Supporting the data cleansing processes of data migration projects using the KNIME Analytics Platform

During data migration projects data cleaning is often neglected, simply because sufficient and skilled resources are not available. Therefore, in a pilot project we have examined which tool could be used without developer knowledge to effectively support data cleansing tasks during data migration projects.

Overview

MINDSPIRE Consulting, a member of Inovivo, our company group, provides data migration services for its banking industry clientele. They have developed a data migration methodology and toolset based on the experience gained from their successful ETL projects.

Our company, Onespire Ltd. is involved in data cleaning activities through our Data Science (DS) services, so this time in our joint pilot project we have reviewed its toolset in order to cover the data cleansing needs arising in migration projects.

The data cleaning tasks of data migration projects

A variety of data cleanliness issues can emerge during data migration projects, from simple typing errors to complex data consistency issues.

Based on our experience, data cleansing tasks are often not carried out during data migration projects because there are no experts available to perform this complex task.

This kind of data cleaning tasks are also challenging because they are unique, so the previously developed solutions or processes cannot be used without changes on other projects. Therefore, after verifying the data quality, a customized concept must be developed for the given environment.

Application of Data Science tools in data cleaning projects

One of the questions we faced was whether, in addition to traditional data migration solutions, could Data Science tools be used to support data cleansing tasks during the implementation of such projects.

In this regard, we selected the KNIME Analytics Platform because it is easy to use, does not require programming skills, and has many specific functions that are rather suitable for solving data cleaning tasks.

The KNIME Analytic Platform is a free, open source data analysis, reporting and integration platform. The tool effectively supports the data extraction – data transformation – data loading (ETL) processes.

The solution has a community of one hundred thousand users who, in addition to data migration, also use the software for data scrubbing, training algorithms, predictive analytics, interactive visual display and report creation.

KNIME is good at identifying data patterns and supports business decisions by exploiting hidden information. Developer knowledge is not required for its use, a complete process can be created on the interface by moving the various elementary units, the nodes.

The other question was to what extent Data Science methodology can meet the requirements of data migration projects.
It was evident that Data Science takes into account several aspects that are not relevant in data migration projects. These include, but are not limited to, scaling and normalization. However, there are also many building blocks that can be easily implemented in the data migration methodology. Examples include replacing missing values, removing duplicates, and type conversion.

In our joint pilot project with MINDSPIRE Consulting, we therefore created a sample data cleaning process with the help of KNIME in order to verify our concept.

Overview of our KNIME data cleansing project

The purpose of the pilot project was to find out whether KNIME can be used to support the data cleaning tasks of a data migration project. A big advantage of the KNIME platform, in addition to the ease of use, is flexibility. A workflow created can be easily and quickly changed by inserting new steps or by replacing and configuring previous steps. The disadvantage is that in case of larger amounts of data, we may face performance problems with the free version.

Structure of the KNIME workflow

The workflow created during the project aimed to clean the data of a customer database consisting of ten records and had four separate tasks:

  1. Defining the data cleaning step based on the Data Science methodology.
  2. Selecting or constructing a sample database.
  3. Creating the workflow using the KNIME workbench.
  4. Iterative process of testing and correcting.

The workflow was run on a data set with limited records, containing intentionally incorrect customer data.

 

The KNIME workflow created during the project

Knime workflow example

Detailed information about the KNIME data cleansing sample project is available here

https://www.mindspire-consulting.com/blog/data-migration-blog/data-migration-related-data-cleansing-process/

 

Conclusion

The data cleansing pilot project carried out with the experts of MINDSPIRE Consulting supported our assumption that there are many similarities between the procedures defined and applied by Data Science and the mostly ad-hoc solutions used during data scrubbing tasks.

Accordingly, it can be stated that it is strongly recommended to use the existing experience and knowledge regarding Data Science tools and methodology during the planning and execution of the data cleansing tasks of data migration projects.

Based on our current knowledge, KNIME can be used adequately in connection with the planning, construction and testing of the data cleaning function, however, a truly effective data cleansing solution could be created with an independent module developed in an advanced programming language.

An additional advantage of KNIME is that it enables the recognition and analysis of hidden patterns in the data even without programming knowledge, thus enabling the involvement of additional employees in the tasks on typically resource-poor projects.

Supporting the data cleansing processes of data migration projects using the KNIME Analytics Platform

This article was written by: Ákos Erdész

Onespire Data Science and Analytics Services

Onespire logo small

Discover our other posts in this category!

Discover our latest posts!

Overview and benefits of the SAP Integrated Business Planning solution

In our post we provide an overview of the cloud-based software and our company’s related services.

Summer Farewell Grill Party 2023

On August 31st, we joyfully celebrated our traditional Summer Farewell Grill Party event, where we could finally meet again.

Paloznak Jazzpicnic 2023

The three-day music festival kicked off on Thursday, August 3rd, at the MOL Main Stage.

Coordinated and efficient management of vendor invoices, subcontractors, and bank transfers

We have developed a solution for a market-leading domestic media and content production company.

When to Unlock The Power of Fiori

Many clients are facing the question: Shall we go with ABAP reports or choose Fiori?

Onespire Cooking Competition 2023

Our yearly gastro competition was held at the Sarlóspuszta Club Hotel. The event offered a diverse range of activities,

Optimizing the supplier invoice management process at a Hungarian electronics trading company

Discover Onespire Ltd.’s DocuLine corporate document management application!

Switching to S/4HANA

Many businesses are starting to realize the benefits of moving to this advanced ERP software.

Acquisition of BKB Solutions

Established in 2012, the company provides SAP services to its clients.

ONESPIRE 2023 team at the Ultrabalaton race

The event was held between May 5-7, our company was represented by the ONESPIRE 2023 team.

Our certifications for bluetelligence’s Enterprise Glossary and Metadata API

This highlights our commitment to keep up with the latest advancements in data management technology.

How to Implement SAP Analytics Cloud?

In our post we examine the potential pitfalls and misunderstandings from the customer’s perspective.

Do you have a question regarding our services?

Contact Onespire's experts!

Follow us on social media!