Supporting the data cleansing processes of data migration projects using the KNIME Analytics Platform

During data migration projects data cleaning is often neglected, simply because sufficient and skilled resources are not available. Therefore, in a pilot project we have examined which tool could be used without developer knowledge to effectively support data cleansing tasks during data migration projects.

Overview

MINDSPIRE Consulting, a member of Inovivo, our company group, provides data migration services for its banking industry clientele. They have developed a data migration methodology and toolset based on the experience gained from their successful ETL projects.

Our company, Onespire Ltd. is involved in data cleaning activities through our Data Science (DS) services, so this time in our joint pilot project we have reviewed its toolset in order to cover the data cleansing needs arising in migration projects.

The data cleaning tasks of data migration projects

A variety of data cleanliness issues can emerge during data migration projects, from simple typing errors to complex data consistency issues.

Based on our experience, data cleansing tasks are often not carried out during data migration projects because there are no experts available to perform this complex task.

This kind of data cleaning tasks are also challenging because they are unique, so the previously developed solutions or processes cannot be used without changes on other projects. Therefore, after verifying the data quality, a customized concept must be developed for the given environment.

Application of Data Science tools in data cleaning projects

One of the questions we faced was whether, in addition to traditional data migration solutions, could Data Science tools be used to support data cleansing tasks during the implementation of such projects.

In this regard, we selected the KNIME Analytics Platform because it is easy to use, does not require programming skills, and has many specific functions that are rather suitable for solving data cleaning tasks.

The KNIME Analytic Platform is a free, open source data analysis, reporting and integration platform. The tool effectively supports the data extraction – data transformation – data loading (ETL) processes.

The solution has a community of one hundred thousand users who, in addition to data migration, also use the software for data scrubbing, training algorithms, predictive analytics, interactive visual display and report creation.

KNIME is good at identifying data patterns and supports business decisions by exploiting hidden information. Developer knowledge is not required for its use, a complete process can be created on the interface by moving the various elementary units, the nodes.

The other question was to what extent Data Science methodology can meet the requirements of data migration projects.
It was evident that Data Science takes into account several aspects that are not relevant in data migration projects. These include, but are not limited to, scaling and normalization. However, there are also many building blocks that can be easily implemented in the data migration methodology. Examples include replacing missing values, removing duplicates, and type conversion.

In our joint pilot project with MINDSPIRE Consulting, we therefore created a sample data cleaning process with the help of KNIME in order to verify our concept.

Overview of our KNIME data cleansing project

The purpose of the pilot project was to find out whether KNIME can be used to support the data cleaning tasks of a data migration project. A big advantage of the KNIME platform, in addition to the ease of use, is flexibility. A workflow created can be easily and quickly changed by inserting new steps or by replacing and configuring previous steps. The disadvantage is that in case of larger amounts of data, we may face performance problems with the free version.

Structure of the KNIME workflow

The workflow created during the project aimed to clean the data of a customer database consisting of ten records and had four separate tasks:

  1. Defining the data cleaning step based on the Data Science methodology.
  2. Selecting or constructing a sample database.
  3. Creating the workflow using the KNIME workbench.
  4. Iterative process of testing and correcting.

The workflow was run on a data set with limited records, containing intentionally incorrect customer data.

 

The KNIME workflow created during the project

Knime workflow example

Detailed information about the KNIME data cleansing sample project is available here

https://www.mindspire-consulting.com/blog/data-migration-blog/data-migration-related-data-cleansing-process/

 

Conclusion

The data cleansing pilot project carried out with the experts of MINDSPIRE Consulting supported our assumption that there are many similarities between the procedures defined and applied by Data Science and the mostly ad-hoc solutions used during data scrubbing tasks.

Accordingly, it can be stated that it is strongly recommended to use the existing experience and knowledge regarding Data Science tools and methodology during the planning and execution of the data cleansing tasks of data migration projects.

Based on our current knowledge, KNIME can be used adequately in connection with the planning, construction and testing of the data cleaning function, however, a truly effective data cleansing solution could be created with an independent module developed in an advanced programming language.

An additional advantage of KNIME is that it enables the recognition and analysis of hidden patterns in the data even without programming knowledge, thus enabling the involvement of additional employees in the tasks on typically resource-poor projects.

Supporting the data cleansing processes of data migration projects using the KNIME Analytics Platform

This article was written by: Ákos Erdész

Onespire Data Science and Analytics Services

Onespire logo small

Discover our other posts in this category!

Discover our latest posts!

Annual All-Staff Meeting & Award Ceremony 2024

This year, the Larus Restaurant and Events Center hosted Onespire’s annual All Staff Meeting again.

Supporting ERP processes using Artificial Intelligence

We examine how processes managed in enterprise resource planning systems can be supported using AI.

Onespire Ltd.’s year-end support activities in 2023

We outline our contribution to the operation of three non-profit public benefit organizations.

SAP CDS Views – The primary data delivery technology for SAP HANA

In this article, we highlight how SAP HANA can change the way riports are delivered.

Efficient processing of company documents using artificial intelligence

In this post, we discuess the data-level processing of different types of business documents.

Onespire Christmas Party 2023

The Onespire team didn’t forget the Christmas celebration this year either, this time hosted by Symbol.

Santa Claus Party 2023

The merry meeting unfolded on Sunday at the enchanting Budai Fonó Music House.

Overview of the SAP Business Technology Platform (BTP)

We outline the database and data management options of the SAP BTP solution.

Overview of the SAP Quality Issue Resolution solution

Effective processes for handling potential quality problems in the SAP system.

Navigating the maze of Hungarian SAP localization in the financial module

We provide an overview of the most important localization tasks for the SAP ERP system in Hungary.

Overview of Enterprise Digitalization Platforms in 2023 – Part 3

In this post, we look at the relevant trends for implementation regarding enterprise digitization platforms.

Overview of Enterprise Digitalization Platforms in 2023 – Part 2

This post will look a little deeper into the fundamentals of improvement opportunities.

Do you have a question regarding our services?

Contact Onespire's experts!

Follow us on social media!