Supporting the data cleansing processes of data migration projects using the KNIME Analytics Platform
During data migration projects data cleaning is often neglected, simply because sufficient and skilled resources are not available. Therefore, in a pilot project we have examined which tool could be used without developer knowledge to effectively support data cleansing tasks during data migration projects.

Overview
MINDSPIRE Consulting, a member of Inovivo, our company group, provides data migration services for its banking industry clientele. They have developed a data migration methodology and toolset based on the experience gained from their successful ETL projects.
Our company, Onespire Ltd. is involved in data cleaning activities through our Data Science (DS) services, so this time in our joint pilot project we have reviewed its toolset in order to cover the data cleansing needs arising in migration projects.
The data cleaning tasks of data migration projects
A variety of data cleanliness issues can emerge during data migration projects, from simple typing errors to complex data consistency issues.
Based on our experience, data cleansing tasks are often not carried out during data migration projects because there are no experts available to perform this complex task.
This kind of data cleaning tasks are also challenging because they are unique, so the previously developed solutions or processes cannot be used without changes on other projects. Therefore, after verifying the data quality, a customized concept must be developed for the given environment.
Application of Data Science tools in data cleaning projects
One of the questions we faced was whether, in addition to traditional data migration solutions, could Data Science tools be used to support data cleansing tasks during the implementation of such projects.
In this regard, we selected the KNIME Analytics Platform because it is easy to use, does not require programming skills, and has many specific functions that are rather suitable for solving data cleaning tasks.
The KNIME Analytic Platform is a free, open source data analysis, reporting and integration platform. The tool effectively supports the data extraction – data transformation – data loading (ETL) processes.
The solution has a community of one hundred thousand users who, in addition to data migration, also use the software for data scrubbing, training algorithms, predictive analytics, interactive visual display and report creation.
KNIME is good at identifying data patterns and supports business decisions by exploiting hidden information. Developer knowledge is not required for its use, a complete process can be created on the interface by moving the various elementary units, the nodes.
The other question was to what extent Data Science methodology can meet the requirements of data migration projects.
It was evident that Data Science takes into account several aspects that are not relevant in data migration projects. These include, but are not limited to, scaling and normalization. However, there are also many building blocks that can be easily implemented in the data migration methodology. Examples include replacing missing values, removing duplicates, and type conversion.
In our joint pilot project with MINDSPIRE Consulting, we therefore created a sample data cleaning process with the help of KNIME in order to verify our concept.
Overview of our KNIME data cleansing project
The purpose of the pilot project was to find out whether KNIME can be used to support the data cleaning tasks of a data migration project. A big advantage of the KNIME platform, in addition to the ease of use, is flexibility. A workflow created can be easily and quickly changed by inserting new steps or by replacing and configuring previous steps. The disadvantage is that in case of larger amounts of data, we may face performance problems with the free version.
Structure of the KNIME workflow
The workflow created during the project aimed to clean the data of a customer database consisting of ten records and had four separate tasks:
- Defining the data cleaning step based on the Data Science methodology.
- Selecting or constructing a sample database.
- Creating the workflow using the KNIME workbench.
- Iterative process of testing and correcting.
The workflow was run on a data set with limited records, containing intentionally incorrect customer data.
The KNIME workflow created during the project
Detailed information about the KNIME data cleansing sample project is available here
Conclusion
The data cleansing pilot project carried out with the experts of MINDSPIRE Consulting supported our assumption that there are many similarities between the procedures defined and applied by Data Science and the mostly ad-hoc solutions used during data scrubbing tasks.
Accordingly, it can be stated that it is strongly recommended to use the existing experience and knowledge regarding Data Science tools and methodology during the planning and execution of the data cleansing tasks of data migration projects.
Based on our current knowledge, KNIME can be used adequately in connection with the planning, construction and testing of the data cleaning function, however, a truly effective data cleansing solution could be created with an independent module developed in an advanced programming language.
An additional advantage of KNIME is that it enables the recognition and analysis of hidden patterns in the data even without programming knowledge, thus enabling the involvement of additional employees in the tasks on typically resource-poor projects.
Supporting the data cleansing processes of data migration projects using the KNIME Analytics Platform
This article was written by: Ákos Erdész
Onespire Data Science and Analytics Services

Discover our other posts in this category!
History of recommender systems
In our daily lives we are getting into decision-making situations countless times, often even unnoticed. In our post we take a quick look at the history of the recommender systems.
Recommender systems in practice
A short and superficial introduction to recommender systems: an increasing amount of information is pouring on us, it is a real challenge to sort out the elements that matter to us.
BI Maturity Model 2020
Did you know why most digital transformation projects born dead? Would you like to understand, what the key factors are to ensure successful transformation at any organization?