Zooming in on data wrangling and cleaning

Zooming in on data wrangling and cleaning


What is data wrangling ?

A major challenge when working with data is the quality of the data. Whether you want to visualize data in a dashboard or BI tool, or you want to feed data into a machine learning model to get predictions, the outcome of a data pipeline is always heavily dependant on the quality of the data that goes in. 

Data wrangling typically refers to the manual work that a human needs to perform to clean the data. In an end-to-end data pipeline, there are usually many automated steps to transform the data, for example changing the data model into a star schema, filtering the data into data marts etc. With “data wrangling” on the other hand, we typically mean manual work that needs to happen, because somebody needs to actually look at the data and decide what part needs cleaning.

However, recently AI is taking over part of the data wrangling job. AI is now also capable of looking at a dataset and figuring out which parts needs some sort of cleaning. A simple example would be a “country” column in a table with customer data. AI could figure out that since most countries are written as “United Kingdom”, “France”, “Italy” etc, the abbreviation “UK” needs to be replaced with “United Kingdom” to have a uniform list of countries.


Why would you do data wrangling ?

As a data-driven person, you typically spend a large part of your time on cleaning the data before you can do your actual work. There are different reasons why you would need to clean your data:

Reason 1: data correctness

Data wrangling allows you to check if data is actually correct. When you discover “outliers” in your data, you can zoom in and figure out if that data makes sense. A good example would be a table with revenue per customer. An outlier could be a customer with exceptionally low or high revenue. By having a look at those outliers, you can figure out if something is wrong with the data that needs to be fixed at the source.

Another example would be to combine rules, e.g. “B2B customers with a missing VAT number”. Again, by zooming in on all customers that pop up in this filter, you can identify rows of data that need an update. Data wrangling sometimes feels like combing the data in different directions, and each time checking which data is caught be the comb and needs to be untied.

Reason 2: data aggregation in a dashboard/BI tool

Data wrangling is important when you want to do aggregation or grouping of data. Aggregation happens in almost every dashboard that visualizes data.Let’s say you want to see a bar chart with the number of customers per country, you have to make sure that the country names for all your customers are uniform. If you have customers with country “UK” and others with country “United Kingdom”, this will show up as two separate bars in your graph, which is undesired.

Reason 3: joining data

In many cases you want to combine or join data from multiple tables. One table might hold your customers, and a second table might hold purchases. In order to see top customers by purchase value, you need to perform a “join” between both tables. Sometimes such as a join is straightforward using the so called primary key (PK) in one table and the foreign key (FK) in the second table. In our previous example the key that ties together both tables would be “customer id”. However, if both tables originate from a different source, you might not have such a key to use. In that case you might need to match data on other fields such as the customer VAT number. When you do so, you have to make sure that data looks exactly the same in both tables. Data wrangling could be used to e.g. remove spaces from the VAT number so that they match in both tables.

Reason 4: feeding a machine learning model

Machine learning and AI are used to get new insights from data. ML/AI can be used to make predictions, for example “which of my customers are most likely to churn ?”. ML/AI can also be used for classification, pattern detection and extrapolation. In all of these cases, the ML/AI model needs data as input. Input data is needed for two reasons:

Training: an ML/AI model needs to be trained on historic data.

Applying the model: once the model is trained, it needs fresh data to make new insights.

For both these steps, the quality of the data is important. Incorrect data in the training phase will lead to a poorly trained model. And incorrect data used to apply the model can lead to wrong conclusions. This is the reason why data wrangling is such an important part of every ML/AI project.

Reason 5: master data management and the “single customer view”

Master data management or MDM is the overall process of cleaning and building the “master” datasets in a company which are shared between teams. An interesting case of MDM occurs in the company’s CRM, for example Salesforce. Within the CRM context, MDM means that the data in a CRM is cleaned in such a way that a single customer view is possible, or in other words that every customer exists only once in the CRM. This might sound trivial but it definitely is not. In larger organisations, multiple sales executives might enter information in the CRM, in such a way that the same customer appears multiple times, e.g. as “Coca Cola” and as “Coca Cola Inc”. This is a simplistic example, the challenge can be that customers have multiple entities and sub entities. MDM in a CRM is the process of cleaning up this data and linking or merging customer records (and all the data linked to it such as proposals, orders, invoices, contacts etc.) to end up with a single customer view. MDM therefore requires quite some manual data wrangling, combined with enrichment data to recognize that two companies belong to each other, based on their address, NACE code and other information.

Reason 6: business agility

More in general, data wrangling is an important aspect of the overall business agility. Companies that are data driven rely in high quality data. While data should be as clean as possible within the source, the reality is that data wrangling is needed to improve the quality of the data for all consumers.


The need for a Shift Left approach

Eventually data should become a “data product”, which means it is ready for consumption by other teams without the need to perform data wrangling. This requires a shift left approach, where the data quality issues are traced back to the original source and fixed as early as possible in the overall data pipeline.

A shift left approach requires new collaborations in an organization and it requires people to take up new responsibilities. For example a product manager might want to perform analysis on sales data from her product. The data source could be a CRM. As long as the sales executives input incorrect data, she will have to perform data wrangling to come up with useful analytics. However, the CRM owner could improve the data model to avoid incorrect or non-uniform data entry at the source. A simple example of this is replacing free input text fields in a CRM with a dropdown (or so called “enum”) to uniform the data.


Types of data wrangling

Here’s a short overview of data quality aspects that are often fixed in a data wrangling tool:

  • Fill in missing values: e.g. customer records with a missing country. Sometimes this can be fixed by introducing default values or working with 3rd party enrichment data. Worst case, the missing data must be manually looked up and entered.
  • Make data uniform: we gave the example before of replacing “UK” with “United Kingdom”. Ideally “enums” are used to avoid incorrect values. In a business application this means replacing free text fields with a dropdown or lookup field.
  • Detect outliers: outliers can be detected by sorting on a specific column or filtering, or in general applying rules to the data. Outliers can be manually fixed, the root cause can be fixed, or outliers can also simply be deleted which is common when training an ML/AI model.
  • Converting data formats: e.g. currency data sometimes needs cleaning such as replacing a dot with a comma or removing spaces from VAT numbers.
  • Converting data types: e.g. converting a text field to a numerical field to allow aggregation such as making a sum by removing a currency sign in a field to make it numerical.

Tools for data wrangling

Data wrangling can be done in many different ways and tools, depending on the persona and use case. Here are a few tools and technologies that are used to clean data:

  • Visual data wrangling tools such as Trifacta.
  • SQL based transformations, using for example dbt.
  • ETL pipeline transformations in tools such as Snaplogic.
  • Programming data wrangling transformations in e.g. Python or R, because it needs to be applied on big data sets using Spark.
  • MS Excel: still the number one tool used by millions to clean data and work with data
  • All-in-one data platforms such as

Pitfalls of data wrangling

The biggest pitfall to avoid in data wrangling is of course working on a data “silo” so that all the hard work is lost in future applications of the data. This typically happens when you receive an export or “dump” of data that you need to clean in Microsoft Excel or Google Sheet. After a few days or weeks your data is outdated, you need a new export and you have to start your data wrangling work all over.

Ideally data wrangling is a built-in step of a data pipeline, so that when new data arrives that needs cleaning, the data owner can go in and do the work. This assumes that all efforts are recorded as “transformations” that will be re-applied when new data comes in. In other words, you only need to replace “UK” with “United Kingdom” once and after that it becomes a step that gets applied on all new data that flows through the pipeline.

One step further is to flow all these data improvements back to the actual source system. It’s great that “UK” gets replaced in the data warehouse, but in an ideal world it gets replaced in the source system, e.g. the company’s CRM Salesforce. However, this reverse ETL process has some additional challenges. For now, I wish you happy data wrangling !

Niko Nelissen

Niko Nelissen

Niko Nelissen is the founder of Peliqan. Before Peliqan, Niko founded, a no-code iPaaS integration platform. was acquired by Qlik in 2020 and Niko became Senior Director of Product Management for Automation at Qlik. Niko’s primary interest is in the modern data stack, ML/AI as well as SaaS software in general and building disruptive technology.