(c) 2022 Peliqan BV
A major challenge when working with data is the quality of the data. Whether you want to visualize data in a dashboard or BI tool, or you want to feed data into a machine learning model to get predictions, the outcome of a data pipeline is always heavily dependant on the quality of the data that goes in.
Data wrangling typically refers to the manual work that a human needs to perform to clean the data. In an end-to-end data pipeline, there are usually many automated steps to transform the data, for example changing the data model into a star schema, filtering the data into data marts etc. But with “data wrangling” we typically mean manual work that needs to happen, because somebody needs to actually look at the data and decide what part needs cleaning.
However, recently AI is taking over part of the data wrangling job. AI is now also capable of looking at a dataset and figuring out which parts needs some sort of cleaning. A simple example would be a “country” column in a table with customer data. AI could figure out that since most countries are written as “United Kingdom”, “France”, “Italy” etc, the abbreviation “UK” needs to be replaced with “United Kingdom” to have a uniform list of countries.
As a data-driven person, you typically spend a large part of your time on cleaning the data before you can do your actual work. There are different reasons why you would need to clean your data:
Data wrangling allows you to check if data is actually correct. For example if you discover “outliers” in your data, you can zoom in and figure out if that data makes sense. A good example would be a table with revenue per customer. An outlier could be a customer with exceptionally low or high revenue. By having a look at those outliers, you can figure out if something is wrong with the data that needs to be fixed at the source.
Another example would be to combine rules, e.g. “B2B customers with a missing VAT number”. Again, by zooming in on all customers that pop up in this filter, you can identify rows of data that need an update. Data wrangling sometimes feels like combing the data in different directions, and each time checking which data is caught be the comb and needs to be untied.
Data wrangling is very important if you want to do some sort of aggregation or grouping of the data, which happens in almost every dashboard that visualizes data. For example if you want to see a bar chart with the number of customers per country, you have to make sure that the country names for all your customers are uniform. If you have customers with country “UK” and others with country “United Kingdom”, this will show up as two separate bars in your graph, which is an undesired result.
In many cases you want to combine or join data from multiple tables. One table might hold all your customers, and a second table might hold all purchases. In order to see top customers by purchase value, you need to perform a “join” between both tables. Sometimes such as a join is straightforward using the so called primary key (PK) in one table and the foreign key (FK) in the second table. In our previous example the key that ties together both tables would be “customer id”. However, if both tables originate from a different source, you might not have such a key to use. In that case you might need to match data on other fields such as the customer VAT number. When you do so, you have to make sure that data looks exactly the same in both tables. Data wrangling could be used to e.g. remove spaces from the VAT number so that they match in both tables.
Machine learning and AI are used to get new insights from data. ML/AI can be used to make predictions, for example “which of my customers are most likely to churn ?”. ML/AI can also be used for classification, pattern detection and extrapolation. In all of these cases, the ML/AI model needs data as input. Input data is needed for two reasons:
Training: an ML/AI model needs to be trained on historic data.
Applying the model: once the model is trained, it needs fresh data to make new insights.
For both these steps, the quality of the data is important. Incorrect data in the training phase will lead to a poorly trained model. And incorrect data used to apply the model can lead to wrong conclusions. This is the reason why data wrangling is such an important part of every ML/AI project.
Master data management or MDM is the overall process of cleaning and building the “master” datasets in a company which are shared between teams. An interesting case of MDM occurs in the company’s CRM, for example Salesforce. Within the CRM context, MDM means that the data in a CRM is cleaned in such a way that a single customer view is possible, or in other words that every customer exists only once in the CRM. This might sound trivial but it definitely is not. In larger organisations, multiple sales executives might enter information in the CRM, in such a way that the same customer appears multiple times, e.g. as “Coca Cola” and as “Coca Cola Inc”. This is a simplistic example, the challenge can be that customers have multiple entities and sub entities. MDM in a CRM is the process of cleaning up this data and linking or merging customer records (and all the data linked to it such as proposals, orders, invoices, contacts etc.) to end up with a single customer view. MDM therefore requires quite some manual data wrangling, combined with enrichment data that can for example recognize that two companies belong to each other, based on their address, NACE code and other information.
More in general, data wrangling is an important aspect of the overall business agility. Companies that are data driven rely in high quality data. While data should be as clean as possible within the source, the reality is that data wrangling is needed to improve the quality of the data for all consumers.
Eventually data should become a “data product”, which means it is ready for consumption by other teams without the need to perform data wrangling. This requires a shift left approach, where the data quality issues are traced back to the original source and fixed as early as possible in the overall data pipeline.
A shift left approach requires new collaborations in an organization and it requires people to take up new responsibilities. For example a product manager might want to perform analysis on sales data from her product. The data source could be a CRM. As long as the sales executives input incorrect data, she will have to perform data wrangling to come up with useful analytics. However, the CRM owner could improve the data model to avoid incorrect or non-uniform data entry at the source. A simple example of this is replacing free input text fields in a CRM with a dropdown (or so called “enum”) to uniform the data.
Here’s a short overview of data quality aspects that are often fixed in a data wrangling tool:
Data wrangling can be done in many different ways and tools, depending on the persona and use case. Here are a few tools and technologies that are used to clean data:
The biggest pitfall of data wrangling to avoid is of course working on a data “silo” so that all the hard work is lost in future applications of the data. This typically happens when you receive an export of “dump” of data that you clean in MS Excel or Google Sheet. After a few days or weeks your data is outdated, you need a new export and you have to start your data wrangling work all over.
Ideally data wrangling is a built-in step of a data pipeline, so that when new data arrives that needs cleaning, the data owner can go in and do the work. This assumes that all efforts are recorded as “transformations” that will be re-applied when new data comes in. In other words, you only need to replace “UK” with “United Kingdom” once and after that it becomes a step that gets applied on all new data that flows through the pipeline.
One step further is to flow all these data improvements back to the actual source system. It’s great that “UK” gets replaced in the data warehouse, but in an ideal world it gets replaced in the source system, e.g. the company’s CRM Salesforce. However, this reverse ETL process has some additional challenges. For now, I wish you happy data wrangling !