(c) 2022 Peliqan BV
Data has become a valuable asset in every company, as it forms the basis of predictions, personalization of services, optimization and getting insights in general. While companies have been aggregating data for decades, the tech stack has greatly evolved and is now referred to as “the modern data stack”.
The modern data stack is an architecture and a strategy that allows companies to become data driven in every decision they make. The goal of the modern data stack is to set up the tools, processes and teams to build up end-to-end data pipelines.
A data pipeline is a process where data flows from a source to a destination, typically with many intermediate steps. Historically this process was called ETL, referring to the 3 main steps of the process:
The first step in a data pipeline is to extract data from the source. The challenge is that there are many different types of sources. These sources include databases, log files but also business applications. Modern tools such as Stitch and Fivetran make it easy to extract data from a wide range of sources, including SaaS business applications. For the latter, the APIs of those SaaS applications are used to read the data incrementally.
For large databases, the concept of CDC is often used. CDC (change data capture) is a method where are changes that occur in a source database (inserts of new data, updates of existing data and deletions of data) are tracked and sent to a destination database (or data warehouse) to recreate the data. CDC avoids the need to make a daily or hourly full dump of the data which would take too long to import.
Data transformations can be done in different ways, for example with visual tools where users drag and drop blocks to implement a pipeline that has multiple steps. Each step is one block in the visual workflow and applies one type of transformation.
Another popular and “modern” approach is to use SQL queries to define transformations, this makes sense because SQL is a powerful language, known by many users and it’s also “declarative” meaning that users can define in a concise manner what they want to accomplish. The SQL queries are executed by the data warehouse to do the actual transformations of the data. Typically this means that data moves from one table to the other, until the final result is available in a set of golden tables. The most popular tool to implement pipelines using SQL queries is called “dbt”.
Data scientists will often use programming languages to transform the data, for example Python or R.
The third step in a classic ETL pipeline, is to load the data into tables in a data warehouse. A well known pattern to load the data is the so called star schema, which defines the structure of the tables and how information is organized in these tables.
Funny enough, ETL does not cover the entire data pipeline. An ETL pipeline stops at the data warehouse as its final destination. The data warehouse is a central location to store data, but the end goal is typically either BI tools to create dashboards (Qlik, Tableau, Power BI, Looker) or a machine learning model that uses the data from the data warehouse to make for example predictions.
More recently companies have adopted an ELT approach, switching the L and the T. This means that the data is extracted and then loaded into a central repository, but the transformation takes place at a later date, if and when it’s necessary. The switch from ETL to ELT is a result of the explosion of data which is being generated. Since the transformation step is the most complex and most costly step, it makes sense to wait and only transform the data that is actually required at a certain point in time.
ELT is considered a more modern approach compared to ETL. Not because it’s better, but at least the data is stored in a central location, although in a raw format, so that it is available to multiple teams within the company. That’s what we call data democracy, or I should say the first step to democratize data access in your organization.
However, in an ELT context, the data consumer is confronted with raw data that still needs to be transformed to make it usable within a given context. Those transformations could include documenting the data, cleaning the data, replacing data, combining data etc. In short, everything that is needed to get to a so called “golden dataset” which can be used to feed a BI dashboard or which can be used to start making AI-driven predictions.
Data pipelines used to go in one direction only: from the source to the destination. The destination is typically a data warehouse. Reverse ETL is a relatively new concept where the data from the data warehouse is used to feed it back into business applications, for example to keep data in sync between different business applications or to enrich data in a given business application. For example, using reverse ETL, you could feed information on support tickets into Salesforce, so that sales managers can see how their customers are interacting with the support team. In this example, the total number of support tickets per customer is queried from the data warehouse, and the result is written into a few customer fields in Salesforce.
A different approach to reverse ETL is using an iPaaS, an integration platform. An iPaaS does not require a data warehouse to be in place, instead it will flow data directly from one business application to another. Both strategies (reverse ETL and iPaaS) make sense and have different pro’s and con’s.
Data warehouses are essentially databases but with a different architecture which is optimized for reading and aggregating data. The main difference between a regular database and a data warehouse (DW) is that DWs are columnar oriented, which means that the data is stored per column and not per row. This makes it faster to e.g. get all the amounts of orders and calculate the sum.
In the modern data stack, the data warehouse resides in the cloud. Some of the most common data warehouses in the cloud include Snowflake, Redshift (offered by Amazon AWS), Google Bigquery and Microsoft’s data warehouse on Azure. Snowflake pioneered the concept of splitting up storage nodes and compute nodes. If you use a data warehouse such as Snowflake, you pay for the “compute” power that you use, which are the servers (in the cloud) responsible for executing your queries.
A different and more recent approach to data warehouses is the concept of a data lake. A data lake is a different approach where data is “dumped” in files on storage, for example Amazon S3. This is easier than loading the data into a data warehouse because no transformation is needed to fit the data into tables. Further more, a data lake can store huge amounts of data at a very low cost. For example if your web shop generates large log files everyday (perhaps gigabytes, terrabytes or even petabytes in size), it makes sense to store these files on cloud storage.
While data lakes can hold files of different types, you will often see CSV files, but also files with a specific file format such as Avro or Parquet. While most file formats store data row by row (row oriented), Parquet stores data in a columnar fashion, similar to a data warehouse.
A data lake does not allow direct querying of the data. That’s what query engines such as Presto and Trino solve, they allow you to execute SQL queries on data lakes, making it easy to work with the data. AWS users will be familiar with e.g. Athena, which is based on Presto. Another approach is to process the data using Spark, which allow parallel processing of data by using code written in Scala or even Python. Data engineers will often use Spark to process data and e.g. use it to feed machine learning (AI) models.
Databricks is a well known vendor offering data lake solutions.
Both data warehouses and data lakes have pro’s and con’s, so why not add a new concept: the lakehouse. A lakehouse is a data lake but with an additional meta layer that describes the structure of the data. Since data warehouses such as Snowflake already decouple storage from compute and data lakes such as Databricks already offer all the tools to process and query the data, the difference between data warehouses (DWs) and data lakes (DLs) will probably blur in the near future.
We touched upon the basic ingredients of the modern data stack, but there are more aspects to it. For example data governance is a hot topic, where all aspects of the data lifecycle are managed including the quality of the data, SLA’s on the delivery of data, managing access to data etc.
Another important topic is data observability or data monitoring. The modern data stack approach tries to implement best practises from other mature sectors such as software development and DevOps. Similar to how software services in the cloud have to be monitored, data also needs to be monitored end-to-end in a data pipeline to make sure the correct data flows across.
We also need to talk about data catalogs, they are a central market place for team members to discover data assets that are available throughout the organization. Data catalogs will typically also visualize the data lineage, which means showing all the steps that were taken from the source to get to a final data asset which is used by an individual.
With data pipelines becoming increasingly more complex, users need a way to automate the entire process, and that’s what orchestration tools such as Airflow will handle. Orchestration tools can handle all the steps that have to be taken in the right order to process the data. These steps are typically visualized as a workflow, more specifically in a so called “DAG” (a directed acyclic graph). A DAG is a fancy word for a workflow with different paths that can run at the same time — without having loops. The DAG shows all the steps that need to be taken, including their dependencies, for example “wait for all data from source A and from source B to arrive, before merging the data in the next step”.
Finally, the hot new kid on the block today is the data mesh. The data mesh is new concept that explains how large organizations can enable local teams to build data pipelines and make the data available across the larger organization, without the need for one central CDO (Chief Data Officer) office that coordinates and implements all the data pipelines. Instead, local teams are empowered to work in an autonomous fashion, while still adhering to certain rules (e.g. around meta data and documentation of the data) so that the central data engineering team is not a bottleneck.
Each of these topics deserves its own blog article (and more), stay tuned on the Peliqan blog for more !