An interesting topic in the data world revolves around data mutability and immutability. I have always been intrigued by this topic, and to be honest, early on in my career I did not fully understand all the fuzz about it.
Mutability means that something can be changed, altered, updated. Data in a business context is usually mutable. For example when you work in a CRM, you can freely update the details of a customer, and when you work with a relational database such as SQL server or MySQL, you can also make changes to records in tables — assuming of course you have the required permissions.
At first sight, you would think that data should always be mutable: if something needs an update, just make the update. The opposite would be unthinkable: if our business data would be immutable, it would be impossible to make changes and e.g. update the address of a customer when that customer moves to a new address.
However, mutable data has a few major drawbacks. For example, if data can be freely changed, there is very little control over what needs to happen on a change. When a customer address is changed, perhaps it needs to be verified or maybe the invoicing details need to be updated as well. Another challenge of freely mutable data is that old values are no longer available. What if the address was changed in error and needs to be reverted ?
Mutable data also makes it hard to understand things that have happened in the past. Let’s look at a more complex example of a relational database with customers and invoices. The customer records holds the address and the invoices are linked to a customer using the customer id (that’s the foreign key). Here’s a simple example of a customer with 2 invoices:
Customer id: 103
Name: ACME
Address: Grand Rue 102, Paris, France
Invoice id: 1001
Customer id: 103
Invoice date: 1st of March
Total: 1000 EUR
Invoice id: 1020
Customer id: 103
Invoice date: 1st of April
Total: 1200 EUR
Assume that someone changes the address of the customer, and one week later the latest invoice is returned by mail because the address was incorrect. How would you know if that invoice was sent to the old address or the new address ? That’s exactly the type of problem “data immutability” solves.
What is Data Immutability?
Data immutability refers to the principle of making data unchangeable once it’s created. Unlike traditional mutable data that can be freely edited or deleted, immutable data remains fixed, creating a permanent record. This approach offers several advantages. Firstly, it ensures data integrity and prevents accidental or malicious modifications. Secondly, it allows for easier auditing and tracking changes over time, making it valuable for compliance and historical analysis.
Implementing immutability can involve various techniques. “Append-only” databases, for example, only allow adding new entries, while existing ones stay untouched. Another method involves creating new versions of data with edits, leaving the original version intact. While immutability can sometimes pose challenges in terms of updating data, it offers significant benefits in terms of security, reliability, and data governance.
How data immutability works
How exactly does data immutability work ? Does it mean that data cannot be changed, ever ? No, of course not. Instead of overwriting old data, the data is replaced. There are different ways to replace old data. For the above example of customers and invoices, we could simply introduce a new customer record with a new id:
Customer id: 107
Name: ACME
Address: Rue Picard 30, Paris, France
New invoices can be linked to this new customer id, so that it’s always clear to which address the invoices were sent. In reality it gets a bit more complex: we would introduce a new field to indicate which customer record is the current and valid record, and we would also link the new customer to the old one to make it clear that this is still the same customer ACME.
Even better would be to store the address in a separate table. In that case the customer record does not change, only the linked record with the address would change. Keep in mind, we would not overwrite the old address, instead we add a new address and flag the previous one is “old”.
This is what it would look like in our database:
Customer id: 103
Name: ACME
Address id: 2000
Customer id: 103
Address: Grand Rue 102, Paris, France
Current address ? No
Address id: 2005
Customer id: 103
Address: Rue Picard 30, Paris, France
Current address ? Yes
Great, we now have an overview of past and current addresses of our customers. Each address by itself is immutable, it’s carved in stone. We can add a new address whenever we need to do so, but we should not alter an existing address. This list of addresses now becomes a “log” of changes, that gives us full insight into what happened in the past. In a real world scenario we would add a timestamp as well:
Address id: 2000
Customer id: 103
Address: Grand Rue 102, Paris, France
Current address ? No
Timestamp_created: 1st of March
Address id: 2005
Customer id: 103
Address: Rue Picard 30, Paris, France
Current address ? Yes
Timestamp_created: 1st of May
Time travel
Now that we have the above list of addresses, we can perform so called “time travel”. Let’s say you wanted to know what the data looked like on April 1st. Simply ignore the records that were added after this date, and you will see the data exactly as what it looked like on April 1st:
Address id: 2000
Customer id: 103
Address: Grand Rue 102, Paris, France
Timestamp_created: 1st of March
Data immutability allows “time travel”, to see what data looked like at any point in time in the past.
Performance
We now understand what immutable data means in a relational database: it’s a pattern to store data changes in a way that it preserves the old data.
Immutable data concepts can also be introduced for performance reasons, for example when data is stored in files. If you want to update one record in the middle of a file, it’s impossible to do so. For example if you have a data lake and you use Amazon AWS S3 to store the files, you will notice that updating individual records (lines) in e.g. a CSV file is impossible to do.
Instead, a so called “copy on write” is used. When data needs to be changed, a copy of the file is created with the required changes and the old files is deleted. That can be a heavy process to change just one record, imagine the fill is really big (e.g. 1GB in size) and you need to create a copy. As an alternative, it’s also possible to keep the original file and store the changes in a new file. This allows to write data updates very quickly, however the performance hit is now moved to the reading phase: when the data is being read, the changes need to be applied first.
Benefits of immutable data
As a recap, immutable data is a pattern to handle changes in data and there are different reasons to implement this pattern:
- Keeping track of data changes and preserving old data
- Being able to do “time travel” (see exactly what the data looked like at a certain point in time)
- Performance reasons, e.g. in a data lake by doing a “copy on write”
- Making sure that certain business logic is applied when data needs to be updated
At Peliqan.io we handle all source data as immutable, regardless of what the source looks like. This means that data engineers can configure read-only access for business users to access the data that they need. The data source can be a database, a data warehouse or a data lake. At the same time we make sure that from a business user perspective, anyone can “work” with the data in a virtual copy, for example transform, edit, annotate or clean the data. Peliqan.io combines the power of immutability with the freedom to work with a personal copy of the data that magically stays up to date with the source.