Aug 18, 2021

Decoupling Storage and Transform

This is not a new concept, but my take on something that is already happening in the data space. When we look at the Modern Data Stack, there is a part of it that is fundamentally different from traditional data workflows: the ETL. Traditionally, when one wanted to load data to create a central repository, one would follow an ETL approach: Extract, Transform, and Load.

The idea here is that we extract data from our transactional systems (for example, our CRM). We rarely have the data we need in a clean format that we can then query, so a transformation had to be applied. For example, let’s say that you want to know which sales rep is handling each opportunity in your CRM. However, your CRM’s API might give you a JSON like this:

{
	"id": "111111111",
	"owner": {
		"id": "4325",
		"name": "Juan"
	}
}

What you really want to insert in your database is something like owner_id and owner_name. Therefore, during your transformation step, you would read this JSON and flatten it. Finally, during the load step you would insert these records into your database.

The Shift to ELT

Here’s where things get interesting. The Modern Data Stack flips this on its head: Extract, Load, then Transform. Instead of transforming data before loading it, you load the raw data first, often storing it exactly as it comes from the source. Only then you transform it using tools like dbt.

At first glance, this might seem like a minor reshuffling of the same operations. But it’s actually a profound architectural shift that changes how we think about data workflows entirely.

Why Does This Matter?

The traditional ETL approach made sense in a world where storage was expensive and compute was relatively cheap. You transformed data before loading it because storing unnecessary or messy data was costly. The transformation logic lived in scripts and was often tightly coupled to the loading process. If you wanted to change a transformation, you had to change the code, retest the pipeline, and re-run the load.

But here’s the catch: your analytical questions evolve faster than your data sources do.

When I first encountered this pattern in practice, I was working with a sales team that kept changing what they wanted to track. One week they cared about opportunities by owner, the next week they wanted opportunities by region, then by product line, then by a combination of all three with various date filters. Each time, we’d have to go back, modify the transformation logic, and reload historical data. They would create new properties in Hubspot that I would have to load and transform. It was slow, brittle, and frustrating.

With ELT, you sidestep this problem. You load the raw data once, and transformation becomes a separate, queryable step.

The Economics of Storage vs. Compute

The rise of cloud data warehouses like Snowflake, BigQuery, and Redshift fundamentally changed the economics of data storage. Storage became cheap and compute became something you could spin up on-demand and pay for only when you use it. Suddenly, the old constraint of “don’t store anything you don’t need” dissolved.

This shift unlocked a new approach: store everything in its raw form, and let the warehouse handle the heavy lifting of transformation. Modern columnar databases are good at this. They’re optimized for analytical queries, they handle semi-structured data like JSON natively, and they scale well.

But more than that, decoupling storage from transformation gives you optionality. You’re no longer locked into the transformations you defined at load time. You can go back to the raw data and ask new questions without touching your ingestion pipeline. This is liberating in ways that are hard to overstate if you’ve ever been stuck in the old model.

Hidden benefit

There’s a less obvious benefit to this architectural shift that often goes unmentioned: organizational decoupling. When storage and transformation are separate, you can have separate teams owning each piece. Your data engineering team can focus on building reliable, robust ingestion pipelines, ensuring data flows from source systems into your warehouse without loss or corruption. Meanwhile, your analytics engineering team can focus entirely on the transformation layer, building and maintaining the models that turn raw data into business insights.

This might sound like a minor operational detail, but it’s transformative in practice. The skills and mindset required to build a reliable ingestion pipeline are fundamentally different from those needed to model business logic in SQL. The former requires deep knowledge of APIs, error handling, rate limits, and data extraction patterns. The latter requires understanding of business logic, relationships between entities, and how to structure data for analytical queries.

When these concerns are coupled you need people who are good at both, or you end up with compromises. Ingestion gets hacky because it’s bundled with transformation logic, or transformations become inflexible because they’re constrained by the ingestion process. With ELT, each team can use the best tool for their job. Data engineers use tools like Airbyte for extraction. Analytics engineers use dbt for transformation. The interface between them is clean: raw data in the warehouse. No stepping on each other’s toes. No artificial coupling of unrelated concerns.