Version Control for Data Teams
A metric suddenly changes. The query is sitting there in your BI tool, but you have no idea what’s different from last week. Someone remembers tweaking something on Friday, but was that this query or a different one? There’s no history, no way to diff the changes, and no clear path back to the working version. This happens all the time in some data teams, and it’s completely avoidable.
Why this keeps happening
I’ve seen some data teams still treat version control as a “nice to have” or something that only software engineers need to worry about. The result is predictable chaos: critical SQL lives in BI tools with no meaningful change history, so you can’t see who modified a query or why. Notebooks accumulate increasingly desperate names like analysis_final_FINAL_v3_actually_final.ipynb because there’s no better way to track different versions. Pipeline configurations live exclusively in web UIs where, if you are lucky, you might get a timestamp of when something changed, but no diff of what actually changed or any context about the decision. Every modification is a leap of faith, and every debugging session starts with archaeology.
Then someone makes what looks like a harmless change:
- left join payments p on p.order_id = o.id
+ inner join payments p on p.order_id = o.id
Just a single word, but now you’re silently excluding all orders that didn’t have payment records at the time of the join. Without a commit message explaining why someone changed it, you’re stuck guessing. You can’t reproduce the results, you can’t collaborate with your team, and you can’t debug the issue.
What Git actually gives you
I really don’t want to oversell this because Git is really just a tool. But it’s a really useful tool for data work, and the benefits compound over time in ways that aren’t obvious at first.
You get a timeline, not just of what changed, but of the context around why it changed. This matters more than you’d think. Six months from now, when you’re staring at a piece of SQL wondering why it’s written in a weird way, the commit message that says “filtering out test accounts because they skewed the conversion metrics” saves you from having to rediscover that context through trial and error. Even when you make a mistake, having that history helps you understand what you were thinking at the time, which makes it easier to learn from those mistakes instead of just repeating them.
You can reproduce old results with confidence. When your CEO asks “how did we calculate churn in Q2?”, you can actually answer that question instead of shrugging and hoping your memory is accurate. You check out the commit from that quarter, rerun the analysis with the exact same code, and get the exact same numbers. No guessing about which version of the query was in production, no anxiety about whether you’re comparing apples to oranges.
Collaboration gets easier too, and in more ways than you’d expect. Pull requests are not supposed to be bureaucracy, they’re how you turn your individual SQL into team knowledge. Even a quick review catches obvious mistakes like that left join vs inner join before it makes it to production. More importantly, the PR becomes documentation of the discussion around why you made certain choices. Future team members (including future you) can read through that context and understand the tradeoffs without having to ask around or reverse engineer the logic.
And debugging becomes actually tractable instead of a guessing game. When a metric suddenly changes and you’re not sure why, you can diff the code to see exactly what’s different. Once you’ve found it, you can revert it, fix it properly, or at least understand what went wrong. Without version control, you’re just stabbing in the dark, hoping you remember what changed.
Some objections I’ve heard
Our SQL lives in the BI tool
Yeah, a lot of teams are in this situation. But there’s a few options you can use to get started. The ideal would be to move your logic into versioned models (dbt is great for this). Then point your BI tool at clean tables instead of having it do the heavy lifting. This will also help you stay DRY (don’t repeat yourself) since you can reuse the same models for multiple dashboards. If you are looking for a pragmatic middle ground, you can write a script that exports your BI tool’s SQL to Git on a schedule. At least you get history and diffs, even if the workflow isn’t perfect. You can really just start small, pick your most critical dashboard and move just that one to versioned SQL.
Notebooks are impossible to version
They’re annoying, but not impossible. A few approaches that actually work:
- Use Marimo, a notebook that is just a Python file. Diffs work normally, no weird JSON.
- Try Jupytext, which maintains a
.pyor.mdversion alongside your.ipynb - At minimum, use
nbdimeto get readable diffs
You can also parameterize your notebooks so you can rerun them with a specific date and random seed. And don’t commit output cells, keep the repo lean.
We’re too small for this kind of process
Small teams are exactly who should do this (even one-person teams!). You don’t have the slack to waste time reconstructing what happened last month.
Here’s the minimum viable setup:
- One repo with folders for
/models,/analyses,/notebooks - Branch and PR for any logic change (even solo, this helps future you)
- Write a one sentence commit message explaining why you made the change
- Optional but helpful: a basic CI check that runs your SQL against a dev schema
That’s it, it takes an afternoon to set up, saves you countless hours of confusion.
What changes when you actually do this
The biggest shift isn’t technical, it’s psychological since data work gets less stressful.
- When something breaks, you can find what changed with
git diffinstead of interrogating your teammates - Refactoring stops being scary because you can always revert
- New team members can read through commit history to understand why things are built the way they are
- You stop having to choose between “move fast” and “be reliable”