Monday I wrote about how multiclaude and GasTown converged on nearly identical primitives for multi-agent orchestration. The key insight wasn't about prompts or models or agent personas. It was about infrastructure: CI is the ratchet. Let chaos reign. Multiple agents, overlapping work, duplicated effort, whatever. As long as you have a mechanism that only captures forward progress, you're good.
That phrase has been rattling around my head ever since. Because here's the thing: we have this for code. What's the equivalent for data?
The Missing Ratchet
CI transformed software development by giving us a one-way gate. Code either passes or it doesn't. No negotiations, no exceptions, no "we'll fix it later." The ratchet clicks forward, and it never clicks back.
Data has no such mechanism.
Oh, we have tools. We have great expectations (pun intended). We have dbt tests and schema validators and anomaly detectors. But none of them function as the arbiter-the single, uncompromising source of truth that says "this data is real now, and we're never going backward."
Instead, we have... hope? Process? Tickets that say "data quality issue" that sit in someone's backlog for three sprints while the dashboard keeps serving numbers that everyone knows are wrong but nobody can prove?
What Would a Data Ratchet Look Like?
Let's steal the multiclaude architecture and apply it to data:
Code Ratchet
Data Ratchet
CI tests
Schema validation + semantic checks
Passing tests
Data meeting quality thresholds
Merged PRs
Verified, immutable records
Git history
Data lineage with provenance
Multiple agents
Multiple validators / transformation paths
The principle is the same: chaos is fine, as long as we ratchet forward.
Multiple data sources can feed into your system. They can be messy, inconsistent, formatted in ways that make you question whether the upstream team has ever heard of ISO 8601. That's the Brownian motion: the random thermal energy of the real world generating data in a thousand incompatible ways.
But the ratchet, the verification layer, only lets validated data through. And once it's through, it's permanent. Immutable. Part of the record.
The Four Components
I think a data ratchet needs four things:
- The Pawl: Schema as Contract JSON Schema (or Avro, or Protobuf, whatever floats your boat) isn't just documentation. It's the pawl that prevents backward motion. Data either conforms or it doesn't. No partial credit.
Here's what a schema-as-pawl actually looks like:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "SensorReading",
"type": "object",
"required": ["device_id", "timestamp", "value", "unit"],
"properties": {
"device_id": {
"type": "string",
"pattern": "^[A-Z]{2}-[0-9]{6}$"
},
"timestamp": {
"type": "string",
"format": "date-time"
},
"value": {
"type": "number",
"minimum": -273.15
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit", "kelvin"]
}
},
"additionalProperties": false
}
Notice additionalProperties: false. That's the pawl. You can't sneak extra fields through. You can't send "value": "hot" instead of a number. You can't omit the timestamp and promise to fill it in later.
But here's where most systems fail: they treat schema validation as a warning, not a wall. "Schema violation detected, logging and continuing." That's not a ratchet. That's a turnstile with a broken lock.
A real data ratchet rejects non-conforming data. Full stop. The data can go back to the source, get transformed, get remediated, whatever it needs to do. But it doesn't get through until it conforms.
- The Wheel: Idempotent Checkpoints In multiclaude, git worktrees give each agent isolation. If an agent's work fails, it fails in its own branch. The main branch (the ratcheted progress) stays untouched.
Data pipelines need the same thing: checkpoints that are idempotent and isolated. If a transformation fails, you can retry from the last checkpoint without corrupting the verified data downstream.
class CheckpointedPipeline:
def init(self, checkpoint_store: str):
self.checkpoint_store = checkpoint_store
def process_batch(self, batch_id: str, records: list[dict]) -> str:
# Check if we already processed this batch
checkpoint = self.load_checkpoint(batch_id)
if checkpoint and checkpoint["status"] == "completed":
return checkpoint["output_path"] # Idempotent: return existing result
# Process in isolation (write to temp location)
temp_path = f"{self.checkpoint_store}/pending/{batch_id}"
validated = []
for record in records:
if self.validate(record):
validated.append(record)
else:
self.quarantine(record, batch_id) # Don't lose it, just don't let it through
self.write_records(temp_path, validated)
# Only after success: commit the checkpoint
final_path = f"{self.checkpoint_store}/verified/{batch_id}"
self.atomic_move(temp_path, final_path)
self.save_checkpoint(batch_id, {"status": "completed", "output_path": final_path})
return final_path
The key moves: write to a temp location first, only move to the verified path after success, and the checkpoint makes retries safe. If the process dies mid-batch, we start over. No partial state leaking into the verified dataset.
Most pipelines I've seen treat state as something that happens to them rather than something they manage. They're stateless in theory and stateful in practice, which is the worst of both worlds.
- The Arbiter: Automated Verification with Teeth Here's the multiclaude rule that matters: agents are forbidden from weakening CI to make their work pass.
Translate that to data: no one can weaken the validation rules to make bad data pass. Not the data team, not the business stakeholder with a deadline, not the executive who needs the dashboard updated yesterday.
What does "CI for data" actually look like? Something like this:
data-ci.yaml
name: Data Quality Gate
on:
data_ingestion:
sources: ["sensor-feed", "partner-api", "user-uploads"]
jobs:
validate:
steps:
- name: Schema Validation
run: |
jsonschema --instance ${{ inputs.data_path }} \
--schema schemas/${{ inputs.source }}.json
fail_on_error: true # This is the ratchet. No exceptions.
- name: Semantic Checks
run: |
python checks/semantic_validator.py \
--data ${{ inputs.data_path }} \
--rules rules/${{ inputs.source }}.yaml
# Example rules:
# - timestamp must be within last 24 hours
# - device_id must exist in device registry
# - value must be within 3 std devs of rolling mean
- name: Lineage Recording
if: success()
run: |
record-lineage \
--input ${{ inputs.data_path }} \
--schema-version ${{ inputs.schema_hash }} \
--validator-version ${{ github.sha }} \
--output verified/${{ inputs.batch_id }}
on_failure:
steps:
- name: Quarantine Bad Data
run: |
move-to-quarantine ${{ inputs.data_path }} \
--reason "${{ job.failure_reason }}"
- name: Alert Source System
run: |
notify-upstream ${{ inputs.source }} \
--batch ${{ inputs.batch_id }} \
--errors ${{ job.validation_errors }}
The critical bit is fail_on_error: true with no escape hatch. No continue-on-error. No "warn and proceed." The data either passes or it goes to quarantine.
This is culturally difficult. It requires the same organizational commitment that "we don't ship if tests fail" required for software teams. But it's the only way the ratchet works.
- Reproducibility: The Secret Ingredient There's one more piece that makes the code ratchet work: reproducibility. When CI fails, you can reproduce the failure. When it passes, you can reproduce the pass. Same inputs, same outputs, every time.
Data systems are notoriously bad at this. The pipeline that worked yesterday fails today because someone changed an upstream schema. Or because the source system had a hiccup. Or because Mercury is in retrograde. (I've debugged all three. The Mercury one was actually a timezone issue in a system named "Mercury." I wish I was kidding.)
A real data ratchet needs what I'd call a "usability signature":
{
"batch_id": "2026-01-22-sensor-feed-042",
"verified_at": "2026-01-22T14:32:01Z",
"input_hash": "sha256:a1b2c3d4...",
"schema": {
"name": "SensorReading",
"version": "2.1.0",
"hash": "sha256:e5f6g7h8..."
},
"validators": {
"semantic_checks": "v1.4.2",
"anomaly_detector": "v0.9.1"
},
"result": {
"status": "passed",
"records_in": 10482,
"records_verified": 10479,
"records_quarantined": 3
},
"output_path": "verified/2026-01-22/sensor-feed-042.parquet",
"output_hash": "sha256:i9j0k1l2..."
}
This signature is an artifact, not just a log line. You can take this signature, grab the input data by its hash, run the exact versions of the validators, and you'll get the same result. If you can't do that, you don't have a ratchet. You have a coin flip.
The Uncomfortable Implication
Here's what this means in practice: a lot of data that's currently flowing through your systems wouldn't make it through a real ratchet.
That's not a bug. That's the point.
The Brownian ratchet works because it's uncompromising. The pawl doesn't care that you really need this data for a quarterly review. It doesn't care that the source system "usually" sends valid records. It doesn't care about your deadline.
CI transformed software quality not by being smart, but by being stubborn. It created a culture where "works on my machine" stopped being an excuse because there was an objective arbiter that didn't care about your machine.
Data needs the same stubbornness. The same willingness to say "no" and mean it.
What This Looks Like in Practice
I've been thinking about this in the context of what we're building at Expanso: intelligent data pipelines that can process data at the edge. The edge is where the Brownian motion is strongest. Sensors, devices, user inputs, all generating data in a thousand formats with a thousand failure modes.
The traditional answer is to centralize. Pull everything to a data lake, clean it up, validate it there. But that's expensive, slow, and loses context. By the time you've moved the data, you've lost the ability to remediate at the source.
What if the ratchet lived at the edge? Validation happens where data is generated. Non-conforming data gets rejected immediately, while there's still context to fix it. Only verified data propagates upstream.
That's the vision. Not a single central ratchet, but a distributed network of ratchets. Each one small and stubborn. Each one clicking forward, never back.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*
NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!
Originally published at Distributed Thoughts.
Top comments (0)