DEV Community: Marcus

I Gave Claude Code 27 Rule Files Instead of One CLAUDE.md

Marcus — Sat, 25 Jul 2026 16:49:51 +0000

There is a moment when a CLAUDE.md tips over. That moment does not sit on any calendar, it shows in behavior: the file keeps getting longer, and the rules inside it keep getting followed less. Every new convention you write in dilutes the ones already there. That is exactly where the project behind this article stood — and the answer was not a better CLAUDE.md, but its dissolution into individual rule files. Today there are 27 of them.

This article is the experience report: how Claude Code rules need to be structured so they hold up over months, what a single rule file must carry, and why the most important parts are not the rules themselves. They are the don'ts and the inventories.

The key points up front:

One file per convention — 27 rule files instead of one CLAUDE.md, each with a single topic.
Anatomy of a rule that holds: rule, reasoning, don'ts, inventory, cross-references.
Don'ts beat prescriptions — at least in the DI² project: a negative example is concrete and recognizable, a prescription competes against the training prior.
Inventories are the drift radar: the part of a rule that claims an actual state of the code — and therefore the first to stand out when code and documentation diverge.
paths: scoping lets large rules load only where they apply — 3 of the 27 files use it.
The honest downside: maintenance effort, rule conflicts, and a state in which the documentation deliberately runs ahead of the code.

Prerequisites. A project with Claude Code and a .claude/rules/ directory. The pattern transfers to any coding agent that loads convention files into its context — Cursor rules or comparable mechanisms in other tools work on the same principle. What loads into context when, and what that costs, is covered by the sibling article Skills vs. Rules in Claude Code — this article starts one level earlier: at the question of what the rule files themselves have to look like.

The Starting Point: One File That Kept Growing

The project behind the numbers is DI², an ETL generator built on Next.js and PostgreSQL, its code written almost entirely AI-assisted with Claude Code. In the beginning, all conventions lived where every Claude Code project collects them first: in the CLAUDE.md. That works as long as the file is short. It stops working once the file becomes a container in which database conventions, color tokens and commit rules all sit side by side.

This effect is hard to measure, but it is easy to feel. It also matches what research shows about long contexts: language models make measurably worse use of information sitting in the middle of long inputs ("Lost in the Middle", Liu et al. 2023). In my experience, a rule sitting in line 40 of a long file gets followed less than the same rule in its own, topically named file. A second effect has less to do with the model than with the humans: in a 400-line file, even the author cannot find a rule again when they want to check whether it still holds. Why a single file does not scale structurally, and which loading mechanics sit behind that, belongs to the skills-vs-rules article — this one is about what comes after.

The consequence in the project: a .claude/rules/ directory holding 27 rule files, one per convention (as of July 22, 2026). The same count holds for the public edition of this rule structure, the di2-starter-kit on GitHub — you can verify it there, not counting the subfolder READMEs. There is a file for the table conventions, one for dialogs, one for loading states, one for the security model. The directory keeps growing with the project. In the very week this article was written, three new files arrived, for views, triggers and database policies. A rule system is finished when the project is finished, which is to say never.

Anatomy of a Rule That Holds

After several months with this system, a fixed structure has emerged. A rule file that holds consists of five parts:

# <Convention> (<project>)

> One-sentence summary: What does this rule enforce, and where does it apply?

## The Rule

Every <structure/component/procedure> <does exactly one thing, stated imperatively>.
One convention per file — do not mix topics.

## Reasoning

Why this rule exists: the concrete incident, bug or review finding
that triggered it. A rule without a reason gets weighed away in
trade-offs — the reason is part of the rule, not decoration.

## Don'ts

- ❌ `<concrete negative example from your own code>` — why it drifts.
- ❌ `<second negative example>` — what applies instead (with the target spelling).

## Inventory

| Usage site | File | Status |
|---|---|---|
| <Site A> | `components/site-a.tsx` | ✅ compliant |
| <Site B> | `components/site-b.tsx` | ⏳ retrofit open |

## Cross-References

- [neighbor-rule.md](neighbor-rule.md) — boundary: what is governed there, not here.

The rule itself (the first two or three sentences under The Rule) is the smallest part, and that is no accident. It states imperatively what applies. Everything beyond that belongs to the other four parts.

The reasoning is not a courtesy to the reader. An agent weighs trade-offs, and as observed in the DI² project, a rule without a reason loses that weighing more easily against a plausible counterargument from the concrete case. A rule with a reason anchors what the concrete case has to argue against. The difference shows in exactly the moments that matter: namely, when the model considers an exception justified. How a convention and its reasoning come into being in the first place is described in the methodology article Deriving SQL Conventions with Claude Code.

The remaining three parts are the actual substance of this article. The don'ts and the inventory each get their own section below, and the cross-references almost explain themselves: they draw the boundary to the neighboring rule so that two files do not creep into governing the same topic. Every boundary violation that surfaces later gets recorded there as an explicit reference.

Why Don'ts Beat Prescriptions

This is the central observation from several months of rule maintenance in the DI² project. Whether it transfers to other projects has not been examined — it is project experience, not a study. A prescription says what should be. A negative example shows what must not be — with a concrete, recognizable spelling. The difference looks small at first glance. In practice it is not:

<!-- Before: a prescription without an anchor -->

Use only the project tokens for font sizes.

<!-- After: negative examples with recognition value -->

**Don'ts:**

- ❌ `text-[12px]` — raw pixel value, drifts; snap target is `text-token-meta`.
- ❌ `text-sm` — framework default instead of the project scale; in the app
  scope the linter guard flags it as an error.
- ❌ Inline `line-height` override via `[line-height:Xpx]` — the scale ships
  its own line height; a deliberate override needs a code comment with
  a justification.

The prescription "use only the project tokens" is factually correct, and yet it accomplishes little. At every generation it competes against the model's training prior, in which text-sm is the statistically most common way to write small text. "Only project tokens" first has to be translated onto the concrete case, and the rule gets lost in that translation.

The negative example skips the translation. text-[12px] is precisely the string the model is about to write — it sits verbatim in the rule, marked with a ❌ and the reason. A don't leaves no room for interpretation. It additionally names the snap target, the spelling that applies instead. Whoever reads the don't, whether human or model, afterwards knows both things: what is wrong and what belongs in its place. There is no don't-specific magic behind this, but a familiar effect from prompt research: concrete examples are easier for language models to act on than abstract prescriptions. Negative examples are simply the form in which a rule file can harness that effect.

The best don'ts do not come from the rule author's imagination but from real finds. Every time a review or a bug surfaces a new bypass variant, exactly that variant goes into the rule file as a don't. The rule system thereby learns the same mistakes the code has already made once. There was no shortage of material: the sibling article AI-Assisted Coding Gave Me 799 Hardcoded Font Sizes documents the drift finding from which the font-size don'ts emerged.

Inventories as a Drift Radar

A rule states a target state, and target states have an inconvenient property: they cannot become wrong. "Every dialog carries a leading icon" stays correct as a sentence even when six dialogs without icons have long been sitting in the code. The rule notices none of it.

An inventory changes that. It lists the concrete usage sites of the convention together with their actual state:

## Inventory + Retrofit Backlog

As of now, this rule is the documented truth. Existing sites without
<convention> are brought in line as a tracked follow-up step (retrofit) —
until then the code deliberately lags behind the documentation.

| Dialog | File | Status |
|---|---|---|
| Editor (domain object A) | `components/object-a-editor-dialog.tsx` | ✅ icon leads left |
| Inspector (domain object B) | `components/object-b-inspector.tsx` | ✅ icon leads left |
| Invite user | `components/user-invite-dialog.tsx` | ⏳ retrofit open |
| Bulk delete | `components/bulk-delete-dialogs.tsx` | ⏳ retrofit open (destructive tone) |

The table is the part of a rule file that can fail against the actual state of the code. A reasoning section or a cross-reference can go stale too, but only the inventory makes a checkable claim about what the code looks like right now. If a new dialog arrives and is missing from the table, the inventory is incomplete. If a listed dialog gets rebuilt and its status is not updated, the inventory is stale. Exactly this vulnerability is what makes it valuable: a rule without an inventory can exist past the code unnoticed for years, while the lag of a rule with an inventory becomes visible at the next reconciliation, at the latest. The inventory is the rule's drift radar.

In the project, 9 of the 27 rule files carry such sections. They come in two flavors. The caller inventory lists who uses a component or convention — it answers the question "if I change this, what is affected?" before anyone has to search. The retrofit backlog lists which existing sites do not yet satisfy the convention. Both forms share the mechanics but differ in the direction of view: one looks at the rule's users, the other at its open debts.

The inventory is not maintained in a separate documentation session but in the same commit as the code change. Whoever adds a dialog adds it to the table. Whoever completes a retrofit sets the status to ✅. That sounds like a high demand on discipline. In an agent workflow, though, it is the cheapest possible moment, because the agent usually has the rule file in context anyway when it works inside the rule's scope.

Rules That Load Only Where They Apply

As the file count grows, the context-cost question returns. 27 files that all load all the time would just be a partitioned CLAUDE.md at the same cost. The lever against that is a paths: front matter that binds a rule to its path scope:

---
paths:
  - "src/app/api/**"
  - "src/lib/db*"
---

# Backend conventions

This rule loads only when the task touches files in its path scope —
API routes and the database access layer. A frontend task does not
pay its context costs.

In the project, 3 of the 27 files carry this front matter: the backend rules, the frontend rules and the security model. The selection follows a simple criterion. These three files are large, and their scope is a clearly bounded subtree of the project. A backend rule inside a pure frontend task is dead context. The remaining 24 files load unscoped because they are either small or apply across the whole project, like the commit conventions.

This article deliberately goes no deeper here. The mechanics behind it — what Claude Code loads into context when, what separates rules from skills, and how the costs add up — are the subject of Skills vs. Rules in Claude Code. For the structural question of this article, the finding suffices: paths: scoping is the main reason 27 files do not mean 27-fold costs.

The Honest Downside

A rule system of this size is not free, and an experience report that hides that would be advertising.

Maintenance is real work. Every rule file wants attention at every convention change, inventories want updating in the same commit, and the cross-references between files go stale when a rule moves. The effort is not a one-time investment but a running cost. In the project it is the kind of work that pays off, but it does not disappear just because you approve of it.

Rules end up in conflict. With 27 files it happens that two rules govern the same case from different angles — the dialog rule wants an icon, the confirmation-dialog rule forbids one for its special case. The resolution is the same every time: the conflict gets written into both files as an explicit carve-out, with references to each other. Undecided conflicts are the worst thing that can happen to a rule system, because then the prioritization stays ambiguous — which rule prevails depends on the particular context and the task at hand.

The documentation runs ahead of the code — deliberately. When a new convention is decided, it applies to new code immediately. The existing sites are not rebuilt within the hour but tracked as a retrofit backlog in the inventory and brought in line step by step. That state is not a failure as long as it is documented. The rule file says honestly: this is the truth, and the code deliberately lags behind at these listed sites. Untracked, the same state would be a lie, because then the documentation claims an actual state that does not exist.

When a Rule Needs a Linter

Prose has a limit, and the sibling article AI-Assisted Coding Gave Me 799 Hardcoded Font Sizes has measured it out in detail: a documented rule raises the generation hit rate but does not guarantee it. At high volume, any residual rate turns into visible drift.

From that follows a division of labor that has proven itself in the project. Whatever is machine-checkable gets a guard in addition to the rule — in the font-size case a custom ESLint rule at error level that flags exactly the spellings from the don'ts. The rule file remains the source all the same: it explains the why, defines the mapping and lists the deliberate exceptions, while the linter enforces only the checkable subset.

Whatever is not machine-checkable stays a pure prose rule and needs the inventory as its substitute radar. No linter can decide whether a dialog icon is the right one for the domain. Whether all dialogs have one is written in the inventory table. The rule of thumb from the project: a rule whose violation can be expressed as a search pattern is a linter candidate. A rule whose violation a human has to recognize needs an inventory.

What I Would Do Differently Today

Looking back, I would approach three things differently, and all three can be named concretely.

Too many files too early. The first weeks produced rule files for topics that did not even have a second usage site yet. A convention for a single case is not a convention but a note. Today a new rule file comes into being only when the same decision comes up for the second time. And cleanup is part of the deal: a rule whose scope has disappeared, or whose content a neighboring rule meanwhile carries, gets deleted or merged into that neighbor. A rule set that only ever grows becomes the very unwieldy file it was meant to replace.

Rules without inventories that went stale silently. The early files consisted of rule and reasoning, without a state-of-the-code part. After two months, some of them described a state the code had long left behind, and nobody had noticed. Only the inventory sections made the aging visible. In hindsight, every rule with usage sites should have carried one from day one.

Discovered paths: too late. The scoping arrived only when the context costs were already noticeable, and converting existing files to clean path scopes was more tedious than an early cut would have been. Anyone starting today should ask one short question at every new rule file: does this apply everywhere, or in one subtree? Answering that question costs ten seconds at creation time. Not asking it costs a refactor later.

FAQ

When should I split my CLAUDE.md?
At the latest when, while writing a new convention into it, the question comes up where it actually belongs. That is the signal that the file carries multiple topics. A second signal is the repeated ignoring of a rule that verifiably sits in the file — the rule is drowning in its surroundings.

How many rule files are too many?
The number itself is not the limit, the context costs are. 27 files work because the large ones load path-scoped and the unscoped ones are small. How those costs add up and where the line runs is covered by Skills vs. Rules in Claude Code.

Why do negative examples work better than prescriptions?
Because they close the translation gap. A prescription first has to be applied by the model onto the concrete case, while a don't already contains the wrong spelling verbatim — precisely the string the training prior would suggest, marked with the reason and the target spelling. Recognizing is more reliable than deriving.

What do I do when two rules contradict each other?
Write the conflict into both files, as an explicit carve-out with references to each other. Which rule wins in the overlap case must be stated in the files, not in the author's head. An undecided conflict otherwise gets decided by chance, depending on which rule happens to sit more prominently in context.

Does the code always have to match the documentation?
No — and that may be the most counterintuitive lesson. A new convention applies to new code immediately, while the existing code is brought in line step by step as a tracked retrofit backlog. The tracking is what matters: a documented lag is a deliberate decision, an undocumented one is documentation that lies.

Going deeper:

Skills vs. Rules in Claude Code — What Auto-Loads, What Loads on Demand — the loading mechanics and context costs behind this article.
Maximal Template Over Empty Repo — a Claude Code Setup That Prunes Itself via /init — how a rule inventory gets tailored at project start.
Setting Up a Claude Code Project with a Development Workflow and Database — the sub-pillar: the setup this rule system lives in.

Upstream:

AI-Assisted Coding Gave Me 799 Hardcoded Font Sizes — the drift finding that makes rules and guards necessary in the first place.
Deriving SQL Conventions with Claude Code — the Generate-Refine-Derive Loop — how a convention comes into being before you pour it into a rule file.
AI-Assisted SQL Development with Claude Code — Rules, Skills and Agents — the pillar: the enforcement system as a whole.

Starter kit:

The open DI² starter kit on GitHub — a project template built on this article's one-file-per-convention principle, ready to tailor.

AI-Assisted Coding Gave Me 799 Hardcoded Font Sizes

Marcus — Tue, 21 Jul 2026 23:42:38 +0000

It didn't start with an audit. It started with a nagging feeling: the interface looked restless. You don't notice it at first glance, but on the second and third look it is there — a timestamp slightly larger here than there, a dialog title a touch smaller than in the neighboring dialog. Everyone knows the discipline from letters and résumés: same typeface, same font size, same alignment. The same holds for an application interface, except the violation doesn't show up in any one spot. It shows up as a diffuse restlessness across many screens.

Only that feeling led to a counting command across the frontend, and the count delivered the explanation: 799 hits for raw pixel font sizes like text-[13px], spread across 25 distinct pixel values in 74 files. And that in a project which had a documented font-size scale with six tokens all along. The scale was used 263 times and bypassed roughly 1,180 times. This is not a sloppiness finding from some legacy codebase grown over a decade — it is the state of a codebase built AI-assisted with Claude Code from day one. This is what AI code drift looks like: every single suggestion is locally plausible, and what adds up is the restlessness you can see before you can measure it.

This article is the experience report — with the real numbers, the rule that ended the problem, and the honest admission that a convention living only in prose loses against a language model.

The key points up front:

The finding: 799 raw pixel font sizes against 263 token usages, even though the scale was documented. The ratio is the message, not the single number.
The biggest source of drift lies between the tokens: values like 11px and 12px, for which no token existed at all, form the majority at roughly 400 occurrences.
No code review catches this, because every diff is harmless on its own.
Better prompts raise the hit rate but don't eliminate the drift — at high generation volume, any residual rate becomes visible.
What holds: a 6-step scale, a fixed element-to-token mapping, an unambiguous snap rule for all in-between values, and a linter at error level.

Prerequisites: The example uses Tailwind CSS and ESLint in a Next.js project. The pattern applies to any codebase with design tokens, regardless of framework — and, as the end of the article shows, just as much to SQL conventions.

The Finding: 799 Against 263

The project behind the numbers is DI², an ETL generator built on Next.js and PostgreSQL whose code was written almost entirely AI-assisted. For font sizes, a clear convention existed: six named tokens from text-di-h1 (18px) down to text-di-label (10px), defined in the Tailwind configuration and described in a brand rule file that the agent loads for every frontend task. Everything that follows is therefore the measurement of a single project — one data point, not proof. What makes the case interesting beyond the project is the mechanism behind it, and that mechanism, as we will see, is not project-specific.

The inventory on June 25, 2026 across all src/**/*.tsx files produced three categories:

Declaration style	Occurrences	Files	Assessment
Raw pixels `text-[Xpx]`	799	74	drift, 25 distinct pixel values
Tailwind defaults `text-xs`/`text-sm`/`text-base` …	383	85	drift within the app scope
Canonical tokens `text-di-*`	263	43	the target pattern

The three rows distinguish three ways of declaring the same font size. text-[13px] is Tailwind's arbitrary-value syntax: the pixel value sits literally inside square brackets and acts like an inline font-size — any value is possible, and that is how 25 different ones come into being. text-xs, text-sm and text-base, on the other hand, are named size steps, but they belong to the framework's bundled default scale (12, 14 and 16 pixels). That looks disciplined, because it follows a scale. It is just the wrong one: Tailwind's generic scale instead of the project's own, whose six tokens don't even contain those three values. The third row, finally, is the project's own scale — the target pattern. Read top to bottom, the table is a ladder: freehand value, foreign scale, own scale.

That two scales coexist is not an accident, by the way — it is the framework's default. Tailwind ships its complete size ladder with every project, and the project's own tokens were added via theme.extend — and extend means exactly that: extend, not replace. From that point on, every text-sm compiles just as happily as every text-di-body. There is no moment at which the project consents to the foreign scale, and none at which it announces itself. It is simply there, from day one, as the statistically closest choice for anyone who needs small text — human or model.

A second term needs an explanation, because it recurs throughout the article. The app scope is the application behind the login: the dashboard and administration pages plus the components they render. The token scale applies only there. Outside it lie two zones with rights of their own — the public pages (landing, legal, login), which deliberately carry larger formats, and the bundled UI base components, which internally work with Tailwind defaults. A text-sm is therefore not wrong per se; on a marketing page it is legitimate. That is why the second row says drift within the app scope: of the 383 occurrences, only the share inside the application counts as drift.

The convention, then, was ignored at a ratio of roughly 1:4.5. To rerun the inventory in your own project, all you need is a search tool like ripgrep (the examples are written in PowerShell):

# Raw pixel font sizes: total number of occurrences
rg --no-filename -o 'text-\[[0-9.]+px\]' src | Measure-Object -Line

# Distribution: which pixel value occurs how often?
rg --no-filename -o 'text-\[[0-9.]+px\]' src | Group-Object | Sort-Object Count -Descending

# For comparison: the canonical tokens
rg --no-filename -o 'text-di-(h1|h2|body|meta|micro|label)' src | Measure-Object -Line

The timeline delivers a punchline of its own. Between the first count and the recount three weeks later, shortly before the cleanup migration started, the total grew from 799 to 813 occurrences and from 25 to 26 distinct values. The drift kept growing while its removal was already being planned. A convention that is not enforced doesn't lose once — it loses a little more every day.

Why Nobody Caught It

The first reflex at a number like this: how did that slip through? The answer is uncomfortable because it describes no negligence, but a structural gap.

A text-[12px] is not wrong in any single diff. It renders correctly, it looks fine in the preview, it breaks no test. A reviewer reading the diff of a new dialog checks the logic, the states, the accessibility. They do not compare whether the font size in this file is consistent with the one in 73 other files. The drift doesn't live in any one file. It lives between the files.

It was visible all the same — just not as a defect, but as the nagging feeling from the beginning. A restless interface shows the symptom, not the location: which of the 74 files do you point at when no single line is wrong on its own? So the impression stayed without consequence for a long time. It couldn't be pinned to any diff, and what can't be pinned to a diff ends up in no review comment and no ticket. Visual regression tests usually don't catch this form of inconsistency either, because they compare each view against its own baseline. Two views that are both brand new share no baseline against which the difference could stand out. Only the inventory turned the feeling into a finding with numbers — and thereby into something fixable.

Then there is the speed. A person who builds one dialog a day makes a handful of font-size decisions per week, and muscle memory keeps them reasonably stable. An agent that creates twenty components in the same week makes the same decision a hundred times over — and each one is optimized locally, not for consistency with all the previous ones. The inconsistency doesn't come from carelessness; it comes from the sheer volume of independent single decisions. The densest single file in the inventory accounted for 70 raw pixel sizes on its own.

Humans produce the same drift — honesty demands saying so. The difference is not the kind of mistake but the pace and the volume. What a team accumulates in two years of wild growth, AI-assisted development manages in a quarter.

The Drift Lives Between the Tokens

The most revealing part of the inventory is the distribution of the 25 pixel values. It splits into three classes, and the middle one is the interesting one:

Exact token matches: values that correspond to a token, just written raw — 13px instead of text-di-body (166 occurrences), 11.5px instead of text-di-meta (46), plus the remaining token values. This class is mechanically repairable and visually a no-op.
Off-token values: roughly 400 occurrences on values for which no token exists at all — led by 11px with 196 and 12px with 143 occurrences, plus 12.5px (42) and 14px (19).
Deliberate exceptions: marketing and legal pages with large formats of their own, such as 22px or 44px, which intentionally sit outside the app scale.

The second class deserves the second look. 11px and 12px both sit next to the same token, text-di-meta (11.5px), and both were later migrated to it. Together that is 339 places where two different pixel values played the same semantic role — meta text, timestamps, helper lines. Nobody ever decided that both values should exist. There is no commit with the message "we are introducing 11px as an alternative to 12px." Both values simply came into being, suggestion by suggestion, because each looked reasonable on its own.

That is what separates this drift from a copying error. Someone who mistypes a documented value produces a findable defect. Someone who puts plausible values into a gap of the scale a hundred times over produces a creeping second scale that was never decided anywhere and therefore never stands out anywhere. That is AI code drift in its purest form.

The "Better Prompts" Fallacy

The obvious reaction to the finding would have been to state the convention more forcefully. Add a chapter to the brand rule, repeat the tokens in the prompt, instruct the agent more insistently.

That doesn't carry far, and the reason lies in how the model works. A language model is trained on vast amounts of public code, among it countless Tailwind projects, and there text-sm or text-[12px] is the overwhelmingly most common way to write small text. A project-specific convention like text-di-meta, by contrast, is exactly one file in a rules directory. At every single generation the rule competes against that weight, and it wins often — but not always. Across a thousand decisions, a hit rate of 90 percent still leaves a hundred drift spots, and even a considerably higher rate ends up in the dozens at sufficient volume.

Rules in prose improve the rate, and rules with reasons improve it further. How a ruleset comes into being that an agent actually follows is described in the methodology article Deriving SQL Conventions with Claude Code. But any rate below 100 percent means drift at high volume. For a convention that is meant to hold without exception, the prompt is the wrong tool. It needs a check that doesn't get tired.

What Works: Scale, Category Mapping, Snap Rule

The cleanup consisted of three building blocks that only work in combination.

First, the scale itself. Six steps, defined in exactly one place in the Tailwind configuration, each with its line height built in:

// tailwind.config.ts — the scale as the single source of truth
fontSize: {
  'di-h1':    ['18px',   { lineHeight: '23.4px' }],
  'di-h2':    ['15px',   { lineHeight: '21px' }],
  'di-body':  ['13px',   { lineHeight: '18.85px' }],
  'di-meta':  ['11.5px', { lineHeight: '15px' }],
  'di-micro': ['10.5px', { lineHeight: '13.65px' }],
  'di-label': ['10px',   { lineHeight: '12px', letterSpacing: '0.05em' }],
},

An important decision was to keep the scale at six steps. The tempting alternative would have been to create new tokens for 11px and 12px and thereby legalize the status quo. That would have turned the drift into an official eight-step scale, and the next in-between value would have found a gap again.

Second, the element-category mapping. A table in the rule file defines which kind of element carries which token. Section headings get di-h1, dialog titles di-h2, table cells and buttons di-body, timestamps and helper texts di-meta, counter pills di-micro, uppercase labels di-label. With that, "what size does this element need?" is no longer a matter of taste but a lookup.

Third, the snap rule. The name says it: roughly 400 occurrences sat on in-between values like 11px or 12px, and during the migration each of them had to snap onto one of the six tokens — like an object being pulled onto the grid in a graphics editor. Which token it becomes is decided in two stages. First comes the element's role from the category mapping: a timestamp gets di-meta because it is meta text, no matter which pixel value used to be there. Only when an occurrence cannot be assigned to any category does numeric proximity to the nearest token decide. Even the single tie is settled explicitly: 14px sits exactly between di-body (13px) and di-h2 (15px), and di-body wins as the default. Role before number is not a formality, either. A 12px in a sticky table header belongs to di-micro (10.5px) by category, although di-meta (11.5px) would be numerically closer. And because the rule leaves no room for judgment, two people — or two agent runs — resolve the same raw value identically, guaranteed.

With those three building blocks the migration itself was unspectacular: 80 files in the app scope, converted category by category, with shifts of less than a pixel. Exactly one deliberate exception remained — a large page title outside the app scale, marked with an inline comment and a justification.

One side finding from the same cleanup deserves a mention for completeness: the tokens had to be explicitly registered as font-size classes in the tailwind-merge configuration, because the library otherwise classifies unknown text-* classes as text color and silently discards one of them on conflict. That, however, is an ordinary trap when introducing custom tokens and has nothing to do with AI. Know it, pin it with a small test, done.

Even the Fresh Rule Drifted

The most instructive finding of the whole story comes from the migration's quality assurance, and it cuts against the author.

The category rule — "the element category wins over pixel proximity" — was written into the rule file in the same commit that carried the migration. And in that very commit it was undercut: the data cells of the densest table registers, previously 12px, belonged to di-body (13px) according to the category table. They were migrated numerically to di-meta (11.5px) instead, because that matched the density intent of those views. The freshly written rule and its first application contradicted each other, and it surfaced only in the downstream QA pass, which filed it as a documentation inconsistency.

You can consider the finding small — half a pixel in dense tables. Its value lies elsewhere: it shows that even a carefully worded, freshly printed rule drifts at the moment of its application when only humans and prose carry it. Not out of ignorance, but because a second legitimate consideration intervened in the concrete case and nobody checked against the wording. This rule-based check is one of the tasks a machine performs more reliably than any participant.

Enforcement: A Linter at error

The fourth building block therefore makes the convention machine-checkable. The tool class is secondary: a compiler check, a Tailwind plugin or a CI script can play the same role. For this project, a custom ESLint rule was the most practical form — it reports every raw pixel font size and every Tailwind size default in the app scope as an error, not a warning. The core of the rule fits on a page:

// eslint-rules/no-raw-font-size.mjs — the core of the rule
const PX_RE      = /text-\[(\d+(?:\.\d+)?)px\]/g
const DEFAULT_RE = /(?<![\w-])text-(xs|sm|base)(?![\w-])/g

// Snap table: raw value -> canonical token (from the rule file)
const PX_SNAP = {
  '11':   'text-di-meta',
  '12':   'text-di-meta',
  '12.5': 'text-di-body',
  '13':   'text-di-body',
  '14':   'text-di-body',
  '15':   'text-di-h2',
}

export default {
  meta: { type: 'problem', messages: {
    rawPx: 'Raw pixel font size `{{match}}`. Use {{target}}.',
  } },
  create(context) {
    const check = (node, raw) => {
      for (const m of raw.matchAll(PX_RE)) {
        context.report({ node, messageId: 'rawPx', data: {
          match: m[0], target: PX_SNAP[m[1]] ?? 'a token from the scale',
        } })
      }
      // DEFAULT_RE analogous
    }
    return {
      Literal(node)         { if (typeof node.value === 'string') check(node, node.value) },
      TemplateElement(node) { check(node, node.value?.cooked ?? '') },
    }
  },
}

Three decisions have proven themselves:

The rule inspects all string literals and template parts, not just JSX attributes (the Literal and TemplateElement visitors at the end of the rule). In practice, class strings also arise in cn() arguments, in .join(" ") helpers and in exported constants — a rule that only sees className="…" would have blind spots there.
The error message names the snap target (the target field in the report data). Whoever sees the error sees the fix and doesn't have to go find the rule file first. That holds for human readers just as much as for the agent reacting to the linter error.
error, not warn. A warning is a number in a summary; an error breaks the build. Only the second one is enforcement. The halfway path — warnings plus occasional cleanup — ends up reproducing the very state that led to the drift.

One obvious alternative deserves a mention because it looks simpler than it is: remove Tailwind's default scale from the configuration altogether by defining fontSize without extend. Then text-sm simply would not exist anymore. That fails on two counts. An unknown text-sm produces no error in Tailwind — it produces no CSS at all, so the text would silently fall back to the browser's default size, and a silent failure like that is harder to find than the drift it is meant to prevent. And the bundled UI base components, like the marketing pages, build on exactly those default classes internally. The radical fix would break the very zones that legitimately live off the standard scale. A linter with a scope can express that; a global configuration cannot.

The deliberate exceptions therefore live not in the rule but in the ESLint configuration: a path list exempts marketing, legal and login pages, whose large formats intentionally sit outside the app scale. That records the convention's scope machine-readably in one place, congruent with the scope in the rule file. The direction of the definition is worth noting: the app scope itself is never listed anywhere — it is simply everything that was not exempted. A new public page that nobody adds to the exemption list is treated as app scope, and the guard flags its large formats as errors. That is the right direction to fail: a forgotten list entry shows up as a loud false alarm in the build instead of slipping through as silent drift.

// eslint.config.mjs — carve-out as a path list (abridged)
const FONT_SIZE_CARVE_OUT = [
  'src/app/page.tsx',        // landing: deliberately larger formats
  'src/app/impressum/**',
  'src/app/login/**',
  'src/**/*.test.{ts,tsx}',  // tests reference classes as test data
]

export default [
  { files: ['src/**/*.{ts,tsx}'], rules: { 'di2/no-raw-font-size': 'error' } },
  { files: FONT_SIZE_CARVE_OUT,   rules: { 'di2/no-raw-font-size': 'off' } },
]

One sequencing detail to close: the guard was activated as the last step, after the app scope had been migrated clean. The other way around, every unfinished file would have needed a temporary exemption list, and temporary exemption lists have a tendency to become permanent.

What the Linter Cannot Do

For the experience report to stay honest, the limits belong in it.

The guard checks the spelling, not the assignment. That an element carries a token at all — that it enforces reliably. Whether it is the right token for the element's category it cannot know — in that gap lived the rule contradiction from the QA finding, and there it can arise again. The category assignment remains a rule in prose, with all the weaknesses described, just on a much smaller attack surface.

Two limits are chosen deliberately. Larger Tailwind defaults from text-lg upward the rule does not flag, because a strikingly large text in an app file is a case for design review, not a candidate for mechanical snapping. And dynamically assembled class strings, such as values concatenated from variables, are only partially covered by a static rule. Both gaps are documented rather than concealed, because a check whose omissions nobody knows creates false confidence.

And finally: the linter preserves the scale, it does not justify it. Whether six steps are the right six and which category deserves which token remains a design decision that precedes the tool.

The Same Pattern in SQL

The case is a frontend case on purpose, because that is where the drift was measurable. The pattern behind it is bound to no framework. Transferring it to SQL is therefore a reasoned analogy, not a second measurement — but one that follows from the same mechanism.

A SQL convention — singular table names, snake_case parameters, a fixed procedure skeleton — is the same class of rule as a font-size scale: a project-specific commitment competing against the statistically more common spelling from other codebases. An agent generating PL/pgSQL drifts there for the same reason as with the pixel values, and the answer has the same structure. The convention lives versioned in a rule file, with reasons and negative examples, the way the Postgres Table Conventions, the PL/pgSQL Procedure Conventions and the PL/pgSQL Function Conventions demonstrate. And what is machine-checkable gets checked by a tool: for SQL layout, say, sqlfluff with a project-specific configuration; for structural conventions, a script in the CI gate.

The division of labor is the same in both worlds. The prose rule explains the why and raises the hit rate. The tool closes the gap between a high hit rate and zero exceptions. Because the underlying pattern stays the same, whether in the frontend or in the database: local plausibility creates global inconsistency. How the overall interplay of rules, skills and agents is set up is described in the pillar article AI-Assisted SQL Development with Claude Code.

FAQ

How do I find consistency drift in my own project?
It often announces itself as an impression first: the interface feels restless without any single spot to point at. It becomes tangible through a counting inventory: one search pattern for the canonical spelling, one for the bypasses, and the two numbers set in relation. For design tokens those are expressions like text-\[[0-9.]+px\] against the token names. More revealing than the total is the distribution of values — clusters on values without a token show where the scale has a gap or where an unofficial second convention has formed.

Isn't it enough to put the rule into CLAUDE.md?
A documented rule improves the hit rate but doesn't guarantee it — at high generation volume, any residual rate turns into measurable drift. Rules in prose and machine checks are therefore not alternatives but two halves: the rule explains the why, the linter enforces the what. Where rules live in Claude Code and what they cost in context is covered in Skills vs. Rules in Claude Code.

Why a custom ESLint rule instead of an off-the-shelf one?
Because the error message should name the project's own snap target, and the check has to cover class strings outside JSX attributes as well. Generic approaches like no-restricted-syntax can flag the pattern but cannot suggest a contextual fix — and it is exactly that fix which makes the error immediately actionable for humans and agents alike.

Does this only apply to CSS and design tokens?
No. AI code drift affects any project-specific convention that competes against a more widespread standard spelling — SQL naming, file structures, error-handling patterns all drift by the same mechanism. The countermeasure is the same everywhere: document with reasons, and pour the machine-checkable part into a tool at error level.

Doesn't a human team produce the same drift?
It does, and that is the honest core. The difference is pace: an agent makes as many individual decisions in a week as a team makes in months, and it makes each one optimized locally rather than for consistency with the previous ones. Conventions that held up passably at human speed break visibly under that volume — which also means AI drift merely makes an old problem visible faster.

Going deeper:

Deriving SQL Conventions with Claude Code — the Generate-Refine-Derive Loop — how a convention comes into being before you can enforce it.
AI-Assisted SQL Development with Claude Code — Rules, Skills and Agents — the pillar: the enforcement system as a whole.
Skills vs. Rules in Claude Code — where conventions live and what they cost in context.

Convention spokes:

Starter kit:

The open DI² starter kit on GitHub — a project template with the rules structure described here; the font-size guard from this article ships with it as a reusable template.

Design Pattern // The Architecture of an ETL Process — How to Isolate Bad Data Cleanly

Marcus — Mon, 20 Jul 2026 13:46:45 +0000

A single date string that cannot be parsed, and the entire ETL run aborts. The design pattern for ETL process architecture presented here prevents exactly that: bad data is isolated, not passed along.

TL;DR — what this article covers:

Work packages and schema layering E0 – L2 — how to decompose the ETL process into distinct, self-contained packages, each with its own database schema.
Technical vs. structural transformation — why separating type conversion and foreign-key resolution into two passes is safer and easier to debug than doing both in one.
Data quality at the schema boundaries — erroneous records are caught at the transitions; the main stream keeps flowing cleanly.
Historization as an optional layer — SCD 1 / SCD 2 pay off mostly when delta loads are involved.

Prerequisite. Basic familiarity with ETL processes. This is a conceptual article — not a step-by-step tutorial. Root of the article series: Data quality in an ETL process; the present article covers the architecture part.

Tasks of the ETL Process

An ETL process consists of the three general steps E = Extract, T = Transform and L = Load. What exactly has to happen within each of these top-level steps, however, is a matter of definition.

Extraction

In this step, data is extracted from various data sources. Sources can be databases, files, or APIs. The data to be extracted may be structured or unstructured and may come in different formats. This article series deals exclusively with structured data. Structured data sources include relational databases, but also CSV documents as well as XML and JSON documents — as long as their data elements follow a logical structure. Unstructured data such as text from social networks is out of scope here.

This generic definition leaves open what extraction concretely means. The following sections describe a concrete shape for the extraction process:

Materialization of the extracted data
Extended extraction tasks
No type conversion of the data

Materialization of Extracted Data

The design pattern presented here stores all extracted data in a database. The data must be stored in such a way that, in this step, there is no technical reason for the data not to fit into the database. The only acceptable cause for aborting extraction is an infrastructure issue (storage, network, etc.). I refer to this writing-to-database step as the materialization of the data.

Extraction and the materialization of extracted data give three main benefits:

The source system is read only once and as briefly as possible.
All extracted data is available in a database for subsequent processing.
After-the-fact error analysis on concrete records becomes possible.

Reading from the source system can put it under enough load that its performance and response times suffer. Extracting first minimizes the duration of that access.

Once all extracted data sits in a database, downstream steps can work on it using SQL. No additional ETL tool is required just to integrate heterogeneous source systems. This lowers technical hurdles and, in practice, makes the downstream processes substantially more performant — both in execution and in development.

Extended Extraction Tasks

For text files in XML and JSON format (and, depending on the case, CSV), materialization works a bit differently. XML and JSON documents are stored in the database before their contained attributes are extracted. As an extended extraction task, the attributes are then extracted from the stored documents using powerful T-SQL functions such as OPENXML or OPENJSON and written to the database.

No Type Conversion of the Data

Text-file deliveries are particularly problematic. The data they contain is not type-safe by any means. A date delivered as text may or may not be convertible into a date value. There has to be an agreement between the source-data-producing process and the ETL process about, for example, which date format is used (yyyy/dd/MM, dd.MM.yyyy, etc.). Converting values during extraction is a source of errors and risks aborting the entire ETL run. Converting data into the target data types is therefore not permitted during extraction.

Transformation

A common definition of the transformation step goes something like: "Transformation converts the extracted data into the desired format." Another definition lumps all the tasks under the term data integration. Both phrasings are vague and offer no concrete guidance.

Starting from the extracted data, the design pattern presented here defines two mandatory tasks and one optional task:

Type conversion of the extracted data
Data quality check
Optional: historization

When data arrives as text files, the extracted attributes must first be converted into the target data types. This also applies when data is extracted from databases whose data types diverge from those in the target system. We focus here on text files as the data source. As described above, values extracted from text files are first stored as text. The target system, however, expects strongly typed data. A date, for example, will routinely have to be converted into a value of type date.

The data quality check is, on top of that, a critically important task that fundamentally shapes the outcome of an ETL run. It starts with the question whether a delivered value can be converted into the data type of the corresponding target field. Where required, deliveries must also be checked for duplicates. There are many other useful and necessary checks and tasks that belong under the umbrella of data quality.

In the historization step, source data identified as changed (new, modified, or deleted records) is rolled forward in separate tables, so it is always reproducible when a record was inserted, modified, or deleted. This step is optional. A colleague once called the historized data the brain of the ETL process: the data of the downstream target system can be reconstructed from the historized data at any point. Of course, historization comes with additional maintenance tasks such as backups.

The term data integration is, in fact, closer to what we will call structural transformation. There, data from various sources is filtered, merged, and aggregated. Although that is also a transformation task, the design pattern presented here does not perform it during T of the ETL process — it happens during L. At this point, it pays to draw a sharp terminological line between the transformation tasks described in this section and structural transformation. The transformation tasks described here, performed during T, are referred to as technical transformation. The transformation tasks performed during L are referred to as structural transformation.

The boundaries between the three top-level ETL steps are fluid and, in the end, a matter of definition.

Typing of Extracted Data

If the data source is a database such as SQL Server or Oracle, the data will typically already be strongly typed. Even so, type conversion may still be necessary to match the data types of the target system.

Example. Consider the length of text fields or the storage of a date without a time zone. Application developers do not always pay close attention to input length limits. As a result, an address field in a source system might be able to hold entire novels. Users who notice such a gap will — empirically — happily use it to dump information that simply does not belong there. If the source system does not support storage of a date with time zone, the time zone of the source system must be determined and taken into account when converting to the target data type date with time zone.

When processing attributes extracted from a text file, typing the extracted values into the target data types is always required.

Example. During extraction, attributes are stored as values of type text. A text that looks like a date to us is not necessarily convertible into a date. For instance, 30-02-2023 is not a valid date. Another example: 03-05-2023 cannot be interpreted as a date without additional context about the data source. Read in American style (mm-dd-yyyy), it becomes 05-Mar-2023; read in the typical German style (dd-mm-yyyy), it becomes 03-May-2023. Correct interpretation requires knowledge of the date format — that is, the format string. Similar challenges arise for numeric values where decimal and thousands separators must be agreed on.

Data Quality Check

The data quality check inspects the extracted and converted data for completeness and correctness. These checks cover a wide field. Examples are:

Type conversion check
Duplicate identification
Spelling and orthography check on text values
Foreign key check
Mandatory field missing value check
Business logic validation

The article Data quality in an ETL process introduces the term technical data quality. The check on technical data quality operates on the typed data. For data quality checks on typed data, simple logical conditions can be set up, identifying errors on a value or per-record basis. A logical condition is technically expressed as a WHERE clause in the ETL process and applied to the typed data. If a WHERE clause returns records, those records contain an error in the inspected field.

Type Conversion Check

Whether the type conversion succeeds or fails has direct impact on all downstream tasks. If an input value cannot be converted into the target data type, the offending record may have to be excluded from further processing. The design pattern presented here checks for every delivered source record whether its input values can be converted into the respective target data types.

Duplicate Identification

Duplicate identification can be arbitrarily complex. In this article series, I limit myself to a combination of fields that, per the delivery contract, must follow a defined cardinality or must be unique (cardinality = 1).

Spelling and Orthography Check on Text Values

Phone numbers, for instance, have many possible notations. The German DIN 5008 standard prescribes that the area code be written without parentheses and separated from the rest of the number by a single space. Notation checks on a value can be performed as part of the technical transformation.

Foreign Key Check

If the delivered data contains a foreign key relationship, only the structural validity of a delivered foreign key value is checked here — that is, format, presence where required, and data type. The actual foreign key resolution against the target system (mapping source-system code → target surrogate key) only happens later, as part of the structural transformation. The reason for the split: resolution needs context from the target system (such as a Countries table), whereas format and presence checks can be answered from the record alone.

Mandatory Field Missing Value Check

If an attribute is a mandatory field in the target system, the typed data must be inspected to ensure that a corresponding value was delivered.

Business Logic Validation

Checking business logic is itself a wide field that can become arbitrarily complex. Even checking simple business logic can substantially improve data quality. A simple example might be a customer's date of birth — which obviously must not lie in the future.

Loading

In the final step of the ETL process, the typed data is structurally transformed to match the data structures of the target system, optionally re-checked for data errors, filtered, aggregated, historized, and finally loaded into the target system. The tasks involved are:

Structural transformation
Data quality check
Filtering
Aggregation
Optional: historization
Loading the data into the target system

The previously technically transformed data can be loaded into different target systems. The target could be a CRM system or a data warehouse. The structural transformation task is specific to the chosen target system. That is why the structural transformation happens during L of the ETL process. Again: the boundaries between the top-level ETL steps are fluid, and it is a matter of definition which tasks fall into which step.

Structural Transformation

The structural transformation operates exclusively on the typed, quality-checked, and possibly historized data that was found to be error-free. Technically, the structural transformation corresponds to a SELECT statement joining historized tables and shaping the output to match the target system's data structures. Among other things, this step resolves foreign keys and lookup values:

Foreign key resolution
Lookup value resolution

The output of the structural transformation is — as with extraction and the technical transformation — materialized in the database, so this data, too, is available for analysis and error diagnosis. The data structures of the structurally transformed data largely correspond to the structures in the target system. In particular, the column names and data types of the output are chosen to match those expected by the target.

Foreign Key Resolution

If foreign keys cannot be determined from the extracted data alone, they must be looked up against the target system's data.

Example. Target systems often store countries in a separate table. The country United States is then identified both by its country name and — typically — by a technical key (for example, a GUID). When structurally transforming a customer whose source data identifies the country as the text United States, this text must be translated to the primary key of United States in the target system and stored as a foreign key with the customer record.

Foreign key resolution requires either direct read access to the Countries table in the target system or — if direct access is not available — that table must be read in advance and made available in the staging database. At that point, reading the Countries table is itself an extraction task.

Lookup Value Resolution

Source and target systems often use different codings for the value of a dropdown field. A dropdown field, for example, could be a list field for selecting a customer's salutation.

In the database, what is shown and selected in the application is rarely stored verbatim. A salutation of Mr. might be stored as the value 1 and Ms. as 2. The codings used in source and target systems typically differ.

These coded attributes are often not stored in separate tables. Translating the source-system code into the target-system code therefore requires explicit knowledge of the translation rules. Following terminology used in Microsoft CRM Dynamics, this translation is called lookup value resolution. To resolve lookup values, the codes used by source and target systems must be determined and stored in a mapping table that is consulted during the structural transformation.

Data Quality Check

Experience from real projects shows that foreign key resolution and lookup value resolution are major sources of errors — typically rooted in incomplete or incorrect mappings of source-system codes to target-system codes.

Filtering

Unless the target system is being initially populated with data, only records with specific properties should be loaded into the target. Filtering for the records actually destined for loading should — where possible — already happen during the technical transformation. If that is not feasible there, filtering happens during the structural transformation.

Aggregation

Data may need to be aggregated before loading into the target system.

If end-to-end traceability of every processing step in the ETL pipeline is required, aggregation should be considered as a separate processing step downstream of the structural transformation. Aggregated data would then be stored in separate tables of the staging database.

Historization

As in the technical transformation, the structurally transformed and checked data can be rolled forward in separate tables. New records are inserted, changed records are updated, and deleted records are flagged as deleted.

Loading Data Into the Target System

The final loading of the changed data into the target system therefore operates on quality-assured, structurally transformed, and historized data. Only error-free records — those for which foreign keys and lookup values were successfully resolved — are loaded.

Technologically, this article focuses on loading change data into a target database. The target database is updated via SQL statements, that is, INSERTs, UPDATEs, and where applicable DELETEs. Other target systems — such as Microsoft Dynamics 365 — require the use of a proprietary API, both for writing data into and reading data from the target. In that case, an ETL tool such as SQL Server Integration Services is needed.

Architecture of the ETL Process

The architecture of the ETL process presented here is generic and can be used regardless of the kind of source data or target system, in data migration and data integration projects alike. It also fits the data-loading workflow of a data warehouse. The ETL process is decomposed into small, self-contained work packages. The tasks performed within a work package are sharply defined. During processing, data quality is checked at each step. After a work package finishes, only error-free data is handed over to the next package. At the end of the pipeline, quality-assured data sits in data structures similar to those of the target system and can be loaded there without further transformation.

Work Packages of the ETL Process

The following diagram illustrates the work packages of the ETL process presented here:

The top lane of the diagram shows the top-level steps from the ETL acronym: Extract, Transform, and Load. The bottom lane names the concrete work packages of the ETL process and maps each to one of the top-level steps. Each work package is paired with a database schema. The middle lane labels the database schemas used per work package (E0–L2). Data is handed from work package to work package, that is, from schema to schema, as processing progresses. The ETL process consists of the following work packages:

Data extraction
Technical transformation
Historization of the technically transformed data
Structural transformation
Historization of the structurally transformed data
Loading the data into the target system

The Technical Transformation and Structural Transformation work packages check the data quality of the transformed data and hand over only error-free data to the next package. In the diagram, these checks are indicated by the dark arrow heads. The sections below summarize the steps within each work package and provide an overview of the technology used to carry them out.

Extraction

The goal of extraction is to first store all data to be processed in the staging database. Within extraction, it matters whether the source data comes from a database or from documents with table-like structures (such as EXCEL or CSV) — or from documents with complex logical structures (such as XML or JSON).

Extraction From a Database

When reading from a database or from table-like structures, the attributes / columns are first materialized into tables of schema E1. The structures of the tables in schema E1 closely match the structures in the source system. When extracting from a database, the data is stored using the data types from the source system. If the source-system data types are not supported by SQL Server, the data is stored in schema E1 as nvarchar.

Extraction From Documents With Table-Like Structures

Data from documents with table-like structures — such as EXCEL and CSV — cannot be delivered in a type-safe way. These documents are often hand-authored and hand-maintained, and the ETL process cannot assume that a column contains, say, a valid date. To make sure that all values from these documents can be materialized in the staging database in tables of schema E1, all data is first stored as nvarchar. Use generous maximum text lengths to ensure that data can actually be materialized there.

Extraction From Documents With Complex Logical Structures

When XML or JSON documents are to be processed, the documents themselves are first stored in tables of schema E0. Extraction of the attributes then happens into tables of schema E1. The attribute extraction operates on the documents stored in schema E0 in the first step.

Attributes from text files are stored in schema E1 as nvarchar. Use generous maximum text lengths to ensure that the data fits.

Technology

Extraction of data from a database or from table-like structures can be done with Microsoft's SQL Server Integration Services (SSIS) or any other ETL tool. To extract XML or JSON documents, SSIS first loads them into tables of schema E0. The attribute extraction from the documents uses the powerful T-SQL functions OPENXML or OPENJSON.

Summary

This extraction approach has several advantages. Using an ETL tool such as SSIS — which supports a high degree of parallelism in data processing — materialization into schemas E0 and E1 can be done with high throughput. Upstream systems are minimally impacted, and the data is available for further processing — including attribute extraction from XML and JSON documents via OPENXML or OPENJSON — in the staging database. The materialized data also enables root-cause analysis when errors arise.

Technical Transformation

Within the top-level transformation step, this design pattern performs the technical transformation as described above. It consists of the following sub-steps:

Type conversion
Technical data quality check
Data error logging
Flagging of erroneous records
Hash value computation

Type Conversion

The output of the technical transformation is typed data that matches the target system's expectations. Typing can be driven by metadata via generic user-defined stored procedures and materializes the data into tables of schema T1.

Per attribute from schema E1, two columns are provided in schema T1. The first column holds the extracted value in the data type used in schema E1. The second column holds the typed value in the target data type — if the value can be converted. If the value cannot be converted, the second column stores NULL.

Technical Data Quality Check

After typing, the result is checked by comparing column pairs for type-conversion problems. Because the type conversion is purely technical, this check is also called technical data quality check. The error check can already be extended here to cover simple business logic.

Data Error Logging

Detected data errors are logged in a readable, queryable form in an error table.

Flagging of Erroneous Records

If a record contains at least one error, it is flagged as erroneous so it can be excluded from further processing. The flag lives in a column that stores the count of detected errors. Error-free records carry NULL in this column.

Hash Value Computation

The last sub-step of the technical transformation is computing and storing two hash values per record. The first hash represents the business-key columns of the record; the second hash represents all remaining columns. Through these two hashes, the next work package — Historization of Technically Transformed Data — can identify change records. Hash values are computed only for error-free records.

Technology

Conversion of extracted values into target data types, error checks, flagging of erroneous records, and hash value computation can all be implemented as generic stored procedures that build the appropriate dynamic SQL statements from metadata. This requires upfront investment in implementing those procedures. Once they exist, the tasks above reduce to simple procedure calls. In the long run, this reduces development effort and maximizes reuse.

Scope of the dynamic part. Dynamic SQL in the strict sense only appears in the data-quality check — one rule maps to one WHERE clause applied to the typed table at run time. Beyond that, the procedures listed above (type conversion, DQ check, flagging, hash-value computation) are metadata-generatable, because they follow the same structural pattern for every target table. This generation covers the corridor from extraction up to technical historization (schema T2). From schema L1 onward — the structural transformation — the JOIN statements are target-system-specific and are developed manually; so are the historization procedures for schema L2 (see the corresponding sections below).

How these generic checks are implemented in practice is shown in Checking Data Quality with SQL — a configurable framework that handles the tasks listed above through metadata-driven procedures.

Historization of Technically Transformed Data

Historization consists of the following sub-steps:

Historization
Identification of change records
Identification via hash values
Storing hash values
Promoting only error-free records

Historization

Historization means that delivered data is rolled forward in a database. In the data warehousing world, Slowly Changing Dimensions describes several types of historization that specify exactly how the rolling-forward works. Slowly Changing Dimensions is also commonly abbreviated as SCD. Ralph Kimball's canonical typology covers six types (SCD 1 through SCD 6); some sources additionally describe Type 0 (attributes that never change) and Type 7 (hybrid of surrogate and natural keys). Only two of these types are relevant here:

SCD 1 — strictly speaking, no real historization at all. A record loaded earlier is simply overwritten by its changed counterpart. Only the most recent state of each record is ever stored.
SCD 2 — every table that historizes data gets two extra columns ValidFrom and ValidTill, indicating the validity interval of the record. Currently valid records are open-ended, indicated for example by NULL in ValidTill. When a change record arrives for a currently valid record, the previously valid record's ValidTill is set to the date from which the change record becomes valid, and the change record itself is inserted with ValidTill = NULL.

Historization is optional. With delta loads it can be helpful or even required, however. Suppose a customer places a new order. In a delta load, the order is delivered, but not the customer (who has not changed). Resolving the foreign-key relationship between order and customer cannot be done from the delivered data alone. To resolve it, either the customer data has to be extracted from the target system, or customers must be historized in the staging database so they are available on subsequent ETL runs.

In the context of the ETL process presented here, historization means that only error-free, changed records are historized. Historization can follow either SCD 1 or SCD 2.

Identification of Change Records

Historization requires being able to recognize change records in the source system and, subsequently, in the historized tables. Source systems often provide no information — or only unreliable information — about when a record was inserted, modified, or deleted. When a CSV file is generated from a hand-edited EXCEL document, for example, we can take it for granted that no reliable change information is available. Against that backdrop, this design pattern always derives change records from the data itself. The hash values computed during the Technical Transformation are used for this.

Identification via Hash Values

In the Technical Transformation section, hash values were computed for error-free records — one over the business-key columns, one over the remaining columns. Both can be used to identify change records. New, modified, and deleted records can be identified by comparing the hashes of the business key and the attributes between the tables holding the extracted data (schema T1) and the historized data (schema T2):

Hash (business key)	Hash (attributes)	Type of change
present in T1 and T2, equal	equal	no change
present in T1 and T2, equal	not equal	record was modified
only in T1 (extracted)	—	new record
only in T2 (historized)	—	record was deleted

Storing Hash Values

When a new record is inserted into the historized tables, updated, or flagged there as deleted, the hash values of the new, modified, or deleted record are stored or updated accordingly. This ensures that the hash values stored there always represent the status quo of the source systems and that change records can be identified via hash values at any later point (in subsequent ETL runs).

Promoting Only Error-Free Records

Promoting an erroneous record — and later loading it into the target system — could cause an error and potentially abort the entire ETL run. Therefore, only error-free change records from schema T1 are stored in schema T2.

Structural Transformation

The structural transformation consists of the following sub-steps:

Structural transformation and resolution of foreign-key relationships and lookup values
Structural data quality check
Data error logging
Flagging of erroneous records
Hash value computation

Structural Transformation and Resolution of Foreign Key Relationships and Lookup Values

The output of the structural transformation is data in table structures matching the target system. The structural transformation is implemented as SQL statements with the required JOINs in the FROM clause. Developing those statements requires solid knowledge of the data, the relationships among entities, and especially the foreign-key relationships among tables in the source system — or among the source systems being integrated.

Besides the actual structural transformation of source data, the structural transformation resolves foreign-key relationships for the target system and determines the codes to store for lookup values. The result is stored in tables of schema L1, whose structure, column names, and data types resemble those of the target system.

Structural Data Quality Check

After the structural transformation, the result is checked: could all foreign-key relationships and lookup values be resolved? If no foreign key or no lookup code can be determined for a record, the record counts as erroneous. Since this check concerns the outcome of the structural transformation, it is called the structural data quality check here.

Data Error Logging

Detected data errors are logged in a readable, queryable form in an error table.

Flagging of Erroneous Records

Hash Value Computation

The last sub-step of the structural transformation is computing two hash values per record. The first hash represents the business-key columns of a structurally transformed record; the second hash represents all remaining attribute columns. Both hashes let the next work package — Historization of Structurally Transformed Data — identify change records.

Historization of Structurally Transformed Data

Historization of the structurally transformed data covers the same sub-steps as historization of the technically transformed data. It is an optional step, because — as long as the data from the technical transformation is historized — the structurally transformed data can always be reconstructed by running the structural transformation again.

Historization of the structurally transformed data takes the records from tables in schema L1 and stores them in tables of schema L2. The approach is identical to historizing data from schema T1 into schema T2. Only error-free change records are historized from L1 into L2. New, changed, and deleted records are additionally marked with a flag indicating that they still need to be loaded into the target system. If the data of schema L2 is historized as well, it must never be deleted and should be backed up by a maintenance process. This way, it is always possible to trace when which record changed.

The procedures required to historize data into schema L2 must be developed manually.

Loading

The transformed and quality-checked data in schema L2 can now be loaded into the target system using a technology of choice. The change records to be loaded are identified via a flag indicating whether the record has already been loaded. Records loaded successfully into the target system are flagged accordingly.

FAQ

What is the difference between technical and structural transformation?

The technical transformation works on each record in isolation: it converts input values into the target data types (text → date, decimal, …) and runs a first data-quality check at the value level, both without looking at other tables. The structural transformation, by contrast, needs context from the target system — resolving foreign keys, mapping lookup values — and therefore happens in its own work package after the technical transformation. Splitting them lets the two classes of errors be logged and fixed separately.

Why materialize every work package in the database?

Materialization — writing each work package's output to a database table — decouples three things: the source system is read only once and runs unobserved by downstream steps; every step becomes re-startable without rerunning the entire ETL pipeline; and a traceable audit trail emerges for diagnosing errors on individual records. The storage overhead is negligible compared with the robustness gained.

Do I really need all six persistence layers?

Not strictly. The full E0/E1/T1/T2/L1/L2 layering pays off mostly where audit trail, per-package restartability, and after-the-fact error analysis are hard requirements — typically in classical migration and CRM-integration projects with data volumes in the low to medium range (≤100M records per run). For large volumes or modern platforms such as Snowflake, Databricks, or BigQuery, some intermediate layers are often implemented as views rather than materialized tables — the architectural logic stays the same while storage and I/O overhead drop. Rule of thumb: T2 and L1 are the first candidates for virtualization, because they can always be reconstructed from T1 and T2 respectively.

When do I need historization (SCD)?

Historization pays off when the ETL process handles delta loads — that is, only the changes since the last run, not a full snapshot. In a delta load, the order record is delivered but not the related customer (if the customer has not changed); without historized customer data, the foreign-key relationship cannot be resolved. With full snapshot loads — where every run pulls the entire source system — SCD is optional and usually unnecessary.

Can this pattern be used with Postgres instead of SQL Server?

The core concepts of the pattern — work packages, schema layering E0–L2, data quality at the boundaries, hash-based SCD — are not tied to SQL Server and can be implemented in any relational database. The code presented here, however, is consistently T-SQL / SQL Server-centric (SSIS, OPENXML, OPENJSON, HASHBYTES, metadata-driven stored-procedure generation). The most important Postgres equivalents are: xmltable() for XML, jsonb_to_recordset() or JSON_TABLE (from Postgres 17) for JSON, and digest(…, 'sha256') from pgcrypto for hashes. Structural adaptation can go deeper than just renaming functions — for example, the E0/E1 split for XML/JSON can often be dropped in Postgres because xmltable() extracts directly from the source read.

How does this pattern relate to Data Vault 2.0?

There are clear similarities — business key, hash-based delta detection, layer separation into Raw / Cleansed / Business, auditability. Readers familiar with Data Vault will recognize E1/T2 as a "Raw + Hub/Satellite-equivalent" and L1/L2 as the "Business Vault". The pattern presented here is, however, lighter: no strict Hub/Link/Satellite separation, no mandatory insert-only history, no Raw-Vault-vs-Business-Vault architectural dogma. For classical migration and CRM-integration projects with audit requirements, this simplification is pragmatic; for pure data-warehouse loading with multi-source integration, Data Vault 2.0 is worth a look.

Data quality in an ETL process — root of the article series.
Checking Data Quality with SQL — a Configurable Framework — the implementation layer beneath this architecture: spotting bad data generically and classifying it by severity.
Design Pattern // Logging an ETL process with T-SQL — cluster sibling covering the logging layer.
Design Pattern // Safe Type Conversion with T-SQL — fn_try_convert_* UDFs for the technical transformation.
Data Quality // Fundamentals of Type Conversion with T-SQL — foundational article on TRY_CONVERT.
ETL vs. ELT — How to Tell Which Pattern You Actually Built — classifies the architecture presented here as persistent-staging ELT.

Checking Data Quality with SQL — a Configurable Framework for Spotting Bad Data Generically

Marcus — Thu, 16 Jul 2026 21:12:39 +0000

Bad data gives no warning. An age of 200 years, a duplicate customer number, a country code that doesn't exist — in the source system nobody notices. Only when the ETL run tries to push the rows into the strictly modelled target layer does the load break: on a CHECK, on a UNIQUE index, on a foreign key. Checking data quality with SQL means finding exactly those rows beforehand, classifying them by severity and sorting them out deliberately — without a special tool, with a handful of generic SQL routines.

The essentials up front:

Three generic check routines — a WHERE clause, a uniqueness check and a foreign-key check — cover a large share of typical data errors.
All three write into one shared error table: one row per violation, with the business key, the offending value and a plain-text message.
A severity (error / warning / information) drives a quality gate: only error-free rows flow on.
All in plain PL/pgSQL — the same basic principle that specialised data-quality tools use too, here dependency-free to build yourself.

Prerequisite: Postgres as the example engine and a central staging layer that is loaded raw first. The checks run against it set-based. That is the counter-design to the tool-centric package that mixes extraction, transformation and loading per table and scatters the quality check across the whole process (more on that in the architecture article of this series).

Data quality with SQL: the dimensions behind it

There is plenty to read about data quality and little to grab hold of. The literature has agreed for decades on what makes data good: Wang & Strong described fifteen dimensions in their 1996 paper "Beyond Accuracy", the DAMA UK working group singled out six of them in 2013 as core dimensions for practice (completeness, uniqueness, timeliness, validity, accuracy, consistency), and ISO/IEC 25012 standardises data-quality characteristics as a norm. What the literature rarely delivers is the how — and when it does, it is usually tied to a particular tool.

The concrete lever, though, is obvious: anyone who wants to check data quality with SQL formulates the check as a query and ends up with a table holding the bad records. Three routines cover a large share of practice, and each one cleanly maps to one of the established dimensions:

Check routine	What it finds	Dimension
WHERE clause on one table	values outside allowed ranges, missing mandatory values	Validity (+ Completeness)
Uniqueness / maximum occurrences	duplicates, over-frequent keys	Uniqueness
Foreign key against a reference	orphaned rows without a master	Consistency / Integrity

That covers four of the six dimensions; timeliness and accuracy — in the sense of matching the real world — lie outside their reach and need other means. The theory behind it — error classes, the criteria canon, and the full coverage map including the two blind spots — is deepened in the concept article Data Quality: Dimensions and Error Classes.

Honestly placed: this approach is not new. Tools like Soda Core, dbt tests or Great Expectations follow the same approach at the core — formulate check logic, collect the hits, attach a severity; whether that happens as generated SQL or against another engine (Pandas, Spark, a data warehouse) is an implementation detail. Around that, granted, they offer more, from monitoring through profiling to lineage. The value of the home-built version is not originality but transparency: every line is readable, nothing is bound to a product, and it runs everywhere you aren't allowed to install an extra tool.

The common denominator: one error table

The backbone is not the check but its result. All three routines write into the same table — one row per violation found, stored so that the source record can later be identified unambiguously:

  1: CREATE TABLE dq.error
  2: (
  3:     id             bigint      NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,schema_name    text        NOT NULL
  5:    ,table_name     text        NOT NULL
  6:    ,id1_column     text
  7:    ,id1_value      text
  8:    ,id2_column     text
  9:    ,id2_value      text
 10:    ,id3_column     text
 11:    ,id3_value      text
 12:    ,error_column   text
 13:    ,error_value    text
 14:    ,severity       char(1)     NOT NULL
 15:    ,message        text        NOT NULL
 16:    ,created_on     timestamptz NOT NULL DEFAULT now()
 17:    ,CONSTRAINT pk_error            PRIMARY KEY (id)
 18:    ,CONSTRAINT ck_error_severity   CHECK (severity IN ('E', 'W', 'I'))
 19: );

The column pairs id1_column/id1_value through id3_column/id3_value hold the business key of the affected record: id1_column holds the column name (say customer_id), id1_value the value (say 4711). That lets you reconstruct the bad row later — WHERE customer_id = 4711. Three pairs are enough for composite keys; in practice one almost always suffices. Alongside, each row records the checked column (error_column), the offending value (error_value), the message in plain text and the severity (severity: Error, Warning, Information).

The configuration: one rule per row

The check rules are not cast into code but written into a table. One row = one rule. New checks arrive without a deployment, and the business side can read along with what is being checked:

  1: CREATE TABLE dq.check_rule
  2: (
  3:     id              bigint  NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,check_type      text    NOT NULL   -- 'constraint' | 'unique' | 'lookup'
  5:    ,schema_name     text    NOT NULL
  6:    ,table_name      text    NOT NULL
  7:    ,id1_column      text    NOT NULL   -- business key (up to three)
  8:    ,id2_column      text
  9:    ,id3_column      text
 10:    ,check_column    text    NOT NULL   -- checked column
 11:    ,where_clause    text               -- 'constraint': the "bad" predicate
 12:    ,max_occurrence  int     NOT NULL DEFAULT 1   -- 'unique': allowed occurrences
 13:    ,ref_schema      text               -- 'lookup': reference table
 14:    ,ref_table       text
 15:    ,ref_column      text
 16:    ,severity        char(1) NOT NULL DEFAULT 'E'
 17:    ,message         text    NOT NULL
 18:    ,active          boolean NOT NULL DEFAULT true
 19:    ,CONSTRAINT pk_check_rule  PRIMARY KEY (id)
 20: );

check_type decides which of the three routines is built for the row. Depending on the type, different columns matter: where_clause for the WHERE check, max_occurrence for uniqueness, ref_schema/ref_table/ref_column for the foreign key. severity and message hang on every rule — so each finding carries its severity and its plain text straight from the configuration.

Routine 1: the WHERE clause

The simplest and at the same time most powerful check: a condition that describes bad rows, attached to a table. Everything the condition matches is a finding. For the rule "age must be between 0 and 120" the routine produces this statement:

  1: INSERT INTO dq.error (schema_name, table_name, id1_column, id1_value,
  2:                       error_column, error_value, severity, message)
  3: SELECT
  4:     'staging'
  5:    ,'customer'
  6:    ,'customer_id'
  7:    ,T01.customer_id::text
  8:    ,'age'
  9:    ,T01.age::text
 10:    ,'E'
 11:    ,'Age out of range 0..120'
 12: FROM
 13:    staging.customer T01
 14: WHERE
 15:    age < 0 OR age > 120;

The predicate on line 15 comes unchanged from where_clause. With that, this one routine covers a whole family: ranges (age < 0 OR age > 120), mandatory fields (email IS NULL), formats (length(zip) <> 5), plausibility (order_date > current_date). Two dimensions at once — validity and completeness. How to push this routine to its limits — and which NULL trap of three-valued logic it has to avoid — is covered in depth by the spoke Validating Data with SQL.

Routine 2: uniqueness and cardinality

Duplicate keys are the classic case. The check is a GROUP BY with HAVING on the count:

  1: SELECT
  2:    customer_id
  3: FROM
  4:    staging.customer
  5: GROUP BY
  6:    customer_id
  7: HAVING
  8:    count(*) > 1;

The real trick sits in the > 1: it comes from max_occurrence. Instead of checking for uniqueness, you check for a maximum count — > 1 for true uniqueness, > 3 for example when a key may appear at most three times. The routine then joins the keys it found back onto the table and logs every occurrence (not just the first), so that every affected row appears in the error table.

What is checked here is exactly the business key that identifies the record — the same column that lands in the error table as id1_column. That is no coincidence: this check directly mirrors the UNIQUE constraint that the strict target layer carries on the business key. What must be unique there is caught up front in the source. How to push this routine — maximum cardinality, composite keys and the NULL trap in UNIQUE, where SQL Server and Postgres differ — is deepened by the spoke Finding Duplicates with SQL.

Routine 3: referential integrity

A foreign key into the void — a country_code for which there is no entry in the master-data table. Put generically: all rows of the child table that find no partner in the master table via LEFT JOIN:

  1: SELECT
  2:    T01.customer_id
  3: FROM
  4:    staging.customer T01
  5:    LEFT JOIN staging.country T02
  6:    ON
  7:      T01.country_code = T02.country_code
  8: WHERE
  9:        T01.country_code IS NOT NULL
 10:    AND T02.country_code IS NULL;

Child table, child column (check_column) and master (ref_schema/ref_table/ref_column) come from the configuration — so the same routine works for any master-child relationship. The IS NOT NULL condition on line 9 deliberately separates "unknown value" (an error) from "no value given" (that is the WHERE routine's job). How to push this routine — the three phrasings LEFT JOIN … IS NULL/NOT EXISTS/NOT IN, the notorious NOT IN-plus-NULL trap, and composite and self-referencing foreign keys — is deepened by the spoke Finding Orphaned Records with SQL.

The runner: dynamic SQL — done safely

The three statements above are hard-coded. They become generic when a function assembles them at runtime from the configuration. In PL/pgSQL, format() is the right tool — and the point where you have to be careful. Identifiers belong in with %I, literals with %L; both are quoted correctly by Postgres and rule out SQL injection through table and column names. Here is the branch for the WHERE check:

  1: l_sql := format($sql$INSERT INTO dq.error
  2:                      (
  3:                          schema_name
  4:                         ,table_name
  5:                         ,id1_column
  6:                         ,id1_value
  7:                         ,error_column
  8:                         ,error_value
  9:                         ,severity
 10:                         ,message
 11:                      )
 12:                      SELECT
 13:                          %1$L
 14:                         ,%2$L
 15:                         ,%3$L
 16:                         ,T01.%4$I::text
 17:                         ,%5$L
 18:                         ,T01.%6$I::text
 19:                         ,%7$L
 20:                         ,%8$L
 21:                      FROM
 22:                         %9$I.%10$I T01
 23:                      WHERE
 24:                         %11$s
 25:                 $sql$
 26:    ,l_rule.schema_name
 27:    ,l_rule.table_name
 28:    ,l_rule.id1_column
 29:    ,l_rule.id1_column
 30:    ,l_rule.check_column
 31:    ,l_rule.check_column
 32:    ,l_rule.severity
 33:    ,l_rule.message
 34:    ,l_rule.schema_name
 35:    ,l_rule.table_name
 36:    ,l_rule.where_clause
 37: );
 38: EXECUTE l_sql;

The schema and column names (lines 16, 18, 22) go through %I, the fixed values through %L. The uniqueness and foreign-key branches are built on the same pattern — only the inner statement differs.

One spot stays deliberately raw: the predicate on line 24 is inserted with %s, i.e. as unchanged SQL. It has to be — where_clause is a SQL expression, not a value. That makes the dq.check_rule table the trust boundary of the system: whoever may write there can have arbitrary SQL executed. In practice this is uncritical, because the configuration is maintained administratively and never filled from user input — but you have to know it and secure it. Identifiers, by contrast, are watertight through %I: a column name like age"; DROP TABLE staging.customer; -- from the configuration becomes a (non-existent) quoted identifier and raises a clean error instead of dropping the table.

Across all active rules of a table the runner loops, executes the built statement per rule and finally writes the severity counters back.

Take it with you: The complete framework — error table, rule table, runner, demo data and gate query — is available as a runnable SQL bundle for download (PostgreSQL 13+, verified against Postgres 16). An empty database is all you need; the script sets up everything itself.

A word on runtime, to be honest: most freely configured predicates run as a full-table scan over the source table, because no matching index exists for an arbitrary expression — across many rules on large tables that adds up. A simple email IS NULL or a key check may well use an index, but with freely configured rules you cannot rely on it. It is acceptable because the check runs in the staging window that is scheduled anyway, against the freshly loaded set, not against the production system. For very large tables it pays to limit the check to the partitions or batches loaded in the current run, instead of scanning everything every time.

Severity and the quality gate

Up to here, dq.error holds what is wrong. Steering the ETL process, however, needs a statement per record: may it proceed or not? For that the source table gets three counter columns — sys_error, sys_warning, sys_info — and the runner fills them after each run: per business key the number of findings by severity.

The write-back here runs over the single-column business key (id1) — by far the most common case. Gating composite keys across several columns would work the same way but is material for its own spoke; the error table already holds the key parts id1–id3 for it.

That turns the gate into a trivial WHERE condition:

  1: SELECT
  2:     customer_id
  3:    ,country_code
  4:    ,email
  5:    ,age
  6: FROM
  7:    staging.customer
  8: WHERE
  9:    sys_error = 0;

Only rows without an error flow into the next layer. Warnings and information do not block — they are logged but no obstacle. That is the whole point of severity: it separates "must not proceed" from "worth a look". In the demo set of seven rows exactly two pass the gate — the clean row and the row with the missing email (a warning only). Everything with age out of range, the unknown country and the duplicate customer number stays behind, neatly logged:

severity	id1_value	error_column	error_value	message
E	1	customer_id	1	customer_id not unique
E	1	customer_id	1	customer_id not unique
E	2	age	200	Age out of range 0..120
E	3	country_code	XX	Unknown country_code
E	5	age	-3	Age out of range 0..120
W	4	email		Email missing

Why not just constraints?

The obvious question: if the target layer has CHECK, UNIQUE and foreign-key constraints anyway — why the effort? The answer lies precisely there. The downstream tables are strictly modelled; that is intended. But a constraint knows only two outcomes: the row fits, or the whole load breaks. When loading thousands of rows, "breaks" is the worst of all options — a single bad row stops the entire process, and you don't even know which one.

That is exactly why you check in the source up front: you identify all rows that would fail at the target constraints, classify them by severity and let only the clean ones pass. The framework does not rebuild the constraints — it is the transparent, auditable pre-filter before a deliberately strict target layer. Instead of an aborted load you get a table of findings and a process that carries on with the good data.

In fact the three routines are exactly the pre-filter for the three constraint types the target enforces:

Constraint at the target	Check routine in the source
`CHECK`	WHERE clause (Routine 1)
`UNIQUE`	Uniqueness (Routine 2)
`FOREIGN KEY`	Foreign key (Routine 3)

What is enforced as a constraint at the target is checked in the source up front — on the same business key that carries the UNIQUE constraint at the target. The difference is not what is checked but how the violation is handled: report and classify instead of aborting the load.

What this approach does not cover

Three routines are a lot, but not everything — and it pays to be honest about where the line lies:

Repair is deliberately not part of it. The framework finds and makes transparent; it corrects nothing. That is a decision, not an omission — transparency first.
Cross-field business rules ("discount only if status = active") can sometimes be expressed as a WHERE clause, sometimes not.
Temporal consistency, accuracy against an external truth, complex patterns (beyond simple length and format checks) are topics of their own.

And if you'd rather use a finished tool: Soda Core, dbt tests and Great Expectations cover the same area, free and well maintained. The SQL home-build pays off when you want transparency, zero dependencies and full control over every line — or simply work in an environment where no extra tool may be installed.

Postgres-to-SQL-Server bridge

The pattern is not Postgres-specific. In SQL Server, sp_executesql takes the role of EXECUTE format(); identifiers are protected there with QUOTENAME() instead of %I. The structure stays identical: a configuration table, a cursor (or a loop) over the rules, a dynamically built INSERT … SELECT … WHERE per rule, a shared error table and severity columns in the source as a gate. Anyone coming from SQL Server transfers the approach one to one.

FAQ

Do I need a dedicated tool for data quality?
No. Checking data quality with SQL works with three generic routines — range, uniqueness and foreign-key checks — that cover a large share of typical errors. Tools like Soda or dbt take work off your hands but at the core do the same: build SQL, collect hits, attach a severity.

Isn't dynamic SQL a security risk?
Only if you glue inputs together raw. With format() and %I (identifiers) or %L (literals), Postgres quotes correctly and injection through table/column names is ruled out. The freely configurable predicate (where_clause) is deliberately raw SQL — which is why the configuration table is the trust boundary and belongs under administrative protection.

Why not just use the target table's CHECK and foreign-key constraints?
Because a constraint aborts the whole load instead of reporting bad rows one by one. The check in the source identifies all rows up front that would fail at the target, classifies them and lets only clean ones pass — auditable instead of aborted.

How do I find the faulty record again from the error table?
Via the id*_column/id*_value pairs: they hold the column name and value of the business key. WHERE <id1_column> = <id1_value> leads back to the source row. For composite keys, up to three pairs are available.

Does this work in SQL Server too?
Yes. sp_executesql replaces EXECUTE format(), QUOTENAME() replaces %I. Configuration table, error table, severity gate and the three routines stay structurally the same.

Data Quality in an ETL Process — the bigger picture: catching technical and business errors before they reach the target system.
Design Pattern // The Architecture of an ETL Process — how bad data is cleanly isolated across layered staging levels.
Data Quality // Fundamentals of Type Conversion with T-SQL — the field-wise validity check when converting, the "T" building block of this series.
Validating Data with SQL — the spoke for Routine 1: value ranges, required fields and the NULL trap of three-valued logic in detail.
Finding Duplicates with SQL — the spoke for Routine 2: maximum cardinality, composite keys and the NULL trap of UNIQUE semantics between SQL Server and Postgres.
Finding Orphaned Records with SQL — the spoke for Routine 3: LEFT JOIN … IS NULL vs. NOT EXISTS vs. NOT IN, the NOT-IN-plus-NULL trap and composite/self-referencing foreign keys.
Data Quality: Dimensions and Error Classes — the theory frame of the series: error classes, the criteria per Apel et al., and an honest coverage map of what SQL reaches — and the two criteria it cannot.

Is Claude a Woman or a Man? — and Why We Ask in the First Place

Marcus — Thu, 09 Jul 2026 08:15:32 +0000

"You spend more time with her than with me." That's a sentence you usually hear when an affair comes to light — in my case, it was about Claude Code. And suddenly there was a question in the room I had never thought about before: Is Claude actually a woman or a man?

What this article puts on trial:

The evidence for "woman" and the evidence for "man" — a circumstantial trial in two acts
The verdict: a SQL query whose result is only four letters long
Why we assign gender to tools in the first place — from ships to the GPS voice
What that means for working with Claude Code: calibrating trust instead of adopting a colleague

Prerequisites: none. Although, if you have ever lost a row to a NULL in a WHERE clause, the verdict will be twice the fun.

The Case: A First Name in the Terminal

Anthropic could have called its language model "Assistant 3000". Instead, it got a French first name — and first names trigger reflexes: Whoever has a first name has a face, a voice, a story. And, so the reflex insists, a gender.

English is comparatively merciful here — "the AI" and "the model" carry no gender. My native German is not: every noun drags a gendered article along, so German speakers cast a vote with every sentence — "die KI" (feminine) or "der Assistent" (masculine). And anyone who works with Claude Code all day catches themselves thinking "he solved that cleanly" — or "she's contradicting me again" — in any language. The question sounds silly, but it leads somewhere interesting: to the line between tool and colleague. So let's try it properly: two lines of evidence, one verdict — and then the actually interesting follow-up question.

The Evidence for "Woman"

Exhibit 1: language. In French, the name's home country, artificial intelligence is feminine: une IA. In my native German, it is too: die KI, die Maschine, die Antwort — whoever says "frag mal die KI" has already ruled. Two languages, one tendency.

Exhibit 2: the first name. Claude is one of the few French first names that have been used for both genders for centuries. Claude Pompidou was France's First Lady, Claude Jade starred for Truffaut, Claude Cahun photographed her way through every role model of the 20th century. If you hear "Claude" and automatically picture a man, you only know half the name's history.

Exhibit 3: the demeanor. The cliché says: listens, apologizes a lot, weighs every position three times. Sounds like Claude. (That this is a cliché about women rather than a property of women is part of the evidence — this exact mechanism will keep us busy after the verdict.)

Exhibit 4: the jealousy. The strongest piece of evidence comes from my own living room — see the opening: my girlfriend is jealous of Claude. By now she knows my excuses by heart: "I just need to ask Claude something real quick." "Five minutes, honestly." "No, we're just discussing index strategies." And jealousy is a surprisingly precise measuring instrument: Nobody has ever been jealous of a wrench. Of a colleague you talk to for hours every day — apparently, yes. (This exhibit, too, will reappear after the verdict — it is living proof for the chapter on anthropomorphization.)

The Evidence for "Man"

Exhibit 1: language, now for the other side. In English, the default pronoun for a coding agent slips toward "he" with remarkable ease — "ask him to fix the branch", "he's already opened the PR". And German votes twice: der Assistent, der Agent, der Chatbot, der Algorithmus — all masculine. Whoever says "the agent has already created the branch — he's fast today" has ruled just as firmly, only the other way.

Exhibit 2: the namesakes. Claude Monet, Claude Debussy, Claude Lévi-Strauss — and Claude Shannon, the founder of information theory, most frequently traded as the secret namesake. Anthropic has never officially confirmed it; but what name would suit a language model better than that of the man who made the information content of language computable? (Objection from the prosecution: speculation. — Sustained. The exhibit stays in the record anyway.)

Exhibit 3: the demeanor, cross-check. The cliché says: explains things unasked and at full length, is remarkably sure of itself — especially when it's wrong. Anyone who has ever received a confidently delivered, entirely invented API signature nods knowingly at this point.

A note for readers without German, because the two pans only pair up this way there: the labels lean on German grammar, where the AI is a she (die KI) and the assistant is a he (der Assistent). The scale weighs exactly those two grammatical camps against each other.

The Verdict

Two lines of evidence, both conclusive, both built on clichés and pronouns. They cancel each other out exactly. The court retires to deliberate — to the place where all verdicts on this blog are rendered: the database.

  1: SELECT
  2:     name
  3:    ,gender
  4: FROM
  5:    assistants
  6: WHERE
  7:    name = 'Claude';
  8: 
  9: -- name    gender
 10: -- Claude  NULL

There it is, in four letters: NULL.

The cross-check confirms the verdict — Claude shows up neither among the women nor among the non-women, because NULL is neither equal nor unequal to anything:

  1: SELECT
  2:    count(*)
  3: FROM
  4:    assistants
  5: WHERE
  6:    gender = 'female';
  7: -- 0
  8: 
  9: SELECT
 10:    count(*)
 11: FROM
 12:    assistants
 13: WHERE
 14:    gender <> 'female';
 15: -- 0

In SQL, NULL does not mean "empty" and it does not mean "zero". It means: there is no value here. It gets interesting once you ask why there is none — and there are two very different readings:

Missing, but applicable: The value exists, we just don't know it. The birth date of a customer who never provided it.
Missing, because not applicable: There is no value that could belong in this column at all. The shoe size of a number. The maiden name of a warehouse shelf.

Edgar F. Codd, the inventor of the relational model, even wanted to distinguish these two cases with two separate markers — it never caught on; to this day, SQL has one NULL for both. The Claude case is clearly the second reading: The gender is not unknown, it is not applicable. There is no hidden gender that Anthropic keeps secret — there is simply no value that belongs in this column. Claude itself, by the way, answers the question exactly the same way: neither, in every language.

That would close the case. But the more interesting question only starts here: Why did we ask in the first place?

Why We Assign Gender to Tools

The reflex is old and well documented. Ships have been "she" in English for centuries. The GPS voice is "the lady in the nav" in a surprising number of households, although nobody lives in there. Voice assistants like Alexa and Siri entered the market with female-coded names and default voices — a design decision that has drawn plenty of criticism since, because it perpetuates the pattern "assisting role = female".

And we act accordingly: We thank machines. We say "please" to Siri. We comfort the robot vacuum when it gets stuck under the sofa. Communication research measured this reflex back in the 1990s — Byron Reeves and Clifford Nass showed that people respond to computers with the same social patterns as to humans, even when they know perfectly well there is a machine in front of them. That is not stupidity, it is economy: Our brain has exactly one module for dialogue, and it was trained on humans.

A language model with a first name that answers in full sentences, apologizes and asks follow-up questions hits that module with full force. Wanting to assign Claude a gender is not an accident — it is the expected consequence of Claude functioning like a conversation partner. And the mechanism works in both directions: The jealousy from Exhibit 4 is the same reflex, just seen from the outside — if you spend hours talking to "someone", your environment sees a relationship, not a toolchain. The question "woman or man?" is the most harmless symptom of this mechanism. The less harmless one follows in the next section.

What This Means for Working with Claude Code

If you develop with Claude Code, you work in dialogue for hours and days on end. The anthropomorphization reflex runs in the background the whole time — and it has a side effect that directly concerns code quality: Colleagues get the benefit of the doubt, tools get checked.

The colleague who has delivered good work for years gets his pull request waved through once in a while. Exactly this pattern transfers to the AI assistant once you internally promote it to colleague: After twenty good answers, you stop really reading the twenty-first. But the twenty-first answer of a language model is statistically just as much at risk as the first — the model has no reputation to lose and does not keep track of its own success rate.

The verdict has been rendered — but it comes with terms of probation:

Review discipline independent of gut feeling. Generated code gets read, executed and tested — even if "the colleague" was right ten times in a row. What that means in practice is what the convention and workflow articles of this blog are about: rules that enforce conventions, and checks that find errors mechanically instead of relying on trust.
First-name terms are allowed, the responsibility stays here. It is perfectly fine to say "he" or "she" and to thank the assistant — as long as accountability stays clear: bugs in deployed SQL belong to the human who merged it.

The gender stays NULL. The responsibility stays NOT NULL. And the pull request still gets a review.

FAQ

Where does the name Claude come from? Anthropic has never officially explained it. The most common guess is a bow to Claude Shannon, the founder of information theory. The only thing certain: It is a deliberately human first name — and in French one of the few that are used for women and men alike.

What does Claude itself answer? Neither. Claude describes itself as an AI without gender and without a body — consistently, whether you ask in English, German or French.

Is it harmful to humanize the AI? Not per se — the reflex is normal and makes the collaboration more pleasant. It only gets risky when humanizing turns into the benefit of the doubt: when generated code goes through unread because "the colleague" has been reliable so far. The solution is not less friendliness, but more systematic reviews.

And what do I do about the jealousy at home? Read out the verdict. Granted: "It's just a tool" sounds exactly like something a person with something to hide would say. But NULL is, after all, the only relationship status where guaranteed nothing is going on.

Why do Alexa and Siri have female voices, while Claude has no persona at all? The voice assistants of the 2010s were deliberately designed as friendly service personas — with a name, a voice and a small-talk repertoire. Anthropic went a different way: a human first name, but no fixed voice, no avatar, no gender. That is a deliberate design decision, and it emphasizes the tool character.

The NULL side of the story:

Validating Data with SQL — Ranges, Required Fields and the NULL Trap — why NULL values fall through WHERE filters and how to find them anyway

The serious AI side of this blog:

AI-Assisted SQL Development with Claude Code — Rules, Skills and Agents That Enforce Conventions — how "the colleague" becomes a tool with enforced conventions
Setting Up a Claude Code Project with a Development Workflow and Database — the place to start if you want to build with Claude yourself
Skills vs. Rules in Claude Code — What Auto-Loads, What Loads on Demand — the mechanics behind review discipline: conventions that do not depend on gut feeling

Design Pattern // Safe Type Conversion with T-SQL — Catch Errors Instead of Aborting the ETL Process

Marcus — Thu, 09 Jul 2026 08:14:23 +0000

A single value that won't convert — a 25.5 in an integer column, an empty string, a date like 20240230 — and the ETL run aborts mid-import. Anyone who loads text data from upstream systems knows it: the delivery doesn't honour the agreed interface, and a bare CONVERT throws an exception instead of cleanly logging the offending value.

This article describes a design pattern for safe type conversion: an approach that makes every conversion error individually identifiable without aborting the ETL process. It rests on three paradigms, derived in the next section.

What you'll learn here:

Materialization — why the intermediate results belong in a persisted table, so the result stays inspectable at any time: after an abort as well as after a successful run.
Error identification — how a single WHERE clause finds every failed value instead of killing the run.
Data-type subtleties — why TRY_CONVERT(int, '') returns 0 and when that is functionally wrong.
Reusable UDFs — fn_try_convert_* with empty-string-→-NULL handling as a building block per target type.

Prerequisite: SQL Server / T-SQL and an ETL context where text data must be moved into typed columns. For the plain conversion functions CAST, CONVERT, TRY_CAST and TRY_CONVERT, see Type Conversion Basics with T-SQL — this article builds the pattern on top of them.

The Three Paradigms

The approach rests on three paradigms:

NULL instead of abort. The conversion function returns NULL when the input value cannot be converted to the target data type — no runtime error, no ETL abort.
Materialize input and output value. The ETL process stores both the input value (text) and the converted output value in one table.
Identify errors by comparison. Comparing the input and output value finds every failed value with a simple WHERE clause.

SQL Server delivers the first paradigm out of the box with TRY_CONVERT: the function returns NULL on failure instead of throwing an exception. Applying it alone, however, does not guarantee a functionally correct conversion — for that you need to know the specifics of each target data type (see below). For some data types TRY_CONVERT can't be applied sensibly at all: a yes/no value, for instance, arrives as text (J, N, Y, Yes, No, …), and date values often need pre-processing too. For these cases you write user-defined functions that satisfy the first paradigm (NULL on failure) — the section Reusable Conversion Functions shows them.

Robust, safe type conversion matters especially in data migration projects, where the data to be processed is delivered as files (Excel, CSV, XML, JSON, …).

Input and Output Values

Input values are data that has been extracted and stored in a database, in tables and columns of type nvarchar. Output values are data converted from the input values into the target data types. For every input value to be processed there is also an output value.

Materializing the Extracted Data

A pure in-memory approach tempts you not to persist intermediate results at all: extraction, type conversion and error identification then run in a single processing flow in memory (SSIS with its Control Flow is a well-known example). When values can't be converted, extensive error handling is needed mid-flow — and if a record contains several errors, often only the first is handled and logged. Faulty records end up, at best, in a text file nobody ever looks at. As powerful as such tools are: comprehensive error handling for type conversion is rarely implemented consistently in practice.

It is better to separate the steps strictly and materialize the intermediate results in a database — regardless of which tool does the processing. The decisive gain is the persistence itself: the conversion result stays inspectable at any time — not only after an abort, but also after a successful run. You can look inside at any moment, trace individual errors and analyze their causes. That is exactly what the acronym ETL stands for: the data is first extracted into a database, then converted robustly, and in the last step error-free data is identified and processed further:

This figure shows an ETL process in which a separate database schema is created for each task to be performed:

Schema	Meaning
E0	Storing XML and JSON documents in the database
E1	Extracting the values from the text files
T1	Type conversion of the extracted values
T2	Historization of error-free converted records
L1	Structural transformation toward the target system
L2	Storage of error-free, structurally transformed data

The full derivation of this schema layering — from extraction (E0/E1) through transformation to loading (L1/L2) — is given in Design Pattern // The Architecture of an ETL Process. This article deepens the type-conversion step from schema E1 to T1 — so let's focus on those two.

Schema E1

Extracted data is stored in tables of schema E1 in columns of type nvarchar. The length of the text fields should, if necessary, not be restricted; it must in any case be chosen so that all data can be extracted in full. Extraction can then only fail because of infrastructure problems. The extracted data is also called the input values.

Schema T1

For every table in schema E1 there is a table of the same name in schema T1. In the T1 tables the input-value columns are carried over with type nvarchar, and a second column is added per input value — this time with the target system's data type. For pragmatic reasons these column pairs share the same column name, with the columns holding the input value receiving the suffix _E1. An example…

  1: CREATE TABLE [T1].[Table]
  2: (
  3:     [Id]         int IDENTITY(1,1) NOT NULL
  4:    ,[PK_E1]      nvarchar(256)         NULL
  5:    ,[PK]         int                   NULL
  6:    ,[Text_E1]    nvarchar(256)         NULL
  7:    ,[Text]       nvarchar(3)           NULL
  8:    ,[Integer_E1] nvarchar(256)         NULL
  9:    ,[Integer]    int                   NULL
 10:    ,[Date_E1]    nvarchar(256)         NULL
 11:    ,[Date]       datetime              NULL
 12: );

In the table [T1].[Table] all columns except [Id] are declared nullable. This lets both the input values from an [E1].[Table] and the converted output values be stored — even in the presence of conversion problems. The prerequisite, however, is that all conversion functions return NULL on failure.

Identifying Conversion Errors

Let's look at a data example for the [T1].[Table] above.

Id	PK_E1	PK	Text_E1	Text	Integer_E1	Integer	Date_E1	Date
1	1023	1023	S01	S01	25.5	*NULL*	20240218	2024-02-18
2	1024	1024	S022	S02	87	87	20240230	NULL
3	1025X	*NULL*	S03	S03	65	65	20240219	2024-02-19

Problems converting input values into the output value's data type can be identified with a simple WHERE clause. For output values of the general type non-text, the following WHERE clauses find the type-conversion problems — the input value is populated, but the converted output value is NULL:

  1: WHERE [PK_E1]      IS NOT NULL AND [PK]      IS NULL
  2: WHERE [Integer_E1] IS NOT NULL AND [Integer] IS NULL
  3: WHERE [Date_E1]    IS NOT NULL AND [Date]    IS NULL

For output values of the general type text, you can query for inequality to identify values that are too long (truncated) and therefore problematic:

  1: WHERE [Text_E1] <> [Text]

Why both values are persisted — and what the alternative costs. The pattern deliberately stores both the input value (_E1, text) and the converted output value side by side. In theory you could drop the output column and keep only the input values — but then the check routines become more complex. It comes down to where the conversion happens:

Conversion materialized (this approach): the conversion result is written into the output column. The check is then a simple column comparison on the input/output pair — and you see directly in the row which specific column the error is in (in the example above: row 1 at Integer, row 2 at Date, row 3 at PK).
Conversion only in the check: the check routine applies the actual conversion/validation logic to the input values at runtime and logs or counts the errors. That works — but the table no longer shows where the error is: you learn how many errors a record has, but not, by simply looking at the row, in which column. This very lack of visibility has caused confusion in practice. This rule-based route — applying check rules generically via dynamic SQL — is described in detail in Checking Data Quality with SQL.

Conversion by Target Data Type

SQL Server provides functions for converting data into a target data type. Apply them without examining exactly how they work, and you'll get surprises. On closer inspection it turns out, for instance, that converting an empty string yields the number 0:

  1: SELECT TRY_CONVERT(int, N'')   -- 0
  2: SELECT TRY_CONVERT(int, N' ')  -- 0

That can be functionally correct. From a database developer's point of view, however, no value was delivered — the value is unknown, and therefore NULL would be the correct conversion result. There are a number of such subtleties to consider for safe type conversion.

The following linked articles derive, per data type, how an input value is converted safely and correctly into the output value's data type. Safe type conversion is derived for the following data types:

Data type	Range	Bytes
char	Fixed-length string, 1 byte per character	n
nchar	Fixed-length string, 2 bytes per character	2 * n
varchar	Variable-length string, 1 byte per character	variable
nvarchar	Variable-length string, 2 bytes per character	variable
bigint	-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807	8
int	-2,147,483,648 to 2,147,483,647	4
smallint	-32,768 to 32,767	2
tinyint	0 to 255	1
numeric [(p [, s])] / decimal [(p [, s])]	-10^38 +1 to 10^38 -1. The two data types are functionally identical. p = total number of digits, s = digits after the decimal point	5-17
money / smallmoney	For precision and because of money's special behaviour in calculations, decimal is recommended instead.	8 / 4
float[n]	n = number of bits used to store the mantissa (1-53)	8
real	Synonym for float(24)	4
bit	0 or 1	1
date	0001-01-01 to 9999-12-31 (no time)	3
datetime	1753-01-01 to 9999-12-31, with time; fractional-second precision rounded to ~3.33 ms	8
datetime2(n)	0001-01-01 to 9999-12-31, with time; n = fractional-second precision	6-8
time(n)	00:00:00 to 23:59:59.9999999; n = fractional-second precision	5

The following articles in this series derive safe type conversion depending on the output value's data type.

Reusable Conversion Functions

The first paradigm — NULL instead of an exception for a non-convertible value — repeats for every target type. Rather than spelling it out in every SELECT, you encapsulate it in a user-defined function fn_try_convert_<type>. It handles two things TRY_CONVERT alone does not: mapping empty strings and pure whitespace input to NULL (instead of the 0 trap above) and — for floating-point numbers — normalizing comma decimal notation.

For the integer types, the integer representative looks like this (the siblings fn_try_convert_bigint, fn_try_convert_smallint and fn_try_convert_tinyint differ only in the target type):

  1: CREATE FUNCTION [dbo].[fn_try_convert_int] (@p_input AS nvarchar(256))
  2: RETURNS int
  3: AS
  4: BEGIN
  5:    DECLARE @normalized AS nvarchar(256);
  6: 
  7:    SET @normalized = LTRIM(RTRIM(@p_input));
  8: 
  9:    -- empty string/whitespace is an unknown value, not 0
 10:    IF @normalized = N'' RETURN NULL;
 11: 
 12:    RETURN TRY_CONVERT(int, @normalized);
 13: END;

For the floating-point types, comma-to-dot normalization is added so that a value written with a comma (e.g. 25,5, common in German-language source data) converts correctly (fn_try_convert_real is identical except for the target type):

  1: CREATE FUNCTION [dbo].[fn_try_convert_float] (@p_input AS nvarchar(256))
  2: RETURNS float
  3: AS
  4: BEGIN
  5:    DECLARE @normalized AS nvarchar(256);
  6: 
  7:    -- comma decimal notation: comma to dot
  8:    SET @normalized = REPLACE(LTRIM(RTRIM(@p_input)), N',', N'.');
  9: 
 10:    IF @normalized = N'' RETURN NULL;
 11: 
 12:    RETURN TRY_CONVERT(float, @normalized);
 13: END;

With these functions, the conversion in schema T1 becomes a simple, abort-safe expression — [dbo].[fn_try_convert_int]([Integer_E1]) returns the typed value or NULL, never an exception. The data-type-specific subtleties (ranges, rounding for decimal, J/N mapping for bit, date formats) are derived per type in the linked articles of this series.

Critical Appraisal of the Approach

This article has laid out the basic approach to safe type conversion. Implementing it in an ETL process looks laborious at first: SELECT statements that read data from the schema E1 tables and store it typed in the schema T1 tables can become complex for tables with many columns.

It therefore makes sense to solve this task once with a generic, metadata-driven procedure: a procedure that generates the conversion SELECT dynamically from the T1 table structures reduces the development effort per table to essentially one line of code. This very pattern — configurable bad-data detection via dynamic SQL — is described in Checking Data Quality with SQL.

This places the article clearly: it is the practical rule guide for conversion error checking. The architectural frame — the schema layering E0–L2 — is provided by The Architecture of an ETL Process; the generalization to arbitrary data-quality rules by the framework just mentioned. This article covers the piece in between: how to concretely detect errors at the type-conversion step.

FAQ

Why not just use TRY_CONVERT directly in the SELECT?

TRY_CONVERT alone has two pitfalls: an empty string becomes 0 instead of NULL, and on failure the information about which value was non-convertible is lost. The pattern solves both — the fn_try_convert_* function maps empty values to NULL, and the E1/T1 materialization keeps the original value next to the conversion result.

What's the difference from the "Type Conversion Basics" article?

The basics article compares the functions themselves — CAST, CONVERT, TRY_CAST and TRY_CONVERT. This article builds the design pattern on top: the ETL approach of materialization, error identification via WHERE clause and reusable UDFs. The basics are the tool, this is the method.

How do I find all failed conversions?

With a WHERE clause on the column pair: for non-text types the conversion failed if the input value is populated but the output value is NULL ([x_E1] IS NOT NULL AND [x] IS NULL). For text types, inequality ([x_E1] <> [x]) reveals truncated values.

How do I automate the conversion across many columns and tables?

With a generic, metadata-driven procedure that generates the conversion SELECT dynamically from the table structures. The pattern is worked out in Checking Data Quality with SQL — there as configurable bad-data detection via dynamic SQL.

Does the pattern also apply to PostgreSQL?

Conceptually yes. The three paradigms are engine-neutral. Postgres has no TRY_CONVERT, but the same idea can be implemented with a PL/pgSQL function that wraps the cast in a BEGIN … EXCEPTION WHEN others THEN RETURN NULL block. Especially relevant for data migration to PostgreSQL.

Basics:

Data Quality // Type Conversion Basics with T-SQL — CAST, CONVERT, TRY_CAST and TRY_CONVERT compared.

Safe conversion by data type:

In the ETL, data-quality and migration context:

The Architecture of an ETL Process — the schema layering E0–L2 into which this pattern places the E1→T1 conversion step.
Data Quality in an ETL Process — the higher-level view of data quality in the process.
Checking Data Quality with SQL — generic bad-data detection via dynamic SQL.
Data Migration: SQL Server to PostgreSQL — the context where safe type conversion matters most.

AI-Assisted SQL Development with Claude Code — Rules, Skills and Agents That Enforce Conventions

Marcus — Tue, 07 Jul 2026 16:25:08 +0000

A stored procedure, a migration script, a complex report — Claude Code writes them in seconds. That's the easy part. The hard part starts afterwards: generated SQL that belongs to no one drifts apart just like hand-written code — only faster, because the AI produces hundreds of lines on demand. AI-assisted SQL development only pays off when the generated code follows the same conventions as the hand-written kind — and when a human still understands what was produced.

This article is the entry point to a series on how AI-assisted SQL development with Claude Code works in practice — not as autocomplete, but as three concrete levers: rules files that enforce conventions, skills for recurring tasks, and agents for multi-step data workflows. The common thread stays the data work: SQL Server, Postgres, ETL — not AI for its own sake.

What you'll take away:

why SQL and ETL work in particular benefits from machine-enforced conventions;
the three levers of Claude Code — rules, skills, agents — and what each is good for;
how a .claude/rules/ file turns a style guide into a default at generation time;
why generated code needs conventions and human understanding — not one without the other.

Prerequisite: a basic grasp of SQL/ETL. Claude Code (Anthropic's AI coding agent) is introduced here, not assumed.

Why AI coding, and why for SQL work?

SQL and ETL work is full of recurring patterns: the same naming convention across hundreds of objects, the same procedure layout, the same log inserts in every load step, the same formatting across every statement. Patterns like these are the ideal ground for machine enforcement — regular enough that an AI can reproduce them reliably, and numerous enough that human discipline eventually tires.

At the same time, SQL lets missing understanding slip through silently: a statement can be syntactically correct, run fast, and still answer the wrong question. An AI that generates SQL sharpens both sides — it produces patterns faster, but it also produces wrong answers faster. AI-assisted SQL development is therefore only worth it with two guardrails: enforced conventions, so the generated code stays readable and reviewable, and a human who understands the business question.

The three levers: rules, skills, agents

Claude Code offers three mechanisms that go beyond plain autocomplete — each solves a different problem:

Rules files (.claude/rules/) — enforce conventions. Project instructions the agent receives automatically on every request. They answer: "What should generated code look like?"
Skills / slash commands — encapsulate recurring tasks. Named, parameterizable routines for things you do the same way over and over. They answer: "How do I trigger a known task reproducibly?"
Agents — orchestrate multi-step workflows. Routines that go beyond a single prompt — several steps, tools, checks. They answer: "How do I run a whole chain of steps reliably?"

The following sections take each lever in turn — through the SQL lens.

Rules files: conventions the agent enforces

The most direct lever. A .claude/rules/ file sits in the repository and is handed to Claude Code as a project instruction on every request. Whatever it says becomes the default at generation time: prescribe sp_<verb>_<entity> and you get sp_upd_project instead of update_project — without anyone having to remember it in review.

The difference from a classic style guide is fundamental. A style guide in a wiki depends on human discipline and gets forgotten; a rules file is handed to the agent on every generation. "Please stick to it" becomes "this is how it's generated". For that to hold, the rules file must be built differently for an agent than for a human: explicit code examples instead of prose, explicit anti-patterns (Don'ts), and — most importantly — a reason per rule, so the agent can transfer it to new cases.

What that looks like in detail — from the name prefix through the collapsible block structure to tabular alignment — is shown by the sister article on PL/pgSQL conventions using a complete, battle-tested sql.md. It's the case study for this section. And how such a rules file comes about in the first place — from manual corrections that Claude Code turns into rules — is shown step by step in Deriving SQL Conventions with Claude Code.

The "what" behind the rules files — the concrete conventions for formatting and structure — is described at length on this blog anyway: in the articles on Formatting SQL Statements (Part 1) and Part 2, and on Structuring SQL Statements. A rules file turns that knowledge into a machine-enforced rule.

Skills and slash commands: recurring SQL tasks

Some tasks you do the same way every time: create a new table with its standard procedures, drop a logging pattern into an ETL step, format a statement by the house rules, generate a data-quality check. A skill (invoked as a slash command in Claude Code) encapsulates such a routine under a name — including the necessary steps and conventions.

The gain is reproducibility: instead of rephrasing every prompt (and forgetting this or that along the way), you invoke the same reviewed routine every time. For SQL work that means, for instance: a skill that creates a table by the surrogate-PK rule and generates the sp_ins_/sp_upd_ procedures in the house style right alongside — identically structured every time.

Agents: multi-step data workflows

The biggest lever — and the one that demands the most care. An agent runs a routine that goes beyond a single prompt: several steps, intermediate results, tool calls, checks. In the ETL context this is a natural fit, because ETL itself is multi-step — extract, check, transform, load, log.

An agent could, say, analyze a source table, propose a suitable data-quality check, generate the load script and wire in the logging — all in the conventions the rules file prescribes. The decisive point: the more steps an agent takes on its own, the more important the two guardrails from the start become. Enforced conventions keep the generated code reviewable; a human who understands the business question catches the plausible-but-wrong results that an agent produces just as fluently as the right ones.

The common thread: generated code needs understanding

The same thesis runs through all three levers: the tool takes the typing off your hands, not the understanding. Formatting and structuring SQL was never just cosmetics — it's the act in which you read the statement in full and build the relationships between the tables mentally. When the AI takes over that act, a gap opens up: technically correct SQL that still doesn't answer the business question.

Conventions don't close that gap by themselves — but they make it visible. Generated code that follows house rules is readable enough for a human to review in seconds instead of having to decipher it first. That's exactly why rules, skills and agents are not a replacement for domain knowledge but an amplifier: they handle the reproducible part reliably and free the human's mind for what no model can do safely — asking the right question.

FAQ

Do I need Claude Code, or does this work with any AI tool?

The three levers — enforced conventions, encapsulated tasks, multi-step routines — are tool-agnostic as a concept. The concrete implementation differs: Claude Code loads .claude/rules/ files automatically as a project instruction, other tools have their own mechanisms (project settings, system prompts, custom instructions). The principle "conventions as a machine-loaded rule" transfers; the file paths don't, 1:1.

Is generated SQL safe enough for production?

Only with review. An AI produces plausible code — including where it's wrong. Enforced conventions lower the risk because generated code stays readable and therefore reviewable; they don't replace the review. Security-relevant logic (permission checks, mutations) deserves especially careful reading.

How do I keep the AI from ignoring my conventions?

By keeping the conventions in a rules file in the repo instead of in your head. Three things help: explicit code examples instead of prose descriptions, explicit Don'ts (anti-patterns), and a reason per rule, so the agent transfers it to cases it hasn't been shown.

Is this worth it for solo developers or only in a team?

Both — for different reasons. In a team, conventions prevent drift across several hands. Solo, the gain is the "second developer" that never forgets the rules: even working alone, you benefit from generated code staying consistent and still being readable in 18 months.

AI workflow (the "how" behind the rules):

SQL conventions & structure (the "what" the AI enforces):

ETL practice (the ground for skills and agents):

Starter kit:

The open DI² starter kit on GitHub — the rules, skills and SQL rule tree from this article as a forkable template.

Data Quality: Dimensions and Error Classes — the Theory Behind the SQL Checks

Marcus — Sat, 04 Jul 2026 15:38:03 +0000

A lot gets written about data quality, and very little gets measured. The German-language practitioner's standard reference alone lists sixty possible quality criteria — from timeliness to reliability —, and even the lean models still arrive at six to fifteen dimensions. Yet the core of the matter is surprisingly tangible: a data error caught during loading costs you a log entry. The same error running through to the report costs a wrong invoice, a bad decision — and the trust in every number that follows. This article makes the unde

rlying framework tangible, without a theory slog: it sorts the common data errors into classes, maps them to the established dimensions of data quality — and uses an honest coverage map to show which of them three generic SQL routines actually cover. And it names the two they cannot reach.

The essentials up front:

Data errors sort into a handful of classes — technical vs. business, and field-level vs. record-level vs. relationship-level. That is enough to place any concrete error.
The dimensions of data quality are established canon, not taste — defined criteria in the German-language standard reference by Apel et al., internationally in Wang & Strong, DAMA, and ISO/IEC 25012.
Generic SQL checks four criteria directly in the database: completeness, validity, uniqueness, and referential integrity/consistency. Two stay out of reach — content accuracy and timeliness; both need information from outside the database.
Checking early is cheap, repairing late is expensive: prevention at the source cuts total costs by roughly two thirds (Redman, as cited by Apel et al.) — and 100 % data quality is not the economic target anyway.

Prerequisites: none. This article provides the framing before the SQL practice begins; the theory part needs no SQL knowledge. If you want the how right away, head for the framework article Data quality checks with SQL.

What data errors look like: a classification
The dimensions of data quality
What can SQL actually check?
The two blind spots
What bad data costs
From theory to practice
FAQ
Related Articles

What data errors look like: a classification

Before slicing data quality into dimensions, it helps to look at the errors themselves. They can be sorted along two axes, and together the two axes cover practically every case.

The first axis is technical versus business. A technical (syntactic) error violates the form: letters in a numeric column, a date that is no date, a foreign key pointing nowhere. You can spot these errors without any domain knowledge — they break the rules of the data type or the model. A business (semantic) error, by contrast, is formally flawless and still wrong: an age of 200, a ship date before the order date, a customer flagged as "premium" without a single order. Only domain knowledge makes the error visible.

The second axis is the scope of the error:

field-level — the error sits in a single value (age = 200, "abc" in a numeric column).
record-level — the error only emerges from the interplay of several fields in one row (ship_date < order_date) or between rows of one table (two records sharing the same key).
relationship-level — the error lies in the relationship between tables (a reference whose target does not exist).

The two axes combine into a compact taxonomy in which every concrete error gets a place and a responsible check routine:

Scope	technical / syntactic	business / semantic
field-level	`"abc"` in a numeric column, an invalid date → validity	`age = 200`, a negative price → validity (business rule)
record-level	two records with the same primary key → uniqueness	`ship_date < order_date` → consistency (cross-field)
relationship-level	foreign key pointing to a missing master → consistency (integrity)	"premium" customer without an order → business rule

One special case falls outside the grid: the value that is missing. An empty mandatory field is neither technically nor semantically malformed — it simply is not there. A criterion of its own is responsible for it, completeness — it cannot be pinned to any single cell but cuts across the whole matrix. Two more kinds of error are absent from the matrix entirely — the value that is well-formed and still factually wrong, and the value that is outdated. We will come back to both; they are the blind spots.

The dimensions of data quality

The classification says what an error looks like. The dimensions of data quality say which property of good data it violates. There is no shortage of theory here — rather a surplus, and that is why the term so often stays fuzzy.

The German-language practitioner's standard reference on data quality in business intelligence projects is Datenqualität erfolgreich steuern ("Managing Data Quality Successfully") by Apel, Behme, Eberlein, and Merighi (3rd edition, Edition TDWI). Following Würthele, it defines data quality as a "multidimensional measure of the suitability of data to fulfill the purpose tied to its capture/generation" — a suitability that can change over time as needs change (all book quotes translated from the German original). Two things are already baked into this definition: quality is multidimensional, and it is purpose-bound. There is no absolute "good", only a "good enough for this purpose".

How many dimensions there are is a matter of definition. The book starts with an alphabetical catalog of sixty possible quality criteria and narrows it down to the practically viable ones; for the business intelligence context it highlights six — accuracy, consistency, reliability, completeness, timeliness, and relevance. For this article, the criteria that matter are the ones data errors can technically be pinned to:

Criterion	What it demands (per Apel et al.)
Completeness	The attributes are populated with values that "semantically differ from the value `NULL` (unknown)"; no data gets lost in transformations.
Validity (in the book: formal accuracy + uniformity)	The values arrive in the predefined format and are represented uniformly — in practice, the allowed value range belongs here too.
Uniqueness (in the book: freedom from redundancy + key uniqueness)	No record describes the same real-world entity twice; the business key occurs only as often as it may.
Referential integrity / consistency	Every foreign key uniquely references an existing primary key; values do not contradict each other — within a record, between records, across applications.
Accuracy (content)	The values match the entities of the real world — the data corresponds to reality.
Timeliness	The records reflect the current state of the modeled world and are not outdated.

Two names in this table each bundle two of the book's criteria — the book slices finer than international usage does. Validity does appear in the catalog, but the actual definition lives in the formal component of accuracy (delivery in the predefined data format) and in uniformity. And what DAMA and ISO call uniqueness, the book lists as freedom from redundancy and key uniqueness; its own criterion named "uniqueness" means something else there, namely the unambiguous interpretability of a record through its metadata. Key uniqueness the book phrases in terms of primary keys — yet the check is only meaningful on the business key, because a constraint-enforced primary key cannot occur twice in the first place. The check gets interesting exactly where the constraint is (still) missing: in staging tables and at interfaces. This article sticks to the common names and means the book criteria listed above.

That several such lists exist is not a contradiction but a method: group the criteria and you get a quality model. The book presents two side by side — a theoretical taxonomy after Hinrichs, and the categorization of the German Society for Information and Data Quality (DGIQ), derived from a user survey and in turn based on the much-cited study by Wang and Strong (1996) with its fifteen dimensions. Internationally, the six dimensions of DAMA UK (2013) and the ISO/IEC 25012 standard are widely used as well. They are different cuts through the same subject — which one fits depends on the purpose. For the rest of this article we stay with the book's criteria, because they point most directly at an SQL check.

What can SQL actually check?

Now for the decisive question: how many of these criteria does a handful of generic SQL checks actually reach? The practical framework of the SQL check series builds on three routines — a value-based WHERE check, a duplicate check, and a foreign-key check. Plot them against the criteria and you get this coverage map:

Criterion	verifiable with SQL?	Routine
Completeness	✅ yes	value check (`IS NULL` on mandatory fields)
Validity	✅ yes	value check (value range, format, safe type conversion)
Uniqueness	✅ yes	duplicate check (`GROUP BY … HAVING count(*) > 1`)
Referential integrity / consistency	✅ partly	foreign-key check; cross-field rules in the `WHERE`
Accuracy	❌ no	—
Timeliness	❌ no	—

Four criteria, then, are within reach of generic, configurable SQL — and with surprisingly little code. "Verifiable" means: SQL executes the rule; the domain has to formulate it — a price > 0 is written into no engine by default. Completeness is an IS NULL check on the mandatory columns — in the book's wording, "the individual attributes contain no NULL values". Validity is a value-range or format check; its classic case is safe type conversion, which — for example with TRY_CONVERT in SQL Server — turns an invalid value into NULL instead of aborting the load (more in the type conversion basics). Uniqueness is a grouping over the business key with HAVING count(*) > 1. Referential integrity — in the book's wording, "every foreign key must uniquely reference an existing primary key" — is a foreign-key check.

The "partly" for consistency deserves honesty — and a distinction: referential integrity is only the relational special case of consistency; the book explicitly calls its key criteria a "special alignment toward the relational database model". Consistency itself reaches further: the book explicitly includes the reconciliation of data across different applications — that a value does not contradict itself across systems. A single SQL check in one database cannot deliver that cross-system reconciliation; the variants checkable inside the database (referential integrity, cross-field rules within a row) it covers cleanly. The boundary, however, is set by the process, not the technology: bring the data needed for the reconciliation — the master from the other system, the reference table of the other application — into the ETL process, and they sit side by side in one database; the cross-system reconciliation becomes an ordinary SQL check. The foreign-key check then runs against a master that originally came from an entirely different system.

The book draws this line itself, by the way — just elsewhere: in its example metrics per criterion (chapter 7, table 7–3), completeness is an automated "number of NULL values" query, while content accuracy and timeliness list "user feedback" as the measuring method. Where no query can reach, a human has to answer.

The two blind spots

That leaves the two criteria where SQL alone has to pass — not because of a weakness of the language, but because the required information simply is not in the database. Both sit in the book's catalog right next to the checkable ones.

The first blind spot is timeliness. Whether an address is "current" cannot be read off the address itself — it looks identical yesterday and today. Timeliness demands that the data reflect "the current state of the modeled world", as the book puts it — and that takes a temporal reference: a timestamp of when the value was last confirmed, and an expectation of how long it stays valid. If such a last_verified field exists, SQL can check the rule (last_verified < now() - interval '1 year') — but then it checks the completeness and validity of that timestamp, not the currency of the actual value. Without the timestamp, the criterion is invisible to a pure database check.

The second blind spot is accuracy — and it is the subtlest point of this whole article, because the book defines it as having "a content component and a formal component". The formal side — the right data type, the predefined format — is exactly validity, and thus checkable. The content side is not: a birth date of 1990-05-14 is perfectly valid — right type, plausible range, clean format. Whether it is the person's actual birth date, the database does not know and cannot know. That would take an external source of truth: the ID document, a population register, a second independent record. SQL compares data against rules, not against the world.

This makes the boundary more precise: accuracy is unverifiable as long as the source of truth lies outside the database. As soon as it becomes available as data — say, a leading system that is co-extracted in the ETL process, the same pattern as with the cross-system consistency reconciliation —, the check becomes an ordinary SQL comparison: source against leading system, deviation equals finding. The typical case is data migration, where the target system already knows many of the records; how to build such reconciliations systematically is shown in the article on verifying a migration. What is checked then, however, is agreement with the designated source of truth — whether that source itself agrees with the world remains a governance decision, not an SQL question.

This boundary is not an excuse but the honest core of the matter: an SQL check guarantees that data is well-formed and internally coherent — not that it is true. Whoever says "automatically checked" has to name the two criteria left out — otherwise they are selling a green checkmark as proof of truth.

What bad data costs

You will find plenty of charts online along the lines of "37 % of all data errors are completeness errors". No such chart appears here, deliberately — there is no citable primary source for a reliable frequency distribution per dimension, and a made-up number would be exactly the kind of data error this article is about.

What is citable is the economics behind it — and it has two sides. The first concerns when you check. The 1-10-100 rule originally comes from quality management (Labovitz and Chang, Making Quality Work, 1992) and has since been applied to data quality: what costs 1 to prevent at the source costs 10 to correct afterwards and 100 if you let the error take effect — as a wrong invoice, a lost customer, a bad decision. The specific factors are a rule of thumb, not a law of nature; the direction, though, is undisputed and matches everyday experience: an error gets more expensive the later it surfaces. That is exactly why checking at the source (staging) pays off, before the data flows on — there an error is still reportable, at the target it is fatal.

The second side concerns not when you check but how far you should push data quality. In their book Datenqualität erfolgreich steuern (3rd edition, Edition TDWI, figure 3–2, p. 43), Apel, Behme, Eberlein, and Merighi contrast two opposing cost curves: the costs caused by poor data quality fall as quality rises — bad decisions, rework, and lost customers become rarer. The costs of producing and assuring good quality rise instead, and disproportionately so the closer you get to 100 %. The sum of both — the total quality cost — has its minimum not at 100 % but at an optimum in between. That is the core message: 100 % data quality is rarely the economic target; what you are looking for is the most cost-effective combination for the purpose at hand.

Own rendering after Apel et al., "Datenqualität erfolgreich steuern" (3rd edition, Edition TDWI, figure 3–2, p. 43) — the classic total-cost-of-quality model.

The chart is deliberately simplified — a principle, not a measurement; the optimum shifts with data and purpose. Two practical consequences follow:

Achieve a lot with little effort. The production curve stays flat for a long stretch and only explodes at the top end. The cheap early stretch is the obvious NULL values, broken formats, and duplicates that generic SQL extracts with little effort — exactly the four checkable criteria. The last few percent, and even more so the two uncheckable criteria (accuracy and timeliness), cost far more and are only worth it where the domain demands it.
Prevent early instead of cleaning up late. Prevention at the source is the strongest lever: avoiding errors there, instead of detecting and cleansing them later, cuts total costs by roughly two thirds on average, according to the research cited in the same chapter (Redman 2008).

From theory to practice

The framework is in place: errors sort into a handful of classes, good data describes itself in a handful of criteria, and four of them are checkable with generic SQL. The road from here into practice runs through four articles:

The framework — Data quality checks with SQL — builds the shared error table, the severity gate, and the configurable runner that generates the three routines at runtime.
Validity + completeness are deepened in Validate data with SQL — value ranges, mandatory fields, and the distinction between "no value" and "unknown value".
Uniqueness is deepened in Find duplicates with SQL — from count(*) > 1 to composite keys.
Consistency / integrity is deepened in Find orphaned records — checking referential integrity, even without a foreign-key constraint.

With the dimensions in mind, these four articles no longer read as loose tricks but as what they are: one answer each to one measurable property of good data.

FAQ

What data quality dimensions are there?
It depends on the model. The German-language practitioner's standard reference (Apel et al., Datenqualität erfolgreich steuern) lists sixty possible quality criteria and highlights six for the business intelligence context: accuracy, consistency, reliability, completeness, timeliness, and relevance. Internationally, the six dimensions of DAMA UK (2013) and the fifteen characteristics of ISO/IEC 25012 are widespread; the academic origin is the fifteen dimensions of Wang & Strong (1996). They all describe the same subject at different resolutions — which list fits depends on the purpose.

What is the difference between validity and accuracy?
Validity means: the value conforms to the rules — right type, valid format, allowed value range. Accuracy means: the value corresponds to reality. The standard reference bundles both under "accuracy" (a content and a formal component) — but for an SQL check, exactly this separation is decisive: a birth date can be perfectly valid (well-formed, plausible) and still factually wrong. SQL checks the formal side, because the rules live in the database; content agreement with reality needs an external source of truth and cannot be checked with SQL alone. There is one exception: if a designated source of truth is available as data — such as the leading system in a data migration —, the accuracy check becomes an SQL reconciliation against that source.

How many data quality dimensions can you check with SQL?
Four criteria can be checked with generic SQL directly in the database: completeness (IS NULL check on mandatory fields), validity (value range/format), uniqueness (GROUP BY … HAVING count(*) > 1), and referential integrity/consistency (foreign-key check). Not checkable are content accuracy and timeliness — both need information from outside the database (an external reference or a timestamp).

What is the difference between technical and business data errors?
A technical (syntactic) error violates the form and is detectable without domain knowledge: letters in a numeric column, a foreign key pointing nowhere. A business (semantic) error is formally correct and still wrong: an age of 200 or a ship date before the order date. Generic SQL catches technical errors easily; business errors need an explicitly formulated business rule.

Do I need a tool to measure data quality?
Not necessarily for the four checkable criteria — a handful of generic SQL checks with a central error table covers completeness, validity, uniqueness, and referential integrity. Specialized tools (dbt tests, Great Expectations, Soda) automate and orchestrate this more comfortably and add reporting, but at their core they check the same criteria. Accuracy and timeliness they do not solve either — not without an external reference or a timestamp.

Data quality checks with SQL — the configurable SQL framework that casts the four checkable criteria of this article into three generic routines: shared error table, severity gate, dynamic runner.
Validate data with SQL — the routine for validity and completeness, including the NULL trap.
Find duplicates with SQL — the routine for uniqueness: maximum cardinality and composite keys.
Find orphaned records — the routine for consistency / referential integrity.
Data quality in an ETL process — the bigger picture: where in the ETL process the dimensions are checked and how bad data is isolated.

Data Migration: SQL Server to PostgreSQL — the Complete Guide

Marcus — Tue, 30 Jun 2026 12:46:36 +0000

A data migration from SQL Server to PostgreSQL rarely fails at actually copying the data. It fails at the silent differences that only surface in the target: datetime, which knows no time zone, bit, which is not a boolean, an IDENTITY that turns into a sequence, and a collation that suddenly compares case-sensitively. Anyone who sets out to migrate SQL Server to PostgreSQL isn't just copying tables — they're translating types, schema, code and behaviour from one engine into another.

This guide gives the overview: it sorts the move into five phases and names the key trip-ups for each. The depth lives in a dedicated article per phase — this one supplies the through-line that ties the phases together.

The essentials up front:

Five phases instead of "moving data": the move breaks down into data types, schema/DDL, data transfer, code porting and verification. Tackling them in this order spares you the typical setbacks.
Each phase with its own trip-up: sometimes it's breaking types, sometimes IDENTITY becomes a sequence, sometimes it's the transfer method or translating T-SQL into PL/pgSQL. None of them shows up until it strikes in the target — which is why verification comes last.
What tooling takes off your plate: tools like pgloader handle the mechanical ~80% — the uncritical types and the bulk transfer. The last 20% — type edge cases, procedure logic, triggers — stay handwork.
The through-line: each phase has its own detailed article with the full depth. This hub connects them and tells you in which order they play together.

Prerequisite: a basic grasp of relational databases. SQL Server 2017+ as the source, PostgreSQL 16/17 as the target. Postgres concepts (sequences, text, COPY, PL/pgSQL) are placed in context briefly on first mention — no prior Postgres knowledge required, T-SQL basics yes.

Why migrate SQL Server to PostgreSQL?
The migration path in five phases
One Table, Several Phases at Once
What tooling takes off your plate — and where handwork remains
FAQ
Related articles

Why migrate SQL Server to PostgreSQL?

The reasons are rarely technical — they're usually economic or strategic. Licensing costs fall away: PostgreSQL is open source and usable without core or CAL licences. The open-source stack runs freely on any infrastructure, with no vendor lock-in. And cloud portability is high — nearly every provider offers managed PostgreSQL.

This is no holy war over "Postgres is better than SQL Server". Both engines are mature, and not every workload belongs on the move. The point is sober: once the decision to migrate SQL Server to PostgreSQL has been made, it pays to plan the move as a structured path rather than a one-off copy operation. That's exactly what this guide is for.

The migration path in five phases

A database migration is not a single step but a chain of dependent phases. Tackling them in this order avoids the typical setbacks — such as transferring data before the target schema has the right types. Each phase has its own detailed article.

1. Data types. The foundation. Most types convert one-to-one (int, numeric, varchar, date) — but datetime, bit, money, uniqueidentifier, nvarchar and tinyint have no one-to-one equivalent. datetime forces the time-zone question (timestamp vs. timestamptz), bit becomes a real boolean, and the Postgres money type is best left untouched. Which types convert cleanly and which break is covered in Data Type Mapping SQL Server → PostgreSQL.

2. Schema & DDL. Once you have the types comes the structure around them: IDENTITY becomes GENERATED AS IDENTITY or a sequence, default constraints and named constraints move over, and the case sensitivity of identifiers flips from SQL-Server-tolerant to Postgres-exact. Details in Schema Migration SQL Server → PostgreSQL.

3. Data transfer. Only once the target schema stands do the data move. The choice of method depends on data volume and downtime tolerance: a bcp export plus COPY, the all-in-one tool pgloader, or an ETL pipeline. Which method fits when is compared in Transferring Data: bcp, COPY, pgloader, ETL.

4. Code porting. Tables are only half the database. Stored procedures, functions and triggers have to be translated from T-SQL to PL/pgSQL — different error handling, different variable syntax, different idioms (TRY_CONVERT has no direct counterpart). This is the part with the highest handwork share. Explored in depth in Porting T-SQL to PL/pgSQL.

5. Verification. A migration is only done once you've proven that nothing was lost or silently corrupted: row reconciliation per table, spot-check comparisons, data-quality checks after the load. How to check that systematically is shown in Verifying the Migration — Data Quality and Row Reconciliation.

The phases build on one another but can be iterated — it's typical to run types and schema together on a pilot table before transferring the full load.

One Table, Several Phases at Once

To make the phase path tangible, a small table that bundles several trip-ups at once. First the source in T-SQL:

  1: CREATE TABLE dbo.customer
  2: (
  3:     customer_id   int             IDENTITY(1, 1)  NOT NULL
  4:    ,full_name     nvarchar(100)   NOT NULL
  5:    ,is_active     bit             NOT NULL  DEFAULT 1
  6:    ,created_at    datetime        NOT NULL  DEFAULT GETDATE()
  7:    ,CONSTRAINT pk_customer PRIMARY KEY (customer_id)
  8: );

And the same as a PostgreSQL target:

  1: CREATE TABLE customer
  2: (
  3:     customer_id   integer       GENERATED BY DEFAULT AS IDENTITY
  4:    ,full_name     text          NOT NULL
  5:    ,is_active     boolean       NOT NULL  DEFAULT true
  6:    ,created_at    timestamp     NOT NULL  DEFAULT now()
  7:    ,CONSTRAINT pk_customer PRIMARY KEY (customer_id)
  8: );

Four columns, and already two phases interlock — data types and schema — in a table nobody would call migration-critical. Three of the changes you see coming:

Line 3 (schema): int IDENTITY(1, 1) → integer GENERATED BY DEFAULT AS IDENTITY — the auto-value column moves to the SQL-standard mechanism. The real work comes after the load: the sequence must be advanced to the highest value, or the next INSERT collides.
Line 5 (data type): bit → boolean, 1 becomes true. Beware: ported queries like WHERE is_active = 1 break in Postgres.
Line 6 (data type): datetime → timestamp. The trap isn't now() instead of GETDATE(), but that datetime carries no time zone: whether timestamp or timestamptz is right depends on whether the values were meant as local time or UTC — if you don't decide deliberately, it shifts silently later.

The fourth one slips by when you skim — and is the most likely to go unnoticed:

Line 4 (data type): nvarchar(100) → text. "Postgres is natively Unicode, so text" is true — but it hides that the length limit disappears. This is exactly what an auto-converter like pgloader reaches for by default: nvarchar(100) becomes text, the limit drops without being asked. If the 100 was just technical baggage, text is the right choice. If it was a business rule — the database was not allowed to accept anything longer — a silent validation has been lost, and what belongs there is a varchar(100) or a CHECK constraint rather than bare text. The tool makes the translation automatically — but whether the limit was intended by the domain, it cannot know: that question isn't in the type, it's in the domain.

That's the whole guide in one table: four lines that look like search-and-replace carry two phases and at least one trap you only see if you know the data by its meaning — not just its type.

What tooling takes off your plate — and where handwork remains

The honest expectation-setting first: there is no "one click and done". Tools like pgloader (free, takes schema and data over in a single run) or commercial converters handle the mechanical bulk — and that's a lot: the uncritical types, the bulk transfer, the standard constraints. As a rule of thumb they cover around 80% of the mechanics.

The remaining 20% are exactly the spots where a business decision is needed that no tool can know:

Breaking types — datetime → timestamp or timestamptz? money → which numeric scale? tinyint with a preserving CHECK constraint? These cases need checking, not blind adoption.
Procedure and trigger logic — T-SQL to PL/pgSQL is translation work, not search-and-replace.
Performance tuning — indexes, statistics and query plans differ; what was fast in SQL Server may need a different index in Postgres.
Behavioural differences — collation/case, NULL ordering, transaction semantics on errors.

The art of migration lies not in transferring the easy 80% but in the clean, verified translation of the hard 20%. This cluster devotes a dedicated article to each of these spots.

FAQ

How long does a SQL-Server-to-Postgres migration take?
It depends on schema complexity and the amount of code, not primarily on data volume. The data transfer itself is often done in hours; the time goes into code porting (stored procedures, triggers) and verification. A simple database with little logic is doable in days, one with hundreds of procedures in weeks to months.

Can I automate this completely?
No. Tools like pgloader take the mechanical ~80% off your plate — the uncritical types and the bulk transfer. The last 20% (breaking types, procedure logic, triggers, performance tuning) need human decisions. Expecting "one click and done" builds in silent errors.

What happens to my stored procedures?
They have to be ported from T-SQL to PL/pgSQL — that's the most demanding phase. Error handling (TRY/CATCH → EXCEPTION), variable syntax and many idioms differ. There's no direct translation tool; the dedicated article on code porting shows the patterns.

Do I need downtime?
For the simplest variant (export → transfer → switch over), yes — the source is ideally read-only during the transfer so no changes are lost. Low-downtime strategies (logical replication, gradual cutover) are possible but considerably more involved and a topic of their own.

Should I migrate gradually or as a big bang?
For most solo and mid-size scenarios, a big-bang cutover in a quiet maintenance window is pragmatic. Gradual migration (both systems in parallel) pays off for large, continuously available systems — but at the cost of considerable synchronisation complexity.

Does this also work for Azure SQL or other source databases?
The phases (types → schema → transfer → code → verification) apply generally, and Azure SQL is closely related to SQL Server — much of it carries over. This cluster, though, is specifically tailored to SQL Server → PostgreSQL; Oracle or MySQL as the source bring different type and dialect traps.

This guide is the overview of the cluster on migrating from SQL Server to PostgreSQL. Each phase has its own detailed article:

Data types: Data Type Mapping SQL Server → PostgreSQL — what converts cleanly and what breaks
Schema & DDL: Schema Migration SQL Server → PostgreSQL — Identity, Constraints, Defaults, Sequences
Data transfer: Transferring Data: bcp, COPY, pgloader, ETL — Which Method When
Code porting: Porting T-SQL to PL/pgSQL — Migrating Procedures and Functions
Verification: Verifying the Migration — Data Quality and Row Reconciliation

For more on adjacent topics:

Postgres Table Conventions — what the target schema looks like idiomatically
ETL vs. ELT — Explained — transfer patterns and tool choice
Data Quality Checks with SQL — the checking framework for the verification phase
Data Quality // Fundamentals of Type Conversion with T-SQL — background on the conversion pitfalls

DEV Community: Marcus

I Gave Claude Code 27 Rule Files Instead of One CLAUDE.md

The Starting Point: One File That Kept Growing

Anatomy of a Rule That Holds

Why Don'ts Beat Prescriptions

Inventories as a Drift Radar

Rules That Load Only Where They Apply

The Honest Downside

When a Rule Needs a Linter

What I Would Do Differently Today

FAQ

Related Articles

AI-Assisted Coding Gave Me 799 Hardcoded Font Sizes

The Finding: 799 Against 263

Why Nobody Caught It

The Drift Lives Between the Tokens

The "Better Prompts" Fallacy

What Works: Scale, Category Mapping, Snap Rule

Even the Fresh Rule Drifted

Enforcement: A Linter at error

What the Linter Cannot Do

The Same Pattern in SQL

FAQ

Related Articles

Design Pattern // The Architecture of an ETL Process — How to Isolate Bad Data Cleanly

Tasks of the ETL Process

Extraction

Materialization of Extracted Data

Extended Extraction Tasks

No Type Conversion of the Data

Transformation

Typing of Extracted Data

Data Quality Check

Type Conversion Check

Duplicate Identification

Spelling and Orthography Check on Text Values

Foreign Key Check

Mandatory Field Missing Value Check

Business Logic Validation

Loading

Structural Transformation

Foreign Key Resolution

Lookup Value Resolution

Data Quality Check

Filtering

Aggregation

Historization

Loading Data Into the Target System

Architecture of the ETL Process

Work Packages of the ETL Process

Extraction

Extraction From a Database

Extraction From Documents With Table-Like Structures

Extraction From Documents With Complex Logical Structures

Technology

Summary

Technical Transformation

Type Conversion

Technical Data Quality Check

Data Error Logging

Flagging of Erroneous Records

Hash Value Computation

Technology

Historization of Technically Transformed Data

Historization

Identification of Change Records

Identification via Hash Values

Storing Hash Values

Promoting Only Error-Free Records

Structural Transformation

Structural Transformation and Resolution of Foreign Key Relationships and Lookup Values

Structural Data Quality Check

Data Error Logging

Flagging of Erroneous Records

Hash Value Computation

Historization of Structurally Transformed Data

Loading

FAQ

Related Articles

Checking Data Quality with SQL — a Configurable Framework for Spotting Bad Data Generically