venkatesh m

Posted on Mar 1 • Edited on Mar 6

130 Shades of Gray: Building a Design Token Pipeline That Killed Our Color Chaos

#designsystems #css #typescript #frontend

The Codebase That Had 130 Grays

I spent four years working in a production codebase that had 130 gray variables. I actually counted them once.

Someone needs a border color, creates $gray-v1. Someone else needs a slightly darker background,doesn't find the right gray in the file — or can't tell which of 130 is correct — creates $gray-v2. Four years of that across 5 product modules and this is what you get:

$gray-v12: #6B7280;
$gray-v13: #6E7581;
$gray-v14: #707682;
$gray-v38: #4A5163;
$gray-v39: #4B5264;
$gray-v47: #5C6370;
$gray-v89: #3D4450;

Seven variables. Can you tell them apart by hex value? We couldn't either. $gray-v38 and $gray-v39 differ by one hex digit. They were used on different screens for the same purpose — disabled text. Created six months apart by two different developers who had no way of knowing the other one existed.

And the grays weren't even the worst part — the utility classes were.

.FontSm14GrayV52 {
  font-size: 14px;
  color: $gray-v52;
}

.FontRg16GrayV23 {
  font-size: 16px;
  color: $gray-v23;
}

.BgGrayV7Pd12 {
  background-color: $gray-v7;
  padding: 12px;
}

Font size, color, and gray version baked into one class name. Need the same gray at a different font size? New class. Same font size, different gray? Another class.

Now imagine a designer updates the disabled text color from #6B7280 to #8B95A5. Go find every gray variable that was being used as disabled text. Not labeled as disabled text — just happening to be that hex value, or one digit away from it, scattered across 130 numbered variables and 200+ utility classes. Good luck.

Nobody built this mess on purpose. There was no architecture to prevent it. What was missing wasn't discipline — it was a system.

Why This Happens

The instinct is to blame developers. Better naming conventions. Stricter code review. More discipline.

We tried all of that. It doesn't work — not at scale, not over time.

Here's what actually happens. A developer needs a color for disabled text. They open the variables file, see 80 grays with names like $gray-v52 and $gray-v61, and have no way to know which one is "the disabled text gray." Nothing in the name tells them. Nothing in the system maps intent to value. So they pick one that looks close enough, or they create a new one. Both choices are rational. Both make the problem worse.

Code review doesn't catch this because the reviewer is looking at the same meaningless variable names. They can verify that a gray was used. They can't verify it was the right gray. There's no source of truth to check against.

Naming conventions help for about six months. Then the team grows, the conventions drift, someone joins who never read the doc, and you're back to $gray-v131.

The problem isn't that developers make bad choices. It's that the system they're working in makes the right choice invisible. You can't consistently pick the correct gray if nothing in the codebase defines what "correct" means.

The Three-Tier Model

Take one gray: #6B7280. In the old system, that hex value could live in $gray-v12 or $gray-v47 or a utility class or hardcoded inline. No meaning attached, no intent captured. Just a color floating in a file somewhere.

In a token system, that same gray passes through three layers before it ever reaches a component.

Tier 1 — Global tokens. These are the raw palette. Every color, every spacing value, every font size your system supports. Named by what they are, not what they do.

{
  "color": {
    "gray": {
      "500": { "$value": "#6B7280" },
      "600": { "$value": "#4B5563" },
      "700": { "$value": "#374151" }
    }
  }
}

color.gray.500 is a fact. It's a gray. It's the 500 weight in your scale. It says nothing about where it goes or why it exists. That's intentional — global tokens are the raw material, not the decisions.

Tier 2 — Semantic tokens. This is where intent enters the system. Semantic tokens don't define colors — they define purposes and point to a global token.

{
  "color": {
    "text": {
      "primary": { "$value": "{color.gray.700}" },
      "secondary": { "$value": "{color.gray.500}" },
      "disabled": { "$value": "{color.gray.500}" }
    },
    "border": {
      "default": { "$value": "{color.gray.500}" }
    }
  }
}

Now #6B7280 has meaning. It's color.text.secondary. It's color.text.disabled. It's color.border.default. Three different purposes, same underlying value — and that relationship is explicit, not accidental.

Remember the designer changing the disabled text color? In the old system, that was an archaeology project across 130 variables. Here, you update color.text.disabled to point to color.gray.600 instead of color.gray.500. One change. Every disabled text element in the product updates. Nothing else is affected.

Tier 3 — Component tokens. These bind semantic tokens to specific component surfaces.

{
  "button": {
    "text": {
      "disabled": { "$value": "{color.text.disabled}" }
    }
  },
  "input": {
    "text": {
      "disabled": { "$value": "{color.text.disabled}" }
    }
  }
}

A developer building a button never picks a gray. They use --fw-button-text-disabled. They don't need to know it resolves to color.text.disabled, which resolves to color.gray.500, which resolves to #6B7280. The chain exists, but the developer at the end of it just sees a name that tells them exactly what it's for.

This is what it looks like when the system makes the right choice the only choice. There's no wrong gray to pick because developers aren't picking grays. They're picking intentions — --fw-button-text-disabled, --fw-input-border-default, --fw-card-bg-primary — and the token pipeline resolves those intentions to actual values.

And here's where theming comes in for free. The entire resolution chain above is your light theme. Your dark theme is a second set of semantic mappings pointing to different global tokens:

{
  "color": {
    "text": {
      "disabled": { "$value": "{color.gray.400}" }
    }
  }
}

Same component tokens. Same developer API. Different underlying values. Swap a single attribute on the root element:

<div data-theme="dark">
  <!-- every component re-resolves automatically -->
</div>

No separate stylesheet. No conditional logic in components. The architecture handles it.

The Build

The full pipeline is in the flintwork repo if you want every line. Here I'll walk through the parts that do the actual work.

The input is a folder of JSON files organized by tier — global/ for the raw palette, semantic/ for theme mappings, component/ for component-level bindings. The output is CSS files with custom properties scoped to data-theme attributes. One build script, no Style Dictionary, no runtime dependencies.

Flattening nested tokens.

Token files are nested JSON following the W3C Design Tokens format. A token has a $value — everything else is a group container. The first step is flattening that tree into a flat map of dot-notation paths to values:

function flattenTokens(
  group: TokenGroup,
  prefix: string = ''
): FlatTokenMap {
  const map: FlatTokenMap = new Map();

  for (const [key, value] of Object.entries(group)) {
    if (key.startsWith('$')) continue; // skip metadata

    const path = prefix ? `${prefix}.${key}` : key;

    if (isTokenValue(value)) {
      map.set(path, value.$value);
    } else if (typeof value === 'object' && value !== null) {
      const nested = flattenTokens(value as TokenGroup, path);
      for (const [nestedPath, nestedValue] of nested) {
        map.set(nestedPath, nestedValue);
      }
    }
  }

  return map;
}

{ color: { gray: { 500: { $value: "#6B7280" } } } } becomes Map { "color.gray.500" => "#6B7280" }. Keys starting with $ are W3C metadata ($type, $description) and get skipped — they're documentation, not output.

Resolving references.

The core problem: tokens point to other tokens. {color.text.disabled} points to {color.gray.500} which points to #6B7280. The resolver follows those chains using regex replacement, so a single value can contain multiple references:

function resolveValue(
  value: string | number | string[],
  lookups: FlatTokenMap[],
  depth: number = 0
): string {
  if (Array.isArray(value)) return value.join(', ');
  if (typeof value === 'number') return String(value);
  if (depth > 10) {
    throw new Error(
      `Circular reference detected while resolving: ${value}`
    );
  }
  if (!value.includes('{')) return value;

  return value.replace(/\{([^}]+)\}/g, (_, refPath: string) => {
    for (const lookup of lookups) {
      const resolved = lookup.get(refPath);
      if (resolved !== undefined) {
        return resolveValue(resolved, lookups, depth + 1);
      }
    }
    throw new Error(`Unresolved token reference: {${refPath}}`);
  });
}

The function handles three value types — arrays for font family stacks that get joined with commas, numbers that pass through as strings, and string values that might contain references. The depth guard at 10 catches circular references instead of blowing the call stack. The throw on unresolved references is deliberate — I want the build to fail loudly if a token points to nothing. A silent fallback means a missing color in production that nobody catches until a user reports it. Fail at build time, not at render time.

The lookups parameter is an array of flat token maps, searched in order. The build passes [componentTokens, semanticTokens, typographyTokens, globalTokens] so component-level overrides resolve before falling back to semantic, then global.

The order encodes the architecture.

CSS output.

Token paths convert to custom properties with a --fw- prefix to avoid collisions in consumer apps. color.text.primary becomes --fw-color-text-primary. The theme scoping is straightforward — light tokens go under :root, dark tokens go under [data-theme="dark"]:

:root {
  --fw-color-text-primary: #374151;
  --fw-color-text-disabled: #6B7280;
  --fw-button-text-disabled: #6B7280;
}

[data-theme="dark"] {
  --fw-color-text-primary: #F9FAFB;
  --fw-color-text-disabled: #9CA3AF;
  --fw-button-text-disabled: #9CA3AF;
}

The pipeline generates a combined tokens.css plus separate light.css and dark.css for consumers who only need one theme. Three files, 213 tokens resolved, light and dark themes ready. The whole build runs in about 40ms.

Theme Switching in Practice

The old codebase never had dark mode. Now I understand why — imagine adding it:

.card {
  background: $gray-v3;
  color: $gray-v71;
}

.dark-mode .card {
  background: $gray-v88;
  color: $gray-v14;
}

That's one component. Now multiply it across every surface, every text style, every border and background in the product. Each one needs a .dark-mode override block. Each one requires someone to pick the right gray from a list of 130. Nobody was willing to start that project. I don't blame them.

With tokens, dark mode already exists. Not because I built it separately — because the architecture makes it free:

.card {
  background: var(--fw-color-surface-primary);
  color: var(--fw-color-text-primary);
}

No override block. The component references intentions, not values. Switch one attribute on the root:

<div data-theme="dark">
  <!-- every token re-resolves automatically -->
</div>

Same component code. Same custom property names. Different values underneath. No stylesheet swap, no JavaScript toggling classes on individual elements, no new CSS files.

Dark mode went from "a project nobody wanted to touch" to a single attribute change. That's the difference between having a system and not having one.

What I'd Do Differently

If I rebuilt this pipeline tomorrow, two things would change immediately.

Token validation before resolution. Right now a typo in a JSON file passes silently. $valeu instead of $value, a hex code like #6B728 with five digits — the pipeline won't catch it until something downstream breaks or the CSS output looks wrong. For a solo project with 213 tokens, I can eyeball the output. For a team where five people are editing token files in the same sprint, this is the first thing that breaks. A JSON schema validation step before resolution even starts — checking that every token has a valid $value, that hex codes are well-formed, that references point to paths that actually exist — would catch these at the source instead of at the symptom.

Flat values vs. var() chains. The pipeline resolves every reference to its final value. --fw-button-text-disabled outputs #6B7280, not var(--fw-color-text-disabled). I chose this because flat values are easier to debug — you inspect an element and see the actual color, not a chain of three custom property references. But the tradeoff is real. If a semantic token changes, I rebuild and every downstream value updates. A var() chain would let that cascade at runtime without a rebuild. For a shipped design system with consumers who import your CSS, the var() approach is probably the right call. I went with flat values because I was optimizing for "can I see what's happening in DevTools" during development. I'd revisit that decision before publishing to npm.

Both of these are solvable. Neither required rearchitecting anything — they're additive improvements to a pipeline that already works. The gaps are at the edges, not the foundation.

I built this pipeline for flintwork, not for that codebase. But everything I learned about what breaks — and why — came from four years of living inside a system without one. That's the thing about building a system from scratch: you have to know exactly what the absence of one costs.

The second article in this series — on building the headless primitives that consume these tokens — is now live.