Ryan B

Posted on Sep 18 • Edited on Sep 24

Building SW: Lessons from the Backend

#go #opensource #webdev #sql

Small warning: This post is about swearing in code.

For the last few months, I've been building SW - a backend service that ingests GitHub events, extracts human-written text, and detects tongue-in-cheek profanity. The idea is lighthearted analytics - not moderation - with consent pipelines baked in at the core. Along the way, I've fought with storage engines, schema design, NLP quirks, and the usual Go pitfalls. This post is about the journey - the problems, dead ends, and the fixes that got me here.

Why?

I was on github a few months ago and I saw a commit that caught my attention:

"Fuck you, dependabot."

It got me to thinking: How often do developers experience this kind of tooling rage? I dug into GH Archive a bit and the answer was pretty clear: It happens... A lot. And not just tooling rage; code breaks all the time, language nuance, self-owns... And while there's been research papers investigating toxicity in programming, no one has dug into what I affectionately call Expletive driven development.

I had questions:

Are there trends in how often people curse in commits?
Any kind of rhythm to it?
Is tooling_rage more pervasive than lang_rage or framework_rage?
Are some languages more likely to cause developers to drop F-bombs than others?

With the zeitgeist of the day being AI & LLMs, I wanted to do it my way: No fancy algos, no machine learning; as close to stdlib as I could get (with some exceptions).

1. Data Volume and Storage

I started with Postgres because it was familiar, battle-tested, and offered strong guarantees. For a while it was fine. But as I scaled, a few issues became unavoidable:

Maintaining views: Materialized views were brittle and expensive to refresh. I was constantly docker compose down and docker compose up to rebuild the db, and views kept slowing me down.
Ingestion time: High-volume inserts triggered deadlocks and lock contention.
Nuances: VACUUM tuning, WAL bloat, index churn - the usual Postgres headaches at scale. Just because we're used to it doesn't mean we can't try different approaches.

Migration: I adopted a split model:

Postgres (v18 beta) -> The Control-plane. Perfect for ingest accounting, consent, and rulepack metadata. Postgres 18 features (like improved parallelism, UUIDv7) gave me better ergonomics without giving up safety.
ClickHouse -> High-volume facts. Utterances and hits live here, with pruning-friendly tables and cheap rollups. In practice, hits are the real workhorse - powering stats and samples - while utterances (events) serve as the canonical backing.

Example (ClickHouse hits schema, simplified):

CREATE TABLE hits (
  bucket_hour   DateTime,
  detver        UInt16,
  lang_code     LowCardinality(Nullable(String)),
  term_id       UInt64,
)
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(bucket_hour)
ORDER BY (bucket_hour, detver, lang_code, term_id);

This balance - Postgres as the control plane, ClickHouse as the fact store - is where I finally stopped fighting the store.

2. Text Normalization & Detection Drift

Profanity isn't written straight. Developers get creative with leetspeak, homoglyphs, zero-width spaces. And if you over-normalize, you hit Scunthorpe-class false positives.

I built a normalizer pipeline:

func Normalize(s string) string {
    s = utf8Repair(s)
    s = nfkcFold(s)
    s = caseFold(s)
    s = stripZeroWidth(s)
    s = leetMap(s)
    return collapseSpaces(s)
}

Then came detection drift. Early on, I had a flat rules.json. Over time, I introduced the rulepacker JSON format, with templates, lemmas, categories, and versioning baked in. Every hit carries a detector_version - so backfills can compare old vs new results without rewriting history.

{
  "version": 2,
  "templates": [
    {"pattern": "fuck you, {TARGET_BOT}", "category": "bot_rage", "severity": "strong"}
  ],
  "lemmas": ["fuck", "shit"]
}

This let me evolve detection rules without reprocessing 14 years of commits every time. Implementing tried & true algorithms like Aho-Corasick gave me a path I needed to follow, I just needed to wire it up.

3. Language Handling

Language detection is fucking hard. And it sucks that GitHub doesn't provide language metadata on commit messages or issues.

I went through several stages:

Custom hinter: A script-based detector (langhint) that counted Unicode ranges. Rough, but decisive for Japanese/Korean/Arabic.
pg_cld2_detect_language: Hooked into Postgres triggers for per-utterance detection.
ClickHouse beta NLP: Functions like detectLanguageMixed(text_raw) gave me speed and scale.

Snippet from our first hinter:

// DetectScriptAndLang returns a coarse script and a best-effort lang code.
func DetectScriptAndLang(s string) (script string, lang string) {
    // Count letters by Unicode block, map to probable languages
}

Final setup: ClickHouse. While using experimental features isn't something I would recommend, I was already set on using the latest tech anyway (image: postgres:18beta3, GOEXPERIMENT=jsonv2) so at this point it's really just a "throw the baby out with the bathwater" kind of thing.

4. Consent, Anonymity & Guardrails

Analytics without doxxing. That was the bar.

While I initially was storing repo names & github usernames as well as github event ids, I dropped them completely. I turned github user ids and repo ids into HIDs, forced any identity related information into a postgres schema... with strict rules to prevent accidental API leaks. When repo/user information was fetched, the consent boundary would either write full information, or anonymous. If an opt-out exists in the consent receipts, we just don't write anything into CH.

CREATE DOMAIN hid_bytes AS bytea CHECK (octet_length(VALUE) = 32);

-- Minimal FK targets
CREATE TABLE principals_repos  (repo_hid  hid_bytes PRIMARY KEY);
CREATE TABLE principals_actors (actor_hid hid_bytes PRIMARY KEY);

CREATE TABLE repositories (
  repo_hid        hid_bytes  PRIMARY KEY REFERENCES principals_repos(repo_hid) ON DELETE CASCADE,
  consent_id      uuid UNIQUE REFERENCES consent_receipts(consent_id) ON DELETE CASCADE,
  full_name       text,
  default_branch  text,
  primary_lang    text,
  languages       jsonb,
  stars           int,
  forks           int,
  subscribers     int,
  open_issues     int,
  license_key     text,
  -- ...
  gone_at         timestamptz,
  gone_code       int2,
  gone_reason     text
);
CREATE TABLE actors (
  actor_hid        hid_bytes  PRIMARY KEY REFERENCES principals_actors(actor_hid) ON DELETE CASCADE,
  consent_id       uuid UNIQUE REFERENCES consent_receipts(consent_id) ON DELETE CASCADE,
  login            text,
  name             text,
  type             text,
  company          text,
  location         text,
  bio              text,
  blog             text,
  twitter_username text,
  followers        int,
  following        int,
  public_repos     int,
  public_gists     int,
  -- ...
  gone_at          timestamptz,
  gone_code        int2,
  gone_reason      text
);

CREATE SCHEMA ident;

CREATE TABLE ident.gh_repo_map (
  repo_hid   hid_bytes PRIMARY KEY,
  gh_repo_id bigint      NOT NULL,
  seen_at    timestamptz NOT NULL DEFAULT now()
);

CREATE OR REPLACE FUNCTION can_expose_repo(hid public.hid_bytes)
RETURNS boolean;
CREATE OR REPLACE FUNCTION can_expose_actor(hid public.hid_bytes)
RETURNS boolean;

CREATE OR REPLACE FUNCTION ident.resolve_repo
(p_hid hid_bytes, p_who text, p_purpose text)
RETURNS bigint
SECURITY DEFINER;

CREATE OR REPLACE FUNCTION ident.resolve_actor
(p_hid hid_bytes, p_who text, p_purpose text)
RETURNS bigint
SECURITY DEFINER;

I had to respect opt-outs and allow opt-ins with verifiable proof - without OAuth, tokens, or brand-specific files.

Guardrails v2:

Opt-out: Commit a zero-byte dotfile (.<hash>.txt) at repo root.
Opt-in: Create a public gist with the challenge hash.
Receipts: Postgres stores durable receipts.
Revocation: Delete the file or gist, state flips to revoked.

Enforcement:

Opt-outs override everything.
Opt-ins allow scoped de-masking (repo name or actor handle).
By default, everything stays anonymized, and slurs get masked.

This made consent auditable, cheap to verify, and ethically defensible. Repo/actor names would be hashed to allow API consumers to graph a particular idenity over time, but if they say "Hey, I don't mind everyone knowing that I swear like a sailor" (like me), we'd expose their information.

5. Backfill & Replays

Processing a decade and a half of GH Archive means retries are inevitable. Without careful design, retries lead to duplication or drift.

Our fixes:

Deterministic IDs: UUIDv7 + SHA-256. Replays are idempotent.
Leasing: Postgres coordinates hourly leases. A single worker claims an hour, runs it, marks it complete. No collisions, no double work. Multiple workers at the same time is our default, and everything is configurable.
Telemetry: Every lease logs ingest stats (fetch_ms, read_ms, db_ms).

sha := sha256.Sum256(rawJSON)
id := uuidv7FromHash(sha[:])

Every developer should understand this: idempotency is survival.

6. API & Service Design

I standardized on a ports-first architecture, but I also made deliberate choices that Go style purists would call violations:

I use http in package names instead of aliasing everywhere. I believe aliasing should be the responsibility of the importer, and I will die on that hill.
I use explicit service, repo, and module directories for every domain.

Ports strategy:

type StatsPort interface {
    Timeseries(ctx context.Context, in TimeseriesInput) (TimeseriesResp, error)
}

Modules only export ports. This killed cyclic deps, made testing easier, and kept module surfaces boring and predictable.

Wiring is done via Modkit - a package built specificly to alias my internal platform libraries and dependencies. It could be argued that this makes it harder to debug issues, and I will agree with that. I do however counter with "that's what tests are for"; Developer experience can be improved with syntactic sugar - all I cared about was cleaner imports & predictability.

I also had a bit of fun with the service names. Bouncer is the service that handles consent API calls. Nightshift is my rollup & aggregate service. Let's be honest... It's a project about cussing. Who cares?!

7. Observability & Ops

Long backfills + heavy inserts = ops pain if you're blind. I wired observability in from the start:

Structured logs with per-op timings.
ingest_hours table in Postgres tracks hourly telemetry and drops.
ClickHouse queries for compression vs uncompressed sizes.
Nightshift jobs record metrics as they prune, archive, or reverify policy.

Example log:

2025-09-14T15:52:04Z INF ch op args="table=hits rows=1" elapsed_ms=2.663 sql="INSERT BULK"

Ops stopped being about babysitting jobs - I trust the metrics, and when something spikes, it's immediately visible.

8. Data retention policies

The Nightshift service was born out of necessity.

One of the biggest challenges with GH Archive is sheer volume. Public events have been flowing since 2011, adding up to billions of records - on the order of 2 to 4 billion utterances by 2025. In our ClickHouse schema that translates into anywhere from ~160 GB (best case) to over a terabyte (worst case) just to keep the raw utterances and detector hits online. Add a few more years of growth and the storage bill balloons by tens to hundreds of gigabytes per year. Even at cloud-cheap disk prices, that's still way more money just for cold facts we rarely query.

That's why Nightshift exists. Instead of hoarding every commit message and issue comment forever, we roll them up into the a summary - hourly and daily counts, categories, top terms, and language breakdowns. Once that's written, we can safely nuke the raw utterances and hits for that window. The savings are enormous: dropping from hundreds of gigabytes per billion utterances down to a thin slice of summarized facts measured in a few gigabytes. We keep all the trend data we need for charts and exploration, while avoiding the operational nightmare of running a profanity time machine at terabyte scale.

The other reason I did this configurable retention policy is because the project is (will be) open sourced. It should be explorable on a home laptop, and a few gigs is a lot easier to digest.

Takeaways

Postgres -> ClickHouse split was the win. Control-plane vs fact-store is the right balance.
Normalization + rulepacker solved detection drift without reprocessing history.
Language detection is hard, but ClickHouse works. Fallback to lang_code un when it's not clear.
Consent flows built on GitHub primitives (dotfiles, gists) kept things simple and ethical.
Determinism + leasing made replays safe and repeatable.
Ports-first design killed cyclic deps and gave me clean seams, even if it breaks Go orthodoxy.
Observability baked-in made ops boring (and boring is good).
Data growth configuraton kept disk usage low (which is also good).

I'm not ready to share the repo for SW yet, but the backend development is solid: hybrid Postgres/ClickHouse, versioned detector, consent guardrails, replay-safe backfills, and metrics you can trust.

Of course... Now this means I have to write frontend code too... Bloody hell.

DEV Community