Xianpeng Shen

Posted on Jul 2

Your Git History Already Knows Which Code Is AI-Written. Your CI Should Too.

#opensource #ai #github #devops

Somewhere between a quarter and half of the code your team merged this year was probably written by an AI assistant. Ask which PRs, which files, which lines — and most teams can only shrug.

Here's the thing: the answer is already sitting in your git history. Nobody's reading it.

The trailer nobody reads

When Claude Code writes a commit, it appends this to the message:

Co-Authored-By: Claude <noreply@anthropic.com>

GitHub Copilot does the same. So does Cursor. Every commit these tools touch carries a machine-readable attribution stamp — emitted automatically, no configuration, no human discipline required.

Meanwhile, most "AI usage policies" I've seen work like this: a PR template with a checkbox that says "This PR contains AI-generated code", filled in on the honor system, read by no one, enforced by nothing.

We have a reliable, automatic disclosure signal in every commit, and an unreliable, manual one in the PR body — and teams are betting their governance on the manual one.

Open Delivery Spec (ODS) is a small open-source project built on one idea: read the signal that's already there, and act on it in CI.

Attribution, not detection

Let me be upfront about what ODS is not, because this is where most tools in this space lose me.

"AI code detectors" that claim to forensically identify AI-written code from style are, at the line level, snake oil. The false-positive rate makes them unusable for anything with consequences. ODS doesn't try. Its primary signal is attribution — what the AI tools themselves disclose via Co-Authored-By trailers. If someone squashes commits and strips trailers, ODS won't catch them. It's not a lie detector; it's a ledger of what the tools reported.

The project's own docs put it bluntly: ODS is a signal producer, not a quality oracle. A "PASS" means no policy rule fired, not that the code is good. An 85% detection confidence means the change is likely AI-assisted — not that 85% of the lines were machine-written.

I find this honesty more useful than the alternative. You can build real policy on a signal whose failure modes you understand.

What it actually does

ODS runs four steps on every pull request:

① Detect   →  Is there AI code?         (Co-Authored-By trailers, PR disclosure, branch prefix)
② Analyze  →  What quality defects?     (built-in heuristics + your analyzers via SARIF)
③ Score    →  How much tech debt added? (quality-driven, weighted by AI risk)
④ Check    →  Block, warn, or pass?     (your policy, written in OPA Rego)

Setup is one workflow file:

# .github/workflows/ods.yml
name: ODS AI Code Quality
on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  ods:
    runs-on: ubuntu-latest
    steps:
      - uses: open-delivery-spec/validate-action@v1

That's the whole integration. You get a PR comment, a job summary, an HTML report, and a badge. Add semgrep: true and it runs Semgrep and merges the findings into the gate. Drop a .ods/policy.rego in your repo and your rules decide what blocks.

A real example: blocking a real vulnerability

This is from the reproducible walkthrough in the spec repo — every output below is captured from the tools, not hand-written.

An AI-assisted change lands this classic:

def run_user_command(user_input):
    # Untrusted input flows into a shell — command injection risk.
    return subprocess.run(user_input, shell=True, capture_output=True)

Semgrep finds it and emits SARIF. ODS ingests the finding — keeping Semgrep's rule id and severity — and the policy gate does its job:

$ ods check --sarif semgrep.sarif --policy .ods/policy.rego
❌  Policy check failed
   Policy: .ods/policy.rego
   Denials:
     ❌ high: python-subprocess-shell-true (app/runner.py:11)
$ echo $?
1

Non-zero exit, CI fails, merge blocked. Fix the code (pass an argument list, shell=False), Semgrep reports zero findings, and the same gate passes:

$ ods check --sarif semgrep.sarif --policy .ods/policy.rego
✅  Policy check passed
$ echo $?
0

Notice the division of labor here. ODS didn't find the vulnerability — Semgrep did. ODS's job was attribution (this change is AI-assisted), aggregation (fold external findings into one score and one policy input), and enforcement (your Rego rules, one exit code). It composes with the analyzers you already trust instead of pretending to replace them.

The scoring philosophy: AI use is not a sin

An earlier version of the scoring treated "percentage of AI code" as technical debt in itself. That was wrong, and it got fixed: a clean, fully-AI-written change now scores ~0.

The current model: real quality signals — defect density, high/critical findings, coverage gaps, duplication — form the base debt. The AI ratio acts only as a bounded risk multiplier (1.0× to 1.5×) on top, because AI-authored defects are ones no human reasoned through. Quality problems get amplified for AI-heavy changes; AI quantity alone never creates debt.

If your AI writes tested, clean code, ODS waves it through. That's by design. The point is governance, not gatekeeping.

Building the demo found a bug in our own tool

A story worth telling, because it's the whole argument for runnable examples.

While building that walkthrough above, the gate refused to block. Real Semgrep findings sailed through as severity info. The reason: Semgrep puts severity in the SARIF rule's defaultConfiguration.level, not on each individual result — and the ingester only read the per-result field. Every real-world Semgrep finding was being quietly downgraded to informational, which means the policy gate would never have blocked anything from Semgrep in production.

A demo that actually runs caught what unit tests and documentation hadn't. The fix shipped before the walkthrough did. If you take one non-ODS lesson from this post: make your examples executable.

What it doesn't do (yet)

Honest limitations, so you can decide with open eyes:

Attribution can be evaded. Squash-and-strip removes trailers. ODS measures disclosed AI use, which is the right basis for governance — it's not forensic detection, which we think is a losing game anyway.
The built-in analyzer is heuristic. Five rules targeting common AI code smells. The real defect-finding power is meant to come from your existing analyzers via SARIF. Treat the built-ins as a fallback signal, not the product.
Thresholds are heuristic defaults. The score verdicts (pass / review / block at 1 / 3 / 5) are sensible starting points, not calibrated science. Override them with your own Rego policy — that's what it's there for.
It's young. The ODS org dogfoods it on every PR across its own repos, and it's just starting to onboard external projects. Early-adopter territory.

Try it

Spec & walkthrough: github.com/open-delivery-spec/spec
CLI (Go): github.com/open-delivery-spec/cli — go install github.com/open-delivery-spec/cli/cmd/ods@latest
GitHub Action: github.com/open-delivery-spec/validate-action

Everything is Apache-2.0. If you run it on a real repo, the maintainers genuinely want to hear what broke, what blocked wrongly, and what threshold made no sense — that feedback is worth more than a star (though stars are nice too).

Your repos have been accumulating AI attribution data for a year or more. It costs one workflow file to start reading it.

DEV Community