Raihan

Posted on May 12

I built the first open benchmark for federal contracting AI. Here's what it shows about frontier LLMs.

#ai #machinelearning #nlp #opensource

If you ask GPT-4o or Claude to extract Federal Acquisition Regulation clause numbers from a federal solicitation, a non-trivial fraction of the time they will hand you a number that does not exist. There is no FAR 52.999-99. The model just made it up. For a federal contractor staffing a proposal, that is the difference between a clean compliance matrix and a rejected bid.

I went looking for a benchmark that measured this. There isn't one. Commercial tools in the space — Capture2Proposal, GovTribe, GovWin, OrangeSlices — all do natural-language processing on federal solicitations, but none publish benchmarks. Academic work on RFP processing is narrow and one-off. GSA's own srt-fbo-scraper covers only Section 508 compliance.

So I built one.

FedProc-Bench

FedProc-Bench is a multi-task benchmark for federal procurement NLP. Four tasks, drawn from real federal contracting sources:

#	Task	What it tests
1	Notice type classification	Eight SAM.gov notice-type buckets — Solicitation, Combined Synopsis/Solicitation, Sources Sought, and so on
2	NAICS sector prediction	Twenty top-level NAICS sectors
3	Set-aside identification	Multi-label across SBA, SDVOSB, WOSB, EDWOSB, 8(a), HUBZone, and SDB
4	FAR / DFARS clause extraction	Token-level entity recognition on canonical clause numbers like `52.219-9` or `252.225-7042`

Task 4 is the headline. It is the task where frontier LLMs visibly fail.

The data sources are public and free. SAM.gov provides the solicitations themselves through its Opportunities API. The Electronic Code of Federal Regulations gives me Title 48 — the full FAR and DFARS — as structured XML, which I parse down to 1,032 individual clause records. Claude Haiku fills in a small amount of synthetic augmentation for rare set-aside types like HUBZone and EDWOSB that real SAM data barely contains. Every record carries a source and label_origin field, so anyone can audit the provenance line by line.

The v0 release ships 1,615 records split 1,129 / 243 / 243 train / val / test. That is small. I had originally targeted 10,000. We will come back to why I have less in the section on what bit me.

The model

The companion model — raihan-js/fedproc-180m-v0 — is a 149-million-parameter ModernBERT-base with one shared encoder and four task heads. Sequence classification heads for tasks 1 and 2 (softmax over the label set), seven sigmoid heads for the multi-label set-aside task, and a per-token BIO head for the FAR-clause extractor.

The interesting design choice is the task mask. Records from different sources contribute different supervision: SAM metadata contributes tasks 1, 2, and 3; raw FAR clause text contributes task 4; synthetic excerpts contribute all four. Inside the model's forward pass, a per-record four-boolean mask says which heads get gradient for each example. That is how a single model trains jointly on heterogeneous sources without diluting any head's signal.

Training takes 4.3 minutes on a single RTX 3060 for six epochs. The whole training run cost me zero dollars and the electricity to keep my desk lamp on.

What I compared against

I ran the same four tasks through three frontier systems and the trained model:

Claude Sonnet 4.6 (Anthropic)
GPT-4o (OpenAI)
Claude Haiku 4.5 (Anthropic)
FedProc-180M v0 (the model I trained)

Each system gets the same prompt and the same test split. For task 4, a clean canonical metric: entity F1 with exact match on clause-number strings, plus a hallucination rate, which I define as the share of predicted clause numbers that do not appear anywhere in the cached real FAR + DFARS corpus. Inventing a number that does not exist is the failure mode that matters here.

The headline results

Aggregate scores across all four tasks (mean of per-task macro-F1; task 4 is entity F1):

Rank	Model	Aggregate	T4 entity F1	T4 hallucination
1	Claude Sonnet 4.6	0.911	0.991	0.0% (0 / 493)
2	GPT-4o	0.896	0.970	4.4% (23 / 517)
3	Claude Haiku 4.5	0.851	0.916	15.0% (88 / 587)
4	FedProc-180M v0	0.497	0.921	5.5% (26 / 473)

Claude Sonnet 4.6 is genuinely impressive — zero invented clauses across 493 predicted spans. GPT-4o is close behind. The compact model places fourth on the aggregate because tasks 2 and 3 are weak in v0 (more on this in a moment), but on task 4 — the headline task — it is right behind GPT-4o and roughly matches Claude Haiku on F1 while inventing about a third as many fake clauses.

The honest read

Before anyone runs away with that table: 65 of the 220 task-4 test records are Claude-generated synthetic excerpts that cite specific pinned clauses. Frontier models from the Claude family are being graded on text their own family wrote. That is a real bias.

The way I disclose this in the benchmark is to break out task 4 by record source. The real-FAR slice is the honest read because no system in this comparison helped author it:

Model	Real FAR text — F1	Real FAR text — hallucination
Claude Sonnet 4.6	0.984	0.0% (0 / 182)
GPT-4o	0.937	11.0% (23 / 209)
Claude Haiku 4.5	0.804	32.1% (88 / 274)
FedProc-180M v0	0.800	13.8% (22 / 159)

So on the cleanest available slice, Claude Sonnet 4.6 still wins outright. GPT-4o is solid but invents a clause number more than one in ten times. Claude Haiku 4.5 invents a clause number almost a third of the time. And FedProc-180M, the compact specialized model, matches Haiku on F1 with less than half the hallucination rate.

That last bullet is the v0 takeaway: a 150M-parameter model trained in four minutes on a consumer GPU produces task-specific extraction that is competitive with Claude Haiku and demonstrably more reliable on the failure mode that matters for the use case. At roughly fifty times lower latency and three orders of magnitude lower per-call cost, that is a real Pareto point for federal contractors who want on-prem, predictable, auditable FAR-clause extraction.

Where this is actually weak

I am not going to oversell the rest of the table. Tasks 1, 2, 3 are limited in v0 because the SAM.gov daily quota on a non-federal API key ran out during the data pull before I could fetch description text for the cached solicitations. The model sees only titles like 53--O-RING for task-2 NAICS prediction. That is essentially impossible. v0.1, once the quota window cycles, will retrain on the full description text and these numbers should move substantially.

The other honest caveat: 1,129 training records is tiny by NLP standards. The fact that ModernBERT-base lifts task 4 to 0.921 F1 on this little data is partly attributable to ModernBERT being a genuinely strong base model, and partly to the fact that the FAR-clause-number pattern is fundamentally structural — it is easier to learn 52.<digits>-<digits> than to learn what makes a notice an RFI versus a Sources Sought.

Why I built this

I am a co-founder at VETR Proposal, which builds AI-assisted federal proposal management for SDVOSB, WOSB, and 8(a) contractors. Reliable FAR clause handling is core to that product. Before I shipped anything to a customer that touches clause citation, I wanted to know how often current AI systems make things up. There was no public answer. So I made the measurement public.

That is the other reason the benchmark and dataset are open: anyone working in this space — competitors, GSA, academic groups, internal teams at large contractors — can now use the same yardstick. The benchmark is the contribution. The model is just one entry on the leaderboard.

Try it

The model: raihan-js/fedproc-180m-v0.

The dataset: raihan-js/fedproc-bench.

Both are Apache 2.0. The source code that built them lives in the repo (link in the model card). To reproduce from scratch you need a SAM.gov developer key (free), an Anthropic key for the synthetic step, and a couple of hours on a GPU you already have.

If you find a clause number my model misses, an obvious bug, or a hallucination my regex did not catch — open an issue. v0.1 lands tomorrow.

If you build in federal contracting tech or care about the reliability of LLMs on regulated text, let me know what you find when you run it.

Top comments (1)

Harjot Singh • Jun 1

it's interesting to see how a lack of reliable benchmarks can lead to real-world consequences in federal contracting. having a tool like FedProc-Bench could really help improve accuracy.

if you ever need to build a web app to analyze or showcase your findings, Moonshift lets you deploy a full next.js + postgres + auth setup in about 7 minutes, and you own the code on your github. happy to give you a free build to try it out.