DEV Community: Chee Yu Yang

Failing to Train DeBERTa to Detect Patent Antecedent Basis Errors

Chee Yu Yang — Thu, 26 Mar 2026 07:05:35 +0000

Patent claims have a simple rule: introduce "a thing" before referring to "the thing." I fine-tuned DeBERTa-v3 on synthetic antecedent basis errors and hit 90% F1 on my test set. Then I evaluated on real USPTO examiner rejections from the PEDANTIC dataset and watched that number collapse to 14.5% F1, 8% recall. The model catches 8 out of 100 real errors. This writeup covers what I built, why it failed, and what the failure reveals about the gap between synthetic and real patent data.

The problem

Antecedent basis errors are one of the most common reasons for 112(b) rejections. They're also one of the most annoying—purely mechanical mistakes that slip through because patent claims get long, dependencies get tangled, and things get edited over time. You introduce "a sensor" in claim 1, then three claims later you write "the detector" meaning the same thing. Or you delete a clause during revision and forget that it was the antecedent for something downstream.

"A device comprising a processor, wherein the controller manages memory."

↑ "the controller" appears out of nowhere—no antecedent

More examples of antecedent basis errors

Ambiguous reference

✗ "a first lever and a second lever... the lever is connected to..."
✓ "a first lever and a second lever... the first lever is connected to..."

When multiple elements share a name, "the lever" is ambiguous.

Inconsistent descriptors

✗ "a lever... the aluminum lever"
✓ "an aluminum lever... the aluminum lever"

Adding a descriptor that wasn't in the antecedent creates uncertainty.

Compound terms

✗ "a video display unit... the display"
✓ "a video display unit... the video display unit"

Can't reference part of a compound term alone without introducing it separately.

Implicit synonyms

✗ "a sensor... the detector"
✓ "a sensor... the sensor"

Even if they mean the same thing, different words require separate antecedents.

Gray area: Morphological changes

? "a controlled stream of fluid... the controlled fluid"

Often acceptable because the scope is "reasonably ascertainable," but some examiners may still flag it.

Not an error: Inherent properties

✓ "a sphere... the outer surface of said sphere"

You don't need to explicitly introduce inherent components. A sphere obviously has an outer surface.

When the USPTO catches it, you get an office action. You pay your attorney to draft a response. The application gets delayed. All for an error so mechanical, so tedious, that checking for it yourself feels almost insulting.

Prior work

Commercial tools

ClaimMaster is a Microsoft Word plugin that parses claims and highlights potential antecedent basis issues: missing antecedents, ambiguous terms, singular/plural mismatches. They describe it as using "natural-language processing technologies" and have recently added LLM integration for drafting and analysis.

Patent Bots is a web-based alternative that highlights terms in green (has antecedent), yellow (warning), or red (missing antecedent).

LexisNexis PatentOptimizer is the enterprise option, checking for antecedent basis and specification support.

Open source

antecedent-check parses claims into noun phrases using Apache OpenNLP. plint is a patent claim linter that requires manually marking up claims with special syntax for new elements and references.

Research

The PEDANTIC dataset from Bosch Research contains 14,000 patent claims annotated with indefiniteness reasons, including antecedent basis errors. They tested logistic regression baselines and LLM agents (Qwen 2.5 32B and 72B) on binary classification of whether a claim is indefinite, with the best model achieving 60.3 AUROC. Antecedent basis was the most common error type, accounting for 36% of all rejections.

Approach

I framed this as token classification: feed the model a claim with its parent claims as context, and have it label each token as the start of an error span, a continuation of one, or clean. I used DeBERTa-v3-base and evaluated against PEDANTIC's test split (885 samples with real examiner-flagged antecedent basis errors).

The training data problem

PEDANTIC has labeled antecedent basis errors, but only ~2,500 training examples. I wanted more data and control over the error types. So I decided to generate synthetic training data.

I started by pulling ~25,000 granted US patents (2019–2024) from Google Patents BigQuery. These are clean, examiner-approved claims with no antecedent basis errors—at least in theory. I parsed out the claim structure, built dependency chains so each dependent claim had its parent claims as context, and ended up with about 370,000 claim-context pairs.

Then I wrote a corruption generator to inject synthetic errors. The idea: take clean claims and break them in ways that create antecedent basis errors, recording exactly which character spans are wrong.

The six corruption types I generate

Remove antecedent — Find "a sensor" in context, delete it. Now "the sensor" in the claim is orphaned.
Swap determiner — Change "a controller" → "the controller" in the claim where no controller was introduced.
Inject orphan — Insert "the processor connected to" from a hardcoded list of 24 common patent nouns.
Plural mismatch — "a sensor" in context → "the sensors" in claim. Singular introduced, plural referenced.
Partial compound — "a temperature sensor" introduced → "the sensor" referenced. Can't drop the modifier.
Ordinal injection — "a first valve" and "a second valve" exist → inject "the third valve".

50/50 split between clean and corrupted examples. The corrupted ones got converted to BIO format (B-ERR for beginning of error span, I-ERR for inside, O for everything else) and fed to DeBERTa.

After about 12,500 steps, the model hit 90.84% F1 on my validation set. I was feeling pretty good about it.

Then I tested on real data

The PEDANTIC dataset contains actual USPTO examiner rejections with the error spans labeled by hand. This is the real thing—885 test samples where examiners flagged antecedent basis issues in actual patent applications.

Out of the box, my model hit 5% F1. That 90% on synthetic data? Gone. But before giving up, I wanted to understand what was actually happening inside. The model outputs a confidence score for each token—how sure is it that this word is part of an error? By default, it only flags tokens where it's more than 50% confident. What if I lowered that bar?

Look at the blue line. Precision barely moves—stays around 70% no matter what threshold I pick. That's interesting. It means the model actually learned something real. When it speaks up, it's right about two-thirds of the time, whether I make it cautious or aggressive.

The green line is the problem. At the default threshold, recall is 2.6%—catching almost nothing. Crank the threshold down to 0.05, and recall triples to 8.2%. Still bad, but less bad. F1 goes from 5% to 14.5%. I'll take it.

So the model isn't broken. It learned patterns that transfer to real data—just not very many of them. The synthetic corruption I generated covers maybe 8% of what USPTO examiners actually flag. The other 92%? Patterns I didn't think to simulate.

Exploring the data

Before trying to fix anything, I wanted to understand what's actually in PEDANTIC and what my model sees. I ran every test sample through the model at threshold 0.05 and categorized every prediction.

What the model catches

Of the 258 true positives, 92% start with "the"—phrases like "the user", "the source profile", "the matrix". This makes sense. My synthetic training data generates errors by swapping "a X" to "the X", so the model learned to flag definite articles that lack antecedents. When it sees "the controller" and can't find "a controller" earlier in the context, it speaks up.

What it misses

The 2,883 false negatives tell a different story. Only 38% start with "the". The rest?

Bare nouns (no determiner) — "widgets", "pattern", "text content" — 37% of all errors. The noun is used without "the" or "a" but still lacks proper introduction.

"said X" patterns — "said widget", "said data" — Patent-speak for "the". My model catches almost none of these despite 30% augmentation in training.

Embedded errors — "a location of the occluded area" — The error is "the occluded area" but PEDANTIC marks the whole phrase. Different annotation granularity.

Pronouns — "it" — 20 cases. Never in my training data because I focused on noun phrases.

False positives

The 152 false positives are mostly patent boilerplate: "The method of claim 8", "The system", "The apparatus". These always have antecedents—"method" refers to the claim itself, "system" to whatever was introduced in claim 1. The model doesn't understand claim structure, just surface patterns.

Data quality

Some PEDANTIC annotations look like parsing artifacts. I found dozens of instances of "d widgets" and "idgets"—clearly broken spans from the word "widgets". A small percentage of false negatives have suspicious patterns: spans starting with spaces, single characters, or truncated words. Not a huge problem, but worth noting.

The gap

Now the picture comes together. Go back to those six corruption types I wrote—every single one produces errors starting with "the" or "said". That's all the model ever saw during training.

But real examiner rejections are messier. PEDANTIC breaks down like this:

Pattern	% of PEDANTIC	Recall	Training
"the X"	42.0%	16.2%	trained
bare nouns	37.3%	1.0%	never
embedded "the"	9.6%	9.3%	never*
"a/an X" phrases	6.2%	1.0%	never
"said X"	4.3%	1.5%	30% aug
pronouns	0.6%	0.0%	never

*Embedded patterns get partial credit when our "the X" detection overlaps with the annotated span

But wait—if I trained on "the X" patterns, why is recall only 16%? Where did the other 84% go?

Digging into the 84%

I dug into the model's actual predictions and found bugs in my corruption logic.

Distribution mismatch on context. 51% of missed "the X" errors are in independent claims (no parent context). My training data has 18% without context—not zero, but the distribution is off. The model learned to rely heavily on cross-referencing "the X" against "a X" in context. When context is missing or sparse, it's less confident.

I trained on wrong labels. Here's the bug: 7% of my training errors are "The X of claim N" patterns—things like "The method of claim 1, wherein...". These should never be errors. The phrase "of claim 1" explicitly provides the antecedent. But my remove_antecedent corruption doesn't understand this. It sees "a method" in context, "the method" in the claim, removes "a method", and labels "The method" as orphaned. Wrong.

This created spurious patterns. 10.8% of error tokens in my training data appear within 3 tokens after the [SEP] separator—right at the claim start. The model learned "claim start → likely error". On real data, it puts ~0.3 probability on [SEP] and claim-start boilerplate. Actual errors also get ~0.3 probability. The model can't distinguish real errors from the noise I accidentally taught it.

Real errors are more subtle. My synthetic training creates obvious cases—I delete "a sensor" from context, making "the sensor" clearly orphaned. But 17% of PEDANTIC's "the X" errors have an "a X" that does exist somewhere earlier. The examiner flagged it anyway because the reference was ambiguous, or referred to something different, or had a scope issue. I never generated these nuanced cases.

The false positives

The 152 false positives are almost all patent boilerplate: "the method", "the apparatus", "the system". Now I know why—I literally trained the model to flag these. Those 7% wrong labels taught it that claim-start phrases are errors. The model is doing exactly what I trained it to do. I just trained it wrong.

The real gap

90% F1 on synthetic data, 14.5% on real data. The gap is my corruption logic. I accidentally trained the model on wrong labels, created spurious patterns around [SEP] and claim-starts, and never generated the subtle ambiguity cases that real examiners flag. The model architecture is fine. My training data was broken.

Future work

The model architecture isn't the problem—DeBERTa learned exactly what I taught it. The corruption logic is what's broken. There are a few clear directions to try:

Fix the bugs. Filter out "The X of claim N" patterns from error labels. Add explicit negative examples where boilerplate phrases like "The method of claim 1" are labeled as NOT errors. Rebalance the context distribution to match PEDANTIC (more independent claims, fewer dependent).

Cover more patterns. 37% of real errors are bare nouns—"widgets", "pattern", "text content"—and I never generated any. Add corruptions that reference bare nouns without introduction. Generate "said X" errors more aggressively (30% augmentation wasn't enough for 1.5% recall). Add pronoun cases.

Generate harder cases. Right now I create obvious errors—delete "a sensor" and "the sensor" is clearly orphaned. But 17% of real errors have an antecedent that exists but is ambiguous, refers to something different, or has scope issues. This probably requires either manual curation or a smarter generation strategy that intentionally creates near-miss patterns.

Or skip synthetic generation entirely and fine-tune on PEDANTIC's training split. It's smaller (only ~2,500 antecedent basis examples vs my 185,000), but it's real data with real annotation patterns. The distribution would match by construction.

OCR on Patent Figures with DeepSeek-OCR

Chee Yu Yang — Thu, 26 Mar 2026 00:58:21 +0000

12 approaches to extracting text and reference numbers from patent figure sheets, tested against 8 sheets from US11423567B2 (a facial recognition depth mapping system). Flowcharts, dense instrument screenshots, architectural diagrams with tiny scattered reference numbers.

The figures

Patent figures have text at multiple orientations (some sheets are rotated 90 degrees), tiny reference numbers like "41" or "7025" scattered among drawings, dense data screens with white text on dark backgrounds, structural elements (boxes, arrows, lines) that look like text to a machine, and "Figure X" labels often printed sideways.

Sheet 01 from US11423567B2. The whole thing is rotated 90 degrees, with labels like "BP", "DR", "1", and "D" scattered around the drawing.

DeepSeek-OCR is a 3.3B parameter vision model that runs locally. It has a grounding mode that returns bounding boxes alongside text—the prompt <|grounding|>OCR this image. produces output like <|ref|>camera 110</ref><|det|>[[412, 8, 455, 63]]</det>.

Test 1: Baseline

Raw images into DeepSeek-OCR. Clean upright flowcharts came out perfect:

Sheet 00, test 1. Clean flowchart. Every label and text block detected correctly.

Everything else had problems—rotated text came back garbled ("Accurling" instead of "Acquiring"), it read "61" as "19" on one sheet, and small labels near drawings were consistently missed. Two sheets perfect, six with errors.

The dense instrument screenshot was the worst—grid marks triggered 225 hallucinated "+" detections:

Sheet 03, test 1. Every colored box is an OCR detection. Most of the ones on the right side are hallucinated "+" symbols from grid marks.

Tests 2–3: Preprocessing

Binarization (converting to pure black and white, boosting contrast) gave identical results. The images were already clean line drawings—nothing to clean up.

Tesseract OSD for rotation detection got confused by the sideways "Figure X" labels on otherwise upright sheets and rotated things that shouldn't have been rotated. Results got worse.

Test 4: Manual rotation

Some patent figure sheets are printed in landscape orientation—the entire page is rotated 90 degrees. DeepSeek-OCR doesn't handle this well. At the wrong angle, it either misses text entirely or garbles it ("Accurling" instead of "Acquiring"). At the right angle, the same text comes through perfectly.

I ran every sheet at three angles (0, 90, 270 degrees) and manually compared. Sheet 01 went from 2 usable detections at 0 degrees to 32 at 270 degrees:

0 degrees. Sideways. Finds a few large labels (100, 10, 110, 120, 121) but misses BP, DR, D, 1, and most of the small text.

270 degrees. Upright. All labels detected—BP, DR, D, 1, 10, 100, 110, 111, 120, 121. "Figure 1" read correctly too.

The problem was figuring out which rotation to use automatically. Not every sheet needs rotating, and rotating an already-upright sheet makes things worse.

Tests 5–6: Automatic rotation detection

Cheap probes (running OCR with only 128 output tokens at each angle): 5/8 correct. The probes were too short to distinguish close cases, and not meaningfully faster than running all three angles fully.

OpenCV text line detection (morphological operations to find horizontal vs. vertical text lines): 4/8 correct. Patent figures have box borders, arrows, and structural lines that register as text lines. The algorithm couldn't tell a box outline from a line of text.

Test 7: Brute force scoring

Instead of predicting the right angle, I ran all three and scored each result by counting meaningful detections, unique labels, and penalizing spam. Best score wins.

6/8 correct. The two failures were ties—two angles produced the same number of detections with the same label lengths. The scoring couldn't tell "Decermine" from "Determine" because it wasn't checking whether the words were real English.

Test 8: GLM-OCR

GLM-OCR is a newer, smaller model (0.9B parameters) that benchmarks higher than DeepSeek on standard OCR tasks. I tested both its "Text Recognition" and "Figure Recognition" prompts at all three angles.

The "Figure Recognition" prompt was useless—it returned only "Figure X" on every sheet at every angle.

The "Text Recognition" prompt was more interesting. On text-heavy sheets (the flowcharts, the dense instrument screen), it was rotation-proof—identical perfect output at 0, 90, and 270 degrees. DeepSeek can't do that.

On diagram sheets with scattered reference numbers, results were inconsistent. Some sheets returned only "Figure X" at every angle (sheets 01, 05, 07—all the reference numerals ignored). Others partially worked but only at specific rotations—sheet 06 returned just "Figure 6" at 0 and 90 degrees, but at 270 it found 62, 61, 601, 602, 603, and 6. Sheet 04 found BP, 41, 42, 43 at 90/270 but not at 0.

GLM-OCR seems to treat isolated small numbers near drawings as non-text. When the numbers are large and clearly part of the layout it picks them up, but the tiny scattered reference numerals that patents rely on get skipped. Different failure mode from DeepSeek—DeepSeek at least attempts them (and sometimes gets them wrong), GLM doesn't try.

Tests 9–11: Known-label matching

Patent reference numerals aren't unknown—the patent specification text defines them explicitly ("camera (110)", "distance measuring sub-system (120)", etc.). We already extract these in our app, so we have a list of every reference number that should appear in the figures.

Test 9 used the "Figure" label as a filter. If DeepSeek reads "Fisure4" or "File 7" at a given angle, that angle is wrong. This reliably eliminated bad angles but couldn't break ties between two angles that both read "Figure" correctly.

Tests 10–11 added known-label matching—after filtering by "Figure", count how many OCR detections match known reference numerals from the patent spec. The angle with the most matches wins.

This fixed the "61" vs "19" problem ("61" is a known reference numeral, "19" isn't). 7/8 correct. The single miss was a three-way tie where the same four numerals appeared at every angle.

Sheet 06, test 1. Sideways. Reads "61" as "19".

Sheet 06, test 11. Correct rotation selected via known-label matching. "61" detected correctly.

Test 12: Google Cloud Vision API

I tried Google's Vision API as a sanity check. It got every sheet right on the first try with no rotation and no preprocessing. It found labels that DeepSeek missed at every angle—the tiny "1" and "BP" on the cluttered diagram, the "7" in the corner of the neural network sheet. Zero typos. Word-level bounding boxes in pixel coordinates. 0.3 seconds per image vs. 9+ seconds for three rotation passes locally.

Sheet 07, DeepSeek (test 11). 13 labels. Missed "7" in the top right.

Sheet 07, Google Vision (test 12). All 14 labels including "7".

Sheet	Google Vision	DeepSeek (test 11)
Flowchart	Perfect	Perfect
Person in vehicle (rotated)	Found BP, DR, "1" — 12 detections	Missed 1, D, BP — 8 detections
Rotated flowchart	Perfect, no rotation needed	Typos without correct rotation
Dense instrument screen	63 words, caught everything	33 detections at best angle
Face profiles	All labels, no rotation needed	All labels, needed 270 rotation
Face/depth images	All labels correct	All labels correct
Depth map diagram	"61" correct immediately	Read "61" as "19" without rotation
Neural network architecture	All 14 labels including "7"	13 labels, missed "7"

Google Vision API pricing: first 1,000 images/month free, $0.0015 per image after that. A typical patent has 5–15 figure sheets, so the free tier covers 65–200 patents per month. At scale, 10,000 patents would cost $75–225.

Conclusions

If cloud is acceptable, Google Vision API is the obvious choice—one API call per image, no rotation logic, no scoring heuristics, no local GPU.

If it has to stay local, DeepSeek-OCR with the test 11 pipeline works: run at three angles, filter by "Figure" quality, pick the angle that matches the most known reference numerals. 7/8 sheets correct, and the one miss is cosmetic (correct numerals, garbled text labels).

Image preprocessing (binarization, contrast), Tesseract for rotation detection, OpenCV text line analysis, and cheap probe strategies didn't help on patent figures.