<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chee Yu Yang</title>
    <description>The latest articles on DEV Community by Chee Yu Yang (@chyuang).</description>
    <link>https://dev.to/chyuang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F347440%2F002efdb8-608e-4ad4-8d4d-181644fc65cc.png</url>
      <title>DEV Community: Chee Yu Yang</title>
      <link>https://dev.to/chyuang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/chyuang"/>
    <language>en</language>
    <item>
      <title>Failing to Train DeBERTa to Detect Patent Antecedent Basis Errors</title>
      <dc:creator>Chee Yu Yang</dc:creator>
      <pubDate>Thu, 26 Mar 2026 07:05:35 +0000</pubDate>
      <link>https://dev.to/chyuang/failing-to-train-deberta-to-detect-patent-antecedent-basis-errors-2p12</link>
      <guid>https://dev.to/chyuang/failing-to-train-deberta-to-detect-patent-antecedent-basis-errors-2p12</guid>
      <description>&lt;p&gt;Patent claims have a simple rule: introduce "a thing" before referring to "the thing." I fine-tuned DeBERTa-v3 on synthetic antecedent basis errors and hit 90% F1 on my test set. Then I evaluated on real USPTO examiner rejections from the PEDANTIC dataset and watched that number collapse to 14.5% F1, 8% recall. The model catches 8 out of 100 real errors. This writeup covers what I built, why it failed, and what the failure reveals about the gap between synthetic and real patent data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Antecedent basis errors are one of the most common reasons for 112(b) rejections. They're also one of the most annoying—purely mechanical mistakes that slip through because patent claims get long, dependencies get tangled, and things get edited over time. You introduce "a sensor" in claim 1, then three claims later you write "the detector" meaning the same thing. Or you delete a clause during revision and forget that it was the antecedent for something downstream.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"A device comprising a processor, wherein the controller manages memory."&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;↑ "the controller" appears out of nowhere—no antecedent&lt;/p&gt;

&lt;h3&gt;
  
  
  More examples of antecedent basis errors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ambiguous reference&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✗ "a first lever and a second lever... the lever is connected to..."&lt;br&gt;
✓ "a first lever and a second lever... the first lever is connected to..."&lt;/p&gt;

&lt;p&gt;When multiple elements share a name, "the lever" is ambiguous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inconsistent descriptors&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✗ "a lever... the aluminum lever"&lt;br&gt;
✓ "an aluminum lever... the aluminum lever"&lt;/p&gt;

&lt;p&gt;Adding a descriptor that wasn't in the antecedent creates uncertainty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compound terms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✗ "a video display unit... the display"&lt;br&gt;
✓ "a video display unit... the video display unit"&lt;/p&gt;

&lt;p&gt;Can't reference part of a compound term alone without introducing it separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implicit synonyms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✗ "a sensor... the detector"&lt;br&gt;
✓ "a sensor... the sensor"&lt;/p&gt;

&lt;p&gt;Even if they mean the same thing, different words require separate antecedents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gray area: Morphological changes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;? "a controlled stream of fluid... the controlled fluid"&lt;/p&gt;

&lt;p&gt;Often acceptable because the scope is "reasonably ascertainable," but some examiners may still flag it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not an error: Inherent properties&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✓ "a sphere... the outer surface of said sphere"&lt;/p&gt;

&lt;p&gt;You don't need to explicitly introduce inherent components. A sphere obviously has an outer surface.&lt;/p&gt;

&lt;p&gt;When the USPTO catches it, you get an office action. You pay your attorney to draft a response. The application gets delayed. All for an error so mechanical, so tedious, that checking for it yourself feels almost insulting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prior work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Commercial tools
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.patentclaimmaster.com/" rel="noopener noreferrer"&gt;ClaimMaster&lt;/a&gt; is a Microsoft Word plugin that parses claims and highlights potential antecedent basis issues: missing antecedents, ambiguous terms, singular/plural mismatches. They describe it as using "natural-language processing technologies" and have recently added LLM integration for drafting and analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.patentbots.com/" rel="noopener noreferrer"&gt;Patent Bots&lt;/a&gt; is a web-based alternative that highlights terms in green (has antecedent), yellow (warning), or red (missing antecedent).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.lexisnexisip.com/solutions/patent-drafting/patentoptimizer/" rel="noopener noreferrer"&gt;LexisNexis PatentOptimizer&lt;/a&gt; is the enterprise option, checking for antecedent basis and specification support.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open source
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/cgupatent/antecedent-check" rel="noopener noreferrer"&gt;antecedent-check&lt;/a&gt; parses claims into noun phrases using Apache OpenNLP. &lt;a href="https://github.com/btrettel/plint" rel="noopener noreferrer"&gt;plint&lt;/a&gt; is a patent claim linter that requires manually marking up claims with special syntax for new elements and references.&lt;/p&gt;

&lt;h3&gt;
  
  
  Research
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/boschresearch/pedantic-patentsemtech" rel="noopener noreferrer"&gt;PEDANTIC dataset&lt;/a&gt; from Bosch Research contains 14,000 patent claims annotated with indefiniteness reasons, including antecedent basis errors. They tested logistic regression baselines and LLM agents (Qwen 2.5 32B and 72B) on binary classification of whether a claim is indefinite, with the best model achieving 60.3 AUROC. Antecedent basis was the most common error type, accounting for 36% of all rejections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approach
&lt;/h2&gt;

&lt;p&gt;I framed this as token classification: feed the model a claim with its parent claims as context, and have it label each token as the start of an error span, a continuation of one, or clean. I used &lt;a href="https://huggingface.co/microsoft/deberta-v3-base" rel="noopener noreferrer"&gt;DeBERTa-v3-base&lt;/a&gt; and evaluated against PEDANTIC's test split (885 samples with real examiner-flagged antecedent basis errors).&lt;/p&gt;

&lt;h2&gt;
  
  
  The training data problem
&lt;/h2&gt;

&lt;p&gt;PEDANTIC has labeled antecedent basis errors, but only ~2,500 training examples. I wanted more data and control over the error types. So I decided to generate synthetic training data.&lt;/p&gt;

&lt;p&gt;I started by pulling ~25,000 granted US patents (2019–2024) from &lt;a href="https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data" rel="noopener noreferrer"&gt;Google Patents BigQuery&lt;/a&gt;. These are clean, examiner-approved claims with no antecedent basis errors—at least in theory. I parsed out the claim structure, built dependency chains so each dependent claim had its parent claims as context, and ended up with about 370,000 claim-context pairs.&lt;/p&gt;

&lt;p&gt;Then I wrote a corruption generator to inject synthetic errors. The idea: take clean claims and break them in ways that create antecedent basis errors, recording exactly which character spans are wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  The six corruption types I generate
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Remove antecedent&lt;/strong&gt; — Find "a sensor" in context, delete it. Now "the sensor" in the claim is orphaned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swap determiner&lt;/strong&gt; — Change "a controller" → "the controller" in the claim where no controller was introduced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject orphan&lt;/strong&gt; — Insert "the processor connected to" from a hardcoded list of 24 common patent nouns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plural mismatch&lt;/strong&gt; — "a sensor" in context → "the sensors" in claim. Singular introduced, plural referenced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial compound&lt;/strong&gt; — "a temperature sensor" introduced → "the sensor" referenced. Can't drop the modifier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ordinal injection&lt;/strong&gt; — "a first valve" and "a second valve" exist → inject "the third valve".&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;50/50 split between clean and corrupted examples. The corrupted ones got converted to BIO format (B-ERR for beginning of error span, I-ERR for inside, O for everything else) and fed to DeBERTa.&lt;/p&gt;

&lt;p&gt;After about 12,500 steps, the model hit &lt;strong&gt;90.84% F1&lt;/strong&gt; on my validation set. I was feeling pretty good about it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vsothzggy8ltjcgsaqg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vsothzggy8ltjcgsaqg.png" alt="Training progress showing F1, precision, recall reaching ~90% and loss decreasing over 12,500 steps" width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Then I tested on real data
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/boschresearch/pedantic-patentsemtech" rel="noopener noreferrer"&gt;PEDANTIC dataset&lt;/a&gt; contains actual USPTO examiner rejections with the error spans labeled by hand. This is the real thing—885 test samples where examiners flagged antecedent basis issues in actual patent applications.&lt;/p&gt;

&lt;p&gt;Out of the box, my model hit 5% F1. That 90% on synthetic data? Gone. But before giving up, I wanted to understand what was actually happening inside. The model outputs a confidence score for each token—how sure is it that this word is part of an error? By default, it only flags tokens where it's more than 50% confident. What if I lowered that bar?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04kapxja78h23i99b4yf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04kapxja78h23i99b4yf.png" alt="Model performance vs confidence threshold showing precision stable at ~70% while recall drops from 8% to 1% as threshold increases" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Look at the blue line. Precision barely moves—stays around 70% no matter what threshold I pick. That's interesting. It means the model actually learned something real. When it speaks up, it's right about two-thirds of the time, whether I make it cautious or aggressive.&lt;/p&gt;

&lt;p&gt;The green line is the problem. At the default threshold, recall is 2.6%—catching almost nothing. Crank the threshold down to 0.05, and recall triples to 8.2%. Still bad, but less bad. F1 goes from 5% to 14.5%. I'll take it.&lt;/p&gt;

&lt;p&gt;So the model isn't broken. It learned patterns that transfer to real data—just not very many of them. The synthetic corruption I generated covers maybe 8% of what USPTO examiners actually flag. The other 92%? Patterns I didn't think to simulate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploring the data
&lt;/h2&gt;

&lt;p&gt;Before trying to fix anything, I wanted to understand what's actually in PEDANTIC and what my model sees. I ran every test sample through the model at threshold 0.05 and categorized every prediction.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the model catches
&lt;/h3&gt;

&lt;p&gt;Of the 258 true positives, 92% start with "the"—phrases like "the user", "the source profile", "the matrix". This makes sense. My synthetic training data generates errors by swapping "a X" to "the X", so the model learned to flag definite articles that lack antecedents. When it sees "the controller" and can't find "a controller" earlier in the context, it speaks up.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it misses
&lt;/h3&gt;

&lt;p&gt;The 2,883 false negatives tell a different story. Only 38% start with "the". The rest?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bare nouns (no determiner)&lt;/strong&gt; — "widgets", "pattern", "text content" — 37% of all errors. The noun is used without "the" or "a" but still lacks proper introduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"said X" patterns&lt;/strong&gt; — "said widget", "said data" — Patent-speak for "the". My model catches almost none of these despite 30% augmentation in training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedded errors&lt;/strong&gt; — "a location of the occluded area" — The error is "the occluded area" but PEDANTIC marks the whole phrase. Different annotation granularity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pronouns&lt;/strong&gt; — "it" — 20 cases. Never in my training data because I focused on noun phrases.&lt;/p&gt;

&lt;h3&gt;
  
  
  False positives
&lt;/h3&gt;

&lt;p&gt;The 152 false positives are mostly patent boilerplate: "The method of claim 8", "The system", "The apparatus". These always have antecedents—"method" refers to the claim itself, "system" to whatever was introduced in claim 1. The model doesn't understand claim structure, just surface patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data quality
&lt;/h3&gt;

&lt;p&gt;Some PEDANTIC annotations look like parsing artifacts. I found dozens of instances of "d widgets" and "idgets"—clearly broken spans from the word "widgets". A small percentage of false negatives have suspicious patterns: spans starting with spaces, single characters, or truncated words. Not a huge problem, but worth noting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap
&lt;/h2&gt;

&lt;p&gt;Now the picture comes together. Go back to those six corruption types I wrote—every single one produces errors starting with "the" or "said". That's all the model ever saw during training.&lt;/p&gt;

&lt;p&gt;But real examiner rejections are messier. PEDANTIC breaks down like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;% of PEDANTIC&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;Training&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"the X"&lt;/td&gt;
&lt;td&gt;42.0%&lt;/td&gt;
&lt;td&gt;16.2%&lt;/td&gt;
&lt;td&gt;trained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bare nouns&lt;/td&gt;
&lt;td&gt;37.3%&lt;/td&gt;
&lt;td&gt;1.0%&lt;/td&gt;
&lt;td&gt;never&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;embedded "the"&lt;/td&gt;
&lt;td&gt;9.6%&lt;/td&gt;
&lt;td&gt;9.3%&lt;/td&gt;
&lt;td&gt;never*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"a/an X" phrases&lt;/td&gt;
&lt;td&gt;6.2%&lt;/td&gt;
&lt;td&gt;1.0%&lt;/td&gt;
&lt;td&gt;never&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"said X"&lt;/td&gt;
&lt;td&gt;4.3%&lt;/td&gt;
&lt;td&gt;1.5%&lt;/td&gt;
&lt;td&gt;30% aug&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pronouns&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;never&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Embedded patterns get partial credit when our "the X" detection overlaps with the annotated span&lt;/p&gt;

&lt;p&gt;But wait—if I trained on "the X" patterns, why is recall only 16%? Where did the other 84% go?&lt;/p&gt;

&lt;h2&gt;
  
  
  Digging into the 84%
&lt;/h2&gt;

&lt;p&gt;I dug into the model's actual predictions and found bugs in my corruption logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distribution mismatch on context.&lt;/strong&gt; 51% of missed "the X" errors are in independent claims (no parent context). My training data has 18% without context—not zero, but the distribution is off. The model learned to rely heavily on cross-referencing "the X" against "a X" in context. When context is missing or sparse, it's less confident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I trained on wrong labels.&lt;/strong&gt; Here's the bug: 7% of my training errors are "The X of claim N" patterns—things like "The method of claim 1, wherein...". These should &lt;em&gt;never&lt;/em&gt; be errors. The phrase "of claim 1" explicitly provides the antecedent. But my &lt;code&gt;remove_antecedent&lt;/code&gt; corruption doesn't understand this. It sees "a method" in context, "the method" in the claim, removes "a method", and labels "The method" as orphaned. Wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This created spurious patterns.&lt;/strong&gt; 10.8% of error tokens in my training data appear within 3 tokens after the [SEP] separator—right at the claim start. The model learned "claim start → likely error". On real data, it puts ~0.3 probability on [SEP] and claim-start boilerplate. Actual errors also get ~0.3 probability. The model can't distinguish real errors from the noise I accidentally taught it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real errors are more subtle.&lt;/strong&gt; My synthetic training creates obvious cases—I delete "a sensor" from context, making "the sensor" clearly orphaned. But 17% of PEDANTIC's "the X" errors have an "a X" that &lt;em&gt;does&lt;/em&gt; exist somewhere earlier. The examiner flagged it anyway because the reference was ambiguous, or referred to something different, or had a scope issue. I never generated these nuanced cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The false positives
&lt;/h2&gt;

&lt;p&gt;The 152 false positives are almost all patent boilerplate: "the method", "the apparatus", "the system". Now I know why—I literally trained the model to flag these. Those 7% wrong labels taught it that claim-start phrases are errors. The model is doing exactly what I trained it to do. I just trained it wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real gap
&lt;/h2&gt;

&lt;p&gt;90% F1 on synthetic data, 14.5% on real data. The gap is my corruption logic. I accidentally trained the model on wrong labels, created spurious patterns around [SEP] and claim-starts, and never generated the subtle ambiguity cases that real examiners flag. The model architecture is fine. My training data was broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future work
&lt;/h2&gt;

&lt;p&gt;The model architecture isn't the problem—DeBERTa learned exactly what I taught it. The corruption logic is what's broken. There are a few clear directions to try:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix the bugs.&lt;/strong&gt; Filter out "The X of claim N" patterns from error labels. Add explicit negative examples where boilerplate phrases like "The method of claim 1" are labeled as NOT errors. Rebalance the context distribution to match PEDANTIC (more independent claims, fewer dependent).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cover more patterns.&lt;/strong&gt; 37% of real errors are bare nouns—"widgets", "pattern", "text content"—and I never generated any. Add corruptions that reference bare nouns without introduction. Generate "said X" errors more aggressively (30% augmentation wasn't enough for 1.5% recall). Add pronoun cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generate harder cases.&lt;/strong&gt; Right now I create obvious errors—delete "a sensor" and "the sensor" is clearly orphaned. But 17% of real errors have an antecedent that &lt;em&gt;exists&lt;/em&gt; but is ambiguous, refers to something different, or has scope issues. This probably requires either manual curation or a smarter generation strategy that intentionally creates near-miss patterns.&lt;/p&gt;

&lt;p&gt;Or skip synthetic generation entirely and fine-tune on PEDANTIC's training split. It's smaller (only ~2,500 antecedent basis examples vs my 185,000), but it's real data with real annotation patterns. The distribution would match by construction.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>OCR on Patent Figures with DeepSeek-OCR</title>
      <dc:creator>Chee Yu Yang</dc:creator>
      <pubDate>Thu, 26 Mar 2026 00:58:21 +0000</pubDate>
      <link>https://dev.to/chyuang/ocr-on-patent-figures-with-deepseek-ocr-5aci</link>
      <guid>https://dev.to/chyuang/ocr-on-patent-figures-with-deepseek-ocr-5aci</guid>
      <description>&lt;p&gt;12 approaches to extracting text and reference numbers from patent figure sheets, tested against 8 sheets from US11423567B2 (a facial recognition depth mapping system). Flowcharts, dense instrument screenshots, architectural diagrams with tiny scattered reference numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The figures
&lt;/h2&gt;

&lt;p&gt;Patent figures have text at multiple orientations (some sheets are rotated 90 degrees), tiny reference numbers like "41" or "7025" scattered among drawings, dense data screens with white text on dark backgrounds, structural elements (boxes, arrows, lines) that look like text to a machine, and "Figure X" labels often printed sideways.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2wtgxjp1yffwqucypj8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2wtgxjp1yffwqucypj8.png" alt="Patent figure sheet 01 — person in vehicle with camera system, rotated 90 degrees with scattered reference numbers" width="800" height="867"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 01 from US11423567B2. The whole thing is rotated 90 degrees, with labels like "BP", "DR", "1", and "D" scattered around the drawing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;DeepSeek-OCR is a 3.3B parameter vision model that runs locally. It has a grounding mode that returns bounding boxes alongside text—the prompt &lt;code&gt;&amp;lt;|grounding|&amp;gt;OCR this image.&lt;/code&gt; produces output like &lt;code&gt;&amp;lt;|ref|&amp;gt;camera 110&amp;lt;/ref&amp;gt;&amp;lt;|det|&amp;gt;[[412, 8, 455, 63]]&amp;lt;/det&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test 1: Baseline
&lt;/h2&gt;

&lt;p&gt;Raw images into DeepSeek-OCR. Clean upright flowcharts came out perfect:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndrvnch8ggawt8df23vq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndrvnch8ggawt8df23vq.png" alt="Sheet 00 flowchart — all labels and text detected correctly with bounding boxes" width="800" height="707"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 00, test 1. Clean flowchart. Every label and text block detected correctly.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Everything else had problems—rotated text came back garbled ("Accurling" instead of "Acquiring"), it read "61" as "19" on one sheet, and small labels near drawings were consistently missed. Two sheets perfect, six with errors.&lt;/p&gt;

&lt;p&gt;The dense instrument screenshot was the worst—grid marks triggered 225 hallucinated "+" detections:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwuqjatkslya53sh873r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwuqjatkslya53sh873r.png" alt="Dense instrument screenshot with hundreds of colored bounding boxes on grid marks — hallucinated detections" width="800" height="1161"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 03, test 1. Every colored box is an OCR detection. Most of the ones on the right side are hallucinated "+" symbols from grid marks.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests 2–3: Preprocessing
&lt;/h2&gt;

&lt;p&gt;Binarization (converting to pure black and white, boosting contrast) gave identical results. The images were already clean line drawings—nothing to clean up.&lt;/p&gt;

&lt;p&gt;Tesseract OSD for rotation detection got confused by the sideways "Figure X" labels on otherwise upright sheets and rotated things that shouldn't have been rotated. Results got worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test 4: Manual rotation
&lt;/h2&gt;

&lt;p&gt;Some patent figure sheets are printed in landscape orientation—the entire page is rotated 90 degrees. DeepSeek-OCR doesn't handle this well. At the wrong angle, it either misses text entirely or garbles it ("Accurling" instead of "Acquiring"). At the right angle, the same text comes through perfectly.&lt;/p&gt;

&lt;p&gt;I ran every sheet at three angles (0, 90, 270 degrees) and manually compared. Sheet 01 went from 2 usable detections at 0 degrees to 32 at 270 degrees:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5o4nlxhlxy3dysn3xrm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5o4nlxhlxy3dysn3xrm.png" alt="Sheet 01 at 0 degrees — sideways, few detections" width="800" height="867"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;0 degrees. Sideways. Finds a few large labels (100, 10, 110, 120, 121) but misses BP, DR, D, 1, and most of the small text.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffssg6xtehf1dy0trwrvr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffssg6xtehf1dy0trwrvr.png" alt="Sheet 01 at 270 degrees — upright, all labels detected" width="800" height="737"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;270 degrees. Upright. All labels detected—BP, DR, D, 1, 10, 100, 110, 111, 120, 121. "Figure 1" read correctly too.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The problem was figuring out which rotation to use automatically. Not every sheet needs rotating, and rotating an already-upright sheet makes things worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests 5–6: Automatic rotation detection
&lt;/h2&gt;

&lt;p&gt;Cheap probes (running OCR with only 128 output tokens at each angle): 5/8 correct. The probes were too short to distinguish close cases, and not meaningfully faster than running all three angles fully.&lt;/p&gt;

&lt;p&gt;OpenCV text line detection (morphological operations to find horizontal vs. vertical text lines): 4/8 correct. Patent figures have box borders, arrows, and structural lines that register as text lines. The algorithm couldn't tell a box outline from a line of text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test 7: Brute force scoring
&lt;/h2&gt;

&lt;p&gt;Instead of predicting the right angle, I ran all three and scored each result by counting meaningful detections, unique labels, and penalizing spam. Best score wins.&lt;/p&gt;

&lt;p&gt;6/8 correct. The two failures were ties—two angles produced the same number of detections with the same label lengths. The scoring couldn't tell "Decermine" from "Determine" because it wasn't checking whether the words were real English.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test 8: GLM-OCR
&lt;/h2&gt;

&lt;p&gt;GLM-OCR is a newer, smaller model (0.9B parameters) that benchmarks higher than DeepSeek on standard OCR tasks. I tested both its "Text Recognition" and "Figure Recognition" prompts at all three angles.&lt;/p&gt;

&lt;p&gt;The "Figure Recognition" prompt was useless—it returned only "Figure X" on every sheet at every angle.&lt;/p&gt;

&lt;p&gt;The "Text Recognition" prompt was more interesting. On text-heavy sheets (the flowcharts, the dense instrument screen), it was rotation-proof—identical perfect output at 0, 90, and 270 degrees. DeepSeek can't do that.&lt;/p&gt;

&lt;p&gt;On diagram sheets with scattered reference numbers, results were inconsistent. Some sheets returned only "Figure X" at every angle (sheets 01, 05, 07—all the reference numerals ignored). Others partially worked but only at specific rotations—sheet 06 returned just "Figure 6" at 0 and 90 degrees, but at 270 it found 62, 61, 601, 602, 603, and 6. Sheet 04 found BP, 41, 42, 43 at 90/270 but not at 0.&lt;/p&gt;

&lt;p&gt;GLM-OCR seems to treat isolated small numbers near drawings as non-text. When the numbers are large and clearly part of the layout it picks them up, but the tiny scattered reference numerals that patents rely on get skipped. Different failure mode from DeepSeek—DeepSeek at least attempts them (and sometimes gets them wrong), GLM doesn't try.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests 9–11: Known-label matching
&lt;/h2&gt;

&lt;p&gt;Patent reference numerals aren't unknown—the patent specification text defines them explicitly ("camera (110)", "distance measuring sub-system (120)", etc.). We already extract these in our app, so we have a list of every reference number that should appear in the figures.&lt;/p&gt;

&lt;p&gt;Test 9 used the "Figure" label as a filter. If DeepSeek reads "Fisure4" or "File 7" at a given angle, that angle is wrong. This reliably eliminated bad angles but couldn't break ties between two angles that both read "Figure" correctly.&lt;/p&gt;

&lt;p&gt;Tests 10–11 added known-label matching—after filtering by "Figure", count how many OCR detections match known reference numerals from the patent spec. The angle with the most matches wins.&lt;/p&gt;

&lt;p&gt;This fixed the "61" vs "19" problem ("61" is a known reference numeral, "19" isn't). 7/8 correct. The single miss was a three-way tie where the same four numerals appeared at every angle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsiht0damgxpdukaz8ley.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsiht0damgxpdukaz8ley.png" alt="Sheet 06 test 1 — sideways, 61 misread as 19" width="800" height="1134"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 06, test 1. Sideways. Reads "61" as "19".&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvswvz13w2z6f69cf0shw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvswvz13w2z6f69cf0shw.png" alt="Sheet 06 test 11 — upright, 61 correctly detected" width="800" height="564"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 06, test 11. Correct rotation selected via known-label matching. "61" detected correctly.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Test 12: Google Cloud Vision API
&lt;/h2&gt;

&lt;p&gt;I tried Google's Vision API as a sanity check. It got every sheet right on the first try with no rotation and no preprocessing. It found labels that DeepSeek missed at every angle—the tiny "1" and "BP" on the cluttered diagram, the "7" in the corner of the neural network sheet. Zero typos. Word-level bounding boxes in pixel coordinates. 0.3 seconds per image vs. 9+ seconds for three rotation passes locally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6hhovaw0eyawsno3bi8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6hhovaw0eyawsno3bi8.png" alt="Sheet 07 neural network diagram — DeepSeek test 11, missed the 7" width="800" height="460"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 07, DeepSeek (test 11). 13 labels. Missed "7" in the top right.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvopehd5np8iwbq5ky7jo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvopehd5np8iwbq5ky7jo.png" alt="Sheet 07 — Google Vision, all 14 labels including the 7" width="800" height="1389"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sheet 07, Google Vision (test 12). All 14 labels including "7".&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sheet&lt;/th&gt;
&lt;th&gt;Google Vision&lt;/th&gt;
&lt;th&gt;DeepSeek (test 11)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flowchart&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Person in vehicle (rotated)&lt;/td&gt;
&lt;td&gt;Found BP, DR, "1" — 12 detections&lt;/td&gt;
&lt;td&gt;Missed 1, D, BP — 8 detections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rotated flowchart&lt;/td&gt;
&lt;td&gt;Perfect, no rotation needed&lt;/td&gt;
&lt;td&gt;Typos without correct rotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dense instrument screen&lt;/td&gt;
&lt;td&gt;63 words, caught everything&lt;/td&gt;
&lt;td&gt;33 detections at best angle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Face profiles&lt;/td&gt;
&lt;td&gt;All labels, no rotation needed&lt;/td&gt;
&lt;td&gt;All labels, needed 270 rotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Face/depth images&lt;/td&gt;
&lt;td&gt;All labels correct&lt;/td&gt;
&lt;td&gt;All labels correct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Depth map diagram&lt;/td&gt;
&lt;td&gt;"61" correct immediately&lt;/td&gt;
&lt;td&gt;Read "61" as "19" without rotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neural network architecture&lt;/td&gt;
&lt;td&gt;All 14 labels including "7"&lt;/td&gt;
&lt;td&gt;13 labels, missed "7"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Google Vision API pricing: first 1,000 images/month free, $0.0015 per image after that. A typical patent has 5–15 figure sheets, so the free tier covers 65–200 patents per month. At scale, 10,000 patents would cost $75–225.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;If cloud is acceptable, Google Vision API is the obvious choice—one API call per image, no rotation logic, no scoring heuristics, no local GPU.&lt;/p&gt;

&lt;p&gt;If it has to stay local, DeepSeek-OCR with the test 11 pipeline works: run at three angles, filter by "Figure" quality, pick the angle that matches the most known reference numerals. 7/8 sheets correct, and the one miss is cosmetic (correct numerals, garbled text labels).&lt;/p&gt;

&lt;p&gt;Image preprocessing (binarization, contrast), Tesseract for rotation detection, OpenCV text line analysis, and cheap probe strategies didn't help on patent figures.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>ai</category>
      <category>computervision</category>
    </item>
  </channel>
</rss>
