<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sanjay Chauhan</title>
    <description>The latest articles on DEV Community by Sanjay Chauhan (@san64777).</description>
    <link>https://dev.to/san64777</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3973975%2F812a2bda-ff71-43d7-9593-cf3caf97f341.jpg</url>
      <title>DEV Community: Sanjay Chauhan</title>
      <link>https://dev.to/san64777</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/san64777"/>
    <language>en</language>
    <item>
      <title>Your scraper says 200 OK. I measured how often it's lying.</title>
      <dc:creator>Sanjay Chauhan</dc:creator>
      <pubDate>Mon, 08 Jun 2026 13:57:28 +0000</pubDate>
      <link>https://dev.to/san64777/your-scraper-says-200-ok-i-measured-how-often-its-lying-3d0h</link>
      <guid>https://dev.to/san64777/your-scraper-says-200-ok-i-measured-how-often-its-lying-3d0h</guid>
      <description>&lt;p&gt;You write a scraper. It hits a URL, gets back &lt;code&gt;200 OK&lt;/code&gt;, you check &lt;code&gt;resp.status_code&lt;/code&gt;, it is 200, so you call &lt;code&gt;save(resp)&lt;/code&gt; and move on. The pipeline runs nightly. Everything is green. You trust it, because the whole point of a status code is that it tells you what happened.&lt;/p&gt;

&lt;p&gt;Three days later a downstream report looks subtly wrong. A column is empty, or a count is off. You start the long walk back to find out which page quietly handed you a login form instead of the article you asked for. It was a login wall. Or a JavaScript app-shell with no content rendered yet. Or a soft-404 dressed up as a real page. The status code said success, your code believed it, and the junk got stored as data.&lt;/p&gt;

&lt;p&gt;In 2026 a &lt;code&gt;200 OK&lt;/code&gt; is not ground truth. It is just as likely to be an anti-bot challenge page, a login wall, a soft-404, or an empty JavaScript shell that never rendered. Status-code retry logic never notices the difference, so the corruption gets stored as data.&lt;/p&gt;

&lt;p&gt;I wanted to see this happen on real, named sites rather than argue it in the abstract, so I went and measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I found
&lt;/h2&gt;

&lt;p&gt;I took three popular Python fetchers (&lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;curl_cffi&lt;/code&gt;, &lt;code&gt;scrapling&lt;/code&gt;), pointed them at a mix of control sites and protected ones, and ran 3 requests each. Then, this is the part that matters, I captured each raw body and labeled what every fetcher &lt;em&gt;actually&lt;/em&gt; got back &lt;strong&gt;independently&lt;/strong&gt;, by reading the stored bytes, not by trusting the status line and not by trusting veriscrape. Only after labeling did I compare. Every result was stable 3 of 3.&lt;/p&gt;

&lt;p&gt;A "silent failure" here means one specific thing: a &lt;code&gt;2xx&lt;/code&gt; response whose body is junk (a login wall, a JS app-shell, an empty page) that gets reported as success with no signal that anything is wrong. The cleanest, least-disputable case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;discord.com/app&lt;/code&gt; and &lt;code&gt;web.telegram.org&lt;/code&gt; return &lt;code&gt;200&lt;/code&gt; with an &lt;strong&gt;empty JavaScript app-shell&lt;/strong&gt;: a mount point and a wall of scripts, zero server-rendered content. Every status-code-only fetcher (&lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;curl_cffi&lt;/code&gt;, &lt;code&gt;scrapling&lt;/code&gt;) stores that husk as a successful page. The HTML loads, the content does not.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a category-wide, structural blind spot, not a knock on any one tool. Any fetcher that decides success from the status line stores that &lt;code&gt;200&lt;/code&gt; as good data, because a &lt;code&gt;200&lt;/code&gt; with a skeleton in the body is, by every status-code measure, a success.&lt;/p&gt;

&lt;p&gt;The independent labeling earned its keep in an unflattering way, and I am keeping that on the record because it is the whole point. An earlier draft of this writeup reported one cell as a competitor's (&lt;code&gt;scrapling&lt;/code&gt;) "silent failure" on &lt;code&gt;g2.com&lt;/code&gt;. Re-labeling from the captured body showed that was wrong: it was a &lt;em&gt;veriscrape&lt;/em&gt; false positive. The real, content-rich G2 homepage had come back (the anti-bot let the fetch through), and veriscrape had mislabeled it as a login wall. I fixed the detector (that homepage now classifies &lt;code&gt;OK&lt;/code&gt;) and retracted the claim. That is the thesis turned on its author: the tool exists to flag silently-wrong data, and the discipline has to apply to its own output first. If it cannot survive that, it has no business judging anyone else's fetch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why retry logic cannot see it
&lt;/h2&gt;

&lt;p&gt;Here is the shape of almost every fetch-and-store loop I have ever written or reviewed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fetcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# looks fine, ships it
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# only fires on 4xx / 5xx
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The branch that matters never runs. A login wall is served with &lt;code&gt;200&lt;/code&gt;. A JS shell is served with &lt;code&gt;200&lt;/code&gt;. A DataDome gate can be served with &lt;code&gt;200&lt;/code&gt;. The status code is doing exactly what it is defined to do (the HTTP transaction succeeded), and your code is reading a meaning into it that was never there. So the &lt;code&gt;if&lt;/code&gt; is true, &lt;code&gt;save()&lt;/code&gt; runs, and the corruption is now in your store. There is no error, no exception, no log line.&lt;/p&gt;

&lt;p&gt;You cannot fix this with more retries, because retrying a &lt;code&gt;200&lt;/code&gt; login wall just gives you the same &lt;code&gt;200&lt;/code&gt; login wall, stably, 3 of 3. The only way to catch it is to look at the body, the headers, and the cookies, and decide what the response &lt;em&gt;is&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: get the bytes plus a verdict
&lt;/h2&gt;

&lt;p&gt;That decision is what I built. It is a library called veriscrape: a verified-fetch primitive that returns the bytes &lt;em&gt;plus&lt;/em&gt; a portable, deterministic trust verdict, so the moment your data is silently wrong you have a signal at the fetch layer instead of a wrong report three days later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;veriscrape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;veriscrape&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;veriscrape&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://discord.com/app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;      &lt;span class="c1"&gt;# 'EMPTY_SHELL'
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cause&lt;/span&gt;        &lt;span class="c1"&gt;# 'js_app_shell'  (or 'datadome', 'login_wall', 'cloudflare_challenge', ...)
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;   &lt;span class="c1"&gt;# 0.0 to 1.0
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt;     &lt;span class="c1"&gt;# the exact markers matched, for audit
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;           &lt;span class="c1"&gt;# True ONLY when r.verdict is OK
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;get()&lt;/code&gt; is a drop-in for &lt;code&gt;requests.get&lt;/code&gt;. It fetches with curl_cffi (browser-like TLS, so you are not labeled on a TLS signal alone), then runs the deterministic classifier over the response. The verdict is one of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OK  BLOCKED  CHALLENGE  HONEYPOT  SOFT_404  LOGIN_WALL  EMPTY_SHELL  UNVERIFIED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The taxonomy is the whole point. Instead of one boolean &lt;code&gt;status_code == 200&lt;/code&gt;, you get a named reason for &lt;em&gt;what the response actually is&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you already have a fetch stack, you do not have to replace it. Classify what you already pulled, without re-fetching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veriscrape.adapters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;from_requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_response&lt;/span&gt;
&lt;span class="c1"&gt;# from_requests(resp) for a requests.Response
# from_response(...) for httpx, Playwright, or any stack
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is also a Scrapy middleware (&lt;code&gt;VeriscrapeMiddleware&lt;/code&gt;) and a CLI (&lt;code&gt;veriscrape check &amp;lt;url&amp;gt;&lt;/code&gt;, exit code 0 for OK or UNVERIFIED, 1 on a problem) for pipelines and CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works under the hood
&lt;/h2&gt;

&lt;p&gt;It is deterministic. No LLM. The verdict is computed from status, headers, cookies, and body, which means it is reproducible and auditable: &lt;code&gt;r.evidence&lt;/code&gt; shows you exactly which markers matched, so you can argue with any verdict.&lt;/p&gt;

&lt;p&gt;The core rule I keep coming back to is the &lt;strong&gt;two-key rule&lt;/strong&gt;. A vendor fingerprint &lt;em&gt;alone&lt;/em&gt; is not a verdict. &lt;code&gt;Server: cloudflare&lt;/code&gt;, a &lt;code&gt;cf-ray&lt;/code&gt; header, a &lt;code&gt;_px&lt;/code&gt; cookie, an &lt;code&gt;x-kpsdk-*&lt;/code&gt; header: all of these show up on perfectly normal allowed pages too. If you treat vendor presence as "challenge," you will flag half the internet. So a real verdict needs two keys: the vendor gate &lt;strong&gt;and&lt;/strong&gt; a challenge-or-block-specific marker on a genuine mitigation response. One key without the other is not a verdict.&lt;/p&gt;

&lt;p&gt;Coverage today is 14 negative detectors plus the affirmative &lt;code&gt;OK&lt;/code&gt; detector. The negatives are 7 anti-bot vendors (Cloudflare, DataDome, Akamai, PerimeterX/HUMAN, Kasada, Imperva/Incapsula, F5 BIG-IP ASM), 3 CAPTCHA gates (reCAPTCHA, Turnstile, hCaptcha), and honeypot, login-wall, soft-404, and empty-shell. The affirmative &lt;code&gt;OK&lt;/code&gt; detector is the one that ships a green light: it is the only path to &lt;code&gt;r.ok&lt;/code&gt; being &lt;code&gt;True&lt;/code&gt;, and it is deliberately the hardest to earn (more on that below).&lt;/p&gt;

&lt;p&gt;Every detector ships allowed-page fixtures: real pages from the same vendors that are &lt;em&gt;not&lt;/em&gt; challenges. The test suite fails if any of those fixtures trips a verdict. The whole product is the claim "I will not lie to you," so the false-positive gate is the part I care about most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest caveat
&lt;/h2&gt;

&lt;p&gt;Read this part before you adopt anything.&lt;/p&gt;

&lt;p&gt;veriscrape &lt;strong&gt;abstains over guessing&lt;/strong&gt;, and the affirmative &lt;code&gt;OK&lt;/code&gt; verdict that ships today is built around that. &lt;code&gt;get()&lt;/code&gt; returns a positive &lt;code&gt;OK&lt;/code&gt; only for a 200 that is a real document (it has a &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;) with substantial server-rendered visible text, the inverse of an empty shell. Anything short, ambiguous, or disqualified comes back &lt;code&gt;UNVERIFIED&lt;/code&gt;, not &lt;code&gt;OK&lt;/code&gt;, and &lt;code&gt;r.ok&lt;/code&gt; is &lt;code&gt;True&lt;/code&gt; only on that affirmative &lt;code&gt;OK&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What that means in practice: a padded soft-404, a paywall teaser, a geo or maintenance or age-gate page, a suspended or error page served as a &lt;code&gt;200&lt;/code&gt;, none of these get blessed; they stay &lt;code&gt;UNVERIFIED&lt;/code&gt;. The detector keys on affirmative evidence (real document, substantial server-rendered text) and disqualifies long-but-bad pages, because length alone is not proof of content.&lt;/p&gt;

&lt;p&gt;That is on purpose. The design rule is &lt;strong&gt;abstain over guess&lt;/strong&gt;: I would rather return &lt;code&gt;UNVERIFIED&lt;/code&gt; than emit a confident-but-wrong &lt;code&gt;OK&lt;/code&gt;, because a confident-but-wrong success is the exact failure this whole thing exists to prevent. It is the failure mode that costs you three days, and I am not going to reproduce it inside the tool meant to catch it. &lt;code&gt;UNVERIFIED&lt;/code&gt; is a real verdict and it is not &lt;code&gt;ok&lt;/code&gt;. If that tradeoff does not fit your pipeline, that is fair, and now you know it up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it, and please try to break it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;veriscrape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/san64777/veriscrape" rel="noopener noreferrer"&gt;https://github.com/san64777/veriscrape&lt;/a&gt; (Apache-2.0, Python 3.12+)&lt;/li&gt;
&lt;li&gt;The benchmark, dated, with the raw cases and the correction banner for the retracted claim: &lt;a href="https://github.com/san64777/veriscrape/blob/main/benchmark/results-2026-06-07.md" rel="noopener noreferrer"&gt;https://github.com/san64777/veriscrape/blob/main/benchmark/results-2026-06-07.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The deeper reasoning (why these detectors, why deterministic, the two-key rule): &lt;a href="https://github.com/san64777/veriscrape/blob/main/WHY.md" rel="noopener noreferrer"&gt;https://github.com/san64777/veriscrape/blob/main/WHY.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reproduce it yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run &lt;span class="nt"&gt;--extra&lt;/span&gt; benchmark python &lt;span class="nt"&gt;-m&lt;/span&gt; benchmark.run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What I most want is for you to break it. If you find a false positive (a normal page that gets flagged) or a false negative (junk that comes back &lt;code&gt;UNVERIFIED&lt;/code&gt; when a detector should have caught it), open an issue with the URL. The detectors are pure functions of &lt;code&gt;(status, headers, body)&lt;/code&gt;, so a captured response is enough to reproduce and add as a fixture, and the evidence dict will tell us both exactly which marker tripped. That is the conversation I want to have.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I am Sanjay Chauhan. I build reliability and data-integrity primitives for data pipelines. veriscrape is open source under Apache-2.0: &lt;a href="https://github.com/san64777/veriscrape" rel="noopener noreferrer"&gt;https://github.com/san64777/veriscrape&lt;/a&gt; . Reach me at &lt;a href="mailto:san64777@gmail.com"&gt;san64777@gmail.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>showdev</category>
      <category>opensource</category>
    </item>
    <item>
      <title>acroforge: turn a flat PDF into a real fillable AcroForm, with a deterministic core and zero copyleft</title>
      <dc:creator>Sanjay Chauhan</dc:creator>
      <pubDate>Mon, 08 Jun 2026 12:13:30 +0000</pubDate>
      <link>https://dev.to/san64777/acroforge-turn-a-flat-pdf-into-a-real-fillable-acroform-with-a-deterministic-core-and-zero-50k5</link>
      <guid>https://dev.to/san64777/acroforge-turn-a-flat-pdf-into-a-real-fillable-acroform-with-a-deterministic-core-and-zero-50k5</guid>
      <description>&lt;p&gt;You generated a "fillable" PDF. It looks perfect in Chrome. Then a user opens it in Firefox and the checkboxes are blank, the text fields show nothing until clicked, and one field renders its value an inch to the left. You did not change anything. Two PDF viewers just disagreed about what your form means.&lt;/p&gt;

&lt;p&gt;That is the first trap with PDF forms: a field is not "filled" until it renders the same way across viewers. The PDF spec lets a viewer either trust the appearance stream you baked in or regenerate its own, and the two paths drift apart constantly. If you do not control the appearance stream, you are at the mercy of whichever engine opens the file.&lt;/p&gt;

&lt;p&gt;The second trap is licensing. Reach for the well-known Python PDF tools and you hit a wall fast: the strongest ones are AGPL (viral copyleft you cannot ship inside a closed product without consequences), or they are paid, or they are a cloud API you have to send documents to. For a lot of teams, "send our forms to someone else's server" is a non-starter, and "relicense our whole app" is worse.&lt;/p&gt;

&lt;p&gt;I wanted a small library that does one job: take a flat, non-interactive PDF and turn it into a real, fillable AcroForm, deterministically, locally, with a permissive license. That is &lt;strong&gt;acroforge&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;acroforge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache-2.0, Python 3.11+, no network calls, no AI, no cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-line core
&lt;/h2&gt;

&lt;p&gt;Three functions. They all take &lt;code&gt;bytes&lt;/code&gt; and return &lt;code&gt;bytes&lt;/code&gt;, so they compose into any pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;acroforge&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;acroforge&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FieldType&lt;/span&gt;

&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FieldType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;450&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;730&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;fillable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flat_pdf_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# inject real AcroForm fields
&lt;/span&gt;&lt;span class="n"&gt;filled&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fillable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jane Doe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  &lt;span class="c1"&gt;# set values by name
&lt;/span&gt;&lt;span class="n"&gt;final&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                         &lt;span class="c1"&gt;# bake appearances, lock the form
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;build&lt;/code&gt; injects standards-compliant interactive fields at the coordinates you specify.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fill&lt;/code&gt; sets field values by name.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flatten&lt;/code&gt; bakes the field appearances into the page content, removes interactivity, and locks the result. The flattened PDF is the one you hand to anyone, because there is no interactive layer left to render differently.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Describing fields
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;FieldSpec&lt;/code&gt; is a plain pydantic model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FieldType&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;                                  &lt;span class="c1"&gt;# 0-indexed
&lt;/span&gt;    &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    &lt;span class="c1"&gt;# (x0, y0, x1, y1) in PDF points
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;                                  &lt;span class="c1"&gt;# AcroForm field name
&lt;/span&gt;    &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;           &lt;span class="c1"&gt;# radio group members
&lt;/span&gt;    &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;                  &lt;span class="c1"&gt;# TEXT cap / COMB cell count
&lt;/span&gt;    &lt;span class="n"&gt;export_value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;            &lt;span class="c1"&gt;# checkbox/radio on-value
&lt;/span&gt;    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;                    &lt;span class="c1"&gt;# 1.0 = you authored it
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The field types cover the forms people actually fill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;TEXT&lt;/code&gt; single-line text, optional &lt;code&gt;maxlen&lt;/code&gt; to cap length.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;COMB&lt;/code&gt; the segmented box style, where &lt;code&gt;maxlen&lt;/code&gt; is the number of cells. An SSN field is &lt;code&gt;maxlen=9&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CHECKBOX&lt;/code&gt; with an &lt;code&gt;export_value&lt;/code&gt; for the on-state (default &lt;code&gt;"Yes"&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RADIO&lt;/code&gt; one &lt;code&gt;FieldSpec&lt;/code&gt; per button, sharing a &lt;code&gt;name&lt;/code&gt;, each with its own &lt;code&gt;export_value&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SIGNATURE&lt;/code&gt; a placeholder signature widget.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is text plus a checkbox plus a radio group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FieldType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;450&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;730&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FieldType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;660&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;690&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ssn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FieldType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CHECKBOX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;620&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;220&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agree&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;export_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Yes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FieldType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RADIO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;580&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;220&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;export_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;basic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;FieldSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FieldType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RADIO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;260&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;580&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;280&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;export_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;fillable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flat_pdf_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fillable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jane Doe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ssn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;123456789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agree&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The honest two-layer design
&lt;/h2&gt;

&lt;p&gt;acroforge is deliberately split into two layers, and the split is the whole point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer one is the deterministic engine: &lt;code&gt;build&lt;/code&gt;, &lt;code&gt;fill&lt;/code&gt;, &lt;code&gt;flatten&lt;/code&gt;.&lt;/strong&gt; You tell it exactly where fields go, and it puts them there, fills them, and flattens them reliably. This works on any PDF, vector or scanned, because it does not need to understand the document. It only needs coordinates. This is the part you should depend on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer two is a best-effort detector, and it is labeled best-effort everywhere on purpose.&lt;/strong&gt; &lt;code&gt;af.detect(pdf)&lt;/code&gt; reads the PDF's vector geometry and nearby text labels and &lt;em&gt;guesses&lt;/em&gt; where fields belong, returning a &lt;code&gt;FormManifest&lt;/code&gt; where every field carries &lt;code&gt;confidence &amp;lt; 1.0&lt;/code&gt; to flag it as a guess. &lt;code&gt;af.make_fillable(pdf)&lt;/code&gt; runs detect then build in one step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# guesses, each confidence &amp;lt; 1.0
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fillable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;make_fillable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# detect() then build(), one call
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The detector handles underline-style forms (write-on rules become text fields), bordered table and grid forms (cells become text fields, label-aware), vector checkbox squares, and font-glyph checkboxes like the box and check characters. It is vector-only: scanned or image-only PDFs are refused with &lt;code&gt;ScannedPDFError&lt;/code&gt;, because there is no OCR and I would rather refuse than pretend.&lt;/p&gt;

&lt;p&gt;I make no accuracy promises about detection. It will miss fields and invent spurious ones, and quality varies wildly by document. The intended workflow is: run &lt;code&gt;detect&lt;/code&gt;, review the draft manifest, correct it, and hand the corrected specs to the engine. The fuzzy layer bootstraps you; the deterministic layer is what ships. Keeping that boundary sharp is what stops the guessing from contaminating the guaranteed path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-viewer correctness is the contract
&lt;/h2&gt;

&lt;p&gt;Back to the first trap. The way acroforge avoids the Chrome-vs-Firefox split is by treating cross-viewer rendering as the actual test contract, not an afterthought.&lt;/p&gt;

&lt;p&gt;Every field type has golden-image render tests in two engines: pdfium, which is what Chrome uses, and pdf.js, which is what Firefox uses. A change that makes a field render differently in either engine fails CI. Adobe Reader is a manual spot-check on top of that. The rule I hold the library to is simple: a field does not "work" until it renders correctly across viewers, so I test before claiming it.&lt;/p&gt;

&lt;p&gt;It has been exercised on real documents, including IRS forms W-9 and 1040 and a 43-page credentialing packet, which is where comb fields, duplicate field names, and multi-page layouts stop being theoretical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The zero-copyleft story
&lt;/h2&gt;

&lt;p&gt;The runtime dependency tree is strictly permissive, by design and by enforcement:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;reportlab&lt;/td&gt;
&lt;td&gt;BSD&lt;/td&gt;
&lt;td&gt;field widget rendering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pypdf&lt;/td&gt;
&lt;td&gt;BSD-3-Clause&lt;/td&gt;
&lt;td&gt;PDF read / merge / flatten&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pdfplumber&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;geometry utilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyPDFForm&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;fill helpers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pydantic&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;model validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No GPL, AGPL, LGPL, or SSPL anywhere in the runtime tree, not even MPL. This is not a promise I am asking you to take on faith. CI enforces it on every push with &lt;code&gt;pip-licenses --fail-on='GPL;AGPL;LGPL;SSPL'&lt;/code&gt;, so a copyleft dependency cannot sneak in through a transitive bump without turning the build red. You can drop acroforge into a commercial product without a licensing conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  read_fields: the inverse of build
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;read_fields(pdf)&lt;/code&gt; ingests the AcroForm fields already in a fillable PDF back into &lt;code&gt;FieldSpec&lt;/code&gt;s (real registered fields, so &lt;code&gt;confidence = 1.0&lt;/code&gt;). It is the exact inverse of &lt;code&gt;build&lt;/code&gt;, which means the two round-trip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;specs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fillable.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;   &lt;span class="c1"&gt;# -&amp;gt; list[FieldSpec]
&lt;/span&gt;
&lt;span class="c1"&gt;# copy one form's field layout onto another PDF
&lt;/span&gt;&lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;other_pdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;template_pdf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get one &lt;code&gt;FieldSpec&lt;/code&gt; per widget with coordinates, type, name, and checkbox/radio on-states recovered. Dropdowns come back as text; pushbuttons are skipped. It is handy for inspecting an existing form, diffing layouts, or lifting a known-good field arrangement onto a new document.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it and tell me where it breaks
&lt;/h2&gt;

&lt;p&gt;acroforge 0.2.0 is live on PyPI and the source is on GitHub. It is small, focused, and meant to stay that way.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/san64777/acroforge" rel="noopener noreferrer"&gt;https://github.com/san64777/acroforge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/acroforge/" rel="noopener noreferrer"&gt;https://pypi.org/project/acroforge/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run it against a form that renders wrong in some viewer, or a layout the detector mangles, that is exactly the report I want. Open an issue with the PDF (or a reproducer) and the viewer. And if it saves you from an AGPL dependency or a cloud round-trip, a star helps other people find it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;acroforge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>opensource</category>
      <category>pdf</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
