Devil Scrapes

Posted on Jun 4

USPTO Patent Search API: scrape US patents to clean JSON for $3/1K

#webscraping #python #apify #data

Quick answer: The USPTO does not publish a bulk-export API — the Patents Public Search tool is interactive-only and paginates manually. To get structured patent data programmatically, you scrape the underlying XHR endpoint the browser calls. The Apify Actor below does that — returning patent number, title, abstract, inventors, assignees, CPC/IPC classifications, and all four date fields as Pydantic-validated JSON — for $0.003 per patent (~$3.00 per 1,000 results). No API key required; the source is a free public government dataset.

Every week the USPTO publishes a new batch of granted patents. If you're doing freedom-to-operate research, tracking a competitor's filing activity, or building an IP-landscape model, that weekly batch is the raw material. The problem is getting it out of the USPTO's search interface and into a format your code can actually use — the query surface is narrow, the response format is undocumented in places, and the server has opinions about the TLS fingerprint of whoever is calling it. A plain requests.get() will not get you far. Here is what actually works, and how I packaged it.

What is USPTO Patents Public Search? 🔎

The US Patent and Trademark Office runs the Patents Public Search system — a public-access database of every US patent and patent application since the 1790s. It is the canonical source for:

Granted patents (kind code B1, B2) — full text, inventors, assignee, all dates
Pre-grant publications (kind code A1) — applications published 18 months after priority date
CPC and IPC classifications — the codes that map a patent to its technology domain

It is the same index patent attorneys use for freedom-to-operate work and examiners use to review new applications — free, public, updated weekly.

Does USPTO have a patent search API? 🔌

Sort of. The developer.uspto.gov/ds-api/patents endpoint accepts GET requests and returns JSON. In practice it is a thin wrapper over the web UI's search index: narrow documented coverage, no structured field-level export, and it won't return the full metadata a patent row needs — abstract, all inventors, CPC codes — in one call.

What we actually use under the hood is the XHR endpoint that Google Patents calls when you type a query into its search box. Google Patents mirrors the USPTO index, exposes a richer query syntax (TI/, AN/, IN/, CPC/), and returns structured pagination cursors. But that endpoint inspects TLS handshakes and is not documented for third-party use — which is precisely the job the Actor exists to do.

What the data looks like

One patent, one flat row. Every field from models.py, validated with Pydantic before it lands in your dataset.

{
  "patent_number": "US11948025B2",
  "publication_id": "US11948025B2",
  "title": "System and method for training neural networks with federated learning",
  "abstract": "A distributed training system wherein a central coordinator aggregates gradients from edge devices without exposing raw training data...",
  "inventors": ["Alice Chen", "Robert Mihalcea"],
  "assignees": ["Acme AI Corporation"],
  "applicants": [],
  "cpc_classifications": ["G06N3/084", "G06N20/00"],
  "ipc_classifications": [],
  "publication_date": "2024-04-02",
  "grant_date": "2024-04-02",
  "filing_date": "2022-11-14",
  "priority_date": "2022-11-14",
  "kind_code": "B2",
  "patent_url": "https://patents.google.com/patent/US11948025B2/en",
  "scraped_at": "2026-05-31T09:14:22+00:00"
}

Sixteen fields. patent_number, title, and patent_url are always populated; every other field is nullable. Pre-grant publications frequently omit the assignee and grant date — correct behaviour, not a parser bug.

The naive approach (and why it falls apart) 🔥

The first attempt looks reasonable:

Find the XHR call in Chrome DevTools (patents.google.com/xhr/query)
Replay it with httpx.get()
Parse the paginated JSON

It fails before the first response arrives — three reasons:

1. TLS fingerprinting. The endpoint checks the JA3/JA4 signature of your TLS ClientHello. Python's stdlib ssl module presents a fingerprint no real browser ever sends, so the server returns 403 before reading your query parameters. We work past it by running curl-cffi with explicit browser impersonation — AsyncSession(impersonate="chrome131") — rotating across chrome131 / chrome124 / firefox147 / safari180 profiles per request. The handshake the server sees is indistinguishable from a real browser, because at the TLS layer it is.

2. Two-stage fetch per patent. The query endpoint returns a list of publication IDs. The metadata you actually want — abstract, inventor list, CPC codes — lives on the individual patent pages, scraped from <meta> tags in the HTML. One query for 100 patents means 101 HTTP requests. We handle both stages in the same curl-cffi session, with the same impersonated fingerprint across all legs.

3. Undocumented response shape. The cluster/result structure from the XHR endpoint changes format depending on query type. We maintain a parser that handles both publication_number-keyed results and href-parsed fallbacks for when the primary key is absent. We retry on 408 / 429 / 5xx with exponential backoff and surface partial results with a clear status message rather than a silent empty dataset.

We thread Apify residential proxies for fresh exit IPs on blocks and pace requests within rate-limit boundaries. None of it is glamorous. All of it is the difference between a script that works once and a pipeline that runs every Monday morning.

The Actor ⚙️

The result is an Apify Actor: USPTO Patents Scraper. Paste a query in the Apify Console and click Start, or drive it from code:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/uspto-patents-scraper").call(
    run_input={
        "searchQuery": "AN/Apple AND TI/(machine learning)",
        "maxResults": 500,
        "sortBy": "publication_date_desc",
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["patent_number"], item["title"])

The searchQuery field accepts plain text (searched across title, abstract, and claims) or field-qualified syntax:

Qualifier	Searches	Example
`TI/<text>`	Title	`TI/(transformer attention)`
`AN/<name>`	Assignee	`AN/Google`
`IN/<name>`	Inventor	`IN/Hinton`
`CPC/<code>`	CPC class	`CPC/G06N3`
`AB/<text>`	Abstract	`AB/(diffusion model)`

maxResults caps the total across all pages (1–2000); sortBy supports relevance, publication_date_desc, and publication_date_asc.

Use cases

IP-landscape monitoring. Run weekly on a competitor's assignee name (AN/CompetitorCorp), diff the patent numbers against last week's output, and you have a filing-velocity signal before their press release. At $3.00 per 1,000 results, a weekly sweep of five competitors' portfolios costs under $1.

Freedom-to-operate (FTO) screening. Query a CPC class covering your product category (CPC/G06N3), pull the last five years of grants sorted by publication date, and feed the abstracts into an LLM for a first-pass claim-scope summary. The structured output drops straight into a Pandas DataFrame.

Inventor tracking. Follow a researcher across employer changes with IN/Bengio. The assignees field tells you who owns their recent work; priority_date shows the cadence.

Acquisition diligence. Quantify a target company's portfolio before a deal: how many grants, which CPC domains, filing trend over three years — enough metadata to answer those questions without a patent-database subscription.

Pricing — exact numbers 💰

Pay-per-event. You pay for patents that land in your dataset, nothing for query overhead.

Event	Price	What it covers
`actor-start`	$0.005	One-off warm-up per run
`result`	$0.003	Per patent written to the dataset

Pull size	Total cost
100 patents	$0.31
500 patents	$1.51
1,000 patents	$3.01
2,000 patents (max/run)	$6.01

Apify's $5 free credit covers your first ~1,650 patents with no credit card. For comparison, a Derwent Innovation seat or similar professional patent-database subscription starts at several hundred dollars a month and a sales call.

The technically interesting bit

The Actor uses Google Patents as a proxy to the USPTO index, not the official developer.uspto.gov endpoint. Google's XHR query interface is substantially richer — it supports the full field-qualified syntax (TI/, AN/, IN/, CPC/) and returns structured pagination with cluster-grouped results that map cleanly to individual patent pages. Those patent pages carry the same <meta> tags the USPTO itself embeds (DC.title, DC.contributor scheme="inventor", citation_publication_date), so the HTML parse runs against a stable, well-structured source rather than raw patent XML.

This is the same mirror Lens.org and most commercial patent-intelligence tools build on. The difference is that we hand you the raw data in your own dataset at pay-per-event pricing, not a monthly SaaS seat.

Limitations 🚧

No full claims text. The Actor returns the abstract and metadata. Full claims extraction requires parsing the patent document XML and is a roadmap item; the patent_url field gives you the Google Patents page where the full document is one click away.
Inventor city/state not returned. The meta-tag schema carries no geographic inventor data; that's in the document body, out of scope for v1.
Family members not linked. Patent families (continuations, PCT national-phase entries) are not resolved — you get the individual publication, not the full family tree.
USPTO publishes weekly. New grants appear in the index within roughly one week of the issue date. The Actor reflects the current search index, not a real-time feed.
Max 2,000 results per run (the maxResults ceiling). For bulk portfolio pulls, run multiple queries split by date filter or CPC sub-class.

FAQ ❓

Is scraping Google Patents for USPTO data legal?
Google Patents indexes publicly available USPTO data the US government makes freely accessible. The Actor sends queries at a pace that respects rate limits, collects only structured metadata (no personal data beyond inventor names on public filings), and bypasses no authentication. Check your own jurisdiction and use case; the underlying data is a public government record.

Does this work for international patents (EPO, WIPO, PCT)?
The query endpoint returns US patents (country=US is fixed). PCT applications get a US national-phase publication number once they enter the US, findable by assignee or inventor. Dedicated EPO/WIPO coverage is out of scope for this Actor.

Can I export to CSV or push results to a warehouse?
Yes. Export the dataset as JSON, CSV, or Excel from the Apify Console's Storage tab. You can also webhook the ACTOR.RUN.SUCCEEDED event into Make, Zapier, or n8n, or pull results via the Apify Dataset API.

Why is abstract sometimes empty?
A small number of pre-grant publications carry the abstract in a non-standard HTML position the current regex doesn't reach. The patent_url always resolves to the full document. If you hit this consistently on a patent class, open an issue on the Actor's Issues tab.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/uspto-patents-scraper. Free $5 trial credit, no credit card. Run it on AN/Google AND TI/(large language model) and you'll have structured patent rows in your dataset within a minute. Got a use case I didn't cover — full claims, patent families, bulk historical pull? Drop it in the comments; the roadmap follows what people actually need.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community