DEV Community

NexGenData
NexGenData

Posted on • Originally published at thenextgennexus.com

New: Patents to Markdown for RAG — turn US/EP/WO patents into clean, chunked Markdown for RAG

What it does

Patents to Markdown for RAG pulls patents from Google Patents (US, EP, and WO families) and converts them into clean, chunked Markdown ready for retrieval-augmented generation and LLM pipelines. It extracts the abstract, claims, and full description, then segments the text into token-sized chunks so you can embed and index without any HTML or PDF cleanup.

Who it's for

Built for AI engineers building patent search, IP analysts assembling prior-art corpora, and legal-tech teams who need patent text in a format an LLM can actually consume.

Sample fields / output

  • patent_number
  • title
  • abstract
  • claims
  • description
  • assignee
  • inventors
  • filing_date
  • publication_date
  • jurisdiction
  • markdown
  • chunk_id
  • token_count

Example use cases

  • Build a prior-art RAG knowledge base for a patent-search assistant.
  • Feed chunked claims and descriptions into an embeddings index for semantic search.
  • Generate LLM-ready context for freedom-to-operate and invalidity analysis.

Try Patents to Markdown for RAG on Apify»

Related actors

FAQ

Which patent offices are covered?

US, EP, and WO families via Google Patents, including abstract, claims, and full description.

What format is the output?

Clean Markdown split into token-sized chunks, each with a chunk_id and token_count for direct embedding.

Do I need a Google Patents login?

No login is required.

See also: New -- Patents to Markdown for RAG

Top comments (0)