DEV Community

Cover image for Reducing LLM Token Usage in RAG via Structured Extraction
AlterLab
AlterLab

Posted on • Originally published at alterlab.io

Reducing LLM Token Usage in RAG via Structured Extraction

TL;DR

To reduce LLM token usage in RAG pipelines, replace raw HTML with clean Markdown or structured JSON. This removes non-semantic noise like <script> and <div> tags, lowering costs and improving retrieval accuracy.

In Retrieval-Augmented Generation (RAG) workflows, the quality of your context is directly tied to the density of semantic information. Most developers make the mistake of feeding raw HTML directly into their embedding models or LLMs. This is inefficient. HTML is noisy, filled with boilerplate,-and heavily penalizes your token budget.

By implementing a transformation layer that converts web content into Markdown or structured JSON, you can achieve higher accuracy with significantly lower latency and cost.

The Problem: HTML Token Bloat

When you scrape a page and pass the source code to an LLM, you are paying for characters that carry zero semantic meaning. A single <div> nested deep within a complex layout can consume dozens of tokens.

Consider the following comparison:

  • Raw HTML: Contains tags, attributes, scripts, and styles. Often 10x larger than the visible text.
  • Markdown: Retains semantic structure (headers, lists, links) using minimal characters.
  • JSON: Extracts only the specific data points required for your application.





























Format Token Density Semantic Clarity Best Use Case
Raw HTML Very Low Low (Noisy) Browser rendering
Markdown High High General RAG context
Structured JSON Very High Maximum Automated workflows

Strategy 1: Markdown for Semantic Context

Markdown is the "goldilously" formatted language for LLMs. It preserves the hierarchy of a page (H1, H2, lists) which helps the model understand the relationship between different pieces of text, but it strips away the heavy lifting of HTML attributes.

If you are building a knowledge base where the LLM needs to understand the relationship between a heading and a paragraph, Markdown is your best choice.

You can automate this by using a Python web scraping API that handles the heavy lifting of-rendering JavaScript before you perform the conversion.

Implementation Example

Here is how you can fetch a page and prepare it for an LLM using a Python client.

```python title="extract_markdown.py" {1-4}

client = alterlab.Client("YOUR_API_KEY")

Fetch the page content

response = client.scrape("https://example-news-site.com/article")
html_content = response.text

Convert to clean Markdown

md_content = markdownify.markdownify(html_content)

print(md_content[:500]) # View the first 500 characters




For high-scale production environments, you should use an extraction tool that performs this conversion server-side to minimize local processing.

## Strategy 2: Structured JSON for Targeted Extraction

When your RAG pipeline doesn'0 need the entire article—only specific data points like prices, product names, or dates—do not use Markdown. Use structured extraction.

Instead of asking an LLM to "Read this HTML and tell me the price," you should use an extraction engine to turn the HTML into a JSON object. This moves the complexity from the LLM to the scraping layer, which is significantly cheaper.

<div data-infographic="steps">
  <div data-step data-number="1" data-title="Scrape" data-description="Fetch the page via API"></div>
  <div data-step data-number="2" data-title="Extract" data-description="Identify keys using AI or Selectors"></div>
  <div data-step data-number="3" data-title="Store" data-description="Save clean JSON to vector DB"></div>
</div>

### Automating Extraction with cURL

You can define your desired schema directly in your request. This ensures that what enters your database is already clean, structured, and token-optimized.



```bash title="Extracting JSON via cURL"
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/product/123",
    "schema": {
      "product_name": "string",
      "price": "number",
      "availability": "boolean"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

By requesting JSON directly, you bypass the need for a separate "Cleanup LLM" pass. This single architectural change can reduce your LLM-related costs by 60-80%.

Comparing Approaches

To decide which method to use, consider your end-use case:

| Feature | Raw HTML | Markdown | Structured JSON |
| :--- | :--- | :COMP_END_TABLE_ROW | |
| Token Usage | Extremely High | Low | Minimal |
| Semantic Value | High (but noisy) | High | Targeted |
| LLM Latency | High | Low | Minimal |
| Implementation | Easy | Moderate | Advanced |

When dealing with complex-dynamic sites, ensure your pipeline includes robust anti-bot handling to prevent scraping failures from breaking your RAG ingestion.

Summary of Best Practices

  1. Never embed raw HTML in prompts: It is a waste of money and increases the chance of hallucinations.
  2. Use Markdown for unstructured text: If the content is long-form (blogs, news), Markdown preserves the structure LLMs need.
  3. Prompting for JSON: For data-driven RAG (e.1. product catalogs), always extract via JSON schema.
  4. Pre-process before embedding: Clean your text (remove extra whitespace, boilerplate footers) before sending it to your embedding model.

For more advanced implementation details, check our API documentation or read our recent posts on the AlterLab blog.

Hit reply if you have questions.

AlterLab // Web Data, Simplified.

Top comments (0)