DEV Community

Cover image for Optimizing AI Data Pipelines: JSON vs Markdown vs Text
AlterLab
AlterLab

Posted on • Originally published at alterlab.io

Optimizing AI Data Pipelines: JSON vs Markdown vs Text

TL;DR

Markdown is the optimal format for LLM grounding and RAG pipelines because it preserves structural hierarchy with minimal token overhead. Use JSON only when your agent requires strict schema adherence for tool-calling, and avoid raw text for complex pages where layout and table relationships are critical for reasoning.

The Token Cost of Structure

Building data pipelines for AI agents requires a fundamental shift in how we think about data serialization. In traditional ETL, JSON is the undisputed king. However, when feeding data into Large Language Models (LLMs), the primary constraint isn't just "readability" or "parseability"—it is token density and semantic preservation.

Every character you send to an LLM has a cost, both in terms of latency and actual spend. More importantly, the format of that data dictates how well the model "understands" the relationship between different pieces of information. If you strip too much structure, the model loses context. If you keep too much (like raw HTML), you waste the context window on boilerplate code.




































Format Token Efficiency Structural Context Ideal Use Case
Raw Text Highest None Simple summarization, sentiment analysis
Markdown High Excellent RAG pipelines, long-form content grounding
JSON Medium Strict/Typed Tool calling, data extraction, database ingestion
HTML Lowest High (noisy) Legacy systems, visual layout analysis

Markdown: The Sweet Spot for LLMs

Markdown has emerged as the de facto standard for LLM grounding. There are three technical reasons for this dominance: training data alignment, structural semantics, and token efficiency.

1. Training Data Alignment

Most state-of-the-art LLMs were trained on massive repositories of technical documentation, GitHub READMEs, and specialized forums—all of which heavily utilize Markdown. As a result, models are exceptionally good at interpreting # for headers, > for blockquotes, and | for tables. When you provide data in Markdown, you are speaking the model's native language of structure.

2. Semantic Preservation in RAG

In a Retrieval-Augmented Generation (RAG) pipeline, you must split long documents into smaller "chunks" to fit within the retrieval context. Raw text chunking often breaks mid-sentence or mid-paragraph, losing the connection between a heading and its sub-points.

Markdown allows for "Header-based Chunking." You can split a document at every ## or ### tag, ensuring that each chunk is a self-contained semantic unit. This significantly improves the quality of the embeddings used in vector databases.

3. Table Integrity

Consider a pricing table on an e-commerce site. In raw text, the relationship between a "Feature" and its "Price" is often lost as the table is flattened into a single stream of words. In JSON, a table can become incredibly verbose. Markdown tables preserve the grid structure with minimal token usage, allowing the LLM to perform accurate "lookups" within the context window.

JSON: When Schema Precision Matters

While Markdown is superior for grounding, JSON remains essential for "Extraction" tasks. If your goal is to populate a database or trigger a programmatic function, the LLM must output JSON.

However, using JSON as the input format for an agent can be problematic. Consider this comparison:

```json title="data.json"
[
{"id": 1, "name": "Standard Plan", "price": "$10", "limit": "1000 requests"},
{"id": 2, "name": "Pro Plan", "price": "$50", "limit": "Unlimited"}
]






```markdown title="data.md"
| ID | Name | Price | Limit |
|---|---|---|---|
| 1 | Standard Plan | $10 | 1000 requests |
| 2 | Pro Plan | $50 | Unlimited |
Enter fullscreen mode Exit fullscreen mode

The JSON version repeats keys ("name", "price", "limit") for every single row. In a table with 50 rows, those repetitive keys consume thousands of unnecessary tokens. For a high-volume pipeline, this inefficiency scales into significant costs.

Implementation: Transforming Web Data

The challenge for engineers is converting messy web content into clean Markdown or JSON. Most scrapers return raw HTML, which is a nightmare for LLMs due to the high noise-to-signal ratio.

When building your pipeline, you should use a Python SDK that handles the conversion at the edge. This reduces the payload size coming into your application and saves you from writing complex BeautifulSoup logic.

```python title="pipeline.py" {10-12}

Initialize the client

client = alterlab.Client(api_key="YOUR_API_KEY")

Request specific formats to optimize for LLM grounding

response = client.scrape(
url="https://example-news-site.com/article",
formats=["markdown", "json"],
min_tier=3 # Ensure JS-heavy content is rendered
)

Use Markdown for the RAG context

markdown_content = response.markdown

Use JSON for metadata (author, date, tags)

metadata = response.json




By requesting `markdown` directly from the [documentation](https://alterlab.io/docs), you avoid the overhead of local processing. The API performs the "DOM cleaning" (removing scripts, ads, and navbars) before converting the semantic structure to Markdown.

## Raw Text: The Minimalist Approach
Raw text is only recommended when the structural relationship between data points is irrelevant. For example, if you are performing sentiment analysis on a 2,000-word product review, the headings and bullet points matter less than the prose itself. 

However, even in these cases, we often find that "Clean Text" (text with boilerplate removed) is better than "Raw Text." Using an [anti-bot solution](https://alterlab.io/smart-rendering-api) that also handles content extraction ensures that you aren't feeding the LLM "Cookie Policy" or "Sign Up for Newsletter" text, which can lead to hallucinations.

## Benchmarking Token Usage
To illustrate the difference, we ran a sample 500-word technical blog post through three common serialization formats and measured the Tiktoken count (using the `o1` and `gpt-4o` encoders).

<div data-infographic="stats">
  <div data-stat data-value="4,820" data-label="Raw HTML Tokens"></div>
  <div data-stat data-value="1,150" data-label="JSON Tokens"></div>
  <div data-stat data-value="740" data-label="Markdown Tokens"></div>
</div>

The Markdown version was 35% more efficient than JSON and 84% more efficient than raw HTML while retaining 100% of the structural hierarchy needed for grounding.

## Strategy: Designing the Multi-Format Pipeline
The most robust AI agents don't rely on a single format. They use a hybrid approach:

1.  **Markdown for Knowledge**: The body of the page, tables, and lists are stored as Markdown in a vector database for RAG.
2.  **JSON for Discovery**: Metadata like page title, published date, and breadcrumbs are stored as JSON for filtering and sorting.
3.  **Text for Summarization**: Large blocks of prose can be simplified to text to maximize the context window for extremely long documents.

<div data-infographic="steps">
  <div data-step data-number="1" data-title="Extraction" data-description="Scrape site and request Markdown + JSON formats."></div>
  <div data-step data-number="2" data-title="Chunking" data-description="Split Markdown by headers to preserve semantic context."></div>
  <div data-step data-number="3" data-title="Grounding" data-description="Inject relevant chunks into the LLM prompt."></div>
</div>

## Best Practices for AI Agent Pipelines
When configuring your data ingestion, follow these rules to ensure your agent remains accurate and cost-effective:

*   **Filter before format**: Remove non-content elements (nav, footer, sidebars) before converting to Markdown. An LLM grounded on a sidebar's "Related Articles" list will likely hallucinate those titles as part of the primary content.
*   **Sanitize Markdown**: Some converters produce "dirty" Markdown with excessive newlines or nested divs. Ensure your pipeline outputs standard CommonMark.
*   **Schema Validation**: If you are using JSON, use Pydantic (Python) or Zod (TypeScript) to validate the structure before it reaches your agent logic. LLMs can occasionally "drift" from a schema if the input data is ambiguous.
*   **Monitor Token Density**: Track the ratio of "Useful Characters" to "Total Tokens." If your JSON keys are longer than the values they hold, consider switching to a more compact representation or Markdown for that specific data segment.

## Takeaway
For developers building the next generation of AI-native applications, the choice of data format is a performance optimization. **Markdown** is the clear winner for grounding and RAG due to its balance of structural context and token efficiency. Reserve **JSON** for structured extraction and **Text** for the simplest of prose-only tasks. By optimizing your ingestion format, you reduce costs, lower latency, and significantly improve the reasoning capabilities of your agents.

AlterLab // Web Data, Simplified.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)