Mariano Gobea Alcoba

Posted on Mar 26 • Originally published at mgatc.com

Robust LLM Extractor for Websites in TypeScript!

#llm #dataextraction #webscraping #typescript

Web scraping and structured data extraction from websites have long been fundamental tasks in data engineering, fueling analytics, competitive intelligence, and various automation workflows. The traditional approach, relying heavily on CSS selectors or XPath expressions, presents a persistent challenge: fragility. Website layouts change frequently, leading to broken selectors, pipeline failures, and significant maintenance overhead. This fragility is exacerbated by the increasing complexity of modern web applications, which often render content dynamically using JavaScript, making static HTML parsing insufficient.

The Promise and Pitfalls of LLM-First Extraction

The advent of large language models (LLMs) appeared to offer a compelling solution to the brittleness of traditional scraping. The intuition is straightforward: provide an LLM with raw HTML and a natural language instruction or a schema, then request structured JSON output. This paradigm shift promised to abstract away the intricate details of DOM structure, offering a more resilient approach to data extraction.

However, practical application of LLMs for web data extraction reveals several significant challenges that can make the "LLM-first" approach more painful than anticipated:

Token Budget Exhaustion and Noise Reduction

Raw HTML, especially from modern web pages, is replete with superfluous content for extraction purposes. Navigation bars, footers, headers, advertisements, tracking scripts, inline styles, and comment blocks collectively constitute a substantial portion of the page's HTML, often representing 80% or more of the total token count. Feeding this undifferentiated mass to an LLM quickly exhausts token budgets, leading to higher API costs and potentially truncated or less accurate outputs due to context window limitations. Effective noise reduction is therefore not merely an optimization but a prerequisite for feasible LLM-based extraction.

Malformed JSON Output

Despite sophisticated instruction following capabilities, LLMs are not infallible JSON generators. They frequently produce malformed JSON, particularly when dealing with complex, nested schemas or lengthy outputs. A single missing bracket, an unescaped quote, or an extraneous comma can render the entire output unparsable by standard JSON libraries, leading to pipeline crashes and lost data. The problem is compounded when extracting arrays of objects, where a single invalid item can corrupt the entire array structure. Robust error recovery is crucial to mitigate this common failure mode.

URL Hygiene and Normalization

Web pages are a mosaic of relative URLs, URL fragments, query parameters, and tracking identifiers. When extracting links or image sources, these must be canonicalized and normalized into absolute, clean URLs. An LLM might extract a relative path, an uncleaned URL with tracking parameters, or a URL with a fragment identifier, necessitating post-processing that is often overlooked in initial LLM integration attempts. Consistent URL normalization is essential for data integrity and usability.

Repetitive Boilerplate

Integrating LLMs into a data extraction pipeline typically involves a series of common, repetitive steps: fetching HTML (often requiring browser automation), cleaning the HTML, converting it to a more LLM-friendly format (like Markdown), constructing the LLM prompt, invoking the LLM, parsing its output, handling potential errors, and finally validating the structured data against a schema. Rebuilding this entire pipeline for every new extraction task leads to significant developer overhead and inconsistencies across projects.

Lightfeed Extractor: An Opinionated Solution for Robustness

Recognizing these systemic challenges, Lightfeed Extractor emerged as a TypeScript library designed to encapsulate the complete pipeline from raw HTML to validated, structured data. Its architecture focuses on robustness, type safety, and efficient resource utilization, aiming to turn the promise of LLM-based extraction into a reliable production reality.

Core Component: Intelligent HTML Pre-processing and Markdown Conversion

The most critical initial step for efficient LLM processing is transforming raw, verbose HTML into a concise, content-focused representation. Lightfeed Extractor employs a multi-stage approach for this:

DOM Segmentation and Noise Reduction

The library leverages robust DOM parsing capabilities, typically through jsdom for server-side processing or Playwright's browser context for live DOM interaction. The objective is to identify and isolate the main content block while aggressively pruning irrelevant elements.

DOM Construction: The raw HTML is parsed into a navigable Document Object Model (DOM) tree. This allows for semantic analysis and structural manipulation, which is difficult with regex-based approaches.
Main Content Identification: Sophisticated heuristics, often inspired by readability algorithms (e.g., similar to Mozilla's Readability.js), are applied to locate the primary content region of the page. These algorithms typically analyze factors such as element density, text length within blocks, tag frequency, and the presence of semantic HTML5 tags like <article>, <main>, and <section>. The goal is to intelligently discern editorial content from surrounding UI elements.
Noise Pruning: Once the main content candidate is identified, extraneous elements are systematically removed from the DOM. This includes:
- Structural elements commonly found outside main content: <nav>, <header>, <footer>, <aside>.
- Non-content elements: <script>, <style>, <iframe>, <noscript>.
- Common identifiers for advertisements, social sharing widgets, comment sections (if not desired), and other non-essential interactive components, often identifiable via CSS classes or IDs.
- Empty elements or elements with minimal text content that are likely layout placeholders.

This process significantly reduces the overall token count, allowing more substantial content to fit within the LLM's context window and focusing the LLM's attention on relevant data.

Semantic Enhancement and URL Canonicalization

After noise reduction, the remaining DOM is semantically enhanced and normalized:

Image Inclusion: If desired, <img> tags are processed. Their src attributes are extracted, and their alt text is prioritized.
Link Normalization: All <a> tags' href attributes are canonicalized. This involves:
- Resolution of Relative URLs: Relative paths (e.g., /products/item-123) are resolved against the base URL of the original page to form absolute URLs (e.g., https://example.com/products/item-123).
- Query Parameter Cleaning: Common tracking parameters (e.g., utm_source, fbclid, gclid) and other irrelevant query components are stripped to produce cleaner, canonical URLs.
- Fragment Removal: URL fragments (#section) are typically removed unless explicitly required, as they do not identify unique resources.

Markdown Representation for LLM Efficiency

The cleaned and enhanced HTML fragment is then converted into Markdown. Markdown is chosen for its conciseness and its natural alignment with how LLMs process textual information. Libraries like turndown or custom renderers facilitate this conversion, ensuring that headings, lists, links, and paragraphs are represented efficiently. This final Markdown output is a significantly condensed, yet semantically rich, representation of the original web page's core content, optimized for LLM consumption.

import { JSDOM } from 'jsdom';
import TurndownService from 'turndown';
import { URL } from 'url';

interface HtmlProcessorOptions {
  baseUrl: string;
  includeImages?: boolean;
  stripTrackingParams?: boolean;
}

export class HtmlToMarkdownProcessor {
  private turndownService: TurndownService;
  private options: HtmlProcessorOptions;

  constructor(options: HtmlProcessorOptions) {
    this.options = options;
    this.turndownService = new TurndownService({
      headingStyle: 'atx',
      codeBlockStyle: 'fenced',
      hr: '---',
    });

    // Custom rules for turndown if needed, e.g., to handle specific elements
    // or to modify how links/images are rendered.
    this.turndownService.addRule('anchor', {
      filter: ['a'],
      replacement: (content, node) => {
        const anchor = node as HTMLAnchorElement;
        let href = anchor.href;
        if (href) {
          try {
            const resolvedUrl = new URL(href, this.options.baseUrl);
            if (this.options.stripTrackingParams) {
              // Simple stripping example
              resolvedUrl.searchParams.forEach((_value, key) => {
                if (key.startsWith('utm_') || key.startsWith('gclid')) {
                  resolvedUrl.searchParams.delete(key);
                }
              });
            }
            // Ensure absolute URL
            href = resolvedUrl.toString();
          } catch (e) {
            // Handle malformed URLs if necessary
          }
        }
        return `[${content}](${href || ''})`;
      }
    });

    if (!this.options.includeImages) {
      this.turndownService.remove('img');
    } else {
       this.turndownService.addRule('image', {
         filter: ['img'],
         replacement: (content, node) => {
           const img = node as HTMLImageElement;
           let src = img.src;
           if (src) {
             try {
               const resolvedUrl = new URL(src, this.options.baseUrl);
               src = resolvedUrl.toString();
             } catch (e) {
               // Handle malformed URLs
             }
           }
           return `![${img.alt || ''}](${src || ''})`;
         }
       });
    }
  }

  public process(html: string): string {
    const dom = new JSDOM(html);
    const document = dom.window.document;

    // A simplistic main content extraction for demonstration.
    // In a real library, this would involve sophisticated algorithms.
    let mainContentElement = document.querySelector('main') || document.body;

    // Remove common noisy elements from the main content (or entire document)
    ['nav', 'header', 'footer', 'aside', 'script', 'style', 'iframe'].forEach(tag => {
      document.querySelectorAll(tag).forEach(el => el.remove());
    });
    // Further heuristic-based removal could target ad divs, social buttons, etc.

    return this.turndownService.turndown(mainContentElement.innerHTML);
  }
}

// Example usage:
const htmlContent = `
  <!DOCTYPE html>
  <html>
  <head><title>Product Page</title></head>
  <body>
    <header><h1>Site Header</h1><nav>...</nav></header>
    <main>
      <article>
        <h2>Product Name</h2>
        <p>This is a great product. <a href="/details?id=123&utm_source=test">More info</a></p>
        <img src="/assets/product.jpg" alt="Product Image">
      </article>
      <aside>Related products...</aside>
    </main>
    <footer><p>Copyright</p></footer>
    <script>alert('hello');</script>
  </body>
  </html>
`;

const processor = new HtmlToMarkdownProcessor({
  baseUrl: 'https://example.com',
  includeImages: true,
  stripTrackingParams: true,
});
const markdown = processor.process(htmlContent);
console.log(markdown);
/* Expected output (simplified):
## Product Name

This is a great product. [More info](https://example.com/details?id=123)

![Product Image](https://example.com/assets/product.jpg)
*/

Type-Safe Data Contracts with Zod

One of the most significant advancements for robust data extraction is the integration of Zod for defining and validating output schemas. Zod is a TypeScript-first schema declaration and validation library, offering powerful compile-time type inference and runtime validation.

Defining Schemas for LLM Output

With Lightfeed Extractor, developers define the expected structure of the extracted data using Zod. This schema serves as both a contract for the LLM and a validation mechanism for its output.

import { z } from 'zod';

// Define the schema for a single product
const productSchema = z.object({
  name: z.string().describe('The name of the product.'),
  price: z.number().positive().describe('The price of the product, as a positive number.'),
  currency: z.string().length(3).describe('The 3-letter currency code (e.g., USD, EUR).'),
  description: "z.string().optional().describe('A brief description of the product.'),"
  features: z.array(z.string()).min(1).optional().describe('A list of key features.'),
  imageUrl: z.string().url().optional().describe('The absolute URL to the product image.'),
});

// Define the schema for the entire extraction, which might be an array of products
const extractionSchema = z.object({
  products: z.array(productSchema).min(1).describe('An array of extracted product details.'),
});

type ExtractedProducts = z.infer<typeof extractionSchema>;

// This schema is then passed to the extractor, along with the Markdown content.

The .describe() method is particularly useful as it adds metadata that can be directly incorporated into the LLM's prompt, guiding its output generation more effectively.

Runtime Validation and Developer Experience

Upon receiving the LLM's raw JSON output, Lightfeed Extractor attempts to parse it and then rigorously validates the resulting JavaScript object against the defined Zod schema.

Compile-time Safety: Developers benefit from TypeScript's static type checking, ensuring that code interacting with the extracted data aligns with the schema.
Runtime Validation: Zod performs deep validation, checking types, required fields, string formats (e.g., url(), email()), array lengths (min(), max()), and custom validation rules. This catches discrepancies that LLMs might introduce, even if the JSON is syntactically valid.
Detailed Error Reporting: If validation fails, Zod provides granular, human-readable error messages, pinpointing exactly which part of the data structure is invalid and why. This is invaluable for debugging LLM prompts or identifying problematic website structures.

Resilient Data Recovery and Partial Extraction

Perhaps one of the most distinguishing features of Lightfeed Extractor is its emphasis on resilience and graceful degradation in the face of malformed LLM output. Instead of failing outright, the library attempts to salvage valid data.

Strategies for Malformed JSON

When JSON.parse() fails, the library employs a multi-pronged strategy to recover parsable JSON:

Prefix/Suffix Trimming: LLMs sometimes wrap JSON in conversational text or markdown code blocks (e.g., ```json). Regular expressions are used to trim such extraneous content, isolating the actual JSON string between the first opening { and the last closing } (or [ and ]).
Heuristic-based Repair: For common JSON syntax errors, the library can apply heuristic repairs. This might involve:
- Missing Commas: Detecting and inserting commas between objects in an array or between key-value pairs in an object when they are clearly missing.
- Unclosed Strings/Brackets: Attempting to complete obviously unclosed quotes or brackets.
- Trailing Commas: Removing redundant trailing commas, which are invalid in strict JSON.
- Invalid Escapes: Correcting common escape sequence errors. This level of repair often relies on dedicated JSON repair libraries or custom parsing logic, balancing aggressive correction with the risk of unintended data alteration.
Iterative Refinement (Advanced): For persistent issues, an advanced strategy involves sending the malformed output back to the LLM with explicit instructions to correct its JSON, referencing the original prompt and schema. This "self-correction" mechanism can significantly improve extraction success rates but incurs additional LLM token costs and latency.

The Importance of Granular Validation

Even if the JSON is syntactically correct, individual data points might not conform to the schema (e.g., a price field containing a string "N/A" instead of a number). Lightfeed Extractor handles this through granular validation, especially crucial for extracting lists of items:

Array Item Validation: If the top-level schema expects an array of objects (e.g., z.array(productSchema)), the library iterates through each item in the LLM's parsed output array.
Individual Item Validation: Each item is validated against its specific Zod sub-schema (productSchema).
Collection of Valid Data: Valid items are collected into the final result. Invalid items are skipped, and their validation errors are typically logged or returned as part of a comprehensive error report, alongside the partially extracted data.

This approach ensures that if an LLM successfully extracts 19 out of 20 products, those 19 valid products are still returned, preventing complete pipeline failure due to a single malformed entry.


typescript
import { z } from 'zod';

const itemSchema = z.object({
  id: z.string(),
  value: z.number().positive(),
});

type Item = z.infer<typeof itemSchema>;

function parseAndValidateWithRecovery(
  rawLLMOutput: string,
  schema: z.ZodSchema<any>
): { data: any | null; errors: z.ZodError | string | null; partialData?: any[] } {
  let parsedJson: any;
  let error: string | null = null;

  try {
    // Attempt standard parse
    parsedJson = JSON.parse(rawLLMOutput);
  } catch (e: any) {
    // Fallback: Attempt heuristic repair
    console.warn('JSON parse failed. Attempting heuristic repair...');
    try {
      // Simplified repair: e.g., trim markdown code blocks and attempt parse again
      const jsonRegex = /

```json\s*(\{[\s\S]*\}|\[[\s\S]*\])\s*```

/m;
      let cleanedOutput = rawLLMOutput;
      const match = rawLLMOutput.match(jsonRegex);
      if (match && match[1]) {
        cleanedOutput = match[1];
      } else {
        // More robust repair could involve a dedicated library
        // For demonstration, a basic attempt to find first '{' and last '}'
        const firstBracket = cleanedOutput.indexOf('{');
        const lastBracket = cleanedOutput.lastIndexOf('}');
        if (firstBracket !== -1 && lastBracket !== -1 && lastBracket > firstBracket) {
          cleanedOutput = cleanedOutput.substring(firstBracket, lastBracket + 1);
        } else {
            const firstArrayBracket = cleanedOutput.indexOf('[');
            const lastArrayBracket = cleanedOutput.lastIndexOf(']');
            if (firstArrayBracket !== -1 && lastArrayBracket !== -1 && lastArrayBracket > firstArrayBracket) {
              cleanedOutput = cleanedOutput.substring(firstArrayBracket, lastArrayBracket + 1);
            }
        }
      }
      parsedJson = JSON.parse(cleanedOutput);
      console.warn('JSON repaired successfully.');
    } catch (repairError) {
      error = `Failed to parse JSON even after repair: ${e.message} and ${repairError}`;
      return { data: null, errors: error };
    }
  }

  // Validate against the schema
  const validationResult = schema.safeParse(parsedJson);

  if (validationResult.success) {
    return { data: validationResult.data, errors: null };
  } else {
    // Handle partial data recovery for arrays
    if (schema instanceof z.ZodArray && Array.isArray(parsedJson)) {
      const itemSchema = (schema as z.ZodArray<any>).element;
      const partialData: any[] = [];
      const itemErrors: z.ZodError[] = [];

      parsedJson.forEach((item, index) => {
        const itemValidationResult = itemSchema.safeParse(item);
        if (itemValidationResult.success) {
          partialData.push(itemValidationResult.data);
        } else {
          itemErrors.push(new z.ZodError([
            { ...itemValidationResult.error.errors[0], path: [`${index}`].concat(itemValidationResult.error.errors[0].path) }
          ]));
        }
      });

      if (partialData.length > 0) {
        console.warn(`Partial data extracted successfully: ${partialData.length} items out of ${parsedJson.length}.`);
        return { data: parsedJson, errors: new z.ZodError(itemErrors.flatMap(e => e.errors)), partialData: partialData };
      }
    }

    console.error('Schema validation failed:', validationResult.error);
    return { data: null, errors: validationResult.error };
  }
}

// Example usage:
const malformedJsonOutput = `
This is some introductory text.
\`\`\`json
[
  {"id": "A1", "value": 100},
  {"id": "B2", "value": "invalid"}, // Malformed value
  {"id": "C3", "value": 300},
  {"id": "D4", "value": -50} // Invalid positive number
]
\`\`\`
Some trailing text.
`;

const schemaForArray = z.array(itemSchema);

const result = parseAndValidateWithRecovery(malformedJsonOutput, schemaForArray);
console.log('\nResult:', result);
/* Expected output will show errors for B2 and D4, but partialData will contain A1 and C3 */

Orchestration with LangChain-Compatible LLMs

Lightfeed Extractor is designed to be agnostic to the underlying LLM provider, achieving this through compatibility with LangChain's flexible interfaces. This allows seamless integration with various models: OpenAI (GPT-3.5, GPT-4), Google Gemini, Anthropic Claude, or locally hosted models via Ollama.

Prompt Engineering for Structured Output

Effective prompt engineering is paramount for consistent LLM output. The library constructs prompts that include:

The cleaned Markdown content of the web page.
The Zod schema (often serialized into a JSON schema or a structured natural language description).
Clear instructions to generate JSON conforming strictly to the provided schema, without conversational preamble or postscript.
Instructions for handling missing data (e.g., return null or omit optional fields if data is not found).

Integration Flexibility

By leveraging LangChain's Runnable interface or similar abstractions, Lightfeed Extractor can easily swap LLM implementations without altering the core logic. This provides developers with the flexibility to choose models based on cost, performance, and specific task requirements.

Leveraging Playwright for Enhanced Web Interaction

While jsdom is effective for static HTML parsing, many modern websites rely heavily on JavaScript for rendering content. Furthermore, many sites employ anti-bot measures. Lightfeed Extractor addresses these challenges by integrating Playwright for browser automation.

Dynamic Content and Anti-Bot Measures

JavaScript Rendering: Playwright launches a real browser instance (Chromium, Firefox, or WebKit), allowing it to execute JavaScript, render dynamic content, and interact with single-page applications (SPAs) just like a human user would. This ensures that the HTML provided to the LLM is the fully rendered content.
Anti-Bot Patches: Playwright can be configured with "stealth" techniques to mimic human browsing behavior, making it more difficult for websites to detect and block automated access. This includes setting realistic user agents, viewport sizes, avoiding common bot-like network patterns, and handling CAPTCHAs (though the latter often requires external services).

Resource Optimization

Playwright also enables optimization strategies:

Resource Blocking: Specific resources (images, fonts, stylesheets, scripts from known ad/tracker domains) can be intercepted and blocked at the network level. This reduces bandwidth consumption, speeds up page loading, and minimizes the amount of "noise" HTML that needs to be processed later, further conserving token budgets.
Headless Operation: For server environments, Playwright operates in headless mode, meaning no graphical browser window is displayed, making it suitable for scalable deployment.

Architectural Considerations and Operational Insights

The Lightfeed Extractor embodies a robust pipeline design, orchestrating various components into a cohesive workflow.

Pipeline Flow

Input: A URL or raw HTML string, accompanied by a Zod schema.
Browser Automation (Optional): If a URL is provided, Playwright navigates to the page, waits for full rendering, and extracts the raw HTML.
HTML Pre-processing: The raw

Originally published in Spanish at www.mgatc.com/blog/robust-llm-extractor-websites-typescript/

DEV Community