Nathan Skiles

Posted on Dec 6, 2024 • Originally published at serpapi.com

Turning search results into Markdown for LLMs

#webscraping #markdown #serpapi #llm

Intro

This article will walk through converting search results into Markdown format, suitable for use in large language models (LLMs) and other applications.

Markdown is a lightweight markup language that provides a simple, readable way to format text with plain-text syntax. Checkout Markdown Guide for more information:

Markdown Guide

Use Case

Markdown's simple, readable format allows for the transformation of raw webpage data into clean, actionable information across different use cases:

LLM Training: Generate Q&A datasets or custom knowledge bases.
Content Aggregation: Create training datasets or compile research.
Market Research: Monitor competitors or gather product information.

SerpApi

SerpApi is a web scraping company that allows developers to extract search engine results and data from various search engines, including Google, Bing, Yahoo, Baidu, Yandex, and others. It provides a simple way to access search engine data programmatically without dealing directly with the complexities of web scraping.

This guide focuses on the Google Search API, but the concepts and techniques discussed can be adapted for use with SerpApi’s other APIs.

Google Search API

The Google Search API lets developers programmatically retrieve structured JSON data from live Google searches. Key benefits include:

CAPTCHA and browser automation: Avoid manual intervention or IP blocks.
Structured data: Output is clean and easy to parse.
Global and multilingual support: Search in specific languages or regions.
Scalability: Perform high-volume searches without disruptions.

Google Search Engine Results API

Gettings Started

This section provides a complete code example for fetching Google search results using SerpApi, parsing the webpage content, and converting it to Markdown. While this example uses Node.js (JavaScript), the same principles apply in other languages.

Required Packages

Make sure to install the following pages in your Node.js project.

SerpApi JavaScript: Scrape and parse search engine results using SerpApi. Get search results from Google, Bing, Baidu, Yandex, Yahoo, Home Depot, eBay and more.

SerpApi JavaScript

Cheerio: A fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Cheerio

Turndown: Convert HTML into Markdown with JavaScript.

Turndown

Importing Packages

First, we must import all of our required packages:

import dotenv from "dotenv";
import fetch from "node-fetch";
import fs from "fs/promises";
import path from "path";
import { getJson } from "serpapi";
import * as cheerio from "cheerio";
import TurndownService from "turndown";

Fetching Search Results

The fetchSearchResults function retrieves search results using SerpApi’s Google Search API:

const fetchSearchResults = async (query) => {
  return await getJson("google", {
    api_key: process.env.SERPAPI_KEY,
    q: query,
    num: 5,
  });
};

Create a .env file, include your SerpApi key, and install the dotenv package. Or, replace the process.env.SERPAPI_KEY process with your API key if you are simply running the script locally.

Parsing Webpage Content

The parseUrl function fetches the HTML of a given URL, cleans it, and converts it to Markdown:

const parseUrl = async (url) => {
  try {
    // Configure fetch request with browser-like headers
    const response = await fetch(url, {
      headers: {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
        Accept:
          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
      },
    });

    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }

    const html = await response.text();

    // Initialize HTML parser and markdown converter
    const $ = cheerio.load(html);
    const turndown = new TurndownService({
      headingStyle: "atx",
      codeBlockStyle: "fenced",
    });

    // Clean up HTML by removing unnecessary elements
    $("script, style, nav, footer, iframe, .ads").remove();

    // Extract title and main content
    const title = $("title").text().trim() || $("h1").first().text().trim();
    const mainContent =
      $("article, main, .content, #content, .post").first().html() ||
      $("body").html();
    const content = turndown.turndown(mainContent || "");

    return { title, content };
  } catch (error) {
    console.error(`Failed to parse ${url}:`, error.message);
    return null;
  }
};

This function ensures a clean, readable Markdown by removing non-essential elements like scripts and ads.

Sanitizing Keywords

To prevent filename issues, we can sanitize keywords before using them in filenames:

const sanitizeKeyword = (keyword) => {
  return keyword
    .replace(/\\s+/g, "_") // Replace spaces with underscores
    .substring(0, 15) // Truncate to 15 characters
    .toLowerCase(); // Convert to lowercase
};

Writing to Markdown

This function writes the parsed content to a Markdown file, using the sanitize function to set the file's name:

const writeToMarkdown = async (data, keyword, index, url) => {
  const sanitizedKeyword = sanitizeKeyword(keyword);
  const filename = path.join(
    "output",
    `${new Date().toISOString()}_${sanitizedKeyword}_${index + 1}.md`
  );
  const content = `[//]: # (Source: ${url})\\n\\n# ${data.title}\\n\\n${data.content}`;
  await fs.writeFile(filename, content, "utf-8");
  return filename;
};

Main Execution

The main script invokes the process. Update the keywords array to keywords relevant to your use case:

// Example Keyword array
const keywords = ["coffee", "playstation 5", "web scraping"];

// Main execution block
(async () => {
  try {
    // Create output directory if it doesn't exist
    await fs.mkdir("output", { recursive: true });

    // Process each keyword
    for (const keyword of keywords) {
      const results = await fetchSearchResults(keyword);

      // Process search results if available
      if (results.organic_results && results.organic_results.length > 0) {
        for (const [index, result] of results.organic_results.entries()) {
          try {
            const data = await parseUrl(result.link);
            const filename = await writeToMarkdown(
              data,
              keyword,
              index,
              result.link
            );
            console.log(`Written to: ${filename}`);
          } catch (err) {
            console.error(`Failed to process ${result.link}:`, err.message);
            continue;
          }
        }
      } else {
        console.log(`No organic results found for keyword: ${keyword}`);
      }
    }
  } catch (error) {
    console.error(error);
  }
})();

To summarize the above, we:

Setup output directory: Ensures files are saved to an appropriate location.
Fetch and parse results: Process each search result URL for relevant content.
Error handling: Prevents the entire process from failing due to individual errors.

Next Steps

While the above should get you started, you may need to configure Cheerio or Turndown further to dial in the sections you're scraping.

You can find a repository for the above code here:

NateSkiles/search-results-to-markdown

Conclusion

SerpApi simplifies accessing structured search engine data through programmatic methods. By leveraging code-based solutions, developers can efficiently extract and transform web pages from search results into usable formats, enabling data collection and analysis.

DEV Community