DEV Community

tokozen
tokozen

Posted on

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

You have a URL. You need data from it. The question is not "how do I scrape this?" The question is "what scope of data do I actually need, and what's the cheapest way to get it?"

Anakin exposes three distinct APIs for web data extraction: Scrape, Crawl, and Map. They sound like synonyms. They are not. Using the wrong one wastes money, slows you down, and sometimes returns way more (or less) than you needed. Here is how to think about each one.

What Each API Actually Does

Scrape API takes a single URL and returns clean, structured content from that page. You get the text, maybe the HTML, maybe specific fields depending on how you configure the request. One URL in, one payload out. It handles JavaScript rendering, handles bot detection, and gives you something you can immediately feed into a parser or an LLM prompt.

Crawl API starts at a URL and follows links. It traverses the site according to rules you set: max depth, URL patterns to include or exclude, page limits. It is designed for situations where you need content from many pages but you do not know in advance which URLs those are.

Map API discovers all the URLs on a domain without fetching page content. It reads sitemaps, follows internal links, and returns a list of URLs. No content, just the address book.

The mental model: Map tells you what exists. Crawl fetches what exists. Scrape fetches exactly what you point it at.

When to Use Which

Use Scrape when you already know the URLs.

If you have a list of product pages, article URLs, or profile pages, use the Scrape API in a loop. It is fast per request, cheap, and predictable. Building a RAG pipeline from a known corpus? Scrape each URL. Monitoring a specific page for changes? Scrape it on a schedule. Extracting a single article for an LLM prompt? Scrape it.

import httpx
import json

ANAKIN_API_KEY = "your_api_key"
SCRAPE_ENDPOINT = "https://api.anakin.ai/v1/scrape"

urls = [
    "https://example.com/products/widget-a",
    "https://example.com/products/widget-b",
    "https://example.com/products/widget-c",
]

results = []

for url in urls:
    response = httpx.post(
        SCRAPE_ENDPOINT,
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"url": url, "format": "markdown"},
    )
    if response.status_code == 200:
        data = response.json()
        results.append({"url": url, "content": data["content"]})
    else:
        print(f"Failed {url}: {response.status_code}")

# Now feed results into your vector store or LLM pipeline
with open("scraped_products.json", "w") as f:
    json.dump(results, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

This is the right pattern for a known URL list. No crawling overhead, no discovery step, just clean content per URL.

Use Crawl when the site structure is the source of truth.

Say you are building a competitor intelligence tool and you need everything from docs.competitor.com. You do not have a list of pages. The site does. Crawl API starts at the root, follows links up to whatever depth you set, and returns content from every page it reaches.

This also fits content migration jobs, documentation ingestion for search indexes, and any situation where "all pages under this path" is your query. The cost is that you often get pages you do not want: legal pages, tag archives, duplicate content from pagination. Budget for filtering.

A practical crawl config:

crawl_payload = {
    "url": "https://docs.example.com",
    "maxDepth": 3,
    "maxPages": 200,
    "includePaths": ["/docs/", "/guides/"],
    "excludePaths": ["/docs/archive/", "/legal/"],
    "format": "markdown",
}
Enter fullscreen mode Exit fullscreen mode

Setting includePaths is the most important tuning knob. Without it, a crawl on a large site will hit irrelevant pages fast and eat into your page budget.

Use Map when you need inventory, not content.

Map is the cheapest of the three because it returns URLs, not content. This makes it the right first step for a lot of workflows:

  • You want to understand the shape of a site before deciding what to scrape.
  • You are building a selective crawl: map the whole domain, filter the URL list to what you care about, then scrape only those URLs.
  • You need to check whether a page exists without fetching it.
  • You are auditing a site for broken link detection or SEO analysis.

The output is a flat JSON array of URLs. That is it. Feed it into a filter, pass it to the Scrape API in batches, or just inspect it manually to understand what you are working with.

map_response = httpx.post(
    "https://api.anakin.ai/v1/map",
    headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
    json={"url": "https://docs.example.com"},
)

all_urls = map_response.json()["urls"]
doc_urls = [u for u in all_urls if "/docs/" in u and not u.endswith(".pdf")]

print(f"Found {len(doc_urls)} doc pages to scrape")
Enter fullscreen mode Exit fullscreen mode

Map plus a filter plus Scrape in a loop is often better than Crawl for sites where you know the URL pattern but not the full list. You get tighter control and no wasted requests.

A Decision Rule to Keep in Mind

Before you make an API call, ask two questions:

  1. Do I know which URLs I need? Yes: use Scrape. No: keep going.
  2. Do I need the content, or just the URL list? Content: use Crawl. Just URLs: use Map.

The worst pattern I see is people reaching for Crawl when they already know their URLs, or using Scrape one-by-one to build an index when Crawl would handle it in a single call. The second worst is using Crawl when Map plus a filter would get the same URL list at a fraction of the cost.

What to Build Next

If you are putting together a documentation RAG pipeline, the pattern I would use is: Map the docs domain, filter to relevant paths, batch-scrape those URLs with the Scrape API, chunk the markdown, embed it, and load into a vector store. Crawl is the shortcut version, but the Map-plus-Scrape approach gives you an explicit list of what went into your index, which matters when you need to refresh individual pages later.

If you are doing ongoing monitoring, Scrape on a schedule is almost always the right answer. Crawl is for one-time or periodic full ingestions.

The APIs are composable. Use them that way.

Top comments (0)