Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

#scraping #api #python #rag

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

You have a website you need data from. You open the docs, see three APIs that all sound like they touch web pages, and have to make a choice. Scrape, Crawl, Map. The names feel intuitive until you actually need to pick one.

This article breaks down what each one does, where it fits, and what the failure modes look like if you reach for the wrong one.

What Each API Actually Does

Scrape takes a single URL and returns clean, structured content from that page. You get the main text, metadata, possibly tables or links, depending on how you configure it. One request in, one page out. It is the right tool when you already know exactly which page holds the data you want.

Crawl starts at a URL and follows links, returning content from every page it visits up to some depth or page limit. You use it when the data you want is spread across a site and you do not know the exact URLs ahead of time. Think documentation sites, blog archives, or any site where the index page links to the content pages.

Map does not scrape content at all. It traverses a domain and returns every discoverable URL, nothing more. No page content, just a list of addresses. It is fast because it only needs to find links, not render and parse each page.

Here is a quick decision table:

You want...	Use
The content of one specific page	Scrape
The content of many pages on a site	Crawl
A list of all URLs on a site	Map
URLs first, then selectively scrape some	Map + Scrape

A Concrete Example: Building a RAG Knowledge Base

Say you are building a RAG pipeline over a software product's documentation. The docs live at docs.example.com. You want to ingest every page.

The wrong instinct here is to jump straight to Crawl. Before you do that, run Map to understand what you are dealing with.

import requests

ANAKIN_API_KEY = "your_api_key"

# Step 1: Map the docs site to get all URLs
map_response = requests.post(
    "https://api.anakin.ai/v1/map",
    headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
    json={"url": "https://docs.example.com"}
)

urls = map_response.json().get("urls", [])
print(f"Found {len(urls)} URLs")

# Filter out non-content pages
content_urls = [
    u for u in urls
    if not any(skip in u for skip in ["/changelog", "/search", "/404"])
]
print(f"Scraping {len(content_urls)} content pages")

# Step 2: Scrape each page individually
results = []
for url in content_urls[:10]:  # test with first 10
    scrape_response = requests.post(
        "https://api.anakin.ai/v1/scrape",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"url": url, "formats": ["markdown"]}
    )
    data = scrape_response.json()
    results.append({
        "url": url,
        "title": data.get("title"),
        "content": data.get("markdown")
    })

print(f"Scraped {len(results)} pages")

This pattern gives you control. You can filter URLs before scraping, prioritize certain sections, and avoid wasting API calls on pages that will not add anything useful to your index (changelog entries, auto-generated search pages, and so on).

If you had used Crawl directly, you would have gotten all of that content automatically, but you would have less visibility into what was included until after the fact.

When Crawl Is the Right Choice

Crawl makes sense when you want comprehensive coverage and do not need to filter ahead of time. A few good cases:

Ingesting a blog where every post is worth including
Archiving a site before a migration
Building a competitive analysis tool that needs the full text of a competitor's site

The thing to watch with Crawl is depth configuration. A site that looks small can have thousands of pages once you follow pagination, tag pages, and user-profile URLs. Set a page limit and check the shape of what comes back before running it at full scale.

Crawl also handles the link-following logic for you. If a docs site has a sidebar with 200 links and you would have to manually extract and deduplicate them, Crawl saves you that work. Map does the same thing for URL discovery, but without the content.

When to Combine All Three

There is a pattern that shows up in more sophisticated pipelines: Map to discover, filter in code, then Scrape selectively. This is the approach in the example above.

It adds a step, but it gives you a clean separation between "what exists on this site" and "what I actually want." That matters when:

The site has sections you want to exclude (API reference pages that are too terse to be useful in a RAG context, for example)
You want to check which URLs you have already indexed before scraping again
You are building an incremental update system where you only re-scrape pages that have changed

For the incremental case, you would run Map periodically to detect new URLs, compare against your stored URL list, and then Scrape only the new ones. That is a lot cheaper than re-crawling the whole site every time.

The Failure Mode to Avoid

The most common mistake is using Scrape in a loop when Crawl would handle it better, or using Crawl when you only needed one page.

Scraping in a loop without any link discovery means you have to manually maintain the list of URLs. If the site adds a new page, you miss it. Crawl handles that automatically.

Using Crawl for a single known URL is just overhead. You will get the content of that page plus everything linked from it, which is not what you wanted, and you will pay for the extra pages.

Map is the one that tends to get overlooked. It feels redundant if you are already planning to crawl. But if you want to do anything intelligent with URL selection before fetching content, Map gives you that information at low cost.

What I Would Do Next

If you are starting a new data ingestion project, run Map first. Look at what comes back. Understand the structure of the site: how many pages, what the URL patterns look like, whether there are sections worth skipping. Then decide whether Crawl covers your needs or whether you want the finer control of Map plus selective Scraping.

That five-minute audit at the start will save you from over-fetching junk pages or under-fetching content that was one link deeper than you expected.