DEV Community: tokozen

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

tokozen — Fri, 15 May 2026 06:45:46 +0000

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

You have a URL. You need data from it. The question is not "how do I scrape this?" The question is "what scope of data do I actually need, and what's the cheapest way to get it?"

Anakin exposes three distinct APIs for web data extraction: Scrape, Crawl, and Map. They sound like synonyms. They are not. Using the wrong one wastes money, slows you down, and sometimes returns way more (or less) than you needed. Here is how to think about each one.

What Each API Actually Does

Scrape API takes a single URL and returns clean, structured content from that page. You get the text, maybe the HTML, maybe specific fields depending on how you configure the request. One URL in, one payload out. It handles JavaScript rendering, handles bot detection, and gives you something you can immediately feed into a parser or an LLM prompt.

Crawl API starts at a URL and follows links. It traverses the site according to rules you set: max depth, URL patterns to include or exclude, page limits. It is designed for situations where you need content from many pages but you do not know in advance which URLs those are.

Map API discovers all the URLs on a domain without fetching page content. It reads sitemaps, follows internal links, and returns a list of URLs. No content, just the address book.

The mental model: Map tells you what exists. Crawl fetches what exists. Scrape fetches exactly what you point it at.

When to Use Which

Use Scrape when you already know the URLs.

If you have a list of product pages, article URLs, or profile pages, use the Scrape API in a loop. It is fast per request, cheap, and predictable. Building a RAG pipeline from a known corpus? Scrape each URL. Monitoring a specific page for changes? Scrape it on a schedule. Extracting a single article for an LLM prompt? Scrape it.

import httpx
import json

ANAKIN_API_KEY = "your_api_key"
SCRAPE_ENDPOINT = "https://api.anakin.ai/v1/scrape"

urls = [
    "https://example.com/products/widget-a",
    "https://example.com/products/widget-b",
    "https://example.com/products/widget-c",
]

results = []

for url in urls:
    response = httpx.post(
        SCRAPE_ENDPOINT,
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"url": url, "format": "markdown"},
    )
    if response.status_code == 200:
        data = response.json()
        results.append({"url": url, "content": data["content"]})
    else:
        print(f"Failed {url}: {response.status_code}")

# Now feed results into your vector store or LLM pipeline
with open("scraped_products.json", "w") as f:
    json.dump(results, f, indent=2)

This is the right pattern for a known URL list. No crawling overhead, no discovery step, just clean content per URL.

Use Crawl when the site structure is the source of truth.

Say you are building a competitor intelligence tool and you need everything from docs.competitor.com. You do not have a list of pages. The site does. Crawl API starts at the root, follows links up to whatever depth you set, and returns content from every page it reaches.

This also fits content migration jobs, documentation ingestion for search indexes, and any situation where "all pages under this path" is your query. The cost is that you often get pages you do not want: legal pages, tag archives, duplicate content from pagination. Budget for filtering.

A practical crawl config:

crawl_payload = {
    "url": "https://docs.example.com",
    "maxDepth": 3,
    "maxPages": 200,
    "includePaths": ["/docs/", "/guides/"],
    "excludePaths": ["/docs/archive/", "/legal/"],
    "format": "markdown",
}

Setting includePaths is the most important tuning knob. Without it, a crawl on a large site will hit irrelevant pages fast and eat into your page budget.

Use Map when you need inventory, not content.

Map is the cheapest of the three because it returns URLs, not content. This makes it the right first step for a lot of workflows:

You want to understand the shape of a site before deciding what to scrape.
You are building a selective crawl: map the whole domain, filter the URL list to what you care about, then scrape only those URLs.
You need to check whether a page exists without fetching it.
You are auditing a site for broken link detection or SEO analysis.

The output is a flat JSON array of URLs. That is it. Feed it into a filter, pass it to the Scrape API in batches, or just inspect it manually to understand what you are working with.

map_response = httpx.post(
    "https://api.anakin.ai/v1/map",
    headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
    json={"url": "https://docs.example.com"},
)

all_urls = map_response.json()["urls"]
doc_urls = [u for u in all_urls if "/docs/" in u and not u.endswith(".pdf")]

print(f"Found {len(doc_urls)} doc pages to scrape")

Map plus a filter plus Scrape in a loop is often better than Crawl for sites where you know the URL pattern but not the full list. You get tighter control and no wasted requests.

A Decision Rule to Keep in Mind

Before you make an API call, ask two questions:

Do I know which URLs I need? Yes: use Scrape. No: keep going.
Do I need the content, or just the URL list? Content: use Crawl. Just URLs: use Map.

The worst pattern I see is people reaching for Crawl when they already know their URLs, or using Scrape one-by-one to build an index when Crawl would handle it in a single call. The second worst is using Crawl when Map plus a filter would get the same URL list at a fraction of the cost.

What to Build Next

If you are putting together a documentation RAG pipeline, the pattern I would use is: Map the docs domain, filter to relevant paths, batch-scrape those URLs with the Scrape API, chunk the markdown, embed it, and load into a vector store. Crawl is the shortcut version, but the Map-plus-Scrape approach gives you an explicit list of what went into your index, which matters when you need to refresh individual pages later.

If you are doing ongoing monitoring, Scrape on a schedule is almost always the right answer. Crawl is for one-time or periodic full ingestions.

The APIs are composable. Use them that way.

Browser Sessions: Stateful Web Automation Behind a CDP Connection

tokozen — Tue, 12 May 2026 06:26:04 +0000

Browser Sessions: Stateful Web Automation Behind a CDP Connection

Most web automation breaks the moment a site asks you to log in. Your scraper fetches the login page, posts credentials, and then... loses the session cookie on the next request because it spun up a fresh context. Or it hits a JS-rendered dashboard and gets back an empty <div id="app"></div>.

The root problem is statefulness. HTTP is stateless by design, and most scraping tools treat each request as isolated. But real user flows are not isolated. Logging in, clicking through a multi-step form, waiting for a WebSocket to push data, then reading the result: that is a single continuous session, and automating it requires a browser that remembers where it has been.

This is where Chrome DevTools Protocol (CDP) sessions come in.

What CDP Actually Gives You

CDP is the protocol Chrome and Chromium expose for external control. Tools like Playwright and Puppeteer sit on top of it. When you connect to a CDP endpoint, you get a persistent channel to a running browser instance. You can send commands (Page.navigate, Input.dispatchMouseEvent, Runtime.evaluate), subscribe to events (Network.responseReceived, Page.loadEventFired), and read the DOM at any point.

The key word is persistent. Unlike a one-shot headless request, a CDP session keeps the browser alive between your commands. Cookies stay. LocalStorage stays. Auth tokens stay. If the site uses a session cookie set after login, your next navigation carries that cookie automatically because it is in the same browser profile.

Anakin's Browser Sessions API exposes exactly this: a hosted, stateful CDP connection you can attach to without managing your own Chromium fleet. You get a wsEndpoint URL, connect with Playwright or Puppeteer, and the browser persists across your script.

The Login-Then-Scrape Pattern

Here is a concrete example. Say you need data from a SaaS dashboard that sits behind an email/password login. The page loads via React, so the data never appears in the initial HTML. You need to:

Navigate to the login page.
Fill and submit the form.
Wait for the dashboard to render.
Extract the data from the live DOM.

With a CDP-backed session, this is straightforward:

import asyncio
from playwright.async_api import async_playwright
import httpx

ANAKIN_API_KEY = "your_api_key"

async def scrape_dashboard():
    # Create a browser session via Anakin's API
    response = httpx.post(
        "https://api.anakin.ai/v1/browser-sessions",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"session_ttl": 300}  # 5-minute session
    )
    session = response.json()
    ws_endpoint = session["wsEndpoint"]

    async with async_playwright() as p:
        # Connect to the existing hosted browser instance
        browser = await p.chromium.connect_over_cdp(ws_endpoint)
        context = browser.contexts[0]
        page = await context.new_page()

        # Step 1: navigate to login
        await page.goto("https://example-saas.com/login")
        await page.wait_for_selector("input[name='email']")

        # Step 2: fill and submit
        await page.fill("input[name='email']", "user@example.com")
        await page.fill("input[name='password']", "s3cur3pass")
        await page.click("button[type='submit']")

        # Step 3: wait for dashboard content to render
        await page.wait_for_selector(".dashboard-metrics", timeout=10000)

        # Step 4: extract data from the live DOM
        metrics = await page.eval_on_selector_all(
            ".metric-card",
            "cards => cards.map(c => ({ label: c.querySelector('.label').textContent, value: c.querySelector('.value').textContent }))"
        )

        print(metrics)
        # [{'label': 'Monthly Revenue', 'value': '$42,180'}, ...]

        await browser.close()

asyncio.run(scrape_dashboard())

A few things worth noting here. connect_over_cdp attaches to the remote browser rather than launching a new one. The browser.contexts[0] line reuses the existing context, which means any cookies or storage the session already has are available. And wait_for_selector is doing real work: it is polling the DOM until the React component finishes rendering, which a static HTTP request would never see.

Handling the Awkward Edge Cases

Stateful sessions introduce problems that stateless scraping does not have.

Session expiry mid-flow. Sites time out inactive sessions, sometimes in minutes. If your script pauses between steps (say, you are processing data from step 3 before continuing), the site may have logged you out. The fix is to check for the presence of a login redirect at each navigation step, not just at the start.

2FA and CAPTCHAs. Some sites push a CAPTCHA after login, especially on new IP addresses. CDP sessions let you intercept the page at that point and either route to a CAPTCHA-solving service or pause and wait for a human to complete it, then resume programmatically. You keep the session alive while the human intervenes.

Memory and resource leaks. A long-running CDP session accumulates open pages, event listeners, and network logs. If you are running many sessions in parallel, close pages you are done with explicitly. Do not rely on garbage collection.

Flaky selectors. JS-heavy apps change their DOM structure on deploys. Prefer aria-label, data-testid, or visible text over deeply nested CSS selectors. They are less likely to break silently.

One pattern that works well for multi-step flows: treat each logical step as a function that asserts it is on the right page before proceeding. If page.url() does not match what you expect, log the actual URL and screenshot, then raise. This turns silent failures into loud ones.

async def assert_on_page(page, expected_path: str):
    current = page.url
    if expected_path not in current:
        await page.screenshot(path="debug_screenshot.png")
        raise RuntimeError(f"Expected path '{expected_path}', got '{current}'")

When to Reach for This vs. Simpler Tools

CDP-based sessions are heavier than a plain HTTP scraper. They consume more memory, have higher latency, and require more careful error handling. Use them when:

The target is behind authentication and session cookies matter.
The data is rendered client-side by JavaScript.
The flow involves multiple user interactions (clicks, form fills, file uploads).
You need to observe network traffic or intercept requests mid-flow.

If the data is in the initial HTML response, a standard scrape API call is faster and cheaper. CDP is for the cases where the simpler tool genuinely cannot reach the data.

The interesting direction from here is combining stateful browser sessions with structured extraction. Once you have a live, authenticated, fully-rendered page, you can pass the DOM to an LLM-backed extractor that reads the content semantically rather than with CSS selectors. That combination handles the "the DOM structure changes but the information is always there" problem, which is the last real reliability bottleneck in production scraping.

How Agentic Search Actually Works: The Research Loop Link-Fetching Agents Miss

tokozen — Fri, 08 May 2026 05:40:02 +0000

How Agentic Search Actually Works: The Research Loop Link-Fetching Agents Miss

Most agent tutorials show you the same pattern: take a user query, call a search API, grab the top result, stuff the text into your prompt. Done. Ship it.

That works fine for trivia. It falls apart when the question requires synthesis across multiple sources, when the first result is a listicle with no substance, or when the answer depends on information that only shows up three clicks deep into a documentation site.

The difference between a link-fetching agent and a genuinely useful research agent is the loop. Let me show you what that loop actually looks like.

What a Naive Search Agent Does

A basic agent that "uses web search" usually does something like this:

Receive question
Run one search query
Take the first URL from results
Fetch that URL
Return whatever text comes back

The problem is that web search results are ranked for clicks, not for answer quality. The top result might be a vendor comparison page with affiliate links, or a forum thread where nobody answered the question, or a press release from three years ago.

Even if you grab five results instead of one, you're still making a single pass. You're not evaluating whether what you got actually answers the question.

The Research Loop

Agentic search works differently. The core idea is that the LLM drives the process iteratively, deciding at each step whether it has enough information or needs to dig further. The loop looks more like this:

Receive question
Generate a targeted search query (the LLM should write this, not just pass the user input verbatim)
Get search results as structured data (titles, URLs, snippets)
Decide which results are worth fetching based on snippets
Fetch selected pages, get clean text
Evaluate: does this actually answer the question? What's still missing?
If incomplete, generate a follow-up query targeting the gap
Repeat until confident or until a step budget is exhausted
Synthesize a final answer with sources

Steps 6 and 7 are where most agent implementations stop short. Without them, you have a retrieval tool, not a research agent.

Here's a minimal Python implementation of this loop using Anakin's Agentic Search endpoint, which handles the iterative querying and source-grounded answering in one call, and their Scrape API for when you need to go deeper on a specific page:

import httpx
import json

ANAKIN_API_KEY = "your-api-key"

def agentic_search(question: str) -> dict:
    """
    Call Anakin's Agentic Search API. Returns an answer grounded in sources
    with citations, running the research loop server-side.
    """
    response = httpx.post(
        "https://api.anakin.ai/v1/agentic-search",
        headers={
            "Authorization": f"Bearer {ANAKIN_API_KEY}",
            "Content-Type": "application/json",
        },
        json={"query": question},
        timeout=60.0,
    )
    response.raise_for_status()
    return response.json()

def scrape_page(url: str) -> str:
    """
    Fetch clean text from a specific URL when the agent needs to go deeper.
    """
    response = httpx.post(
        "https://api.anakin.ai/v1/scrape",
        headers={
            "Authorization": f"Bearer {ANAKIN_API_KEY}",
            "Content-Type": "application/json",
        },
        json={"url": url, "format": "markdown"},
        timeout=30.0,
    )
    response.raise_for_status()
    return response.json().get("content", "")

def research_with_followup(question: str, context: str = "") -> str:
    """
    Run agentic search, then optionally scrape a specific source
    if the answer references something worth digging into further.
    """
    full_query = f"{question}\n\nContext: {context}" if context else question

    result = agentic_search(full_query)
    answer = result.get("answer", "")
    sources = result.get("sources", [])

    print(f"Initial answer:\n{answer}\n")
    print(f"Sources used: {len(sources)}")
    for s in sources:
        print(f"  - {s.get('title')} ({s.get('url')})")

    # If the answer mentions a specific doc or page worth reading fully,
    # you can scrape it and pass that content back for a deeper pass.
    if sources:
        top_source_url = sources[0].get("url")
        print(f"\nScraping top source for deeper context: {top_source_url}")
        full_text = scrape_page(top_source_url)

        # Now run a follow-up with the full page content as grounding
        followup_result = agentic_search(
            f"{question}\n\nFull text of primary source:\n{full_text[:4000]}"
        )
        return followup_result.get("answer", answer)

    return answer

if __name__ == "__main__":
    question = "What are the current rate limits for the OpenAI Assistants API and how do they differ by tier?"
    final_answer = research_with_followup(question)
    print(f"\nFinal answer:\n{final_answer}")

The key thing here is that the agentic search call is not just "search and return links." It's running its own internal loop: reformulating queries, following promising threads, discarding junk sources, and producing an answer with citations attached. Then the outer code can decide whether to go one level deeper on any cited source.

Why Snippets Are Not Enough for RAG

When you're building a RAG pipeline and you index search result snippets, you're indexing 150-character summaries written by search engines to generate clicks. Those snippets frequently omit the actual technical details: the exact configuration parameter, the version constraint, the exception to the rule that matters for your use case.

For factual retrieval tasks, this is fine. For technical research, it's a consistent source of hallucination. The LLM fills in the missing detail from training data, confidently, and wrong.

The fix is to fetch full page content for the sources that actually matter, get clean structured text (not raw HTML), and index that. When you integrate agentic search into a RAG pipeline, the output you want to embed is not the answer text. It's the source content the answer was grounded in, tagged with the query that surfaced it.

def build_rag_chunks(question: str) -> list[dict]:
    result = agentic_search(question)
    chunks = []
    for source in result.get("sources", []):
        full_text = scrape_page(source["url"])
        chunks.append({
            "url": source["url"],
            "title": source["title"],
            "content": full_text,
            "query": question,
        })
    return chunks

Now you have grounded, full-text chunks you can actually embed and retrieve later, not snippets.

Where to Go From Here

The research loop is not complicated conceptually, but it has real operational costs: latency goes up, token usage goes up, and you need a budget strategy so agents don't spin forever on hard questions. A step limit of 3 to 5 iterations covers most real-world queries without runaway costs.

If I were building this for production, I'd add a confidence score threshold to the evaluation step so the loop exits early when the answer quality is already high, rather than always burning the full budget. I'd also log every query and source fetched so I can audit where the agent went wrong when users report bad answers, because they will.

The link-fetching approach feels like research. The loop actually does it.

Building a RAG Pipeline That Stays Fresh with Live Web Data

tokozen — Tue, 05 May 2026 06:02:44 +0000

Building a RAG Pipeline That Stays Fresh with Live Web Data

You build a RAG pipeline, embed your documents, stand up a vector store, and it works great. Then three months later, users start complaining that the answers are wrong. Your product pricing changed. A regulation was updated. A library released a breaking version. The documents you indexed at setup time are now lying to your users.

The fix is not to re-index more aggressively. The fix is to stop treating the web as a one-time data source and start treating it as a live feed that your pipeline can query at retrieval time.

Here is how to wire that up.

The Core Problem with Static RAG

Standard RAG looks like this: ingest documents, chunk them, embed them, store vectors, retrieve on query, generate. Every step happens at ingest time except retrieval and generation. That is fine for a corporate knowledge base with a weekly update cycle. It breaks down when:

You are answering questions about prices, regulations, or news
Your users ask about things that happened after your last ingest
The authoritative source is a website that updates continuously

The mental model shift is to treat retrieval as a two-stage process. First, check your local vector store for stable reference content (your docs, FAQs, internal data). Second, run a live web query for anything time-sensitive and merge those results into the context window before generation.

What the Pipeline Looks Like

Here is the revised flow:

User sends a query
Classify the query: does it need fresh data?
If yes, fire a web search, scrape the top results, pull clean text
Combine local retrieval results with fresh web content
Pack both into the prompt context and generate

The classification step does not need to be fancy. A simple heuristic works: if the query contains words like "current", "latest", "today", "now", "price", or a named entity that changes frequently, route it to the live path. You can also ask the LLM itself with a one-shot classifier prompt.

The scraping step is where most implementations fall apart. Raw HTML is terrible context. You want clean, structured text. Anakin's Scrape API handles this well: you give it a URL and it returns the page content as clean markdown or plain text, stripping nav, ads, and boilerplate. That matters a lot because every token of garbage HTML in your context window is a token not used for actual reasoning.

A Concrete Implementation

import os
import requests
from openai import OpenAI

ANAKIN_API_KEY = os.environ["ANAKIN_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

client = OpenAI(api_key=OPENAI_API_KEY)

def needs_fresh_data(query: str) -> bool:
    trigger_words = ["current", "latest", "today", "now", "price", "recent", "2024", "2025"]
    return any(word in query.lower() for word in trigger_words)

def web_search(query: str, num_results: int = 3) -> list[dict]:
    resp = requests.get(
        "https://api.anakin.ai/v1/search",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        params={"q": query, "limit": num_results},
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json().get("results", [])

def scrape_url(url: str) -> str:
    resp = requests.post(
        "https://api.anakin.ai/v1/scrape",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}", "Content-Type": "application/json"},
        json={"url": url, "format": "markdown"},
        timeout=15,
    )
    if resp.status_code != 200:
        return ""
    return resp.json().get("content", "")[:3000]  # cap tokens per source

def build_context(query: str, local_chunks: list[str]) -> str:
    sections = []

    if local_chunks:
        sections.append("## Internal Knowledge\n" + "\n\n".join(local_chunks))

    if needs_fresh_data(query):
        search_results = web_search(query)
        web_sections = []
        for result in search_results:
            url = result.get("url", "")
            title = result.get("title", url)
            content = scrape_url(url)
            if content:
                web_sections.append(f"### {title}\nSource: {url}\n\n{content}")
        if web_sections:
            sections.append("## Live Web Results\n" + "\n\n".join(web_sections))

    return "\n\n".join(sections)

def answer(query: str, local_chunks: list[str]) -> str:
    context = build_context(query, local_chunks)
    system_prompt = (
        "You are a helpful assistant. Answer the user's question using the provided context. "
        "If you cite information from a web source, mention the source URL. "
        "If the context does not contain enough information, say so clearly."
    )
    user_message = f"Context:\n{context}\n\nQuestion: {query}"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

# Example usage
local_knowledge = [
    "Our refund policy allows returns within 30 days of purchase.",
    "Enterprise plans include SSO and audit logs.",
]
query = "What is the current price of GPT-4o API calls?"
print(answer(query, local_knowledge))

A few things worth noting in this code:

The scrape_url function caps content at 3000 characters per source. You will want to tune this based on how many sources you pull and what your context window budget looks like. With three sources at 3000 chars each plus your local chunks, you are comfortably under 16k tokens.
needs_fresh_data is intentionally simple. In production you would replace this with a proper classifier or even a quick LLM call with a boolean output schema.
The system prompt tells the model to cite sources when using web content. This matters for user trust and for debugging when something goes wrong.

Keeping Latency Manageable

The obvious concern with adding live web requests to a RAG pipeline is latency. A search call plus three scrape calls adds up. A few approaches that help:

Run search and scrape in parallel using asyncio or concurrent.futures. The three scrape calls should happen simultaneously, not serially.
Cache search results for identical or near-identical queries with a short TTL, maybe 5 to 15 minutes depending on how time-sensitive your domain is.
Set aggressive timeouts. A scrape that takes more than 10 seconds is probably not worth waiting for. Fail gracefully and use whatever you got.
Limit scraping to the top two results rather than five. Marginal sources past the second result rarely change the answer quality, but they do add latency.

With parallelism and reasonable timeouts, you can usually keep the live web path under two seconds of added latency for the retrieval step.

Where to Go from Here

The pattern above handles the common case: detect, search, scrape, merge. But there is a more interesting direction when you need depth rather than breadth. Anakin's Agentic Search does the research loop for you, returning a synthesized answer with sources rather than raw search results. That can be a better fit when the query needs multi-step reasoning across sources rather than a simple "what does this page say."

The thing I would tackle next is making the freshness classifier smarter. Right now it is a keyword list. A better version would use embeddings to detect semantic similarity to a set of "time-sensitive topics" defined for your specific domain. That gets you higher precision without much more code.

Static RAG is a starting point, not a destination. The web is your database. Treat retrieval accordingly.

Anti-bot without the arms race: what Camoufox does differently

tokozen — Tue, 28 Apr 2026 06:18:46 +0000

Anti-bot without the arms race: what Camoufox does differently

You spin up Playwright, navigate to a site, and get a CAPTCHA or a 403 before the page even loads. The site never saw your request headers. It fingerprinted your browser.

This is the core problem with headless automation: most anti-bot systems do not just check whether you look like a bot by your behavior. They check what your browser is at a runtime level, things like the values returned by navigator.webdriver, the shape of the AudioContext API, your canvas rendering hash, font metrics, screen geometry, and dozens of other signals that are consistent across real user sessions but drift or become obviously synthetic in headless environments.

The typical response is a cat-and-mouse loop: someone finds a leak, writes a patch in JavaScript, the detection vendors update their fingerprinting logic, repeat. Camoufox takes a different approach.

What Camoufox actually patches

Camoufox is a modified Firefox build. The patches live at the C++ and Rust level inside the browser engine, not in a JavaScript shim injected at runtime. That distinction matters because JS-layer patches can be detected: a site can inspect whether navigator.__proto__ has been tampered with, check timing consistency between API calls, or look for the characteristic fingerprint of Object.defineProperty overrides. Engine-level patches do not leave those artifacts because the values come from the compiled binary itself.

The specific things Camoufox addresses include:

Canvas fingerprinting: real browsers produce slightly different canvas renders depending on GPU driver and OS. Headless Chrome on a Linux server produces a consistent, recognizable hash. Camoufox injects controlled noise at the rendering level so hashes vary realistically across sessions.
WebGL vendor strings: these normally reflect actual hardware. In a container you get Mesa software rendering strings, which are a strong signal. Camoufox spoofs these to plausible hardware values.
Screen and window geometry: headless browsers often report a viewport that does not match any screen, or a screen size of 0x0. Camoufox lets you configure these to match common real-world resolutions.
navigator.webdriver: obviously the first thing everyone patches, but Camoufox also handles the surrounding properties that tend to be absent or misshapen in automation contexts.
Font enumeration: the set of fonts available in a container differs from a desktop OS. Camoufox allows configuring which fonts are reported.

Because these patches live in the engine, you get them without writing any JavaScript injection code yourself.

Using Camoufox from Python

Camoufox ships a Python wrapper that handles launching the patched browser and configuring fingerprint parameters. Here is a minimal working example that loads a page and dumps the fingerprint values a detection library would see:

import asyncio
from camoufox.async_api import AsyncCamoufox

async def main():
    async with AsyncCamoufox(
        headless=True,
        os="windows",           # target OS persona
        screen={"width": 1920, "height": 1080},
        fonts=["Arial", "Times New Roman", "Calibri"],
    ) as browser:
        page = await browser.new_page()

        # Navigate to a fingerprint audit page
        await page.goto("https://abrahamjuliot.github.io/creepjs/")

        # Grab some of the values the page reports back
        webdriver_flag = await page.evaluate("navigator.webdriver")
        canvas_hash = await page.evaluate("""
            () => {
                const c = document.createElement('canvas');
                const ctx = c.getContext('2d');
                ctx.fillStyle = 'red';
                ctx.fillRect(0, 0, 50, 50);
                return c.toDataURL().slice(-20);
            }
        """)
        webgl_vendor = await page.evaluate("""
            () => {
                const gl = document.createElement('canvas').getContext('webgl');
                const ext = gl.getExtension('WEBGL_debug_renderer_info');
                return gl.getParameter(ext.UNMASKED_VENDOR_WEBGL);
            }
        """)

        print(f"webdriver: {webdriver_flag}")
        print(f"canvas tail hash: {canvas_hash}")
        print(f"webgl vendor: {webgl_vendor}")

        await page.close()

asyncio.run(main())

On a clean Playwright Chromium install you would typically see webdriver: true and a Mesa vendor string. With Camoufox configured as above, you get None for webdriver and a spoofed GPU vendor.

The os parameter tells Camoufox which persona to build. Choosing "windows" means the font list, screen geometry defaults, and some navigator properties are calibrated to a typical Windows desktop profile. You can also use "macos" or "linux".

Where this does and does not help

Camoufox handles the static fingerprint layer well. It is harder to fool behavioral analysis, which looks at things like mouse movement entropy, scroll patterns, timing between interactions, and whether you click precisely in the center of every button.

If you are doing research scraping or building data pipelines where you control the interaction flow, you can work around behavioral heuristics by adding realistic delays and some randomization to your action sequences. Camoufox does not do this for you, but it removes the browser-layer signals so behavioral analysis is the only remaining vector.

There is also the question of IP reputation. A perfectly spoofed browser fingerprint coming from a datacenter IP that has been seen making thousands of requests still gets blocked at the network layer. Pairing Camoufox with residential or mobile proxies gives you coverage at both layers.

For sites that use session-based auth and require you to maintain cookies across requests, Camoufox supports persistent contexts the same way Playwright does. If you are building something that needs to stay logged in across scraping runs, you can serialize the browser storage state to disk and restore it on the next run, same API.

What I would actually try next

If you are hitting detection walls and have already tried header spoofing and basic evasion without success, Camoufox is worth running against a fingerprinting audit tool like CreepJS or BrowserLeaks before pointing it at your target. See what signals still leak. The audit pages give you a structured view of exactly which APIs are returning anomalous values.

Engine-level patching is not magic, but it is a more durable foundation than stacking JS overrides. When detection vendors update their fingerprinting logic, a JS patch breaks immediately. A C++ patch to the canvas rendering pipeline does not.

The source is on GitHub under a Mozilla Public License variant. If you are doing this at scale and need something customized, the patch set is auditable and the build process is documented.

Authenticated Scraping: Why Session Persistence Matters

tokozen — Fri, 24 Apr 2026 05:54:58 +0000

Authenticated Scraping: Why Session Persistence Matters

You write a scraper that logs in, grabs a token, and then makes a request to a protected endpoint. It works once. You run it again five minutes later and get a 401. You add a retry. It works. Then it stops working entirely when the site starts fingerprinting your requests.

This is the session persistence problem, and it trips up a lot of scrapers that handle auth in a simplified way.

What "Authenticated Scraping" Actually Involves

Most people treat authentication as a one-time step: POST credentials, receive a cookie or token, attach it to subsequent requests. That model works fine for simple REST APIs. For real web applications, especially ones built for human users, it falls apart quickly.

Here's what actually happens when a person logs into a web app:

The browser sends credentials via a form POST.
The server sets a session cookie (often HttpOnly, Secure, SameSite=Lax).
Subsequent requests carry that cookie automatically.
The server may also rotate the cookie value on each request, or issue a CSRF token that must accompany state-changing requests.
If the browser goes quiet for too long, the session expires. The server might also tie the session to IP, User-Agent, or a device fingerprint.

A naive scraper might grab the initial cookie and replay it indefinitely. But if the site rotates session tokens, replaying an old value causes an immediate logout or a redirect to /login. If the site checks the User-Agent or Accept-Language headers for consistency, a mismatch triggers a challenge page.

Session persistence means maintaining the full browser-like state across requests: cookies, headers, timing, and sometimes JavaScript execution for SPAs that build auth state on the client side.

Where Naive Implementations Break

Here are the specific failure modes I see most often:

Cookie expiry without refresh. Sessions have TTLs. If your scraper sleeps for an hour between page fetches, the session may be dead when it wakes up. You need logic to detect a redirect to a login page (watch for Location: /login in a 302, or a 200 response whose URL or body content signals you've been kicked out) and re-authenticate.

CSRF token rotation. Some apps embed a CSRF token in the page HTML and require it on every POST. If you cache the token from login and reuse it, the second POST fails. You need to parse the current page's token before each state-changing request.

Missing cookies from redirects. requests follows redirects by default, but it does not always preserve cookies set during intermediate redirect steps. Using a requests.Session() object handles this correctly because it stores cookies across all requests in the session. Forgetting to use a session object is a common source of intermittent auth failures.

JavaScript-gated auth flows. OAuth and SSO flows often rely on JavaScript to exchange tokens, redirect, and set cookies. An HTTP-only client never executes that code, so it never completes the handshake. You need a real browser for these.

A Concrete Example with Session Handling

Here's a Python example using requests.Session to log in, detect session expiry, and re-authenticate transparently:

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://example.com/login"
PROTECTED_URL = "https://example.com/dashboard/data"
CREDENTIALS = {"username": "user@example.com", "password": "hunter2"}

def get_csrf_token(session, url):
    resp = session.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    token_input = soup.find("input", {"name": "csrf_token"})
    if not token_input:
        raise ValueError("CSRF token not found on page")
    return token_input["value"]

def login(session):
    csrf = get_csrf_token(session, LOGIN_URL)
    payload = {**CREDENTIALS, "csrf_token": csrf}
    resp = session.post(LOGIN_URL, data=payload, allow_redirects=True)
    resp.raise_for_status()
    # Confirm we're actually logged in, not silently redirected back
    if "dashboard" not in resp.url:
        raise RuntimeError(f"Login failed, landed at: {resp.url}")
    print("Logged in successfully")

def fetch_protected(session):
    resp = session.get(PROTECTED_URL)
    # Detect silent redirect to login page
    if "login" in resp.url or resp.status_code == 401:
        print("Session expired, re-authenticating...")
        login(session)
        resp = session.get(PROTECTED_URL)
    resp.raise_for_status()
    return resp.json()

with requests.Session() as s:
    s.headers.update({
        "User-Agent": "Mozilla/5.0 (compatible; MyCrawler/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    })
    login(s)
    data = fetch_protected(s)
    print(data)

A few things worth noting here. The requests.Session object persists cookies across all calls automatically, including cookies set on redirects. The CSRF token is fetched fresh from the login page before each login attempt. After the POST, we check resp.url rather than the status code alone, because many apps return a 200 with the login form when credentials are wrong. The fetch_protected function detects a stale session by inspecting the final URL after any redirects.

This handles most form-based auth flows. It does not handle JavaScript-rendered auth, multi-factor prompts, or fingerprint-based bot detection.

When You Need a Real Browser

Some situations require actual browser automation:

OAuth flows that rely on JavaScript redirects and postMessage between frames.
Apps using WebAuthn or device-bound credentials.
Sites that serve a blank HTML shell and populate auth state entirely via JavaScript.
Anti-bot systems (Cloudflare, PerimeterX, DataDome) that run behavioral challenges.

For these cases, a stateful browser session over CDP (Chrome DevTools Protocol) gives you a real browser context with proper JavaScript execution, cookie storage, and behavioral signals. You log in once through the browser, then reuse that browser context for subsequent requests. The session state (cookies, local storage, IndexedDB entries) persists exactly as a human user's would.

Anakin's Browser Sessions product works this way: you get a CDP-accessible browser instance where you can authenticate interactively and then hand off the live session to automated scraping. For sites with particularly aggressive bot detection, this is often the only path that works reliably.

What I'd Do Next

If you're building something that needs to scrape behind a login wall, start by mapping exactly what the login flow does. Open DevTools, go to the Network tab, filter by XHR and Fetch, and walk through the login manually. Note every cookie that gets set, every redirect, every token in the request or response body.

Then decide: can an HTTP client replay this, or does it require JavaScript execution? Most traditional web apps can be handled with a session-aware HTTP client and careful cookie and CSRF management. SPAs and OAuth-heavy flows usually need a real browser.

The session expiry detection logic is the part most people skip initially and then scramble to add later. Build it in from the start. A scraper that silently returns stale or wrong data because its session expired is harder to debug than one that fails loudly with a re-authentication attempt.

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

tokozen — Tue, 21 Apr 2026 05:39:58 +0000

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

You have a website you need data from. You open the docs, see three APIs that all sound like they touch web pages, and have to make a choice. Scrape, Crawl, Map. The names feel intuitive until you actually need to pick one.

This article breaks down what each one does, where it fits, and what the failure modes look like if you reach for the wrong one.

What Each API Actually Does

Scrape takes a single URL and returns clean, structured content from that page. You get the main text, metadata, possibly tables or links, depending on how you configure it. One request in, one page out. It is the right tool when you already know exactly which page holds the data you want.

Crawl starts at a URL and follows links, returning content from every page it visits up to some depth or page limit. You use it when the data you want is spread across a site and you do not know the exact URLs ahead of time. Think documentation sites, blog archives, or any site where the index page links to the content pages.

Map does not scrape content at all. It traverses a domain and returns every discoverable URL, nothing more. No page content, just a list of addresses. It is fast because it only needs to find links, not render and parse each page.

Here is a quick decision table:

You want...	Use
The content of one specific page	Scrape
The content of many pages on a site	Crawl
A list of all URLs on a site	Map
URLs first, then selectively scrape some	Map + Scrape

A Concrete Example: Building a RAG Knowledge Base

Say you are building a RAG pipeline over a software product's documentation. The docs live at docs.example.com. You want to ingest every page.

The wrong instinct here is to jump straight to Crawl. Before you do that, run Map to understand what you are dealing with.

import requests

ANAKIN_API_KEY = "your_api_key"

# Step 1: Map the docs site to get all URLs
map_response = requests.post(
    "https://api.anakin.ai/v1/map",
    headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
    json={"url": "https://docs.example.com"}
)

urls = map_response.json().get("urls", [])
print(f"Found {len(urls)} URLs")

# Filter out non-content pages
content_urls = [
    u for u in urls
    if not any(skip in u for skip in ["/changelog", "/search", "/404"])
]
print(f"Scraping {len(content_urls)} content pages")

# Step 2: Scrape each page individually
results = []
for url in content_urls[:10]:  # test with first 10
    scrape_response = requests.post(
        "https://api.anakin.ai/v1/scrape",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"url": url, "formats": ["markdown"]}
    )
    data = scrape_response.json()
    results.append({
        "url": url,
        "title": data.get("title"),
        "content": data.get("markdown")
    })

print(f"Scraped {len(results)} pages")

This pattern gives you control. You can filter URLs before scraping, prioritize certain sections, and avoid wasting API calls on pages that will not add anything useful to your index (changelog entries, auto-generated search pages, and so on).

If you had used Crawl directly, you would have gotten all of that content automatically, but you would have less visibility into what was included until after the fact.

When Crawl Is the Right Choice

Crawl makes sense when you want comprehensive coverage and do not need to filter ahead of time. A few good cases:

Ingesting a blog where every post is worth including
Archiving a site before a migration
Building a competitive analysis tool that needs the full text of a competitor's site

The thing to watch with Crawl is depth configuration. A site that looks small can have thousands of pages once you follow pagination, tag pages, and user-profile URLs. Set a page limit and check the shape of what comes back before running it at full scale.

Crawl also handles the link-following logic for you. If a docs site has a sidebar with 200 links and you would have to manually extract and deduplicate them, Crawl saves you that work. Map does the same thing for URL discovery, but without the content.

When to Combine All Three

There is a pattern that shows up in more sophisticated pipelines: Map to discover, filter in code, then Scrape selectively. This is the approach in the example above.

It adds a step, but it gives you a clean separation between "what exists on this site" and "what I actually want." That matters when:

The site has sections you want to exclude (API reference pages that are too terse to be useful in a RAG context, for example)
You want to check which URLs you have already indexed before scraping again
You are building an incremental update system where you only re-scrape pages that have changed

For the incremental case, you would run Map periodically to detect new URLs, compare against your stored URL list, and then Scrape only the new ones. That is a lot cheaper than re-crawling the whole site every time.

The Failure Mode to Avoid

The most common mistake is using Scrape in a loop when Crawl would handle it better, or using Crawl when you only needed one page.

Scraping in a loop without any link discovery means you have to manually maintain the list of URLs. If the site adds a new page, you miss it. Crawl handles that automatically.

Using Crawl for a single known URL is just overhead. You will get the content of that page plus everything linked from it, which is not what you wanted, and you will pay for the extra pages.

Map is the one that tends to get overlooked. It feels redundant if you are already planning to crawl. But if you want to do anything intelligent with URL selection before fetching content, Map gives you that information at low cost.

What I Would Do Next

If you are starting a new data ingestion project, run Map first. Look at what comes back. Understand the structure of the site: how many pages, what the URL patterns look like, whether there are sections worth skipping. Then decide whether Crawl covers your needs or whether you want the finer control of Map plus selective Scraping.

That five-minute audit at the start will save you from over-fetching junk pages or under-fetching content that was one link deeper than you expected.

How Agentic Search Actually Works: The Research Loop Link-Fetching Agents Miss

tokozen — Thu, 16 Apr 2026 18:59:06 +0000

Most agent pipelines treat web search as a single-shot tool call: send a query, get back some URLs, fetch one or two of them, stuff the text into the context window, move on. That works fine for lookup tasks. "What is the capital of France?" does not need a research loop.

But real research tasks do. "What are the current funding trends in open-source AI infrastructure?" or "How does Company X's pricing compare to its three main competitors?" requires following threads, noticing gaps, and issuing follow-up queries. A single fetch-and-summarize pass almost always misses the part of the answer that was buried in a secondary source, a forum thread, or a page the first result happened to link to.

That is the gap agentic search is supposed to fill. Here is what the actual loop looks like, and why the naive version falls short.

What a naive search agent does

A typical ReAct-style agent calls a search tool, gets back a list of results, picks the top one or two, fetches the content, and hands it to the LLM. The LLM either answers from that or gives up.

The failure mode is quiet. The agent returns an answer, often a confident one, but it is based on whatever happened to rank highest in that one query. If the first result is a marketing page, a paywalled article, or a three-year-old blog post, the answer reflects that without the agent noticing.

Three concrete problems:

Single query coverage: one phrasing of a question surfaces a different slice of the web than a slightly different phrasing. No single query covers the topic.
No gap detection: the agent does not evaluate whether the retrieved content actually answers the question. It feeds whatever it got to the LLM and lets the LLM figure it out.
No follow-up: if the first batch of results is insufficient, the agent has no mechanism to try again with a refined query or drill into a promising link.

The research loop that fixes this

Agentic search replaces the single-shot pattern with a loop that has three phases: query generation, result evaluation, and follow-up decision.

Phase 1: generate multiple queries. Given a research goal, the LLM generates three to five distinct queries that approach the topic from different angles. Not just synonyms, but genuinely different framings: a factual lookup, a comparison query, a "what are people saying about" query, a date-scoped query.

Phase 2: fetch, extract, and score. For each query, fetch the top results. Extract clean text (not raw HTML with nav bars and cookie banners, but the actual prose). Score each chunk against the original research goal: is this relevant? does it introduce new information? does it contradict something already retrieved?

Phase 3: decide to continue or stop. If the retrieved content covers the goal, synthesize and return. If there are still gaps, generate new queries targeting those gaps specifically, and loop. Most well-scoped research tasks converge in two to four iterations.

Here is a minimal version of that loop in Python using Anakin's Agentic Search API, which handles the fetch-and-extract step and returns results with sources attached:

import requests
import os

ANAKIN_API_KEY = os.environ["ANAKIN_API_KEY"]
ANAKIN_ENDPOINT = "https://api.anakin.ai/v1/agentic-search"

def agentic_search(goal: str, max_iterations: int = 3) -> dict:
    collected_sources = []
    queries_tried = []
    current_query = goal

    for iteration in range(max_iterations):
        print(f"Iteration {iteration + 1}: querying '{current_query}'")

        response = requests.post(
            ANAKIN_ENDPOINT,
            headers={
                "Authorization": f"Bearer {ANAKIN_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "query": current_query,
                "include_sources": True,
            },
            timeout=30,
        )
        response.raise_for_status()
        data = response.json()

        answer = data.get("answer", "")
        sources = data.get("sources", [])
        collected_sources.extend(sources)
        queries_tried.append(current_query)

        # Ask the LLM whether the answer covers the goal
        # or whether a follow-up query is needed.
        # In a real pipeline this is an LLM call.
        # Here we fake it with a length heuristic for illustration.
        if len(answer.split()) > 150:
            print(f"Goal covered after {iteration + 1} iteration(s).")
            return {"answer": answer, "sources": collected_sources}

        # Generate a follow-up query targeting the gap.
        # In production: call your LLM with the goal, the answer so far,
        # and ask it to produce a more specific query.
        current_query = f"{goal} detailed analysis {iteration + 1}"

    return {
        "answer": "Max iterations reached. Partial results below.",
        "sources": collected_sources,
    }


if __name__ == "__main__":
    result = agentic_search(
        goal="What are the main approaches to reducing LLM inference costs in 2024?"
    )
    print(result["answer"])
    for src in result["sources"]:
        print(" -", src.get("url"), src.get("title"))

The key thing the loop adds is the gap-detection step. Even with the fake heuristic above, the structure forces you to ask: did I actually get what I needed? That question is absent from the single-shot pattern.

What clean source attribution changes

The other thing a proper agentic search loop enables is traceable answers. When you fetch raw pages yourself and concatenate the text into a prompt, you lose the mapping between claims and sources. The LLM synthesizes across everything and you cannot tell which sentence came from where.

When each result comes back with its source URL attached to the specific text chunk it came from, you can build a citation index. For RAG pipelines this matters a lot: the final answer can include inline citations, and a downstream verification step can re-fetch the source to confirm a claim has not changed since it was indexed.

For agent memory, this is also useful. If the agent stores what it has already fetched (by URL or by a hash of the content), it avoids re-fetching the same page on the next iteration and can detect when two sources contradict each other.

Where to take this next

The loop I showed above is stateless across iterations. A more robust version would maintain a shared context object that accumulates:

All queries tried (to avoid rephrasing the same one)
All URLs fetched (to skip duplicates)
A running summary of what is known vs. what is still unknown
A confidence score that drives the stop condition

The stop condition is the hardest part to get right. Too aggressive and the agent stops after one good-looking result. Too lenient and it loops until it hits the token or cost limit. In practice, a small LLM call that scores coverage against a checklist derived from the original goal works better than a word-count heuristic.

The agents that produce genuinely useful research outputs are not the ones with the best base model. They are the ones with the tightest loop: query, evaluate, follow up, stop when done.