Isaac Sunday

Posted on Nov 17

Building a Fast Web Scraper Without Puppeteer: A Live Coding Challenge

#typescript #webscraping #node #performance

How I built a production-grade scraper for NC public notices using Cheerio and TLS fingerprinting instead of headless browsers

The Challenge

During a live coding interview, I was tasked with building a scraper for ncnotices.com, a North Carolina public notices database. The catch? I had limited time and needed to handle a complex ASP.NET application with state management, pagination, and CAPTCHA protection.

The interviewer mentioned their team used Puppeteer for their existing scraper. I chose a different path.

Why Not Puppeteer?

Don't get me wrong - Puppeteer is fantastic for many use cases. But for this task, it felt like bringing a tank to a knife fight:

Resource overhead: Running a headless Chrome instance uses 100-200MB of RAM per instance
Slower execution: Browser startup alone takes 1-2 seconds
Complexity: Managing browser lifecycles, waiting for selectors, handling browser crashes

Since ncnotices.com renders server-side (classic ASP.NET), I didn't need JavaScript execution. I just needed to:

Parse HTML efficiently
Manage cookies and sessions
Bypass basic bot detection

The Tech Stack

{
  "dependencies": {
    "cheerio": "^1.1.2",      // Fast HTML parsing
    "impit": "^0.7.0",        // TLS fingerprinting
    "tough-cookie": "^6.0.0"  // Cookie management
  }
}

Key Libraries

Cheerio: jQuery-like HTML parsing that's blazing fast. Perfect for server-rendered content.

Impit: The secret sauce. This library mimics real browser TLS fingerprints, making requests indistinguishable from actual Chrome/Firefox browsers. Much lighter than Puppeteer.

tough-cookie: Proper cookie jar implementation for session management.

The Technical Challenges

1. ASP.NET ViewState Management

ASP.NET uses __VIEWSTATE and __VIEWSTATEGENERATOR tokens for state management. These need to be extracted and sent with each request:

private getStateTokens($: CheerioAPI): {
  view_state: string;
  view_state_generator: string;
} {
  const configText = $.text().split("0|hiddenField").at(-1)?.split("|") || [];
  let view_state = $("#__VIEWSTATE").attr("value") || "";
  let view_state_generator = $("#__VIEWSTATEGENERATOR").attr("value") || "";

  if (!view_state) {
    view_state = configText[configText.findIndex((x) => x.trim() == "__VIEWSTATE") + 1];
  }

  if (!view_state_generator) {
    view_state_generator = configText[configText.findIndex((x) => x.trim() == "__VIEWSTATEGENERATOR") + 1];
  }

  return { view_state, view_state_generator };
}

The tokens can be in hidden fields OR embedded in the response text. I handle both cases.

2. Pagination Hell

Classic ASP.NET pagination requires simulating user interactions through POST requests with specific event targets:

const queryParams: Record<string, string> = {
  ctl00$ToolkitScriptManager1:
    params.page > 1
      ? "ctl00$ContentPlaceHolder1$WSExtendedGridNP1$updateWSGrid|ctl00$ContentPlaceHolder1$WSExtendedGridNP1$GridView1$ctl01$btnNext"
      : "ctl00$ContentPlaceHolder1$as1$upSearch|ctl00$ContentPlaceHolder1$as1$btnGo",
  __VIEWSTATE: params.config.view_state,
  __VIEWSTATEGENERATOR: params.config.view_state_generator,
  __ASYNCPOST: "true",
  // ... more params
};

Each page requires the previous page's state tokens, creating a chain of requests.

3. CAPTCHA Support

The site occasionally shows reCAPTCHA. I built an extensible solver interface:

export interface ICaptchaSolver {
  solveCaptcha(siteKey: string, pageUrl?: string): Promise<string>;
}

This allows plugging in any CAPTCHA solving service (2captcha, Anti-Captcha, etc.) without modifying the scraper logic.

4. Session Management

The scraper maintains cookies across requests using tough-cookie:

this.client = new Impit({
  browser: "chrome",
  proxyUrl: proxyUrl,
  ignoreTlsErrors: true,
  cookieJar: new CookieJar(),
});

The Complete Flow

async searchKeyword(query: string): Promise<NCSearchItem[]> {
  let hasNext = true;
  let setPreset = true; // Simulate form fill on first request
  let page = 1;
  const config = await this.getSearchConfig();
  const items: NCSearchItem[] = [];

  do {
    const { items: currentItems, view_state_generator, view_state, has_next }
      = await this.executeSearch(query, { page, date: config.date, config }, setPreset);

    items.push(...currentItems);
    page++;
    hasNext = has_next;
    config.view_state = view_state;
    config.view_state_generator = view_state_generator;
    setPreset = false;
  } while (hasNext);

  return items;
}

Get initial config and state tokens
Execute search with form simulation
Extract results and pagination state
Continue until no more pages
Return all items

Performance Comparison

Let's crunch the numbers for a realistic scenario: scraping 100 public notices across 10 paginated search result pages.

Puppeteer Approach

Browser startup:        1,500ms (one-time cost)
10 search page loads:  10 × 300ms = 3,000ms
100 detail page loads: 100 × 300ms = 30,000ms
───────────────────────────────────────────
Total time:            ~34.5 seconds
Memory usage:          ~150MB per instance

This Approach (Cheerio + Impit)

HTTP client init:      50ms (one-time cost)
10 search requests:    10 × 100ms = 1,000ms
100 detail requests:   100 × 100ms = 10,000ms
───────────────────────────────────────────
Total time:            ~11 seconds
Memory usage:          ~10MB per instance

Result: 3.1x faster execution time, 15x lower memory footprint

Scalability on 8GB RAM

Approach	Concurrent Instances	Theoretical Throughput
Puppeteer	~50 instances	~145 notices/sec
Cheerio + Impit	~800 instances	~7,270 notices/sec

50x better concurrency - though realistically limited by network bandwidth and rate limiting rather than memory.

What I Learned

Choose the right tool: Puppeteer is amazing, but not always necessary
Understand the target: Server-side rendering = no JavaScript needed
TLS fingerprinting matters: Impit makes requests look legitimate
State management is crucial: ASP.NET apps require careful token handling
Build for extensibility: The CAPTCHA solver interface allows easy integration

The Outcome

I completed this scraper 2 hours after the interview ended and sent it to the interviewer. While I didn't get the job (they chose another candidate), the interviewer specifically mentioned:

"You are clearly very smart and have a lot of expertise. I will put you at the top of my list for the future."

Sometimes the best outcome isn't getting the job - it's building something you're proud of and demonstrating genuine problem-solving skills.

Try It Yourself

The full source code demonstrates:

Clean TypeScript architecture
Comprehensive type definitions
Testable design with dependency injection
Production-ready error handling

const scraper = new NCNotices("http://your-proxy:8888");
const results = await scraper.searchKeyword("foreclosure");

for (const searchItem of results) {
  const detail = await scraper.getItem(searchItem);
  console.log(detail);
}

Final Thoughts

Web scraping isn't about using the most popular tools - it's about understanding your target, choosing the right approach, and building maintainable solutions. Sometimes a lightweight HTTP client with good HTML parsing beats a full browser automation framework.

What's your experience with web scraping? Have you faced similar challenges with state management or bot detection? Drop a comment below!

P.S. - The Plot Twist

So... did I get the job?

Spoiler alert: Nope! But hey, I had fun building this and learned a ton in the process. Sometimes the best outcome isn't getting hired - it's the skills you sharpen and the code you're proud to share.

Plus, I got one of the nicest rejection messages I've ever received: