DEV Community

Cover image for Building a Fault-Tolerant Web Data Ingestion Pipeline with Effect-TS
Prithwish Nath
Prithwish Nath

Posted on • Originally published at Medium

Building a Fault-Tolerant Web Data Ingestion Pipeline with Effect-TS

TL;DR — Silent failures break data pipelines. This post shows how Effect-TS enables typed errors, safe resource management, declarative retry logic, and composable pipelines to build predictable, fault-tolerant web data ingestion systems at scale.

It annoys me to no end that production web data pipelines rarely fail catastrophically. Instead, batch jobs “succeed” with incomplete data — silently corrupting downstream analytics, triggering retry storms that lead to IP bans, or letting one bad edge case crash a large nightly job.

I’m currently rebuilding the web data ingestion pipeline I’m responsible for at work: aggregation and analysis from 100+ upstream sources daily, hundreds of items per batch, with strict consistency requirements. Over time, I stopped trying to paper over failures with more logging + more retries, and started looking for a way to make system behavior explicit and easier to reason about.

That search eventually led me to Effect (formerly, Effect-TS) — a TypeScript effect system for modeling side effects, failures, and resource lifecycles directly in the type system.

Effect didn’t make my life easier in the sense of “fewer lines of code.” What it changed was how I thought about failure in TypeScript systems. Instead of treating network errors, rate limits, and partial responses as things to catch and move on from, Effect pushes you to model these failure modes explicitly and decide ahead of time how the system should respond.

Reliability engineering isn’t about building systems that never fail. It’s about building systems where failure is expected, understood, and bounded — so it doesn’t cascade into larger outages or silent data corruption.

In this post, I’ll walk through what that style of reliability engineering looks like in practice: using Effect-TS with typed errors, resource management, and declarative retries to build a fault-tolerant web data ingestion pipeline whose behavior is predictable under real-world failure.

All of the code in this post lives in a public Effect-TS web scraping and data ingestion repository on GitHub:

https://github.com/sixthextinction/effect-ts-scraping

What is Effect-TS?

At a practical level, Effect lets you describe work without running it yet.

An Effect value represents an operation that might perform I/O, might fail, and might depend on some environment…but none of that happens until you explicitly run it.

The crucial thing to realize is that Effect doesn’t just describe what the operation does. It also encodes what the operation produces on success, what it can fail with, and, optionally, what it depends on.

And ALL of that information lives in the TypeScript type system.

This might not sound like a big deal at first, but it changes when decisions get made — and turns out, that matters a lot.

In a typical TypeScript codebase, a data-fetching function looks like this:

async function fetchHtml(url: string): Promise<string> {  
  const res = await fetch(url);  

  if (!res.ok) {  
    throw new Error(`Request failed: ${res.status}`);  
  }  

  return await res.text();  
}  

// and then...  
const promise = fetchHtml("https://example.com");
Enter fullscreen mode Exit fullscreen mode

What most people don’t realize is that Promises in JavaScript are eager. As soon as that line runs, the request has already started. Even if you never await the promise, the network request is already in flight, side effects have already happened, and yes — failures may already be occurring.

Now compare that to an Effect-based version:

// first define the error  
class NetworkError extends Data.TaggedError('NetworkError')<{  
  url: string;  
}> {}  

// then, do this.  
const fetchHtml = (url: string): Effect.Effect<string, NetworkError, never> =>  
  Effect.tryPromise({  
    try: async () => {  
      const res = await fetch(url);  
      if (!res.ok) {  
        throw new Error();  
      }  
      return await res.text();  
    },  
    catch: () => new NetworkError({ url }),  
  });
Enter fullscreen mode Exit fullscreen mode

Effect is lazy, not eager. With Effect, just doing const effect = fetchHtml(“https://example.com"); does nothing. It’s simply data, returning a description of a computation. Nothing runs until you explicitly say so, by calling a runner like this:

Effect.runPromise(fetchHtml("https://example.com"));

Because the work doesn’t start until you run that, turns out you can still alter how it should behave — retries, timeouts, cancellations and more, attached before execution, not bolted on afterward.

Instead of discovering failure modes at runtime (or more likely, encoding them in comments/conventions) you’re forced to confront them at design time.

const program = pipe(  
  fetchHtml("https://example.com"),  
  Effect.retry(retryPolicy), // add retry logic  
  Effect.timeout("10 seconds") // add exponential backoff  
 // anything else  
);  

// still nothing has run  

// until….  
Effect.runPromise(program);  
Enter fullscreen mode Exit fullscreen mode

That’s why Effect works so well for hostile I/O like data ingestion. You’re deciding ahead of time how the system behaves when failures do happen. And with it, cross-cutting concerns (retries, rate limits, cleanup) can go on top without refactoring core logic.

Also, look at the Effect version’s type — Effect<string, NetworkError> — this is a machine-checkable contract that tells you, precisely:

  • this operation performs effects
  • it produces a string on success
  • it can fail with NetworkError
  • it CANNOT fail with any other expected error

Compare that to the vanilla TS type signature ((url: string) => Promise<string>), you cannot tell:

  • what errors might be thrown
  • whether they’re retryable
  • whether this is safe to call multiple times
  • whether this does I/O or just compute

All of that information exists only in comments, conventions, or someone’s head (or you only find out by running it and reacting.)

All this is why Effect feels like the TypeScript framework that you didn’t know you needed.

How Effect Changed How I Design for Failure

When the mental model of Effect clicked for me, I knew that if I can describe behavior before anything runs, then I’m not just deciding what happens on success, I’m deciding how the operation behaves under every condition. That includes failures obviously, but it also includes retries, slowdowns, and backpressure.

That’s where my thinking about data ingestion started to change. Most failures in a data ingestion pipeline are expected. None of them behave like typical fix-and-forget bugs:

  • networks are slow or unavailable
  • upstream APIs rate-limit you
  • data formats change without notice
  • some batches succeed while others fail

What’s different about these failures is that they’re maddeningly partial, and often. A job can succeed just enough to look healthy while quietly producing incomplete/stale data.

That’s not a correctness problem so much as a reliability problem. Once I started using Effect more deliberately, I noticed that it actually pushed me away from reacting to failures after the fact, and toward making those decisions up front.
So instead of adding another retry or another catch, while designing I had to decide:

  • What kinds of failures do I expect to see in production?
  • Which of these should be retried, and which shouldn’t?
  • When should the pipeline slow down instead of pushing harder?
  • When is failing fast the correct behavior?*

*This one is slightly debatable, but lets throw it in there because it’s an adjacent problem anyway

Because these questions were now part of the TypeScript type system itself, those decisions end up close to the code that triggers them. There’s less room for “we’ll handle it later” logic that never quite materializes because Effect forces the conversation.

Designing a Web Data Pipeline with Effect

The first concrete step was obvious: I needed to enumerate what actually breaks in my pipeline, and decide how each case should behave.

So I sat down and reduced my Puppeteer-based ingestion pipeline down to its real failure modes:

  • Network timeouts. Transient. These should be retried with backoff.
  • Rate limits. Expected. These require slowing down.
  • IP blocks. Fatal without proxy rotation; but with the right infrastructure (as was my case), just another retryable case.
  • CAPTCHAs. Not a logic problem. For me, this is handled entirely by the proxy layer, and is also retryable without any code on my part.
  • Schema changes. The site changed and selectors broke. This isn’t transient — it’s a logic error and should fail fast.

Traditional error handling lumps all of these into “something went wrong, throw an exception.” Effect lets you model them as distinct failure types, which means you can build infrastructure that handles them systematically. And that’s exactly where we’re going to start.

For reference, here’s the code for the full pipeline: https://github.com/sixthextinction/effect-ts-scraping/blob/main/full-pipeline.ts

Failure as a First-Class Concept (Tagged Errors)

The first thing I do is write down every failure mode I expect to see, and give each one a name. Each of these errors represents something meaningfully different from an operational perspective.

class NetworkError extends Data.TaggedError('NetworkError')<{  
  message: string;  
  url: string;  
  cause?: unknown;  
}> {}  

class TimeoutError extends Data.TaggedError('TimeoutError')<{  
  message: string;  
  url: string;  
  timeout: number;  
}> {}  

class RateLimitError extends Data.TaggedError('RateLimitError')<{  
  message: string;  
  url: string;  
  retryAfter?: number;  
}> {}  

class IPBlockError extends Data.TaggedError('IPBlockError')<{  
  message: string;  
  url: string;  
  proxyId?: string;  
}> {}  

class ParseError extends Data.TaggedError('ParseError')<{  
  message: string;  
  cause?: unknown;  
}> {}
Enter fullscreen mode Exit fullscreen mode

What’s this Data.TaggedError? That’s something Effect provides us. Basically, it’s a premade error class that automatically gets a _tag field — a string literal that acts as a discriminant.

This _tag field gives us type-safe error handling. TypeScript can distinguish between different error types at compile time, and you can use functions like Effect.catchTag to handle specific errors without losing type information.

You technically can do this with vanilla TypeScript (discriminated unions), but it’ll be a pain. Yes, you can catch generic Error objects and use instanceof checks — but TypeScript can’t always narrow them correctly. Effect’s tagged errors give you precise type narrowing. When you catch a NetworkError, for example, TypeScript 100% knows it has a url property. When you catch a RateLimitError, TypeScript knows it might have a retryAfter property. This makes error handling both type-safe and composable (not to mention 500% less annoying to write code for. 😅)

Anyway, when modeling complex pipelines, this has to be your first step because once you have typed errors, you can decide which ones are retryable and which aren’t:

const retryableErrors = [  
  'NetworkError',  
  'TimeoutError',   
  'RateLimitError',  
  'IPBlockError',  
  'BrowserError',  
] as const;  

const isRetryableError = (error: ScrapingError): boolean =>  
  retryableErrors.includes(error._tag as any);
Enter fullscreen mode Exit fullscreen mode

So a ParseError means your HTML selectors broke. That’s not a network problem so retrying won’t help. But a TimeoutError is when you retry with backoff.

Browser Logic — Side Effects Without the Pain

My pipeline uses Puppeteer to handle dynamic/JS based websites.

For this, we’ll use the Effect interface. (this Effect is a type within the Effect-TS/effect library we’re working with) Instead of letting Puppeteer leak browser state all over the codebase, everything related to it lives in these Effects.

The Effect interface is the quintessential part of the Effect-TS library — a description of a workflow or operation that is lazily executed. Here's what it looks like:

Effect<Success, Error, Requirements>

Where Success represents the type of what is returned on a success, Error represents the same for an error, and Requirements represents the type of required dependencies that you need to pass.

We’ve talked about the main difference before — unlike Promises, Effects are lazy. They don’t run until executed. This opens up a lot of opportunities for us for composition, cancellation, and resource management.

Using Effect, the lowest-level operation we need is…simply launching a browser. Makes sense that this should be our Step 1:

// STEP 1: Actually launch the browser  

// tryPromise converts a Promise-returning function into an Effect  
// Errors are caught and converted to typed errors (BrowserError)  
const launchBrowser = (  
  proxyConfig: ProxyConfig  
): Effect.Effect<Browser, BrowserError, never> =>  
  Effect.tryPromise({  
    try: async () => {  
      process.env['NODE_TLS_REJECT_UNAUTHORIZED'] = '0'; // disable SSL validation for Bright Data proxy  
      return await puppeteer.launch({  
        headless: true,  
        ignoreHTTPSErrors: true, // ignore SSL certificate errors  
        args: [  
          '--no-sandbox',  
          '--disable-setuid-sandbox',  
          `--proxy-server=${proxyConfig.host}:${proxyConfig.port}`, // proxy host:port (credentials set via page.authenticate later)  
        ],  
      });  
    },  
    catch: (error: unknown) =>  
      new BrowserError({  
        message: 'Failed to launch browser with Bright Data proxy',  
        cause: error,  
      }),  
  });
Enter fullscreen mode Exit fullscreen mode

The never in the Requirements position means the effect doesn’t require any external dependencies or context to run.

Effect.tryPromise will convert a Promise-returning function into an Effect (here, puppeteer.launch). Any thrown error gets mapped into a typed failure — since I explicitly use a catch function here, it will explicitly map it to an error of type BrowserError.

Your proxy config can live in a separate object like so. Using a proxy is technically optional, but I already had access to residential proxies and that handles the messy parts for me — fingerprinting, CAPTCHA solving, IP rotation, and geo-targeting — so in my pipeline, my Puppeteer instances behave like a real user instead of getting blocked immediately, with no extra code on my part.

Bright Data - All in One Platform for Proxies and Web Scraping

// Bright Data HTTP Proxy configuration (from env vars or .env file)  
// You'll get these values from your dashboard when you sign up  
const BRIGHT_DATA_CONFIG = {  
  customerId: process.env.BRIGHT_DATA_CUSTOMER_ID,  
  zone: process.env.BRIGHT_DATA_ZONE,  
  password: process.env.BRIGHT_DATA_PASSWORD,  
  proxyHost: 'brd.superproxy.io',  
  proxyPort: 33335,  
};  

// Validate configuration  
if (!BRIGHT_DATA_CONFIG.customerId || !BRIGHT_DATA_CONFIG.zone || !BRIGHT_DATA_CONFIG.password) {  
  throw new Error(  
    'Bright Data configuration missing. Set BRIGHT_DATA_CUSTOMER_ID, BRIGHT_DATA_ZONE, and BRIGHT_DATA_PASSWORD environment variables or add them to .env file'  
  );  
}  

interface ProxyConfig {  
  host: string;  
  port: number;  
  username: string;  
  password: string;  
}  

const buildProxyConfig = (): ProxyConfig => {  
  const username = `brd-customer-${BRIGHT_DATA_CONFIG.customerId}-zone-${BRIGHT_DATA_CONFIG.zone}`;  
  return {  
    host: BRIGHT_DATA_CONFIG.proxyHost,  
    port: BRIGHT_DATA_CONFIG.proxyPort,  
    username,  
    password: BRIGHT_DATA_CONFIG.password!,  
  };  
};
Enter fullscreen mode Exit fullscreen mode

Those proxy config values are just the credentials you get when you set up a proxy to use.

With Puppeteer, you use proxies via page.authenticate(), which is where we’ll use this in the next step.

Alright, so as of now, we have a Puppeteer instance up and running. Next, we need navigation and content extraction. We’ll use Effect.acquireUseRelease to do this:

// STEP 2: Go to page, extract content.   

const navigatePageAndGetContent = ( browser: Browser, // this was returned as a result of what we did in Step 1  
  url: string, // the URL to go to  
  proxyConfig: ProxyConfig, // we already set this up earlier  
  timeout: number // use your own values in ms) =>  
  Effect.acquireUseRelease(  
    // acquire: create the page  
    // use: navigate and get content  
    // release: always close the page, even on error  
  );
Enter fullscreen mode Exit fullscreen mode

This acquireUseRelease is Effect’s version of try / finally. You use it when describing real-world operations where you have to work with external resources (database connections, network stuff, etc.) that must be acquired, used properly, and released when no longer needed (even an error occurs).

It always involves a 3-step process. For us, this will involve:

  • Acquire: open a page in Puppeteer using a proxy that we authenticate
  • Use: navigate, check status codes, return HTML
  • Release: close the page, even if something failed

You don’t have to explicitly remember to do cleanup — the structure enforces it.

Let’s look at all of those steps in detail.

// STEP 2: Go to page, extract content.   

// Effect.acquireUseRelease manages resource lifecycle: acquire, use, and release  
// Ensures cleanup happens even if errors occur (like try/finally)  
// See: https://effect.website/docs/resource-management/introduction  
const navigatePageAndGetContent = (  
  browser: Browser,  
  url: string,  
  proxyConfig: ProxyConfig,   
  timeout: number = 10000   
): Effect.Effect<string, BrowserError | TimeoutError | IPBlockError | RateLimitError | NetworkError> =>  
  Effect.acquireUseRelease(  
    // STEP 2.1: acquire: create the page  
    Effect.tryPromise({  
      try: async () => {  
        const page = await browser.newPage();  
        await page.authenticate({ username: proxyConfig.username, password: proxyConfig.password }); // authenticate with Bright Data proxy  
        return page;  
      },  
      catch: (error: unknown) =>  
        new BrowserError({  
          message: 'Failed to create page or authenticate',  
          cause: error,  
        }),  
    }),  
    // STEP 2.2: use: navigate and get content  
    (page) =>  
      Effect.tryPromise({  
        try: async () => {  
          const response = await page.goto(url, {  
            waitUntil: 'networkidle2', // use 'load' if 'networkidle2' fails - proxies can have background requests that never stop  
            timeout: timeout    
          });  

          // check for HTTP errors that indicate blocks/rate limits  
          if (response) {  
            const status = response.status();  
            if (status === 429) {  
              throw new RateLimitError({  
                message: `Rate limited: ${url}`,  
                url,  
              });  
            }  
            if (status === 403) {  
              throw new IPBlockError({  
                message: `IP blocked: ${url}`,  
                url,  
              });  
            }  
            if (status >= 400) {  
              throw new NetworkError({  
                message: `HTTP error ${status}: ${url}`,  
                url,  
              });  
            }  
          }  

          return await page.content();  
        },  
        catch: (error: unknown) => {  
          if (error instanceof RateLimitError || error instanceof IPBlockError) {  
            return error;  
          }  
          if (error instanceof NetworkError) {  
            return error;  
          }  
          if (error instanceof Error && error.message.includes('timeout')) {  
            return new TimeoutError({  
              message: `Navigation timeout after ${timeout}ms`,  
              url,  
              timeout,  
            });  
          }  
          return new BrowserError({  
            message: 'Failed to navigate or get content',  
            cause: error,  
          });  
        },  
      }),  
    // STEP 2.3: release: always close the page, even on error  
    (page) =>  
      pipe(  
        Effect.tryPromise({  
          try: async () => await page.close(),  
          catch: () => new Error('Failed to close page'),  
        }),  
        Effect.catchAll(() => Effect.void) // ignore close errors  
      )  
  );
Enter fullscreen mode Exit fullscreen mode

Effect's pipe (seen here in Step 2.3) composes functions left-to-right, passing the output of one as input to the next. It makes Effect operations readable instead of nested. So while reading, you start with Effect.tryPromise, then apply Effect.catchAll

Great, we now have a Puppeteer instance, and we can use it to go visit a page, extract the content we want, and close the page. Now we just have to bring it all together i.e. manage the browser lifecycle.

This should also be on an acquire → use → release cycle, this time on a browser level rather than a page level.

const scrapeUrl = (  
 url: string,  
 options?: { timeout?: number }  
 ): Effect.Effect<  
 string,  
 BrowserError | TimeoutError | IPBlockError | RateLimitError | NetworkError, never>  
=> {  
 const proxyConfig = buildProxyConfig();  
 const timeout = options?.timeout || 10000;  
// Bright Data automatically rotates IPs on each request,  
 // so retrying after an IP block gets a fresh IP  
 return Effect.acquireUseRelease(  
 // STEP 1: ACQUIRE -- launch browser  
 launchBrowser(proxyConfig),  

     // STEP 2: USE -- navigate and extract HTML  
(browser) =>  
  navigatePageAndGetContent(browser, url, proxyConfig, timeout),  

// STEP 3: RELEASE -- always clean up the browser  
(browser) =>  
  pipe(  
    Effect.tryPromise({  
      try: async () => await browser.close(),  
      catch: () => new Error('Failed to close browser'),  
    }),  
    Effect.catchAll(() => Effect.void)  
  )  
 );  
};
Enter fullscreen mode Exit fullscreen mode

💡 In production usage you will usually also want to separate our fetch → parse → exit cycle into fetch → persist raw → parse → persist parsed, so you can debug raw HTML later, or parallelize parsing.

Again, each step is expressed as a function returning an Effect, and cleanup is guaranteed, even if navigation or retries fail.

At the end of this stage we have the basic Puppeteer loop: visiting a dynamic page, extracting its HTML, and cleaning up after ourselves.

But there’s more to do — namely, making all of the above work with parsing (business logic), and retry behavior + rate limiting (cross-cutting concerns.)

Retries That Understand Why Something Failed

Effect makes retry behavior declarative via its Schedule API.

// remember we defined which errors were retryable in step 1  
// here, first, we define HOW we should schedule retries…  

const retryPolicy = pipe(  
  Schedule.exponential(Duration.seconds(1)),  
  Schedule.intersect(Schedule.recurs(3))  
);
Enter fullscreen mode Exit fullscreen mode

This says: “retry with exponential backoff starting at 1 second, up to 3 times, but only if the error is retryable.”

But the schedule alone isn’t enough. The system also needs to know which failures deserve a retry. Luckily, we already know those.

//… and then actually retry the ones which are retryable  
// Effect<A, E, R> is just the Effect ecosystem’s convention/shorthand for Effect<Success, Error, Requirements>  

const retryIfRetryable = <A, E extends ScrapingError, R>( effect: Effect.Effect<A, E, R>) =>  
  Effect.retry(effect, {  
    schedule: retryPolicy,  
    until: (error) => isRetryableError(error),  
  });
Enter fullscreen mode Exit fullscreen mode

The until predicate is the key. Instead of retrying blindly, the system checks the error type and only retries when it makes sense. If you hit a ParseError (not in our list of retryable errors), the pipeline fails immediately — which makes sense, no point in hammering a broken CSS selector.

This is where our tagged errors from Step 1 pay off. The retry logic doesn’t inspect strings or guess intent. It operates entirely on types.

We’ll use retryIfRetryable at the very end, when we’re bringing all parts of our pipeline together.

Rate Limiting as a Declarative Policy

Rate limiting is what we call backpressure, rather than proper error handling. That is — you don’t want to wait until you get rate-limited to slow down, you want to prevent it in the first place.

For this tutorial, we can keep rate limiting intentionally boring here. Because even our rate limiting is an Effect (Effect.Effect), it composes cleanly with retries and resource management above.

const withSimpleRateLimit = <A, E, R>( effect: Effect.Effect<A, E, R>) =>  
  pipe(  
    Effect.sleep(Duration.millis(100)),  
    Effect.flatMap(() => effect)  
  );
Enter fullscreen mode Exit fullscreen mode

This simply introduces a delay before the effect runs. Just like retries, we’ll use withSimpleRateLimit at the very end when composing the pipeline.

Of course, if you wanted to, you could go all out — you can build production grade rate limiting because Effect provides primitives like Ref (keep track of some form of state), Queue (lightweight in-memory queue), and Schedule (we used this in just the previous step.)

💡 I’m not going to go into detail on building a full rate limiter with Effect because that’s way too much cognitive load for just a blogpost. The point isn’t to build perfect throttling — it’s to show that backpressure can be a first-class part of the pipeline in Effect.

Parsing as Its Own Failure Domain

Finally, remember that while parsing logic is synchronous, it can still fail. We haven’t accounted for those yet.

This is our HTML parsing step to get the data we need. Instead of letting ParseError’s throw, it’s better if we wrap it in Effect.try — this is similar to Effect.tryPromise from earlier, but ONLY for synchronous functions that may throw (like cheerio parsing here.)

// This is just your scraping logic with selectors for the data you want  
// For this one we just get h1’s and spans.  

interface ScrapingResult {  
  title: string;  
  spans: string[];  
  url: string;  
}  

// Effect.try wraps synchronous logic that may throw  
// Errors are caught and converted to typed errors (ParseError)  
const parseHtml = (html: string): Effect.Effect<ScrapingResult, ParseError, never> =>  
  Effect.try({  
    try: () => {  
      const $ = cheerio.load(html);  
      const title = $('h1').text().trim();  
      const spans = $('span')  
        .map((_i: number, el: any) => $(el).text().trim())  
        .get()  
        .filter((s: string) => s.length > 0);  

      return {  
        title,  
        spans,  
        url: TARGET_URL,  
      };  
    },  
    catch: (error: unknown) =>  
      new ParseError({  
        message: 'Failed to parse HTML',  
        cause: error,  
      }),  
  });  
Enter fullscreen mode Exit fullscreen mode

This completes our error modeling — now parsing failures are distinct from network failures and both are handled properly. If parsing breaks, all our retries stop. That should be intentional — and so we made it so.

Composing the Pipeline

Up to this point, we’ve built individual pieces in isolation:

  • scrapeUrl knows how to fetch HTML safely
  • retryIfRetryable knows when to retry
  • withSimpleRateLimit enforces basic throttling
  • parseHtml turns raw HTML into structured data as per our domain logic (we provide the selectors we need)

Now we compose them into a single pipeline.

const scrapeWithRetry = (): Effect.Effect<  
  ScrapingResult,  
  ScrapingError  
> =>  
  pipe(  
    // Step 2: fetch HTML  
    scrapeUrl(TARGET_URL, { timeout: 30000 }),  

    // Step 4: apply rate limiting  
    withSimpleRateLimit,  

    // Step 3: retry transient failures  
    retryIfRetryable,  

    // Step 5: parse HTML  
    Effect.flatMap(parseHtml)  
  );
Enter fullscreen mode Exit fullscreen mode

Read this top to bottom.

  1. We start by fetching HTML with scrapeUrl (which instantiates and uses Puppeteer)
  2. That operation is rate-limited
  3. If it fails with a retryable error, it’s retried with backoff
  4. If it succeeds, we move on to parsing
  5. If parsing fails, the whole Effect fails immediately (as it should)

There are no callbacks here, no try/catch, and no manual error propagation. Control flow is handled by the Effect runtime.

Crucially, this function does not run anything yet. It only describes what should happen.

💡 I’ve skipped observability for this blog post, but in production, you should add more detailed logging, save retry metrics/failures, tracing per URL or proxy etc. Effect makes this very easy with its Logging APIs.

Executing the Pipeline

So far, we’ve built a description of a workflow. To actually execute it, we need to define what happens at the edges of the system (and this is where we finally use scrapeWithRetry.)

That’s what the final program does. This is again a pipe, so these happen sequentially:

const program = pipe(  
  scrapeWithRetry(),  

  // Log success  
  Effect.tap((result: ScrapingResult) =>  
    pipe(  
      Console.log('Scraping successful!'),  
      Effect.flatMap(() =>  
        Console.log(JSON.stringify(result, null, 2))  
      )  
    )  
  ),  

  // Handle all failures in one place  
  Effect.catchAll((error: ScrapingError) =>  
    pipe(  
      Console.error('Pipeline failed:', error),  
      Effect.flatMap(() =>  
        Effect.sync(() => process.exit(1))  
      )  
    )  
  )  
);
Enter fullscreen mode Exit fullscreen mode

Finally, here’s how you run this.

// This is the entry point that ACTUALLY kicks off the entire pipeline  
Effect.runPromise(program).catch((error: unknown) => {  
  console.error('Unhandled error:', error);  
  process.exit(1);  
});
Enter fullscreen mode Exit fullscreen mode

The .catch() wrapper here handles any truly unexpected errors that escape the Effect system (really shouldn’t happen, but it’s defensive programming).

This is the moment where everything becomes “real” — the browser is launched, requests are made, retries happen, resources are acquired and released, and logs are written.

Until this line runs, nothing has executed.

That mental separation — describing a workflow first, then running it explicitly — is one of the key reasons Effect works well for systems like this. You can reason about behavior before anything touches the network.

Why This Matters for Production Data Pipelines

So what did we build? Our pipeline has a few important properties:

  • Our errors are part of the type system and MUST be handled or intentionally propagated, and structured errors make any debugging or observability WAY easier
  • All of our resource lifecycles are enforced by construction
  • Our retry behavior is declarative, composable, and constrained by error types. In general, all cross-cutting concerns (retries, rate limits, cleanup) compose without refactoring core logic
  • All failure modes are explicit and discoverable at compile time
  • Any concurrency is safer by default, especially around shared resources

That structure is what makes this production pipeline evolvable. You can add ingestion sources, tune retries, adjust rate limits, or add observability without turning the code into a pile of special cases.

This system will always, always fail predictably, with enough context to debug what went wrong and why.

This kind of setup pays off when you’re scraping at scale, your failures have real business impact, and you need debuggability and auditability — something a team will maintain.

Most scraping tutorials stop at “how to fetch a lot of HTML without getting blocked.” That’s not the hard part. The hard part is building something you don’t have to babysit. Effect.ts gives you a way to model failure honestly, and a lot of readymade, first-class APIs to handle the parts application code built from scratch never should + you can add proxies to handle CAPTCHA/general unblocking. It’s a solid foundation to work off of.

It’s way more difficult, absolutely — Effect’s learning curve is more like a cliff wall — but it’s also way more reliable. And when a system runs unattended in production, that reliability is what actually matters.


Full source code: https://github.com/sixthextinction/effect-ts-scraping

Top comments (0)