Senpeng

Posted on Apr 18

Scaling Our Daily Twitter Scraping Workflow with actionbook's Cloud Browser Providers

#ai #webdev #programming #productivity

What We Were Trying to Do Every Morning

Every morning, our inbox looks like a battlefield. Around a thousand emails pour in overnight, each one an IFTTT-generated notification carrying a single Twitter link. Keyword subscriptions, competitor alerts, KOL tracking, industry signals, all jammed together in an unreadable pile.

Out of those thousand, roughly fifty are actually worth reading. The other 950 are noise.

But here is the catch. You cannot tell which is which by looking at the email subject. The IFTTT notification gives you a link and not much else. To filter properly, you have to open each tweet and pull three pieces of information: the actual content, the view count, and the like count. Only then can you decide whether a post deserves a closer look or a follow-up.

So we did what every engineering team does when faced with a thousand repetitive clicks. We automated it.

And for a long time, our automation was fine. Not great. Just fine.

Where the First Version Broke

Our first crawler was a straightforward Actionbook script running in local mode. Open a batch of tabs, visit each IFTTT-provided link, extract the content and metrics, move on. On paper, we could push it to 30 concurrent tabs on a single machine. In practice, it hit Twitter's rate limit almost immediately.

Anything above a handful of concurrent requests started coming back with empty pages, challenge screens, or outright blocks. Dropping concurrency to keep the crawler stable meant a full run took around thirty minutes. By the time the data landed, standup was halfway over.

We tried the usual bag of tricks. Jittered delays. Rotating user agents. Residential proxies. Each one added maintenance cost and the payoff kept shrinking.

Eventually we stopped and asked the obvious question. What is actually limiting us here?

It was not CPU. It was not bandwidth. It was not even the script.

It was the single exit IP.

Every request we made went out through the same door. Twitter was not rate-limiting our machine. It was rate-limiting anyone knocking from that address. No amount of local optimization could fix a problem that existed at the network edge.

The only real way forward was to move to cloud browsers, so that every request could go out from a different IP.

Enter `--provider`

Around this time, Actionbook shipped exactly what we needed. A --provider flag that lets you delegate browser sessions to different cloud browser services. Today it supports three backends: Driver, HyperBrowser, and BrowserUse.

What matters is not which three providers. What matters is that you can switch between them by changing a single flag, without touching the script. That means we can run the same crawler across multiple providers in parallel, and each provider brings its own pool of IPs.

The New SOP: Spreading the Load Across Three Providers

Here is where the economics get interesting. Each of these cloud browser providers offers a free tier. Individually, none of them is generous enough to handle our full daily volume. Together, running in parallel, they comfortably are.

So we designed the pipeline around that observation.

At 7 AM, a cron job ingests every IFTTT email from the overnight inbox and extracts the Twitter link embedded in each one. Today that yields around a thousand URLs. The list gets split into three roughly equal slices, one for each provider.

For each slice, we open a single cloud browser session on the corresponding provider. Inside that session, we drive 10 tabs concurrently, each one visiting its assigned tweet URL and pulling the content, view count, and like count.

Three providers running in parallel. One session each. Ten tabs per session. Thirty real requests in flight at any moment, but split across three completely independent IP pools. If Twitter throttles one provider's egress, the other two keep going without even noticing.

From the crawler's point of view, none of this complexity exists. Actionbook abstracts away the differences between Driver, HyperBrowser, and BrowserUse. We wrote one scraper. The orchestrator decides which provider gets which slice, spins up the sessions, and collects the results.

Once everything lands, the pipeline runs a summarization pass over the collected content and applies our relevance filters. The thousand raw URLs collapse into about fifty tweets that genuinely deserve attention, and those show up in the morning briefing channel before anyone walks into standup. The runtime dropped from around thirty minutes to about five, and rate limit errors effectively went to zero.

Closing Thought

This Twitter pipeline is just one example. Once you have Actionbook's --provider flag combined with isolated sessions across different cloud browsers, a lot of workflows that used to feel impractical suddenly become straightforward.

Anything that needs high-volume access to the same domain, anything that needs a clean session per task, anything that needs to sidestep the limits of a single local machine, all of it fits naturally into this pattern.

DEV Community

Scaling Our Daily Twitter Scraping Workflow with actionbook's Cloud Browser Providers

What We Were Trying to Do Every Morning

Where the First Version Broke

Enter `--provider`

The New SOP: Spreading the Load Across Three Providers

Closing Thought

Top comments (0)

What We Were Trying to Do Every Morning

Where the First Version Broke

Enter --provider

The New SOP: Spreading the Load Across Three Providers

Closing Thought

Enter `--provider`