DEV Community

Cover image for Scraping at Warp Speed: Parallelism vs. Concurrency
Zil Norvilis
Zil Norvilis

Posted on

Scraping at Warp Speed: Parallelism vs. Concurrency

The 100,000 URL Problem

You’ve been tasked with scraping a massive product catalog. You write a clean Ruby script using HTTP and Nokogiri. It takes about 2 seconds to process one page.

  • Sequential time: ~55 hours.

That’s unacceptable. You need to leverage your computer’s hardware to do multiple things at once. In the Ruby world, this brings us to the classic debate: Parallelism vs. Concurrency.

While these terms are often used interchangeably, they are technically very different—especially because of Ruby’s GVL (Global VM Lock).


1. Concurrency: The Art of Waiting (Threads)

Web scraping is primarily an I/O-bound task. Your script spends 90% of its time doing nothing—just waiting for a server in a data center halfway across the world to send back some HTML.

Concurrency is about dealing with lots of things at once. In Ruby, we use Threads for this. While one thread is waiting for a network response, the Ruby scheduler switches to another thread to start a new request.

require 'http'
require 'thread'

urls = ["https://site.com/p1", "https://site.com/p2", ...] # 100k URLs
queue = Queue.new
urls.each { |url| queue << url }

threads = 10.times.map do
  Thread.new do
    until queue.empty?
      url = queue.pop(true) rescue nil
      if url
        response = HTTP.get(url)
        puts "Fetched #{url}"
      end
    end
  end
end

threads.each(&:join)
Enter fullscreen mode Exit fullscreen mode
  • Pro: Low memory overhead. Excellent for network-heavy tasks.
  • Con: Because of the GVL, only one thread can execute Ruby code at a time. If your "parsing" logic (Nokogiri) is heavy, threads won't help as much as you think.

2. Parallelism: The Brute Force (Processes)

Parallelism is about doing many things at the same time. This requires multiple CPU cores. In Ruby, we achieve this by forking Processes.

Each process has its own memory space and its own GVL. This means they can truly run Ruby code (like heavy HTML parsing) at the exact same time.

# gem install parallel
require 'parallel'

urls = [...] 

# Uses all available CPU cores
Parallel.each(urls, in_processes: 8) do |url|
  doc = Nokogiri::HTML(HTTP.get(url).body)
  # Heavy parsing logic here...
  puts "Processed #{url}"
end
Enter fullscreen mode Exit fullscreen mode
  • Pro: Real use of multi-core CPUs. Best for heavy parsing/data processing.
  • Con: High memory usage. Each process copies the entire application (though Copy-on-Write helps, it still adds up).

3. The New Wave: Async & Fibers

In modern Ruby (3.x+), we have a third option: Non-blocking I/O with Fibers. Using the async gem, you can handle thousands of concurrent connections on a single thread with almost zero overhead.

# gem install async-http
require 'async'
require 'async/http/internet'

Async do
  internet = Async::HTTP::Internet.new

  urls.each do |url|
    Async do
      response = internet.get(url)
      puts "Async fetched #{url}"
    end
  end
end
Enter fullscreen mode Exit fullscreen mode
  • Pro: Incredible scalability. You can have 5,000 active "requests" waiting at once without the memory cost of 5,000 threads.
  • Con: Requires a different way of thinking about code (the reactor pattern).

Which one should you choose?

If your scraper... Use this... Why?
Is just waiting for the network Threads Simple to implement, low RAM usage.
Does heavy Nokogiri/Regex parsing Parallel (Processes) Bypasses the GVL to use all CPU cores.
Needs to hit 10,000+ URLs ASAP Async (Fibers) The most efficient way to handle massive I/O.
Is limited by your RAM Threads or Async Processes will crash your server if you spawn too many.

The "Warp Speed" Secret: The Hybrid Approach

The most professional scrapers use a Hybrid Approach:

  1. Use Async/Threads to fetch the HTML (The I/O part).
  2. Push the raw HTML into a Background Job (Sidekiq).
  3. Sidekiq workers (which are separate processes) handle the Parsing (The CPU part).

This separation of concerns ensures that a slow network doesn't block your CPU, and a heavy CPU task doesn't stop your network fetchers.

Summary

Speed in scraping isn't just about doing things fast; it's about doing things efficiently.

  • Stop waiting: Use Threads.
  • Stop bottlenecking: Use Processes.
  • Scale to the moon: Use Async.

Pick the right tool for your specific bottleneck, and watch your 55-hour execution time melt away into minutes.


Are you hitting a performance wall in your Ruby scrapers? Share your benchmark stats in the comments! 👇

Top comments (1)

Collapse
 
nishith_dev profile image
Nishith

Great breakdown!! I've recently tested this platform, anakin.io, which auto-handles this combo, concurrent requests + parallel JS rendering for login walls. Noticed it was 3x faster on Cloudflare sites vs raw Puppeteer.