Neetin Singh Negi

Posted on Jul 5

How I Built a High-Performance Browser Image Processing Pipeline with Web Workers and WebAssembly

#webassembly #performance #javascript #webdev

A deep dive into worker pools, zero-copy transfers, SharedArrayBuffer, scheduling, and the engineering decisions behind a browser-native image processing engine.

Introduction

In my previous article, I explained how I replaced an image-processing backend with WebAssembly and moved the entire optimization pipeline into the browser.

Many readers asked the same question afterward:

"How do you process dozens of large images in parallel without freezing the browser?"

The answer isn't WebAssembly.

It isn't libvips.

And surprisingly, it isn't image compression either.

The hardest part of the entire project wasn't image compression—it was building a worker pool that could process large batches efficiently while keeping memory usage under control.

A naïve implementation quickly runs into problems:

Too many workers compete for CPU.
Decoded images consume far more memory than their file size suggests.
Aggressive parallelism can make the browser unresponsive.

This article is a deep dive into how I designed a browser-native processing pipeline using Web Workers, SharedArrayBuffer, task scheduling, and zero-copy memory transfers.

The Problem: Browser-Native Processing Doesn't Scale Automatically

Processing a single image inside the browser is surprisingly straightforward.

Most modern browsers can easily decode an image, run it through a WebAssembly module, and return the optimized result.

The challenge begins when users stop uploading a single image.

Real-world image optimization tools are rarely used one file at a time. More often, users drag an entire folder into the browser and expect dozens of high-resolution images to begin processing immediately.

That's where browser-native processing becomes much more complicated.

Large images may occupy only a few megabytes on disk, but after decoding they can consume hundreds of megabytes of memory. At the same time, image encoding is computationally expensive, and users still expect the interface to remain responsive while progress updates, previews, and downloads continue to work smoothly.

The obvious solution might seem to be creating more Web Workers.

Unfortunately, that usually makes the problem worse.

More workers mean more decoded images in memory, higher CPU contention, additional garbage collection pressure, and an increased risk of exhausting the browser's available heap.

The challenge isn't simply processing images in parallel.

The real challenge is deciding how much work should run simultaneously, which images should run first, and how memory should be managed while everything is happening.

That realization completely changed the architecture of my application.

Instead of building "an image compressor," I ended up building a scheduling system.

Figure 1. Processing one image is easy. Processing large batches requires scheduling, controlled concurrency, and careful memory management.

Why a Single Worker Isn't Enough

Image Optimization tools are rarely used on a single image.More often , users drag and entire folder of photos into the browser and expect everything to be processed at once.

At first glance the solutions seems to be obvious: either process image one by one , or spin up worker for each image . In pratice , neither approach works well in the brwoser.

Let's look at both extreme and why we need something smarter.

Figure 2. Neither sequential processing nor unlimited parallelism scales well. Efficient browser-native image processing requires controlled concurrency through a worker pool and intelligent task scheduling.

Overall Pipeline

Once I realized that browser-native image processing was really a scheduling problem rather than a compression problem, the overall architecture became much clearer.

Instead of sending every uploaded image directly to a worker, each image moves through a series of stages designed to maximize throughput while keeping memory usage predictable and the browser responsive.

The complete pipeline looks like this.

Figure 3. Every uploaded image passes through a scheduler before reaching the worker pool. This allows the application to control concurrency, minimize memory pressure, and process images efficiently without blocking the main thread.

Images

Every uploaded image is first converted into a task and placed into a processing queue. Rather than immediately assigning work to a Web Worker, the application waits until resources are available.

This small design decision gives the scheduler complete control over the workload.

Task Scheduler

The scheduler acts as the brain of the entire system.

Instead of simply processing images in the order they arrive, it decides:

Which image should run next.
Which worker should receive the task.
How many images can safely run in parallel.
Whether heavy images should wait while smaller images complete first.

This prevents a handful of very large images from blocking an entire batch.

Worker Pool

Once a task is selected, it is assigned to an available worker from a fixed-size worker pool.

Each worker runs independently on its own thread, allowing multiple images to be processed simultaneously without blocking the browser's main UI thread.

Because the workers are reused, the expensive WebAssembly runtime only needs to be initialized once per worker instead of once per image.

Because workers are long-lived, those expensive startup costs are paid once instead of once per image. This allows the scheduler to dispatch new tasks almost immediately, rather than repeatedly downloading, initializing, and configuring the WebAssembly runtime.

Rather than creating and destroying workers continuously, the scheduler simply assigns new tasks to workers that have become idle.

SharedArrayBuffer

Large image buffers are transferred efficiently between JavaScript and WebAssembly using shared memory.

Reducing unnecessary allocations keeps memory usage stable and significantly lowers garbage collection pressure during large batch operations.

WebAssembly + libvips

This is where the heavy work happens.

A WebAssembly build of libvips performs decoding, resizing, compression, format conversion, and encoding directly inside the browser.

The processing engine is the same class of native library commonly used on backend servers—except it's now running entirely on the client.

Output

Once processing finishes, the optimized image is returned to the React application, where users can preview, download, or optionally upload it to cloud storage.

At no point does the image need to pass through a backend server.

This architecture shifts the browser from being a simple user interface into a complete image-processing runtime.

Designing the Worker Pool

Building a worker pool sounds straightforward until you start thinking about everything that can go wrong.

Workers aren't simply "running" or "idle."

In production they constantly move between different states.

Figure 4. The scheduler continuously monitors worker availability and assigns new tasks only to idle workers, ensuring efficient resource utilization without oversubscribing the browser.

Each worker can be:

Idle – waiting for work.
Busy – actively processing an image.
Timed Out – taking longer than expected.
Failed – encountered an unexpected runtime error.
Restarting – being recreated after a failure.

Managing these state transitions turned out to be just as important as image compression itself.

Instead of creating a new Web Worker for every uploaded image, I initialize a fixed-size pool when the application starts.

Those workers stay alive for the lifetime of the session and continuously receive new tasks from the scheduler.

This approach has several advantages:

The WebAssembly runtime is loaded only once per worker.
Memory allocations are reused instead of recreated.
Worker startup overhead disappears after initialization.
Browser resources remain predictable even for very large batches.

Assigning Work

Whenever a worker finishes processing an image, it immediately requests another task from the scheduler.

The scheduler simply finds the next available image and dispatches it to the newly idle worker.

That continuous cycle keeps every worker busy without overwhelming the browser.

Because only idle workers receive new work, concurrency always remains under control regardless of how many images users upload.

Handling Failures

Production systems need to assume that failures will happen.

A corrupted image, an unexpected WebAssembly error, or a browser limitation should never stall the entire pipeline.

Each task is therefore assigned a timeout.

If a worker stops responding:

The task is marked as failed.
The worker is terminated.
A replacement worker is created.
Remaining tasks continue processing normally.

This fault-tolerant design prevents a single bad image from affecting the rest of the batch.

Zero-Copy Transfers

Once multiple workers were processing images in parallel, another performance problem became obvious.

Moving large image buffers between the main thread and workers wasn't free.

Every unnecessary memory copy increases allocation pressure, consumes additional RAM, and creates more work for the browser's garbage collector. For multi-megabyte images, those costs add up surprisingly quickly.

Instead of copying image data into a worker, I transfer ownership of the underlying ArrayBuffer.

private assignTask(slot: WorkerSlot, task: TaskRecord): void {
  if (slot.dead) {
    this.taskQueue.unshift(task);
    return;
  }

  slot.currentTaskId = task.id;

  if (slot.timeoutId) clearTimeout(slot.timeoutId);

  slot.timeoutId = setTimeout(() => {
    this.failSlot(
      slot,
      new Error("Local processing timed out. Try a smaller image or reload the page."),
    );
  }, task.timeoutMs);

  try {
    // Zero-copy transfer of ArrayBuffer into the worker.
    slot.worker.postMessage(task.request, [task.request.buffer]);
  } catch (error) {
    this.failSlot(slot, error instanceof Error ? error : new Error("Failed"));
  }
}

The second argument to postMessage() transfers ownership of the buffer rather than creating a duplicate copy.

For large batches, this significantly reduces memory usage and improves responsiveness.

Figure 5. Instead of copying image data between threads, ownership of the ArrayBuffer is transferred directly to the worker, eliminating unnecessary memory allocations.

Why SharedArrayBuffer Matters

Passing messages between workers is straightforward.

Sharing memory between workers is considerably more powerful.

Without shared memory, every worker maintains its own independent allocations, which quickly increases overall memory consumption during large batch processing.

By enabling SharedArrayBuffer, JavaScript and the WebAssembly runtime can coordinate through a shared memory region instead of constantly allocating new buffers.

This reduces allocation overhead and allows the WebAssembly runtime to reuse memory much more efficiently.

The trade-off is deployment complexity.

Browsers only expose SharedArrayBuffer when the page is running in a Cross-Origin Isolated environment.

That requires enabling both:

Cross-Origin-Opener-Policy (COOP)
Cross-Origin-Embedder-Policy (COEP)

Without those headers, shared memory is disabled entirely, regardless of how the application is written.

Loading WebAssembly Only Once

Initializing a WebAssembly runtime is surprisingly expensive. Before a single image can be processed, the browser needs to download the module, instantiate the runtime, configure memory, detect runtime capabilities, and initialize libvips.

If every worker repeated that process for every task, startup latency would quickly dominate the overall processing time.

Instead, I lazily initialize the runtime and cache the resulting promise. The first task performs the initialization, while every subsequent task simply waits for the same promise to resolve. This ensures that WebAssembly is loaded only once per worker, regardless of how many images are processed.

let vipsPromise: Promise<VipsRuntime> | undefined;

async function getVips(): Promise<VipsRuntime> {
  if (vipsPromise) return vipsPromise;

  vipsPromise = (async () => {
    const memory = getSharedWasmMemory();
    const vipsEs6Url = `${ORIGIN}/wasm-vips/vips-es6.js`;

    const mod = await import(
      /* webpackIgnore: true */
      vipsEs6Url
    );

    const factory =
      (mod as { default?: unknown }).default ?? mod;

    if (supportsSimd()) {
      console.info("SIMD supported.");
    } else {
      console.warn("SIMD unavailable.");
    }

    const vips = await factory({
      wasmMemory: memory,
      mainScriptUrlOrBlob: vipsEs6Url,
      locateFile: (file) => `${ORIGIN}/wasm-vips/${file}`,
    });

    vips.Cache?.maxMem?.(WASM_HEAP_MAX_BYTES);

    return vips;
  })();

  return vipsPromise;
}

Although the implementation is relatively small, it encapsulates several important performance optimizations:

Lazy initialization ensures the runtime is created only when it's actually needed.
Promise caching guarantees that multiple requests share the same initialization instead of creating duplicate WebAssembly instances.
SharedArrayBuffer-backed memory allows the runtime to work with shared memory instead of allocating separate heaps.
Dynamic imports keep the initial application bundle smaller by loading the WebAssembly runtime only when image processing begins.
SIMD detection enables browsers with SIMD support to automatically take advantage of additional CPU instructions for faster image processing.

This initialization happens only once, but it has a significant impact on the overall user experience. By avoiding repeated runtime creation, the application can immediately begin processing the next image instead of repeatedly paying the cost of setting up WebAssembly.

The Real Bottleneck: Memory

When I started this project, I assumed CPU performance would be the biggest challenge.

Image compression is computationally expensive, so I expected most of my time would be spent optimizing encoder settings and reducing processing time.

I was wrong.

The real bottleneck wasn't CPU—it was memory.

A JPEG that occupies only 10 MB on disk may require 200–300 MB of memory once it's decoded for processing.

That changes the problem completely.

Processing one image is usually straightforward.

Processing ten large images simultaneously can consume several gigabytes of memory surprisingly quickly.

This is where many browser-native image processing experiments begin to fail.

An aggressive worker pool might keep every CPU core busy, but it also increases:

Browser heap usage
Garbage collection pressure
Memory fragmentation
Risk of exhausting available memory

Eventually, the browser spends more time reclaiming memory than processing images.

Ironically, adding more workers can make the application slower instead of faster.

That realization completely changed my priorities.

Instead of maximizing throughput at all costs, I focused on keeping memory usage predictable.

Stable performance turned out to be far more valuable than maximum parallelism.

Figure 6. Compressed files are relatively small, but decoding them dramatically increases memory usage. Managing decoded images efficiently became the primary engineering challenge.

Why FIFO Scheduling Wasn't Good Enough

Once memory became the primary constraint, the scheduler became the most important component of the system.

A simple first-in, first-out queue sounds reasonable.

Until someone uploads twenty images where the first file is a 300 MB panorama.

Every smaller image waits behind that single task.

The browser appears frozen even though workers are available.

Instead, the scheduler estimates workload and separates tasks into different queues.

Small images finish quickly, giving users immediate feedback, while larger images continue processing in the background.

The scheduler also adjusts concurrency based on available hardware, ensuring that lower-powered devices aren't overwhelmed while more capable machines can process additional work in parallel.

This simple change dramatically improved perceived performance.

Users no longer had to wait for the largest image before seeing progress.

Instead, optimized images begin appearing almost immediately, making the application feel significantly faster even when total processing time remains similar.

Fault Recovery

Building a fast processing pipeline is only half the problem. It also needs to recover gracefully when something goes wrong.

In practice, workers don't always complete successfully. A corrupted image, an unexpected runtime error, or even a browser-specific issue can leave a worker stuck indefinitely. If a single worker hangs, it can stall the entire processing queue.

Failures are inevitable. Hanging forever isn't.

To prevent that, every task is assigned a timeout when it's dispatched to a worker.

If the timeout expires before the worker returns a result, the scheduler assumes the worker is no longer healthy. The task is marked as failed, the worker is recycled, and a fresh worker takes its place.

This ensures that one problematic image doesn't block every other image waiting in the queue.

slot.timeoutId = setTimeout(() => {
  this.failSlot(
    slot,
    new Error(
      "Local processing timed out. Try a smaller image or reload the page."
    )
  );
}, task.timeoutMs);

The implementation itself is straightforward, but the impact on reliability is significant. The recovery strategy follows four simple steps:

Detect stalled workers with configurable timeouts.
Remove unhealthy workers from the pool.
Create replacement workers automatically.
Continue processing the remaining tasks.

This approach favors reliability over maximum throughput. In a browser environment, keeping the application responsive is often more valuable than squeezing out a few extra milliseconds of performance.

Figure 7. If a worker becomes unresponsive, the scheduler automatically recovers by recycling the worker and continuing with the remaining tasks.

Results

Instead of focusing on compression ratios—which I covered in my previous article—I wanted to evaluate how the architecture behaved under sustained workloads.

The goal wasn't simply to compress images faster. It was to determine whether the browser could remain responsive while processing large batches of high-resolution images in parallel.

After introducing worker pooling, zero-copy transfers, shared memory, and dynamic scheduling, the difference was immediately noticeable.

Users no longer have to wait for an entire batch to finish before seeing results. As workers complete individual tasks, optimized images begin appearing almost immediately, making the application feel significantly more responsive.

The architectural improvements can be summarized like this:

Before	After
Sequential processing	Parallel worker pool
Frequent memory copies	Zero-copy ArrayBuffer transfers
Main thread blocked	Responsive UI
Fixed execution order	Dynamic task scheduling
Workers could stall indefinitely	Automatic timeout & recovery
High memory pressure	Controlled concurrency

None of these improvements came from changing the compression algorithm itself. They came from treating the browser like a runtime rather than just a user interface.

Although the compression algorithms themselves never changed, the surrounding architecture dramatically improved throughput, responsiveness, and overall stability.

The browser now behaves much more like a dedicated processing engine than a traditional web page.

Lessons Learned

When I started this project, I thought performance meant making image compression faster.

By the end, I realized performance is mostly about architecture.

The compression library was already highly optimized. My job wasn't to make libvips faster—it was to build a system that could use it efficiently inside the constraints of a browser.

That meant thinking less about algorithms and more about how work flows through the system.

A few architectural decisions ended up having a far greater impact than any micro-optimization:

Reusing workers instead of constantly creating new ones.
Eliminating unnecessary memory copies with transferable ArrayBuffers.
Sharing memory efficiently with SharedArrayBuffer.
Scheduling work instead of processing images strictly in arrival order.
Limiting concurrency based on available resources instead of maximizing parallelism.
Recovering automatically from stalled workers without interrupting the user.

Individually, none of these techniques are groundbreaking.

Together, they transformed the browser into a runtime capable of handling workloads that I previously assumed required a backend.

That was probably the biggest lesson from the entire project.

Modern browsers aren't just rendering engines anymore. They're increasingly capable application platforms—but getting the best performance out of them requires thinking like a systems engineer rather than a frontend developer.

Conclusion

In my previous article, I showed that modern browsers are capable of replacing an image-processing backend.

This article explored what it actually takes to make that architecture reliable in production.

Moving image processing into the browser isn't simply a matter of compiling native code to WebAssembly. It requires careful attention to worker pools, concurrency, memory management, scheduling, and fault tolerance. WebAssembly makes browser-native image processing possible, but it's the surrounding architecture that makes it practical.

As browser APIs continue to evolve, I expect more traditionally server-side workloads to move to the client.

The interesting question is no longer:

Can the browser do this?

It's becoming:

Does this feature really need a backend anymore?

📚 Browser-Native Image Processing Series

If you're interested in browser-native image processing, this article is part of a two-part series:

Part 1: How I Replaced My Image Processing Backend with WebAssembly

Learn why I moved image processing entirely into the browser and how WebAssembly made it possible.

Part 2: How I Built a High-Performance Browser Image Processing Pipeline with Web Workers and WebAssembly (You're here)

A deep dive into the worker pool, task scheduler, SharedArrayBuffer, zero-copy transfers, and fault recovery that make the architecture production-ready.

DEV Community