Nicanor Korir

Posted on Jan 24

From Shaky Farm Videos to Sharp Diagnoses - Building a Client-Side Media Pipeline

#javascript #webdev #imageprocessing #canvas

A farmer stands in her field, phone in hand, recording a quick video of a sick tomato plant. The camera shakes. The sun creates harsh shadows. Her thumb accidentally covers the corner of the frame for three seconds.

That video contains maybe 900 frames. Maybe ten of them are actually usable for plant disease diagnosis. The other 890 are blurry, redundant, or partially obscured.

The question that drove weeks of development: how do you automatically find those ten good frames without uploading 900 to a server?

Why Client-Side Processing

The obvious architecture: upload the raw video, process it on the server with Python or FFmpeg, and send back extracted frames.

For my users, that architecture fails:

Bandwidth costs real money. On metered data plans common in rural Africa, uploading a 30MB video might cost more than the farmer earns that day. Extracting frames locally and uploading only the useful ones—maybe 500KB total—changes the economics entirely.

Networks are unreliable. A large upload over 2 GB with frequent drops means failed transfers, wasted data, and frustrated users. Smaller uploads succeed more often.

Latency compounds. Upload time plus server processing time plus download time. On slow networks, this becomes intolerable. Processing locally eliminates the round-trip.

So everything happens in the browser. Image compression, video frame extraction, blur detection, frame selection—all client-side JavaScript using Canvas APIs.

Image Compression: The Foundation

Every image, regardless of source, goes through compression before upload. The target: 1024 pixels on the longest dimension, JPEG at 80% quality.

Why these specific numbers?

1024 pixels is the sweet spot for Claude Vision. Larger images don't improve diagnostic accuracy—Claude doesn't need to count individual pixels to identify disease symptoms. Smaller images lose the detail needed to spot early infections. I tested this extensively: 1024px captures everything diagnostically relevant.

80% JPEG quality is where compression artifacts become invisible to humans but file size drops dramatically. At 90% quality, files are 50% larger with no visible benefit. At 70%, subtle disease symptoms might be obscured by compression artifacts. 80% hits the sweet spot.

The result: a 4000x3000 pixel PNG (roughly 15MB) becomes a 1024x768 JPEG (roughly 100KB). That's a 150x reduction in data transferred. On a slow network, that's the difference between a 30-second upload and a 2-second upload.

Video Frame Extraction: The Hard Part

Videos are information-dense but mostly redundant. A 10-second clip at 30fps contains 300 frames, but probably only 5-10 are worth analyzing.

The naive approach—grab every Nth frame—fails in practice

Blur clustering. If the camera moves at the 3-second mark, you get a blurry frame. The sharp frames at 2.8 and 3.2 seconds are skipped.

Redundancy. If the camera holds steady for 5 seconds, you might extract 2 nearly identical frames while missing coverage of different angles.

Temporal bias. Fixed intervals ignore content. The user might have shown three different angles at 5, 15, and 25 seconds. Fixed extraction at 0, 10, 20, 30 might miss all three.

The solution requires two innovations: blur detection to find sharp frames, and temporal filtering to ensure diversity.

Blur Detection: The Laplacian Trick

To find sharp frames, you need to measure sharpness. The computer vision community solved this decades ago with the Laplacian operator.

The intuition: sharp images have many edges—sudden transitions between light and dark. Blurry images have few edges because transitions are smoothed out.

The Laplacian operator computes, for each pixel, how different it is from its immediate neighbors. In a flat region (like a solid color), Laplacian values are near zero. At an edge (like a leaf vein against leaf tissue), values are high.

For each pixel, the calculation is: 4 * center - top - bottom - left - right. If the center pixel is similar to its neighbors, this equals zero. If the center pixel is dramatically different (an edge), this equals a large positive or negative number.

By computing the variance of Laplacian values across the entire image, you get a single "sharpness score." High variance means many edges, which means sharp. Low variance means few edges, which means blurry.

In testing, I found these thresholds:

Variance above 500: reliably sharp, excellent for analysis
Variance 100-500: acceptable, usable if nothing better available
Variance below 100: too blurry, likely unusable

The calculation happens on downscaled frames (1024px max dimension) to keep processing fast. Full-resolution Laplacian computation would be too slow for real-time frame scoring on mobile devices.

Temporal Diversity: Not Just The Sharpest

Selecting the 10 sharpest frames sounds right, but fails in practice. They often cluster in one time window when the camera happens to be stable.

For disease diagnosis, temporal diversity matters. The user naturally shifts perspective while recording—showing the top of the leaf, then the underside, then the stem. A diverse frame selection captures this variation.

The algorithm:

Score all candidate frames by sharpness (Laplacian variance)
Sort by sharpness, highest first
Select the sharpest frame
Calculate minimum time gap: video duration divided by (desired frames × 2)
Skip any frame too close in time to already-selected frames
Select the next-sharpest that passes the temporal filter
Repeat until you have enough frames
Sort selected frames chronologically for display

For a 30-second video, selecting 5 frames enforces at least 3-second gaps between selections. The result: frames are both sharp AND temporally diverse, capturing different moments and angles from the recording.

The Complete Video Pipeline

When a user uploads a video:

Load metadata: Get duration and dimensions without loading the full video into memory
Calculate sample points: Divide duration by max frames to get interval
Seek and capture: For each sample point, seek the video element and draw the current frame to the canvas
Score sharpness: Compute Laplacian variance for each captured frame
Select best frames: Apply the temporal diversity filter to choose final frames
Compress frames: Each selected frame goes through image compression (1024px, 80% JPEG)
Return for analysis: Final frames ready for Claude Vision

The whole process typically takes 3-5 seconds for a 30-second video on a mid-range phone. Users see a progress indicator showing frames being extracted, then their selected frames are displayed for confirmation before analysis.

Handling Multiple Media Items

Users can upload multiple images or videos in a single session. The system needs to know: same plant from different angles, or different plants entirely?

For images, the UI asks directly. "Different Plants" produces N separate diagnoses. "Same Plant" produces one comprehensive diagnosis considering all evidence.

For video input, extracted frames are treated as "the same plant" by default. The user was recording continuously, so frames presumably show the same subject. This makes video the fastest path to a comprehensive diagnosis—point and record, get analysis.

Error Handling: Fail Visibly

Real-world media processing fails constantly. Files are corrupted. Formats are unsupported. Memory runs out on old phones.

Unsupported format: Check MIME type before processing. Provide clear error listing supported formats (JPEG, PNG, GIF, WebP for images; MP4, WebM, QuickTime for video).

Corrupt file: Wrap processing in try-catch. If the image fails to load or the video fails to seek, provide specific feedback.

Processing failure: If compression fails, retry with lower quality settings. If that fails, skip the problematic file but continue with others.

Memory pressure: Process one item at a time rather than parallelizing. Release canvas references after each operation. Revoke blob URLs immediately after use. This is slower but prevents crashes on memory-constrained devices.

Unusable output: If blur detection determines all extracted frames are below the usability threshold, warn the user and suggest re-recording with a steadier hand or better lighting.

The principle: fail visibly, never silently. A user who sees "Video too blurry, try recording again in better light" can fix the problem. A user whose analysis silently uses garbage frames loses trust in the whole system.

Performance: Making It Work on Old Phones

The target device isn't the latest iPhone. It's a three-year-old Android phone with 2GB of RAM running Chrome.

Canvas reuse: Creating a new canvas element for each operation is expensive. The pipeline reuses canvas elements across operations, clearing and resizing as needed.

Downscale first: Blur detection doesn't need full resolution. A 4000x3000 image downscaled to 1024x768 gives equally valid sharpness scores at 1/12th the processing cost.

Progressive loading: For videos, metadata loads first (duration, dimensions), then frames extract one at a time with progress feedback. Users see activity immediately rather than waiting for complete processing.

Explicit cleanup: Large media can exhaust mobile browser memory. The pipeline explicitly nulls references after use. Video object URLs are revoked immediately after frame extraction. This prevents memory from accumulating across multiple uploads.

What I Learned

Test with garbage input. Development used well-lit, centered photos taken by someone who knows what they're doing. Production receives shaky videos recorded while walking, photos with fingers partially covering the lens, and screenshots of screenshots. Robustness only emerged after I deliberately tried to break the system with the worst possible input.

Client-side processing is more feasible than expected. My initial assumption was that "real" image processing needed server resources. Modern browsers have Canvas APIs that handle common operations efficiently. Even blur detection—fundamentally a computer vision algorithm—runs fast enough on phone CPUs when you're smart about resolution.

Users prefer control over automation. Early versions automatically selected "best" frames without showing users what was selected. Users didn't trust it. Showing extracted frames and letting users confirm or re-record built confidence in the system. The extra step is worth the trust it creates.

Fail visibly. Silent failures are the worst. A processing error that shows "Something went wrong, try again" is infinitely better than one that silently produces garbage output. Users can recover from visible failures; invisible ones erode trust permanently.

The Invisibility Goal

The media pipeline is invisible when it works. Users upload a shaky video, see some frames appear, confirm their selection, and get a diagnosis. They don't think about Laplacian variance or temporal diversity or JPEG compression ratios.

That invisibility is the goal. The complexity exists so that users don't have to understand it. They just need to point their phone at a sick plant and get help.

Every technical decision—client-side processing, blur scoring, temporal filtering, compression ratios—serves that goal. Not because the techniques are elegant, but because they make the experience work for farmers with old phones, slow networks, and unsteady hands.

Technical complexity that serves users is engineering. Technical complexity that serves itself is self-indulgence. The media pipeline exists to turn chaos into clarity, invisibly.

This completes the Shamba-MedCare technical series.

The full system represents months of iteration across architecture, prompt engineering, accessibility, and media processing. The code is open source. The problems are documented. The opportunity—bringing agricultural AI to farmers who need it most—is massive.

If you're working on similar problems, I'd love to hear from you.

Source code on GitHub