Building a Free OSHA Compliance Tool — 8 Weeks Solo

Ayush Gupta — Sat, 30 May 2026 10:13:05 +0000

Commercial workplace-safety software — Protex AI, Intenseye, and the rest — runs $500 to $2,000 a month. It watches camera feeds for PPE violations: a worker without a hard hat, a missing high-vis vest, no fall harness at height. The technology isn't exotic anymore. The price tag is.

So over eight weeks, solo, I built SafetyVision — an open-source PPE compliance monitor that does the core job for free and runs on $0 of infrastructure. Not a toy: a fine-tuned detection model, explainable predictions, OSHA-grounded incident reports, compliance forecasting, a documented API and SDK, and a one-command self-host. Three live surfaces, all free-tier.

▶ 3-minute walkthrough · Live app · GitHub

This is the story of the decisions that mattered — including the ones that didn't go to plan.

The product, in one breath

Upload a worksite photo. SafetyVision finds each worker, flags missing PPE in red ranked by risk, shows you why it flagged it (a GradCAM heatmap and SHAP attribution), writes an incident report citing the actual OSHA regulation, exports an audit-ready PDF, and forecasts the site's 7-day compliance trend. Every inspection is saved to your history.

It runs three ways: a Next.js web app on Vercel (the product), a no-signup Gradio demo on Hugging Face Spaces (the open-source try-it), and a serverless REST API on AWS Lambda (for developers). Same core powers all three.

The compromises in this project are about scale — free tiers, a small model, a modest training set — never about sophistication. Here's where the sophistication went.

The model: and an honest 0.763

Detection is a fine-tuned YOLOv8, exported to ONNX so it runs on a plain CPU — no GPU required for end users. Version 1 was YOLOv8*n* (nano), trained on ~58k images, landing at 0.701 mAP@50. Decent, but it had a clear weakness: it was biased toward frontal poses and missed workers seen from the side, the back, or partially occluded.

For v2 I went bigger — YOLOv8*s* (small), 80k+ images, and an aggressive Albumentations augmentation pipeline (random occlusion, brightness/contrast jitter, motion blur, mosaic) specifically to fight that frontal bias. The target was mAP@50 ≥ 0.78.

It landed at 0.763.

I could have buried that. Instead it's in the README, the model card, and the demo's closing line. Here's why: a recruiter or a safety officer evaluating this doesn't trust a project with no failure modes — they trust one that knows exactly where it's weak. v2 is a real improvement (Fall-Detected hits 0.956, hard hats 0.936), and the per-class breakdown shows precisely which classes still struggle (NO-Safety-Vest at 0.382). An honest 0.763 with a documented gap is worth more than a suspicious 0.78.

That became the project's organizing principle: the demo is curated, the model card is honest. The demo shows the best-case path because that's what every product demo does; the model card lists every failure mode because that's what every responsible model card does. Both exist on purpose.

Why two explainers, not one

Every detection ships with both a GradCAM heatmap and SHAP attribution. That's deliberate redundancy, and it's the feature I'm most attached to.

GradCAM answers "where did the model look?" — it paints a heatmap over the image so you can see it attended to the head region when it flagged a missing hard hat. It's spatial and immediately intuitive; a safety officer with no ML background gets it in two seconds.

SHAP answers a different question: "which pixels actually moved the prediction?" — per-pixel attribution that a technical reviewer can interrogate. It's slower to compute (the heaviest step in the pipeline) and harder to read, but it's the one that holds up under scrutiny.

A black-box safety tool is a non-starter — if the system flags a worker, someone needs to be able to ask why. Shipping both means the answer satisfies the floor manager and the auditor.

Grounding the reports in real regulations

A generic "this worker is missing a hard hat" message isn't useful. A citation of 29 CFR 1910.135(a)(1) is. So the incident report is generated by a multimodal Gemini Flash model that receives three things: the annotated image (so it sees what the camera sees), the structured violation data, and the relevant OSHA regulation text — retrieved by a RAG pipeline (Qdrant vector store + BGE embeddings) over the actual 29 CFR 1910 and 1926 standards.

Does the RAG grounding actually help, or is it theater? I A/B tested it. With RAG vs. without, judged on report quality: RAG wins, Cohen's d = 0.65, p = 0.0197 (paired t-test, N=16). Small sample, but a real and significant effect. I ran a second A/B on the detection confidence threshold (0.40 vs 0.55): 0.40 wins, McNemar p = 4×10⁻⁵ on 200 held-out images. Decisions backed by numbers, not vibes.

The infrastructure war stories

The GCP quota wall

I planned to train on GCP with the $300 free credit. Every GPU VM request bounced — across dozens of zones and machine types. The error messages pointed at regional quotas that looked fine. The real culprit took systematic testing to find: a global GPUS_ALL_REGIONS umbrella quota that defaults to 0 on new paid accounts and silently overrides every regional quota. You can have regional GPU quota of 1 and still be blocked because the global cap is 0.

For v1 I pivoted to Kaggle's free 2×T4 notebooks and trained around the 12-hour session cap with checkpoint-resume. For v2, after the account aged and an explicit quota request cleared the global cap, I trained on a single GCP L4 — then wound the whole GCP footprint down to $0 once the weights were on Hugging Face. Documented the entire diagnosis as an architecture decision record, because the next person hitting that wall deserves better than the error message I got.

Lambda Function URLs over API Gateway

For the API, I chose a Lambda Function URL over API Gateway. The reasoning: Function URLs are free forever, while API Gateway's free tier expires after 12 months — and for a single /analyze endpoint, I didn't need API Gateway's usage plans or request transformations. API-key auth and rate-limiting live at the handler level instead. It's the kind of trade you make explicit so the alternative is on record, not the kind you default into.

The 4MB that looked like 6MB

Lambda Function URLs cap payloads at 6MB. I built the frontend to that limit, and 5MB images started returning 413s. The cause: base64 inflation. A 6MB on-the-wire cap is really ~4MB of raw image once you account for the ~33% base64 overhead in the JSON envelope. And it bites the response too — my annotated image, GradCAM, and SHAP visuals were going out as PNG and blowing the ceiling. Fix: JPEG q85 instead of PNG, cap input resolution at 1280px, and set the real frontend limit to 4MB. The kind of constraint that's invisible until production traffic finds it.

$0, on purpose

The hard constraint was zero ongoing cost, and every runtime service honors it: AWS Lambda/S3/DynamoDB/ECR (always-free, no 12-month cliff), Supabase for Postgres + auth, Vercel for the frontend, Hugging Face for hosting and weights, Qdrant Cloud for vectors, Google AI Studio for the LLM. Cost per analysis: $0.

That's not a limitation to apologize for — for a small factory that can't justify $2,000/month, the free version is the product-relevant version. And the architecture is built so the expensive upgrades (a bigger model on a GPU endpoint, a frontier LLM, multi-seed evals) are config flags away, not rewrites. I built the cheap version of an upgrade-ready system.

What I'd do with more time

None of these are blind spots — each was a conscious trade against "ship the rigorous free version."

Close the mAP gap to 0.78+ — more side/back-view and occluded training data; the augmentation helped but didn't fully solve the pose bias.
A semantic guardrail / second model for the report layer, beyond the current prompt-level grounding.
Multi-seed evals and bigger A/B samples — the current intervals are wide; the effects are directional, not bankable beyond the strong ones.
RTSP / live-camera ingestion — the obvious product next step, but it needs persistent compute, which breaks the $0 rule.

What it actually demonstrates

Eight weeks, solo, $0: a fine-tuned and ONNX-exported detector, dual explainability, RAG-grounded multimodal reporting, time-series forecasting with a baseline, statistically-validated A/B tests, a three-surface deployment (Next.js + Vercel, Gradio + HF Spaces, serverless AWS via Terraform), a published PyPI SDK with a CLI, and honest metrics throughout.

The point was never to out-spend the incumbents. It was to show that the capability is no longer the moat — and to build the free version well enough that someone would actually use it.

Try the live demo · Read the model card · Deploy your own

SafetyVision is an AI-assisted pre-screening tool to support human safety officers — not a replacement for human judgment.

DEV Community: Ayush Gupta