A story about driving a car with a camera, letting AI do the rest, and spending $0/month on the whole thing.
A housing inspector drives down a residential street at 15 miles per hour. They squint at every house — looking for peeling paint, overgrown vegetation, broken windows, junk vehicles in the yard. When they spot something, they pull over, write it up, take a photo, and move on. A single neighborhood takes hours. Scale that to an entire municipality with tens of thousands of properties, and you're looking at weeks of windshield time for problems that are fundamentally visual and repetitive.
But what if you could drive the route once with a GoPro, and let software handle everything else?
Not just "scan for problems" — that's the easy part to imagine. I meant the whole pipeline: identify exactly which property each issue belongs to, map it to the municipal parcel database, annotate the evidence, and drop it into the enforcement system. No manual address transcription. No "I think that was 142, or maybe 144."
Three months later, I had a working system that does exactly that. It runs on a budget Linux server. The recurring cost is zero dollars per month. Here's how I built it, what I learned, and why the boring parts were the hardest.
The Pipeline at a Glance
The system has four phases:
- Frame extraction — Turns hours of GoPro footage into a handful of high-quality still frames
- Violation detection — A vision-language model inspects each frame for 13 housing code violations
- Parcel geolocation — GPS telemetry + spatial queries map each violation to a specific property
- Enforcement delivery — Results land in a CRM with addresses, violation details, and annotated images
But describing it as four bullet points hides where the real engineering lives. Let me walk through each phase.
Phase 1: From Hours of Video to the Right 50 Frames
A 10-minute GoPro video at 30 fps is 18,000 frames. Sending every frame to a vision-language model would cost hundreds of dollars, take days, and produce mostly garbage — 90% of the frames are pavements, sidewalks, trees, or the side of cars.
The solution is a two-stage, Zero-Shot Image Classification model (which in this case is CLIP) for frame selection.
Stage 1 extracts frames cheaply using ffmpeg — 3 frames per second, scaled down to 640 pixels wide. Each frame gets scored by a CLIP vision model (ViT-L-14) using three criteria:
- Sharpness — Laplacian variance. Blurry frames from motion or focus hunting get rejected.
- House occupancy — CLIP similarity to prompts like "a front view of a residential house." The system learns what "house" looks like in this specific video.
- Vehicle or Tree dominance — CLIP similarity to "a car, truck or a tree taking up most of the frame." If a FedEx truck blocks the view, or a tree is right in front of the house, those type of frames gets discarded.
Stage 2 goes back to the original video and re-extracts only the top-scoring frames at full native resolution via ffmpeg seek. Five frames per 60-second chunk is usually enough — that's about 50 frames for a 10-minute video (highly dependent on the speed of the car). The VLM never sees the other 17,950.
The auto-calibration trick: Every video has different lighting, different neighborhoods, different weather. Hardcoded thresholds fail. So the system samples ~one frame every 30 seconds from the actual video, computes the distribution of sharpness, house similarity, and vehicle similarity scores, and sets thresholds at sensible percentiles. No per-video tuning required. It just works.
Resumability matters. Processing a 30-minute video can take hours. If it crashes at the last minute, you don't want to start over. Checkpoints save which chunks are done, and calibration data is cached — reruns pick up where they left off instantly.
Phase 2: Teaching a VLM to Be a Housing Inspector
Each selected frame goes to a vision-language model (I used Gemini's free tier) with a carefully constructed system prompt. The prompt establishes three things:
- Identity: "You are an expert housing inspector for the City."
- Context: A catalog of 13 violation types sourced from the municipal code — peeling paint, vehicles on unpaved surfaces, overgrown vegetation, broken windows, bad roofing, damaged siding, and so on.
-
Output contract: Strict JSON with
house_number,violations[], each withviolation_category,seen_in_image(a description), andbboxin 0–1000 normalized coordinates.
The model returns structured data. But here's the thing about VLM bounding boxes: they are noisy. Models confidently return boxes that are 2 pixels wide, boxes where x2 < x1, boxes that span the entire image as a "vertical sliver." Left unprocessed, these break annotation rendering and corrupt results.
So every inference result passes through a bbox post-processing pipeline:
- Repair — Clamp all coordinates to 0–1000, fix axis ordering, reject NaN/Inf values, enforce minimum size
- Expand slivers — Tall vertical strips (like a box covering the entire façade for "peeling paint") get centered-expanded to a max 8:1 aspect ratio
- Filter — Optionally drop boxes that are too tiny or have extreme aspect ratios
- IOU merge — Same-category boxes with heavy overlap (>65%) get deduplicated
There's also a bad-run detector: if a frame comes back with two or more nonsensical vertical slivers (height/width ≥ 10), it's flagged for automatic retry on the next cycle instead of being committed to final results. The model has off days, and the pipeline accounts for that.
Rate limit handling deserves its own mention. Free-tier APIs have quotas. Instead of exponential-backoff loops that burn through your retry budget in five minutes, the system counts consecutive failures. After five in a row, it pauses all API calls until local midnight — when quotas reset — then resumes automatically. State is persisted, so a server restart doesn't reset the counter. This single design choice eliminated the "wake up to find 400 error messages" problem entirely.
Phase 3: The Hard Part — Which House Is This?
Detecting that a house has peeling paint is useful. Knowing which house it is makes the detection actionable. This is where things got interesting.
GoPro cameras embed GPS telemetry (GPMF) directly in the video stream — latitude, longitude, speed, and heading, sampled at roughly 18 Hz. That's a GPS fix every ~55 milliseconds. Even driving at 25 mph, that's less than a meter between samples. Also, another caveat here is the telemetry we get is for the 'car' and not the 'house' present in the video.
The geolocation pipeline has six stages, but the core insight is stage 4: spatial parcel matching.
For each frame, the system knows the car's position and heading at that exact millisecond. It projects two rays outward from the car — one perpendicular to the left (heading − 90°) and one to the right (heading + 90°) — representing where each camera would be looking. Since the camera is dash-mounted facing one side of the street, one of these rays points at the house.
The system then performs point-in-polygon queries against a GeoDataFrame of local parcel boundaries — publicly available GIS data, downloaded once from the state's ArcGIS service. No API keys. No ongoing cost.
But a single projected point doesn't always land inside a parcel. GPS has noise. The road-to-house setback varies. So the system tries multiple projection distances — 6m, 10m, 14m, 18m, 22m — and picks the hit whose projected point lands deepest inside the polygon.
If no ray lands inside any parcel, there's a nearest-on-side fallback: find the closest parcel centroid within 60 meters that's on the correct side of the car's direction of travel.
The optical flow trick: Here's the part I'm most proud of. The system doesn't need you to tell it which side the camera is mounted on. As the car drives past a house, objects on the left side of the car slide rightward across the frame. Objects on the right slide leftward. This is a fixed geometric law, not a heuristic. By computing pixel motion between consecutive video frames, the system infers camera side with no calibration or annotation. The optical flow hint is used conservatively — it breaks ties on corner lots and near-equal nearest-centroid results, but GPS point-in-polygon always takes priority.
Consensus voting ties it together. A single frame might be ambiguous — GPS glitch, property boundary crossing, corner lot. So frames are grouped into fixed 2-second windows and vote on the dominant parcel ID. If one parcel appears in ≥50% of matched frames, confidence is "high." Ties, GPS gaps, and empty windows get flagged as "low confidence" with specific reason codes.
The result: ~90% parcel-match accuracy on real-world test runs, purely spatial. No manual annotation. No address database lookup. The GPS and parcel geometry do all the work.
Phase 4: Making It Actionable
The final phase merges VLM-detected violations with geolocated parcels and produces the report json file — one entry per house, with address, parcel ID, list of violations, frame evidence, and confidence metadata.
A separate watcher process polls this file and pushes results to a CRM/Database (the pattern works with any REST API). For each address, it:
- Normalizes the address to match CRM conventions (just for this case)
- Finds or creates the property record by parcel ID
- Updates property fields (address, city, zip, property class, owner) from the public parcel dataset
- Uploads the raw house image as an attachment, titled with street address + timestamp + version counter
An inspector opens their CRM/Database, sees a property flagged with violations, clicks the annotated image, verifies with one click, and moves on. What used to take driving to every street now takes reviewing a queue.
The Zero-Cost Stack
I built this on a single budget Linux server — nothing exotic. Here's what costs money and what doesn't:
| Component | Cost |
|---|---|
| Server (used desktop with NVIDIA GPU) | One-time hardware |
| Gemini VLM API | $0 — free tier with limits |
| CLIP model (ViT-L-14) | $0 — open-source, runs locally |
| Parcel GIS data | $0 — NYS public ArcGIS service |
| ffmpeg, Python, geopandas, shapely, Pillow | $0 — all open-source |
| CRM/DB API | $0 — included with org subscription, or minimal cost for DB |
| GoPro camera | $0 — already owned |
The only recurring cost is electricity. Everything else is either free-tier API, local compute, or public data.
A 10-minute GoPro video processes end-to-end in about 1 hour on budget settings (fewer frames per chunk, smaller CLIP model). Bump up the CLIP model to ViT-H-14 on a better GPU, and you get higher accuracy with only slightly longer processing. The pipeline is tunable — trade speed for quality at any point.
What I'd Do Differently
The bbox pipeline should have been day-one work. I spent the first two weeks fighting inconsistent model outputs before I built the repair/expand/filter/merge pipeline. VLMs are stochastic, and bounding boxes are the least reliable part of their output. Don't trust them. Process them.
Checkpoint everything. Every stage of every phase has resumable state now. At first, only the video extraction had it. A crash during VLM inference meant re-processing frames that already had results — wasteful and confusing. Adding idempotent merge logic to the inference results file fixed this.
The optical flow side inference was a late addition that should have been earlier. The original design required manually specifying which side the camera was on for each video. The motion-based inference eliminated that entirely and improved accuracy on corner lots and ambiguous GPS data. Physics beats configuration.
Rate limits are a system design problem, not an error handling problem. The midnight-pause backoff pattern replaced exponential retry loops and eliminated a whole class of operational headaches. If you're building on free-tier APIs, design for the quota window, not the individual request.
The Bigger Picture
Housing code enforcement is one of those domains where the problem is obvious but the solution has been stuck for decades. Inspectors driving every street looking for visual violations is the same workflow from 1970, just with a tablet instead of a clipboard.
What changed isn't the problem — it's that the building blocks became free or nearly free. Vision-language models that can reason about images. Open-source embedding models that can rank frames by relevance. Public parcel data that covers entire counties. GPS sensors embedded in consumer cameras.
The interesting work isn't the model. It's the pipeline around it — the frame selection, the geolocation, the error recovery, the consensus logic. The model is 30% of the solution. The other 70% is everything that turns raw model output into something an inspector can actually use.
And none of it requires a cloud budget.
Built with Python, Gemini, CLIP, geopandas, and a GoPro. $0/month. Open source everything.
Connect & Share
I’m Faham — currently diving deep into AI/ML while pursuing my Master’s at the University at Buffalo. I share what I learn as I build real-world AI apps.
If you find this helpful, or have any questions, let’s connect on LinkedIn and X (formerly Twitter).
AI Disclosure
This blog post was written by Faham with assistance from AI tools for research, content structuring, and image generation. All technical content has been reviewed and verified for accuracy.





Top comments (0)