Arnab Datta

Posted on Mar 24

How We Built an AI Littering Detection System in 4 Days — and Won 2nd Place

#computervision #hackathon #machinelearning #python

We had 4 days, one laptop with an RTX 2050, and a problem nobody on our team had fully solved before. This is the story of building TRACE — and everything that broke along the way.

What Is TRACE?

TRACE (Trash Recognition and Automated Civic Enforcement) is a real-time AI surveillance pipeline that:

Detects littering events across multiple live camera feeds
Confirms offender identity using a 5-state behaviour machine
Reads license plates via OCR for vehicle offenders
Routes WhatsApp alerts with evidence snapshots to the nearest municipality ward office using GPS distance
Streams live annotated video to a real-time analytics dashboard

Stack: Python · YOLOv8 · ByteTrack · EasyOCR · FastAPI · SQLite · Twilio · OpenCV · HTML/CSS/JS

We won 🥈 2nd place at NextGenHack 2026. This was my first ever hackathon, first semester of college, competing against seniors.

Here's what actually happened during those 4 days.

Problem 1: Detecting Behaviour, Not Just Objects

The obvious approach — detect trash, flag it — doesn't work. Trash appears in a frame for a lot of reasons that aren't littering. Someone carrying a bag. A bin. A parked vehicle with litter near it. You'd get false alerts constantly.

We looked at Human Action Recognition (HAR) models first. The idea was to classify the action — "person dropping object" — directly. But every model we tested was either too slow for real-time inference on our hardware, trained on datasets that didn't cover littering specifically, or produced too many false positives on adjacent actions like "person placing object on surface."

No perfect fit existed. So I designed something from scratch.

The 5-State Machine

Every tracked trash object moves through states independently:

UNKNOWN → CARRYING → SEPARATION → STATIONARY → ALERTED
                                ↘ CANCELLED (owner returns)

UNKNOWN: Trash first appears. Looking for the nearest suspect.
CARRYING: Suspect within 150px of the object — assumed being carried.
SEPARATION: Suspect has moved away. Timer starts.
STATIONARY: Object hasn't moved more than 15px in 30+ frames since separation.
ALERTED: Suspect is beyond 200px — confirmed littering. Evidence captured, alert dispatched.
CANCELLED: Owner identified by ByteTrack ID returns — false alarm cleared.

The key detail: owner identity is verified using ByteTrack track IDs, not just position. A different person walking near a stationary object doesn't cancel the alert. Without this, any passerby would reset the timer.

This took hours of whiteboarding. Getting the transition logic right — especially the cancellation paths — was harder than the model training.

Problem 2: Single Camera to Multi-Camera

Getting one camera working was straightforward. Getting three to run simultaneously without everything collapsing was a different problem entirely.

The issues hit in layers:

Threading: Each camera needs its own detection loop. Python's GIL means you can't just run them in threads and expect true parallelism for CPU-bound work. We moved to one thread per camera, each with its own model instances.

Shared state: ByteTrack maintains tracking state across frames. If two cameras share a tracker, their track IDs collide and the state machine breaks completely. Solution: each camera thread gets its own ByteTrack instance. No sharing.

MJPEG streaming: The dashboard needs live video. Naive implementation — encode frame, POST to backend, serve — blocks the detection loop and tanks FPS. We decoupled it: a separate sender thread reads from a shared frame buffer and POSTs independently. The detection loop writes one frame to the buffer (microseconds) and moves on. If the backend is slow, the sender skips to the latest frame. Detection runs at full GPU speed regardless.

Problem 3: Round 2 Surprise — Add Geofencing

Midway through the hackathon, the judges told us to add geofencing. New requirement, mid-build.

The goal: instead of hardcoding a phone number per camera, alerts should automatically route to the nearest municipality ward office based on the camera's GPS coordinates.

My first instinct was Euclidean distance — just subtract the coordinates. That's wrong.

1 degree of latitude ≈ 111 km. Raw degree subtraction treats coordinates as flat 2D points, which gives completely wrong distances at any real-world scale. A camera 200 metres from an office could appear farther than one 2 km away depending on which direction you measure.

The correct formula is Haversine, which accounts for the Earth's curvature:

import math

def haversine(lat1, lng1, lat2, lng2):
    R = 6_371_000  # Earth radius in metres
    phi1, phi2 = math.radians(lat1), math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lng2 - lng1)
    a = math.sin(dphi/2)**2 + math.cos(phi1) * math.cos(phi2) * math.sin(dlambda/2)**2
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

Now nearest_office(cam_lat, cam_lng) iterates through every entry in MUNICIPALITY_OFFICES, computes Haversine distance, and returns the closest one. Adding a new ward office requires one dict entry in config. No camera config changes needed — routing updates automatically.

We also added high sensitivity zones — schools, stations, heritage sites — where cameras within a defined radius never drop below MEDIUM priority surveillance.

Problem 4: The GPU Was Choking

More cameras meant the GPU was hitting its ceiling. On an RTX 2050, we could run 4 cameras at full inference before FPS started dropping hard.

I looked at standard rate-control approaches:

Token bucket: Solves contention between producers sharing one resource. But each camera owns its own thread and model instances — there's no shared queue to arbitrate. Doesn't fit.
Frame differencing: Gates inference on pixel-change detection. Sounds good, but lighting changes, wind, insects — all produce false triggers. Also creates irregular frame gaps that ByteTrack's persist=True wasn't designed for, breaking track continuity.

We'd actually had simple frame skipping in an earlier version — run detection every Nth frame regardless of what's happening. We scrapped it because it broke tracking. ByteTrack needs consistent temporal input to maintain IDs reliably.

The Dynamic Priority System

The insight: most cameras are idle most of the time. A camera pointed at an empty street at 2am doesn't need the same inference rate as one that just detected a littering event.

Each camera thread tracks time since its last confirmed trash detection and assigns itself a priority:

def get_camera_skip(ctx):
    elapsed = time.time() - ctx.last_trash_time
    if elapsed < PRIORITY_HIGH_WINDOW:    # 5 seconds
        return PRIORITY_HIGH_SKIP         # skip=1, every frame
    elif elapsed < PRIORITY_MEDIUM_WINDOW: # 30 seconds
        return PRIORITY_MEDIUM_SKIP       # skip=5
    else:
        return PRIORITY_LOW_SKIP          # skip=8

Key design decisions:

Trash model runs every frame regardless — only person detection is skipped
Cameras start at LOW automatically — last_trash_time=0.0 means elapsed ≈ 1.7 billion seconds
Skipped frames reuse last_known_persons cache — ByteTrack state is preserved between detection frames
Priority transitions POST to backend only on change — not every frame

Result: went from 4 cameras at full load to 6-9 cameras on the same RTX 2050.

The difference from the old frame skipping: this version is activity-aware. It doesn't skip blindly on a fixed schedule — it skips based on what's actually happening in the scene. A camera that just detected a littering event immediately jumps to HIGH (every frame) for 5 seconds. An idle camera at LOW still runs the trash model every frame, just not person detection.

The Dashboard Problem (JS with No JS Experience)

None of our team were JavaScript developers. The dashboard needed to be live, multi-camera, handle MJPEG streams, update charts every 5 seconds, and look presentable to judges.

We deliberately chose plain HTML/CSS/JS — no React, no build step, no npm. Zero risk of build failures mid-demo. It opens directly in any browser and polls the FastAPI backend every 5 seconds.

Chart.js for the graphs. Native <img> tags for MJPEG streams — the browser handles multipart decode natively, no JS required. fetch() for everything else. It works. It held up through the entire demo.

What We Shipped

Multi-camera real-time detection (threaded, one worker per camera)
5-state littering behaviour machine with ByteTrack ID-based owner verification
EasyOCR license plate recognition with Indian format validation
Haversine geofencing — nearest ward office routing
Dynamic HIGH/MEDIUM/LOW priority inference system
imgbb snapshot upload → Twilio WhatsApp alert with zone label
FastAPI backend, SQLite, MJPEG streaming
Live dashboard with priority badges and zone sensitivity indicators

Model performance: YOLOv8s fine-tuned on TACO dataset, mAP50 = 0.81

What I'd Do Differently

The state machine thresholds (150px carry distance, 200px abandon distance) were tuned empirically on test videos. They work, but they're pixel-based — which means they're resolution and camera-angle dependent. A proper implementation would normalize by estimated person height in frame.

The seen_trash_ids set that tracks confirmed events is never pruned. Over a long session it grows indefinitely. Simple fix with a timestamp-based TTL, just didn't make the hackathon cut.

Frame differencing as a complement to the priority system — gating the trash model on truly static scenes — would be the next meaningful optimization. The priority system handles person detection well. The trash model still runs every frame regardless.