How We Built an Open Crime Data Platform for Brazil (3 Million Records)

#python #opendata #fastapi #brazil

How We Built an Open Crime Data Platform for Brazil (3 Million Records)

Brazil has one of the world's highest homicide rates — over 47,000 murders per year. Yet accessing crime statistics is surprisingly hard. Data is fragmented across 27 state secretariats, each with different formats, different crime taxonomies, and wildly inconsistent update schedules.

We built Crime Brasil to fix that.

What it does

Crime Brasil is an interactive crime map covering Rio Grande do Sul, Rio de Janeiro, and Minas Gerais — with bairro (neighborhood) level granularity for RS, and municipality-level for the others.

Key features:

3+ million individual crime records for RS (2022–2026)
Choropleth maps at state → city → neighborhood drill-down
Per-100K-inhabitants rate normalization (so you can actually compare)
Crime type filters (homicide, robbery, drug offenses, etc.)
Time range selector (last 6/12/24/36 months or full history)
API access — all data available via REST endpoints

The tech stack

Backend: FastAPI + SQLAlchemy + SQLite (yeah, SQLite — handles 3M rows fine with proper indexes)
Frontend: Next.js 14 + Leaflet.js + Tailwind CSS
Deployment: Docker + zero-downtime atomic alias swap (no Kubernetes needed at this scale)
Data pipeline: Python ETL from SSP/RS ZIP files, ISP/RJ CSV, SEJUSP/MG CSV

The hardest part: bairro name normalization

The same neighborhood appears as:

CENTRO HISTÓRICO
CENTRO HISTORICO
centro historico
CENTRO-HISTÓRICO
CTR HIST

Across police records, census data, and map boundaries.

We built a phonetic normalization pipeline using custom Brazilian Portuguese rules — strips accents, normalizes common abbreviations (RUA → R, AVENIDA → AV), handles compound names. This alone took about 3 weeks of iteration against real data.

Why SQLite, not PostgreSQL

Conventional wisdom says "use Postgres for anything serious." But SQLite with covering indexes on (state, year_month, tipo_enquadramento) serves our heatmap queries in <100ms at 3M rows. The database is ~800MB. It fits on any $5 VPS. Zero connection pooling headaches. Zero migrations for most schema changes.

We do have one performance rule: always use BETWEEN for year filters, never LIKE. SQLite uses the covering index for BETWEEN '{ano}-01' AND '{ano}-12' but not for LIKE '{ano}-%'.

Open data

All datasets are published:

Kaggle: MG Violent Crime 2019–2026
Zenodo: RS Full Dataset — DOI 10.5281/zenodo.19712036
HuggingFace: RS 2022–2026

What's next

São Paulo coverage (aggregated municipal-level data already being processed)
API authentication for high-volume consumers
Historical comparison charts per location

The entire REST API is public: https://crimebrasil.com.br/api/heatmap/municipios returns municipality-level heatmap data, https://crimebrasil.com.br/api/location-stats returns detailed stats per location.

Happy to answer questions about the data pipeline, the normalization approach, or the zero-downtime deploy architecture.

Crime Brasil: https://crimebrasil.com.br — open crime data for Brazil

DEV Community

How We Built an Open Crime Data Platform for Brazil (3 Million Records)

How We Built an Open Crime Data Platform for Brazil (3 Million Records)

What it does

The tech stack

The hardest part: bairro name normalization

Why SQLite, not PostgreSQL

Open data

What's next

Top comments (0)