46 Real-World Hackathon Problems With Datasets and Research Papers

#opensource #hackathon #showdev #python

Here's a scenario you've probably been part of.

The hackathon starts. Your team gathers around a laptop. Someone says "let's build something with AI." Then comes the debate. Too vague. Too ambitious. Too fake. Four hours later you've decided on "something with a chatbot" because nobody had a better idea.

I've been in that room too many times. So I spent the last few months building something to fix it.

A curated collection of 46 real-world problem statements across 5 tracks, each with linked datasets, peer-reviewed research, and realistic build timelines.

The whole thing is open source on GitHub. MIT license. Free to use, fork, or contribute to.

Why most hackathon prompts fail
The typical hackathon prompt falls into one of three traps:

This repo fixes all three. Every problem is grounded in actual data, backed by research, scoped to a realistic build time, and comes with clear success criteria. You know what "done" looks like before you start.

The 5 tracks at a glance

The collection has grown to 46 problems across 5 tracks. Here's what's inside.

Global South Impact (10 problems)

AI and ML problems for the developing world. Maternal health risk stratification (287K deaths per year). Public procurement fraud detection ($1.3 to $4 trillion lost annually). Offline crop disease diagnostics for 500 million farmers without internet. Groundwater depletion forecasting affecting 2 billion people.

US Civic Tech (10 problems)

Systems that still run on paper in 2026. Workers' compensation claim navigation in a $50 billion industry with zero consumer software. Medical bill decoding when 80% of bills contain errors. Public records automation for journalists. Family court assistance where 70 to 80% of people represent themselves.

India Impact (5 problems)

These are my personal favorites. Problems built on India's DPI layer. Mandi price intelligence through Agmarknet APIs for farmers losing 10,000 crore rupees annually to price opacity. MSME compliance copilot for 6.45 crore small businesses. Court case navigation through eCourt APIs where 52 million cases are pending. Government scheme eligibility through DigiLocker where 7.67 lakh crore rupees in schemes have low uptake.

Rapid Prototypes (11 problems)

Weekend-sized builds across public health, land records, and civic services. Village grain bank manager. School resource transparency map. Waste worker platform. Infrastructure defect reporter. Tight scope. Clear criteria. You can ship something real in a weekend.

Frontier AI Platforms (10 problems)

The newest track. Healthcare problems that actually matter. Algorithmic bias auditing. Antimicrobial resistance surveillance. Clinical trial matching equity. Dementia caregiver decision support. Perinatal mental health screening. Wildfire risk preparedness. Youth mental health crisis triage. SMB cybersecurity compliance. Each one is hard, important, and comes with a clear path to a working prototype.

What makes this different from other collections

I've seen plenty of "X project ideas for developers" lists. Most of them are just titles. Here's what this repo does differently.

Every problem has linked data. The hardest part of any hackathon isn't coding. It's finding usable data. Most interesting datasets are locked behind paywalls or buried in government PDFs. Every problem here either links to an accessible source or tells you exactly where to get it.

Code
· json
{ "track": "global-south-impact", "problem": "Public Procurement Fraud Detection", "dataset": "Transparency International / Open Contracting Data Standard", "papers": [ "Decarolis et al. (2020) — Procurement corruption and firm entry", "Fazekas et al. (2016) — Red flags in public procurement" ], "build_time": "5-7 months", "success_criteria": "ML model flagging high-risk contracts with >80% precision" }

Every problem has research backing. Each statement cites peer reviewed papers. You're not guessing whether this is a real problem. Someone has already studied it.

Every problem has a scope. Build times range from 2 weeks to 18 months. You can pick something that fits your timeline instead of overcommitting.

Getting started in 3 steps

Step one is the easiest part.

Code

git clone https://github.com/AshayK003/hackathon-problem-statements.git
cd hackathon-problem-statements

Step two. Pick a track that matches your interests and available time. The INDEX.md file has a complete table of contents with all 46 problems searchable by track, build time, and tech stack.

Step three. Each problem has its own markdown file with the full breakdown. Context. Dataset links. Research citations. Success criteria. A suggested tech stack. You can go from zero to building in the time it normally takes to decide what to build.

Honest limitations

This collection is thorough but it has gaps.

The datasets are curated but not hosted. You still need to download and process them yourself. Some of the government data sources require API keys or approval.

The Global South and India tracks are the most complete because that's where the biggest gaps in accessible problem statements existed. The Frontier AI track is the newest and still being refined.

Not every problem is a weekend build. Some of them need months. The scope is honest, which means you won't waste time on something that can't work in your timeframe.

Why this matters

The best thing about hackathons is that they prove something. You can build. You can ship. You can solve a real problem in limited time.

The worst thing is that most hackathon output gets deleted after the event because the problem wasn't real enough to sustain.

This collection exists because I believe the best tools should solve real problems. Open source is how we make that happen.

If you build something from this repo, I'd genuinely love to see it. Open an issue. Tag me. Send a pull request. The collection keeps growing because people contribute their own problems and improvements.