DEV Community: SentinelCipher

46 Real-World Hackathon Problems With Datasets and Research Papers

SentinelCipher — Fri, 12 Jun 2026 07:03:23 +0000

Here's a scenario you've probably been part of.

The hackathon starts. Your team gathers around a laptop. Someone says "let's build something with AI." Then comes the debate. Too vague. Too ambitious. Too fake. Four hours later you've decided on "something with a chatbot" because nobody had a better idea.

I've been in that room too many times. So I spent the last few months building something to fix it.

A curated collection of 46 real-world problem statements across 5 tracks, each with linked datasets, peer-reviewed research, and realistic build timelines.

The whole thing is open source on GitHub. MIT license. Free to use, fork, or contribute to.

Why most hackathon prompts fail
The typical hackathon prompt falls into one of three traps:

This repo fixes all three. Every problem is grounded in actual data, backed by research, scoped to a realistic build time, and comes with clear success criteria. You know what "done" looks like before you start.

The 5 tracks at a glance

The collection has grown to 46 problems across 5 tracks. Here's what's inside.

Global South Impact (10 problems)

AI and ML problems for the developing world. Maternal health risk stratification (287K deaths per year). Public procurement fraud detection ($1.3 to $4 trillion lost annually). Offline crop disease diagnostics for 500 million farmers without internet. Groundwater depletion forecasting affecting 2 billion people.

US Civic Tech (10 problems)

Systems that still run on paper in 2026. Workers' compensation claim navigation in a $50 billion industry with zero consumer software. Medical bill decoding when 80% of bills contain errors. Public records automation for journalists. Family court assistance where 70 to 80% of people represent themselves.

India Impact (5 problems)

These are my personal favorites. Problems built on India's DPI layer. Mandi price intelligence through Agmarknet APIs for farmers losing 10,000 crore rupees annually to price opacity. MSME compliance copilot for 6.45 crore small businesses. Court case navigation through eCourt APIs where 52 million cases are pending. Government scheme eligibility through DigiLocker where 7.67 lakh crore rupees in schemes have low uptake.

Rapid Prototypes (11 problems)

Weekend-sized builds across public health, land records, and civic services. Village grain bank manager. School resource transparency map. Waste worker platform. Infrastructure defect reporter. Tight scope. Clear criteria. You can ship something real in a weekend.

Frontier AI Platforms (10 problems)

The newest track. Healthcare problems that actually matter. Algorithmic bias auditing. Antimicrobial resistance surveillance. Clinical trial matching equity. Dementia caregiver decision support. Perinatal mental health screening. Wildfire risk preparedness. Youth mental health crisis triage. SMB cybersecurity compliance. Each one is hard, important, and comes with a clear path to a working prototype.

What makes this different from other collections

I've seen plenty of "X project ideas for developers" lists. Most of them are just titles. Here's what this repo does differently.

Every problem has linked data. The hardest part of any hackathon isn't coding. It's finding usable data. Most interesting datasets are locked behind paywalls or buried in government PDFs. Every problem here either links to an accessible source or tells you exactly where to get it.

Code
· json
{ "track": "global-south-impact", "problem": "Public Procurement Fraud Detection", "dataset": "Transparency International / Open Contracting Data Standard", "papers": [ "Decarolis et al. (2020) — Procurement corruption and firm entry", "Fazekas et al. (2016) — Red flags in public procurement" ], "build_time": "5-7 months", "success_criteria": "ML model flagging high-risk contracts with >80% precision" }

Every problem has research backing. Each statement cites peer reviewed papers. You're not guessing whether this is a real problem. Someone has already studied it.

Every problem has a scope. Build times range from 2 weeks to 18 months. You can pick something that fits your timeline instead of overcommitting.

Getting started in 3 steps

Step one is the easiest part.

Code

git clone https://github.com/AshayK003/hackathon-problem-statements.git
cd hackathon-problem-statements

Step two. Pick a track that matches your interests and available time. The INDEX.md file has a complete table of contents with all 46 problems searchable by track, build time, and tech stack.

Step three. Each problem has its own markdown file with the full breakdown. Context. Dataset links. Research citations. Success criteria. A suggested tech stack. You can go from zero to building in the time it normally takes to decide what to build.

Honest limitations

This collection is thorough but it has gaps.

The datasets are curated but not hosted. You still need to download and process them yourself. Some of the government data sources require API keys or approval.

The Global South and India tracks are the most complete because that's where the biggest gaps in accessible problem statements existed. The Frontier AI track is the newest and still being refined.

Not every problem is a weekend build. Some of them need months. The scope is honest, which means you won't waste time on something that can't work in your timeframe.

Why this matters

The best thing about hackathons is that they prove something. You can build. You can ship. You can solve a real problem in limited time.

The worst thing is that most hackathon output gets deleted after the event because the problem wasn't real enough to sustain.

This collection exists because I believe the best tools should solve real problems. Open source is how we make that happen.

If you build something from this repo, I'd genuinely love to see it. Open an issue. Tag me. Send a pull request. The collection keeps growing because people contribute their own problems and improvements.

The bottom line

46 real problems. 5 tracks. Linked datasets. Research citations. Clear success criteria. All open source.

The repo is at github.com/AshayK003/hackathon-problem-statements

What would you build?

How I built DeltaGrid: a Paris Agreement gap analysis dashboard with 5 dependencies and zero paid APIs

SentinelCipher — Sun, 07 Jun 2026 05:07:22 +0000

The Paris Agreement is full of pledges. What it lacks is a simple way to see whether anyone is actually keeping them.
I built DeltaGrid to answer that. It calculates the gap between each country's NDC pledge and their actual energy transition trajectory, lets you adjust how you weight different energy sources, and shows you the result on a world map in real time.
200+ countries. 138 tests. 5 dependencies.

The normalization problem

NDCs (Nationally Determined Contributions) cannot be compared directly. Some countries pledge intensity reductions (emissions per unit of GDP), others pledge absolute cuts. Base years differ. Some pledges cover electricity only, others the full economy.
To compare them, I normalize everything into a Green Score from 0 to 100 and compute a gap against each country's pledged trajectory.

The Green Score formula

pythongreen_score = sum(share_i * weight_i) / max(all_weights)
share_i is the percentage of a country's energy from source i. weight_i is a user-adjustable slider between 0.0 and 2.0.
Dividing by max(all_weights) instead of score.max() * 100 keeps the output on an absolute scale. This is important: if you lower the weight of coal, countries that rely on coal see their score drop visibly. The map responds to your choices in a way that actually means something.
I had this wrong in the first version. The old normalization (score / score.max() * 100) compressed all scores into 0 to 100 regardless of weight changes. Sliders felt broken because they barely moved the map. Switching to max-weight normalization fixed it immediately.
Default weights:
SourceWeightWhySolar1.0Zero emission, fastest growingWind1.0Zero emission, rapidly scalingHydro1.0Zero emission, established baseloadNuclear0.5Low carbon but controversialGas0.2Fossil fuel, bridge roleCoal0.0Highest emission fossil

The gap formula

pythongap = actual_green_score - expected_trajectory

expected_trajectory = linear_interpolation(
base_value=0,
target_value=NDC_ghg_target_percent,
base_year=NDC_pledge_base_year,
target_year=NDC_pledge_target_year,
current_year=selected_year
)
NDC data comes from the Climate Watch API. The bulk fetch is cached to disk for 24 hours. Parsing GHG targets from the raw API response is messy: values come in as ranges ("30-40%"), dashes, floats, or keywords. The _parse_ghg_percentage() function handles all of these cases and has 21 dedicated tests in test_climate_watch.py.
Classification thresholds after gap is computed:
ClassGapHidden Champion> 5On Track0 to 5Slightly Behind-5 to 0Laggard< -5No Datamissing

The 5-dependency constraint

The entire app runs on: streamlit, plotly, pandas, requests, numpy.
Every dependency I considered adding had a cost: geopandas adds system-level binaries and makes cloud deployment fragile. A database adds a persistence layer the dataset does not need. An embeddings library adds an API call for something that does not require semantic search.
The dataset is 4,500 rows. Pandas in memory is the right tool.
For the world map, Plotly's px.choropleth has built-in country outlines covering 200+ countries. No GeoJSON bundling, no shapefile management, no projection configuration. It just works with an ISO-3166 alpha-3 column.
Both data sources (Our World in Data energy CSV and Climate Watch NDC API) are free and open access. No API keys required anywhere in the app.

Architecture: three layers

app/ # Streamlit pages and components
src/ # Computation: scoring, gap, ranking, pipeline
src/data/ # Ingestion, caching, validation, preprocessing
The data flow is linear:
sidebar weights + year
-> compute_green_score()
-> fetch_all_ndcs()
-> compute_gap()
-> classify_countries()
-> choropleth + tables
@st.cache_data memoizes scoring, gap analysis, and choropleth figures across reruns. The OWID CSV is cached with a 1-hour TTL. NDC API responses are cached to disk for 24 hours using a simple JSON-based TTL cache in src/data/cache.py.

Custom data upload

The sidebar accepts CSV or XLSX uploads. Column detection is fuzzy: a column called "solar_pct", "solar_share", or just "solar" all resolve to solar_share_energy. Encoding is auto-detected. ISO codes are normalized and aggregates (World, Africa, etc.) are filtered out automatically.
The upload preprocessor has 33 tests covering encoding edge cases, column normalization, ISO mapping, alternative column names, and the full preprocessing pipeline end to end.

Testing: 138 tests across 10 modules

ModuleTeststest_upload_preprocessor.py33test_climate_watch.py21test_ranking.py17test_country_codes.py17test_validators.py12test_cache.py10test_scoring.py9test_gap.py6test_owid.py4test_integration.py8
Integration tests cover end-to-end pipeline runs, weight-specific ranking behavior, and countries with no NDC data.

What I would do differently

The NDC parsing is the messiest part of the codebase. Climate Watch API responses are inconsistent enough that the parser handles 6 different value formats. A preprocessing step that normalizes raw API responses before they enter the scoring pipeline would have made this cleaner.
I would also add confidence intervals to the gap score earlier. A country that barely has NDC data should show more uncertainty than one with a full pledge and strong historical energy data. Right now they get the same classification treatment.

Deployment

Push to GitHub, connect to Streamlit Community Cloud, set main file to app/main.py. No secrets, no environment variables, no paid services.
bash# Local
streamlit run app/main.py

Dev workflow

make lint && make typecheck && make test

Contributing

Read AGENTS.md first. It has the full agent context including bug history, design decisions, and conventions.
The 5-dependency constraint is hard. Any PR adding a new dependency needs a strong argument. Everything else is open: new classification schemes, new data sources, better NDC parsing, UI improvements.
Repo: github.com/AshayK003/DeltaGrid
Try the app here: https://deltagrid.streamlit.app/

How I built PACE: an open source content analysis pipeline with parallel LLM batching (and what I learned)

SentinelCipher — Sun, 07 Jun 2026 04:35:07 +0000

I built PACE because I was drowning in content I needed to process.
Research papers, YouTube talks, long articles. I kept pasting things into AI chat interfaces one piece at a time, getting inconsistent output with no repeatable structure. It worked, but it did not scale and it certainly did not feel like a system.
So I built one.
PACE (Precise Analysis and Compilation of Extracts) is an open source Streamlit app that ingests content from 5 sources and outputs a structured 10-section report. This post covers the architecture decisions, what worked, and what did not.
Repo: github.com/AshayK003/PACE

The pipeline overview

Input (YouTube / PDF / Article / Audio / Text)
-> Ingest
-> Clean + Chunk
-> Parallel LLM Analysis (3 batches, 10 sections)
-> Final Synthesis
-> Markdown or PDF report
Every stage is modular. Ingestors live in app/ingestors/, each inheriting from BaseIngestor and implementing validate() and ingest(). Adding a new source means adding one file and inheriting the base class.

The ingestor choices

YouTube: youtube-transcript-api. No API key, no OAuth, just a URL. Works for anything with auto-generated or manual captions.
PDF: PyMuPDF4LLM combined with pdfplumber for table extraction. PyMuPDF4LLM runs at 0.09 seconds per page and stays under 1GB RAM, which matters a lot on Streamlit Community Cloud where memory is limited.
Articles: trafilatura. I tested several extractors against each other. trafilatura consistently had the best signal to noise ratio on real world news articles and blog posts. It's not the most popular library but it outperforms readability and newspaper3k on F1 score in published benchmarks.
Audio: faster-whisper for local speech to text. This tab is disabled on Streamlit Cloud because it requires local compute. Worth including for self-hosters.

Semantic chunking without embeddings

Long content needs to be chunked before going into an LLM context window. Most approaches either split naively by character count (destroys semantic coherence) or use embeddings to find meaningful boundaries (adds an API call and a vector dependency).
I used semchunk, which does semantic splitting based on sentence structure and content similarity without requiring embeddings. It keeps related content together and stays cheap to run. For a tool designed to work with free-tier LLMs this was the right call.

The parallel batching decision

This was the biggest performance unlock.
The naive approach is sequential: call the LLM, get section 1, call again, get section 2, repeat 10 times. At 2 to 3 seconds per call, that is 20 to 30 seconds minimum.
PACE groups the 10 analysis sections into 3 batches and fires them concurrently with asyncio. Each batch handles multiple sections in a single LLM call, and the 3 batches run in parallel.
Result: total analysis time dropped from 45 seconds to under 20 seconds. Around 60% faster in practice.
The tradeoff is that prompt construction gets more complex. You have to instruct the model to return multiple labeled sections in one response, then parse them back out reliably. The parser in app/analyzers/parser.py handles this and has 9 dedicated tests covering edge cases.

LLM provider strategy

I built the LLM client against the OpenAI-compatible API interface which every major provider now supports. This means the same client code works with Gemini, Groq, Cerebras, Mistral, DeepSeek, and OpenRouter without any provider-specific logic.
There is a built-in free tier key for people who want to try the tool without signing up anywhere. For heavier use, BYOK from the sidebar. The key stays in Streamlit session state and never hits disk.
The LRU cache (50 entries, 1 hour TTL) means re-analyzing the same content costs zero LLM calls on repeat runs.

Security was not optional
PACE makes HTTP requests based on user-supplied URLs. That is a classic SSRF vector. I added DNS resolution with IP blocking before any outbound request goes through. Private IP ranges, cloud metadata endpoints, and localhost are all blocked.
Other security layers:

File upload validates magic bytes, not just extension
50k character input cap prevents prompt stuffing
Prompt injection detection on user inputs
Error sanitization strips file paths, API keys, and internal details from any error message the user sees

All of this is covered in app/security.py with 40 tests in test_security.py.

Testing: 215 tests across 9 modules

ModuleTeststest_analyzers.py30test_security.py40test_ingestors.py31test_output.py38test_cleaner.py20test_chunker.py10test_config.py14test_parser.py9test_integration.py16
The integration tests were the most valuable. They test full pipeline runs with various content types and failure modes. Every time I changed the batching logic or the parser, the integration tests caught regressions before I manually tested anything.

Deployment

Streamlit Community Cloud is zero cost and handles multi-user sessions automatically. Deployment steps:

Push to GitHub
Go to share.streamlit.io
Set OPENCODE_ZEN_KEY in secrets

Done. The only caveat is that audio transcription requires local compute so that tab is hidden on cloud deployments.

Contributing

The codebase is designed to be easy to extend in three specific ways:
New ingestor: add app/ingestors/my_source.py, inherit BaseIngestor, implement validate() and ingest().
New analysis step: add a prompt to app/analyzers/prompts.py, register it in ALL_PROMPTS.
New LLM preset: add an entry to the presets dict in app/ui/sidebar.py.
All contributions need tests. Run pytest before opening a PR. All 215 must pass.

What I would do differently
The prompt engineering took way longer than expected. Getting LLMs to return structured multi-section output consistently across different providers required many iterations. If I rebuilt this, I would have started with a dedicated output validation layer earlier rather than treating it as a late-stage concern.
I would also add a web scraping fallback for paywalled articles sooner. Right now trafilatura fails gracefully, but a secondary fetch strategy would improve reliability.

Links

Repo: github.com/AshayK003/PACE
MIT license. Stars and PRs welcome.

How I Built BreachAlpha: Quantifying Cybersecurity Breach Impact Using Event Study Methodology

SentinelCipher — Tue, 02 Jun 2026 13:30:30 +0000

A few months ago I kept running into the same wall while talking to security practitioners: they had solid technical evidence of a breach's severity but no credible financial number to bring to business stakeholders. I decided to fix that.
The result is BreachAlpha, an open source tool that uses event study methodology to measure how breaches move stock prices and predict severity using XGBoost.

The methodology (why it is actually rigorous)

Event study methodology comes from financial economics. The idea is simple: isolate the impact of a specific event on an asset's price by comparing actual returns to expected returns (based on the market's movement). The difference is the "abnormal return."

For breaches, the math is:
AR = R_stock - R_market
CAR = sum of AR over event window

When Equifax disclosed the 2017 breach, the market dropped that week too. Event study separates the market-wide drop from the Equifax-specific drop. The CAR over a (-5, +30) trading day window gives you the net financial impact attributable to the breach.

The market prices in company size, sector dynamics, and breach-specific context. It is more honest than parametric cost models that rely on averages.

Architecture overview

breachalpha/ FastAPI + XGBoost backend
frontend/ React + Vite + Tailwind
tests/ 144 tests, 11 modules
The feature engine computes five core signals:

Abnormal return at Day 0, 1, 5, 30
CAR over (-1,+1) and (-5,+30) windows
Volatility spike (ratio of post-breach to pre-breach realized vol)
Volume change
Recovery time in trading days

These go into an XGBoost classifier that outputs Low/Medium/High/Critical severity plus a 0-100 risk score calculated as a weighted probability sum.
Stock data pipeline
Reliable stock data is harder than it sounds. Yahoo Finance rate limits aggressively. So I built a four-source fallback chain:
pythonsources = [
YFinanceSource(), # primary, Chrome TLS fingerprint
AlphaVantageSource(), # fallback, 25 free calls/day
NSEIndiaSource(), # .NS/.BO tickers
YahooScrapingSource(), # last resort HTML scrape
]
Each source implements fetch() and supports_ticker(). The fetcher gates each source before calling it, so NSE India never tries to resolve a NASDAQ ticker.
Stock data is cached locally with a 24h TTL. In testing this cut API calls by around 80% on repeated runs.

Three engineering decisions worth stealing

Decouple domain exceptions from HTTP
Services raise BreachAlphaError subclasses (TickerNotFoundError, InsufficientDataError, etc.). A single global exception handler in server.py translates them to HTTP status codes. Business logic never imports from FastAPI.
This means services are fully testable without spinning up a web server and switching frameworks later would be a one-file change.
Route factories with injected dependencies
pythondef create_score_routes(limiter: Limiter) -> APIRouter:
router = APIRouter()

... route definitions

return router
The rate limiter gets injected, not imported as a global. Tests pass a mock limiter. This pattern scales well as the number of route modules grows.
ProcessPoolExecutor for CPU-bound feature computation
Feature computation is CPU-heavy. Async/await with threads does not help here because of the GIL. ProcessPoolExecutor actually parallelizes across cores:
pythonwith ProcessPoolExecutor() as executor:
future = executor.submit(compute_features, price_data, breach_date)
features = future.result()
On a 4-core machine this roughly halves computation time for batch scoring.

API surface

The core endpoints:
bashPOST /api/score # score a single company
POST /api/score/auto # auto-search breach data then score
POST /api/explain # step-by-step calculation breakdown
POST /api/upload/analyze # batch score from CSV/XLSX
GET /api/breach-search # search breach incidents
Example curl:
bashcurl -X POST http://localhost:8000/api/score \
-H "Content-Type: application/json" \
-d '{
"company": "Equifax",
"breach_type": "data_leak",
"records_affected": 147000000,
"breach_date": "2017-09-07"
}'
Response includes risk score, severity prediction, confidence, per-class probabilities, and all the raw feature values so you can audit the calculation.
Running it locally
bashgit clone https://github.com/AshayK003/BreachAlpha.git
cd BreachAlpha
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
uvicorn breachalpha.server:app --reload --port 8000

separate terminal

cd frontend && npm install && npm run dev
Frontend at localhost:3000, backend at localhost:8000. The model bootstraps on synthetic data in about 2 seconds the first time.
What I want to improve
The biggest limitation right now is the training data. Synthetic data works for the interface and for demos but a model trained on real, labeled breach events would be significantly more accurate. If you have access to structured historical breach data (VCDB, OSF DataBreaches, similar), I would love to collaborate.
Sector-adjusted baselines are also on the list. A breach hitting a healthcare company has a different risk profile than the same breach at a retail chain, and the model should reflect that.

Contributing

The 144-test suite needs to pass. Coverage is enforced at 60% minimum. Main contribution areas right now:

Expanding the known tickers dictionary (currently 200+ companies)
Additional data sources
Real breach training data
Docker Compose setup for easier deployment

If you work in security research, quant finance, or you are building anything around cyber risk quantification, I would genuinely appreciate feedback on the methodology and the feature set.
Repo Link: https://github.com/AshayK003/BreachAlpha

I Built CausalLens — A Free, Open-Source Causal Impact Calculator for Time Series (5 Methods, Zero Setup)

SentinelCipher — Sat, 30 May 2026 16:20:50 +0000

I want to show you a tool I just open-sourced. It's called CausalLens, and it answers one specific question that most analytics stacks get completely wrong: did this intervention actually cause the change in my metric?

The problem with standard before/after analysis
Before/after comparisons are everywhere. They're also almost always misleading.

When you compare a metric before and after an intervention, you're implicitly assuming that the only thing that changed was your intervention. In practice, seasonality changes, external trends shift, unrelated events happen. The "improvement" you're seeing might have occurred anyway.

The right answer is to build a counterfactual: a statistical estimate of what would have happened if you had never intervened. The gap between that counterfactual and your observed data is your causal estimate.
What CausalLens does

You provide a CSV with a time series and an intervention date. The app fits a pre-intervention model, projects it forward as the counterfactual, and reports:

Estimated effect size (absolute and percentage)
p-value for statistical significance
95% confidence interval
Plain-English interpretation
Downloadable PDF and interactive HTML reports

The 5 methods and when to use each

ARIMA ITS (Interrupted Time Series)
Best for: single series, no obvious seasonality, straightforward before/after structure. The ITS framework is well-validated in public health and economics literature for exactly this use case.

SARIMAX
Best for: data with strong seasonal patterns (weekly cycles, monthly cycles, etc.). Ignoring seasonality inflates or deflates your effect estimate badly, so this matters more than people expect.

Bayesian Structural Time Series
Best for: when you want probabilistic output and explicit uncertainty quantification rather than a point estimate. The Bayesian approach also handles structural changes in the pre-period more gracefully.

Difference-in-Differences
Best for: when you have a natural control group that didn't receive the intervention. Classic econometrics approach, still one of the most credible methods when the parallel trends assumption holds.

Synthetic Control
Best for: when you have multiple potential control units but no single clean control group. The method finds the optimal weighted combination of control units to build your counterfactual. Computationally the most expensive method here, and the trickiest to implement correctly on messy data.

Technical stack and deployment constraints
Everything runs on Streamlit. The whole app is designed to fit within Streamlit Community Cloud's free tier: CPU-only, 1GB RAM, no external services.

The main packages:

statsmodels for ARIMA, SARIMAX
pymc for Bayesian STS
scipy.optimize for the Synthetic Control weight solver
reportlab for PDF generation
plotly for the interactive HTML reports

One non-obvious decision: I avoided causalimpact (the Python port of the R package) because it has dependency issues on resource-constrained environments. Building the Bayesian STS from scratch with PyMC gave me more control and better stability.

The hardest part: Synthetic Control on real data
The Synthetic Control weight optimization is a quadratic program subject to simplex constraints. In theory, clean. In practice, donor pool data is often collinear, the objective surface is flat in places, and solvers behave inconsistently.

I ended up wrapping the optimizer with multiple fallback strategies and added explicit diagnostics (pre-period fit quality, effective number of donors) so users can see when the method is straining.
What I'd build next

Regression Discontinuity Design is the obvious missing method. It handles the case where treatment assignment was determined by a threshold (e.g., everyone above a score threshold got the intervention). If you want to contribute that, the repo is ready for it.

Longer term, I want to add automated method selection based on data characteristics, and better guidance for users who aren't sure which method fits their situation.
Try it

Live app: https://causallens-khg4uatpmnhustajhn8mdl.streamlit.app/
GitHub: https://github.com/AshayK003/CausalLens

Feedback, issues, and PRs all welcome. The goal is to make rigorous causal analysis accessible to people who need it but don't have time to become econometricians.

I Built an Adaptive EDA Tool That Learns How You Explore Data

SentinelCipher — Thu, 28 May 2026 12:45:47 +0000

Most exploratory data analysis tools generate static reports.

You upload a dataset, get dozens of charts, scroll for a few minutes, and leave with information overload instead of actual insight.

After running into this problem repeatedly, I decided to build something different.

So I open sourced XAdaptiveEDA.

A Python + Streamlit tool that adapts its recommendations based on how you interact with your data.

GitHub: https://github.com/AshayK003/XadaptiveEDA

What Makes It Different?

Traditional EDA tools treat every dataset and every user the same way.

XAdaptiveEDA tries to behave more like an adaptive system instead of a one-time report generator.

You upload a CSV, Excel, or JSON file, and the app:

ranks analyses by relevance
tracks your feedback with 👍 and 👎 interactions
adapts future recommendations in real time
avoids repetitive analyses
prioritizes columns and patterns you explore frequently
lets you chat with your dataset using natural language

The goal was to make exploratory data analysis feel more interactive and personalized.

Features

Current capabilities include:

Core Analysis
Distribution analysis
Correlation analysis
Missing value detection
Outlier analysis
Categorical analysis
Time series analysis
Clustering
Feature importance
Adaptive Recommendation Engine

The recommendation engine combines:

data relevance
user preferences
novelty scoring
diversity penalties
temporal decay
affinity tracking
ε-greedy exploration

Instead of dumping every possible chart, the tool tries to surface the analyses most likely to matter.

Built-in AI Features

I also added optional LLM integration for:

chatting with datasets
AI-generated analysis insights
smart column naming
natural language query classification

Supported providers:

Ollama (local-first)
OpenRouter
Groq
Custom APIs

One thing I cared about heavily was privacy.

If you use Ollama locally, your data never leaves your machine.

Tech Stack

The project is intentionally lightweight.

Built with:

Streamlit
Plotly
pandas
NumPy
SQLite
Ollama

No massive infrastructure setup required.

The entire system currently runs with just 6 dependencies.

Engineering Details

Some things I focused on while building this:

explainable recommendation scoring
session persistence with SQLite
progressive sampling for large datasets
GPU acceleration support through Ollama
rate limiting for remote APIs
modular architecture
fully local workflows

The project currently has:

68 passing tests
MIT license
modular analysis pipeline
explainable scoring system
Why I Open Sourced It

I strongly believe useful developer tools should be accessible and hackable.

A lot of data tooling today feels either:

too enterprise-focused
too rigid
too expensive
or too opaque

I wanted to build something developers could actually inspect, extend, and experiment with.

What’s Next

Planned improvements include:

plugin system for custom analyses
exportable reports
dashboard mode
multi-dataset comparison
collaborative sessions

I also want to improve the recommendation quality and overall UX significantly.

Looking for Feedback

I’d genuinely love feedback from:

data scientists
Python developers
Streamlit builders
open source contributors
anyone working with exploratory analysis workflows

Especially around:

recommendation quality
UI/UX
adaptive scoring logic
real-world usability

GitHub:
https://github.com/AshayK003/XadaptiveEDA

If you find the project interesting, feel free to star the repo or contribute.

DEV Community: SentinelCipher

46 Real-World Hackathon Problems With Datasets and Research Papers

The 5 tracks at a glance

Global South Impact (10 problems)

US Civic Tech (10 problems)

India Impact (5 problems)

Rapid Prototypes (11 problems)

Frontier AI Platforms (10 problems)

What makes this different from other collections

Getting started in 3 steps

Honest limitations

Why this matters

The bottom line

How I built DeltaGrid: a Paris Agreement gap analysis dashboard with 5 dependencies and zero paid APIs

The normalization problem

The Green Score formula

The gap formula

The 5-dependency constraint

Architecture: three layers

Custom data upload

Testing: 138 tests across 10 modules

What I would do differently

Deployment

Dev workflow

Contributing

How I built PACE: an open source content analysis pipeline with parallel LLM batching (and what I learned)

The pipeline overview

The ingestor choices

Semantic chunking without embeddings

The parallel batching decision

LLM provider strategy

Testing: 215 tests across 9 modules

Deployment

Contributing

Links

How I Built BreachAlpha: Quantifying Cybersecurity Breach Impact Using Event Study Methodology

The methodology (why it is actually rigorous)

Architecture overview

Three engineering decisions worth stealing

... route definitions

API surface

separate terminal

Contributing

I Built CausalLens — A Free, Open-Source Causal Impact Calculator for Time Series (5 Methods, Zero Setup)

I Built an Adaptive EDA Tool That Learns How You Explore Data