<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SentinelCipher</title>
    <description>The latest articles on DEV Community by SentinelCipher (@sentinelcipher).</description>
    <link>https://dev.to/sentinelcipher</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3956595%2Ff8e375a3-9f80-4d90-86f2-f6ff10cb1d04.png</url>
      <title>DEV Community: SentinelCipher</title>
      <link>https://dev.to/sentinelcipher</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sentinelcipher"/>
    <language>en</language>
    <item>
      <title>46 Real-World Hackathon Problems With Datasets and Research Papers</title>
      <dc:creator>SentinelCipher</dc:creator>
      <pubDate>Fri, 12 Jun 2026 07:03:23 +0000</pubDate>
      <link>https://dev.to/sentinelcipher/46-real-world-hackathon-problems-with-datasets-and-research-papers-10pm</link>
      <guid>https://dev.to/sentinelcipher/46-real-world-hackathon-problems-with-datasets-and-research-papers-10pm</guid>
      <description>&lt;p&gt;Here's a scenario you've probably been part of.&lt;/p&gt;

&lt;p&gt;The hackathon starts. Your team gathers around a laptop. Someone says "let's build something with AI." Then comes the debate. Too vague. Too ambitious. Too fake. Four hours later you've decided on "something with a chatbot" because nobody had a better idea.&lt;/p&gt;

&lt;p&gt;I've been in that room too many times. So I spent the last few months building something to fix it.&lt;/p&gt;

&lt;p&gt;A curated collection of 46 real-world problem statements across 5 tracks, each with linked datasets, peer-reviewed research, and realistic build timelines.&lt;/p&gt;

&lt;p&gt;The whole thing is open source on GitHub. MIT license. Free to use, fork, or contribute to.&lt;/p&gt;

&lt;p&gt;Why most hackathon prompts fail&lt;br&gt;
The typical hackathon prompt falls into one of three traps:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzdc0p2wdp505ijngvz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzdc0p2wdp505ijngvz1.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This repo fixes all three. Every problem is grounded in actual data, backed by research, scoped to a realistic build time, and comes with clear success criteria. You know what "done" looks like before you start.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sw3i4i8fo9xbyeztnwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sw3i4i8fo9xbyeztnwj.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The 5 tracks at a glance
&lt;/h2&gt;

&lt;p&gt;The collection has grown to 46 problems across 5 tracks. Here's what's inside.&lt;/p&gt;
&lt;h3&gt;
  
  
  Global South Impact (10 problems)
&lt;/h3&gt;

&lt;p&gt;AI and ML problems for the developing world. Maternal health risk stratification (287K deaths per year). Public procurement fraud detection ($1.3 to $4 trillion lost annually). Offline crop disease diagnostics for 500 million farmers without internet. Groundwater depletion forecasting affecting 2 billion people.&lt;/p&gt;
&lt;h3&gt;
  
  
  US Civic Tech (10 problems)
&lt;/h3&gt;

&lt;p&gt;Systems that still run on paper in 2026. Workers' compensation claim navigation in a $50 billion industry with zero consumer software. Medical bill decoding when 80% of bills contain errors. Public records automation for journalists. Family court assistance where 70 to 80% of people represent themselves.&lt;/p&gt;
&lt;h3&gt;
  
  
  India Impact (5 problems)
&lt;/h3&gt;

&lt;p&gt;These are my personal favorites. Problems built on India's DPI layer. Mandi price intelligence through Agmarknet APIs for farmers losing 10,000 crore rupees annually to price opacity. MSME compliance copilot for 6.45 crore small businesses. Court case navigation through eCourt APIs where 52 million cases are pending. Government scheme eligibility through DigiLocker where 7.67 lakh crore rupees in schemes have low uptake.&lt;/p&gt;
&lt;h3&gt;
  
  
  Rapid Prototypes (11 problems)
&lt;/h3&gt;

&lt;p&gt;Weekend-sized builds across public health, land records, and civic services. Village grain bank manager. School resource transparency map. Waste worker platform. Infrastructure defect reporter. Tight scope. Clear criteria. You can ship something real in a weekend.&lt;/p&gt;
&lt;h3&gt;
  
  
  Frontier AI Platforms (10 problems)
&lt;/h3&gt;

&lt;p&gt;The newest track. Healthcare problems that actually matter. Algorithmic bias auditing. Antimicrobial resistance surveillance. Clinical trial matching equity. Dementia caregiver decision support. Perinatal mental health screening. Wildfire risk preparedness. Youth mental health crisis triage. SMB cybersecurity compliance. Each one is hard, important, and comes with a clear path to a working prototype.&lt;/p&gt;
&lt;h2&gt;
  
  
  What makes this different from other collections
&lt;/h2&gt;

&lt;p&gt;I've seen plenty of "X project ideas for developers" lists. Most of them are just titles. Here's what this repo does differently.&lt;/p&gt;

&lt;p&gt;Every problem has linked data. The hardest part of any hackathon isn't coding. It's finding usable data. Most interesting datasets are locked behind paywalls or buried in government PDFs. Every problem here either links to an accessible source or tells you exactly where to get it.&lt;/p&gt;

&lt;p&gt;Code&lt;br&gt;
· json&lt;br&gt;
&lt;code&gt;{&lt;br&gt;
  "track": "global-south-impact",&lt;br&gt;
  "problem": "Public Procurement Fraud Detection",&lt;br&gt;
  "dataset": "Transparency International / Open Contracting Data Standard",&lt;br&gt;
  "papers": [&lt;br&gt;
    "Decarolis et al. (2020) — Procurement corruption and firm entry",&lt;br&gt;
    "Fazekas et al. (2016) — Red flags in public procurement"&lt;br&gt;
  ],&lt;br&gt;
  "build_time": "5-7 months",&lt;br&gt;
  "success_criteria": "ML model flagging high-risk contracts with &amp;gt;80% precision"&lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Every problem has research backing. Each statement cites peer reviewed papers. You're not guessing whether this is a real problem. Someone has already studied it.&lt;/p&gt;

&lt;p&gt;Every problem has a scope. Build times range from 2 weeks to 18 months. You can pick something that fits your timeline instead of overcommitting.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting started in 3 steps
&lt;/h2&gt;

&lt;p&gt;Step one is the easiest part.&lt;/p&gt;

&lt;p&gt;Code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/AshayK003/hackathon-problem-statements.git
&lt;span class="nb"&gt;cd &lt;/span&gt;hackathon-problem-statements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step two. Pick a track that matches your interests and available time. The INDEX.md file has a complete table of contents with all 46 problems searchable by track, build time, and tech stack.&lt;/p&gt;

&lt;p&gt;Step three. Each problem has its own markdown file with the full breakdown. Context. Dataset links. Research citations. Success criteria. A suggested tech stack. You can go from zero to building in the time it normally takes to decide what to build.&lt;/p&gt;

&lt;h4&gt;
  
  
  Honest limitations
&lt;/h4&gt;

&lt;p&gt;This collection is thorough but it has gaps.&lt;/p&gt;

&lt;p&gt;The datasets are curated but not hosted. You still need to download and process them yourself. Some of the government data sources require API keys or approval.&lt;/p&gt;

&lt;p&gt;The Global South and India tracks are the most complete because that's where the biggest gaps in accessible problem statements existed. The Frontier AI track is the newest and still being refined.&lt;/p&gt;

&lt;p&gt;Not every problem is a weekend build. Some of them need months. The scope is honest, which means you won't waste time on something that can't work in your timeframe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;The best thing about hackathons is that they prove something. You can build. You can ship. You can solve a real problem in limited time.&lt;/p&gt;

&lt;p&gt;The worst thing is that most hackathon output gets deleted after the event because the problem wasn't real enough to sustain.&lt;/p&gt;

&lt;p&gt;This collection exists because I believe the best tools should solve real problems. Open source is how we make that happen.&lt;/p&gt;

&lt;p&gt;If you build something from this repo, I'd genuinely love to see it. Open an issue. Tag me. Send a pull request. The collection keeps growing because people contribute their own problems and improvements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;46 real problems. 5 tracks. Linked datasets. Research citations. Clear success criteria. All open source.&lt;/p&gt;

&lt;p&gt;The repo is at &lt;a href="https://dev.tourl"&gt;github.com/AshayK003/hackathon-problem-statements&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What would you build?&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>hackathon</category>
      <category>showdev</category>
      <category>python</category>
    </item>
    <item>
      <title>How I built DeltaGrid: a Paris Agreement gap analysis dashboard with 5 dependencies and zero paid APIs</title>
      <dc:creator>SentinelCipher</dc:creator>
      <pubDate>Sun, 07 Jun 2026 05:07:22 +0000</pubDate>
      <link>https://dev.to/sentinelcipher/how-i-built-deltagrid-a-paris-agreement-gap-analysis-dashboard-with-5-dependencies-and-zero-paid-54n2</link>
      <guid>https://dev.to/sentinelcipher/how-i-built-deltagrid-a-paris-agreement-gap-analysis-dashboard-with-5-dependencies-and-zero-paid-54n2</guid>
      <description>&lt;p&gt;The Paris Agreement is full of pledges. What it lacks is a simple way to see whether anyone is actually keeping them.&lt;br&gt;
I built DeltaGrid to answer that. It calculates the gap between each country's NDC pledge and their actual energy transition trajectory, lets you adjust how you weight different energy sources, and shows you the result on a world map in real time.&lt;br&gt;
200+ countries. 138 tests. 5 dependencies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c2mdvx4jm002b632ox0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c2mdvx4jm002b632ox0.png" alt=" " width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The normalization problem
&lt;/h2&gt;

&lt;p&gt;NDCs (Nationally Determined Contributions) cannot be compared directly. Some countries pledge intensity reductions (emissions per unit of GDP), others pledge absolute cuts. Base years differ. Some pledges cover electricity only, others the full economy.&lt;br&gt;
To compare them, I normalize everything into a Green Score from 0 to 100 and compute a gap against each country's pledged trajectory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Green Score formula
&lt;/h2&gt;

&lt;p&gt;pythongreen_score = sum(share_i * weight_i) / max(all_weights)&lt;br&gt;
share_i is the percentage of a country's energy from source i. weight_i is a user-adjustable slider between 0.0 and 2.0.&lt;br&gt;
Dividing by max(all_weights) instead of score.max() * 100 keeps the output on an absolute scale. This is important: if you lower the weight of coal, countries that rely on coal see their score drop visibly. The map responds to your choices in a way that actually means something.&lt;br&gt;
I had this wrong in the first version. The old normalization (score / score.max() * 100) compressed all scores into 0 to 100 regardless of weight changes. Sliders felt broken because they barely moved the map. Switching to max-weight normalization fixed it immediately.&lt;br&gt;
Default weights:&lt;br&gt;
SourceWeightWhySolar1.0Zero emission, fastest growingWind1.0Zero emission, rapidly scalingHydro1.0Zero emission, established baseloadNuclear0.5Low carbon but controversialGas0.2Fossil fuel, bridge roleCoal0.0Highest emission fossil&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap formula
&lt;/h2&gt;

&lt;p&gt;pythongap = actual_green_score - expected_trajectory&lt;/p&gt;

&lt;p&gt;expected_trajectory = linear_interpolation(&lt;br&gt;
    base_value=0,&lt;br&gt;
    target_value=NDC_ghg_target_percent,&lt;br&gt;
    base_year=NDC_pledge_base_year,&lt;br&gt;
    target_year=NDC_pledge_target_year,&lt;br&gt;
    current_year=selected_year&lt;br&gt;
)&lt;br&gt;
NDC data comes from the Climate Watch API. The bulk fetch is cached to disk for 24 hours. Parsing GHG targets from the raw API response is messy: values come in as ranges ("30-40%"), dashes, floats, or keywords. The _parse_ghg_percentage() function handles all of these cases and has 21 dedicated tests in test_climate_watch.py.&lt;br&gt;
Classification thresholds after gap is computed:&lt;br&gt;
ClassGapHidden Champion&amp;gt; 5On Track0 to 5Slightly Behind-5 to 0Laggard&amp;lt; -5No Datamissing&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxhfs01g9mewska0w6cu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxhfs01g9mewska0w6cu.png" alt=" " width="799" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-dependency constraint
&lt;/h2&gt;

&lt;p&gt;The entire app runs on: streamlit, plotly, pandas, requests, numpy.&lt;br&gt;
Every dependency I considered adding had a cost: geopandas adds system-level binaries and makes cloud deployment fragile. A database adds a persistence layer the dataset does not need. An embeddings library adds an API call for something that does not require semantic search.&lt;br&gt;
The dataset is 4,500 rows. Pandas in memory is the right tool.&lt;br&gt;
For the world map, Plotly's px.choropleth has built-in country outlines covering 200+ countries. No GeoJSON bundling, no shapefile management, no projection configuration. It just works with an ISO-3166 alpha-3 column.&lt;br&gt;
Both data sources (Our World in Data energy CSV and Climate Watch NDC API) are free and open access. No API keys required anywhere in the app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3in5lq0eb974bqvfqyk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3in5lq0eb974bqvfqyk.png" alt=" " width="799" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: three layers
&lt;/h2&gt;

&lt;p&gt;app/          # Streamlit pages and components&lt;br&gt;
src/          # Computation: scoring, gap, ranking, pipeline&lt;br&gt;
src/data/     # Ingestion, caching, validation, preprocessing&lt;br&gt;
The data flow is linear:&lt;br&gt;
sidebar weights + year&lt;br&gt;
  -&amp;gt; compute_green_score()&lt;br&gt;
  -&amp;gt; fetch_all_ndcs()&lt;br&gt;
  -&amp;gt; compute_gap()&lt;br&gt;
  -&amp;gt; classify_countries()&lt;br&gt;
  -&amp;gt; choropleth + tables&lt;br&gt;
&lt;a class="mentioned-user" href="https://dev.to/st"&gt;@st&lt;/a&gt;.cache_data memoizes scoring, gap analysis, and choropleth figures across reruns. The OWID CSV is cached with a 1-hour TTL. NDC API responses are cached to disk for 24 hours using a simple JSON-based TTL cache in src/data/cache.py.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom data upload
&lt;/h2&gt;

&lt;p&gt;The sidebar accepts CSV or XLSX uploads. Column detection is fuzzy: a column called "solar_pct", "solar_share", or just "solar" all resolve to solar_share_energy. Encoding is auto-detected. ISO codes are normalized and aggregates (World, Africa, etc.) are filtered out automatically.&lt;br&gt;
The upload preprocessor has 33 tests covering encoding edge cases, column normalization, ISO mapping, alternative column names, and the full preprocessing pipeline end to end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing: 138 tests across 10 modules
&lt;/h2&gt;

&lt;p&gt;ModuleTeststest_upload_preprocessor.py33test_climate_watch.py21test_ranking.py17test_country_codes.py17test_validators.py12test_cache.py10test_scoring.py9test_gap.py6test_owid.py4test_integration.py8&lt;br&gt;
Integration tests cover end-to-end pipeline runs, weight-specific ranking behavior, and countries with no NDC data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;The NDC parsing is the messiest part of the codebase. Climate Watch API responses are inconsistent enough that the parser handles 6 different value formats. A preprocessing step that normalizes raw API responses before they enter the scoring pipeline would have made this cleaner.&lt;br&gt;
I would also add confidence intervals to the gap score earlier. A country that barely has NDC data should show more uncertainty than one with a full pledge and strong historical energy data. Right now they get the same classification treatment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;Push to GitHub, connect to Streamlit Community Cloud, set main file to app/main.py. No secrets, no environment variables, no paid services.&lt;br&gt;
bash# Local&lt;br&gt;
streamlit run app/main.py&lt;/p&gt;

&lt;h1&gt;
  
  
  Dev workflow
&lt;/h1&gt;

&lt;p&gt;make lint &amp;amp;&amp;amp; make typecheck &amp;amp;&amp;amp; make test&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;Read AGENTS.md first. It has the full agent context including bug history, design decisions, and conventions.&lt;br&gt;
The 5-dependency constraint is hard. Any PR adding a new dependency needs a strong argument. Everything else is open: new classification schemes, new data sources, better NDC parsing, UI improvements.&lt;br&gt;
Repo: github.com/AshayK003/DeltaGrid&lt;br&gt;
Try the app here: &lt;a href="https://deltagrid.streamlit.app/" rel="noopener noreferrer"&gt;https://deltagrid.streamlit.app/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>climate</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How I built PACE: an open source content analysis pipeline with parallel LLM batching (and what I learned)</title>
      <dc:creator>SentinelCipher</dc:creator>
      <pubDate>Sun, 07 Jun 2026 04:35:07 +0000</pubDate>
      <link>https://dev.to/sentinelcipher/how-i-built-pace-an-open-source-content-analysis-pipeline-with-parallel-llm-batching-and-what-i-2moh</link>
      <guid>https://dev.to/sentinelcipher/how-i-built-pace-an-open-source-content-analysis-pipeline-with-parallel-llm-batching-and-what-i-2moh</guid>
      <description>&lt;p&gt;I built PACE because I was drowning in content I needed to process.&lt;br&gt;
Research papers, YouTube talks, long articles. I kept pasting things into AI chat interfaces one piece at a time, getting inconsistent output with no repeatable structure. It worked, but it did not scale and it certainly did not feel like a system.&lt;br&gt;
So I built one.&lt;br&gt;
PACE (Precise Analysis and Compilation of Extracts) is an open source Streamlit app that ingests content from 5 sources and outputs a structured 10-section report. This post covers the architecture decisions, what worked, and what did not.&lt;br&gt;
Repo: github.com/AshayK003/PACE&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline overview
&lt;/h2&gt;

&lt;p&gt;Input (YouTube / PDF / Article / Audio / Text)&lt;br&gt;
    -&amp;gt; Ingest&lt;br&gt;
    -&amp;gt; Clean + Chunk&lt;br&gt;
    -&amp;gt; Parallel LLM Analysis (3 batches, 10 sections)&lt;br&gt;
    -&amp;gt; Final Synthesis&lt;br&gt;
    -&amp;gt; Markdown or PDF report&lt;br&gt;
Every stage is modular. Ingestors live in app/ingestors/, each inheriting from BaseIngestor and implementing validate() and ingest(). Adding a new source means adding one file and inheriting the base class.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frk84mdd18g85alvzrs30.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frk84mdd18g85alvzrs30.png" alt=" " width="799" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The ingestor choices
&lt;/h2&gt;

&lt;p&gt;YouTube: youtube-transcript-api. No API key, no OAuth, just a URL. Works for anything with auto-generated or manual captions.&lt;br&gt;
PDF: PyMuPDF4LLM combined with pdfplumber for table extraction. PyMuPDF4LLM runs at 0.09 seconds per page and stays under 1GB RAM, which matters a lot on Streamlit Community Cloud where memory is limited.&lt;br&gt;
Articles: trafilatura. I tested several extractors against each other. trafilatura consistently had the best signal to noise ratio on real world news articles and blog posts. It's not the most popular library but it outperforms readability and newspaper3k on F1 score in published benchmarks.&lt;br&gt;
Audio: faster-whisper for local speech to text. This tab is disabled on Streamlit Cloud because it requires local compute. Worth including for self-hosters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic chunking without embeddings
&lt;/h2&gt;

&lt;p&gt;Long content needs to be chunked before going into an LLM context window. Most approaches either split naively by character count (destroys semantic coherence) or use embeddings to find meaningful boundaries (adds an API call and a vector dependency).&lt;br&gt;
I used semchunk, which does semantic splitting based on sentence structure and content similarity without requiring embeddings. It keeps related content together and stays cheap to run. For a tool designed to work with free-tier LLMs this was the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parallel batching decision
&lt;/h2&gt;

&lt;p&gt;This was the biggest performance unlock.&lt;br&gt;
The naive approach is sequential: call the LLM, get section 1, call again, get section 2, repeat 10 times. At 2 to 3 seconds per call, that is 20 to 30 seconds minimum.&lt;br&gt;
PACE groups the 10 analysis sections into 3 batches and fires them concurrently with asyncio. Each batch handles multiple sections in a single LLM call, and the 3 batches run in parallel.&lt;br&gt;
Result: total analysis time dropped from 45 seconds to under 20 seconds. Around 60% faster in practice.&lt;br&gt;
The tradeoff is that prompt construction gets more complex. You have to instruct the model to return multiple labeled sections in one response, then parse them back out reliably. The parser in app/analyzers/parser.py handles this and has 9 dedicated tests covering edge cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F677rhn09cy6sff0yvyuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F677rhn09cy6sff0yvyuh.png" alt=" " width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM provider strategy
&lt;/h2&gt;

&lt;p&gt;I built the LLM client against the OpenAI-compatible API interface which every major provider now supports. This means the same client code works with Gemini, Groq, Cerebras, Mistral, DeepSeek, and OpenRouter without any provider-specific logic.&lt;br&gt;
There is a built-in free tier key for people who want to try the tool without signing up anywhere. For heavier use, BYOK from the sidebar. The key stays in Streamlit session state and never hits disk.&lt;br&gt;
The LRU cache (50 entries, 1 hour TTL) means re-analyzing the same content costs zero LLM calls on repeat runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwff8xrgjzoc3qq3ev214.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwff8xrgjzoc3qq3ev214.png" alt=" " width="800" height="595"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Security was not optional&lt;br&gt;
PACE makes HTTP requests based on user-supplied URLs. That is a classic SSRF vector. I added DNS resolution with IP blocking before any outbound request goes through. Private IP ranges, cloud metadata endpoints, and localhost are all blocked.&lt;br&gt;
Other security layers:&lt;/p&gt;

&lt;p&gt;File upload validates magic bytes, not just extension&lt;br&gt;
50k character input cap prevents prompt stuffing&lt;br&gt;
Prompt injection detection on user inputs&lt;br&gt;
Error sanitization strips file paths, API keys, and internal details from any error message the user sees&lt;/p&gt;

&lt;p&gt;All of this is covered in app/security.py with 40 tests in test_security.py.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing: 215 tests across 9 modules
&lt;/h2&gt;

&lt;p&gt;ModuleTeststest_analyzers.py30test_security.py40test_ingestors.py31test_output.py38test_cleaner.py20test_chunker.py10test_config.py14test_parser.py9test_integration.py16&lt;br&gt;
The integration tests were the most valuable. They test full pipeline runs with various content types and failure modes. Every time I changed the batching logic or the parser, the integration tests caught regressions before I manually tested anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;Streamlit Community Cloud is zero cost and handles multi-user sessions automatically. Deployment steps:&lt;/p&gt;

&lt;p&gt;Push to GitHub&lt;br&gt;
Go to share.streamlit.io&lt;br&gt;
Set OPENCODE_ZEN_KEY in secrets&lt;/p&gt;

&lt;p&gt;Done. The only caveat is that audio transcription requires local compute so that tab is hidden on cloud deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;The codebase is designed to be easy to extend in three specific ways:&lt;br&gt;
New ingestor: add app/ingestors/my_source.py, inherit BaseIngestor, implement validate() and ingest().&lt;br&gt;
New analysis step: add a prompt to app/analyzers/prompts.py, register it in ALL_PROMPTS.&lt;br&gt;
New LLM preset: add an entry to the presets dict in app/ui/sidebar.py.&lt;br&gt;
All contributions need tests. Run pytest before opening a PR. All 215 must pass.&lt;/p&gt;

&lt;p&gt;What I would do differently&lt;br&gt;
The prompt engineering took way longer than expected. Getting LLMs to return structured multi-section output consistently across different providers required many iterations. If I rebuilt this, I would have started with a dedicated output validation layer earlier rather than treating it as a late-stage concern.&lt;br&gt;
I would also add a web scraping fallback for paywalled articles sooner. Right now trafilatura fails gracefully, but a secondary fetch strategy would improve reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;Repo: github.com/AshayK003/PACE&lt;br&gt;
MIT license. Stars and PRs welcome.&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How I Built BreachAlpha: Quantifying Cybersecurity Breach Impact Using Event Study Methodology</title>
      <dc:creator>SentinelCipher</dc:creator>
      <pubDate>Tue, 02 Jun 2026 13:30:30 +0000</pubDate>
      <link>https://dev.to/sentinelcipher/how-i-built-breachalpha-quantifying-cybersecurity-breach-impact-using-event-study-methodology-4cm7</link>
      <guid>https://dev.to/sentinelcipher/how-i-built-breachalpha-quantifying-cybersecurity-breach-impact-using-event-study-methodology-4cm7</guid>
      <description>&lt;p&gt;A few months ago I kept running into the same wall while talking to security practitioners: they had solid technical evidence of a breach's severity but no credible financial number to bring to business stakeholders. I decided to fix that.&lt;br&gt;
The result is BreachAlpha, an open source tool that uses event study methodology to measure how breaches move stock prices and predict severity using XGBoost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjrwmcwyd36u8rlppp9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjrwmcwyd36u8rlppp9f.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The methodology (why it is actually rigorous)
&lt;/h3&gt;

&lt;p&gt;Event study methodology comes from financial economics. The idea is simple: isolate the impact of a specific event on an asset's price by comparing actual returns to expected returns (based on the market's movement). The difference is the "abnormal return."&lt;/p&gt;

&lt;p&gt;For breaches, the math is:&lt;br&gt;
AR = R_stock - R_market&lt;br&gt;
CAR = sum of AR over event window&lt;/p&gt;

&lt;p&gt;When Equifax disclosed the 2017 breach, the market dropped that week too. Event study separates the market-wide drop from the Equifax-specific drop. The CAR over a (-5, +30) trading day window gives you the net financial impact attributable to the breach.&lt;/p&gt;

&lt;p&gt;The market prices in company size, sector dynamics, and breach-specific context. It is more honest than parametric cost models that rely on averages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture overview
&lt;/h3&gt;

&lt;p&gt;breachalpha/        FastAPI + XGBoost backend&lt;br&gt;
frontend/           React + Vite + Tailwind&lt;br&gt;
tests/              144 tests, 11 modules&lt;br&gt;
The feature engine computes five core signals:&lt;/p&gt;

&lt;p&gt;Abnormal return at Day 0, 1, 5, 30&lt;br&gt;
CAR over (-1,+1) and (-5,+30) windows&lt;br&gt;
Volatility spike (ratio of post-breach to pre-breach realized vol)&lt;br&gt;
Volume change&lt;br&gt;
Recovery time in trading days&lt;/p&gt;

&lt;p&gt;These go into an XGBoost classifier that outputs Low/Medium/High/Critical severity plus a 0-100 risk score calculated as a weighted probability sum.&lt;br&gt;
Stock data pipeline&lt;br&gt;
Reliable stock data is harder than it sounds. Yahoo Finance rate limits aggressively. So I built a four-source fallback chain:&lt;br&gt;
pythonsources = [&lt;br&gt;
    YFinanceSource(),       # primary, Chrome TLS fingerprint&lt;br&gt;
    AlphaVantageSource(),   # fallback, 25 free calls/day&lt;br&gt;
    NSEIndiaSource(),       # .NS/.BO tickers&lt;br&gt;
    YahooScrapingSource(),  # last resort HTML scrape&lt;br&gt;
]&lt;br&gt;
Each source implements fetch() and supports_ticker(). The fetcher gates each source before calling it, so NSE India never tries to resolve a NASDAQ ticker.&lt;br&gt;
Stock data is cached locally with a 24h TTL. In testing this cut API calls by around 80% on repeated runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo95v9m8ptteni8gnz4cm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo95v9m8ptteni8gnz4cm.png" alt=" " width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Three engineering decisions worth stealing
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Decouple domain exceptions from HTTP&lt;br&gt;
Services raise BreachAlphaError subclasses (TickerNotFoundError, InsufficientDataError, etc.). A single global exception handler in server.py translates them to HTTP status codes. Business logic never imports from FastAPI.&lt;br&gt;
This means services are fully testable without spinning up a web server and switching frameworks later would be a one-file change.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Route factories with injected dependencies&lt;br&gt;
pythondef create_score_routes(limiter: Limiter) -&amp;gt; APIRouter:&lt;br&gt;
router = APIRouter()&lt;/p&gt;
&lt;h1&gt;
  
  
  ... route definitions
&lt;/h1&gt;

&lt;p&gt;return router&lt;br&gt;
The rate limiter gets injected, not imported as a global. Tests pass a mock limiter. This pattern scales well as the number of route modules grows.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ProcessPoolExecutor for CPU-bound feature computation&lt;br&gt;
Feature computation is CPU-heavy. Async/await with threads does not help here because of the GIL. ProcessPoolExecutor actually parallelizes across cores:&lt;br&gt;
pythonwith ProcessPoolExecutor() as executor:&lt;br&gt;
future = executor.submit(compute_features, price_data, breach_date)&lt;br&gt;
features = future.result()&lt;br&gt;
On a 4-core machine this roughly halves computation time for batch scoring.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqznik69mh8xatym4ch1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqznik69mh8xatym4ch1.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  API surface
&lt;/h3&gt;

&lt;p&gt;The core endpoints:&lt;br&gt;
bashPOST /api/score          # score a single company&lt;br&gt;
POST /api/score/auto     # auto-search breach data then score&lt;br&gt;
POST /api/explain        # step-by-step calculation breakdown&lt;br&gt;
POST /api/upload/analyze # batch score from CSV/XLSX&lt;br&gt;
GET  /api/breach-search  # search breach incidents&lt;br&gt;
Example curl:&lt;br&gt;
bashcurl -X POST &lt;a href="http://localhost:8000/api/score" rel="noopener noreferrer"&gt;http://localhost:8000/api/score&lt;/a&gt; \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{&lt;br&gt;
    "company": "Equifax",&lt;br&gt;
    "breach_type": "data_leak",&lt;br&gt;
    "records_affected": 147000000,&lt;br&gt;
    "breach_date": "2017-09-07"&lt;br&gt;
  }'&lt;br&gt;
Response includes risk score, severity prediction, confidence, per-class probabilities, and all the raw feature values so you can audit the calculation.&lt;br&gt;
Running it locally&lt;br&gt;
bashgit clone &lt;a href="https://github.com/AshayK003/BreachAlpha.git" rel="noopener noreferrer"&gt;https://github.com/AshayK003/BreachAlpha.git&lt;/a&gt;&lt;br&gt;
cd BreachAlpha&lt;br&gt;
python -m venv .venv &amp;amp;&amp;amp; source .venv/bin/activate&lt;br&gt;
pip install -e ".[dev]"&lt;br&gt;
uvicorn breachalpha.server:app --reload --port 8000&lt;/p&gt;

&lt;h1&gt;
  
  
  separate terminal
&lt;/h1&gt;

&lt;p&gt;cd frontend &amp;amp;&amp;amp; npm install &amp;amp;&amp;amp; npm run dev&lt;br&gt;
Frontend at localhost:3000, backend at localhost:8000. The model bootstraps on synthetic data in about 2 seconds the first time.&lt;br&gt;
What I want to improve&lt;br&gt;
The biggest limitation right now is the training data. Synthetic data works for the interface and for demos but a model trained on real, labeled breach events would be significantly more accurate. If you have access to structured historical breach data (VCDB, OSF DataBreaches, similar), I would love to collaborate.&lt;br&gt;
Sector-adjusted baselines are also on the list. A breach hitting a healthcare company has a different risk profile than the same breach at a retail chain, and the model should reflect that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contributing
&lt;/h3&gt;

&lt;p&gt;The 144-test suite needs to pass. Coverage is enforced at 60% minimum. Main contribution areas right now:&lt;/p&gt;

&lt;p&gt;Expanding the known tickers dictionary (currently 200+ companies)&lt;br&gt;
Additional data sources&lt;br&gt;
Real breach training data&lt;br&gt;
Docker Compose setup for easier deployment&lt;/p&gt;

&lt;p&gt;If you work in security research, quant finance, or you are building anything around cyber risk quantification, I would genuinely appreciate feedback on the methodology and the feature set.&lt;br&gt;
Repo Link:  &lt;a href="https://github.com/AshayK003/BreachAlpha" rel="noopener noreferrer"&gt;https://github.com/AshayK003/BreachAlpha&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>cybersecurity</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Built CausalLens — A Free, Open-Source Causal Impact Calculator for Time Series (5 Methods, Zero Setup)</title>
      <dc:creator>SentinelCipher</dc:creator>
      <pubDate>Sat, 30 May 2026 16:20:50 +0000</pubDate>
      <link>https://dev.to/sentinelcipher/i-built-causallens-a-free-open-source-causal-impact-calculator-for-time-series-5-methods-zero-3hm5</link>
      <guid>https://dev.to/sentinelcipher/i-built-causallens-a-free-open-source-causal-impact-calculator-for-time-series-5-methods-zero-3hm5</guid>
      <description>&lt;p&gt;I want to show you a tool I just open-sourced. It's called CausalLens, and it answers one specific question that most analytics stacks get completely wrong: did this intervention actually cause the change in my metric?&lt;/p&gt;

&lt;p&gt;The problem with standard before/after analysis&lt;br&gt;
Before/after comparisons are everywhere. They're also almost always misleading.&lt;/p&gt;

&lt;p&gt;When you compare a metric before and after an intervention, you're implicitly assuming that the only thing that changed was your intervention. In practice, seasonality changes, external trends shift, unrelated events happen. The "improvement" you're seeing might have occurred anyway.&lt;/p&gt;

&lt;p&gt;The right answer is to build a counterfactual: a statistical estimate of what would have happened if you had never intervened. The gap between that counterfactual and your observed data is your causal estimate.&lt;br&gt;
What CausalLens does&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56h38iokslxytgsm12jb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56h38iokslxytgsm12jb.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You provide a CSV with a time series and an intervention date. The app fits a pre-intervention model, projects it forward as the counterfactual, and reports:&lt;/p&gt;

&lt;p&gt;Estimated effect size (absolute and percentage)&lt;br&gt;
p-value for statistical significance&lt;br&gt;
95% confidence interval&lt;br&gt;
Plain-English interpretation&lt;br&gt;
Downloadable PDF and interactive HTML reports&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 5 methods and when to use each&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ARIMA ITS (Interrupted Time Series)&lt;br&gt;
Best for: single series, no obvious seasonality, straightforward before/after structure. The ITS framework is well-validated in public health and economics literature for exactly this use case.&lt;/p&gt;

&lt;p&gt;SARIMAX&lt;br&gt;
Best for: data with strong seasonal patterns (weekly cycles, monthly cycles, etc.). Ignoring seasonality inflates or deflates your effect estimate badly, so this matters more than people expect.&lt;/p&gt;

&lt;p&gt;Bayesian Structural Time Series&lt;br&gt;
Best for: when you want probabilistic output and explicit uncertainty quantification rather than a point estimate. The Bayesian approach also handles structural changes in the pre-period more gracefully.&lt;/p&gt;

&lt;p&gt;Difference-in-Differences&lt;br&gt;
Best for: when you have a natural control group that didn't receive the intervention. Classic econometrics approach, still one of the most credible methods when the parallel trends assumption holds.&lt;/p&gt;

&lt;p&gt;Synthetic Control&lt;br&gt;
Best for: when you have multiple potential control units but no single clean control group. The method finds the optimal weighted combination of control units to build your counterfactual. Computationally the most expensive method here, and the trickiest to implement correctly on messy data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqf876kqhjzzazdi8vud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqf876kqhjzzazdi8vud.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Technical stack and deployment constraints&lt;br&gt;
Everything runs on Streamlit. The whole app is designed to fit within Streamlit Community Cloud's free tier: CPU-only, 1GB RAM, no external services.&lt;/p&gt;

&lt;p&gt;The main packages:&lt;/p&gt;

&lt;p&gt;statsmodels for ARIMA, SARIMAX&lt;br&gt;
pymc for Bayesian STS&lt;br&gt;
scipy.optimize for the Synthetic Control weight solver&lt;br&gt;
reportlab for PDF generation&lt;br&gt;
plotly for the interactive HTML reports&lt;/p&gt;

&lt;p&gt;One non-obvious decision: I avoided causalimpact (the Python port of the R package) because it has dependency issues on resource-constrained environments. Building the Bayesian STS from scratch with PyMC gave me more control and better stability.&lt;/p&gt;

&lt;p&gt;The hardest part: Synthetic Control on real data&lt;br&gt;
The Synthetic Control weight optimization is a quadratic program subject to simplex constraints. In theory, clean. In practice, donor pool data is often collinear, the objective surface is flat in places, and solvers behave inconsistently.&lt;/p&gt;

&lt;p&gt;I ended up wrapping the optimizer with multiple fallback strategies and added explicit diagnostics (pre-period fit quality, effective number of donors) so users can see when the method is straining.&lt;br&gt;
What I'd build next&lt;/p&gt;

&lt;p&gt;Regression Discontinuity Design is the obvious missing method. It handles the case where treatment assignment was determined by a threshold (e.g., everyone above a score threshold got the intervention). If you want to contribute that, the repo is ready for it.&lt;/p&gt;

&lt;p&gt;Longer term, I want to add automated method selection based on data characteristics, and better guidance for users who aren't sure which method fits their situation.&lt;br&gt;
Try it&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F85qlhvaer9asy5sqkx09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F85qlhvaer9asy5sqkx09.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Live app: &lt;a href="https://causallens-khg4uatpmnhustajhn8mdl.streamlit.app/" rel="noopener noreferrer"&gt;https://causallens-khg4uatpmnhustajhn8mdl.streamlit.app/&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/AshayK003/CausalLens" rel="noopener noreferrer"&gt;https://github.com/AshayK003/CausalLens&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback, issues, and PRs all welcome. The goal is to make rigorous causal analysis accessible to people who need it but don't have time to become econometricians.&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>datascience</category>
      <category>statistics</category>
    </item>
    <item>
      <title>I Built an Adaptive EDA Tool That Learns How You Explore Data</title>
      <dc:creator>SentinelCipher</dc:creator>
      <pubDate>Thu, 28 May 2026 12:45:47 +0000</pubDate>
      <link>https://dev.to/sentinelcipher/i-built-an-adaptive-eda-tool-that-learns-how-you-explore-data-21fd</link>
      <guid>https://dev.to/sentinelcipher/i-built-an-adaptive-eda-tool-that-learns-how-you-explore-data-21fd</guid>
      <description>&lt;p&gt;Most exploratory data analysis tools generate static reports.&lt;/p&gt;

&lt;p&gt;You upload a dataset, get dozens of charts, scroll for a few minutes, and leave with information overload instead of actual insight.&lt;/p&gt;

&lt;p&gt;After running into this problem repeatedly, I decided to build something different.&lt;/p&gt;

&lt;p&gt;So I open sourced XAdaptiveEDA.&lt;/p&gt;

&lt;p&gt;A Python + Streamlit tool that adapts its recommendations based on how you interact with your data.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/AshayK003/XadaptiveEDA" rel="noopener noreferrer"&gt;https://github.com/AshayK003/XadaptiveEDA&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What Makes It Different?&lt;/p&gt;

&lt;p&gt;Traditional EDA tools treat every dataset and every user the same way.&lt;/p&gt;

&lt;p&gt;XAdaptiveEDA tries to behave more like an adaptive system instead of a one-time report generator.&lt;/p&gt;

&lt;p&gt;You upload a CSV, Excel, or JSON file, and the app:&lt;/p&gt;

&lt;p&gt;ranks analyses by relevance&lt;br&gt;
tracks your feedback with 👍 and 👎 interactions&lt;br&gt;
adapts future recommendations in real time&lt;br&gt;
avoids repetitive analyses&lt;br&gt;
prioritizes columns and patterns you explore frequently&lt;br&gt;
lets you chat with your dataset using natural language&lt;/p&gt;

&lt;p&gt;The goal was to make exploratory data analysis feel more interactive and personalized.&lt;/p&gt;

&lt;p&gt;Features&lt;/p&gt;

&lt;p&gt;Current capabilities include:&lt;/p&gt;

&lt;p&gt;Core Analysis&lt;br&gt;
Distribution analysis&lt;br&gt;
Correlation analysis&lt;br&gt;
Missing value detection&lt;br&gt;
Outlier analysis&lt;br&gt;
Categorical analysis&lt;br&gt;
Time series analysis&lt;br&gt;
Clustering&lt;br&gt;
Feature importance&lt;br&gt;
Adaptive Recommendation Engine&lt;/p&gt;

&lt;p&gt;The recommendation engine combines:&lt;/p&gt;

&lt;p&gt;data relevance&lt;br&gt;
user preferences&lt;br&gt;
novelty scoring&lt;br&gt;
diversity penalties&lt;br&gt;
temporal decay&lt;br&gt;
affinity tracking&lt;br&gt;
ε-greedy exploration&lt;/p&gt;

&lt;p&gt;Instead of dumping every possible chart, the tool tries to surface the analyses most likely to matter.&lt;/p&gt;

&lt;p&gt;Built-in AI Features&lt;/p&gt;

&lt;p&gt;I also added optional LLM integration for:&lt;/p&gt;

&lt;p&gt;chatting with datasets&lt;br&gt;
AI-generated analysis insights&lt;br&gt;
smart column naming&lt;br&gt;
natural language query classification&lt;/p&gt;

&lt;p&gt;Supported providers:&lt;/p&gt;

&lt;p&gt;Ollama (local-first)&lt;br&gt;
OpenRouter&lt;br&gt;
Groq&lt;br&gt;
Custom APIs&lt;/p&gt;

&lt;p&gt;One thing I cared about heavily was privacy.&lt;/p&gt;

&lt;p&gt;If you use Ollama locally, your data never leaves your machine.&lt;/p&gt;

&lt;p&gt;Tech Stack&lt;/p&gt;

&lt;p&gt;The project is intentionally lightweight.&lt;/p&gt;

&lt;p&gt;Built with:&lt;/p&gt;

&lt;p&gt;Streamlit&lt;br&gt;
Plotly&lt;br&gt;
pandas&lt;br&gt;
NumPy&lt;br&gt;
SQLite&lt;br&gt;
Ollama&lt;/p&gt;

&lt;p&gt;No massive infrastructure setup required.&lt;/p&gt;

&lt;p&gt;The entire system currently runs with just 6 dependencies.&lt;/p&gt;

&lt;p&gt;Engineering Details&lt;/p&gt;

&lt;p&gt;Some things I focused on while building this:&lt;/p&gt;

&lt;p&gt;explainable recommendation scoring&lt;br&gt;
session persistence with SQLite&lt;br&gt;
progressive sampling for large datasets&lt;br&gt;
GPU acceleration support through Ollama&lt;br&gt;
rate limiting for remote APIs&lt;br&gt;
modular architecture&lt;br&gt;
fully local workflows&lt;/p&gt;

&lt;p&gt;The project currently has:&lt;/p&gt;

&lt;p&gt;68 passing tests&lt;br&gt;
MIT license&lt;br&gt;
modular analysis pipeline&lt;br&gt;
explainable scoring system&lt;br&gt;
Why I Open Sourced It&lt;/p&gt;

&lt;p&gt;I strongly believe useful developer tools should be accessible and hackable.&lt;/p&gt;

&lt;p&gt;A lot of data tooling today feels either:&lt;/p&gt;

&lt;p&gt;too enterprise-focused&lt;br&gt;
too rigid&lt;br&gt;
too expensive&lt;br&gt;
or too opaque&lt;/p&gt;

&lt;p&gt;I wanted to build something developers could actually inspect, extend, and experiment with.&lt;/p&gt;

&lt;p&gt;What’s Next&lt;/p&gt;

&lt;p&gt;Planned improvements include:&lt;/p&gt;

&lt;p&gt;plugin system for custom analyses&lt;br&gt;
exportable reports&lt;br&gt;
dashboard mode&lt;br&gt;
multi-dataset comparison&lt;br&gt;
collaborative sessions&lt;/p&gt;

&lt;p&gt;I also want to improve the recommendation quality and overall UX significantly.&lt;/p&gt;

&lt;p&gt;Looking for Feedback&lt;/p&gt;

&lt;p&gt;I’d genuinely love feedback from:&lt;/p&gt;

&lt;p&gt;data scientists&lt;br&gt;
Python developers&lt;br&gt;
Streamlit builders&lt;br&gt;
open source contributors&lt;br&gt;
anyone working with exploratory analysis workflows&lt;/p&gt;

&lt;p&gt;Especially around:&lt;/p&gt;

&lt;p&gt;recommendation quality&lt;br&gt;
UI/UX&lt;br&gt;
adaptive scoring logic&lt;br&gt;
real-world usability&lt;/p&gt;

&lt;p&gt;GitHub:&lt;br&gt;
&lt;a href="https://github.com/AshayK003/XadaptiveEDA" rel="noopener noreferrer"&gt;https://github.com/AshayK003/XadaptiveEDA&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you find the project interesting, feel free to star the repo or contribute.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1frqms4pl5o17swyd9ba.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1frqms4pl5o17swyd9ba.png" alt=" " width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
