<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Clo O</title>
    <description>The latest articles on DEV Community by Clo O (@clo_o_d63b1179f8af77546ac).</description>
    <link>https://dev.to/clo_o_d63b1179f8af77546ac</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3626078%2Fdb174a9c-a86c-4393-a7f1-a993f3098d4e.png</url>
      <title>DEV Community: Clo O</title>
      <link>https://dev.to/clo_o_d63b1179f8af77546ac</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/clo_o_d63b1179f8af77546ac"/>
    <language>en</language>
    <item>
      <title>AI Agentic RAG Pipeline to Surface Community Insights from Census Data</title>
      <dc:creator>Clo O</dc:creator>
      <pubDate>Sun, 23 Nov 2025 19:56:58 +0000</pubDate>
      <link>https://dev.to/clo_o_d63b1179f8af77546ac/ai-agentic-rag-pipeline-to-surface-community-insights-from-census-data-57e2</link>
      <guid>https://dev.to/clo_o_d63b1179f8af77546ac/ai-agentic-rag-pipeline-to-surface-community-insights-from-census-data-57e2</guid>
      <description>&lt;p&gt;Disclaimer: I am a product person, not a coding guru but this works and it brought value to the lean startup I was working for. &lt;/p&gt;

&lt;p&gt;Project Overview&lt;/p&gt;

&lt;p&gt;Github: &lt;a href="https://github.com/cliffordodendaal/community-insights-pipeline" rel="noopener noreferrer"&gt;https://github.com/cliffordodendaal/community-insights-pipeline&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Role: Technical Architect, AI Onboarding Lead&lt;/p&gt;

&lt;p&gt;Timeline: 2 weeks (September 2025)&lt;/p&gt;

&lt;p&gt;Platform: Modular RAG pipeline + Streamlit UI + Python + Langchain&lt;/p&gt;

&lt;p&gt;Impact: Enabled natural-language querying over census PDFs, built a reproducible ingestion pipeline, and laid the foundation for mentoring others into AI workflows &lt;/p&gt;

&lt;p&gt;Executive Summary This project delivers a modular Retrieval-Augmented Generation (RAG) pipeline that transforms static census PDFs into a searchable knowledge base. Built with LangChain, FAISS, and OpenAI, the system enables users to ask natural-language questions and receive grounded, context-rich answers. The pipeline was designed for reproducibility, onboarding clarity, and real-world impact — with every module documented, every friction point surfaced, and every decision made with future mentees in mind.&lt;/p&gt;

&lt;p&gt;The Problem Space&lt;/p&gt;

&lt;p&gt;African census data is locked in PDFs — inaccessible to non-technical users and difficult to query at scale. Manual analysis is slow, error-prone, and siloed.&lt;/p&gt;

&lt;p&gt;We needed a system that could: Ingest and chunk civic data Embed it for semantic search Retrieve relevant context and answer questions Be modular, teachable, and reproducible&lt;/p&gt;

&lt;p&gt;Project Constraints&lt;/p&gt;

&lt;p&gt;Unstructured PDF data with inconsistent formatting&lt;/p&gt;

&lt;p&gt;Limited compute for embedding and querying&lt;/p&gt;

&lt;p&gt;Need for absolute clarity in onboarding steps&lt;/p&gt;

&lt;p&gt;No existing pipeline for civic RAG use cases&lt;/p&gt;

&lt;p&gt;Requirement to support future mentoring and portfolio framing&lt;/p&gt;

&lt;p&gt;Discovery &amp;amp; Diagnosis&lt;/p&gt;

&lt;p&gt;Technical Benchmarking&lt;br&gt;
LangChain’s document loaders and text splitters&lt;/p&gt;

&lt;p&gt;FAISS for local vectorstore indexing&lt;/p&gt;

&lt;p&gt;OpenAI embeddings for semantic search&lt;/p&gt;

&lt;p&gt;Streamlit for rapid UI prototyping&lt;/p&gt;

&lt;p&gt;Onboarding Friction Points&lt;br&gt;
Ambiguous chunking strategies&lt;/p&gt;

&lt;p&gt;Hidden config dependencies (e.g. environment variables)&lt;/p&gt;

&lt;p&gt;Manual errors in embedding and retrieval steps&lt;/p&gt;

&lt;p&gt;Lack of beginner-friendly documentation in most RAG tutorials&lt;/p&gt;

&lt;p&gt;Modular Architecture&lt;br&gt;
Each step of the pipeline was broken into a reusable function:&lt;/p&gt;

&lt;p&gt;load_pdf() — loads and parses documents&lt;/p&gt;

&lt;p&gt;chunk_documents() — splits text into overlapping chunks&lt;/p&gt;

&lt;p&gt;embed_chunks() — embeds and stores in FAISS&lt;/p&gt;

&lt;p&gt;query_chunks() — retrieves and answers via GPT-3.5&lt;/p&gt;

&lt;p&gt;Streamlit UI&lt;br&gt;
A lightweight frontend was built to:&lt;/p&gt;

&lt;p&gt;Accept user questions&lt;/p&gt;

&lt;p&gt;Retrieve relevant chunks&lt;/p&gt;

&lt;p&gt;Display answers with context&lt;/p&gt;

&lt;p&gt;Cache the retriever and LLM for performance&lt;/p&gt;

&lt;p&gt;image&lt;br&gt;
Key Design Decisions&lt;br&gt;
Decision 1: Modularize Everything&lt;br&gt;
Instead of a monolithic script, each step was abstracted into a function — enabling reuse, testing, and teaching.&lt;/p&gt;

&lt;p&gt;Decision 2: Cache the Retriever&lt;br&gt;
To avoid reloading the FAISS index on every query, the retriever and LLM were cached using st.cache_resource.&lt;/p&gt;

&lt;p&gt;Decision 3: Build for Teaching&lt;br&gt;
Every function includes docstrings, type hints, and rationale — designed to be copy-pasted into notebooks or onboarding guides.&lt;/p&gt;

&lt;p&gt;Implementation &amp;amp; Validation&lt;br&gt;
Technical Execution&lt;br&gt;
LangChain loaders and splitters for ingestion&lt;/p&gt;

&lt;p&gt;OpenAI embeddings stored in FAISS&lt;/p&gt;

&lt;p&gt;GPT-3.5 via LangChain’s ChatOpenAI&lt;/p&gt;

&lt;p&gt;Streamlit UI with sample prompts and error handling&lt;/p&gt;

&lt;p&gt;Validation&lt;br&gt;
Queried: “Which municipalities in KwaZulu-Natal have the lowest access to piped water?”&lt;/p&gt;

&lt;p&gt;Received grounded, context-rich answer from embedded census data&lt;/p&gt;

&lt;p&gt;Screenshot captured for portfolio&lt;/p&gt;

&lt;p&gt;Results &amp;amp; Impact&lt;br&gt;
Modular pipeline built and tested end-to-end&lt;/p&gt;

&lt;p&gt;Streamlit UI deployed locally for live querying&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyznzim8j1o7jt1t3p7om.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyznzim8j1o7jt1t3p7om.png" alt=" " width="496" height="1872"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ready for mentoring, onboarding, and civic RAG extensions&lt;/p&gt;

&lt;p&gt;Lessons Learned&lt;br&gt;
Modularity Is Mentorship&lt;br&gt;
Every function you modularize becomes a teaching tool. Beginners don’t need magic — they need clarity.&lt;/p&gt;

&lt;p&gt;RAG Needs Reproducibility&lt;br&gt;
Most RAG tutorials skip the hard parts. This pipeline documents every step, every config, and every friction point.&lt;/p&gt;

&lt;p&gt;UI Unlocks Accessibility&lt;br&gt;
Streamlit made the pipeline usable by non-technical users — a key step in democratizing civic data access.&lt;/p&gt;

&lt;p&gt;What’s Next&lt;/p&gt;

&lt;p&gt;For Portfolio&lt;br&gt;
Add README with setup, sample queries, and impact framing&lt;/p&gt;

&lt;p&gt;Embed screenshots and flowcharts&lt;/p&gt;

&lt;p&gt;Publish case study on GitHub and LinkedIn&lt;/p&gt;

&lt;p&gt;For Mentoring&lt;br&gt;
Create Jupyter notebook walkthrough&lt;/p&gt;

&lt;p&gt;Build glossary of key terms (chunking, embeddings, retriever)&lt;/p&gt;

&lt;p&gt;Add onboarding guide for mentees&lt;/p&gt;

&lt;p&gt;For Scaling&lt;br&gt;
Extend to property spreadsheets and municipal budgets&lt;/p&gt;

&lt;p&gt;Add metadata filters to retriever&lt;/p&gt;

&lt;p&gt;Deploy to Hugging Face Spaces or Streamlit Cloud&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6acra42dkeicrxjevgnm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6acra42dkeicrxjevgnm.png" alt=" " width="667" height="253"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
