DEV Community

Clo O
Clo O

Posted on

AI Agentic RAG Pipeline to Surface Community Insights from Census Data

Disclaimer: I am a product person, not a coding guru but this works and it brought value to the lean startup I was working for.

Project Overview

Github: https://github.com/cliffordodendaal/community-insights-pipeline

Role: Technical Architect, AI Onboarding Lead

Timeline: 2 weeks (September 2025)

Platform: Modular RAG pipeline + Streamlit UI + Python + Langchain

Impact: Enabled natural-language querying over census PDFs, built a reproducible ingestion pipeline, and laid the foundation for mentoring others into AI workflows

Executive Summary This project delivers a modular Retrieval-Augmented Generation (RAG) pipeline that transforms static census PDFs into a searchable knowledge base. Built with LangChain, FAISS, and OpenAI, the system enables users to ask natural-language questions and receive grounded, context-rich answers. The pipeline was designed for reproducibility, onboarding clarity, and real-world impact — with every module documented, every friction point surfaced, and every decision made with future mentees in mind.

The Problem Space

African census data is locked in PDFs — inaccessible to non-technical users and difficult to query at scale. Manual analysis is slow, error-prone, and siloed.

We needed a system that could: Ingest and chunk civic data Embed it for semantic search Retrieve relevant context and answer questions Be modular, teachable, and reproducible

Project Constraints

Unstructured PDF data with inconsistent formatting

Limited compute for embedding and querying

Need for absolute clarity in onboarding steps

No existing pipeline for civic RAG use cases

Requirement to support future mentoring and portfolio framing

Discovery & Diagnosis

Technical Benchmarking
LangChain’s document loaders and text splitters

FAISS for local vectorstore indexing

OpenAI embeddings for semantic search

Streamlit for rapid UI prototyping

Onboarding Friction Points
Ambiguous chunking strategies

Hidden config dependencies (e.g. environment variables)

Manual errors in embedding and retrieval steps

Lack of beginner-friendly documentation in most RAG tutorials

Modular Architecture
Each step of the pipeline was broken into a reusable function:

load_pdf() — loads and parses documents

chunk_documents() — splits text into overlapping chunks

embed_chunks() — embeds and stores in FAISS

query_chunks() — retrieves and answers via GPT-3.5

Streamlit UI
A lightweight frontend was built to:

Accept user questions

Retrieve relevant chunks

Display answers with context

Cache the retriever and LLM for performance

image
Key Design Decisions
Decision 1: Modularize Everything
Instead of a monolithic script, each step was abstracted into a function — enabling reuse, testing, and teaching.

Decision 2: Cache the Retriever
To avoid reloading the FAISS index on every query, the retriever and LLM were cached using st.cache_resource.

Decision 3: Build for Teaching
Every function includes docstrings, type hints, and rationale — designed to be copy-pasted into notebooks or onboarding guides.

Implementation & Validation
Technical Execution
LangChain loaders and splitters for ingestion

OpenAI embeddings stored in FAISS

GPT-3.5 via LangChain’s ChatOpenAI

Streamlit UI with sample prompts and error handling

Validation
Queried: “Which municipalities in KwaZulu-Natal have the lowest access to piped water?”

Received grounded, context-rich answer from embedded census data

Screenshot captured for portfolio

Results & Impact
Modular pipeline built and tested end-to-end

Streamlit UI deployed locally for live querying

Ready for mentoring, onboarding, and civic RAG extensions

Lessons Learned
Modularity Is Mentorship
Every function you modularize becomes a teaching tool. Beginners don’t need magic — they need clarity.

RAG Needs Reproducibility
Most RAG tutorials skip the hard parts. This pipeline documents every step, every config, and every friction point.

UI Unlocks Accessibility
Streamlit made the pipeline usable by non-technical users — a key step in democratizing civic data access.

What’s Next

For Portfolio
Add README with setup, sample queries, and impact framing

Embed screenshots and flowcharts

Publish case study on GitHub and LinkedIn

For Mentoring
Create Jupyter notebook walkthrough

Build glossary of key terms (chunking, embeddings, retriever)

Add onboarding guide for mentees

For Scaling
Extend to property spreadsheets and municipal budgets

Add metadata filters to retriever

Deploy to Hugging Face Spaces or Streamlit Cloud

Top comments (0)