Disclaimer: I am a product person, not a coding guru but this works and it brought value to the lean startup I was working for.
Project Overview
Github: https://github.com/cliffordodendaal/community-insights-pipeline
Role: Technical Architect, AI Onboarding Lead
Timeline: 2 weeks (September 2025)
Platform: Modular RAG pipeline + Streamlit UI + Python + Langchain
Impact: Enabled natural-language querying over census PDFs, built a reproducible ingestion pipeline, and laid the foundation for mentoring others into AI workflows
Executive Summary This project delivers a modular Retrieval-Augmented Generation (RAG) pipeline that transforms static census PDFs into a searchable knowledge base. Built with LangChain, FAISS, and OpenAI, the system enables users to ask natural-language questions and receive grounded, context-rich answers. The pipeline was designed for reproducibility, onboarding clarity, and real-world impact — with every module documented, every friction point surfaced, and every decision made with future mentees in mind.
The Problem Space
African census data is locked in PDFs — inaccessible to non-technical users and difficult to query at scale. Manual analysis is slow, error-prone, and siloed.
We needed a system that could: Ingest and chunk civic data Embed it for semantic search Retrieve relevant context and answer questions Be modular, teachable, and reproducible
Project Constraints
Unstructured PDF data with inconsistent formatting
Limited compute for embedding and querying
Need for absolute clarity in onboarding steps
No existing pipeline for civic RAG use cases
Requirement to support future mentoring and portfolio framing
Discovery & Diagnosis
Technical Benchmarking
LangChain’s document loaders and text splitters
FAISS for local vectorstore indexing
OpenAI embeddings for semantic search
Streamlit for rapid UI prototyping
Onboarding Friction Points
Ambiguous chunking strategies
Hidden config dependencies (e.g. environment variables)
Manual errors in embedding and retrieval steps
Lack of beginner-friendly documentation in most RAG tutorials
Modular Architecture
Each step of the pipeline was broken into a reusable function:
load_pdf() — loads and parses documents
chunk_documents() — splits text into overlapping chunks
embed_chunks() — embeds and stores in FAISS
query_chunks() — retrieves and answers via GPT-3.5
Streamlit UI
A lightweight frontend was built to:
Accept user questions
Retrieve relevant chunks
Display answers with context
Cache the retriever and LLM for performance
image
Key Design Decisions
Decision 1: Modularize Everything
Instead of a monolithic script, each step was abstracted into a function — enabling reuse, testing, and teaching.
Decision 2: Cache the Retriever
To avoid reloading the FAISS index on every query, the retriever and LLM were cached using st.cache_resource.
Decision 3: Build for Teaching
Every function includes docstrings, type hints, and rationale — designed to be copy-pasted into notebooks or onboarding guides.
Implementation & Validation
Technical Execution
LangChain loaders and splitters for ingestion
OpenAI embeddings stored in FAISS
GPT-3.5 via LangChain’s ChatOpenAI
Streamlit UI with sample prompts and error handling
Validation
Queried: “Which municipalities in KwaZulu-Natal have the lowest access to piped water?”
Received grounded, context-rich answer from embedded census data
Screenshot captured for portfolio
Results & Impact
Modular pipeline built and tested end-to-end
Streamlit UI deployed locally for live querying
Ready for mentoring, onboarding, and civic RAG extensions
Lessons Learned
Modularity Is Mentorship
Every function you modularize becomes a teaching tool. Beginners don’t need magic — they need clarity.
RAG Needs Reproducibility
Most RAG tutorials skip the hard parts. This pipeline documents every step, every config, and every friction point.
UI Unlocks Accessibility
Streamlit made the pipeline usable by non-technical users — a key step in democratizing civic data access.
What’s Next
For Portfolio
Add README with setup, sample queries, and impact framing
Embed screenshots and flowcharts
Publish case study on GitHub and LinkedIn
For Mentoring
Create Jupyter notebook walkthrough
Build glossary of key terms (chunking, embeddings, retriever)
Add onboarding guide for mentees
For Scaling
Extend to property spreadsheets and municipal budgets
Add metadata filters to retriever
Deploy to Hugging Face Spaces or Streamlit Cloud


Top comments (0)