Problem Statement
We have a misinformation problem. But more specifically, we have a speed problem.
A journalist spots a suspicious claim. They search for sources. Cross-reference databases. Call experts. Write a verdict. Get it edited. Publish, maybe 6 hours later. Maybe 3 days later.
Meanwhile, the original claim has been screenshot, reposted, quoted in newsletters, and cited in arguments across five platforms.
I wanted to build something that closed that gap. Not a chatbot that guesses. A proper pipeline, one that retrieves real evidence, reasons from it, and tells you why it reached a verdict.
That's what Sift is.
What is Sift?
Sift (Source Inspection & Fact-checking Tool) is an open-source multi-agent AI pipeline that takes any text, extracts every factual claim, retrieves grounded evidence, and returns auditable verdicts — TRUE, FALSE, or UNCERTAIN, with cited sources and full reasoning chains.
Paste a news article. A politician's speech. A viral statistic. A WhatsApp forward. Sift breaks it into individual claims and fact-checks each one independently.
Why Multi-Agent?
The naive approach is to ask an LLM: "Is this claim true?"
The problem: LLMs hallucinate. They have knowledge cutoffs. They're confidently wrong in ways that are hard to detect. And critically, they don't show their work.
A single LLM call can't reliably handle the full pipeline of:
- Extracting structured claims from noisy text
- Retrieving dated, traceable evidence from live sources
- Reasoning across conflicting evidence without confabulating
- Adversarially reviewing its own conclusions for overconfidence
- Finding corrections when something is wrong
Each of these is a distinct task that benefits from its own prompt, its own tools, and its own failure modes. That's why I built five separate agents, orchestrated with LangGraph.
The 5-Agent Pipeline
Agent 1 — Claim Extractor
A single paragraph can contain 4-5 distinct factual claims. Generic LLMs miss them or conflate them.
This agent uses LLaMA 3.3 70B via Groq with Pydantic structured output to extract every distinct verifiable claim from the input text. The output is a typed list of claims — exact text, no paraphrasing, no hallucination.
Agent 2 — Evidence Hunter
LLMs hallucinate citations. You need real, retrievable, dated evidence.
This agent runs HyDE retrieval across 4,270 indexed Guardian + Wikipedia chunks stored in pgvector, then hits Tavily live web search for recent data.
Why HyDE instead of standard RAG?
Standard RAG embeds the raw claim and searches for similar text. A short factual claim like "The Fed raised rates in March 2024" has a weak semantic signal on its own.
HyDE (Hypothetical Document Embeddings) generates a hypothetical document that would contain the answer — something like a news article excerpt — then embeds that. The result is a richer semantic signal and significantly better retrieval recall on short factual claims.
Agent 3 — Synthesis Agent
This agent reasons strictly from retrieved evidence. It returns TRUE / FALSE / UNCERTAIN with a calibrated confidence score.
Critically — if evidence is thin or conflicting, it returns UNCERTAIN instead of confabulating certainty. This was one of the hardest things to get right. LLMs naturally trend toward false confidence. I had to explicitly prompt for epistemic humility and add Pydantic validators to catch zero-confidence outputs.
Agent 4 — Critic Agent
Synthesis agents tend toward overconfidence when evidence partially supports a claim. You need an adversarial check.
This agent independently reviews every verdict. It flags unsupported reasoning, catches cases where 1.1°C vs 1.19°C is a rounding difference, not a false claim, and adjusts confidence downward when warranted.
This is the step most fact-checking systems skip — and it's the one that matters most for borderline claims.
Agent 5 — Correction Agent
Knowing something is false isn't enough. Users need to know what IS true.
This agent fires only on FALSE or UNCERTAIN verdicts. It runs a targeted live search to find the correct information and surfaces it with a cited source. Conditional — doesn't waste tokens on TRUE verdicts.
Why LangGraph?
The pipeline isn't linear for every claim. Some claims have no evidence — they skip synthesis and go straight to the criticism. Some need multiple retrieval attempts. Some claims loop.
LangGraph's state machine handles conditional branching, loops, and shared state across agents cleanly. The state is typed with TypedDict — every agent reads from and writes to the same state object.
Infrastructure
FastAPI returns a task ID immediately. Celery + Redis runs the pipeline in the background. The client polls for results.
Redis cache stores results for 7 days — the same viral claim doesn't cost tokens twice. Cache hits at the API layer return in under 1 second, before Celery even runs.
LangFuse traces every LLM call — prompt, output, latency, token count — so I can debug agent failures without guessing.
Tech Stack
LLM: LLaMA 3.3 70B via Groq API
Embeddings: all-MiniLM-L6-v2 via HuggingFace Inference API
Orchestration: LangGraph state machine
RAG: HyDE + pgvector hybrid search
Vector DB: PostgreSQL + pgvector
API: FastAPI + Pydantic
Task Queue: Celery + Redis
Evidence Sources: Tavily (live) + Guardian API + Wikipedia
Observability: LangFuse + Prometheus + Grafana
Try It
The project is fully open source and Dockerized. One command runs the entire stack:
git clone https://github.com/ashg2099/Sift.git
cd Sift
cp .env.example .env
# Add your API keys (Groq, Tavily, HuggingFace — all free tiers)
docker compose up
Open http://localhost:8000 and start verifying claims.
I'm actively looking for feedback — especially where it breaks. If you try it, I'd love to know what it gets wrong.
GitHub: https://github.com/ashg2099/Sift
LinkedIn: https://www.linkedin.com/in/ashwin-gururaj-93943816a/
Top comments (0)