DEV Community: ashg2099

Why I'm Betting on CrewAI for Multi-Agent Orchestration (And Where It Falls Short)

ashg2099 — Wed, 08 Jul 2026 02:02:36 +0000

I've been deep-diving into CrewAI lately, and here's my honest technical breakdown.

What is CrewAI?
It's a multi-agent orchestration framework where you define a crew of AI agents, each with a role, goal, backstory, and tools, that collaborate to solve complex tasks sequentially or in parallel.
Think: a team of specialists instead of one generalist doing everything.

Single LLM vs CrewAI, where it breaks down:
A single LLM call has no memory across steps, no specialization, and collapses under long, complex workflows.
CrewAI solves this by:
✅ Decomposing problems into focused subtasks
✅ Letting agents maintain context within their scope
✅ Passing outputs between agents as structured inputs
✅ Supporting tool use per agent (search, code execution, file I/O)

How it compares to other frameworks:
🔵 LangChain — great for chains and RAG pipelines, but not built for agent collaboration. CrewAI is purpose-built for multi-agent coordination.
🟠 LangGraph — more low-level, gives you full control over state machines and conditional flows. Better for complex branching logic. CrewAI trades that flexibility for simplicity and speed of development.
🟡 AutoGen (Microsoft) — conversation-based multi-agent, agents talk to each other in chat loops. CrewAI is more structured, roles and tasks are explicit, not emergent from conversation.

The verdict: CrewAI sits in the sweet spot, higher-level than LangGraph, more structured than AutoGen, more agent-native than LangChain.

Where CrewAI genuinely shines:
🔹 Research pipelines (search → summarize → report)
🔹 Automated data analysis workflows
🔹 Content generation with review/editing loops
🔹 Customer support triage with routing logic

Real limitations:
⚠️ Token costs compound fast — every agent call is an LLM call
⚠️ Debugging is hard — tracing which agent caused a failure isn't always obvious
⚠️ Sequential crews can be slow — parallelism requires careful design
⚠️ Agent "hallucination" compounds — errors in one agent propagate downstream
⚠️ Still maturing — production reliability is improving, but not battle-tested at scale like LangChain

For prototyping agentic workflows fast, CrewAI is hard to beat.
For production-grade, fine-grained control, reach for LangGraph.
The best engineers I've seen aren't loyal to one framework. They know when to use each one.
I'm still learning, but this is the mental map I'm building.

CrewAI #AIAgents #MultiAgent #DataScience #AIEngineering #LLM #LangChain #LangGraph

How I improved my fact-checker from F1 0.655 0.813 — what actually changed

ashg2099 — Sun, 21 Jun 2026 01:31:47 +0000

I built a multilingual fact-checker using XLM-RoBERTa fine-tuned on the FEVER dataset. The first version hit F1 0.655. Not bad, but it kept misfiring on obvious real-world claims. Earth being the third planet from the Sun returned FALSE at 76% confidence. Something was fundamentally wrong.
A commenter identified the issue immediately: I was training the model on claims alone, with no evidence. FEVER is not a claim classification task. It's a Natural Language Inference task — the model is supposed to verify a claim against evidence, not guess from the claim text alone. I had been training it wrong from the start.

What FEVER actually is:

FEVER (Fact Extraction and VERification) contains 228,000+ Wikipedia claim-evidence pairs. Each claim is annotated as SUPPORTS, REFUTES, or NOT ENOUGH INFO based on retrieved Wikipedia sentences. The whole point is that the model sees both the claim and the evidence together and decides if the evidence supports or contradicts the claim.

Training on claims alone strips out all that signal. The model has nothing to reason about, it just memorizes surface patterns in the claim text.

Phase 1 — Retraining with evidence

The fix was straightforward: concatenate the claim with its gold evidence sentences before passing to the model. XLM-RoBERTa uses as a sentence separator, so the format becomes [claim] [evidence]. Fine-tuned for one epoch on the full FEVER training set, starting from the existing checkpoint. F1 jumped from 0.655 to 0.813.
The improvement wasn't from a better architecture, more data, or longer training. It was purely from feeding the model what it was designed to receive.

Phase 2 — Making it work in the real world

The retrained model was great on FEVER benchmarks but useless for real-world claims, because real-world claims don't come with pre-labeled Wikipedia evidence attached. You need to retrieve the evidence yourself.
For this, I used BGE (BAAI/bge-base-en-v1.5), a retrieval-optimized embedding model from Beijing Academy of AI. The approach is called Reverse HyDE — instead of generating a hypothetical document for the query, you embed the claim as a retrieval query and find the most semantically similar evidence passages. The FEVER passages are indexed in a FAISS flat inner product index, which gives cosine similarity over normalized vectors.
At inference time: embed the claim, retrieve the top 3 most relevant FEVER passages, concatenate them with the claim, and pass to XLM-RoBERTa. The whole pipeline takes under a second.

Results

The combined system — retrieval augmented XLM-RoBERTa — handled real-world claims correctly where the v1 model failed. Claims about historical facts, scientific facts, and geography all returned sensible verdicts with high confidence.

Why XLM-RoBERTa specifically

XLM-RoBERTa is pretrained on CommonCrawl data across 100 languages. This means the fine-tuned model inherits multilingual capability without any additional training. You can submit claims in Hindi, Spanish, Tamil, Arabic, or Chinese and the model understands them. The retrieved evidence is always English (since FEVER is English), but XLM-RoBERTa handles cross-lingual NLI reasonably well — the claim and evidence don't need to be in the same language.

Key lesson

The architecture did not change. The dataset did not change. The training duration did not change. What changed was understanding what the task actually requires and formatting the input accordingly. A 24% F1 improvement from fixing the input format is a good reminder to read the dataset paper before training.

🤗 Model:

huggingface.co/ashg2099/xlm-roberta-factchecker

Misinformation doesn't speak one language. Our tools do.

ashg2099 — Tue, 02 Jun 2026 02:50:20 +0000

In 2024, the Oxford Internet Institute studied misinformation spread across 81 countries.
Their finding: the most dangerous misinformation wasn't in English. It was in languages that English-language fact-checking tools couldn't read. WhatsApp forwards in Hindi. Facebook posts in Swahili. Telegram chains in Arabic. Viral claims in Tamil that never get fact-checked because the tools don't exist.

Here's the uncomfortable truth about the current state of NLP fact-checking:
95% of fact-checking models are English-only.

The LIAR dataset — the most cited benchmark in claim verification research — is entirely in English. FEVER, the gold standard for fact verification, is entirely in English. Most production fact-checking APIs? English only.

Meanwhile, India alone has 22 official languages and 500 million WhatsApp users. A false claim about a vaccine, an election, a riot — spreads in minutes in a language no existing model can verify.
This is not a model problem. It's an architecture problem.
Cross-lingual transfer learning has existed since 2019 — XLM-RoBERTa was pre-trained on 100 languages simultaneously. The capability is there. The application isn't.

Datasets like MM-COVID, CLEF CheckThat! 2023, and IndicGLUE exists precisely for this — multilingual misinformation benchmarks that almost nobody in the open-source community has seriously combined and trained on.

The gap between what's possible and what's been built is embarrassingly wide. Someone should close it.

This is exactly why I built Sift 🔍
Sift is an open-source multi-agent fact-checking pipeline — 5 agents, each playing a distinct role. But Sift today only speaks English. 🌐
And that's the problem I'm solving next. Someone should close this gap. I intend to. 🚀

🔗 Full technical breakdown of how Sift works: [(https://dev.to/ashg2099/i-built-an-open-source-multi-agent-fact-checker-heres-how-it-works-5eah)]

Data Scientist & AI Engineer — Open to Full-Time Opportunities

ashg2099 — Fri, 29 May 2026 00:20:39 +0000

Hey Dev.to the community,

I'm Ashwin Gururaj — a Data Scientist & AI Engineer based in Melbourne, Australia, currently open to full-time, contract, and internship opportunities.
I specialise in building production-grade AI systems — not just notebooks and demos, but end-to-end pipelines that actually run in production.

What I work with:

Python · LangChain · LangGraph · FastAPI · RAG pipelines · pgvector · Multi-agent systems · LLMs · Groq · HuggingFace · Pydantic · Docker · Celery · Redis · PostgreSQL · Data Science · SQL · Pandas · Scikit-learn

What I've built recently:

Sift — an open-source multi-agent fact-checking pipeline. Takes any text, extracts every factual claim, retrieves grounded evidence via HyDE RAG + live web search, and returns auditable verdicts with cited sources. Built with LangGraph, pgvector, FastAPI, and Docker.
→ GitHub

Open to:

Full-time Data Scientist / AI Engineer / ML Engineer roles
Remote or Melbourne-based
Companies building serious AI products

If you're hiring or know someone who is — I'd genuinely appreciate a connection.

GitHub: https://github.com/ashg2099
LinkedIn: https://www.linkedin.com/in/ashwin-gururaj-93943816a/

Thanks!

I Built an Open-Source Multi-Agent Fact-Checker — Here's How It Works

ashg2099 — Thu, 28 May 2026 00:25:32 +0000

Problem Statement

We have a misinformation problem. But more specifically, we have a speed problem.
A journalist spots a suspicious claim. They search for sources. Cross-reference databases. Call experts. Write a verdict. Get it edited. Publish, maybe 6 hours later. Maybe 3 days later.
Meanwhile, the original claim has been screenshot, reposted, quoted in newsletters, and cited in arguments across five platforms.
I wanted to build something that closed that gap. Not a chatbot that guesses. A proper pipeline, one that retrieves real evidence, reasons from it, and tells you why it reached a verdict.
That's what Sift is.

What is Sift?

Sift (Source Inspection & Fact-checking Tool) is an open-source multi-agent AI pipeline that takes any text, extracts every factual claim, retrieves grounded evidence, and returns auditable verdicts — TRUE, FALSE, or UNCERTAIN, with cited sources and full reasoning chains.
Paste a news article. A politician's speech. A viral statistic. A WhatsApp forward. Sift breaks it into individual claims and fact-checks each one independently.

Why Multi-Agent?

The naive approach is to ask an LLM: "Is this claim true?"
The problem: LLMs hallucinate. They have knowledge cutoffs. They're confidently wrong in ways that are hard to detect. And critically, they don't show their work.
A single LLM call can't reliably handle the full pipeline of:

Extracting structured claims from noisy text
Retrieving dated, traceable evidence from live sources
Reasoning across conflicting evidence without confabulating
Adversarially reviewing its own conclusions for overconfidence
Finding corrections when something is wrong

Each of these is a distinct task that benefits from its own prompt, its own tools, and its own failure modes. That's why I built five separate agents, orchestrated with LangGraph.

The 5-Agent Pipeline

Agent 1 — Claim Extractor

A single paragraph can contain 4-5 distinct factual claims. Generic LLMs miss them or conflate them.
This agent uses LLaMA 3.3 70B via Groq with Pydantic structured output to extract every distinct verifiable claim from the input text. The output is a typed list of claims — exact text, no paraphrasing, no hallucination.

Agent 2 — Evidence Hunter

LLMs hallucinate citations. You need real, retrievable, dated evidence.
This agent runs HyDE retrieval across 4,270 indexed Guardian + Wikipedia chunks stored in pgvector, then hits Tavily live web search for recent data.
Why HyDE instead of standard RAG?
Standard RAG embeds the raw claim and searches for similar text. A short factual claim like "The Fed raised rates in March 2024" has a weak semantic signal on its own.
HyDE (Hypothetical Document Embeddings) generates a hypothetical document that would contain the answer — something like a news article excerpt — then embeds that. The result is a richer semantic signal and significantly better retrieval recall on short factual claims.

Agent 3 — Synthesis Agent

This agent reasons strictly from retrieved evidence. It returns TRUE / FALSE / UNCERTAIN with a calibrated confidence score.
Critically — if evidence is thin or conflicting, it returns UNCERTAIN instead of confabulating certainty. This was one of the hardest things to get right. LLMs naturally trend toward false confidence. I had to explicitly prompt for epistemic humility and add Pydantic validators to catch zero-confidence outputs.

Agent 4 — Critic Agent

Synthesis agents tend toward overconfidence when evidence partially supports a claim. You need an adversarial check.
This agent independently reviews every verdict. It flags unsupported reasoning, catches cases where 1.1°C vs 1.19°C is a rounding difference, not a false claim, and adjusts confidence downward when warranted.
This is the step most fact-checking systems skip — and it's the one that matters most for borderline claims.

Agent 5 — Correction Agent

Knowing something is false isn't enough. Users need to know what IS true.
This agent fires only on FALSE or UNCERTAIN verdicts. It runs a targeted live search to find the correct information and surfaces it with a cited source. Conditional — doesn't waste tokens on TRUE verdicts.

Why LangGraph?

The pipeline isn't linear for every claim. Some claims have no evidence — they skip synthesis and go straight to the criticism. Some need multiple retrieval attempts. Some claims loop.
LangGraph's state machine handles conditional branching, loops, and shared state across agents cleanly. The state is typed with TypedDict — every agent reads from and writes to the same state object.

Infrastructure

FastAPI returns a task ID immediately. Celery + Redis runs the pipeline in the background. The client polls for results.
Redis cache stores results for 7 days — the same viral claim doesn't cost tokens twice. Cache hits at the API layer return in under 1 second, before Celery even runs.
LangFuse traces every LLM call — prompt, output, latency, token count — so I can debug agent failures without guessing.

Tech Stack

LLM: LLaMA 3.3 70B via Groq API
Embeddings: all-MiniLM-L6-v2 via HuggingFace Inference API
Orchestration: LangGraph state machine
RAG: HyDE + pgvector hybrid search
Vector DB: PostgreSQL + pgvector
API: FastAPI + Pydantic
Task Queue: Celery + Redis
Evidence Sources: Tavily (live) + Guardian API + Wikipedia
Observability: LangFuse + Prometheus + Grafana

Try It

The project is fully open source and Dockerized. One command runs the entire stack:

git clone https://github.com/ashg2099/Sift.git
cd Sift
cp .env.example .env
# Add your API keys (Groq, Tavily, HuggingFace — all free tiers)
docker compose up

Open http://localhost:8000 and start verifying claims.
I'm actively looking for feedback — especially where it breaks. If you try it, I'd love to know what it gets wrong.

GitHub: https://github.com/ashg2099/Sift
LinkedIn: https://www.linkedin.com/in/ashwin-gururaj-93943816a/