Let me walk you through something I have been experimenting with lately, running a local AI agent that detects hate speech, powered by Google's Gemma 4 model and the CrewAI framework, all without calling any paid API. No keys, no credits. Just a local model and python. Let me explain how I put this together.
Why Gemma + CrewAI?
Most CrewAI tutorials you will find online default to GPT-4 or Claude. That is fine, but what if you want to run everything locally, maybe for privacy or cost reasons, or just to understand the stack at a deeper level?
Well, I could have gone with Meta's Llamma3.2 3B model, that is a good opensource alternative as well, but Gemma 4 is new and I just wanted to try it, but you can use any opensource model available on Hugging face.
The model I used is google/gemma-4-E2B-it, a 2-billion parameter instruct-tuned version of Google's Gemma 4. It is light enough to run on Colab GPU without any resource crunches, and it loads via Hugging Face transformers library.
The Problem: CrewAI Doesn't Talk to Transformers
CrewAI Agent class expects an LLM object, but out of the box, it only works well with OpenAI-compatible APIs. So I had to build a custom LLM class by inheriting BaseLLM from CrewAI.
Gemma4CrewAILLM class which extends BaseLLM overrides the call() method to handle message formatting and generation manually.
class Gemma4CrewAILLM(BaseLLM):
def call(self, messages, tools=None, ...):
# Convert messages to Gemma's format
# Apply chat template via processor
# Run model.generate()
# Strip the prompt from output and return clean text
The key insight here is that Gemma uses AutoProcessor for both tokenization and chat templating, and I had to be careful to set enable_thinking=False to skip the "thinking tokens" that the model might generate.
Defining the Agent and Task
Once the Gemma4CrewAILLM was ready, plugging it into CrewAI was straightforward. I defined a Hate Speech Detection Specialist agent with a detailed role, goal, and backstory, because CrewAI uses these as system-prompt context to guide how the agent reasons.
The task I gave it was to analyze a piece of text and produce a structured 3-bullet report:
- Verdict: Yes / No / Uncertain with a reason
- Detected Content: Specific phrases flagged, plus the targeted group
- Severity and Recommendation: A numeric rating out of 10, plus a recommended action (Allow / Flag for Review / Remove)
This structure forces the model to be accurate rather than giving summaries, which is exactly what you want in a content moderation pipeline.
Here is my gist for your reference Link.
What should I do Next?
This setup is good for trying out new things or development while saving your pockets from expensive tokens. But here are some improvements/extensions that could be done:
- Add more agents: a second agent for sentiment or intent classification working in parallel
- Try function calling / tools: Gemma 4/Llamma 3.2 supports structured outputs which could make the crew much more powerful
- Wrap it in a FastAPI endpoint: so you can POST text and get back a structured moderation report over HTTP
The whole point for me was to prove that we can build a capable, agentic AI pipeline with zero API costs, running entirely on open-weights models. And honestly, for a 2B parameter model, Gemma 4 handles this task well. With models like Llamma-3.2 7B or Gemma-4 4B we can get better results.
If you have been trying local LLMs for agent workflows, I hope you will find this post helpful. Drop your questions or thoughts in the comments, I would love to hear what use cases you are/were thinking about agents.
Top comments (0)