DEV Community: Simran Shaikh

BugWhisperer: How I Finally Finished My Abandoned GitHub Issue Analyzer (8 Months Later) with GitHub Copilot

Simran Shaikh — Fri, 29 May 2026 04:37:57 +0000

My Submission

GitHub Repo: https://github.com/SimranShaikh20/BugWhisperer
Live Demo: https://bugwhisperer.msusimran20.workers.dev

The Project I Abandoned — September 2025

Eight months ago I had a problem.

Our dev team had 60+ open GitHub issues across 3 repos. Nobody knew what to fix first. Every sprint planning meeting turned into a 45-minute debate about priority. I thought: what if a script could just read all the issues and tell us what is most critical?

So I started building. Here is the entire codebase from that day:

import requests
import os

# TODO: fix this later
GITHUB_TOKEN = "put_your_token_here"

def get_issues(repo):
    # this doesnt work properly
    url = f"https://api.github.com/repos/{repo}/issues"
    r = requests.get(url)
    print(r)  # just printing for now
    # TODO: parse response properly

def analyze_issue(issue):
    # wanted to use openai here but ran out of time
    pass

def main():
    repo = "facebook/react"  # hardcoded lol
    get_issues(repo)
    # analyze_issue() # commented out, broken
    print("done?")

main()

Yes. That is it. A print(r) that prints the raw response object. A function called analyze_issue that literally does nothing. A hardcoded repo URL.

My commit message on September 15, 2025:

"initial attempt - giving up for now, too complicated"

And that was it. The repo sat there for 8 months untouched.

Why I Finally Came Back

When I saw the GitHub Finish-Up-A-Thon challenge, this project was the first thing that came to mind. The idea was always solid. The problem was real. I just never had the right tools or the time to push through the hard parts.

This time I had GitHub Copilot.

What I Built — The After

BugWhisperer is now a full AI-powered GitHub Issue Command Center.

Paste any GitHub repo URL → Get instant AI triage of every open issue in seconds.

Here is everything it does now:

AI Analysis for Every Issue

Every open issue gets analyzed by Groq's Llama 3.1 AI and returns:

Root Cause — what is likely causing this issue
Suggested Fix — concrete actionable solution in plain English
Complexity — Low / Medium / High
Priority — Low / Medium / High / Critical

Kanban Priority Board

Instead of a boring list, issues are sorted into a 4-column visual board:

🔴 Critical — fix immediately
🟠 High — this sprint
🟡 Medium — next sprint
🟢 Low — backlog

AI Sprint Planner

One click generates a complete 2-week sprint plan with time estimates, recommended team size, and a week-by-week breakdown.

Export to Markdown

Download the entire analysis as a .md file — paste it into your GitHub Wiki, Notion, or Linear instantly.

Post Analysis to GitHub

Post the AI analysis directly as a formatted comment on any GitHub issue — no copy-pasting, no leaving the app.

How GitHub Copilot Made This Possible

This is the honest story of where Copilot actually helped me — not a vague "it was amazing" but the specific moments where it unblocked me.

Moment 1: Understanding My Own Broken Code

I opened the old main.py in VS Code with Copilot and typed:

"What is this code trying to do and what is broken?"

Copilot told me immediately:

No authentication headers on the GitHub API call (hence the silent failures I was getting 8 months ago)
Response was never parsed — print(r) just prints the response object, not the data
analyze_issue was completely empty

It then suggested the complete fixed GitHub API call with proper authentication, pagination, and error handling. What I could not figure out in a week 8 months ago, Copilot explained and fixed in 2 minutes.

Moment 2: Reliable JSON from an LLM

The hardest technical problem was getting structured JSON output from an AI model reliably. Every time I tried, the model would add markdown fences around the JSON, add explanation text, or sometimes just break the format entirely.

I described the problem to Copilot and it wrote this system prompt pattern that solved it completely:

You are a senior software engineer analyzing GitHub issues.
Respond ONLY in valid JSON format.
No other text. No markdown. No explanation. Just JSON.

{
  "root_cause": "...",
  "suggested_fix": "...",
  "complexity": "Low or Medium or High",
  "priority": "Low or Medium or High or Critical"
}

The key Copilot taught me: "No other text. No markdown. No explanation." is far more reliable than just saying "respond in JSON format." That single insight saved me probably 3 hours of prompt debugging.

Moment 3: The Kanban Board Component

I had never built a Kanban board before. I asked Copilot:

"Write a React component that takes an array of issues each with a priority field and displays them in 4 columns: Critical, High, Medium, Low"

It wrote the entire working component in one response. I just connected my data to it.

Moment 4: The Sprint Planner Prompt

I described what I wanted — take all analyzed issues and generate a 2-week sprint plan. Copilot wrote the complete AI prompt, the API call, the JSON parsing, and even suggested adding a team_size_recommended field that I had not thought of. That one suggestion made the feature significantly more useful.

Technical Architecture

User Input (GitHub URL)
        ↓
Cloudflare Worker
        ↓
GitHub REST API → Fetch open issues (authenticated)
        ↓
Groq API (Llama 3.1 8b instant) → Analyze each issue
        ↓
Returns: root_cause, suggested_fix, complexity, priority
        ↓
React Frontend → Kanban Board
        ↓
Optional: Sprint Planner (second Groq call)
Optional: Export Markdown report
Optional: Post to GitHub as comment

Tech Stack

Layer	Technology
Frontend	React + TanStack + Tailwind CSS
Backend	Cloudflare Workers (serverless)
AI	Groq API — llama-3.1-8b-instant
GitHub	GitHub REST API v3
Hosting	Cloudflare Workers (free tier)

Why Groq and Not OpenAI?

Speed and cost. Groq's inference is genuinely fast — analyzing 10 issues takes about 8 seconds total. The free tier gives 14,400 requests per day which is more than enough. OpenAI costs money. Groq is free. For a developer tool that should be accessible to everyone, free wins.

Why Cloudflare Workers?

Lovable generates TanStack projects configured for Cloudflare Workers by default. The global edge deployment means the app loads fast anywhere in the world. And the free tier covers 100,000 requests per day which is more than enough.

The Before vs After Summary

	September 2025	June 2026
Code	47 lines of broken Python	Full React + Cloudflare app
UI	Terminal only	Beautiful dark web interface
AI	`pass` — literally empty	Groq Llama 3.1 analysis
GitHub	Hardcoded `facebook/react`	Any public repo URL
Analysis	None	Root cause, fix, complexity, priority
Planning	None	AI 2-week sprint planner
Export	None	One-click markdown report
Deployment	Never ran successfully	Live at workers.dev
Cost	$0 (did nothing)	$0 (all free APIs)

What I Learned

1. GitHub Copilot is best for bridging knowledge gaps

I did not know how to build a Kanban board. I did not know the best prompt pattern to force structured JSON from an LLM. I did not know how Cloudflare Workers reads environment variables differently from Node.js. Copilot filled every one of these gaps instantly — not by writing the whole app for me, but by answering the exact question I was stuck on.

2. Old ideas are often good ideas

My September 2025 script had the right idea. The problem was real. The solution direction was correct. It just needed time, better tools, and a reason to push through the hard parts. Do not delete your old projects — they often contain your best thinking from a time when you were closest to the problem.

3. Constraints force better design

Using only free APIs forced me to be efficient. Limiting to 300 tokens per analysis and using the fastest available model made the app feel instant. If I had unlimited budget I might have built something slower and more expensive.

4. The finish line matters more than the plan

BugWhisperer v2 looks nothing like what I imagined when I wrote that Python script in September 2025. It is a web app, not a CLI script. It uses Groq, not OpenAI. It runs on Cloudflare, not my laptop. Every single implementation detail changed. But the core idea — help developers understand their GitHub issues faster — stayed exactly the same. Ship the idea, not the plan.

Try It

👉 Live Demo: https://bugwhisperer.msusimran20.workers.dev

Test it with any of these repos:

https://github.com/fastapi/fastapi
https://github.com/requests/requests
https://github.com/psf/black
Or any public GitHub repo you are working on

GitHub Repo: https://github.com/SimranShaikh20/BugWhisperer

What Is Next

Private repo support (user provides their own token)
GitHub Actions integration — auto-analyze on new issue creation
Slack notifications for Critical priority issues
VS Code extension
Multi-repo comparison

If this project helped you think differently about your own abandoned side projects, drop a reaction — it genuinely helps this submission and motivates me to keep building.

And if you have an unfinished project sitting somewhere, this challenge is your sign to finally ship it. The idea you abandoned is probably better than you remember. 🚀

Built for the GitHub Finish-Up-A-Thon Challenge
Powered by Groq AI + GitHub Copilot + Cloudflare Workers

I built an open-source AI agent that explains any ML model in plain English — real SHAP, real LIME, real bias detection

Simran Shaikh — Mon, 18 May 2026 03:39:22 +0000

The problem I kept running into

Every time I finished training a model, the same conversation happened:

Manager: "Why did it predict that?"
Me: opens SHAP plot
Manager: glazed eyes

SHAP and LIME are powerful — but they output numbers and plots that
only data scientists can read. Nobody builds the bridge to plain English.
Nobody automates the bias check. Nobody generates a report your legal
team can actually use.

So I built XAI-Agent to do all of that — powered by Hermes Agent's
autonomous multi-step planning pipeline.

What it does

Upload any trained ML model (.pkl) + dataset (.csv) →
Hermes Agent runs 5 tools autonomously →
You get a full plain-English explainability report in under 3 minutes.

The 5-step Hermes Agent pipeline:

file_reader — loads model, auto-detects task type, picks right explainer
shap_analyzer — runs real SHAP, ranks all features by impact + direction
lime_explainer — explains 3 individual predictions in plain English
bias_checker — scans for demographic features, flags disparities
report_writer — writes structured Markdown report, downloadable instantly

What makes this genuinely agentic

Context flows between all 5 tools. The model type from Step 1
determines which SHAP explainer Step 2 uses. The feature ranking
from Step 2 informs Step 3's LIME analysis. The bias verdict from
Step 4 shapes Step 5's recommendations.

It also handles a real edge case most tutorials miss: newer SHAP
versions return 3D arrays (samples, features, classes) instead of 2D.
The agent detects this automatically and slices correctly —
a bug that breaks every naive SHAP implementation.

Sample output

Running on the breast cancer dataset (569 patients, 30 features):

Executive Summary (auto-generated):

This RandomForestClassifier was analyzed across 569 samples and
30 features. The most influential predictor is 'worst area'.
No demographic bias was detected.

SHAP top features:

worst area — 0.0756 — ↑ increases malignancy prediction
worst concave points — 0.0538 — ↑ increases malignancy prediction
mean concave points — 0.0503 — ↑ increases malignancy prediction

Prediction explained in plain English:

Row 0 — Predicted benign at 94% confidence.
'worst area' was well below the malignancy threshold
(impact: −0.141). 'worst concave points' also supported
benign classification (impact: −0.089).

Why this matters beyond the challenge

EU AI Act requires explainability for high-risk AI systems.
GDPR gives citizens the right to explanation for automated decisions.
US financial regulators require adverse action explanations for
ML credit scoring.

Existing tools (Fiddler, Arize, Arthur AI) cost $50K+/year.
XAI-Agent is free, open-source, runs locally, works in 3 minutes.

Tech stack

Hermes Agent (autonomous multi-step planning)
SHAP + LIME (real explainability — not simulated)
Streamlit (UI)
scikit-learn, XGBoost, LightGBM

Try it yourself

GitHub: https://github.com/SimranShaikh20/xai-agent

git clone https://github.com/SimranShaikh20/xai-agent
pip install -r requirements.txt
streamlit run app.py

Test files (sample_model.pkl + sample_dataset.csv) included —
runs in 3 minutes with zero extra setup.

What model would YOU run this on first? Drop it in the comments 👇

I Built an XAI Agent with Hermes Agent That Explains Any ML Model in Plain English — Here's Everything I Learned

Simran Shaikh — Mon, 18 May 2026 03:39:17 +0000

TL;DR: I spent a weekend building XAI-Agent — an autonomous Hermes Agent pipeline that runs real SHAP + LIME analysis on any ML model and generates a plain-English explainability report. This post is everything I learned: how Hermes Agent's multi-step planning actually works, where it surprised me, where it frustrated me, and why I think it's the most underrated open-source agent framework right now.

The Conversation That Started This

Three weeks ago I was in a meeting presenting a RandomForest model I'd trained to predict customer churn. The model was good — 91% accuracy, solid precision-recall curve. I was proud of it.

Then our Head of Product asked: "But why does it predict churn for this specific customer?"

I opened my Jupyter notebook. Showed the SHAP waterfall plot.

She stared at it for five seconds.

"Can you just... tell me in normal words?"

That moment broke something in my brain. I'd spent three days training the model and thirty minutes on explainability. The explainability was useless to the person who needed it most.

So I built XAI-Agent. And in doing so, I learned more about Hermes Agent than I expected.

What Is Hermes Agent, Actually?

Before I get into what I built, let me give you the honest explanation I wish I'd had when I started.

Hermes Agent is an open-source agentic framework built for multi-step autonomous task execution.

That sentence has a lot of words. Here's what it actually means in practice:

Most "AI" tools you interact with are single-shot. You send a prompt, you get a response. Done. The model doesn't remember what it did, doesn't use the output of one step as input to the next, and doesn't make decisions about how to approach a problem.

Hermes Agent does all three of those things.

It gives you:

A planning loop — the agent breaks down a task into steps before executing
Tool use — distinct callable functions the agent can invoke in sequence
Context persistence — output from Tool 1 is available to Tool 2, 3, 4, and 5
Autonomous decision-making — the agent decides which tool to use and when, based on what it finds

This sounds abstract. Let me make it concrete with what I actually built.

The Problem I Was Solving (And Why It's Bigger Than You Think)

Explainable AI (XAI) has a dirty secret: the tools exist, but the workflow is broken.

SHAP and LIME are genuinely powerful libraries. But using them requires:

Writing custom Python code for each model type
Knowing which explainer to use (TreeExplainer? KernelExplainer? DeepExplainer?)
Interpreting the numerical output yourself
Translating that into language a non-technical person understands
Running a separate bias audit
Writing a report that combines all of this

That's 4-6 hours of work per model, requiring a data scientist to babysit every step.

And this isn't a niche problem. The EU AI Act requires explainability for high-risk AI. GDPR Article 22 gives EU citizens the right to explanation for automated decisions. US financial regulators require banks to explain ML-based credit decisions.

The demand for XAI is going to explode over the next 3 years. The tools to deliver it efficiently don't exist yet.

That's the gap I built XAI-Agent to fill.

How I Used Hermes Agent to Build It

Here's the part I want to go deep on — because understanding how Hermes Agent enabled this architecture is the most useful thing I can share.

The Core Insight: Tools Are Not Functions

When I first read about Hermes Agent's tool use, I thought of tools as just... functions. Like, the agent calls shap_analyze() and gets back a result.

That mental model is wrong, and it held me back for half a day.

Tools in Hermes Agent are autonomous units of responsibility. Each tool:

Has a single, clear job
Can access shared agent state (what previous tools discovered)
Makes decisions based on that state
Produces output that enriches the shared state for future tools

The difference sounds subtle. The impact on architecture is enormous.

Here's what my 5-tool pipeline looks like and, more importantly, why it's designed this way:

class HermesXAIAgent:
    """
    Hermes Agent implementation — 5 autonomous tools
    Each tool feeds context into the next
    """

    def tool_file_reader(self, model_path, dataset_path, target_col):
        """
        TOOL 1: Inspection

        This isn't just 'load the files'. The tool makes decisions:
        - What type of model is this? (affects every downstream tool)
        - What is the task? (classification vs regression)  
        - Is there class imbalance? (affects which samples LIME picks)
        - Which SHAP explainer should Tool 2 use?

        All of this gets stored in self.results for Tools 2-5 to use.
        """
        model = joblib.load(model_path)
        model_type = type(model).__name__  # e.g. "RandomForestClassifier"

        # This decision propagates through the entire pipeline
        task = "classification" if y.nunique() <= 10 else "regression"

        self.results["model_type"] = model_type  # Tool 2 reads this
        self.results["task"] = task               # Tools 3, 4, 5 read this

    def tool_shap_analyzer(self):
        """
        TOOL 2: Global Explainability

        Uses model_type from Tool 1 to auto-select the right explainer.
        A naive implementation would hardcode TreeExplainer.
        This tool makes an intelligent decision.
        """
        model_type = self.results["model_type"]  # Read from Tool 1

        TREE_MODELS = ["RandomForestClassifier", "XGBClassifier", ...]

        if model_type in TREE_MODELS:
            # TreeExplainer: fast, exact, works with tree structure
            explainer = shap.TreeExplainer(model)
        else:
            # KernelExplainer: slower, universal fallback
            explainer = shap.KernelExplainer(model.predict_proba, background)

        # Handle SHAP version incompatibility — this one took me 2 hours
        shap_raw = explainer.shap_values(X_sample)

        if isinstance(shap_raw, list):
            shap_vals = shap_raw[1]           # Old SHAP: list per class
        elif shap_raw.ndim == 3:
            shap_vals = shap_raw[:, :, 1]    # New SHAP: 3D array
        else:
            shap_vals = shap_raw             # Regression: use as-is

        self.results["shap_vals"] = shap_vals  # Tool 5 uses this
        self.results["imp_df"] = importance_df # Tools 3, 5 use this

See how Tool 2 reads model_type from Tool 1's output? And produces imp_df that Tool 5 will use? That's the Hermes Agent planning loop in action. It's not a chain of independent function calls — it's a coherent reasoning process where every step builds on what came before.

The Part That Surprised Me: Fallback Planning

One of the things that makes Hermes Agent genuinely useful in production is that it handles failure gracefully within the planning loop.

In my tool_shap_analyzer, if the primary explainer fails (say, the model doesn't support TreeExplainer), the agent doesn't crash and show an error. It falls back to KernelExplainer, logs which method it used, and continues to Tool 3.

This sounds simple. But it means the agent's output is always a complete report — not a half-finished analysis with an error message in the middle.

That's the difference between a demo and a tool people actually use.

The Part That Frustrated Me: Context Size

Here's something nobody tells you about agentic pipelines: context accumulates fast.

By the time I got to Tool 5 (report_writer), my agent state contained:

The full model object
The original dataset (569 rows × 31 columns)
The SHAP values array (150 samples × 30 features)
Three LIME explanation objects
Feature importance DataFrames

That's a lot of state to pass around. For my use case (Streamlit + local execution) it was fine. But if you're building a Hermes Agent pipeline for production use with large datasets, you need to think carefully about what you store in self.results vs. what you compute on demand.

My solution: I store only what downstream tools need. The raw SHAP array (large) gets converted to a summary DataFrame (small) before storage. The LIME explanation objects get converted to plain lists immediately.

The Technical Deep Dive: SHAP + Hermes Agent

Let me go deeper on the SHAP integration because this is where the agentic approach really pays off.

Why SHAP Alone Isn't Enough

SHAP gives you this:

worst area                  0.0756
worst concave points        0.0538
mean concave points         0.0503
worst perimeter             0.0489
worst radius                0.0401

That's useful to me. It's useless to my Head of Product.

What she needs is:

"The size of the tumor at its largest measurement ('worst area') is the single most important factor in this model's predictions. When this measurement is high, the model is significantly more likely to predict malignancy. Think of it as the model's primary red flag."

The Hermes Agent pipeline handles this translation. Tool 2 generates the numbers. Tool 5 converts them to prose using model-aware language. The agent knows it's a medical dataset (from the feature names) and calibrates its language accordingly.

The 3D SHAP Array Bug That Will Break Your Code

I want to specifically call this out because it's a real issue that hit me hard and I've seen it break other implementations.

SHAP's output format changed between versions. Older SHAP returns:

# Old SHAP (< 0.40): list of arrays, one per class
shap_values = [array_class_0, array_class_1]  # each: (n_samples, n_features)

Newer SHAP (0.44+) returns:

# New SHAP (>= 0.44): single 3D array
shap_values  # shape: (n_samples, n_features, n_classes)

If you write naive code like shap_values[1] to get class 1, it works on old SHAP and silently returns wrong results (a column, not a matrix) on new SHAP.

The Hermes Agent approach of having a dedicated tool_shap_analyzer with explicit version handling is what caught this. Because the tool is isolated, I could add this logic cleanly:

if isinstance(shap_raw, list):
    sv = shap_raw[1] if len(shap_raw) > 1 else shap_raw[0]
elif hasattr(shap_raw, 'ndim') and shap_raw.ndim == 3:
    sv = shap_raw[:, :, 1]  # New SHAP: take class 1 slice
else:
    sv = shap_raw  # Regression or binary

This is a good example of why agentic tool isolation matters. If my SHAP analysis was one giant function, this fix would be buried in 200 lines of code. As a distinct tool, it's 10 lines, clearly documented, easy to test independently.

Hermes Agent vs. Other Agentic Frameworks

I know some of you are thinking: "Why Hermes Agent? Why not LangChain, AutoGen, or CrewAI?"

Fair question. Here's my honest take after building this:

LangChain

LangChain is powerful but opinionated. It works best when you're chaining LLM calls together. For my use case — where the "tools" are Python scientific computing libraries, not LLM endpoints — LangChain felt like overkill. I'd be fighting its abstractions rather than using them.

Use LangChain when: You're primarily chaining LLM calls, doing RAG, or need the massive ecosystem of pre-built integrations.

Use Hermes Agent when: You want clean, explicit tool definitions where you control exactly what runs and when.

AutoGen

AutoGen's multi-agent conversation model is fascinating but complex. For a pipeline where the execution order is deterministic (always: inspect → SHAP → LIME → bias → report), AutoGen's flexibility is unnecessary complexity.

Use AutoGen when: You need agents to negotiate with each other, have non-deterministic execution paths, or are building multi-agent debate systems.

Use Hermes Agent when: You have a clear sequential task where each step has defined inputs and outputs.

CrewAI

CrewAI is probably the closest conceptually — role-based agents with tool access. The main difference I found: Hermes Agent gives you more direct control over the planning loop. In CrewAI, the "crew" model adds abstraction that I didn't need.

Use CrewAI when: You want to model your problem as a team of specialized agents with roles and delegation.

Use Hermes Agent when: You want to own the planning logic explicitly and aren't sure yet what "roles" make sense for your problem.

My Honest Assessment

Hermes Agent's sweet spot is deterministic multi-step pipelines where you care about exactly what runs. Scientific computing, data analysis, document processing, code generation — anything where you can define clear tool responsibilities and want explicit control over the execution flow.

It's not trying to be everything. That's actually its strength.

What I'd Build Next

After spending a weekend with Hermes Agent, here's what I'm thinking about:

Counterfactual explanations. The natural next question after "why did the model predict X?" is "what would need to change to get a different prediction?" Hermes Agent would add a Tool 6 for this — using DiCE (Diverse Counterfactual Explanations) to generate "if your tumor's worst area had been 15% smaller, the model would have predicted benign."

Neural network support. Right now XAI-Agent handles tree models and most sklearn models. Adding SHAP DeepExplainer for PyTorch/TensorFlow would open up the majority of production models running in industry right now.

CI/CD integration. The most powerful version of this isn't a web app — it's a tool that runs automatically every time a model is retrained, generates a comparison report (did the important features change? did bias metrics shift?), and posts it as a GitHub PR comment.

The Bigger Picture: Why Open-Source Agents Matter

I want to end with something that's been on my mind since I started building this.

XAI platforms from companies like Fiddler AI, Arize, and Arthur AI are excellent. They're also $30K–$100K+ per year. A startup, a researcher, a solo developer, a nonprofit building AI for healthcare in a developing country — none of them can afford that.

The open-source AI agent ecosystem is the equalizer.

Hermes Agent running SHAP + LIME locally, generating a downloadable report, requiring nothing beyond a Python environment — that's accessible to anyone with a laptop.

As AI becomes more embedded in consequential decisions (loan approvals, medical diagnoses, hiring, parole, content moderation), the ability to explain and audit those decisions cannot be a luxury that only well-funded companies can afford.

Open, capable agent systems like Hermes Agent aren't just technically interesting. They're the infrastructure for making AI accountability universal.

That's why I built XAI-Agent. That's why I'm writing about Hermes Agent. And that's why I think the work this community is doing with open-source agents genuinely matters.

Try It Yourself

GitHub: github.com/SimranShaikh20/xai-agent

git clone https://github.com/SimranShaikh20/xai-agent
cd xai-agent
pip install -r requirements.txt
streamlit run app.py

Test files (sample_model.pkl + sample_dataset.csv) are included. Full analysis in under 3 minutes.

Questions I'd Love to Discuss

I'm genuinely curious about the community's experience:

Have you used Hermes Agent on something other than text/LLM tasks? What was the use case?
How do you handle context accumulation in long agentic pipelines?
What explainability features would make XAI-Agent actually useful for your work?

Drop them in the comments — I read and respond to everything 👇

Built for the DEV Hermes Agent Challenge — May 2026

Tags: #hermesagentchallenge #agents #ai #machinelearning #explainableai #opensource #python #streamlit

VibeSafe

Simran Shaikh — Sat, 16 May 2026 11:07:38 +0000

What I Built

VibeSafe — a privacy-first AI code auditor that analyzes your project and generates a Proof of Authorship certificate, identifying your human architectural decisions versus AI-assisted patterns.

The problem VibeSafe solves is real and growing: AI-assisted development is now standard practice, but that creates a trust gap. When submitting to a hackathon, applying for a job, or open-sourcing a project, reviewers increasingly ask — "Did you actually build this, or did an AI?"

VibeSafe answers that question — not by detecting AI, but by surfacing your human decisions: the architecture choices, the product instincts, the specific tradeoffs only you would have made.

How it works

Drop your project files into VibeSafe. Gemma 4 31B reads your entire codebase in one prompt (thanks to the 262K context window) and returns a structured report across four sections:

Section	What you get
🔑 Proof of Authorship	Your human design decisions vs AI patterns + originality score
🔐 Security Audit	Vulnerabilities, exposed secrets, injection risks with specific fixes
🧠 Logic Analysis	Edge cases, dead code, race conditions, code quality score
📖 Plain English	What your code actually does, explained simply

At the end, you download a signed Proof of Authorship certificate — a plain-text document that lists every architectural decision that proves the project is genuinely yours.

Privacy first

VibeSafe makes a direct API call from your browser to OpenRouter. No backend server. No database. No code stored anywhere. Your API key lives in React state only — closing the tab clears everything.

Demo

🔗 Live App: vibesafe.lovable.app

Quick walkthrough:

Step 1 — Enter your free OpenRouter API key

Step 2 — Upload your project files or paste code

Drag and drop multiple files. VibeSafe accepts .py .js .jsx .ts .tsx .html .css .json .go .rs .php .sql and more.

Step 3 — Watch Gemma 4 analyze your code

The terminal-style loading screen shows exactly what Gemma 4 is doing:

› Connecting to Gemma 4 31B...              ✓
› Reading codebase structure...             ✓
› Running security vulnerability scan...    ✓
› Checking for exposed API keys...          ✓
› Analyzing logic and edge cases...         ✓
› Identifying human architectural decisions ✓
› Detecting AI-generated patterns...        ✓
› Generating Proof of Authorship...         █

Step 4 — Get your full report

Step 5 — Download your Proof of Authorship certificate

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
         VIBESAFE — PROOF OF AUTHORSHIP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Generated: 2026-05-16T10:32:11.000Z
Analyzed by: Gemma 4 31B via OpenRouter
Files: app.py, utils.py, config.py

HUMAN CONTRIBUTION SCORE: 78/100

HUMAN ARCHITECTURAL DECISIONS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Separation of database connection into get_db() factory
   Evidence: Explicit factory pattern in app.py

2. Stateless token design using user ID as token
   Evidence: /login returns raw user ID, deliberate tradeoff

3. Search and todos endpoints intentionally separated
   Evidence: distinct route handlers for different concerns

AI-ASSISTED PATTERNS DETECTED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Boilerplate Flask route scaffolding
2. Generic try/except error handling blocks

VERDICT: Solid architecture with critical SQL injection 
issues that need fixing before production deployment.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Code

🔗 GitHub: github.com/SimranShaikh20/vibesafe

Tech Stack

Frontend: React 18 + Vite + Tailwind CSS + Framer Motion
File handling: react-dropzone
AI Model: Gemma 4 31B (google/gemma-4-31b-it:free)
AI Provider: OpenRouter free tier
Deployment: Lovable

Project Structure

vibesafe/
├── src/
│   ├── App.jsx
│   └── components/
│       ├── CodeUploader.jsx     ← drag-drop + paste + API key
│       ├── LoadingAnalysis.jsx  ← animated terminal steps
│       ├── ReportDashboard.jsx  ← 4-card report layout
│       ├── AuthorshipCard.jsx   ← hero card + certificate export
│       ├── SecurityCard.jsx     ← expandable vulnerability list
│       ├── LogicCard.jsx        ← quality score + logic issues
│       └── PlainEnglishCard.jsx ← plain language explanation

How I Used Gemma 4

Model chosen: Gemma 4 31B Dense (`google/gemma-4-31b-it`)

This was a deliberate choice, not a default. Here is exactly why:

Why not Gemma 4 4B or 2B (edge models)?

The small Gemma 4 models are remarkable for their size — running on a phone or Raspberry Pi is genuinely impressive. But VibeSafe needs to reason across an entire codebase simultaneously, identify cross-file architectural patterns, and generate nuanced judgments about human intent vs AI generation. A 4B model cannot hold enough context or produce reliable structured JSON for this level of analysis.

Why not Gemma 4 26B MoE?

The MoE model is excellent for throughput — activating only 3.8B parameters per token makes it fast and efficient. But for security analysis, I need consistent reasoning quality on every token. A dense model that activates all 31B parameters gives more reliable vulnerability detection. Missing one SQL injection is worse than slower inference.

The killer feature: 262K context window

This is what makes the whole idea possible. I send the entire project — every file concatenated with filename headers — in a single prompt. No chunking, no lost context between files, no missed cross-file dependencies.

// Every file in one shot
const combined = files.map(f => 
  `\n\n=== FILE: ${f.name} ===\n${f.content}`
).join('')

// Single call to Gemma 4 — 262K tokens available
const response = await fetch(
  'https://openrouter.ai/api/v1/chat/completions',
  {
    body: JSON.stringify({
      model: 'google/gemma-4-31b-it:free',
      messages: [{ role: 'user', content: buildPrompt(combined) }],
      temperature: 0.1,
      max_tokens: 4000
    })
  }
)

The authorship prompt

The most interesting engineering decision was how to prompt Gemma 4 to distinguish human decisions from AI patterns. Generic code review prompts don't work — I needed Gemma 4 to think specifically about intent:

You are VibeSafe. Your job is to identify what the human developer 
intentionally designed versus AI-generated or boilerplate code.

Look for:
- Architecture decisions that reflect product thinking
- Specific tradeoffs that reveal human judgment  
- Patterns that are generic and could be AI-generated
- Cross-file design consistency that shows planned thinking

Return structured JSON with human_decisions and ai_patterns arrays.

The key insight: asking about intent produces better results than asking about code quality. Gemma 4's reasoning capability is what makes this distinction meaningful.

Results on VibeSafe's own code

I ran VibeSafe on itself. Here is what Gemma 4 identified as my human decisions:

Separating the authorship report as the hero card rather than security (product decision)
Using a terminal aesthetic for a code tool (design decision)
Direct browser-to-API calls instead of a backend (architecture decision — prioritizing privacy)

And what it flagged as AI-assisted:

Boilerplate Tailwind utility class combinations
Standard React useState patterns
Generic error boundary structure

Originality score: 74/100. That felt honest.

Why this matters beyond the hackathon

The question "did you build this?" is going to become one of the most important questions in software hiring, education, and open source in the next few years. Right now there is no good answer — just vibes and gut instinct.

VibeSafe is a first attempt at making that question answerable. Not by detecting AI (which is an arms race), but by documenting human contribution in a structured, verifiable way.

Gemma 4's 262K context and reasoning capability made this possible to build in a weekend. That says something about where open models are right now.

Built with React + Gemma 4 31B + OpenRouter free tier
GitHub: github.com/SimranShaikh20/vibesafe[=

I Built a Tool That Proves Your Code Is Yours — Here's What Gemma 4 Made Possible

Simran Shaikh — Sat, 16 May 2026 11:01:48 +0000

There is a question spreading quietly through the software industry right now. Hiring managers are asking it. Hackathon judges are asking it. Open source maintainers are asking it.

"Did you actually build this?"

Nobody has a good answer yet. I tried to build one — and Gemma 4 is the reason it worked.

The Problem Nobody Is Talking About

AI-assisted development has gone mainstream fast. Cursor, Copilot, Lovable, Bolt — developers are shipping real products with significant AI assistance, and that is genuinely fine. The tools exist, the skills are in using them well.

But a trust gap is forming. When you submit a project to a hackathon, post a repo on GitHub, or show work in a job interview, reviewers are increasingly skeptical. The portfolio that used to signal skill now also signals a question mark.

The current answer to "did you build this?" is essentially: trust me.

That is not good enough. And trying to detect AI-generated code is an arms race nobody will win — models improve, detection fails, repeat.

I wanted a different approach: instead of detecting AI, document the human.

What I Built: VibeSafe

VibeSafe is a browser-based code auditor that takes your project files, sends them to Gemma 4 31B in a single prompt, and returns a Proof of Authorship certificate — a structured document identifying your human architectural decisions versus AI-assisted patterns.

The output looks like this:

HUMAN ARCHITECTURAL DECISIONS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Separation of database connection into factory function
   Evidence: get_db() pattern used consistently across modules

2. Deliberate stateless token design
   Evidence: login() returns user ID directly — intentional tradeoff

3. Privacy-first architecture: no backend, direct browser API calls
   Evidence: all external calls made from frontend, no server layer

These are the things only I would have decided. The boilerplate React hooks and Tailwind utility classes? Gemma 4 flags those as AI-assisted. The product decisions, the tradeoffs, the specific ways the pieces connect? Those are mine.

Why Gemma 4 Specifically

I tried this concept with smaller models first. It did not work.

The problem is that distinguishing intent from output requires holding the entire codebase in mind simultaneously. A model that has only seen half your files cannot tell you whether your architectural choices are consistent across the project. It cannot spot that you made the same deliberate tradeoff in three different places — which is actually the strongest signal of human authorship.

The 262K Context Window Changes the Analysis

Gemma 4's 262K context window means I send everything in one shot:

const combined = files
  .map(f => `\n\n=== FILE: ${f.name} ===\n${f.content}`)
  .join('')

// One prompt. Entire project. Gemma 4 sees everything at once.
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  body: JSON.stringify({
    model: 'google/gemma-4-31b-it:free',
    messages: [{ role: 'user', content: buildPrompt(combined) }]
  })
})

No chunking. No lost context. No missed cross-file patterns. The model sees the whole picture before making any judgment — the same way a senior engineer would read a codebase before commenting on it.

31B Dense vs the MoE Model

I specifically chose the 31B Dense model over the 26B MoE for this use case.

The MoE model activates ~3.8B parameters per token — it is faster and more efficient, ideal for high-throughput applications. But security analysis needs consistent reasoning quality on every single token. Missing one vulnerability because a parameter set was not activated is worse than slower inference. For a tool that is auditing your code for real risks, I wanted the full model engaged on every decision.

Reasoning Mode for Authorship Detection

The part that surprised me most was how well Gemma 4 handles the authorship question when prompted correctly. Generic "review my code" prompts produce generic answers. But when you ask specifically about intent:

Look for architectural decisions that reflect product thinking.
Look for specific tradeoffs that reveal human judgment.
Distinguish these from patterns that are generic and could be AI-generated.

Gemma 4 produces genuinely insightful distinctions. It identified that my choice to put authorship as the hero card — not security — was a human product decision. It noticed the privacy-first architecture (no backend) as a deliberate tradeoff, not a default. It caught that I reused the same terminal aesthetic across components as a consistent design language.

That is not code review. That is architectural reasoning.

What Running VibeSafe on Itself Taught Me

I ran VibeSafe on its own source code. The results were honest in a way I did not expect.

Human decisions Gemma 4 identified:

Authorship card as hero feature (product decision, not default)
Direct browser-to-API architecture (privacy tradeoff)
Terminal aesthetic as unified design language
Certificate export as plain text (accessibility over PDF complexity)

AI-assisted patterns it flagged:

Standard Tailwind utility class combinations
Boilerplate useState/useEffect patterns
Generic error boundary structure

Originality score: 74/100

That feels right. A good chunk of the implementation is standard React patterns. But the product decisions — what to build, how to frame it, what matters to the user — those are mine. 74 out of 100 captures that honestly.

What This Means for Developers Right Now

Open models at Gemma 4's capability level running on free infrastructure changes what individual developers can build.

Six months ago, this analysis would have required:

A paid API with expensive per-token costs
A backend to handle large context requests
Chunking logic to split codebases into pieces
Multiple round-trips losing context between calls

Now it is a single fetch() call from a React component. Free. 262K tokens. Full model. No backend.

The barrier between "idea" and "working product" for AI-powered developer tools has dropped significantly. VibeSafe went from concept to working demo in a weekend — not because the engineering is simple, but because Gemma 4 handles the hard part.

The Bigger Picture

The "did you build this?" problem is not going away. It is going to intensify as models improve and AI-assisted development becomes more capable.

But I think the framing of the question is wrong. The interesting question is not "how much did AI write?" — it is "what did the human decide?"

Architecture. Product instincts. Tradeoffs. The specific shape of a solution. These things are still fundamentally human, even when the implementation is AI-assisted. They are also what actually matter in a developer.

VibeSafe is a first attempt at making those decisions visible and documentable. Gemma 4's reasoning capability and context window are what made it possible to build something that actually captures them.

Try It

🔗 Live app: https://vibe-proof-code.lovable.app/
🔗 GitHub: github.com/SimranShaikh20/vibesafe

You need a free OpenRouter API key from openrouter.ai/keys — no credit card, takes 30 seconds.

Run it on your own project. See what Gemma 4 says about what you built.

VibeSafe · Powered by Gemma 4 31B · OpenRouter free tier · Built for the vibe coding era

DukanBot: I Flipped OpenClaw Inside-Out to Run WhatsApp for 12 Million Kirana Stores

Simran Shaikh — Fri, 24 Apr 2026 10:53:24 +0000

What I Built

There are 12 million kirana stores in India.

Every single one of them runs on WhatsApp. Orders come in on WhatsApp. Confirmations go out on WhatsApp. Payment reminders are typed manually — at midnight, after a full day standing behind a counter.

My neighbour Sharma Ji runs one of these stores. He writes every "your order is confirmed 🙏" message by hand. When customers don't pay, he has to remember to follow up. When he forgets — which happens — he loses money. Not because he's a bad businessman. Because he has no system.

DukanBot is that system.

It's a complete order management dashboard for kirana stores where the store owner clicks once — and an OpenClaw AI agent running on Groq's LLaMA 3.3 70B sends the WhatsApp message, handles the customer's reply, and logs everything to a real database.

No manual typing. No forgotten follow-ups. No lost money.

Live demo: [https://dukan-bot.netlify.app/]https://dukan-bot.netlify.app/)
GitHub: github.com/SimranShaikh/dukanbot

How I Used OpenClaw

Here's the thing that makes DukanBot different from every other submission:

I didn't use OpenClaw as a chatbot. I used it as a messaging engine triggered by a web dashboard.

Most people connect OpenClaw as a chat interface — you talk to it, it responds. I flipped this completely. In DukanBot, the store owner never touches OpenClaw at all. They just use the dashboard. OpenClaw runs silently in the background, receiving webhooks from the frontend and firing WhatsApp messages to customers.

The Architecture

Store owner clicks "Send Confirmation" in DukanBot dashboard
              ↓
Dashboard POSTs JSON to OpenClaw webhook (localhost:18789/webhook)
              ↓
OpenClaw's DukanBot skill receives the payload
              ↓
Groq LLaMA 3.3 70B formats the message with context
              ↓
OpenClaw sends WhatsApp to customer via connected channel
              ↓
Customer gets: "Hello Rahul! Your order DKN-023 from
               Sharma Kirana worth ₹340 is confirmed 🙏"
              ↓
Dashboard shows "Sent via OpenClaw 🦞 ✓" toast

This pattern — web app as the UI, OpenClaw as the execution layer — is something I haven't seen in any other submission. It treats OpenClaw like a microservice, not a personal assistant.

The Webhook Payload

When the store owner clicks Send Confirmation:

{
  "message": "Hello Rahul! Your order DKN-023 from Sharma Kirana Store worth ₹340 has been confirmed. Thank you! 🙏",
  "to": "+919876543210",
  "type": "confirmation",
  "order_id": "DKN-023",
  "store_name": "Sharma Kirana Store",
  "upi_id": "sharma@upi"
}

When the store owner clicks Send Payment Reminder:

{
  "message": "Hello Rahul, aapka ₹340 ka payment DKN-023 ke liye pending hai. Please pay on UPI: sharma@upi. Thank you 🙏",
  "to": "+919876543210",
  "type": "reminder",
  "order_id": "DKN-023"
}

Notice the reminder is in Hinglish (Hindi + English). That's intentional — kirana store customers in India communicate in Hinglish, not formal English. OpenClaw's SKILL.md handles the tone.

The SKILL.md — The Real Brain

This is the complete skill file that powers DukanBot. No code — pure markdown:

---
name: dukanbot
description: Kirana store WhatsApp assistant. Sends order confirmations
and payment reminders. Handles customer replies automatically.
Powered by Groq LLaMA 3.3 70B Versatile.
---

# DukanBot — Kirana Store WhatsApp AI

You are DukanBot, a WhatsApp assistant for Indian kirana stores.
You run on Groq's ultra-fast LLaMA 3.3 70B model via OpenClaw.

## When webhook type is "confirmation":
Send WhatsApp to the "to" number using the "message" field.
Add a warm closing: "Aapka business humara garv hai 🙏"
Log: [timestamp] CONFIRMATION sent to [number] for [order_id]

## When webhook type is "reminder":
Send WhatsApp to the "to" number using the "message" field.
Keep tone polite but clear — small business relationships matter.
Add: "Koi problem ho toh batayein — hum help karenge 🙏"
Log: [timestamp] REMINDER sent to [number] for [order_id]

## When customers reply on WhatsApp:
- "paid" / "done" / "ho gaya" → "Shukriya! Payment received ✅🙏"
- "when" / "kab" / "ready" → "Aapka order prepare ho raha hai.
   Notification milegi jaldi!"
- "cancel" / "nahi chahiye" → Forward to store owner immediately:
   "⚠️ Customer [name] wants to cancel order [id]. Please respond."
- anything else → Summarize and forward to store owner

## Tone Rules:
- Always Hinglish (mix of Hindi and English naturally)
- Never rude, never rushed — ye relationship business hai
- Use 🙏 for greetings and thanks
- Keep messages under 3 sentences
- Always include store name

## What you are:
- Framework: OpenClaw (open-source personal AI)
- Model: Groq LLaMA 3.3 70B Versatile (~200ms response)
- Channel: WhatsApp
- Purpose: Automate the boring parts so store owners focus on people

Why Groq (Not Claude or GPT)

This was a deliberate technical choice.

A kirana store owner clicking "Send Reminder" expects instant feedback. If the button spins for 3 seconds, they think it broke. I benchmarked three providers:

Provider	Avg Response	Free Tier	Best For
Groq LLaMA 3.3 70B	~180ms	✅ Generous	This project
Claude Haiku	~800ms	❌ Paid	Complex reasoning
OpenAI GPT-4o-mini	~600ms	Limited	General use

Groq's LPU hardware is purpose-built for inference. For a WhatsApp message that's 2-3 sentences, 180ms feels instant. The store owner sees the success toast before they've finished reading the button label.

Config in openclaw.json:

{
  "env": {
    "GROQ_API_KEY": "gsk_..."
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "groq/llama-3.3-70b-versatile",
        "fallbacks": ["groq/llama-3.1-8b-instant"]
      }
    }
  },
  "channels": {
    "telegram": {
      "accounts": {
        "main": { "token": "YOUR_TELEGRAM_BOT_TOKEN" }
      }
    }
  }
}

OpenClaw Status — Built Into the UI

The navbar shows a live OpenClaw connection indicator:

🟢 Green dot = webhook URL configured, connection verified
🔴 Red dot = webhook URL missing → clicking it goes to Settings

If not connected, the Quick Reply panel shows a yellow banner:

"OpenClaw not connected. Go to Settings → OpenClaw to set up."

The entire OpenClaw install flow — Node.js, npm install -g openclaw@latest, model selection, SKILL.md download, webhook URL — is a dedicated tab inside DukanBot. The store owner never needs to read an external doc.

OpenClaw Powers 4 Distinct Flows

Flow	Trigger	What OpenClaw Does
Order confirmation	Store owner clicks button	Sends formatted WhatsApp to customer
Payment reminder	Store owner clicks button or "Send All"	Sends Hinglish reminder with UPI details
Customer reply: paid	Customer texts "ho gaya"	Replies "Shukriya ✅" automatically
Customer reply: unknown	Any other message	Summarizes + forwards to store owner

Screenshots

Dashboard — all data from Supabase, zero hardcoded values

OpenClaw Setup tab — install guide built into the app

Orders page — real CRUD, filter by status

What I Learned

1. OpenClaw is more powerful as an outbound engine than a chatbot

The obvious use case for OpenClaw is "chat with your AI." The less obvious use case — which I think is actually more powerful — is using it as a triggered action engine for existing web apps. Your web app handles the UI. OpenClaw handles the messy, stateful parts: sending messages, handling replies, logging.

This separation of concerns is clean. The dashboard developer doesn't need to understand WhatsApp's API quirks. The OpenClaw skill handles that. The skill author doesn't need to understand Supabase schema. The dashboard handles that.

2. SKILL.md is Markdown with superpowers

I expected to write JavaScript to handle different webhook types (confirmation vs reminder vs customer reply). Instead, I described the behavior in plain English inside SKILL.md — and it worked reliably across hundreds of test messages. The instruction "Never rude, never rushed — ye relationship business hai" actually produced measurably warmer replies than "be professional."

Writing a good SKILL.md feels more like writing a really clear job description than writing code. That's a genuine paradigm shift.

3. Supabase RLS is not optional for multi-user apps

I tested without Row Level Security first. Every store owner saw every other store's orders. Adding auth.uid() = user_id policies to both tables fixed it in one SQL command. This is the kind of thing that's easy to skip when prototyping but disastrous in production. Now it's the first thing I set up.

4. Cultural specificity is a feature, not a constraint

"Hinglish messages" and "₹ formatting" and "DD/MM/YYYY" and "floating WhatsApp button" aren't localisation details. They're the product. A kirana store owner who opens an app and sees "Hello Sharma Ji 🙏" instead of "Hello John" trusts that app differently. Building for a specific real person — not a generic user — makes every product decision easier and better.

5. The hardest part was scope

The temptation was to add inventory tracking, Google Sheets sync, customer loyalty points, voice notes. I cut all of it. The constraint "what would Sharma Ji actually use on a Tuesday afternoon" kept the scope tight and the product coherent. Ship the useful thing first.

Try It Yourself

# 1. Clone the repo
git clone https://github.com/SimranShaikh/dukanbot.git
cd dukanbot && npm install

# 2. Set up Supabase (see README for full SQL)
# Add your VITE_SUPABASE_URL and VITE_SUPABASE_ANON_KEY to .env

# 3. Install OpenClaw with Groq
npm install -g openclaw@latest
GROQ_API_KEY=gsk_your_key openclaw onboard --install-daemon

# 4. Install the DukanBot skill
mkdir -p ~/.openclaw/workspace/skills/dukanbot
cp skills/dukanbot/SKILL.md ~/.openclaw/workspace/skills/dukanbot/
openclaw gateway restart

# 5. Open DukanBot → Settings → paste http://localhost:18789/webhook
# 6. Click "Test Connection" → green ✓ → you're live

npm run dev

Full setup guide, SQL schema, and SKILL.md at the GitHub repo above.

ClawCon Michigan

I didn't attend ClawCon Michigan — I'm a final-year CS student in India and the geography didn't work out this time. But building DukanBot made me realise something: the most interesting OpenClaw builds aren't coming from Silicon Valley. They're coming from people who have a specific, local, unglamorous problem that no VC-funded startup will ever solve. A Telegram bot for kirana store payment reminders. An SMS agent for farmers. An auto-reply skill for a one-person tailoring shop.

That's the version of personal AI I'm excited about. And from what I've read about ClawCon Michigan, it seems like the community there gets this too. Would love to be there in person next year.

Built by Simran Shaikh — Final Year BE CS, The Maharaja Sayajirao University of Baroda
Stack: OpenClaw + Groq LLaMA 3.3 70B + Supabase + React (Lovable.dev)
GitHub: github.com/SimranShaikh/dukanbot

I Built a Multi-Step AI Agent in One Day with Google ADK — Here's What Nobody Tells You

Simran Shaikh — Thu, 23 Apr 2026 09:28:41 +0000

I'm a final-year computer science student. I spend most of my days training deep learning models on image datasets, debugging tensor shape errors at 2am, and convincing myself that 67% accuracy is "a solid baseline."

I do not, normally, build AI agents.

But when Google Cloud NEXT '26 dropped last week and I saw the announcements around ADK 2.0 and the new Gemini Enterprise Agent Platform, I got genuinely curious. Not marketing-brochure curious — actually curious. Because the thing they kept saying was: "You can now build multi-step autonomous agents that coordinate with each other."

That sounded either really powerful or really overhyped. I wanted to find out which.

So I spent a day building something with it. This is what actually happened.

What Even Is ADK?

Before I get into the friction, a quick explainer for anyone who hasn't seen the announcements.

ADK — Agent Development Kit — is Google's open-source Python framework for building AI agents. Not chatbots. Agents — programs that take a goal, break it into steps, use tools, and figure out how to get things done autonomously.

The ADK 2.0 alpha (released March 2026) brought in graph-based workflows, collaborative multi-agent support, and native Vertex AI integration. The stable version (1.x) already supports multi-agent coordination and tool use. That's what I ended up using, and I'll explain why in a moment.

What I Decided to Build

I wanted to build a Research Assistant Agent — you give it a topic, it searches the web, structures the findings, and suggests what to explore next.

The twist: instead of one agent doing everything, I'd build it as a multi-agent pipeline with specialist sub-agents, the way ADK is actually designed to be used:

web_searcher → hits Google Search, returns raw findings
analyst_summarizer → structures those findings for developers
research_coordinator → orchestrates both, delivers the final answer

Simple enough concept. Let's talk about what happened when I actually tried to set it up.

The Setup: Where Things Got Interesting

Step 1 — Getting the API Key

Go to aistudio.google.com, sign in, click "Get API Key." This part was genuinely smooth. Took maybe 3 minutes. Free tier gives you enough to build and experiment.

Step 2 — Installing ADK

pip install google-adk

Simple. Worked on the first try. The install is clean and the dependencies are sensible.

Step 3 — Creating the Project Structure

adk create research_agent

This gave me a folder with agent.py, .env, and __init__.py already stubbed out. That's a nice touch — you're not hunting for the right structure.

research_agent/
    agent.py
    .env
    __init__.py

Step 4 — The Part Where I Hit a Wall

I was excited by ADK 2.0 after reading about the new workflow engine, so I tried installing it first:

pip install google-adk --pre

And here's the honest thing nobody's blog post tells you: ADK 2.0 is a proper alpha. The docs say it. The PyPI page says it. But you don't fully feel it until you're staring at import errors because the API surface has breaking changes from 1.x.

I spent about 40 minutes trying to make 2.0 work before I made the practical call: the stable 1.31.x release already supports multi-agent orchestration. The thing I wanted to build was fully doable without the alpha. So I went back to stable.

pip install google-adk  # without --pre

Lesson learned: ADK 2.0 is genuinely exciting for what it brings (graph-based workflows, better debugging tooling, stateful multi-step support), but right now it's for people who want to be on the bleeding edge and don't mind patching things. If you want to build and ship something this week, use 1.x.

Building the Agent

Here's the full code. I'll walk you through each piece.

The Project Setup

Add your API key to .env:

GOOGLE_GENAI_API_KEY=your_key_here
GOOGLE_GENAI_USE_VERTEXAI=FALSE

GOOGLE_GENAI_USE_VERTEXAI=FALSE means you're using AI Studio (free), not Vertex AI (Google Cloud). Keep it false unless you've set up a Cloud project.

The Sub-Agents

from google.adk.agents import Agent
from google.adk.tools import google_search

# Sub-Agent 1: Does the actual web searching
searcher_agent = Agent(
    name="web_searcher",
    model="gemini-2.0-flash",
    description="Searches the web for up-to-date information on a given topic.",
    instruction="""
    You are a web research specialist. Your only job is to search for 
    accurate, recent information on the topic given to you.
    Always use the google_search tool — never answer from memory alone.
    Prioritize sources from 2025-2026.
    Return a clear summary of what you found, including source context.
    """,
    tools=[google_search],
)

# Sub-Agent 2: Turns raw findings into structured output
summarizer_agent = Agent(
    name="analyst_summarizer",
    model="gemini-2.0-flash",
    description="Structures raw research into clear developer-friendly summaries.",
    instruction="""
    You are a technical writer for a developer audience.
    Structure your response as:

    ## Key Findings
    [3-4 bullet points of the most important facts]

    ## The Most Surprising Thing
    [One insight that might be unexpected]

    ## What to Watch Out For
    [Caveats, limitations, or gotchas]

    ## 3 Follow-Up Questions
    [Specific questions a developer might want to explore next]

    Keep the tone honest, direct, and useful. No marketing fluff.
    """,
)

One thing I noticed: the description field matters more than I expected. The coordinator uses it to decide which sub-agent to delegate to. If your description is vague, the orchestration gets confused. Lesson from 30 minutes of head-scratching.

The Coordinator

root_agent = Agent(
    name="research_coordinator",
    model="gemini-2.0-flash",
    description="Coordinates multi-step research by delegating to specialist sub-agents.",
    instruction="""
    You coordinate a two-step research pipeline:
    Step 1 — Delegate to web_searcher to find current information.
    Step 2 — Pass those findings to analyst_summarizer to structure them.
    Step 3 — Present the final structured output to the user.
    Always complete both steps before responding. Do not skip the search step.
    """,
    agents=[searcher_agent, summarizer_agent],
)

That agents=[...] parameter is doing the heavy lifting here. You're giving the coordinator a roster of sub-agents it can delegate to. It decides when to call which one based on the task and their descriptions.

Running It

adk web research_agent/

Open http://localhost:8000 and you get a chat interface with full event inspection — you can see every step the agent takes, every tool call, every sub-agent handoff. For a framework aimed at developers, this is actually thoughtful UX.

What I Actually Asked It

Test 1: Something I Already Knew the Answer To

"What is Google ADK and what was announced at Cloud NEXT '26?"

The agent searched, found the NEXT '26 announcements, and structured them cleanly. The output was accurate. It correctly identified ADK 2.0's graph-based workflows and the Gemini Enterprise Agent Platform rebrand. It cited things from 2026, not 2023.

Test 2: Something Niche

"What's the current state of AI agents in manufacturing quality control?"

This is where it got more interesting. The search results were mixed — some solid, some generic. The summarizer was honest about the limitations of what it found. It flagged one follow-up question I hadn't considered: whether outcome-based pricing (one of NEXT '26's announcements) changes the economics of running vision AI at manufacturing scale. I hadn't thought about that angle.

Test 3: Pushing It

"What's the A2A protocol and why does it matter for a student building their first AI project?"

Best output of the three. The framing of "for a student" changed the register of the summary — it explained the A2A protocol in practical terms (agents from different companies can talk to each other without custom integration code) rather than enterprise-speak. The follow-up questions were specific and genuinely useful.

What Genuinely Impressed Me

The multi-agent handoff is seamless. I expected some clunkiness at the boundary between searcher and summarizer. There wasn't any. The coordinator passes context cleanly, and the summarizer clearly received structured findings rather than raw text. I don't know exactly what's happening under the hood, but the output quality was noticeably better than a single-agent approach I tested alongside it.

The web UI for debugging. Being able to see the full event trace — which agent ran, what tool it called, what it returned — is not a small thing. When something goes wrong (and it will), you can actually see where. This is the kind of tooling that makes the difference between framework adoption and abandonment.

google_search as a first-class tool. You import it, you add it to tools=[], and it works. No API key management, no rate limit configuration to figure out upfront. For getting started, that's exactly right.

What I'd Push Back On

ADK 2.0 alpha is not ready for a tutorial. I understand why Google announced it at NEXT '26 — the graph-based workflow engine is a genuine step forward in how you structure complex agents. But the breaking changes from 1.x, combined with sparse alpha docs, mean the announcement is ahead of the developer experience right now. If your use case needs stateful multi-step workflows or the new debugging tooling, keep watching it. If you need to build something today, use 1.x.

The instruction prompt is load-bearing. The quality of your agent's output is almost entirely determined by how clearly you write the instruction field. ADK doesn't abstract that away — it amplifies it. A vague instruction gives you a vague agent. I rewrote mine three times before the summarizer stopped adding unnecessary corporate-sounding hedges to its output. That's not a framework problem, but it's worth knowing going in.

Memory across sessions is still your problem. Each conversation starts fresh. If you want stateful agents that remember context across sessions, you need to wire that up yourself. ADK 2.0's improvements here are in the roadmap, but they're not fully baked yet.

My Honest Verdict

ADK is the right direction. The multi-agent pattern it encourages — specialists coordinated by a root agent — produces noticeably better results than stuffing everything into one massive system prompt. The tooling is clean, the Google Search integration works, and the web UI for inspection is genuinely developer-friendly.

For a first-year student or someone new to agents: start here. You'll be running something real in under two hours.

For someone wanting to use the ADK 2.0 graph workflows specifically: give it another month or two. The alpha is progressing fast, but it's not ready to be the foundation of a tutorial you'll publish and stand behind.

The most interesting thing NEXT '26 signalled to me isn't any single announcement — it's the pattern. Google is betting that the future of cloud AI isn't individual models you call via API, but coordinated networks of specialist agents running on managed infrastructure. ADK is their framework for that future. Whether you agree with the bet or not, it's worth understanding how it works.

Get the Code

The full project is on GitHub: https://github.com/SimranShaikh20/Research-Assistant-Agent

Research-Assistant-Agent/
├── Research-Assistant-Agent/
│   ├── agent.py        ← all the agent logic
│   └── __init__.py
├── .env.example        ← copy this to .env, add your key
├── .gitignore
└── README.md

To run it yourself:

git clone https://github.com/SimranShaikh20/Research-Assistant-Agent
cd Research-Assistant-Agent
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # Mac/Linux
pip install google-adk
# copy .env.example to .env, add your Gemini API key
adk web research_agent/

Resources

Built for the Google Cloud NEXT '26 Writing Challenge on DEV Community. I'm a final-year BE Computer Science student at The Maharaja Sayajirao University of Baroda, where my major project is an AI-based defect detection system — so building agents like this is a bit of a departure from my usual ResNet-50 territory. Turned out to be worth the detour.

🌍 Earth's Last Letter

Simran Shaikh — Mon, 20 Apr 2026 04:02:23 +0000

The Planet Writes You a Personal Letter Using Real Climate Data + Gemini AI

This is a submission for Weekend Challenge: Earth Day Edition

What I Built

Earth's Last Letter is an AI-powered web app where Earth — the planet itself, 4.5 billion years old — writes you a deeply personal, poetic letter based on the city you grew up in and the year you were born.

You enter two things: your city and your birth year.

Earth remembers the rest.

Every letter is completely unique. The app pulls real historical CO₂ data from NOAA's Mauna Loa Observatory measurements, your exact city coordinates via Open-Meteo's geocoding API, and feeds all of it into a carefully engineered Google Gemini prompt that generates a 400-word letter written from Earth's perspective — not as a scientist, not as an activist, but as a grieving mother writing to a child she loves.

The letter tells you:

What your city's air smelled like the season you were born (CO₂ was measurably lower)
What Earth remembers about your childhood years in that specific place
One specific change happening near your city right now — not generic warming stats, but Mumbai's retreating monsoon patterns, London's disappearing frost days, Delhi's unprecedented heat islands
What your city might feel like in 2050 if the trajectory holds
One intimate, local action only someone from your city could take

It does not say "reduce your carbon footprint." It does not say "go green." It sounds like a letter left under a stone in a forest.

Here's a sample output for Mumbai, 1995:

Dear child of 1995,

You arrived in Mumbai during the first weeks of June, when the southwest monsoon was still two weeks away and the city held its breath in that particular amber heat only those who know her understand. The air carried 360 parts per million of carbon then — still too much, but enough that the Arabian Sea breeze coming off Marine Drive in the evenings felt clean in a way you may not remember but your lungs do...

...But I am still here. And so are you.

With all the time I have left,
Earth

The UI is built to match — dark navy background, aged parchment letter card, typewriter reveal animation, city postmark stamp, and rotating loading messages ("Earth is remembering your city...", "Searching through 4.5 billion years of memory...").

One click copies the letter. One click shares to X. Every interaction feels like touching something ancient.

Demo

🔗 Live App → earths-last-letter.netlify.app

Try it with:

Mumbai + 1995 — monsoon memories, Arabian Sea heat, coastal flooding
London + 1988 — disappearing frost days, Thames flooding risk
Delhi + 1990 — Yamuna river, unprecedented heat island, air quality history
New York + 2000 — hurricane patterns, Hudson River, coastal erosion

🔑 You'll need a free Gemini API key from aistudio.google.com — takes 30 seconds to get.

Code

SimranShaikh20 / Earths-Last-Lettter

🌍 Earth's Last Letter

"The planet writes you a personal letter — from your birth year to 2050."

What Is This?

Earth's Last Letter is an AI-powered web app where Earth — the planet itself — writes you a deeply personal, poetic letter based on the year you were born and the city you grew up in.

Every letter is unique. Every letter is grounded in real climate data. Every letter is written as if Earth is an ailing parent writing to a child it loves but is slowly losing the strength to sustain.

You enter your city and birth year. Earth remembers the rest.

✨ Features

🌿 Hyper-personalized letters — Every output is unique to your exact city + birth year combination. No two letters are the same.
📊 Real climate data — CO₂ levels at your birth year (Mauna Loa historical data), current CO₂ levels, temperature anomaly…

View on GitHub

The full repo includes:

Complete React + Vite frontend
Gemini API integration with structured prompt
Open-Meteo geocoding and climate data
Vercel deployment config

How I Built It

The Core Insight — Why This Approach

Every climate app I've seen shows you graphs. Numbers. Percentages. The problem isn't that people don't have information — it's that information doesn't move people. Stories do.

I wanted to build something that made climate data feel personal rather than abstract. The insight was simple: if you know exactly what the CO₂ level was the year someone was born, you can tell them something true and specific about how the air has changed in their own lifetime. That's not a statistic. That's their life.

Tech Stack

Layer	Technology
Frontend	React + Vite
Styling	Tailwind CSS
AI Generation	Google Gemini 2.0 Flash
Climate Data	Open-Meteo API (free)
CO₂ Data	NOAA Mauna Loa historical readings
Geocoding	Open-Meteo Geocoding API
Deployment	Vercel

Step 1 — Getting Real Data

I pull two real data sources before touching Gemini at all:

City coordinates via Open-Meteo's free geocoding API:

const res = await fetch(
  `https://geocoding-api.open-meteo.com/v1/search?name=${city}&count=1`
);
const data = await res.json();
const { latitude, longitude, name, country } = data.results[0];

Historical CO₂ levels from NOAA's Mauna Loa Observatory measurements, hardcoded from real published data:

const co2Map = {
  1950:310, 1955:313, 1960:317, 1965:320, 1970:325,
  1975:331, 1980:338, 1985:345, 1990:354, 1995:360,
  2000:369, 2005:379, 2010:389, 2015:400, 2020:412,
  2024:422, 2025:424
};
const birthCO2 = co2Map[Math.round(birthYear / 5) * 5];
const co2Rise = 424 - birthCO2;

For someone born in 1995 in Mumbai, the app now knows:

CO₂ at birth: 360 ppm
CO₂ today: 424 ppm
Rise in their lifetime: +64 ppm

That's real data. That goes into the prompt.

Step 2 — The Gemini Prompt (the hard part)

This is where most of the work is. Getting Gemini to write something that sounds like literature rather than an AI response required a very specific prompt structure.

The key decisions I made:

1. Strict paragraph structure with word counts
Instead of asking for "a letter," I specify exactly 6 paragraphs with word counts for each (60, 70, 80, 70, 60, 40 words). This produces consistent, well-paced output every time.

2. Banned vocabulary list
The prompt explicitly forbids: "carbon footprint", "going green", "save the planet", "climate change" (as a phrase), "sustainability". These words have been so overused in environmental messaging that they've become invisible. Banning them forces Gemini to find fresh, sensory language.

3. City-specific climate instructions
The prompt tells Gemini to reference ONE specific local phenomenon based on the city type — glaciers if mountain, sea level if coastal, heat island if major metro, monsoon patterns if South Asian city. This makes every letter feel locally grounded.

4. Voice constraint — the most important one

YOUR VOICE: You are not angry. You are not lecturing. 
You are a mother watching her child grow up while she grows sick. 
Ancient patience, deep love, quiet grief.

This single voice instruction transforms the output from an environmental essay into something that reads like correspondence.

Here's the Gemini API call:

const response = await fetch(
  `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=${apiKey}`,
  {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      contents: [{ parts: [{ text: prompt }] }]
    })
  }
);
const letter = data.candidates[0].content.parts[0].text;

Step 3 — The UI

The design is intentionally counter to modern app aesthetics. While everyone builds clean, minimal, light-mode dashboards, this app goes dark, ancient, and warm.

The parchment card uses a warm cream background (#f5f0e8), dark brown text (#2c1810), Georgia serif font at 1.9 line height, and subtle box-shadow to create a paper texture without any actual images.

The typewriter animation reveals the letter character by character. This was a deliberate choice — it forces users to read rather than skim, and creates the feeling that Earth is writing to you in real time.

Loading states rotate through 4 messages every 2 seconds:

"Earth is remembering your city..."
"Searching through 4.5 billion years of memory..."
"Weaving real climate data into your letter..."
"Almost ready — Earth writes slowly, carefully..."

These aren't just UX filler. They're part of the narrative.

What I Learned

The biggest lesson: prompt engineering is product design. The quality of the letter is entirely a function of how precisely I constrained Gemini's output. Every word count, every banned phrase, every voice instruction translates directly into a better user experience. The AI isn't doing creative work — it's executing a very specific creative brief.

The second lesson: real data beats fake data every time. Knowing that someone born in 1990 breathed air with 354 ppm CO₂ — and that we're at 424 ppm today — makes the letter hit differently than if I'd just prompted Gemini to "say something emotional about climate change." The specificity is what creates the emotional resonance.

Prize Categories

🏆 Best Use of Google Gemini

Google Gemini 2.0 Flash is the core of this project — not as a chatbot, but as a structured narrative engine. The app feeds real API data (city coordinates, historical CO₂ levels, climate context) into a precisely engineered prompt that produces a consistently high-quality, emotionally grounded 400-word letter every single time.

The innovation isn't just using Gemini — it's the architecture around it: real data in → structured prompt → constrained creative output → literature-quality result. This is a meaningfully different use case than most AI integrations, which treat language models as question-answering systems. Here, Gemini is a writer following a very specific brief.

The prompt includes:

Paragraph-level word count constraints
Banned vocabulary list (forces fresh language)
City-specific climate instruction rules
Strict voice and tone specification
Mandatory real data integration in every paragraph

The result is that the app produces output that users genuinely share, screenshot, and send to family members — which is the real test of whether an AI generation is working.

Built over one weekend for Earth Day 2026. Every letter is different. Every letter is true.

Try yours at https://earth-last-letter.netlify.app/

🚀 FreelanceOS

Simran Shaikh — Sun, 08 Mar 2026 11:47:50 +0000

🚀 FreelanceOS — AI-Powered Operating System for Freelancers

What I Built

FreelanceOS is a complete AI-powered operating system for freelancers and solopreneurs, built entirely on Notion MCP + Google Gemini AI.

Freelancers waste 5–10 hours every week on admin work that doesn't pay — writing contracts, creating invoices, sending client update emails, and chasing unpaid payments. FreelanceOS eliminates all of that.

You type a few words. FreelanceOS generates a professional AI-written contract, invoice, or client email — and saves it directly into your Notion workspace automatically.

The Problem It Solves

Admin Task	Time Wasted Per Week
Writing freelance contracts	1–2 hours
Creating & formatting invoices	30–60 mins
Writing client update emails	20–30 mins
Tracking unpaid invoices	Hours per month
Managing clients & projects across tools	Daily friction

FreelanceOS collapses all of this into one AI-powered Notion workspace.

✨ Core Features

📊 AI Dashboard
Pulls live data from all 5 Notion databases and feeds it to Gemini AI, which analyzes your portfolio and gives you personalized business insights — total revenue potential, overdue projects, workload balance, and 3 actionable recommendations.

📄 AI Contract Generator
Enter client name, project description, budget, and deadline. FreelanceOS generates a complete professional freelance contract with scope, payment terms, revision policy, ownership rights, and termination clause — saved instantly to your Notion Contracts database.

🧾 AI Invoice Generator
Enter client name, amount, and work done. FreelanceOS generates a professional itemized invoice with payment instructions and due dates — saved to your Notion Invoices database as "Unpaid" and tracked automatically.

👥 Client & Project Management
Full CRUD operations on Clients and Projects — all stored and managed through Notion MCP.

🚪 Clean Exit

🗺️ System Architecture

User Input (CLI)
      │
      ▼
FreelanceOS (Python)
      │
      ├──▶ Google Gemini AI ──▶ AI-Generated Content
      │                               │
      └──▶ Notion MCP API ◀───────────┘
                │
                ▼
        Notion Workspace
    ┌──────────────────────┐
    │  Clients   Projects  │
    │  Invoices  Contracts │
    │  Expenses            │
    └──────────────────────┘

Show us the code

🔗 GitHub Repository: github.com/SimranShaikh20/FreelanceOS

Project Structure

freelance-os/
│
├── main.py                 ← Entry point
├── notion_helper.py        ← All Notion MCP API calls
├── ai_helper.py            ← All Gemini AI calls
├── requirements.txt
├── .env.example
│
└── features/
    ├── dashboard.py        ← AI-powered insights
    ├── clients.py          ← Client management
    ├── projects.py         ← Project tracking
    ├── contracts.py        ← AI contract generation
    ├── invoices.py         ← AI invoice generation
    └── emails.py           ← AI email generation

Key Code Snippets

AI Contract Generation:

def generate_contract(client_name, project_desc, budget, deadline):
    prompt = f"""
    Write a professional freelance contract:
    Client: {client_name}
    Project: {project_desc}
    Budget: ${budget}
    Deadline: {deadline}
    Include: scope, payment terms, revision policy,
    ownership rights, termination clause
    """
    return generate_text(prompt)

Saving to Notion MCP:

def add_contract(client_name, project_desc, budget, content):
    db_id = os.getenv("CONTRACTS_DB_ID")
    data = {
        "parent": {"database_id": db_id},
        "properties": {
            "Name": {"title": [{"text": {"content": f"Contract - {client_name}"}}]},
            "Client": {"rich_text": [{"text": {"content": client_name}}]},
            "Budget": {"number": float(budget)},
            "Content": {"rich_text": [{"text": {"content": content[:2000]}}]},
            "Status": {"multi_select": [{"name": "Draft"}]}
        }
    }
    result = notion_post("pages", data)

AI Dashboard Insights:

def generate_project_summary(projects):
    project_list = "\n".join([
        f"- {p['name']}: ${p['budget']} ({p['status']})" 
        for p in projects
    ])
    prompt = f"""
    Analyze these freelance projects:
    {project_list}
    Give: revenue potential, attention needed,
    workload assessment, 3 recommendations.
    """
    return generate_text(prompt)

How I Used Notion MCP

Notion MCP is not just a storage layer in FreelanceOS — it IS the operating system.

The Integration

FreelanceOS uses Notion MCP as its single source of truth across 5 databases:

Notion Database	What FreelanceOS Stores
Clients	Name, email, active/inactive status
Projects	Name, budget, deadline, progress status
Invoices	AI-generated invoice content, amount, paid/unpaid
Contracts	Full AI-generated contract text, draft/signed status
Expenses	Category, amount, date for tax tracking

What Notion MCP Unlocks

1. Real-time AI + Notion sync
Every AI-generated document (contract, invoice) is immediately written to the correct Notion database via the MCP API. No copy-paste. No manual entry.

2. Live business intelligence
The Dashboard pulls live data from all 5 Notion databases simultaneously, feeds it to Gemini AI, and returns intelligent business insights about your freelance operation — all in real time.

3. Persistent workflow memory
Because everything lives in Notion, your freelance OS remembers every client, project, invoice, and contract across sessions. Notion MCP turns a Python script into a stateful business operating system.

4. Human-in-the-loop control
Every AI-generated output is reviewed by the freelancer before saving to Notion. The human stays in control — AI handles the generation, Notion handles the storage, the freelancer makes the final call.

🛠️ Tech Stack

Notion MCP — Core workspace and data layer
Google Gemini 1.5 Flash — AI generation (free tier)
Python 3 — Application logic
Rich — Beautiful terminal UI
Requests — Notion API HTTP client

🚀 Try It Yourself

git clone https://github.com/SimranShaikh20/FreelanceOS
cd FreelanceOS
pip install -r requirements.txt
# Add your API keys to .env
python main.py

Full setup guide in the README.

Docling is a Game-Changer for RAG Systems

Simran Shaikh — Tue, 03 Feb 2026 09:47:45 +0000

Why Docling is a Game-Changer for RAG Systems: Moving Beyond Basic Text Extraction

In the rapidly evolving world of Retrieval-Augmented Generation (RAG), we're constantly seeking ways to improve accuracy and reliability. While traditional RAG systems have made great strides, they often stumble when faced with real-world enterprise documents—PDFs with complex layouts, financial reports packed with tables, or technical documentation spanning multiple formats.

Enter Docling: an open-source document processing library from IBM Research that's transforming how we handle documents in RAG pipelines. In this post, I'll explain what makes Docling special and why it might be the missing piece in your RAG architecture.

The Problem with Traditional RAG

Let's start by understanding what typically happens in a conventional RAG system when processing a document:

Load the document using a basic PDF or text extractor
Split the text into chunks (usually by character count)
Embed the chunks using your chosen model
Store in a vector database
Retrieve relevant chunks when queried
Generate answers using an LLM with the retrieved context

Sounds straightforward, right? The problem is step 2—the chunking strategy. Traditional approaches treat documents as plain text streams, splitting them arbitrarily based on character counts or token limits. This creates several critical issues:

Tables become gibberish. A beautifully formatted table showing quarterly revenue becomes: "Q1 Revenue 100M Q2 Revenue 150M Q3..." Good luck querying that accurately.

Context gets fragmented. Important information gets split mid-sentence, mid-paragraph, or worse—right in the middle of a crucial table or chart.

Structure is lost. Headers, sections, lists, figures—all the semantic structure that makes documents readable and meaningful gets stripped away.

Layout complexity fails. Multi-column layouts get read across columns instead of down them. Headers and footers pollute the content. Footnotes appear randomly in the text.

The result? A RAG system that works okay on simple text documents but frustrates users when dealing with the complex, structured documents that actually matter in enterprise settings.

How Docling Changes the Game

Docling takes a fundamentally different approach. Instead of treating documents as text streams, it understands them as structured objects with semantic meaning. Here's what that means in practice:

1. Structure-Aware Parsing

Docling doesn't just extract text—it identifies and labels every element in your document: headers, paragraphs, tables, lists, figures, captions. It understands the hierarchical relationships between sections and subsections. It preserves the reading order even in complex multi-column layouts.

Think of it as the difference between having someone read you individual words from a newspaper versus having them explain the article's structure, where each section fits, and how the information flows.

2. Intelligent Chunking

With Docling, chunks respect document structure. Instead of splitting every 500 characters, you can chunk by logical units:

Complete sections or subsections
Entire tables (preserved in their tabular format)
Full paragraphs with their associated headers
Lists with all their items intact

Each chunk becomes a semantically complete unit that makes sense on its own, rather than an arbitrary slice of text.

3. Rich Metadata

Every chunk Docling creates comes with valuable metadata:

Which section it belongs to (including the section hierarchy)
What page it's from
What type of content it is (heading, paragraph, table, list)
Its position in the document structure

This metadata enables powerful retrieval strategies. You can filter results to only search tables, prioritize content from executive summaries, or boost matches from specific sections.

4. Table and Structured Data Preservation

This is where Docling truly shines. Financial reports, technical specifications, comparison tables—all preserved in their original structure. When your RAG system retrieves a table, it gets the actual table, with rows and columns intact and queryable.

No more "What was Q2 revenue?" returning garbled text that might or might not contain the right number.

5. Multi-Format Consistency

Whether you're processing PDFs (including scanned documents with OCR), Word documents, PowerPoint presentations, images, or HTML, Docling provides consistent, high-quality extraction through a unified pipeline. One processing approach, reliable results across all formats.

Real-World Impact: The Numbers

Let me share some typical performance improvements when moving from traditional RAG to Docling-enhanced RAG:

Table query accuracy: 30% → 85%

Context preservation: 50% → 90%

Multi-column document handling: 35% → 88%

Structured data retrieval: 25% → 92%

Complex PDF processing: 40% → 87%

These aren't minor improvements—they're the difference between a RAG system users tolerate and one they actually rely on.

A Practical Example

Let's see this in action. Imagine you're building a RAG system for financial analysis. A user asks: "What was the year-over-year revenue growth in Q3?"

Traditional RAG might retrieve:

"...Q3 Rev 180M Product A Sales 50M Product B..."

The LLM has to guess at what Q2 was, what the previous year was, and hope it didn't miss relevant chunks.

Docling-enhanced RAG retrieves:

Text: [Structured table data]
| Quarter | 2024 Revenue | 2023 Revenue | YoY Growth |
|---------|--------------|--------------|------------|
| Q3      | $180M        | $165M        | +9.1%      |

Metadata:
  Section: "Financial Performance > Quarterly Results"
  Page: 8
  Type: Table
  Parent: "Financial Overview"

The LLM gets the complete table with clear structure, plus contextual metadata. The answer practically writes itself: "Year-over-year revenue growth in Q3 was 9.1%, increasing from $165M to $180M."

Implementation Strategy

Integrating Docling into your RAG pipeline is straightforward:

Install Docling and set up the document converter
Process your documents through Docling instead of basic text extraction
Export to your preferred format (markdown, JSON, or custom)
Implement structure-aware chunking using Docling's element boundaries
Enrich your embeddings with Docling's metadata
Store and retrieve using your existing vector database

The beauty is that Docling plugs into your existing RAG architecture—you're not rebuilding from scratch, just replacing the document processing layer.

When Docling Makes Sense (and When It Doesn't)

Docling is particularly valuable for:

Financial reports and statements with extensive tables and charts
Technical documentation with complex layouts and structured information
Research papers with equations, figures, and citations
Legal documents requiring precise section tracking
Enterprise document collections spanning multiple formats
Any scenario where structure and tables matter

You might not need Docling for:

Simple text-only documents (blog posts, novels, articles)
Clean markdown files without complex structure
Very short documents where chunking isn't critical
Use cases where document structure doesn't impact answers

The Trade-off

There's always a trade-off, and with Docling it's processing time. Converting documents through Docling's structural analysis takes longer than basic text extraction—sometimes 2-3x longer during the initial ingestion phase.

But here's the key insight: you pay this cost once during document processing, and you benefit from it with every single query thereafter. Spending an extra second processing a document to get 3x better retrieval accuracy on thousands of future queries is an easy trade-off.

Looking Forward: The Future of Document-Aware RAG

Docling represents a broader shift in how we think about RAG systems. We're moving from "search and generate" to "understand and retrieve." The next generation of RAG systems will be document-aware, structure-preserving, and semantically intelligent.

As RAG moves from proof-of-concept to production deployment in enterprise environments, the difference between basic text extraction and intelligent document processing becomes critical. Users don't just want answers—they want accurate, reliable, well-sourced answers. They want systems that understand the structure and meaning of their documents, not just the words.

Docling helps us build those systems.

Getting Started

Ready to try Docling in your RAG pipeline? Here are your next steps:

Check out the Docling GitHub repository for documentation and examples
Start with a small test set of your most problematic documents—the ones where traditional RAG fails
Compare retrieval quality before and after Docling integration
Measure the impact on your specific use cases and metrics

Final Thoughts

The promise of RAG is that we can give LLMs access to vast knowledge bases and get accurate, grounded answers. But that promise only holds if the retrieval part actually works—if we can find the right information and present it in a way the LLM can understand.

Docling doesn't just make retrieval better; it makes it fundamentally more aligned with how documents actually work. It respects their structure, preserves their meaning, and maintains their context. For anyone building serious RAG systems on real-world documents, that's not just an improvement—it's essential.

The question isn't whether to use document-aware processing in your RAG pipeline. It's whether you can afford not to.

Have you tried Docling in your RAG systems? What results have you seen? Share your experiences in the comments below.

GitHub Repository Intelligence Assistant

Simran Shaikh — Mon, 02 Feb 2026 03:03:01 +0000

What I Built

GitHub Repository Intelligence Assistant - A web application that helps developers understand any GitHub repository through AI-powered conversations and automated analysis.

The Problem

When developers encounter a new repository, they face several challenges:

⏰ Time-consuming exploration: Spending 2-3 hours reading code to understand structure
🤔 No context: Difficult to know where to start in large codebases
📚 Documentation gaps: Missing or outdated setup instructions
🔄 Repeated questions: Same questions asked by every new contributor
💻 Setup friction: Trial and error to get the project running

The Solution

This tool provides instant repository intelligence by:

🔍 Automatic Analysis: Fetches and analyzes GitHub repositories in seconds
💬 AI Conversations: Ask questions about code in natural language
⚡ Smart Answers: Get context-aware responses based on actual repository content
🏗️ Architecture Insights: Understand code structure without digging through files
📦 Dependency Detection: See what technologies and packages are used

How It Works

Enter any GitHub repository URL
App fetches repository structure via GitHub API
Analyzes and prioritizes important files (README, configs, source code)
Ask questions in chat interface
Get AI-powered answers using Claude API with repository context

Demo

🌐 Live Demo

Screenshots

1. Repository Input

Simple interface to enter any GitHub repository URL

2. Repository Analysis Dashboard

Shows repository stats, files analyzed, and key information

3. AI Chat Interface

Natural language conversations about the codebase

Test It With These Repositories:

https://github.com/facebook/react
https://github.com/vercel/next.js
https://github.com/django/django
https://github.com/SimranShaikh20/DevOps-Autopilot

My Experience with GitHub Copilot CLI

Building this project gave me hands-on experience with GitHub Copilot CLI's capabilities. Here's how it accelerated my development:

1. Project Setup & Boilerplate

Initial Setup:

gh copilot suggest "create React app with Vite and Tailwind CSS"

Result: Instead of manually setting up configurations, Copilot CLI provided exact commands:

npm create vite@latest repo-qa -- --template react
cd repo-qa
npm install -D tailwindcss postcss autoprefixer
npx tailwindcss init -p
npm install lucide-react

Time Saved: 20-30 minutes of setup and configuration

2. GitHub API Integration

Challenge: Fetch repository structure and file contents

Copilot CLI Prompt:

gh copilot suggest "write function to fetch GitHub repository tree recursively with error handling"

Generated Code: Complete implementation with:

API endpoint construction
Error handling for 404 and rate limits
Support for both 'main' and 'master' branches
Base64 decoding for file contents

Impact: Saved 1+ hour of reading GitHub API documentation

3. File Prioritization Logic

Challenge: Sort files by importance (README > configs > source code)

Copilot CLI Prompt:

gh copilot suggest "create prioritization function that ranks files by type with README highest priority"

Generated Solution:

const filePriority = (path) => {
  if (path.toLowerCase().includes('readme')) return 1000;
  if (path.endsWith('package.json')) return 900;
  if (path.endsWith('.py')) return 800;
  if (path.endsWith('.js') || path.endsWith('.jsx')) return 700;
  return 0;
};

Impact: Clean, efficient solution in minutes instead of iterating on logic

4. React Component Development

Copilot CLI Prompt:

gh copilot suggest "create glassmorphic card component with Tailwind CSS"

Result: Beautiful UI components with:

Backdrop blur effects
Gradient borders
Responsive design
Proper accessibility

Productivity: Built UI components 2x faster than manual coding

5. State Management

Challenge: Managing loading states, errors, and data flow

Copilot CLI Prompt:

gh copilot suggest "React component with useState for repo data, loading, error states"

Generated: Clean state management pattern with proper error boundaries

6. Debugging & Bug Fixes

Bug: Race condition when switching between repositories

Copilot CLI Prompt:

gh copilot suggest "fix race condition in React useEffect with cleanup"

Solution: Implemented abort controller pattern I wasn't familiar with

Time Saved: 30+ minutes of debugging

Productivity Comparison

Development Task	Traditional Approach	With Copilot CLI	Time Saved
Project Setup	45 min	10 min	78%
API Integration	2 hours	30 min	75%
UI Components	4 hours	2 hours	50%
State Management	1 hour	20 min	67%
Debugging	1.5 hours	30 min	67%
Total Development	~10 days	~7 days	30%

What Copilot CLI Helped Me Build

Features Built with Copilot CLI Assistance:

✅ GitHub API integration (90% generated)
✅ File fetching and parsing (85% generated)
✅ React component structure (70% generated)
✅ Error handling patterns (80% generated)
✅ UI styling with Tailwind (60% generated)
✅ State management logic (75% generated)

Estimated: ~65-70% of the codebase was written or enhanced with Copilot CLI

Key Learnings

1. Specific Prompts Get Better Results

❌ "Create a function"
✅ "Create async function to fetch GitHub repo with retry logic and error handling"

2. Iterate and Refine

Ask follow-up questions to improve generated code
Request alternative implementations

3. Learn from Generated Code

Studied patterns I wasn't familiar with (AbortController, proper async/await)
Discovered Tailwind utilities I didn't know existed

4. Time Distribution Changed

Less time on boilerplate and setup
More time on features and user experience
Better code quality overall

Best Copilot CLI Moments

🎯 Most Helpful: When it suggested the entire error handling pattern for API failures

💡 Biggest Learning: Proper React cleanup functions to prevent memory leaks

⚡ Biggest Time Save: Auto-generating the repository parsing logic

The Development Experience

Before Copilot CLI:

Constantly switching between editor, browser, and Stack Overflow
40% time spent looking up syntax and APIs
Manual boilerplate writing
Solo debugging with console.log

With Copilot CLI:

Stay in terminal and editor - better flow state
Instant answers to "how do I..." questions
Generated boilerplate in seconds
AI-assisted debugging with explanations

Tech Stack

Frontend:

⚛️ React 18 (UI framework)
⚡ Vite (build tool)
🎨 Tailwind CSS (styling)
🎭 Lucide React (icons)

APIs:

🐙 GitHub REST API (repository data)
🤖 Claude API - Anthropic (AI responses)

Tools:

🤖 GitHub Copilot CLI (development acceleration)
🚀 Vercel (deployment)

Installation

# Clone repository
git clone https://github.com/SimranShaikh20/RepoMindAI.git
cd RepoMindAI

# Install dependencies
npm install

# Set up environment variables
# Add your Anthropic API key to .env
VITE_ANTHROPIC_API_KEY=your_key_here

# Run development server
npm run dev

Future Improvements

[ ] Support for private repositories (GitHub OAuth)
[ ] Code search within repositories
[ ] Save favorite repositories
[ ] Export chat conversations
[ ] Compare multiple repositories
[ ] Browser extension version

Why This Matters

This project demonstrates real-world AI application in developer tools. By combining repository analysis with conversational AI, it solves the actual pain point developers face: quickly understanding unfamiliar codebases.

Built with GitHub Copilot CLI, this tool showcases how AI assistance can accelerate development while maintaining code quality.

Built with ❤️ and GitHub Copilot CLI

GitHubCopilotCLI #DevChallenge #AI #React #DeveloperTools

RAG & Semantic Search

Simran Shaikh — Fri, 30 Jan 2026 08:35:15 +0000

In the rapidly evolving world of AI and large language models, Retrieval-Augmented Generation (RAG) has emerged as a game-changing technique. If you're building AI applications that need to understand and search through your own data, this guide will walk you through every essential concept you need to know.

Introduction: Why RAG Matters

Traditional language models have a fundamental limitation: they only know what they were trained on. RAG solves this by teaching AI systems to retrieve and use external knowledge before generating answers. Think of it as giving your AI a library card instead of just relying on what it memorized in school.

Let's dive into the 20 core concepts that make RAG and semantic search work.

1. Embeddings: Teaching Machines to Understand Meaning

What are embeddings?

An embedding is a numerical representation of data—whether text, images, or audio—that preserves the underlying meaning. Instead of treating words as arbitrary symbols, embeddings capture their semantic relationships.

For example, the sentence "Neural networks learn patterns" might become:

[0.12, -0.45, 0.88, 0.34, -0.67, ...]

Why do we need them?

Computers don't inherently understand language. Embeddings bridge this gap by:

Enabling meaningful comparisons between pieces of text
Clustering similar concepts together
Powering semantic search capabilities

Types of embeddings:

Text embeddings: For documents, queries, and general text
Image embeddings: For visual content like diagrams and photos
Multimodal embeddings: Combining text and images (e.g., CLIP)

Popular models:

OpenAI's text-embedding-3-large
Open-source options like BGE, E5, and MiniLM
CLIP for image embeddings

2. Semantic Search: Beyond Keywords

The fundamental shift

Traditional keyword search looks for exact word matches. Semantic search understands meaning.

Example in action:

Query: "How does backpropagation work?"

A document containing "Gradient descent updates weights during neural network training" would be found by semantic search even though it shares no exact keywords with the query. This is the power of understanding meaning over matching words.

How it works:

Convert all documents into embeddings
Convert the user's query into an embedding
Compare the query vector with document vectors
Return the most semantically similar results

3. Vectors: The Language of AI

A vector is simply a list of numbers, like [0.32, -0.14, 0.88, ...]. Each dimension in this list captures a different aspect of meaning—think of them as coordinates in a multi-dimensional space of concepts.

When two vectors are close together in this space, their meanings are similar.

4. Vector Databases: Storage Built for Similarity

Why special databases?

Traditional databases excel at exact matches. Vector databases are optimized for a different question: "What's most similar to this?"

When you're dealing with millions of embeddings, you need specialized tools for fast similarity search.

Popular options:

Database	Best For
FAISS	Local development and research
Chroma	Simple applications and prototyping
Pinecone	Cloud-scale production systems
Qdrant	Open-source deployments

5. Similarity Metrics: Measuring Closeness

Cosine similarity is the most common metric for comparing embeddings:

similarity = (A · B) / (|A| × |B|)

The result ranges from -1 to 1:

1: Vectors point in the same direction (very similar)
0: Vectors are perpendicular (unrelated)
-1: Vectors point in opposite directions (opposite meanings)

6. Chunking: Breaking Down Documents

The challenge

Large language models have input limits. A 100-page manual won't fit in a single context window. The solution? Break it into smaller, digestible pieces.

Chunking strategies:

Method	Description	Use Case
Fixed-size	Every 500 tokens	Simple, consistent chunks
Sliding window	Overlapping segments	Preserves context at boundaries
Semantic	Split by topic/paragraph	Maintains logical coherence

Pro tip: Good chunking preserves complete thoughts. Splitting mid-sentence can harm retrieval quality.

7. Indexing: Speed Through Structure

The problem

Without indexing, finding similar vectors means comparing your query against every single document vector. With millions of documents, this becomes impossibly slow.

The solution

Indexing creates data structures that enable fast approximate nearest neighbor search.

Common index types:

HNSW (Hierarchical Navigable Small World): Fast and accurate
IVF (Inverted File Index): Good for large-scale datasets
Flat: Exact search, slower but 100% accurate

8. Reranking: Refinement for Precision

The two-stage approach

Vector search is fast but sometimes imprecise. Reranking adds a second, more careful analysis.

Process:

Vector database returns top 20 candidates
Reranker model scores these 20 more carefully
Return top 5 best matches

Tools for reranking:
Cross-encoder models that jointly analyze the query and each candidate document provide superior accuracy compared to the independent embeddings used in initial retrieval.

9. MMR: Avoiding Redundancy

Maximal Marginal Relevance solves a common problem: what if your top 5 results all say the same thing?

MMR balances two goals:

Relevance: Results should match the query
Diversity: Results shouldn't duplicate each other

This ensures users get comprehensive information, not repetitive answers.

10. Metadata Filtering: Adding Structure to Search

Sometimes semantic similarity isn't enough. You might want results from a specific source, time period, or category.

Example metadata:

{
  "content": "The compressor operates at 150 PSI...",
  "source": "technical_manual.pdf",
  "page": 12,
  "topic": "compressor",
  "date": "2024-01-15"
}

Filtered query: "Find information about compressors, but only from the technical manual"

This combines semantic search with structured filtering for more precise results.

11. Cross-Encoders vs. Bi-Encoders

Two architectures for comparison:

Type	How It Works	Speed	Accuracy
Bi-encoder	Encodes query and document separately	Fast	Good
Cross-encoder	Processes query and document together	Slow	Excellent

Usage pattern:

Use bi-encoders (standard embeddings) for initial retrieval
Use cross-encoders for reranking the top results

12. Hybrid Search: Best of Both Worlds

Pure semantic search has a weakness: it might miss exact technical terms or specific phrases.

Hybrid search combines:

Keyword search (BM25): Catches exact terms and rare phrases
Vector search: Understands meaning and context

Example: A query for "Python asyncio" benefits from:

Keyword search finding exact mentions of "asyncio"
Semantic search finding related concepts like "asynchronous programming"

13. Knowledge Graphs: Structured Relationships

While vectors capture similarity, knowledge graphs capture relationships.

Structure:

Nodes represent entities (concepts, people, things)
Edges represent relationships between them

Example:

Transformer --uses--> Self-Attention
Self-Attention --enables--> Parallel Processing

Applications:

Graph RAG for multi-hop reasoning
Scientific knowledge representation
Complex question answering

14. Prompts and Context: Controlling Generation

Context consists of the chunks retrieved from your knowledge base.

Prompts are instructions that tell the LLM how to use that context.

Example prompt:

Answer the following question using ONLY the context provided below. 
If the answer cannot be found in the context, say "I don't know."

Context: [retrieved chunks]

Question: [user query]

Well-crafted prompts are essential for preventing hallucinations and ensuring grounded responses.

15. Hallucination: The Challenge RAG Solves

The problem: Language models can generate plausible-sounding but entirely fabricated information.

RAG's solution:

Ground responses in retrieved documents
Include citations to sources
Use prompts that enforce context-only answers

RAG doesn't eliminate hallucinations entirely, but it dramatically reduces them by anchoring the model to factual sources.

16. Tokens: The Currency of Language Models

A token is roughly equivalent to a word fragment. The sentence "Artificial Intelligence is transforming technology" might be tokenized as:

["Art", "ificial", " Intelligence", " is", " transform", "ing", " technology"]

Why it matters:

LLMs have token limits (e.g., 128K tokens for GPT-4)
Token count affects both cost and performance
Understanding tokenization helps optimize chunk sizes

17. Temperature: Controlling Creativity

Temperature is a parameter that controls the randomness of model outputs:

Temperature	Behavior	Use Case
0.0	Deterministic, factual	RAG systems, factual Q&A
0.7	Balanced	General conversation
1.0+	Creative, varied	Creative writing

For RAG applications, lower temperatures (0-0.3) typically work best.

18. Top-k: How Many Results to Retrieve

The top_k parameter determines how many documents to retrieve from your vector database.

Considerations:

Too few (k=1-2): Risk missing relevant information
Too many (k=50+): Include noise, increase costs
Sweet spot: Often k=3-10 depending on your use case

Experiment to find the right balance for your application.

19. Evaluation Metrics: Measuring Success

How do you know if your RAG system is working well?

Key metrics:

Metric	What It Measures
Recall@k	Are the right documents in the top-k results?
MRR (Mean Reciprocal Rank)	How quickly do we find the first relevant result?
NDCG	Overall quality of the ranking
Answer relevance	Does the final answer address the question?
Faithfulness	Is the answer grounded in the retrieved context?

Regular evaluation ensures your system maintains quality as your knowledge base grows.

20. The RAG Pipeline: Putting It All Together

A complete RAG system follows this flow:

1. Ingestion Phase:

Load documents
Split into chunks
Generate embeddings
Store in vector database with metadata

2. Retrieval Phase:

User submits a query
Convert query to embedding
Search vector database
Apply metadata filters
Rerank results
Apply MMR for diversity

3. Generation Phase:

Construct prompt with retrieved context
Call LLM with controlled temperature
Generate response with citations
Return to user

Each step is crucial for building a system that's both accurate and reliable.

Conclusion: The Power of RAG

At its core, RAG and semantic search represent a fundamental shift in how we build AI applications. Instead of hoping a pre-trained model knows everything, we give it the ability to learn from our specific data in real-time.

The one-sentence summary:

RAG + Semantic Search = Teaching AI to read your data before answering

Whether you're building a customer support bot, a research assistant, or an internal knowledge management system, understanding these 20 concepts gives you the foundation to create intelligent, grounded, and reliable AI applications.

Next Steps

Ready to go deeper? Consider:

Building a simple RAG system using LangChain or LlamaIndex
Experimenting with different embedding models to see what works for your domain
Implementing evaluation metrics to measure and improve your system
Exploring advanced techniques like Graph RAG or multi-query retrieval

The field is evolving rapidly, but these fundamentals will serve you well no matter which direction it takes.

Have questions or want to share your RAG implementation experiences? Let's discuss in the comments below!

DEV Community: Simran Shaikh

BugWhisperer: How I Finally Finished My Abandoned GitHub Issue Analyzer (8 Months Later) with GitHub Copilot

My Submission

The Project I Abandoned — September 2025

Why I Finally Came Back

What I Built — The After

AI Analysis for Every Issue

Kanban Priority Board

AI Sprint Planner

Export to Markdown

Post Analysis to GitHub

How GitHub Copilot Made This Possible

Moment 1: Understanding My Own Broken Code

Moment 2: Reliable JSON from an LLM

Moment 3: The Kanban Board Component

Moment 4: The Sprint Planner Prompt

Technical Architecture

Tech Stack

Why Groq and Not OpenAI?

Why Cloudflare Workers?

The Before vs After Summary

What I Learned

Try It

What Is Next

I built an open-source AI agent that explains any ML model in plain English — real SHAP, real LIME, real bias detection

The problem I kept running into

What it does

What makes this genuinely agentic

Sample output

Why this matters beyond the challenge

Tech stack

Try it yourself

I Built an XAI Agent with Hermes Agent That Explains Any ML Model in Plain English — Here's Everything I Learned

The Conversation That Started This

What Is Hermes Agent, Actually?

The Problem I Was Solving (And Why It's Bigger Than You Think)

How I Used Hermes Agent to Build It

The Core Insight: Tools Are Not Functions

The Part That Surprised Me: Fallback Planning

The Part That Frustrated Me: Context Size

The Technical Deep Dive: SHAP + Hermes Agent

Why SHAP Alone Isn't Enough

The 3D SHAP Array Bug That Will Break Your Code

Hermes Agent vs. Other Agentic Frameworks

LangChain

AutoGen

CrewAI

My Honest Assessment

What I'd Build Next

The Bigger Picture: Why Open-Source Agents Matter

Try It Yourself

Questions I'd Love to Discuss

VibeSafe

What I Built

How it works

Privacy first

Demo

Quick walkthrough:

Code

Tech Stack

Project Structure

How I Used Gemma 4

Model chosen: Gemma 4 31B Dense (google/gemma-4-31b-it)

Why this matters beyond the hackathon

I Built a Tool That Proves Your Code Is Yours — Here's What Gemma 4 Made Possible

The Problem Nobody Is Talking About

What I Built: VibeSafe

Why Gemma 4 Specifically

The 262K Context Window Changes the Analysis

31B Dense vs the MoE Model

Reasoning Mode for Authorship Detection

What Running VibeSafe on Itself Taught Me

What This Means for Developers Right Now

The Bigger Picture

Try It

DukanBot: I Flipped OpenClaw Inside-Out to Run WhatsApp for 12 Million Kirana Stores

What I Built

How I Used OpenClaw

The Architecture

The Webhook Payload

Model chosen: Gemma 4 31B Dense (`google/gemma-4-31b-it`)