DEV Community: Muhammad Asim Hanif

From Chatbot to Agent: What Hermes Agent Taught Me About Building Real AI Workflows

Muhammad Asim Hanif — Sat, 23 May 2026 20:26:33 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

Introduction

Most people start using AI by building a chatbot.

A user types a question, the model gives an answer, and the application displays that answer on the screen.

That is useful, but it is also limited.

When I started exploring Hermes Agent, the biggest shift in my thinking was this:

A real AI agent should not only answer. It should plan, use tools, follow steps, recover from failure, and produce a structured result.

That idea changed how I think about AI applications.

Instead of treating the language model as the entire application, Hermes Agent encourages a better architecture where the model becomes part of a larger workflow.

The agent can decide what needs to happen, call tools, pass information between steps, and create outputs that are easier to trust and debug.

This post is about what I learned while exploring Hermes Agent, why agentic workflows are different from normal chatbots, and how developers can start thinking in terms of tools, pipelines, and explainable AI systems.

The Problem With Simple AI Wrappers

A lot of AI applications today are basically wrappers around a prompt.

The frontend sends user input to an LLM, the LLM responds, and the answer is shown to the user.

That design is simple, but it has some problems:

The model is expected to do everything at once.
The workflow is hidden inside one large prompt.
It is difficult to debug where something went wrong.
There is no clear separation between tasks.
The output can be inconsistent.
It is hard to add fallback logic.
It is hard to prove what steps the system followed.

For small use cases, a simple chatbot may be enough.

But for serious applications, especially those involving documents, research, data processing, automation, or decision support, we need something more structured.

That is where an agentic approach becomes useful.

What Makes Hermes Agent Different

Hermes Agent is interesting because it pushes developers toward building workflow-based AI systems.

Instead of asking the model to solve everything in one response, you can break the problem into smaller steps.

For example, instead of this:

User input → LLM → Final answer

An agentic workflow looks more like this:

This structure feels closer to how real software systems work.

Each step has a responsibility.
Each tool has a clear job.
Each output becomes input for the next stage.

That makes the system easier to understand, test, and improve.

Thinking in Tools Instead of Prompts

One of the most important lessons I learned from Hermes Agent is that good agent design starts with tools.

A tool can be anything the agent uses to complete a task:

A document reader
A database lookup
A search function
A calculator
An OCR engine
A parser
A code executor
A file generator
A summarizer
A notification system
A custom API

The key idea is that the LLM should not do every job by itself.

For example, if the task involves extracting structured values from a document, we should not only rely on the LLM guessing from raw text. A better design is:

Document reader → Parser → Validator → LLM explanation

This gives each component a clear role.

The parser extracts values.
The validator checks them.
The LLM explains them.
The agent controls the overall flow.

That is much stronger than putting everything into one giant prompt.

A Simple Agentic Pipeline Example

A basic Hermes-style agentic pipeline can be imagined like this:

This pattern can be used in many projects.

For example:

Research Assistant

Question → Search sources → Extract facts → Compare sources → Summarize answer

Resume Analyzer

Resume upload → Extract skills → Match job description → Find gaps → Suggest improvements

Finance Assistant

Transaction data → Categorize spending → Detect anomalies → Generate budget advice

Medical Report Explainer

Report upload → Extract values → Check ranges → Retrieve context → Explain results

The domain changes, but the agentic structure remains similar.

That is what makes Hermes Agent useful: it gives developers a way to build repeatable, multi-step intelligence into their applications.

Why Planning Matters

Planning is one of the biggest differences between a chatbot and an agent.

A chatbot responds directly.

An agent thinks in steps.

For example, when solving a task, the agent may need to ask:

What information do I need?
Which tool should I use first?
What should I do if this step fails?
Is the result complete?
Do I need another tool call?
How should I present the final answer?

This kind of structure is very important when the task is complex.

Without planning, the model may jump directly to an answer.

With planning, the system can follow a controlled workflow.

That gives the developer more power because the process is not random. It is designed.

Why Tool Calling Matters

Tool calling is what makes an agent useful in the real world.

An LLM has language ability, but tools give it action.

With tools, an agent can:

Read files
Query databases
Analyze data
Retrieve knowledge
Call APIs
Generate reports
Create structured outputs
Trigger follow-up actions

This is where Hermes Agent becomes more than a text generator.

It becomes a coordinator.

The agent does not replace traditional software. It connects traditional software components with AI reasoning.

That combination is powerful.

Why Audit Logs Are Important

One thing I believe every serious agentic system should have is logging.

When an agent runs a multi-step workflow, the user or developer should be able to see what happened.

A simple audit log might show:

[INFO] Agent started
[INFO] Tool 1 called
[INFO] Data extracted
[INFO] Tool 2 called
[INFO] Validation completed
[INFO] Explanation generated
[INFO] Final output returned

This is important for debugging.

But it is also important for trust.

If the system produces an output, users should have some visibility into how that output was created.

This is especially important for sensitive areas like healthcare, education, finance, or legal support.

Agentic systems should not feel like magic black boxes. They should feel like transparent workflows.

Self-Correction Makes Agents More Useful

Another important idea in agentic design is self-correction.

Real-world inputs are messy.

Documents may be badly formatted.
Images may be unclear.
APIs may fail.
Data may be incomplete.
A parser may miss something.

A simple chatbot may fail silently or produce a weak answer.

A better agent can detect failure and try another strategy.

For example:

This makes the system more robust.

It also makes the user experience better because the agent can handle imperfect inputs instead of stopping immediately.

Hermes Agent Encourages Better Software Architecture

The most valuable part of Hermes Agent, in my opinion, is not only the AI capability.

It is the architecture mindset.

It encourages developers to separate responsibilities:

Agent = workflow controller
Tools = task executors
LLM = reasoning and language layer
Database = knowledge storage
Frontend = user interaction layer
Logs = transparency layer

This separation makes projects cleaner.

Instead of mixing everything into one prompt or one file, the application becomes modular.

That means it is easier to:

Add new tools
Replace one component
Debug errors
Improve accuracy
Test individual steps
Explain how the system works

This is how AI applications should be built when they move beyond demos.

Chatbot vs Agent

Here is a simple comparison:

Normal Chatbot	Hermes-style Agentic System
Gives direct answers	Follows a workflow
Mostly prompt-based	Tool-based and modular
Hard to debug	Easier to trace with logs
Usually one-step	Multi-step reasoning
Limited action	Can call tools and APIs
Output may be unstructured	Can return structured results
Weak failure handling	Can use fallback strategies

This does not mean chatbots are useless.

Chatbots are great for conversation.

But when the task requires process, tools, validation, and reliability, an agentic system is a better fit.

Where Hermes Agent Can Be Useful

Hermes Agent can be useful in many areas, such as:

1. Document Intelligence

Reading PDFs, extracting key information, summarizing documents, and generating reports.

2. Research Workflows

Searching sources, comparing information, summarizing findings, and creating citations.

3. Developer Tools

Automating code review, testing, documentation, and debugging workflows.

4. Healthcare Support

Helping users understand reports, medical documents, or health instructions with proper disclaimers.

5. Education

Creating study plans, quizzes, explanations, and progress tracking.

6. Business Automation

Processing invoices, generating emails, classifying tickets, and creating summaries.

7. Local AI Systems

Running private workflows on local infrastructure where data privacy matters.

The common pattern is the same:

Input → Tools → Reasoning → Output

Practical Advice for Developers

If you are starting with Hermes Agent, I would suggest not beginning with a huge project.

Start with a small workflow.

For example:

Upload file → Extract text → Summarize → Generate action items

Then slowly add more tools.

A good starting approach is:

Define the final output first.
Break the task into steps.
Create one tool for each step.
Log every tool call.
Add fallback logic.
Keep the output structured.
Test with messy real-world input.

The goal is not to make the agent look complex.

The goal is to make the workflow reliable.

A Good Agent Is Not Just an LLM

One of my biggest takeaways is this:

A good agent is not just an LLM. A good agent is an LLM connected to tools, memory, rules, validation, and a clear workflow.

The LLM is powerful, but it should not be responsible for everything.

When we combine the LLM with traditional software engineering, the result becomes much better.

That is where agentic systems become practical.

What Open Agentic Systems Mean for Developers

Open agentic systems like Hermes Agent are important because they give developers more control.

Instead of depending only on closed platforms or black-box automation, developers can build systems that run on their own infrastructure and match their own requirements.

This matters for:

Privacy
Customization
Local deployment
Domain-specific tools
Transparent workflows
Developer ownership

For many real-world applications, especially in sensitive domains, control is important.

Developers should be able to decide:

Which tools are used?
Where does the data go?
How is the workflow executed?
What happens if a step fails?
How is the final answer generated?

Hermes Agent fits into that direction.

It gives developers a way to build AI systems that are not only smart, but also structured and explainable.

Final Thoughts

Exploring Hermes Agent helped me understand the difference between building an AI feature and building an AI workflow.

A chatbot can answer.

An agent can act.

A chatbot can respond.

An agent can plan, use tools, recover from failure, and produce structured results.

That is the real value of agentic systems.

For developers, Hermes Agent is a good opportunity to think beyond prompts and start building AI applications like real software systems:

modular, testable, explainable, and useful

The future of AI development will not only be about better prompts.

It will be about better workflows.

And that is exactly where Hermes Agent becomes exciting.

MedReport Agent — AI-Powered Medical Report Interpreter

Muhammad Asim Hanif — Sat, 23 May 2026 20:16:03 +0000

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent

What I Built

I built MedReport Agent, an AI-powered medical report interpreter that helps patients understand their lab reports in simple and clear language.

Many patients receive medical reports such as blood tests, liver function tests, kidney function tests, thyroid reports, and other lab results, but they often cannot understand the medical terms, abbreviations, values, and reference ranges written inside those reports.

MedReport Agent solves this problem by allowing users to upload a medical report as a PDF or image. The system then reads the report, extracts medical values, detects abnormal results, compares them with reference ranges, and generates easy-to-understand explanations in both English and Urdu.

The goal of this project is not to replace doctors. Instead, it helps patients understand their reports better and prepare useful questions before visiting a healthcare professional.

The system provides:

Medical report upload support
OCR-based text extraction
Automatic report type detection
Lab value extraction
Reference range comparison
Abnormal value highlighting
English and Urdu explanations
Doctor questions generation
Clear next-step guidance
Agent audit logs
Privacy-focused local processing

Demo

Demo Video:

Screenshots:
1. Upload screen

2. Agent processing/progress screen

3. Final results dashboard

4. Agent audit logs section

Code

GitHub Repository:
https://github.com/codedbyasim/MedReport

The project structure includes:

MedReport/
├── backend/
│   ├── main.py
│   ├── hermes_agent.py
│   ├── medreport_skill.yaml
│   ├── database.py
│   ├── ocr_processor.py
│   ├── llm_client.py
│   ├── chroma_kb.py
│   ├── requirements.txt
│   └── Dockerfile
│
├── frontend/
│   ├── src/
│   │   ├── App.tsx
│   │   ├── index.css
│   │   └── main.tsx
│   ├── package.json
│   └── Dockerfile
│
├── skills/
│   └── medical/medreport-interpreter/
│       └── SKILL.md
│
├── Model/
│   └── qwen2.5-1.5b-instruct-q4_k_m.gguf
│
├── docker-compose.yml
├── README.md
├── DOCUMENTATION.md
├── hackathon_evaluation.md
├── LICENSE
└── SRS.txt

My Tech Stack

Layer	Technology
Agent Workflow	Hermes Agent
Backend	Python, FastAPI, Uvicorn
Frontend	React, TypeScript, Vite
OCR	PyMuPDF, EasyOCR
Local LLM	Qwen2.5 GGUF
LLM Runtime	llama-cpp-python
Knowledge Retrieval	ChromaDB
Styling	CSS, responsive dashboard UI
Deployment	Docker, Docker Compose
License	MIT

How I Used Hermes Agent

Hermes Agent is the core of MedReport Agent. I used it to build a real multi-step agentic workflow instead of a simple chatbot-style application.

The agent controls the complete medical report analysis pipeline from upload to final explanation.

The workflow follows these steps:

Each step is handled as a separate tool or module. This makes the system more reliable, easier to debug, and closer to a real agentic application.

Agentic Capabilities Used

I used Hermes Agent for:

Planning: The agent follows a structured medical report analysis workflow.
Tool use: Each stage of the pipeline is handled by a dedicated tool.
Multi-step reasoning: The system connects OCR output, parsed values, reference ranges, and retrieved knowledge to generate the final explanation.
Self-correction: If normal parsing fails, the agent can use an LLM-based fallback parsing strategy.
Audit logging: Every major tool call is logged so the workflow remains transparent.
Skill-based configuration: The workflow is defined using a Hermes skill configuration file.

Main Features

Medical Report Upload

Users can upload a PDF or image of their medical report through the web dashboard.

OCR Processing

The backend extracts text from uploaded reports. Digital PDFs are handled using PDF text extraction, while scanned images can be processed using OCR.

Report Type Detection

The system identifies the type of medical report, such as:

CBC
LFT
RFT
Lipid profile
Thyroid profile
Glucose or diabetes-related reports

Lab Value Extraction

The parser extracts medical test names, values, and units from the report text.

Reference Range Comparison

Extracted values are compared with stored reference ranges to determine whether a result is normal, low, high, or critical.

Bilingual Explanation

The system generates patient-friendly explanations in:

English
Urdu

This makes the project more useful for local users who may not be comfortable with English medical terminology.

Doctor Questions

The agent generates useful questions that the patient can ask their doctor during consultation.

Agent Audit Logs

The dashboard shows the workflow logs so users and developers can understand what the agent did at each step.

Why This Project Is Useful

Medical reports are often difficult for normal users to understand. A patient may see values such as hemoglobin, WBC, platelets, ALT, AST, bilirubin, creatinine, glucose, TSH, or HbA1c, but may not know what they mean.

MedReport Agent converts this complex medical information into simple explanations.

This can help patients:

Understand their report better
Identify abnormal values
Ask better questions to doctors
Reduce confusion caused by medical jargon
Access explanations in their local language

The project is especially useful in regions where medical literacy is low and where patients may not always get enough time with doctors to discuss every value in detail.

What Makes It Different

MedReport Agent is different from a normal chatbot because it does not depend on one single prompt.

Instead, it uses a complete agentic pipeline:

This makes the output more structured and transparent.

The project also focuses on:

Local processing
Privacy
Urdu support
Medical report understanding
Transparent agent workflow
Practical healthcare use case

Challenges I Faced

One challenge was handling different medical report formats because labs write test names and values in different ways.

Another challenge was extracting useful text from both digital PDFs and scanned images.

It was also important to design the system in a way that gives helpful explanations without pretending to be a doctor.

To solve these challenges, I used a modular pipeline where each step has a clear responsibility.

How to Run Locally

git clone https://github.com/codedbyasim/MedReport
cd MedReport
docker-compose up --build

What I Learned

While building this project, I learned that agentic systems are most useful when they are connected with real tools and real workflows.

Hermes Agent helped me design the project as a proper pipeline instead of a basic AI wrapper.

I learned about:

Agentic workflow design
OCR integration
Tool-based architecture
Local LLM usage
Retrieval-based medical explanation
Error handling and fallback strategies
User-centered healthcare UI design

Future Improvements

In the future, I would like to add:

Support for more medical test categories
Better handwritten report recognition
PDF export of final explanation
Voice explanation in Urdu
Mobile app version
Patient history comparison
Doctor-side dashboard
More local languages such as Punjabi, Sindhi, and Pashto

Final Thoughts

MedReport Agent shows how Hermes Agent can power a practical and useful real-world application.

The project combines OCR, local LLMs, medical reference ranges, retrieval-based knowledge, bilingual explanation, and an agentic workflow into one complete system.

It is designed to help patients understand their reports better and approach doctors with more confidence.

Disclaimer: MedReport Agent is not a replacement for professional medical advice, diagnosis, or treatment. It is an educational assistant that helps users understand medical reports in simple language. Users should always consult a qualified medical professional for healthcare decisions.

CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform

Muhammad Asim Hanif — Sat, 23 May 2026 10:15:54 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

CodePulse AI is an AI-powered repository intelligence platform that analyzes GitHub repositories and transforms complex codebases into understandable architectural insights.

The platform automatically:

Generates architecture and class diagrams
Detects dependency relationships
Performs security and code quality analysis
Maps blast radius impact across repositories
Identifies technical debt in legacy systems
Explains repository structure using AI

Originally, this project started as an unfinished experimental repository analyzer powered by IBM Watsonx.ai. The initial version lacked polish, had unstable analysis flows, incomplete UX, and limited architectural visualization.

For the GitHub Finish-Up-A-Thon, I completely revived the project by:

migrating the entire AI stack from IBM Watsonx.ai to Gemini 2.5 Flash
redesigning the UI into a modern AI developer platform
adding Blast Radius Analysis
rebuilding repository visualization workflows
improving analysis generation and loading flows
polishing the developer experience end-to-end

The final result became a production-style engineering intelligence platform designed for developers working with large or unfamiliar codebases.

Demo

GitHub Repository

https://github.com/codedbyasim/codepulse-ai

Video Walkthrough

Before vs After

Before → Initial Unfinished Prototype

The original version of CodePulse AI started as an experimental AI-powered repository analyzer. While the foundation existed, the platform lacked visual polish, modern UX, stable AI workflows, and advanced engineering intelligence features.

The initial prototype:

used IBM Watsonx.ai for inference
had incomplete repository analysis flows
lacked polished architecture visualization
had minimal dependency mapping
had static and unfinished UI components
lacked blast radius prediction
had limited developer experience optimization

Before Screenshots

1. Original Landing Page

2. Initial Analyze Repository Interface

3. Initial Legacy Code Analysis Page

4. Original About Page

5. Basic Repository Visualization

6. Initial Loading & Analysis Workflow

After → Revived & Fully Polished Platform

During the GitHub Finish-Up-A-Thon, I completely revived and transformed CodePulse AI into a production-style AI-powered engineering intelligence platform.

The platform now features:

Gemini 2.5 Flash integration
Blast Radius dependency analysis
Interactive repository intelligence
Modern SaaS-inspired UI
Animated dependency graph previews
Security & code quality analysis
Improved loading and analysis flows
Advanced architecture visualization
Responsive developer-focused UX

Major Improvements

AI Stack Migration

One of the biggest upgrades was migrating the entire AI inference layer from IBM Watsonx.ai to Gemini 2.5 Flash.

This migration included:

rebuilding the backend proxy layer
refactoring request/response handling
converting payloads to OpenAI-compatible chat completion format
fixing malformed JSON parsing issues
redesigning Gemini fallback handling
updating environment configuration and model management

UI/UX Redesign

The frontend was completely redesigned into a modern AI SaaS-style experience inspired by:

GitHub
Linear
Vercel
Cursor

New additions included:

animated dependency graph previews
futuristic grid backgrounds
improved typography
polished loading states
responsive layouts
glassmorphism-inspired UI
dark mode refinement

Blast Radius Analysis

One of the biggest new features was Blast Radius Analysis.

This system:

maps repository dependency relationships
visualizes affected nodes
predicts propagation impact across services
helps developers understand change risk before deployment

Repository Intelligence

The platform now provides:

architecture diagrams
dependency insights
security analysis
tech stack detection
repository exploration
AI-generated documentation
legacy code archaeology

After Screenshots

1. Redesigned Hero Section

2. Modern AI Repository Dashboard

3. Blast Radius Analysis Visualization

4. Advanced Dependency Mapping

5. Improved Loading & Analysis Flow

6. AI-Powered Repository Intelligence

7. After About Page

Transformation Summary

Before	After
Static prototype	Production-style AI platform
IBM Watsonx.ai	Gemini 2.5 Flash
Minimal UI	Modern SaaS experience
Basic repository analysis	Advanced repository intelligence
No dependency prediction	Blast Radius Analysis
Incomplete UX	Fully polished workflows
Static components	Animated developer-focused interface
Experimental project	Revived engineering intelligence platform

The Comeback Story

CodePulse AI originally began as an unfinished side project focused on AI-assisted repository understanding. While the core idea was strong, the platform was incomplete and lacked a polished user experience.

The original system:

used IBM Watsonx.ai for inference
had unstable response parsing
lacked proper architecture visualization
had static UI components
had incomplete analysis workflows
did not clearly communicate repository impact analysis

During the Finish-Up-A-Thon, I decided to fully revive the project and transform it into a polished developer intelligence platform.

The project evolved from a rough experimental prototype into a fully redesigned engineering intelligence platform capable of:

dependency analysis
blast radius prediction
AI-powered architecture understanding
repository exploration
security insights
modern developer-focused UX

My Experience with GitHub Copilot

GitHub Copilot became my pair programmer throughout the revival process.

I used Copilot extensively for:

refactoring React + TypeScript components
redesigning Tailwind layouts
generating animation logic
debugging Gemini integration issues
restructuring API payload handling
improving loading workflows
accelerating UI polishing
rebuilding analysis components

Copilot was especially helpful while:

migrating from IBM Watsonx.ai to Gemini 2.5 Flash
implementing animated dependency graph previews
refactoring the backend inference layer
improving frontend responsiveness and styling

Instead of generating the entire project automatically, Copilot acted as a collaborative engineering assistant that helped speed up iteration, experimentation, debugging, and polishing.

Tech Stack

Frontend

React
TypeScript
Tailwind CSS
Framer Motion
Mermaid.js

Backend

Node.js
Express.js

AI

Gemini 2.5 Flash
AIML API

Features

Repository Analysis
Blast Radius Visualization
Security Insights
Dependency Mapping
AI Documentation Generation
Legacy Code Archaeology

Transformation Summary

Before	After
Static prototype	Production-style AI platform
IBM Watsonx.ai	Gemini 2.5 Flash
Minimal UI	Modern SaaS experience
Basic analysis	Advanced repository intelligence
No dependency prediction	Blast Radius Analysis
Incomplete UX	Fully polished workflows
Simple architecture diagrams	Interactive engineering visualization
Experimental project	Revived developer platform

What I Learned

This project taught me:

how to refactor and revive unfinished software
how to migrate AI inference providers
how to build production-style developer tooling
how to design modern SaaS interfaces
how to improve architecture visualization
how to work alongside AI-assisted development tools effectively

Most importantly, this challenge helped me finally finish and polish a project that had previously been left incomplete.

Built for the GitHub Finish-Up-A-Thon 🚀

From Assistants to Agents: My Take on Google I/O 2026

Muhammad Asim Hanif — Fri, 22 May 2026 19:40:21 +0000

This is a submission for the Google I/O Writing Challenge

From Assistants to Agents: My Take on Google I/O 2026

Google I/O 2026 was the moment Google fully embraced agentic AI. Rather than showing incremental improvements, this year’s announcements reframed Gemini as an ecosystem of models, tools and platforms designed to act on our behalf.

In this post I’ll unpack the key releases, highlight some exceptional projects from Google’s Gemini Live Agent Challenge, and share my perspective on what these advances mean for developers.

The Evolution of Gemini: Omni, Flash 3.5 and Spark

Gemini 3.5 Flash

Gemini 3.5 Flash represents a major leap in performance and efficiency.

Google built it as a high-throughput model capable of handling long-horizon reasoning, planning and agentic workflows much faster than previous generations.

What stood out to me most was that Google focused less on “AI hype” and more on practical developer productivity.

This model is designed for:

Fast reasoning
Tool usage
Long context understanding
Agent orchestration
Real-time interactions

For developers, this matters because modern AI systems are no longer just chatbots. They are becoming autonomous systems capable of executing workflows.

Gemini Omni

Gemini Omni was one of the most impressive announcements.

It combines:

Video generation
Physical world understanding
Image editing
Audio interactions
Realistic scene creation

The ability to generate and edit multimodal content from prompts feels like Google entering full-stack creative AI territory.

This also signals that future applications will not rely only on text interfaces anymore.

AI is becoming visual, interactive and context-aware.

Gemini Spark

Gemini Spark may be the clearest preview of where AI is heading.

Spark acts like a persistent personal AI agent that can:

Read emails
Summarize conversations
Schedule appointments
Monitor tasks
Automate workflows

Unlike traditional assistants, Spark is designed to proactively help users rather than waiting for commands.

This changes the role of AI from “tool” to “digital operator.”

AI Search Is Becoming Agentic

Google Search also underwent a massive transformation.

The new AI-powered search experience introduces:

Persistent information agents
Cross-modal search
Continuous monitoring
Personalized summaries

Instead of manually searching repeatedly, users can now ask AI agents to monitor topics continuously.

For example:

“Watch for Chromium security updates”
“Track flights from Islamabad to Dubai”
“Monitor GPU price drops”

This turns search into an active system instead of a passive query engine.

Antigravity 2.0 and Developer Ecosystem

One of the most underrated announcements was Antigravity 2.0.

Google is clearly preparing infrastructure for multi-agent applications.

Antigravity introduces:

Long-running agent sessions
Sub-agent orchestration
Async task execution
Agent SDKs
Terminal-based AI workflows

This feels like the beginning of operating systems designed specifically for AI agents.

As developers, we may soon build applications where dozens of AI agents collaborate simultaneously.

Gemini Live Agent Challenge Winners

One of my favorite parts of Google I/O 2026 was seeing real-world projects from developers.

These projects proved that agentic AI is not theoretical anymore.

Category	Project	What It Does
Grand Prize	ORION	Surgical AI copilot for robotic surgery
Best Live Agent	drone-copilot	Voice-controlled drone assistant
Best Storytelling	Sankofa	AI heritage storyteller
Best UI Navigator	Moonwalk	Voice-controlled desktop AI
Best Multimodal	Wand	Gesture + voice browser agent
Best Innovation	Rayan Memory	3D memory palace AI
Best Technical Execution	JohnKeats.AI	Emotional conversational companion

What impressed me most was the consistent design pattern across all winners:

Persistent sessions
Tool calling
Multimodal reasoning
Streaming interactions
Memory systems
Safety layers

This is clearly becoming the standard architecture for next-generation AI systems.

What This Means for Developers

Google I/O 2026 changed how developers should think about AI systems.

Previously:

AI answered questions

Now:

AI plans
AI remembers
AI acts
AI monitors
AI collaborates

That shift is huge.

Developers now need to focus on:

State management
Long-running sessions
Safety verification
Tool interfaces
Agent collaboration
Ethical safeguards

Prompt engineering alone is no longer enough.

We are entering the era of AI system engineering.

My Biggest Takeaway

The biggest realization I had after watching Google I/O 2026 is this:

AI is no longer becoming a feature inside applications.

Applications themselves are becoming AI-native.

The interface, logic, workflows and automation layers are all merging together into intelligent systems.

That is both exciting and slightly terrifying.

One Concern: Hype vs Reality

While the demos looked impressive, real-world deployment will still be difficult.

Challenges like:

Latency
Reliability
Memory consistency
Hallucinations
Safety verification
Tool failures

remain major problems.

Building truly reliable AI agents is significantly harder than creating impressive demos.

I think the next few years will determine whether agentic AI becomes genuinely useful or simply another hype cycle.

Final Thoughts

Google I/O 2026 felt like a turning point.

This year was not about slightly better chatbots.

It was about creating autonomous AI ecosystems capable of reasoning, planning and acting independently.

Gemini, Spark, Omni and Antigravity together show that Google is betting heavily on an agentic future.

For developers, this creates massive opportunities.

But it also creates massive responsibility.

Because once software begins acting on behalf of humans, trust becomes more important than ever.

Helpful Resources

Thanks for reading 🚀

The Edge AI Revolution: Why Gemma 4 E4B is a Game-Changer for Offline Multimodality

Muhammad Asim Hanif — Fri, 22 May 2026 18:57:11 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

The Cloud is Great, But the Edge is Essential

When we talk about the future of AI, the conversation almost always drifts toward massive data centers, hundreds of gigabytes of VRAM, and cloud APIs. But what happens when the cloud isn't there?

In real-world crises—like the catastrophic floods that frequently hit South Asia—power grids fail and internet connectivity vanishes. In these critical moments, an API key is useless. This is exactly where the true potential of open-source, edge-optimized models comes into play.

With the release of Gemma 4, Google didn't just give us a capable open model; they gave us the Gemma 4 E4B (4B parameter) variant. After spending time building offline systems with it, I believe this specific model is a massive paradigm shift for edge computing. Here is a technical breakdown of why Gemma 4 E4B is quietly revolutionizing local AI.

1. Native Multimodality vs. The "Frankenstein" Pipeline

Before Gemma 4, building a multimodal offline system meant chaining together multiple different models. If you wanted to process a victim's voice note and a photo from a disaster zone on a local laptop, your pipeline looked like this:

Audio to Text: Run OpenAI's Whisper (requires its own memory footprint).
Vision to Text: Run LLaVA or Moondream to generate image descriptions.
Text to Action: Feed all those text strings into an LLM for reasoning.

This "Frankenstein" approach is a nightmare for edge devices. Context switching between models destroys VRAM efficiency, spikes latency, and drains laptop batteries.

The Gemma 4 E4B Solution:
Gemma 4 E4B introduces native multimodality at the edge. It doesn't rely on external transcription or OCR hacks. Through Ollama, you can pass an audio file, an image, and a text prompt in a single /api/chat request.

The model's native audio and vision encoders process the raw data directly into its context window. This single-forward-pass architecture drops latency from over 15 seconds (in chained pipelines) to sub-5 seconds on a modest 4GB VRAM GPU.

2. Agentic Tool Calling... Offline!

One of the most impressive features of the Gemma 4 family is its advanced reasoning and tool-calling capabilities. While we expect this from 100B+ parameter models, seeing it in a 4B model running on a local machine is staggering.

In my experience integrating Gemma 4 into an offline command center, the model isn't just generating text—it's taking actions. You can define Python tools (e.g., dispatch_rescue_team(location, priority)) and Gemma 4 will reliably format JSON arguments to execute those functions.

Because it operates within a 128K context window, you can inject local RAG (Retrieval-Augmented Generation) data—like NDMA or WHO protocols—directly into the prompt. Gemma 4 will read the offline documents, analyze a photo of a flooded area, and accurately call a backend function to dispatch a rescue boat. No internet required.

3. The Power of "Small" Dense Models

We often get caught up in the parameter wars, but the Gemma 4 E4B dense model proves that architecture and training data quality trump raw size.

By packaging advanced reasoning, multimodality, and tool-calling into a 4B effective parameter footprint, developers can deploy sophisticated AI on:

Consumer-grade laptops in remote disaster zones.
Raspberry Pi 5s for localized IoT networks.
Mobile devices operating entirely off-grid.

Conclusion: Building for Global Resilience

The release of Gemma 4 forces developers to ask a new question: "Does this app actually need the internet?" For years, we've built AI applications that assume perfect connectivity. But the most impactful use cases for AI—disaster response, remote healthcare, and off-grid education—exist in places where connectivity is a luxury.

Gemma 4 E4B proves that we don't need to sacrifice intelligence to achieve true offline capability. The future of AI isn't just in the cloud; it's decentralized, local, and running right at the edge where it's needed most.

When Networks Fail, SARA Stands Up: Offline Flood Rescue with Gemma 4 E4B

Muhammad Asim Hanif — Fri, 22 May 2026 18:50:31 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

During major floods—like the catastrophic 2022 Pakistan Floods that displaced over 33 million people—mobile towers lose power and internet services collapse. This creates a critical communication blackout where stranded victims cannot signal for help, and rescue teams deploy boats, helicopters, and medical assets based on guesswork.

SARA (Safety And Rescue Assistant) is a 100% offline-first, local emergency command center. Deployed on a single coordinator laptop alongside a simple Wi-Fi hotspot, it creates a private local network—no internet required.

The SARA End-to-End Rescue Flow

SARA simplifies disaster coordination into a seamless, offline process:

Flood victims connect to the hotspot (SARA-HELP) and access SARA’s intake form using their mobile browser—no app installation needed.

Victims can submit emergency details via text, photo evidence (water depth, injuries), or a recorded voice message.

Coordinators manage resources through a live-updating Glassmorphic Admin Dashboard equipped with offline Leaflet maps, live WebSocket streams, and RAG-integrated medical protocols.

Demo

Here is the walkthrough of SARA's offline system deployment, victim-side emergency reporting form, and real-time dashboard triage updates:

Code

The complete codebase, configurations, and deployment steps are fully open-source and available on GitHub:

👉 GitHub Repository: SARA Offline Rescue

How I Used Gemma 4

At the center of SARA is Google's Gemma 4 Edge-optimized family (gemma4:e4b / 4B) running locally on the coordinator laptop via Ollama.

Gemma 4 powers SARA in three major ways:

1. Intentional Model Selection: Why Gemma 4 E4B?

Disaster response centers operate on battery backups or portable generators. I needed a highly capable model that could run locally on consumer-grade laptop CPUs/GPUs without needing a connection to cloud servers. Gemma 4 4B fits comfortably within under 8GB VRAM, delivering stable, sub-5-second local inferences in the field.

2. Native Multimodality & Offline RAG Integration

Stranded victims report emergencies under high stress. Gemma 4's native multimodal capabilities allow me to process multiple modalities in a single pipeline without context switching.

Audio: Local voice messages are transcribed seamlessly.
Vision: Photo uploads are evaluated directly by the model to detect water depth, trapped individuals, or visible injuries.
Offline RAG: The system searches local manuals from WHO and the National Disaster Management Authority (NDMA) Pakistan using local nomic-embed-text embeddings, injecting critical first-aid instructions into Gemma's prompt.
Bilingual Generation: Gemma 4 acts as a translation engine, analyzing English/Roman Urdu inputs and writing simple, reassuring Urdu summary updates for the victim.

3. Agentic Task Coordination via Tool Calling

SARA provides a natural language command box for rescue coordinators. When a coordinator types "Are there any available rescue boats?" or "Dispatch helicopter to case #3", Gemma 4 maps the query to custom Python tools (dispatch_rescue_team, get_resource_status, etc.) via Ollama's native tool calling. It updates the SQLite database, triggers WebSocket alerts, and returns structured confirmation text—all fully offline.

Google Finally Answered the Question Nobody Was Asking Out Loud

Muhammad Asim Hanif — Sun, 26 Apr 2026 16:51:16 +0000

There's a thing that happens at big tech conferences. You sit through an hour of polished demos, applause lines, and customer success stories, and somewhere in the middle of it all, a single slide quietly destroys a problem you'd been working around for months.

That happened to me while watching Google Cloud NEXT '26.

The announcement wasn't the flashiest one. It wasn't the 8th-gen TPUs (though those are genuinely wild). It wasn't Gemini 3.1 Pro. It was something called the Agent2Agent protocol — and if you've spent any time trying to build real multi-agent systems, you probably just sat up a little straighter.

The problem everyone's been ignoring

Let me back up.

For the past year or so, the developer narrative around AI agents has been: "build your agent, make it smart, deploy it." And tools have gotten genuinely good at that part. But there's a messy reality underneath the demos — what happens when your agent needs to talk to another agent?

Not your agent calling a REST API. Not your agent hitting a database. Your agent needing to hand off a task to a completely different agent, built by a different team, running on a different platform, with different internal logic.

Right now, that looks like a lot of custom glue code. HTTP calls with manually agreed-upon schemas. Hoping the other team's agent returns something predictable. Debugging failures that could be anywhere in a chain of three or four systems.

I've been in that situation. It's not fun. And nobody's really been talking about it as a protocol problem — it's been treated as an integration problem you just solve case by case.

Google's answer at NEXT '26: stop solving it case by case.

What A2A actually is

The Agent2Agent (A2A) protocol is an open standard for agent-to-agent communication. The idea is straightforward — give agents a common language for handing off tasks, sharing context, and reporting status, regardless of what platform they're built on.

Here's what struck me about it: A2A isn't a Google-only thing. It's already built into LangGraph, CrewAI, LlamaIndex, Semantic Kernel, and AutoGen. The Agent Development Kit (ADK) hit stable v1.0 across Python, Go, and Java with TypeScript available too. This isn't a vendor lock-in play disguised as an open standard — or at least, it's not only that.

The practical picture they painted: a Salesforce agent built on Agentforce hands off a task to a Google agent on Vertex AI (now "Gemini Enterprise Agent Platform"), which queries a ServiceNow agent for IT asset data — all through A2A, without any of the three systems needing to understand each other's internals. No custom schema negotiation. No fragile adapter layers.

If that actually works as advertised in production, it changes the economics of multi-agent system design pretty dramatically.

The part that's easy to miss

What I think is genuinely underrated in the NEXT '26 announcements is the security layer sitting underneath all of this.

A2A without trust guarantees is just chaos at scale. If agents can call each other freely, you need to know which agent called what, with what permissions, and be able to audit the whole chain.

Google's answer is Agent Identity — every agent gets a unique cryptographic ID. Agent Gateway handles traffic control between agents and data. Model Armor adds runtime protection against prompt injection and tool poisoning.

These aren't afterthoughts bolted on. According to the docs, they're baked into the Agent Platform from the ground up, which means if you build on it, you get that traceability by default rather than having to engineer it yourself.

I'll be honest — I was skeptical when I read "secure-by-design" in the keynote. That phrase gets used a lot. But the architecture around Agent Identity is specific enough that it reads less like marketing and more like a genuine engineering decision. Cryptographic IDs per agent. Audit logging through Cloud IAM. Centralized observability.

Whether it holds up when you actually try to build something complex on it — that's a different question. But the intent is at least coherent.

Let's actually try it — ADK in under 5 minutes

This is where I'll stop summarizing announcements and show you something concrete. If you want to form your own opinion, the fastest way is to run something.

Install the ADK:

pip install google-adk

Here's a minimal multi-agent setup — a coordinator that delegates to two specialized sub-agents. This is the exact pattern A2A is designed to scale across platforms:

from google.adk.agents import LlmAgent

# A specialized agent that only fetches data
data_agent = LlmAgent(
    name="data_fetcher",
    model="gemini-2.5-flash",
    instruction="""You are a data retrieval specialist.
    When given a topic, return a concise structured summary of relevant facts.
    Keep responses under 150 words."""
)

# A specialized agent that only writes summaries
writer_agent = LlmAgent(
    name="report_writer",
    model="gemini-2.5-flash",
    instruction="""You are a technical writer.
    Take raw data points and turn them into a clean, readable paragraph.
    Avoid jargon. Write for a developer audience."""
)

# Coordinator that routes between them
coordinator = LlmAgent(
    name="coordinator",
    model="gemini-2.5-flash",
    description="I coordinate data fetching and report writing tasks.",
    instruction="""You manage a small team of agents.
    For any research request: first delegate to data_fetcher, 
    then pass those results to report_writer for a clean output.
    Do not do either task yourself.""",
    sub_agents=[data_agent, writer_agent]
)

Run it from your terminal:

adk run .

Or spin up the dev UI to see the full agent trace visually:

adk web

The dev UI is actually one of the underrated parts — you get a real-time view of which sub-agent handled what, what it returned, and how long each step took. That kind of observability is what's been missing from most agent frameworks.

What's notable here is that data_agent and writer_agent could each be running on entirely different infrastructure — or even built by different teams using different frameworks — and with A2A, the coordinator would still hand off tasks the same way. That's the point.

What this actually means for developers

Let me be concrete about what changes if A2A gains real adoption:

Building a pipeline of specialized agents becomes viable. Right now, chaining agents usually means one team owns the whole chain. With A2A, you could have a data-fetching agent from one team, a reasoning agent from another, and a summarization agent from a third — all interoperating without a massive integration project.

The ADK is worth actually looking at now. It's model-agnostic, deployable to any container or Kubernetes environment, and optimized for Gemini but not exclusive to it. The v1.0 stable release across multiple languages means this is past the "experimental" phase.

Agent simulation before you ship. The new Agent Simulation tool lets you stress-test agents against real-world scenarios before deployment. I'm more interested in this than most of the headline features because it addresses one of the most painful parts of agent development — you genuinely don't know how your agent behaves until something weird happens in production.

My honest take

Google's keynote framing was "the era of the pilot is over, the era of the agent is here." I think that's a little optimistic. Most teams I know are still figuring out how to make a single reliable agent, let alone orchestrating fleets of them.

But the infrastructure they're building at NEXT '26 — particularly A2A and the identity/governance layer — is the right bet. The bottleneck in multi-agent systems isn't model intelligence anymore. It's interoperability and trust. And those are fundamentally protocol and infrastructure problems.

The Danfoss example they shared (80% of email-based order processing automated, response times cut from 42 hours to near real-time) and Suzano (95% reduction in query time for natural-language SQL) suggest at least some organizations are past the pilot stage. But enterprise manufacturers and large corporates are a different environment than most of us are building in.

The question for the average developer isn't "is Google's agentic vision compelling." It is. The question is whether A2A becomes a genuine standard or a Google-flavored standard that only really works well in Google's ecosystem. That's determined by adoption, not announcement.

Worth watching. Worth experimenting with. The ADK is free to try, Agent Platform gives $300 in credits, and the A2A spec is open.

That's enough to form your own opinion, which is always better than taking mine.

Further reading: