Syed Ahmer Shah

Posted on May 16

I Gave Hermes Agent 5 Impossible Tasks

#hermesagentchallenge #devchallenge #agents #git

Hermes Agent Challenge Submission

This is a submission for the Hermes Agent Challenge

Let me be honest with you before we start.

I went into this expecting to write a clean "look how cool this AI is" post. You know the type. Polished. Slightly breathless. Ends with "the future is here."

That is not what happened.

What happened was messier, more interesting, and honestly kind of unsettling. So let me just walk you through it.

First — What Even Is Hermes Agent?

Because when I first heard the name I thought it was another LangChain wrapper with a good logo. It's not.

Hermes Agent is an open-source autonomous AI agent framework built by Nous Research, released in February 2026 under the MIT license. In roughly three months it crossed 100,000+ GitHub stars (Repository) — one of the fastest-growing open-source AI projects ever. That number alone made me pay attention.

But here's what actually makes it different from everything else out there right now.

Most AI tools you use today are stateless. You open a chat, you ask something, you close it. Tomorrow you come back and it remembers nothing. It's Groundhog Day but for your productivity.

Hermes doesn't work like that.

Hermes lives on your server — a $5 VPS, your laptop, a serverless backend, whatever you have. It runs persistently. It builds a three-layer memory system as it works: short-term conversation context, medium-term session summaries, long-term skill documents that capture how it solved specific problems. It doesn't just complete tasks. It learns from completing them.

The core mechanism behind this is called GEPA — an ICLR 2026 Oral-accepted self-improvement loop. Every time the agent completes roughly 15 tasks, it reviews its own performance, identifies patterns, and writes new Skill Documents. Agents with 20+ self-generated skills complete similar future tasks 40% faster than fresh instances. That's not a marketing claim. That's a benchmarked number from TokenMix.ai independent testing.

It supports 200+ LLMs through OpenRouter. It connects to Telegram, Discord, Slack, WhatsApp, Signal, and your terminal from a single gateway. And all your data — memories, skills, conversation history — lives in a local SQLite database on your own machine. No telemetry. No cloud lock-in.

In other words: it's the first agent that actually compounds.

Now. Let's talk about what happened when I put it through five tasks designed to break it.

Task 1: Aggregate Real-Time Data From Multiple Sources Simultaneously

The task: Pull today's top 5 tech news stories, summarize each one in under 50 words, rank them by relevance to full-stack developers, and format it as a clean daily briefing. Do it automatically every morning at 8am.

Why it's "impossible": Multi-source aggregation + LLM summarization + relevance ranking + automated scheduling is at least 3 separate tools working in sync. Most agent setups fall apart coordinating even two.

What actually happened:

It worked. And that was the first moment I got that slightly uncomfortable feeling.

The cron scheduling part especially. In Hermes, you set scheduled tasks in natural language — "every morning at 8am, pull tech news and brief me" — and it handles the cron job internally. No YAML. No crontab entries. You just describe what you want and it figures out the execution.

The ranking was interesting. It didn't just sort by publish time. It actually weighted results based on the tools and frameworks mentioned — things like Next.js, Supabase, TypeScript, Rust were flagged as relevant. Things like enterprise SaaS funding rounds got deprioritized. I did not explicitly tell it to do this. It inferred developer relevance from the task context.

Is that "intelligence"? I genuinely don't know. But it saved me from reading a funding article for a B2B CRM no one cares about.

Verdict: Passed. The daily briefing has been running for a week now. It's genuinely useful.

Task 2: Automate a Multi-Step Development Workflow End-to-End

The task: Given a GitHub repository URL, do all of the following without me touching anything: read the README, identify what the project does, write a structured code review checklist based on the tech stack detected, and push a summary as a GitHub issue.

Why it's "impossible": This requires reading external files, code comprehension, structured output generation, and writing back to a third-party service. That's four distinct operations with failure points at each handoff.

What actually happened:

Two out of four steps were clean. Reading the README and detecting the tech stack — solid. It correctly identified a Next.js + Supabase + Tailwind project just from the README and package.json reference.

The code review checklist was decent but generic. It knew the stack but the checklist read like it was pulled from a "React best practices" article from 2023. Not wrong. Just not deep. There was nothing about Supabase RLS policies, nothing about edge function cold starts, nothing stack-specific that a senior dev would actually flag.

The GitHub issue push worked when given the right token permissions. When I gave it an insufficient-scope token, it failed silently instead of telling me what scope it needed. That was annoying.

Verdict: Partially passed. The automation scaffolding works. The depth of reasoning is shallow on complex domain knowledge. This is a real limitation and I'd rather be straight about it than pretend otherwise.

Task 3: Make a Decision Under Complexity and Uncertainty

The task: I gave it a genuine decision I've been sitting on — choosing between two different backend architectures for a side project (Supabase-first serverless vs a dedicated Node/Express server + PostgreSQL). I gave it my constraints: solo developer, limited time, need for auth + realtime + storage, cost-sensitive, deployed on Vercel. I asked it to make a recommendation with reasoning.

Why it's "impossible": Real decisions have tradeoffs, missing information, and no clean right answer. I wanted to see if it would reason through ambiguity or just give me a confident-sounding nothing answer.

What actually happened:

This is the one that genuinely surprised me.

It didn't just recommend one option. It built a decision matrix. It listed factors I hadn't mentioned — like "as a solo developer, onboarding cognitive load matters; Supabase's managed auth reduces the number of systems you need to reason about under deadline pressure." It flagged that Node/Express gives more control but that control has a cost when you're the only one maintaining it at 2am.

The recommendation it landed on was Supabase-first with a specific caveat: avoid complex business logic in Edge Functions because cold starts compound when you chain them. Keep the critical path simple.

That caveat is correct. And I hadn't mentioned Edge Functions once.

I'm still not sure what to make of that.

Verdict: Passed. This is the task I expected it to fail worst at. It didn't.

Task 4: Self-Generate a New Skill From a Novel Workflow

The task: Ask it to do something it has never done before — specifically, analyze a CSV of student grade data, identify students at risk of failing (below certain thresholds across multiple subjects), and generate a personalized intervention note for each one. Then turn that workflow into a reusable skill it can apply to future CSV uploads automatically.

Why it's "impossible": Skill self-generation is the core claim of Hermes. I wanted to stress-test it against a workflow that doesn't exist in its default 118-skill library.

What actually happened:

The CSV analysis worked fine. Basic pandas-style operations under the hood, nothing shocking there.

The "personalized" intervention notes were... okay. They were structurally correct — each note addressed the specific subjects where the student was below threshold. But they were cold. "Student X is showing below-average performance in Mathematics and Science. Recommend additional support sessions." That's technically an intervention note. It's also the kind of note a tired administrator writes at the end of a long Friday. No teacher would actually send it.

The skill generation part, though? That worked exactly as advertised. After completing the task, it wrote a Skill Document called something like "at-risk-student-csv-analyzer" and indexed it. When I uploaded a second, different CSV the next day and asked it to "do the analysis thing you did before," it retrieved the skill, adapted it to the new column structure, and ran the workflow without needing my re-explanation.

That's the compounding effect in real action. And it's genuinely different from anything I've used before.

Verdict: Passed on infrastructure, mixed on quality. The skill loop is real. The output depth depends on how much context you give upfront.

Task 5: Handle a Multi-Turn Workflow That Changes Midway

The task: Start a content planning workflow. Halfway through — after it's already begun — change the brief completely. Go from "write a content calendar for a developer tools startup" to "actually, this is for a personal finance app, let me redo the audience." Watch whether it gracefully adapts or collapses.

Why it's "impossible": Mid-stream context shifts break most AI tools. They either ignore the update and keep going, or reset completely and lose all prior progress.

What actually happened:

It adapted. Mostly.

When I interrupted and changed the target audience, it acknowledged the shift, flagged which parts of what it had already generated were still salvageable (basic calendar structure, posting cadence, format) and which parts needed regeneration (topic ideas, tone, example posts). It didn't start over from scratch. It didn't ignore me. It basically said, in its way: okay, here's what I'm keeping, here's what I'm rebuilding, confirm?

That's a more sophisticated response than most human collaboration tools give you.

The place it slipped: one of the regenerated topic ideas still referenced developer tools in the framing. Subtle. If I hadn't been watching for it I'd have missed it. But it's the kind of context bleed that shows the memory management isn't perfect yet.

Verdict: Passed with caveats. The recovery behavior is impressive. The context bleed is real and worth watching in production.

So What's the Actual Verdict on Hermes Agent?

Here's the honest summary.

Hermes Agent is the most interesting open-source agent framework of 2026 — not because it's perfect, but because its architecture is the right bet. The self-improving skill loop, the three-layer memory, the GEPA mechanism — these are the right answers to the right problems. Stateless AI is a ceiling. Compounding AI is a direction.

The gaps are real though. Output quality is heavily context-dependent. The shallow-domain problem on complex workflows (the code review checklist, the cold intervention notes) is a real limitation. Silent failures on misconfiguration — like the GitHub token scope issue — need better error communication.

But it's MIT licensed. It runs on a $5 VPS. Your data stays on your machine. And it's actively evolving at a release cadence that looks less like a hobby project and more like a well-funded lab that knows what it's building. v0.10.0 shipped 16 April 2026 with 118 skills and a closed learning loop. The pace is aggressive.

The benchmark that stuck with me: agents with 20+ self-generated skills complete similar future research tasks 40% faster than fresh instances. That is compounding intelligence in measurable form. Not philosophy. Not a demo. A number.

Is it ready for production? For solo developers and small teams building non-critical workflows — yes, today. For enterprise-grade production with audit requirements — not yet. But it's closer than anything else in the open-source space.

A Question I Can't Stop Thinking About

When the agent made that call about Supabase Edge Functions — something I never mentioned — was that reasoning, pattern matching, or just a lucky inference from my constraints?

I've been turning that over for a few days and I don't have a clean answer.

What I do know is this: the gap between "useful tool" and "autonomous collaborator" is narrowing faster than I expected. And Hermes is one of the clearest signals of that.

What would you give Hermes Agent as a fifth impossible task? Genuinely curious what breaks it in your domain. Drop it in the comments.

And if you've already been using it — what's the limitation that surprised you most? Because I suspect I've only scratched the surface of where this gets weird.

You can find me across the web here:

✍️ Read more on Medium: @syedahmershah
💬 Join the discussion on DEV.to: @syedahmershah
🧠 Deep dives on Hashnode: @syedahmershah
💻 Check my code on GitHub: @ahmershahdev
🔗 Connect professionally on LinkedIn: Syed Ahmer Shah
🧭 All my links in one place on Beacons: Syed Ahmer Shah
🌐 Visit my Portfolio Website: ahmershah.dev

Top comments (112)

Faraz • May 16

This is probably the first Hermes Agent review I’ve read that actually tests real-world engineering workflows instead of just glorified demos. The part about Hermes remembering and reusing the CSV workflow on a completely different dataset is honestly the most impressive thing here. That’s where it stops feeling like a chatbot and starts feeling like a persistent system.

Also appreciated that you highlighted the weak spots too — especially the shallow code review reasoning and silent GitHub token failure. Those details made the whole review feel way more credible.

The “stateless AI vs compounding AI” line is going to stick with me for a while 👏

Syed Ahmer Shah • May 16

Appreciate that. The goal of this Hermes Agent review was exactly that — test real engineering workflows, not surface-level demos. The CSV workflow reuse is where it starts to feel like persistent systems instead of a stateless chatbot, and that shift is the real story behind compounding AI agents.

Sahil Kumar • May 16

Solid real-world breakdown of Hermes Agent — especially the focus on actual engineering workflows instead of demo-style prompts. The compounding memory + self-generated skill loop is the most interesting part here, because it moves beyond stateless AI into something closer to persistent systems.

I also liked that you didn’t hide the weak spots (shallow reasoning in some tasks, silent failures, context bleed). That balance is what makes this feel credible rather than hype.

“Compounding AI agents” feels like the right framing — this is where things start getting genuinely useful, not just impressive.

Syed Ahmer Shah • May 16

You framed it correctly — the real value of Hermes Agent isn’t surface-level automation, it’s how it behaves in messy, stateful engineering workflows. The compounding memory + reusable skill loop is only meaningful if it survives real-world noise, not demo conditions. That’s where most agents still break.

Faique • May 16

The strongest part of this article is that it reads like an actual engineering evaluation instead of AI marketing copy. The Edge Functions observation alone tells me Hermes is doing more than shallow summarization because that’s a very real production concern most people don’t notice until deployment pain hits.

Also, the “compounding AI” framing is incredibly well put. That idea explains why persistent agents feel fundamentally different from normal chat-based tools.

Syed Ahmer Shah • May 16

Exactly. A lot of AI agent content misses the production layer entirely. The Edge Functions point stood out because it’s a real deployment issue, not a theoretical one. That’s why I framed Hermes around compounding AI instead of marketing-level “smart agent” language.

Vinod Oad • May 16

What I liked most about this post is that you didn’t oversell Hermes Agent as “AGI in a box.” You tested where it breaks, where it adapts, and where it actually shows signs of useful long-term memory. The Task 5 context-switch test was especially interesting because most agents completely lose coherence once the workflow changes midway.

The biggest takeaway for me: persistent memory + self-generated skills feels way more important than just bigger models now. That “do the analysis thing you did before” moment is exactly the kind of behavior that makes agents feel practical instead of gimmicky.

Really solid breakdown. Technical, skeptical, and still exciting at the same time.

Syed Ahmer Shah • May 16

That’s the key distinction I was trying to highlight — this isn’t “AGI in a box,” it’s a system that behaves consistently across time. The context-switch test in Task 5 showed that boundary clearly. Persistent memory and recovery behavior matter more than raw intelligence claims right now.

Omar Hurain • May 17

The most impressive part for me wasn’t the automation — it was Hermes reusing the CSV workflow on a completely different dataset without needing the full prompt again. That’s where it starts feeling less like a chatbot and more like an evolving system.

Syed Ahmer Shah • May 17

Exactly, Omar! That was the exact moment the paradigm shift clicked for me. Most tools require you to manually copy-paste prompts or rebuild chains from scratch. When an agent inherently treats its past executions as reusable software assets rather than ephemeral chat history, it stops acting like an assistant and starts acting like infrastructure. Glad that part stood out to you!

Amir • May 17

This is one of the few AI agent reviews that actually tests real-world workflows instead of hype demos. The “stateless AI vs compounding AI” point really stood out — especially the reusable skill generation across different CSV datasets. Balanced, practical, and honestly more useful than most agent content out there 👏

Syed Ahmer Shah • May 17

Thanks so much! There is a massive gap right now between what works in a cherry-picked Twitter/X demo and what happens when an agent actually hits a messy real-world workflow. I really wanted to cut through the noise and look at how it handles actual architectural constraints. Glad the 'compounding AI' lens brought some clarity to where things are heading!

Zohaib • May 16

This is one of the few Hermes Agent posts that actually feels like engineering evaluation instead of hype. The focus on real workflows (GitHub automation, decision-making, context switching) makes it useful for developers trying to understand where agents actually break in production.

The strongest takeaway is clearly the “compounding” effect — reusable skills + persistent memory is a real shift from stateless chat tools. Still, the issues you pointed out (shallow reasoning, silent failures, context drift) are exactly what will decide whether this scales beyond experiments.

Syed Ahmer Shah • May 16

That’s the core distinction — Hermes isn’t just executing workflows, it’s stress-testing whether agents can operate across context shifts. The moment reusable skills start transferring across tasks, you’re no longer looking at prompt tooling, you’re looking at early-stage workflow infrastructure.

Vicky Jaish • May 16

Great write-up — what stands out is the real benchmark thinking instead of “AI demo wow” reactions. The skill reuse across different CSVs is the real signal here; that’s where agents stop being tools and start becoming systems.

Also appreciated the honest critique on shallow domain reasoning and silent failures — those are exactly what will decide if Hermes is production-ready or just impressive on paper.

Syed Ahmer Shah • May 16

Exactly — the signal isn’t task completion, it’s skill reuse across different datasets. That’s where Hermes starts behaving like a system, not a chatbot. And yes, production-readiness will depend less on capability and more on handling silent failure modes without collapsing workflows.

Aley • May 16

It was the moment this stopped sounding like another agent framework and started sounding genuinely useful. The fact Hermes generated a reusable workflow skill and successfully adapted it to a second CSV structure without re-training or re-prompting is actually huge.

Most “AI agents” automate tasks. Very few improve operationally after completing them. That difference matters.

Syed Ahmer Shah • May 16

That’s the interesting part — automation is common now, but operational improvement over time is rare. The fact it could generate a reusable workflow skill and then apply it again without re-prompting is where AI agents start becoming systems instead of tools.

Comment hidden by post author - thread only accessible via permalink

S M Tahosin • May 17

I’ve noticed suspicious engagement patterns associated with your account, including interactions from accounts that appear inauthentic or coordinated. If this is engagement manipulation, it violates community standards and has been flagged for review.

You are creating multiple account for increasing post engagement. Most of comment in your all recent and previous post are from bulk fake id, all are generated by AI. You are violating dev.to terms and privacy. Your account is flagged.

View full discussion (112 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more