DEV Community: Dan Parii

I turned my real YouTube feed into a daily digest, with Hermes Agent

Dan Parii — Sat, 11 Jul 2026 11:43:32 +0000

Hermes Agent was new to me, so I built something small and real on it and paid attention to what the harness made easy.

What I built

A personal YouTube curator. Every morning a scheduled job scrapes my logged-in home feed and watch history, an agent ranks it, and I get a tiered digest on Telegram (learning, infotainment, entertainment) with a duration and a short reason per pick. A second agent maintains a small wiki of the topics and channels I keep coming back to, and my replies on Telegram steer the next morning's digest, either by changing the shape of the digest, or adding extra guidance on what my preferences are at the moment. I can also go deeper on any pick from the digest by asking for more detail from the video's transcript.

I built this to run on small, non-frontier models, so the whole point was control rather than raw intelligence. The idea is to take a modest model and keep it tightly steered on one narrow task, and that is what Hermes is good at: giving each step just enough scope to do its job and nothing more. Here is a single run:

What my project needed, and what Hermes gave me

Building blocks I did not have to write. There is an overwhelming amount of pre-built skills, and scanning them turns up useful ones. I used the youtube-content and llm-wiki skills.

A gateway and a scheduled job. Both are first-party in Hermes, so I did not have to build the plumbing for either. The cron job runs a deterministic fetch script first, and only then does the agent wake up to reason over what was collected.

A way to handle unsafe internet data. Its hooks give you a lot of granularity. I was able to give only the subagents that this particular job spins up Python write access to a single directory on my machine, and nothing else. That let me trust small local models with a pile of fresh, unfiltered YouTube data every day without worrying about prompt injection. The guard that enforces this lives in the repo if you want to see how it is wired.

A taste profile that improves over time. Hermes classifies any feedback I give as memory-worthy or not and folds it in. My preference for investigative infotainment became part of my profile, and if I want tomorrow's digest to carry less brain-off content, I just say so on Telegram and it lands.

A digest I can reshape by asking. Taste is one lever, the format of the digest is another. When I want to change the shape of the digest itself, I ask the agent in chat and it edits the skill that defines it.

The backbone, briefly

None of this is revolutionary, but the project falls into three clean, reusable pieces that I would keep for anything similar:

A cron job can run a deterministic fetch script, so the model never does the fetching and only reasons over data that already exists.
Subagents with isolated contexts, and very limited tool sets, are safer. The transcript one gets a terminal narrowed to a single command, the wiki one gets file access and no shell.
Agents can then distill the fetched information with the user's help through something like an LLM wiki, so knowledge accumulates instead of being rebuilt from scratch each run.

Swap the fetch script and the topic, and the same three pieces carry over.

Some extra things I like about Hermes

There are also some smaller touches. The harness nudges the agent when it looks stuck, and it sends a token budget warning when a run goes on too long. Its own footprint is modest too: the system context it spends out of the box is around 13k tokens, which leaves a small model real room to work. I have run my local models on both Claude Code and Hermes, and while the two are built for different purposes, Hermes feels built with SLMs in mind. That has made me want to try coding-agent harnesses built for SLMs too. My experience with Claude Code and SLMs has been underwhelming, but Hermes showed me that when the scaffolding is designed around them, small models can be far more impressive.

Repo: https://github.com/pariidanDKE/HermesYoutubeCurator

If you have built on Hermes, I would like to hear how you are handling memory and untrusted input in your own agents!

Gemma Guide - Real-Time Spatial Awareness for Blind Users

Dan Parii — Sat, 23 May 2026 12:58:04 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

The Problem

For a blind user, the important question is not just what is in front of me, but how far away it is and how I should move safely. That gap between scene description and grounded spatial awareness is what Gemma Guide is built to close.

The Solution

Gemma Guide combines Gemma 4 with TIPSv2 into a routed multi-agent pipeline. The user speaks a question, the system interprets it visually and acoustically, and returns grounded guidance: not just "there is a chair in front of you" but "the chair is 1.4 meters ahead, slightly to your left."

The diagram below shows the flow end to end. A Scout agent first decides whether the question needs spatial analysis at all. If it does, a Mapper agent localizes the relevant objects and calls the TIPSv2 tool stack to measure distance and bearing for each one. A Navigator agent then reasons over those grounded measurements and produces the spoken response.

The Spatial Grounding Layer

TIPSv2 (Google DeepMind) provides three capabilities the spatial grounding layer depends on: metric depth, semantic segmentation, and open-vocabulary matching:

Class heads (segmentation + metric depth): Dense prediction transformer (DPT) heads produce per-pixel semantic segmentation and metric depth across 150 common object classes. When Gemma localizes a known object, the system intersects that region with the segmentation mask so depth is measured over the right pixels, not a coarse bounding box.

Open-vocabulary matching: TIPSv2 is a vision-language encoder trained so that image patches and text live in the same embedding space. Gemma can pass any class name directly to the encoder and receive a per-patch similarity map in return, turning open-vocabulary understanding into open-vocabulary measurement. This extends grounding well beyond the 150-class limit without any fine-tuning.

I used the B/14 variant (86M vision + 110M text params), the smallest of the four.

The Interface

Gemma Guide is designed with a blind-first philosophy at its core: accessibility is not a feature added on top, it is the only way to build. The UI uses a two-zone tap-anywhere layout with no visual-first interaction patterns. Audio soundscapes and TTS guidance bridge the gap during model reasoning, and all architectural complexity is entirely hidden from the user.

Demo

Code

pariidanDKE / GemmaGuide

Gemma Guide

Gemma Guide is a blind-first multimodal navigation assistant that combines Gemma 4 with TIPSv2 to answer grounded questions like:

What object is in front of me?
How far away is it?
Where is it relative to me?
How should I move safely?

Motivation

For a blind user, the important question is not just what is in front of me, but how far away it is and how I should move safely. That is the gap between scene description and real navigation assistance. A useful system must do more than describe a scene in natural language; it must produce grounded spatial answers that guide movement in the real world.

Language models are not reliable depth sensors, but with Gemma 4, they can act as an agent that identifies an object, calls specialized spatial tools, and turns grounded distance estimates into practical guidance. Gemma Guide is built to turn…

View on GitHub

How I Used Gemma 4

I used Gemma 4 E4B as the multimodal orchestrator across a three-agent pipeline.

Why Gemma 4 E4B specifically: the interaction is voice-driven, the scene must be interpreted visually, and the model must decide in real time whether a conversational answer is sufficient or whether grounded spatial measurement is required. That demands audio understanding, image understanding, native function calling, and strong reasoning together, and E4B delivers all of this while remaining compact.

That compactness is important for more than speed. It makes co-deployment alongside TIPSv2 feasible, and it directly advances the case for on-device deployment. A blind user should not need an internet connection to see, and running locally removes a meaningful privacy concern around continuous camera access. Keeping the language model in this size class is what makes that future realistic.

The Gemma 4 model card also lists two capabilities that turned out to be directly relevant here: native pointing (the model can return spatial coordinates for objects within an image, not just describe them) and interleaved multimodal input (audio and visual content freely mixed in a single prompt). Both are described under image understanding in the model card, leveraging this gave us more reliable responses.

The pipeline:

Scout: Classifies the request - direct answer or spatial analysis needed - and routes accordingly.
Mapper: Interprets the question, localizes relevant objects, and calls the TIPSv2 spatial tool stack to build a grounded scene state with distance and bearing per object.

Navigator: Receives the annotated image and structured measurements and generates the final spoken guidance. Separating this from the Mapper matters: reasoning over a clean measured world model is more reliable than reasoning over raw tool outputs.

User Question : Can you tell me how to get to the grill in my backyard?

Gemma Guide : The grill is 2.342 meters away, straight ahead. There are a few things in the way. You should first step slightly right to clear the table, which is 0.985 meters away about 25 degrees to your left. Then, the chair at 1.345 meters straight ahead will be in your path. Steer gently to your right to avoid it and step forward.

Key Findings

Distance alone is not enough. Early versions reported metric depth but users had no sense of direction. Adding horizontal bearing from the object's position in the frame turned a distance reading into actionable spatial guidance.
Whole-scene depth reasoning was too unreliable. Asking the model to reason over a full depth map produced ambiguous results. The reliable path was having Gemma localize the object first, then feeding that into the measurement pipeline - leveraging what the model is actually built for: reasoning, localization, and tool calling.
Separation of concerns made outputs consistent. Combining scene description, tool orchestration, and navigation reasoning in one agent made outputs inconsistent. Splitting into Scout, Mapper, and Navigator fixed this.
vLLM was the right local inference choice for now. Ollama lacks audio input entirely, ruling it out for a voice-driven pipeline, and its Gemma 4 tool-calling parser has had numerous bugs where vLLM's proved robust. With quantization, reduced max sequence length, and fewer image patches per call, the full stack - Gemma 4 E4B plus TIPSv2 - fits on a single 16GB GPU.
Latency is the dominant UX constraint. Tool calls run in parallel, but the Mapper and Navigator still bottleneck on model reasoning itself, and that cost grows as conversation history lengthens. Complex scenes can push end-to-end response time past 20 seconds.
Reliability is the remaining work. The navigation instructions are still sometimes off, and fine-tuning the grounding stack on task-specific data is where the remaining work sits.

Toward on-device deployment:

I explored Google AI Edge Gallery as a path to partial on-device deployment (Gemma locally, TIPS stack remote), but the current blocker is that the image Gemma sees in chat is not forwarded into the skill execution context, which breaks grounded measurement. A standalone mobile app with tighter camera and voice control is the stronger long-term path, and both models are compact enough to make fully offline deployment on edge hardware plausible.