DEV Community: Kirill

I Replaced ChatGPT With Gemma 4 In My Product. It Felt Like The Same Radio Show With A Different Host.

Kirill — Thu, 21 May 2026 21:05:01 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Most “read later” links quietly die in browser tabs. At some point I realized I wasn’t actually trying to consume more content anymore. I was trying to reduce the cost of deciding what deserved my attention in the first place.

That realization eventually turned into TLDR Radio — a Telegram bot that converts long-form articles and discussion threads into short audio briefings you can consume while walking, commuting, cooking, or doing literally anything except staring at another glowing rectangle.

But while building it, I accidentally discovered something much more interesting than “AI summaries”. I swapped the underlying model, and almost nothing important broke. That realization stayed in my head much longer than any benchmark chart.

What I Built

TLDR Radio is an audio-first article triage system. You send a link. The bot:

fetches the article
extracts readable content
optionally pulls discussion context
generates a structured summary
converts it into audio
sends the result back through Telegram

The original problem was surprisingly simple. My browser had basically become a graveyard of tabs I was never going to read anyway. And the real issue wasn’t lack of time. It was decision fatigue. Choosing what deserved attention started feeling more exhausting than the actual reading itself. So I stopped treating summarization as “compression”. I started treating it as attention routing.

The core UX decision was intentional from the beginning: I did not want another AI chat interface. I wanted passive consumption.

The product only started feeling genuinely useful once I could:

listen while walking
listen while driving
listen while cooking
listen while cleaning
stay away from the screen entirely

That constraint ended up influencing almost every architectural decision. The system itself looks less like a chatbot and more like a media-processing pipeline.

High-level flow:

Demo

Landing: https://tldr-radio.humifylab.com
Telegram Bot: https://t.me/TldrRadioBot

How to use: send one or two links and get a short audio summary.

[UX demo: from link to detailed audio summary]

[Conversation in Telegram, audio list and lock screen]

Each audio summary has a message with a caption, tags, a few first sentences of the summary, and sources. You can see the difference between Gemma and ChatGPT by comparing those messages yourself. For the rest of the article, Gemma is on the left.

[Gemma on the left and ChatGPT on the right]

One thing I really like is pulling in discussion context from places like Hacker News and Reddit. An article is just one perspective. The comment threads usually surface the real signal way faster than the article itself. There's also an option to go deeper and get a more detailed summary, which works really well for long HN threads.

[Gemma on the left and ChatGPT on the right]

Code

One thing I wanted very deliberately was separating:

webhook latency
durable job execution
asynchronous processing
execution snapshots

The architecture is heavily queue-oriented. The webhook itself stays lightweight and returns quickly. Long-running work happens asynchronously in workers.

[Architecture diagram]

The stack currently includes:

ASP.NET Core Minimal API
PostgreSQL
OpenTelemetry
LLMs providers
Telegram Bot API
TTS providers
Fly.io deployment

The LLM is only one component inside the pipeline, not the entire product.

One small feature to mention is procedural-generated images as covers. For each summary mp3 ID3 tags are written, including "Album" cover. How do you like these?

[Procedural-generated images as covers]

The actual TLDR Radio repository is currently private. But during development I extracted part of the infrastructure into an open-source production-oriented Telegram bot starter for .NET:

https://github.com/lemesevkirill/telegram-bot-starter-dotnet

It contains the asynchronous webhook/worker architecture that heavily influenced TLDR Radio itself.

How I Used Gemma 4

Originally, TLDR Radio used ChatGPT-based summarization. That felt like the obvious choice. Then the Gemma 4 challenge appeared and I started wondering: What actually happens if I swap the model without changing anything else?

For the core reasoning engine of TLDR Radio, I selected the Gemma 4 31B Instruct model, deploying it via OpenRouter's free tier. Within the Gemma 4 ecosystem, developers often choose between the high-throughput Mixture-of-Experts (MoE) models (like the 26B variant) and dense architectures. I intentionally chose the 31B Dense model for a specific architectural reason: script-writing and role-preservation.

While MoE models are incredibly cost-efficient because they activate fewer parameters per token, dense models utilize their entire parameter weight (all 30.7B parameters) for every single token generated. For an audio-first product like TLDR Radio, this full-scale dense processing is critical. It delivers more cohesive narrative structures, better flow, and firmly holds the "radio host personality" across complex, multi-layered summaries without breaking character.

Using OpenRouter allowed me to plug this 31B dense powerhouse into my .NET pipeline instantly, gaining a massive 256K context window and native multilingual support without managing complex local infrastructure.

Honestly, I expected the quality to collapse. That’s not what happened. This became the most interesting part of the entire experiment.

[Gemma on the left and ChatGPT on the right]

I intentionally kept:

the same prompts
the same orchestration
the same summary structure
the same Telegram UX
the same audio generation flow

The only thing that changed was the model. And the result did not feel like smart AI vs dumb AI or high-quality vs low-quality. It felt more like swapping podcast hosts.

[Gemma on the left and ChatGPT on the right]

ChatGPT often sounded patient and explanatory. Gemma frequently sounded denser and more compressed, almost like:

“here’s the essence, let’s move”

The factual quality was often surprisingly close for this workflow.
What changed more noticeably was:

pacing
sentence density
narration rhythm
listening feel
emotional texture

That was the moment where the whole thing stopped feeling like “model evaluation”. It started feeling more like media production for the same show with different hosts. And that realization stayed in my head much longer than expected. Because I originally assumed TLDR Radio was basically a model experiment. Smarter model equals better product. Simple. Then I started swapping models and something uncomfortable happened: The model quietly stopped feeling like the whole product.

[Gemma on the left and ChatGPT on the right]

Real-World Observations

One thing that became obvious very quickly: Operational reliability matters enormously in audio products.

The free Gemma endpoints through OpenRouter were heavily throttled during peak usage. The paid endpoint was dramatically more stable. Which mirrors a broader AI product lesson: Raw intelligence matters less if the operational experience becomes unreliable.

As long as the endpoint is stable, Gemma is totally fine on the pipeline side. You can do everything with Gemma that you do with ChatGPT - latency, limits, context, all the technical details work.

Another interesting observation: I expected prompt portability to be much worse. Instead, both models handled the orchestration surprisingly well. That made the models feel far more interchangeable than I originally expected. Multilingual behavior also changed the feel of the product in interesting ways. Not just translation quality. Personality. Different combinations of model / language / TTS provider started producing noticeably different listening experiences.

Again: less like swapping engines, more like swapping hosts.

Final Thought

Building TLDR Radio changed how I think about AI products. I expected swapping models to feel like replacing the engine. Instead, it felt more like replacing the host of the same radio show.

Gemma didn’t replace GPT in this project. It changed the pacing, tone, and listening feel of the experience. And that turned out to be much more interesting than a benchmark comparison.

The biggest surprise wasn’t realizing that open models got good. It was realizing how quickly the model itself stopped feeling like the whole product.

While building TLDR Radio, I ended up thinking about something larger: What happens when intelligence itself becomes infrastructure? I wrote a more philosophical version of that realization here:

https://futurehangover.substack.com/p/nobody-cares-about-your-frontier

And if you want to try the bot itself:

https://t.me/TldrRadioBot

Turns out writing and speaking are completely different skills

Kirill — Tue, 12 May 2026 18:59:35 +0000

A month ago I sent a cold message to a meetup organizer with a rough idea for a short talk about AI-assisted coding chaos.

At that point it was basically just a rant: "Why does asking an LLM to add one button end with half the repo rewritten?"

Since then:
— I gave the first version of the talk
— turned it into a dev.to article
— got surprisingly good discussions out of it
— and now I'm doing another session for Azure Meetup Konstanz

One thing I realized during this process: writing about engineering and speaking about engineering are completely different skills. When you write, you can edit weak parts away. When you speak, weak ideas become visible immediately.

The topic is still the same: how to use AI coding tools without turning your project into entropy.

If you're curious — here's the meetup link:
https://www.meetup.com/azure-meetup-konstanz-region/events/314158692/

And the original article:
https://dev.to/klem42/i-asked-an-llm-to-add-one-button-it-rewrote-half-my-repo-1l1f

Engineering Teams Quietly Reward Chaos

Kirill — Fri, 08 May 2026 14:32:05 +0000

Lately I've been having a strange thought.

The more chaotic a system became, the more valuable I looked inside it.
The harder things were to understand, the more people depended on me.
The more production pain existed, the more visibility I got.

And eventually I caught myself wondering something uncomfortable:
Was I actually incentivized to reduce any of this?

At first that thought sounded ridiculous. Nobody wakes up thinking: "Today I will make the system worse."

At least I hope not.

But engineering organizations create strange games. And once you notice the incentives, it's hard to unsee them.

You start noticing what actually gets rewarded.

Fast fixes get rewarded.
Heroic debugging sessions get rewarded.
Late-night firefighting gets rewarded.
Being the person who saves the release gets rewarded.
Visibility gets rewarded.
Urgency gets rewarded.
Visible suffering gets rewarded.

Meanwhile, a lot of other things become strangely invisible.

Reducing future complexity.
Eliminating operational pain before it exists.
Making systems boring.
Making onboarding easier.
Documenting things well enough that people stop depending on you.

Nobody gathers the team to celebrate that nothing exploded this quarter.

The email you will probably never receive.

But everybody notices when you jump into chaos and survive it.

I used to genuinely enjoy simplifying things.
I automated repetitive work.
I removed weird dependencies.
I cleaned up ugly flows.
I documented systems.
I tried to make things understandable.
I tried to make myself less of a bottleneck.

But something strange kept happening. The cleaner things became, the less important I seemed to be.

Fewer escalations.
Fewer emergency calls.
Fewer meetings where people suddenly needed me.

Stability made me respectable. Chaos made me important.

And that realization messed with my head more than I expected. Because once you understand the game, the incentives start pulling on you in strange ways.

If I automate this process perfectly, people stop needing me.
If I document everything properly, knowledge asymmetry disappears.
If onboarding becomes easy, I stop being irreplaceable.
If incidents disappear, heroic visibility disappears too.
If systems become predictable, nobody notices how much pain used to exist.

And the worst part is that you don't even need to become malicious. You just slowly adapt.

You postpone cleanup work a little longer.
You stop fighting heroic culture as hard as you used to.
You tolerate complexity instead of killing it immediately.
You leave certain knowledge undocumented because keeping it in your head preserves a kind of invisible leverage.
You optimize for what the organization visibly rewards.

And eventually, part of you starts needing emergencies. They make your value obvious.

That was probably the most uncomfortable realization for me. Not that organizations reward chaos. But how quickly the human brain learns to survive inside systems like that.

I don't think most engineers consciously choose this. I think many organizations accidentally create environments where reducing complexity becomes strategically irrational in the short term.

Then later everybody acts surprised.

Why are there knowledge silos?
Why are certain engineers impossible to replace?
Why does nobody fix root causes?
Why is every release emotionally exhausting?
Why does the organization depend on heroics just to function?

Maybe the real problem is that modern engineering organizations are much better at rewarding visible heroics than invisible stability.

And the hardest part is this: I'm no longer completely sure whether I'm reducing complexity inside these systems anymore. Or simply learning how to profit from it.

Move Fast and Break Things Was Fun. Now We’re Paying For It.

Kirill — Wed, 06 May 2026 14:35:34 +0000

The most valuable engineers I've worked with often looked... less "productive". They weren't always the fastest at closing tickets. They didn't push the most code. Sometimes it even felt like they were slowing things down.

And yet, somehow, teams around them ended up dealing with a lot less bullshit six months later.

Typical story: an API occasionally responds in 10 seconds under load. Not always. Just often enough to become painful. A ticket shows up: "Need to fix slow responses ASAP."

One engineer goes for the obvious fix: "Let's add caching."
A couple of days later, the issue seems gone. Everyone's happy. Ticket closed. Velocity looks great.

Then the familiar magic begins: cache invalidation becomes painful, stale data starts showing up in weird places, edge cases multiply, more patches pile on top, and a month later the system is more complicated than before the "fix".

Another engineer does the thing people often hate: instead of immediately writing code, they start asking why the problem exists in the first place.

And suddenly it turns out the backend isn't really the issue. The frontend is hammering the API because it has no proper way to receive updates.

So instead of another caching layer, the solution becomes boring but clean: events, change notifications, fewer requests, less state, less magic.

In the moment, the second engineer looked slower. They asked annoying questions. They added friction. Maybe they even irritated people a little.
But six months later, systems around people like that somehow tend to be simpler, more stable, and easier to work with.

And I think this is one of the strangest things in our industry.
We're very good at measuring delivery speed. We're much worse at noticing engineers who reduce the amount of future pain inside a system.

And these aren't necessarily "10x architects" or technical geniuses.
Often they're just the people who ask an uncomfortable question at the right moment:

"Wait. Are we actually solving the right problem?"

The funny part is that these are usually the engineers people try to bring with them from company to company, pull into difficult projects, and retain at almost any cost.

Even though their value is incredibly hard to measure formally.

Have you worked with engineers like this?

I used to have 100+ open tabs I wasn't reading

Kirill — Sun, 03 May 2026 15:10:54 +0000

My browser was basically a graveyard. Choosing what to read felt more draining than reading itself. I realized I didn't actually want to read most of those articles - I just didn't want to miss the 10% that actually mattered.

So I changed the order. I started filtering everything through a ~3-minute audio breakdown before deciding whether to read it in full. I started doing this manually with ChatGPT, and that was enough to notice a pattern: for most long-form content, three minutes is enough.

A short breakdown gives me the core ideas, hidden assumptions, and counter-arguments without the fluff. Anything longer starts to feel like a "half-read" that doesn't actually save time. Three minutes seems to be the sweet spot for a Go/No-Go decision.

The signal improves a lot when you include discussion threads from Hacker News or Reddit. An article is one perspective. Comments expose the gaps and highlight the actual signal before you ever commit to a single tab.

I eventually automated this for myself, but that part isn't the point. This changed how I deal with information. I no longer feel the need to "finish" my backlog; I just need a faster way to tell what deserves my attention.

I don't try to read more anymore. I just commit less often.

I tried staring at a wall to fix my focus

Kirill — Wed, 29 Apr 2026 12:02:56 +0000

Recently I read a post here on dev.to. A guy was writing about how his health slowly went off track. Nothing dramatic at first, just the usual things stacking up. Sleep getting worse, caffeine creeping in, energy dropping, focus dissolving somewhere in the background. It didn't read like a crisis. It read like something that quietly happens to a lot of people.

For some reason that post stuck with me. Not because of the health part itself, but because of the pattern. The sense that things don't usually break in one big moment. They degrade through small habits that feel harmless when taken one by one.

Around the same time I came across another idea that felt strangely related. It came from a post about productivity, but it didn't look like productivity advice at all. The suggestion was almost absurd in its simplicity. When your focus drops, don't reach for your phone, don't switch tabs, don't look for stimulation. Just sit and stare at a wall for a few minutes.

That was it.

The author described a familiar loop where bad sleep leads to caffeine, caffeine leads to jittery focus, that leads to background noise like music or podcasts, which turns into scrolling and distraction, which then pushes sleep even further out of balance. A cycle that feeds itself without ever feeling like a conscious decision

What caught my attention wasn't the cycle itself, but the proposed interruption. Not a better tool, not a smarter system, just the removal of input.

I tried it out of curiosity. It felt like something too simple to matter, which is usually a good reason to test it.

Sitting there and doing nothing turned out to be much harder than expected. Not physically, but mentally. The first thing I noticed was how quickly the mind tries to escape the situation. It doesn't matter where it goes. Checking messages, thinking about work, replaying conversations, planning something pointless. Anything is better than staying in that empty space.

That reaction was more interesting than the exercise itself. It made me realize how little time there is in a normal day where nothing is happening. Every gap gets filled automatically. Waiting becomes scrolling. Walking becomes listening. Eating becomes watching something. Even working often comes with some kind of background input.

At some point I saw a comment that put this into words more precisely. The problem with modern devices is not just that they take your attention. They take away the moments where your mind used to wander on its own.

I've recently realised that the biggest problem with smartphones is not that they steal your attention (which is bad enough), but that they steal your disattention
I don't know of a better word for it than disattention. Perhaps downtime? But it's not so structured. It's just those moments where you'd previously let your mind wander. Gone forever.

That idea reframed the whole thing. It's not just about distraction. It's about the disappearance of mental idle time. The kind of low-stimulation state where thoughts connect in unpredictable ways, or where nothing particularly useful happens, but the system resets itself.

Seen from that angle, staring at a wall doesn't look like a productivity trick. It looks more like restoring a missing state.

Some people would probably call this meditation, and maybe that's technically correct. But the framing feels different. Meditation comes with expectations, structure, and sometimes a sense that you're supposed to achieve something. This feels more like a diagnostic tool. You sit down and check whether you can tolerate the absence of input for a few minutes.

If the answer is yes, then nothing interesting happens. If the answer is no, then that's already a signal.

What surprised me was that after a short period of doing nothing, going back to work felt slightly easier. Not in a dramatic or motivational way, but in a quieter sense. There was less internal friction. Fewer competing impulses pulling attention away.

It didn't feel like gaining something new. It felt more like removing noise.

I'm not sure how much of this is a real technique and how much is just a reaction to being constantly overloaded. But the experiment itself feels valuable because it's so easy to try and so hard to fake.

You don't need a system or a habit tracker. You just need a few minutes and the willingness to not fill them with anything.

When was the last time you let your mind wander without a podcast or a screen? I'd be curious to hear if anyone else has tried this "zero-input" experiment.

The interesting part is not whether it works. The interesting part is whether you can actually do it.

I asked an LLM to add one button. It rewrote half my repo.

Kirill — Tue, 21 Apr 2026 16:42:08 +0000

Let’s be honest: modern LLMs write amazing code. Sometimes I look at the output and realize I couldn’t have done it better or faster myself. It hits you with this almost addictive rush of speed.

But that speed comes with a cost: you stop understanding what’s happening.

Recently, I was working on my pet project — a Telegram bot that turns articles into audio. I wanted to add one simple feature: a "Detailed Summary" mode. I threw a quick prompt at the AI:

Give users a way to get more detailed summaries

At first glance, everything looked fine. Then I tried to understand the diff.
I couldn't.

The Chaos of "Just Prompting"

I had a clean setup: all my prompts lived in a prompts.yaml file, and a PromptBuilder class would neatly assemble them. It was a predictable, single-source-of-truth system.

The agent ignored all of it. Instead of adding a template to the YAML, it shoved raw strings directly into the C# code.

It introduced if/else logic inside the builder and added extra prompt instructions as raw string literals. My architecture just left the building. Now I had two sources of truth.

But the worst part was the UX. Instead of adding a simple Telegram button to trigger the detailed mode, the AI decided that the user should manually type magic hashtags like #detailed. It chose the easiest path for the code, not the user.

The model optimized for its own convenience, not for my system. And it made dozens of decisions I never asked for.

The Spec as Anxiety Relief

I realized I was tired. Tired of holding my breath every time I hit "Apply," wondering what exactly was about to break.

At some point, I realized this isn’t just about AI being unpredictable. It’s about me not defining things clearly enough. That’s when I started using a Spec.

It’s not just a technical fix; it’s anxiety relief. When I put a Markdown file in Git between my head and the code, I can finally breathe. I’m no longer guessing what’s going to happen.

My Workflow

The Architecture Chat: I use ChatGPT to argue about edge cases. I often dictate my thoughts by voice—it's easier to "think out loud" about things like race conditions. We talk until we have a solid feature_spec.md file.
The Consistency Check: I make the coding agent compare the new spec with my actual repo. If the AI finds a contradiction before writing code, I’ve already won.
The Implementation: Only when the spec and architecture are aligned do I let the AI touch the code.

Same Task. Same AI. Different Outcome.

When I implemented the same feature using a spec, the result was night and day.

I explicitly defined the rules: "Use the existing YAML prompt storage. Use Telegram's native buttons. Do not force the user to type hashtags."

The agent followed the contract. This time, it created a new prompt template in the right place and implemented a clean button-based UX.

The difference wasn’t the model. It was the spec. The code remained clean, the architecture stayed intact, and I actually understood the diff.

Bottom Line

AI is a fast, but very average developer.

Without boundaries, it will pick the easiest, messiest path. It will still "work." And that's the dangerous part. You won't notice the rot until it's too late.

Stop prompting. Start defining.

How do you handle the AI's urge to "improve" your architecture without asking? Let’s discuss in the comments.

Photo: William Murphy / flickr — CC BY-SA 2.0