DEV Community: Sergey Shkuratov

2026-W28: Four Types of Agency Corrosion

Sergey Shkuratov — Thu, 16 Jul 2026 11:53:06 +0000

In the notes from last week, I saw signs of what I would call agency corrosion. The word “corrosion” feels more precise here than words like “stuckness” or “lack of clarity.” The process keeps going, attention keeps getting spent, decisions keep getting made, but the thing itself is slowly losing its support.

The point is not that things are hard. Things get hard in all sorts of places. What caught me was something else: across several different tracks in a row, the problem did not look like lack of effort. It looked like a slow loss of working shape. From the outside, it is easy to mistake this for normal forward motion. From the inside, it turns out there is motion, but no increase in clarity, honesty, or manageability.

Each of the problems below is small and easy to miss on its own. It was good that they showed up almost at the same time — that helped me notice four forms of the same thing: an unclear contract, a background priority, a false position, and the absence of a stopping condition.

1. Unclear contract

The densest example this week was around Ordo: a rethink of a large scenario structure and the patterns of working with input data and inter-process data.

From the outside, this could look like a scatter of technical improvements. But in practice, almost everything kept collapsing into the same question: where does the system already have an honest contract, and where is it still held together by a half-working convention.

A few clarifications grew out of that one after another. They look local on the surface, but in fact they change the type of clarity:

include is not a language, but a macro for gluing several files into one big file;
inputs is not a local subprocess call signature, but a declaration of global inputs;
profile is not just a convenient bundle of arguments, but a global execution context;
state has to be rigidly separated from the memory of the current run;
if static analysis of a scenario is incomplete, the system should show that explicitly instead of faking confident diagnostics.

What matters to me here is not that I managed to come up with several definitions in a row. What matters is that an unclear contract can pass for productive work for a very long time. As long as the system still somehow holds together, there is always a temptation to treat that area as “clear enough” and move on. But after that, almost any new improvement starts costing more than it seems to, precisely because it lands not on a contract but on a foggy convention.

I think this is the first type of agency corrosion: work continues, but it rests on something that has not actually been defined yet.

2. Background priority

The second failure that week was not about product work anymore, but about the structure of the day.

I stated it to myself quite explicitly: job search has to come first. First the search itself, then new platforms if they appear, then public writing, then public edits, then mandatory logging of the result for the day.

What matters here is not the list itself. What matters more is the difference between two modes. In one mode, an important task stands at the center of the day and organizes everything else around itself. In the other, it exists as a moral background: I sort of know that it is the main thing, but in practice it lives in the mode of “don’t forget to do it.” The priority is nominally recognized, but it has no organizational form.

Because of that, everything around it stays busy and even meaningful, while the thing that actually matters slowly turns into a source of background pressure. Activity loses manageability because what matters exists only in words, not in the real order of the day.

3. False position

The third type of corrosion showed up in a text about mobile development.

The problem there was not the topic, and not even the wording as such. The problem was the position the text was trying to speak from. It was slowly assembling itself as if its center were “an experienced person bravely dealing with complexity.” But that was not quite an honest optic.

What would have been more honest was something else: I do not know mobile development especially well, but I do know how to ask questions and test a line of thought.

This matters to me because a false position also does not look like a failure for a long time. On the contrary, at first it can seem more convincing, more solid, more “publishable.” But then the text starts collapsing from the inside. It has to keep supporting the pose instead of unfolding a real observation.

In that sense, agency corrodes not only in systems or schedules, but also in writing. If the position is dishonest, the work continues, the words keep coming, but the text has less and less internal support.

4. Absence of a stopping condition

The shortest decision that week was the pause on Binance.

The reason here is fairly simple: right now there is neither extra money for experiments nor enough clarity about the legal context in the EU. So continuing this track simply because it had once been opened is a bad bargain with my own attention.

What matters here is not the pause itself, but a more general failure. Some tracks continue not because they still have support, but because they have no legitimate stopping condition. They remain in the background, demand attention from time to time, and feed no longer on meaning but on inertia.

This seems to be another separate type of agency corrosion: the thing is not closed, not thought through, and not put on pause — it just keeps quietly eating resources by the mere fact of its own existence.

What these four forms have in common

In all four cases, the same nasty thing is present: activity continues, but the agent is not there.

In one case, the contract is undefined.
In another, what matters did not receive a real priority.
In the third, the very position from which the text is built is false.
In the fourth, the track lives without a right to stop.

In all these cases, it is easy to see activity from the outside. But if you look not at activity, but at manageability, the picture becomes less comforting.

I do not want to present this as some great discovery. It is more an attempt to carefully name a recurring type of failure that is easy to dismiss as a private circumstance when you are inside it.

The value here is not in the thesis itself, but in a more precise cut: agency gets damaged not only by overload, fatigue, or procrastination. Not every kind of busyness means movement, and not every stop means failure. Sometimes what you need is not to push harder, but to see what exactly has started to rust.

2026. Week 27: three decisions against architectural lying

Sergey Shkuratov — Fri, 10 Jul 2026 14:06:53 +0000

The main feeling of the week: almost everything revolved not around one feature, but around Ordo’s boundaries. Not “what else to add”, but “where the right surface is”. The same motif kept coming back: do not hide meaning inside convenience, do not treat architectural holes with one more wrapper, do not mistake documentation for truth.

Ordo is my project about giving an agent workflows and tools in a way that does not hide important decisions inside a shell hack, prose documentation, or a local convention. It is based on the idea that the environment that runs an agent’s helpers should validate run scenarios, explain why a run is impossible, and when something fails, suggest what to do next. Since it is not published yet, this note is specifically about the architectural forks where the system, under pressure from reality, started to show a more honest shape.

Decision 1. Connectors should be fact-first

A connector is a pluggable set of files for a domain: git, GitHub, shell, and so on. It consists of probes (“what is the state of the world right now?”) and actions (“what exactly do we change in the world?”).

At first, by oversight, I mixed them together, and as a result something specific to a particular process started to appear as a neutral fact. The raw engine kept slapping my hand, forcing me to clarify the “small things”. If I sum up the conclusions I reached while building a couple of connectors for the engine, they are these:

a probe should answer one domain question;
a probe should not smuggle in implicit constraints, otherwise you can no longer reuse it properly in another workflow;
sets of facts are fine, but smuggling habits in under the name of facts is not;
a convenience surface turns into trash very easily.

If the engine starts replacing facts with the local morality of a workflow, it becomes non-portable very quickly. Convenient and frequent things are better kept in a process library than canonized in the core engine.

Decision 2. It is better not to delegate the engine’s responsibilities

state is storage between runs. Value expressions are local data transformations. If you hide things like that inside connectors and extra functions, the system starts looking simpler than it really is, but also gets worse at holding its own boundaries.

Again and again there was a temptation to solve everything in some crooked but quick way:

with a connector;
with a shell/helper layer;
with an oversized memory/facts model;
with a little “come on, let’s just add a bit of language”, which would then spread.

The results of the classic fight between orthogonality and convenience:

state is a built-in runtime capability, not an external trick;
value expressions are also part of the runtime surface, but for pure local computation;
implicit normalizations should not be hidden inside git.* or gh.*;
${...} should not become a second language for controlling the world.

The constraint here was always the same: do not let local expressiveness quietly turn into a hidden control layer. The slide toward an ordinary command launcher does not start when the system gets a bit more expressive. It starts when expressions begin replacing control.

If some need really belongs to the process orchestrator, it is more reliable to admit that immediately instead of disguising it as an integration.

Decision 3. Project help should be built directly from code, not from texts about code

This was very much the central decision of the week. If help lives only in texts about code instead of in the code itself, it is guaranteed to fall behind the real runtime and confidently advise the wrong thing.

From this decision, almost everything else started to pull out after it: how the system discovers its own capabilities, how it publishes its contract, how it builds help in the first place.

This is also a limit on expressiveness at the same time: useful constructs can be added only up to the point where they can still be honestly published as part of the contract, rather than hidden as internal magic.

If the agent cannot work from the help and has to go digging in the code, that is bad code.

An accidental but important idea

One insight of the week: it is not enough to simply tell the agent why it cannot continue right now. If the system only puts up a stop signal at that point, it leaves behind not clarity, but emptiness.

A good failure path should not only forbid the wrong next step, but support the right one: what to check now, what to rely on, how to rearrange the line of reasoning. Otherwise failure stops being useful and turns into a dead end, after which the agent is once again left alone with raw context.

This is also a way not to slide into runner-thinking: Ordo should stay useful even in the cases where executing right now is a bad idea.

AI Is Making Live Performance a Worse Signal

Sergey Shkuratov — Sat, 27 Jun 2026 09:09:30 +0000

What This Is About

I increasingly feel that many engineering interviews are still structured like a first date. They are good at measuring synchronous work, speed, style fit, ease of contact, and the ability to make a good impression quickly. Evaluation of the consequences of someone’s work is much rarer. The problem is that real engineering work looks less like a first date and more like a long marriage with consequences.

This is not about saying that synchronous work is no longer needed or that all live formats are pointless. The point is narrower, but unpleasant: in AI-assisted development, live performance is getting worse at measuring the part of engineering value that is actually becoming the bottleneck.

As the machine takes over more and more local execution, the bottleneck is no longer how fast a person responds under observation. The bottleneck is the ability to hold the foundations of the task, notice a conflict, record a constraint, explain why one path was chosen and another rejected, and leave behind an artifact.

What I Mean by an Artifact

By artifact I do not mean bureaucracy or a cult of documentation. I mean much more grounded things:

clearly recorded rationale;
a review that shows not only “what is wrong,” but why it is dangerous;
a written invariant that should not dissolve in chat;
a short fork: why one path was rejected and another chosen;
an updated context layer, so that the next decision does not have to be reconstructed again from indirect clues.

These are not decorations around the work. They are the work itself, in the part that becomes more expensive as execution becomes cheaper.

Why This Is Especially Visible with AI

If AI can already quickly assemble a draft, write test scaffolding, suggest a decomposition, generate boilerplate, and plausibly complete what was left unsaid, then quick reaction stops being such a scarce advantage. Something else becomes scarce: the ability to make thinking portable.

In a world without strong agents, it was easier to live on memory, improvisation, and heroic on-the-fly guesswork. Someone understood something, discussed it, patched it on the fly, kept it in their head, and moved on. In a world with AI, the price of that implicitness rises. If an important decision remains only in someone’s head or in a conversation, the next agent will reconstruct it from hints. Sometimes successfully. Sometimes not. But almost always with a risk of drift.

So engineering value increasingly lives not in the moment of “I figured it out,” but in the moment of “I made this reasoning usable for the next step.”

LLMs do not produce meaning out of thin air. They rework what has already been written, read, connected, and thought through. So with AI, it becomes more and more important not just to formulate a prompt quickly, but to come to it with something in hand. With your own notes. With your own seeds of thought. With your own links between ideas. With fuel you have already gathered.

Otherwise the model very easily turns into a generator of smooth, plausible emptiness. This is dangerous not only in writing or research, but also in development: smoothness starts to look like understanding, even when there is no real foundation underneath.

In that sense, artifacts are also food for thinking, whether human or machine. Without them, we get the Fukushima move: instead of pouring cement, people just paste over the problem with paper printed to look like cement.

Artificial Intelligence as an Amplifier of Human Intelligence Problems

Of course, industrial software development could never seriously work without artifacts even before this. Architectural decisions, invariants, rationale, and agreements always had to be recorded somewhere, otherwise the system would start to live on memory and guesses. But now it is as if everyone has a strong execution-oriented agile team in their pocket, and the old problem has simply become sharper: the easier it is to produce the next move, the more expensive the absence of explicit grounds for that move becomes.

Very roughly, this is where the old smart notes intuition comes back: nothing counts except what is written down. Not because everything must be protocolized, but because understanding that remains only in the head rarely survives a pause, is hard to develop further, and is hard to check in a meaningful way. It easily becomes a victim of intellectual amnesia.

In this sense, smart notes are interesting not as a note-taking technique, but as a discipline of thinking. A thought must be left in a form that can be returned to, and enriched with your own meaning. Not just understood for yourself, but left behind as something the next iteration can build on. The next person. The next agent. Your future self.

This fits AI-assisted development very well. There the difference becomes visible especially fast: the difference between “I think I got it” and “I left behind a usable artifact.” In the first case, the next iteration guesses the meaning again from indirect clues. In the second, it inherits already completed intellectual work.

Why Live Performance Deceives So Easily

Performance-heavy evaluation still feels natural. Give a task. Ask the person to think out loud. Watch how they design on the fly. Do live coding. Set a timer. Ask them to explain tradeoffs quickly. There is logic in all of that.

It shows well how a person behaves under observation. How fast they answer. How they speak in real time. How they solve a task in an artificially compressed window. All of that can be a useful signal. But it is not necessarily the best answer to the question: does this person leave behind a structure of thinking that makes the next iteration stronger.

Charisma, Viscosity, and Reliability

There is another unpleasant asymmetry here. A person can be extremely charismatic, full of ideas, quickly aligned with your habits and tempo, and create a very strong feeling of compatibility. Almost an ideal worker — in the moment.

But there can also be another type: slower, more cautious, full of caveats, false starts, returns, and re-checking. In a live interview, this person almost inevitably loses. But this person often leaves behind the more reliable trace. Some qualities that hurt momentary performance improve the final artifact a lot. This person does not collapse uncertainty too early. This person adds a caveat where someone else would stay conveniently silent. This person notices self-contradiction. This person does not sell the solution too early. This person slows down where a mistake will later be expensive.

In other words, performance likes smoothness, while reliability often likes friction. A good impression and a good artifact do not correlate as much as we would like. A charismatic and fast person may leave behind less reliable decisions than a boring and slow one. The first sells confidence; the second more often sells verifiedness.

Of course, there are people who both look brilliant and produce excellent artifacts that can be reused. But they are somewhat rarer than we would like. And the more we like evaluating by the first date, the higher the risk that we systematically overprice brilliance and underprice reliability.

What Follows for Hiring

It seems to me that this leads to an unpleasant but useful thought about evaluating people. Not in the sense that interviews should be abolished or that written format should be declared the only honest one. But in the sense that performance-heavy evaluation more and more often measures not what is actually becoming the most expensive thing.

If local execution becomes cheaper, then what gets more expensive is not impression, but consequences. Not a beautiful answer, but thinking that can be continued. Not just a “strong person,” but a person after whom the next iteration starts not with the question “does anyone remember what is even going on here?” but with an answer to that question.

So the formula that feels closer to me now is this: a strong engineer is not just a person who performs well live. It is a person who knows how to leave behind inheritable thinking.

Not impression. Not an aura of competence. Not a beautiful hour-long conversation. But review, rationale, constraints, explicit forks, and updated context — things that survive the moment and become the input to the next piece of work.

Conclusion

I do not think this cancels the value of synchrony. But I do think that with strong LLMs, the old bias becomes much more visible. By habit we still love performance, even though more and more of real engineering value lies not in solving a problem in front of our eyes, but in artifacts that make thinking portable.

Maybe this is especially visible to me because of my own mix of a solo+AI mode, text-first thinking, and fatigue from hiring theater, and maybe I am overestimating the scale of the shift. But if I am not, then we will have to rethink not only the way we work, but also the way we evaluate people: trust the smoothness of the moment less, and pay more attention to what remains after it.

2026. Week 25: when auth/session stops being a list of tasks

Sergey Shkuratov — Wed, 24 Jun 2026 08:17:03 +0000

2026-W25

This week I worked on auth/session across both backend and frontend. At first, it looked like a normal series of tasks while preparing the codebase for a mobile client: add Bearer authentication, avoid breaking the cookie flow, split the refresh routes, and bring the frontend to the same model.

But as the work went on, it became clear that the main difficulty was not in the endpoints themselves. The real problem was the contract: some of the rules lived in code and habits rather than in an explicitly defined model.

At some point it became clear that a “smart” universal layer was less useful here than a more explicit split. For the cookie flow — /user/*; for the bearer flow — /auth/*, with clearer rules for carrier semantics, protected endpoints, and conflict scenarios.

Once that contract became stricter, it became easier to see the actual mismatches. The backend mostly matched the intended model in terms of session core and fail-closed semantics, but the transport layer was still too permissive in some places: not every cookie+bearer conflict was rejected explicitly, and some restrictions existed more as informal assumptions than as a clear contract.

So for me, the point of the week was not so much the number of shipped changes, but the fact that auth/session became much clearer, and the gaps I found turned into concrete follow-up tasks.

In short: the new transport did not just add more work. It pulled a hidden spec out into the open. That is not very visible in the list of closed tasks, but that was the main shift for me.

What stayed in the background

All the while, job search was there in the background as well — not as a matter of “I should try harder”, but as a constant friction with the format itself. Because of physical limitations, neither voice/video nor speed-based evaluations like live coding are workable for me. At the same time, in a normal hiring pipeline, those things are still treated as the standard way to understand a person. So I am not running only into the market, but also into the process itself: it is much better at recognizing a different type of candidate, while still being treated as a neutral norm. Although in practice, here it seems to be either “neutral” or “norm”.

Working with AI Means Thinking More, Not Less

Sergey Shkuratov — Sat, 20 Jun 2026 09:33:09 +0000

Yes, this text is long. Yes, it repeats itself in places. I did not clean that up. A text that sounded too smooth while arguing that AI forces you to think more, not less, would be at least slightly dishonest. This is not fast food for quick consumption. And yes, don’t worry: you won’t hear anything especially new here. That is part of the problem too.

There is a popular and very seductive story about AI in software development. Now that the machine can write code, the human gets to think less. You just point it in the right direction, and the model will quickly and cheaply do a significant part of the work on its own. In that picture, AI is primarily an accelerator for code production, and human thinking gradually shifts from necessity to optional extra.

I keep feeling more and more strongly that this description is dangerously wrong.

A more accurate formula for my own experience right now is this: I’m the tech lead, the AI is the entire team in one body. And if you take that metaphor seriously, the conclusion is the exact opposite of the mainstream narrative. Working with AI is not a way to think less. It is a mode in which you need to think more, not less.

Not because the AI is bad. But because it is too good at one very treacherous thing: it confidently and smoothly fills in what was left unsaid.

I’m the tech lead, the AI is the team

At first this metaphor felt like a neat formulation. Now it feels like a literal description of what is going on.

If you treat AI as a very fast and very capable executor, a lot of things become clearer immediately. It really can wipe out months of routine work. It can spin up prototypes quickly, take over test scaffolding, try out alternatives, make local edits, help break a task into parts, and sometimes even suggest a decent direction.

On the surface, this really does look like a silver bullet. Especially if the human knows the stack and can read code. The pace becomes so extreme that old assumptions about development speed can be thrown into the garbage bin of history.

But that is also exactly where the most dangerous substitution begins.

Once you have an executor this strong, the temptation is to reduce your role to something like this: state the overall goal, wave your hand vaguely in the direction of the task, and then mostly stay out of the way. The system is smart, surely it will figure it out. And this is where the tech lead metaphor becomes genuinely useful: a good tech lead does not stop thinking just because the team is strong. On the contrary, the stronger the team, the more expensive mistakes in framing, boundaries, and verification become.

A strong tech lead does not lose their work. The work is still there. It’s just not where people think it is. They do not have to personally write every line, but they do have to hold onto:

the larger goal for the near future;
the boundaries of the change;
the signs that a task is actually complete;
an understanding of what must not be broken on the way there;
and a way to verify that the team did not produce something externally polished but systemically dangerous.

If you map this onto working with AI, it turns out that the core human responsibilities have not gone anywhere. If anything, they have become stricter.

The main risk is not bad code, but loss of ownership of intent

When people talk about problems with AI in programming, they usually discuss fairly simple things: hallucinations, nonexistent functions, weird syntax, weak tests, generic code, unsafe fragments. All of that happens. But that is not the most unpleasant part.

The real trouble starts when the code looks fine.

It is clean. It is tidy. It passes tests. It has sensible variable names. It does exactly what was requested. If you look at it as a local artifact, it may look more than convincing.

That is exactly why the danger here runs deeper than just bad code.

The problem is that when working with AI, it becomes very easy to lose ownership of intent. That means losing the actual link between:

what we are actually trying to achieve;
why the system is designed this way rather than another;
what constraints and invariants exist here;
and how we distinguish a real solution that works in real life from a plausible imitation of a solution.

Once that ownership is lost, a very unpleasant state appears: “it works, but I don’t know why.” And right behind it comes another one: “and I don’t know what will break it.”

This is especially treacherous because the failure does not happen at the moment of generation. At the moment of generation, everything may look excellent. The problem surfaces later — during the next change, at an edge case, on a repeated call, in a partial failure, when several locally reasonable decisions collide and together create systemic fragility.

So the main trap here is not that the AI wrote nonsense. The main trap is that the human stopped being the owner of their own system.

Why AI forces you to think more

This sounds paradoxical only at first glance. In reality it is fairly simple.

The stronger the executor, the more dangerous an unclear framing becomes. The faster the work gets done, the faster mistakes in intent materialize. The better the system gets at filling gaps, the more dangerous every unstated assumption becomes.

If earlier a good developer could ask a follow-up question or at least avoid rushing into implementation, now the model — in some sense carrying the experience of the whole world — fills in the blanks on its own. The further this goes, the better it gets, the more plausible it becomes, but not necessarily in a way that fits this specific context. And it does all of that silently.

So AI does not lower the demands on thinking. It makes thinking more mandatory and more disciplined.

Working with AI leaves you with no real choice but to:

write the goal down instead of holding it as a vague feeling;
separate the larger goal from the local request;
know in advance what counts as done;
define a contract for each step: inputs, outputs, errors, edge cases;
not accept the proposed decomposition automatically;
not accept code based on external impression alone;
not stop at the happy path;
read diffs and run checks;
and keep in mind what stands above the code: its behavior in the real world.

So with AI, it is not only the speed of code production that changes. The very point where human intelligence gets applied changes. It used to be enough to be the person who writes well by hand. Now you increasingly have to take pride in not losing the foundations of the task when the speed gets too high.

Why “it seems to work” is a trap

One of the nastiest effects of working with AI is that it very easily produces solutions that look complete from the outside.

The feature exists. The behavior exists. The types are in place. There are some tests too. So of course it starts to feel done.

But that feeling can be false in the most dangerous sense. Because external functionality and engineering reliability are not the same thing.

You can get code that executes the stated scenario and still:

violate existing project conventions;
create unnecessary complexity;
bypass an existing component instead of reusing it;
introduce a fragile assumption;
miss an important failure mode;
fail to preserve a domain invariant;
and in doing so buy future maintenance at the cost of today’s speed.

This is especially unpleasant because without review and without a full run of the checks, you suddenly end up almost in the role of a client for whom someone quickly assembled a “sort of working” product. From the outside, it’s alive. Inside, you don’t trust it.

Then three scenarios remain, and all of them are bad. Either you live with anxiety. Or you postpone the analysis until the first incident. Or you start digging immediately, but only after speed has already produced the illusion of completion.

That is why AI, for me, has turned from permission to think less into a demand to think more.

A local prompt does not carry the whole meaning of the project

There is another reason why AI-assisted development forces more thinking. A local request almost never carries all the context that is actually needed for a good solution.

What usually enters the system is not a full model of the project and not a careful list of invariants. What arrives is a local request: add a field, allow an action, change a state, insert a button, fix a flow, support a new behavior. Everything else has to be reconstructed.

Before, a human would at least notice that something essential was missing. They would slow down, clarify, remember past decisions, ask colleagues, dig into documentation, go read code. AI, by contrast, is very good at taking a narrow slice of framing and quickly turning it into a locally convincing solution.

That is the risk. Not that the model cannot do anything, but that it can continue too smoothly in places where the human should have stopped and asked: “Wait, on what basis do we think this is correct at all?”

That leads to an important point: a ticket, a prompt, or a feature request is usually not a specification. It is only a trigger. Pretending it contains the project is exactly how drift begins.

Which means that if the human gives the model nothing beyond the local request except maybe hope, then the model has to reconstruct everything else from indirect clues:

boundaries;
domain agreements;
sources of truth;
prohibitions;
the rationale behind previous decisions.

And once that happens, AI starts rebuilding all of this from hints. Sometimes successfully. Sometimes not. But almost always with a risk of drift.

The tech lead holds not only the goal, but anti-drift discipline

If we go back to the tech lead metaphor, it becomes clear that the role in AI-assisted development is even broader than just assigning tasks.

The tech lead is needed not only to say “this is what we are doing.” They are also needed so that the project does not quietly start rewriting its own foundations piece by piece.

AI is very good at helping with local execution. But precisely because of that, someone must hold onto the things that cannot be delegated wholesale:

which rules count as normal in this project;
which constraints must not be bypassed silently;
which decisions have already been made and why;
where the system’s real invariants live;
which compromises are acceptable and which are not.

So the human in the tech lead role becomes not just a source of tasks, but a carrier of anti-drift discipline.

This is the discipline that stops speed from turning the project into drift.

It requires very boring things:

writing and rereading the goal;
keeping steps manageable;
recording important decisions in artifacts rather than leaving them in chat;
reviewing not only the result but the line of reasoning;
checking not only new tests but old invariants;
asking not only “does it work?” but also “what must not happen here at all?” and “could something else quietly happen here too?”

These boring things turn out to be some of the most expensive engineering work there is.

What the tech lead actually does when working with AI

If you try to reduce all of this to a very practical loop, the human in this model is left with at least the following responsibilities.

1. Hold the larger goal

Not just the local prompt, but a longer line: what exactly are we trying to improve, what counts as success, and what matters most right now.

Without that, AI easily starts optimizing local form instead of global meaning.

2. Break the work into isolated parts

Not so large that you lose verifiability, and not so small that you drown in micromanagement.

Good decomposition here is not bureaucracy. It is a way not to overload either yourself or the model.

3. Set boundaries

What are we doing in this change, and what are we consciously not doing? Which parts of the system are in scope, and which are not? Where is a temporary solution acceptable, and where is it not?

4. Define the signs of done

Not in the sense of “well, it kind of works,” but in the sense of a verifiable contract: which inputs we support, which outputs we expect, which errors are acceptable, which edge cases must be preserved.

5. Read everything important

You do not have to manually write everything yourself. But you do have to read everything important: diffs, new decisions, key tests, controversial spots.

6. Run the existing checks

Do not stop at “the generated code passes its own tests.” All the checks that already exist matter, because those are what catch regressions against the old world.

7. Turn decisions into artifacts

If an important decision lived only in your head or in a conversation, that is a bad decision from the perspective of long-term work with AI. Tomorrow’s agent, or tomorrow’s version of you, will start reconstructing it from scratch — and will most likely get it wrong.

Why this matters beyond code

This is bigger than code.

The more AI can generate, the more valuable the ability becomes not to be a typing machine, but to:

hold the task before the code exists;
understand which context is mandatory;
see long-tail consequences;
distinguish what is locally correct from what is systemically dangerous;
leave behind an artifact that helps the next human or the next agent avoid reinventing the foundations.

That is why the claim that “you need to think more, not less” is not some old craftsman whining about new tools. It is a literal description of the new work.

AI removes part of the mechanics from the human. But everything tied to intent, boundaries, consequence-checking, and preserving meaning becomes more expensive, not less.

What changes in the feeling of the work itself

The biggest shift here is not even procedural. It’s in your head.

Before AI, you could maintain for a long time the image of a programmer as a person who mostly writes code and sort of thinks around that process. Now it increasingly feels like writing code is no longer the central part of the role. The central part of the role is holding the system of thinking around the code.

So my work is less and less described as “I write the feature” and more and more as:

I hold the intent;
I define the boundaries;
I check whether understanding has been replaced by external plausibility;
I make sure the project does not drift;
I turn decisions into forms that will survive both tomorrow’s me and the next agent.

In that sense, the tech lead metaphor is useful not only as a description of process. It is useful because it protects you from lying to yourself. It reminds you that once you have an extremely strong executor, the temptation to relax grows faster than the right to relax.

Thinking, you lead. Stop thinking, you get led.

If I reduce all of this to a short version, my current conclusion is this.

Working with AI is not a mode in which you finally get to stop thinking. It is a mode in which you need to think more, not less.

More — because speed increases the price of ambiguity.

More — because local framing rarely carries all the necessary context.

More — because plausible code is more dangerous than obviously bad code.

More — because someone still has to hold the goal, the boundaries, the invariants, and the method of verification.

That is why the formula “I’m the tech lead, the AI is the whole team” still feels like the most accurate one to me.

It does not romanticize AI, and it does not let the human off the hook. It returns responsibility to where it belongs: the human who must not only start the work, but also understand what exactly is being started, why, and by what evidence the result will be shown to deserve trust rather than merely looking functional.

The cruel irony is that the AI almost certainly already knows all these subtleties better than we do. If you ask it properly, it will tell you about project management, and code review, and contracts, and regressions, and all those old, good ways of not breaking a system through sheer stupidity. It will even suggest the right precautions.

But we still have to ask.

We still have to frame the question.

We still have to notice that this is a place where a question is needed at all.

And we still have to do the thinking.

2026. Week 24: mobile as a test for backend honesty

Sergey Shkuratov — Fri, 19 Jun 2026 08:57:50 +0000

This week I wanted to “just start mobile development” for my checklist service: build a straightforward mobile client and cover the basic scenarios. The first questions looked practical: do I need a full editor on the phone, is the current API enough, and which stack and working setup should I choose?

But very quickly it became clear that this was not really a story about one more client. Mobile became a useful spotlight: it showed the places where I had not fully described the system as a contract, and where it still depended on browser defaults.

1) Product framing: a phone is not a second “full editor”

The request for a “full mobile editor” sounds logical until you break it down into real situations. In practice, it mixes two different user scenarios:

edits on the go: quickly fix text, mark a step, replace one item, continue a checklist in the field;
full editing: structural changes, bulk edits, careful work with formatting and visibility rules.

Once I separated these, the frame for the first version became much more realistic: the mobile client should first of all run checklists and support light edits, not fully copy the web editor.

This matters not because of “laziness”, but because of the cost of the tail. If you aim for full feature parity, you almost immediately pull in topics like offline sync, change conflicts, versioning, and complex UX for large edits. On a phone, these things are more expensive both to build and to use.

2) The API is “mostly enough”, but the bottleneck was not in the business methods

Then came the good news: a quick audit showed that for the first online version of the mobile client, the business contract was already mostly there. Auth, templates, checklists, instances, invitations — all of this exists. There is OpenAPI and there are types, so I do not have to guess from loose documentation.

At this point it would be easy to stop and say: “we can start building the app.”

But the real knot was not there.

The problem was that the current auth flow and session lifecycle in my project are tightly tied to the browser: cookies and expectations about how a client behaves inside a browser session.

For the web this is natural and convenient. For mobile it is not “impossible”, but it is a boundary that quickly turns into architecture debt: a mobile client should not have to adapt itself to the browser habits of the system.

So the paradox of the week sounds like this: the business API already looks mature, but the system as a whole is still not fully ready to honestly serve the same contract in different client environments.

3) The main shift: the question is not “how do I adapt this for mobile?”, but “how do I describe session behavior independently from transport?”

After several iterations, it became clear that mobile is not a demanding client that needs exceptions. It simply forces me to name what should already be defined:

what a session means in the system (in domain terms, not browser terms),
how the client gets and refreshes the right to make requests,
which errors are expected and what the client should do in response,
how all of this should be tested.

As a result, the frame became stronger: not “two kinds of auth”, but one shared session model and two ways to deliver that model depending on the environment:

cookies — natural for web;
bearer token — natural for a native mobile client.

This is no longer just cosmetics and not only “preparation for the future”. It is a way to bring the contract to a state where it does not depend on implicit assumptions about transport.

And this is where something unpleasant but useful came up: before that, I had not separated the layers clearly enough. I had “it works on the web” in my head, and that was enough. But as soon as a second class of client appears, it becomes obvious that some decisions are not really decisions at all, but fog.

To clear that fog, I had to lift and define several system-level artifacts independently of the auth transport:

a description of login and session refresh flows (what the client does, what the server does);
behavior rules for common failures (network error, expired session, invalid token, repeated request);
a test matrix: which combinations of environments, clients, and scenarios must be tested for this to count as “real readiness”, not just a feeling of readiness.

4) An episode about the editor and the boundary of “where an agent is useful”

Against that background, there was also a revealing local episode inside the web checklist editor. For a long time I could not catch a bug: the menu behaved badly in a situation where there was not enough space below it to open normally. The problem was clearly somewhere in low-level JS/DOM/layout mechanics — a layer I know less well than the product and architecture side.

At some point I stopped pretending I should do “manual debugging until victory” and asked an agent to bring up Playwright and run the loop on its own: reproduce, measure, and test hypotheses.

This turned out to be useful not only for the specific bug, but also as a process calibration. If the task is mostly instrumental work — reproduction, changing conditions, collecting facts — it makes sense to give the machine the role of executor in the research loop. The human still holds the frame: formulates what exactly we are looking for and prevents the process from drifting into random fixes.

Weekly result

The week started as an attempt to get into mobile development, and ended with a more important result: I saw the foundations of the system more clearly.

First, I narrowed the product frame of the mobile client to something viable: launch + light edits, without pretending it should do everything the web version does. Then it became clear that business API readiness is only half of the story. Real readiness starts where auth and session behavior are described as a contract that lives honestly both in the browser and in a native app.

And maybe the main conclusion is this: new client environments are valuable not because they require a different UI or stack. They are valuable because they bring implicit assumptions into the light — especially around security, session behavior, and invariants, where “it works on the web” is not enough.

Not Every Lint Warning Is Cosmetic

Sergey Shkuratov — Sun, 14 Jun 2026 08:32:54 +0000

How old tools improve the work of new (non)humans.

I noticed this pattern while working through a series of backend cleanup tasks in pylint, flake8, and mypy: some warnings that I wanted to dismiss as housekeeping kept turning out to be the shortest path to hidden contracts in the code.

As a rule, they looked like small cleanup tasks: naming style, function signatures, module size. But once I started fixing them, it became clear that the problem was not cosmetic. The check was simply the first thing pointing at a place where an important contract in the code was still resting on an unspoken assumption.

For Python backend development, this is especially noticeable in places where the code already looks plausible and locally reasonable — including cases where the draft was assembled quickly with the help of an LLM. In those places, a warning is sometimes useful not as a demand to “make it cleaner”, but as an early signal: this is where it is worth checking boundaries, invariants, or the shape of the contract.

Below are four short cases where cleanup turned out to be not quite cleanup.

1. Enum cleanup that exposed a database contract

At first this looked almost cosmetic: I was simply normalizing naming style.

class Operator(str, Enum):
    eq = "="
    lt = "<"

then:

class Operator(str, Enum):
    EQ = "="
    LT = "<"

But that was not the end of it. It turned out that the real contract lives not only in the Python enum, but also in which values SQLAlchemy reads from and writes to the database. So the real fix ended up looking like this:

operator: MappedColumn[Operator] = mapped_column(
    SQLEnum(Operator, name="operator_enum", values_callable=_enum_lower_names)
)

So the warning was not really about enum style. It was about persistence semantics. From the outside it looked like cleanup, but in practice it forced me to make the contract between the Python enum and stored values explicit.

2. An extra parameter that turned out to be an ownership check

The next signal looked even more mundane: an extra parameter and a messy signature.

async def delete_condition(db: AsyncSession, item_id: UUID, condition_id: UUID) -> None:
    item = await db.get(ChecklistItem, item_id)

When I dug into it, it turned out the warning was not really about the shape of the function. It pointed at an under-specified domain contract. After the fix, the function explicitly required the item_id and checklist_id pair:

async def delete_condition(
    db: AsyncSession,
    checklist_id: UUID,
    item_id: UUID,
    condition_id: UUID,
) -> None:
    _ = await get_draft_item(db=db, item_id=item_id, checklist_id=checklist_id)

What mattered here was not the signal about the signature itself, but the fact that it led to an ownership check. After the fix, what became explicit was not the “neatness” of the function, but the dependency between item_id and checklist_id.

3. A module warning that turned out to be about boundaries

The third case was about structure. A warning about module size and shape would have been easy to dismiss as a linting nitpick. But the monolithic schemas.py had in fact stopped being maintainable.

# app/schemas.py
class HTTPErrorDetail(BaseModel): ...
class AuditEvent(BaseModel): ...
class ChecklistItemCreate(BaseModel): ...
class OrganisationCreate(BaseModel): ...
class TemplateListItem(BaseModel): ...

...and so on for hundreds of lines.

Instead of adding a suppress, this ended up as a proper package:

# app/schemas/__init__.py
from .audit import AuditEvent
from .checklists import Checklist, ChecklistItem, ItemCondition
from .enums import Operator, PropName
from .orgs import Organisation, Invite, Grant
from .templates import TemplateListItem, TemplatePublic

So the problem was not style as such. It was that the codebase was already asking for clearer boundaries. In this case, the file-size warning was not noise, but an early sign that the module needed to be split along responsibilities.

4. Docstrings that turned out to hold a local contract

Another case initially looked purely documentation-related. pylint and flake8 required proper docstrings for functions, methods, and classes, and from the outside this is easy to read as hygiene for the sake of hygiene.

But in several places the fix worked differently. For example, in checklist_router.py the docstring stopped being a generic phrase about a “router for Checklist entity” and turned into a short description of the real lifecycle contract: that published versions are immutable, that draft creation is explicit, that editor mutations operate only on the draft layer, and which denial/error semantics are considered normal here.

A similar thing happened with middleware. There the docstring started fixing not a paraphrase of the method, but important boundary conditions: when Session-Log-ID is required, why the middleware returns 400, where exactly the request-scoped DB session lives, and why this is ASGI middleware rather than BaseHTTPMiddleware.

The framing from Arun Rajkumar’s post about agent-driven workflow is also useful here, because in that framing code and related artifacts act as a source of working context. Read that way, a meaningful docstring is not decorative documentation, but another explicit form of a local contract, useful not only to a human, but also to an AI agent: it no longer has to reconstruct those constraints from the implementation every single time.

Of course, not every docstring has that effect. If a checker only pushes on form — imperative mood, blank lines, section headers — that is more a matter of formatting discipline. That is still useful at least as a factor of consistency, but it is a different kind of value. But where a docstring fixes preconditions, boundaries, error semantics, or lifecycle assumptions, a documentation warning stops being pure cosmetics.

What follows from this

Across all four cases, what matters is not the warning itself as a ritual, but the move from implicit to explicit. In one place I had to make persistence semantics explicit, in another an ownership boundary, in the third module boundaries, and in the fourth a local API or middleware contract.

That is probably the most useful way to read such signals. Not as an order to make the code neater, but as a reason to check whether an important contract is still resting on a silent “this is obvious anyway.” This does not mean every warning hides a deep problem. But if it touches storage format, ownership, input shape, or a module boundary, I would no longer treat it as pure cosmetics.

2026. Week 23: a UI task that stopped being small

Sergey Shkuratov — Wed, 10 Jun 2026 08:41:33 +0000

I Thought This Would Be a Local UI Task

This week, I thought I was solving a fairly narrow task: how to show group settings more neatly in the new checklist editor. The question looked local enough: how to read a group’s state right in the item row, and how to provide a convenient entry point for editing.

The Morning chaos row is active now, and it is easy to see that the group of indicators on the right takes a significant part of the row’s space.

At first, this looked like a normal UX improvement. I needed to find a form of presentation that would not force the user to open the detail panel too often, while also not overloading the item row.

The First Version Turned Out Too Noisy

My first move was toward a richer inline representation. I wanted to show a set of signals directly in the group row, so that its state could be read quickly.

It became clear quite fast that this was the wrong direction. This UI did not make reading easier; it added noise. Instead of a more “talkative” interface, I had to move in the opposite direction: remove extra elements and compress the state representation.

In the end, the solution narrowed down to one summary pill in the group row and one shared popover for editing. That became the main UX lesson of this part of the week: in a dense editor UI, extra signals stop helping very easily.

Then a Deeper Problem Surfaced

The story did not end there. While I was working on the interface, it became clear that the problem was not only about how it looked, but also about how it worked at all.

During the work, an ambiguity in the semantics of visibility surfaced unexpectedly. Initially, I made it so that if a field was invisible, conditions like “if another item has such-and-such value” could make it visible. But the new interface showed this approach poorly, to the point that even I, the author, could not quickly understand what was going on when looking at the editor.

After several experiments, I realized that the problem was not in the editor UX but in the semantics. In the end, I had to invert it: effective_visible = is_visible && conditions_pass. In other words, an item that is invisible by default cannot be made visible by any conditions. But if a visible item also has conditions, then the item can become invisible.

A typical breaking change, and good that it happened early.

After That, the Most Honest Part of the Work Began

Once the rule became clearer, the work was not finished. After all the edits, a more unpleasant but also more honest phase began: I had to verify that the system really behaved the way it was now described.

This is where the E2E part began. It looked much less elegant than the idea of the solution itself. There were failures like Expected: "hidden" Received: "visible", there were timeouts around the drawer and popover, and there were situations where everything already looked fine locally, but the tests answered: no, the behavior is still not fully assembled.

In the end, E2E became the real finish line of this task. Not the moment when the interface already looked convincing, but the moment when the target scenarios started to converge in a verifiable form.

In Parallel, Another Shift Was Taking Shape

There was another line of work this week as well — backend cleanup. I do not want to expand on it here: it will get a separate text.

Against that background, and also against the background of publishing LLM-Assisted Deploy: You Save Typing, Not Thinking, another thought started to come together more strongly for me. A meaningful part of the work should not disappear into local edits, commits, and one-off sessions. It should be brought to the state of a public artifact: a text, a case, a note — something from which one can later read not only the result, but also the line of thought, the constraints, and the way of verification.

What I Take Away From This Week

In short, this was a week in which a small UI task refused to stay small.

First, it ran into the need to simplify my own solution. Then it brought out behavior that had not been fully thought through or clearly stated. Then it demanded that this behavior be proved and shown separately through tests.

That is probably why the week feels coherent. Its center was not that I simply made one more product improvement, but a more general movement: from a local change to an explicit contract, from an explicit contract to verification, and then further to proving and showing the work so that it would not dissolve inside the code.

LLM-Assisted Deploy: You Save Typing, Not Thinking

Sergey Shkuratov — Sat, 06 Jun 2026 13:24:08 +0000

TL;DR

An LLM helped me put together a deploy in about an hour, but only because I did not hand the deploy itself over to it.

What happened

I know how to write deploy scripts. Not in theory — I’ve done it many times, and usually I just sit down and write one.

This time the problem was not how to write it. The problem was time. And I had absolutely no desire to break anything along the way.

So I did not play the game of “the LLM will neatly put this together for me.” In semi-automated deployment, that is an unusually bad idea. Instead of a beautiful result, you get a beautifully broken site lying on the floor.

I did something else.

I defined the script structure myself, based on my previous experience. I spelled out what exactly the deploy had to do, where the boundaries were, what control points existed, what counted as success, and what was a reason not to proceed. Then I fed that text to the LLM and had it write the bash script.

In other words, the model helped where the cost of error was relatively cheap: typing code.

Everything that actually had a price stayed manual: decisions, review, checking the logic, and running it in a safe environment.

In the end I got two scripts: deploy-preprod.sh and deploy-production.sh.

That separation mattered to me. Production should not “just go” directly. First the preprod gate, then production. And I also kept a standard trick of mine for production deploy confirmation — a kind of textual captcha: the script prints a random code to the console, and nothing moves until you type it in by hand. It is not protection from a malicious hacker. It is protection from an overly easy, overly mechanical “fine, let’s just ship it already.”

It took four iterations. Honestly, that is not a sign that the approach was bad. Quite the opposite.

Across those passes, exactly the usual crap surfaced — the kind that makes deployment dangerous: typos, wrong variables, bad log parsing. Nothing conceptually interesting. The code itself looked fairly brisk. The irony was that the code had been written by the LLM — perhaps it was in a hurry too and did not want to get distracted from taking over the world.

My verification was not in the “well, looks reasonable enough” genre either. A more reliable approach after the containers start is to automatically scan the docker-compose logs for obvious signs of errors. Then do a manual smoke check — log in and walk through the key flows. I did think about e2e, but for this task I decided it would be overkill. What I needed at that moment was not a perfect automation contour, but a reproducible deploy with explicit control points and predictable failure behavior.

If you ground this in the actual scripts, the line between “accelerate” and “hand over control” becomes pretty clear:

preprod reproduces an exact copy of the production site before the deploy itself, which is feasible because the site is small;
production refuses to run with latest;
production refuses to do a no-op deploy on the same tags;
before production, the preprod gate runs and fails if preprod is unhappy about anything;
there is an explicit failure if Postgres does not come up;
there is a log check for error, exception, traceback, panic, and fatal;
there is manual confirmation before production.

So the LLM did not “do the deploy.” It helped me assemble the code around a structure and a set of constraints that I had already defined.

All in all, it took about an hour.

Takeaways

The LLM saved me time, but not where many people dream it will. Not on responsibility. Not on the engineering decision. Not on verification. It compressed the draft phase — the keyboard pounding. Everything critical was still manual engineering work.

For tasks like this, that is what a sane LLM-assisted mode looks like: do not delegate risk to the model; use it to strip away the mechanical part so more attention remains for architecture and for the control points where mistakes are actually expensive.

A minute of thought increases uptime by an hour.

Documentation is code: LLMs don’t actually read it — and honestly, neither do we

Sergey Shkuratov — Tue, 02 Jun 2026 09:46:17 +0000

I learned this the hard way: when an LLM says “it matches the docs”, it can still be wrong for a boring reason—it didn’t read the part that matters.

I’m building a small SaaS (checklists as a service). No users yet. Plenty of documentation already. And at some point my docs stopped being an asset and started turning into a liability.

This is the story of how I rebuilt my documentation so that an LLM could actually read it end-to-end—and how that restructure helped me.

The moment I got scared: “silent misses”

The docset grew. I kept asking the LLM to verify tasks against it.

And then I noticed a pattern that felt worse than hallucinations.

Not “the model invented stuff”, but “the model confidently said it matches”—while quietly missing exceptions, prohibitions, and thresholds. Keyword scanning instead of reading.

I called it silent drift: code slowly moves away from conventions, while the invariants remain only in my head.

In a project with roles, audit, and CI/CD security gates, that kind of drift isn’t “just messy docs”. It’s how you lose the ability to implement and review changes consistently.

I couldn’t do it manually (and I couldn’t delegate it fully)

I knew I had to redo the documentation. But I also knew I couldn’t realistically do it all by hand.

At the same time, I couldn’t just tell an LLM: “Rewrite everything according to approach X.” Not enough context, too easy to lose control.

So I went with a third option: build a reliable process out of unreliable components—me + an LLM.

Step 1: I separated my docs into domains (and forced the model to actually read)

First, I extracted domain areas from the old documentation—the vocabulary I was using to describe the project and its parts. I tried to keep domains mutually independent (so the overall framework stays holdable in my head).

Then I ran the same loop for each domain:

I asked the LLM to read all old docs carefully and extract requirements for that domain.
I moved those requirements into a dedicated file and gave each one a project-unique ID.
I asked the LLM to reread everything and check internal consistency.
I fixed contradictions (sometimes by cross-checking the code).
I repeated the consistency check (this caught small but nasty issues).
I reviewed diffs manually to catch what was missing or implicitly assumed.

This took ~4 days (about 4 hours/day). Exhausting, but still much faster than doing it without an LLM.

Step 2: I hit a wall—because I mixed “requirements” with “verification”

After the requirements pass, I wanted to extract scenarios (the thing that connects domains and requirements).

And suddenly the model started to stumble and hallucinate again.

The fix turned out to be painfully simple: my requirements were still “too thick” because they contained verification sections.

Verification text was useful, but it didn’t belong inside requirements files. It confused the extraction step.

So I separated verification into its own files per domain. After that, scenario extraction became stable again.

Step 3 (the main artifact): I built per-subsystem digests

Even with cleaner docs, there was still one big problem:

An LLM is much more likely to actually read one document than to wander through folders and do keyword search across many files.

So I built a small, boring artifact:

a registry file listing subsystems and the docs that belong to each
a tiny builder script that concatenates those files into a single digest per subsystem

Now, for each subsystem (authentication, access control, audit, security gates, plus a few project-specific ones), I have one consolidated document.

I also keep a short “selection rules” note for myself: which digests to feed into the agent for a given task (e.g., access control vs audit logic). The LLM can check conformance well, but I don’t expect it to reliably infer what to check via chains of implicit assumptions.

The payoff: the restructure wasn’t cosmetic

After I rebuilt the registry and digests, I asked the LLM to check the whole codebase for conformance to each consolidated document.

It found about 15 bugs. Some only manifested under specific conditions.

At first, I was upset: how could this exist with so many tests?

Then I realized: this was the clearest proof that the new documentation structure was doing real work.

What I’m taking from this

A big docset is not automatically verifiable.
If you want LLM-assisted development to be stable, you need docs the model can read, not just search.
A tiny artifact (subsystem registry + digest builder) can become a point of leverage for your whole workflow.

If you’ve dealt with docs/code drift (especially with LLMs in the loop), I’d love to hear what helped—and what failed.