DEV Community: Svetlana Perekrestova

I Built a Local Control Plane for My Coding Agents

Svetlana Perekrestova — Sat, 13 Jun 2026 14:27:01 +0000

I have been playing with my local AI setup for a while, and for most of that time it was hard to explain without opening five terminals and pointing at folders.

The tools were fine. That was not the issue.

I used Copilot. I used Claude Code. I used Codex. I tried OpenCode. Sometimes I wanted Claude because I already had the subscription. Sometimes I wanted Codex because I liked the workflow. Sometimes I wanted OpenCode because switching models through OpenRouter was easier there.

The annoying part was everything around the agents.

Every tool had its own home directory, its own settings format, its own MCP config, its own skills folder, its own memory story. So whenever I added a useful MCP or wrote a good rule, I had to ask the stupid question again: where does this tool expect it?

After enough of that, the setup stopped feeling like "a powerful local AI stack" and started feeling like duplicated dotfiles with chat windows attached.

The original mess

This was the rough shape before I cleaned it up:

My first workaround was Markdown.

I would ask one agent to plan a feature, save the plan into a .md file, and then hand that file to another agent to implement. It was better than relying on chat history. At least the plan existed outside one session.

But it did not really solve the problem.

The second agent would read the plan, miss some assumption, make a different tradeoff, and produce something slightly different. Then I would go back to the first agent and ask for review, and it would be confused because the implementation was not the thing it had planned.

So yes, files helped. They were not shared context.

What I wanted was much less dramatic:

write high-level rules once
keep permission rules in one place
stop copying skills by hand
stop re-declaring the same MCPs
have memory survive across projects and sessions
keep my home directory readable

That last one sounds cosmetic until you live with the opposite. A home directory full of half-overlapping AI folders is not a setup. It is a junk drawer.

The memory rabbit hole

I started with memory because that was the pain I could name.

Claude-specific memory was useful, but it was still Claude-specific. AgentMemory through MCP worked, but it felt heavier than I needed. Too many moving parts, too many features I was probably never going to touch.

Headroom was the first thing that made the shape click for me.

The useful bit was not only "memory". It was the split between two jobs I had been mentally mixing together:

compaction and routing: the model request goes through a local proxy before it reaches the provider
memory: long-lived facts and decisions live in a local store and are accessed through MCP

Those should not be the same thing.

In my setup, Headroom's proxy listens on 127.0.0.1:8787. Model traffic can go through that. The memory MCP talks to a shared SQLite database. Same project, related pieces, but different responsibilities.

Once I separated those two ideas, the rest of the architecture became easier to reason about.

The current folder layout

The center is now ~/.ai.

Not because ~/.ai is magic. I just needed one place that I could point tools at.

Most of the old-looking dotfolders are now symlinks. The tools can still find what they expect, but the real structure is under ~/.ai.

My shell makes that explicit:

export OPENCODE_CONFIG_DIR="$HOME/.ai/opencode"
export CLAUDE_CONFIG_DIR="$HOME/.ai/claude"
export CODEX_HOME="$HOME/.ai/codex"
export ANTHROPIC_BASE_URL="http://127.0.0.1:8787"
export OPENAI_BASE_URL="http://127.0.0.1:8787/v1"

Small Codex footnote: I have both ~/.ai/codex and ~/.ai/codex-desktop. That is intentional. The desktop app and the CLI/runtime state are not exactly the same thing. I still want both under the same roof.

This is not a perfect abstraction. It is just a directory structure I can remember.

OpenCode became my main entrance

OpenCode is now the place I usually start from when I want a normal CLI coding session.

The reason is provider switching. With OpenRouter behind it, I can move between models without rebuilding the rest of the setup. Some tasks need a stronger model. Some tasks are fine on something cheaper. Some days a provider is just slow.

The exact model in the config is not the point. The shape is:

{
  "instructions": [
    "~/.ai/instructions.md",
    "~/.ai/rules.md",
    "~/.ai/security.md"
  ],
  "provider": {
    "openrouter": {
      "options": {
        "baseURL": "http://127.0.0.1:8787/v1"
      }
    }
  }
}

OpenCode reads the shared instruction files from ~/.ai. Its OpenRouter traffic goes through Headroom.

The MCP list lives there too. Mine currently includes Headroom, Headroom memory, Firecrawl, Tavily, Deepgram docs, Sentrux, and Context7. That list will change. The important part is that I now treat MCPs as part of my local agent environment, not as random plugins installed for one tool and forgotten.

Runtime flow

This is the part I wish I had drawn earlier:

There are two paths.

Model calls go through the proxy. Memory goes through the memory MCP.

I keep repeating this because it prevents a very specific kind of debugging pain. If a model call behaves strangely, I look at the proxy path. If an agent forgot something, I look at the memory path. Different questions, different places to inspect.

Memory rules that are boring on purpose

The shared memory is a local SQLite database exposed through headroom-memory.

I do not want it to save everything. That would just turn memory into another log pile. I want it to remember things that are expensive to rediscover:

project decisions
architecture notes
workflow preferences
recurring bugs
lessons from previous sessions

For my own workflow I use two commands:

/remember when something should survive future sessions
/recall when an agent needs context from earlier work

Examples of memories that are useful:

"This project uses build scripts instead of editing Info.plist directly."
"Do not trust SourceKit diagnostics until the actual iOS build runs."
"The VPN project stores memory under the shared Headroom DB."
"This repo prefers feature branches and conventional commits."

I also have a hard rule against saving secrets, credentials, raw logs, or private personal data. Memory should make future work easier. It should not become a liability.

Permissions are not decoration

I used to think of permissions as a safety checkbox. Now I think of them as part of the agent environment.

If Claude Code, Codex, and OpenCode disagree about what they can read or run, then switching agents changes the behavior of the task. That is exactly the kind of invisible difference that causes trouble later.

So I moved the common policy into shared files:

~/.ai/rules.md for coding and workflow standards
~/.ai/security.md for denied commands and paths
~/.ai/instructions.md for memory behavior and cross-agent notes

The security file is plain on purpose:

## denied paths
- ~/.ssh/**
- ~/.aws/**
- ~/.kube/**

## denied commands
- rm -rf /
- rm -rf ~
- git push --force *
- git reset --hard *

Each tool still has its own config format. I am not trying to erase that. I just want the policies to rhyme closely enough that I do not have to remember which agent is more dangerous today.

What is still rough

Headroom memory needs a UI.

The CLI is fine for listing memories, and I like CLIs more than most people probably should. Still, I want a small browser view where I can inspect memories, edit wording, merge duplicates, and delete things that should not be there.

Some configuration also refuses to become shared, and that is fine. Claude Code has Claude-specific plugin settings. Codex has desktop runtime state. OpenCode has provider and TUI settings.

I stopped trying to force those into one shape.

The useful boundary is simple: shared things go into the local AI home, tool-specific things stay with the tool.

Local models, for now

I also tried to make local coding models part of this story.

My MacBook is an older Apple Silicon machine with 16 GB of memory. It can run small coding models, around the 5B range, but I do not get enough value from them yet.

For the tasks I actually care about, I still need stronger reasoning, better codebase navigation, and more reliable instruction following than this hardware gives me comfortably.

So my setup is not "offline AI". It is local orchestration with remote model execution.

That is less romantic, but more accurate.

Where I landed

Right now the setup gives me the thing I was missing:

one local AI directory
shared rules
shared security policy
deduplicated skills
MCPs treated as local infrastructure
model traffic routed through a local proxy
memory shared across agents, projects, and sessions
fewer random dotfolders in $HOME

The part I care about most is continuity.

I can plan in one tool and implement in another without pretending the chat transcript is enough. I can use OpenCode when I want model flexibility. I can still use Claude Code when the subscription workflow is more convenient. I can use Codex without treating it as a separate universe.

It is not one agent to replace the others.

It is a local setup where the agents share enough context that switching between them does not feel like starting over.

Short version

The migration looks like this:

I did not need to choose one perfect coding agent.

I needed to decide what all of them should share.

Why 'Brownfield' Deployments Break Agent Architectures — Lessons from Google

Svetlana Perekrestova — Sat, 21 Mar 2026 23:50:01 +0000

Last week I attended Google AI Agents Reloaded Live Labs Benelux and ended up winning in the "Agents for Good" track. 🎉

Between lab sessions and back-to-back talks, I filled several pages of notes — and a lot of it challenged or refined assumptions I had about building agents. This isn't a recap of "what is an AI agent." It's the non-obvious stuff: where the real traps are, what the data actually says, and which design decisions have outsized consequences in production.

First: the honest framing for 2026

The production track opened with this:

"If 2025 was the year of the agent, 2026 is the year of making it work."

Followed immediately by a slide that just read: "Reality check: it's all brownfield."

No one is deploying agents into pristine infrastructure built from scratch. They go into existing systems, legacy APIs, organizational processes, and teams that were never designed for autonomous AI. That constraint changes almost every architectural decision — and it's the lens through which everything else in this article should be read.

The A2A Protocol — and why the streaming part matters

Google's Agent-to-Agent (A2A) protocol is a standardized, framework-agnostic way for agents to discover each other, communicate, and delegate tasks. The spec is clean, but the interesting design decisions are in the details.

Discovery via Agent Cards

Remote agents advertise their capabilities at a well-known endpoint:

/.well-known/agent-card.json

This is the agent's capability manifest — it declares name, description, URL, version, skills, and supported content types. A client agent fetches this before sending any task. It's roughly the AI equivalent of an OpenAPI spec, but for autonomous agents rather than REST endpoints.

An Agent Card definition in code looks like:

agent_card = AgentCard(
    name='Currency Agent',
    description='Helps with exchange rates for currencies',
    url=f'http://{host}:{port}/',
    version='1.0.0',
    default_input_modes=CurrencyAgent.SUPPORTED_CONTENT_TYPES,
    default_output_modes=CurrencyAgent.SUPPORTED_CONTENT_TYPES,
    skills=[skill],
)

The communication model: Messages, Tasks, Artifacts

All communication happens over JSON-RPC over HTTP(S). The core data structures:

Message — has a role (user or agent) and one or more Parts (text, file, or JSON)
Task — returned by the server agent with an id and status for async tracking
Artifact — the final output payload, also structured as Parts

The Agent Executor sits inside the server-side agent and handles incoming messages, runs the internal reasoning loop, and emits responses back to the client.

Polling is explicitly an anti-pattern

This was called out directly in the session: polling task.status over HTTP for long-running tasks is inefficient. You're hammering an endpoint waiting for a status change when the server could simply tell you when it's done.

The right mechanism is SSE (Server-Sent Events) — the server agent pushes updates (initial task acknowledgment, intermediate messages, final artifacts) over a persistent HTTPS connection. You declare support in the agent card with streaming: true.

For multi-step agentic workflows where the client (or an orchestrating parent agent) needs to react to intermediate outputs, this isn't optional. Polling at scale compounds into latency and unnecessary infrastructure load.

A2A vs MCP — complementary, not competing

This distinction came up several times across talks and labs:

MCP (Model Context Protocol): connects an agent to tools — APIs, functions, data sources. It defines the agent↔tool interface. Primitives are Tools, Resources, and Prompts.
A2A: connects agents to agents. Full task delegation between autonomous agents that each have their own reasoning loops, tools, and memory.

A well-designed multi-agent system uses both: MCP for tool access within an agent, A2A for coordination across agent boundaries. The live demo showed a reference implementation worth looking at if you're planning a multi-agent architecture.

Evaluating agents properly — this section deserves its own article

The evaluation talk (by Naz Bayrak from Google) was the most practically useful session of the day. The field underinvests here, and it shows.

You need to evaluate two dimensions, not one

Every agent run produces:

Final response — did the agent achieve the goal? Is the output correct and useful?
Trajectory — what path did it take? Which tools were called, in what order?

Evaluating only the output misses an entire category of bugs: agents that arrive at correct-looking answers via wrong reasoning, agents that take five steps when two suffice, agents that call the wrong tool but recover via hallucination. These bugs surface reliably under slightly different inputs — and you won't know they exist until they do.

The three evaluation methods and their actual tradeoffs

Method	Strengths	Weaknesses
Human evaluation	Captures nuance, human factors, trust signals	Subjective, slow, expensive, doesn't scale
LLM-as-a-Judge	Scalable, consistent, automated	Bounded by the judge model's capability ceiling; misses complex intermediate steps
Automated metrics	Objective, deterministic, fits in CI	Can't measure creativity or complex reasoning; susceptible to gaming

None of these is sufficient alone. The pattern that works in production is layered: automated metrics catch regressions at scale, LLM-as-Judge handles qualitative assessment, human evaluation validates the trust signals that automation can't quantify.

Six trajectory metrics — and which to use when

ADK provides six distinct strategies for comparing an agent's actual trajectory against a "golden run":

Exact match — perfect replication required (strictest)
In-order match — correct steps in the correct order, extra steps allowed
Any-order match — correct steps in any order, extra steps allowed
Precision — how relevant/correct are the predicted actions?
Recall — how many required actions were actually captured?
Single-tool use — did the agent use a specific tool at least once?

Choosing wrong matters. Any-order match makes sense when steps are genuinely parallelizable. In-order match is right when sequence is semantically meaningful — a lookup must precede a write, a validation must precede a commit. Precision vs. recall is the classic tradeoff: optimize for precision when false positives are costly, recall when missing required steps is the bigger risk.

LLM-as-a-Judge: how to structure the rubric

The pattern that works:

"You are an impartial AI quality analyst. Rate the following response
on a scale of 1-5 for [criterion].
Does the response [specific check]?
Explain your reasoning."

Expected output — structured JSON, not prose:

{
  "groundedness_score": 5,
  "reasoning": "The response accurately reflects the source document and makes no unsupported claims."
}

Ground truth (a reference answer) is recommended but not required. Without it, scores are less reliable — include it whenever you have it.

Two case studies worth unpacking

Case Study 1: Multi-agent retail pricing system

Challenge: validate a system where three agents (Data, Action, Forecasting) had to collaborate correctly. The final output alone wasn't enough to trust — if the orchestrator called agents in the wrong order, outputs could look plausible while the underlying process was broken.

Three-layer evaluation:

Trajectory testing (pytest-automated): confirmed the orchestrator called the right agents in the right order. Tests verified that data queries started with transfer_to_agent('DataAgent') — catching internal process errors even when final outputs looked fine.
LLM-as-Judge with custom rubric: scored final responses on "Business Clarity" and "Conciseness" — qualitative, business-relevant criteria that no automated metric captures.
Rubric decomposition for stress testing: for complex multi-constraint prompts, one LLM generated a checklist of all constraints the response should satisfy; a second LLM gave binary Yes/No answers per item. Explainable, detailed failure analysis rather than a single opaque score.

Key learning: For multi-agent back-end systems, the process is the product. Trajectory correctness must be validated before output quality — a correct-looking answer via a wrong path is a bug, not a success.

Case Study 2: Customer-facing software assistant

Challenge: technical accuracy wasn't enough — the agent needed to be genuinely helpful and feel trustworthy to real users.

Strategy:

AI-simulated conversations: a "Simulated IT Pro" LLM dynamically generated realistic multi-turn dialogues, creating a large test dataset without expensive human annotation.
Expert Evaluator LLM: scored transcripts on helpfulness and task adherence, providing a quantitative baseline.
"Vibes-based" human testing: domain experts interacted directly with the agent and provided qualitative feedback on whether guidance felt right in real-world context — the thing automated systems can't capture.
Structured human evaluation: 1-5 scoring forms with free-form notes, identifying nuance and domain-specific errors the LLM judge missed.

Key learning: Automation provides scale; it doesn't provide trust. For user-facing agents, qualitative domain-expert signal isn't optional — it's the thing you're ultimately optimizing for.

Evaluation is a loop, not a gate

The four-stage continuous evaluation cycle:

Code & Build — validate logic before anything runs in production
Quality & Behavior Eval — pre-deployment testing of intelligence and behavior
Release & Live — real users generate real data
Observe, Analyze, Capture — production insights update golden datasets for the next iteration

Production interactions are your richest source of new test cases. Teams that close this loop improve continuously. Teams that treat evaluation as a pre-ship gate plateau.

Agentic memory — full context is not the answer

This was the most empirically grounded section of the whole conference, and it directly challenged a common default.

The accuracy data

Approach	Accuracy
No memory	10.8%
Full context (all history)	55.4%
Competing memory offering	63.8%
Reflective Memory Bank (RMB)	74.6%

Full context underperforms selective memory. There are two reasons:

Cost: full context processes all history every request; selective memory retrieves only the relevant subset.
Context rot: as irrelevant history accumulates in the context window, LLM output quality degrades systematically — the "lost in the middle" effect. The model's attention is diluted across noise.

The implication runs counter to the intuition that "more context = better": giving the model less but more relevant information outperforms giving it everything.

Two distinct memory layers

Sessions (short-term memory)

Agent Engine is inherently stateless — it doesn't store anything between calls. Sessions is the layer that adds statefulness. It stores conversation history, agent actions, and state within a single session, and eliminates the need to manage your own conversation history database.

Memory Bank (long-term memory)

Memory Bank persists facts across multiple sessions, linked to a specific userId. After each session ends, an LLM automatically extracts key facts from the conversation and stores them. Retrieval is similarity-based (semantic search) or by userId lookup. The extraction happens server-side, invisible to the user.

Sessions and Memory Bank are independent — you can use either or both.

How Reflective Memory Management actually works

RMM (from Google Research, arXiv:2503.08026) is the underlying mechanism:

Prospective Reflection: after a session ends, the system decomposes the dialogue by topic, summarizes key facts, and stores or merges them into the bank
Retrospective Reflection: before responding to a new query, retrieves potentially relevant topic summaries
Adaptive Reranking: a learnable module (trained via RL on LLM citation behavior) refines the Top-K retrieved memories to the Top-M most relevant — this is the key differentiator from naive vector similarity search

Wiring it up

With ADK, PreloadMemoryTool handles the full lifecycle — finding memories at session start, injecting them into context, and telling Memory Bank to learn from the session when it ends:

agent = adk.Agent(
    model=MODEL_NAME,
    name="helpful_assistant",
    instruction="""You are a helpful assistant with perfect memory.
        - Use the context to personalize responses
        - Naturally reference past conversations when relevant
        - Build upon previous knowledge about the user""",
    tools=[adk.tools.preload_memory_tool.PreloadMemoryTool()],
)

runner = adk.Runner(
    agent=agent,
    app_name=app_name,
    session_service=VertexAiSessionService(
        project=PROJECT_ID, location=LOCATION, agent_engine_id=agent_engine_id
    ),
    memory_service=VertexAiMemoryBankService(
        project=PROJECT_ID, location=LOCATION, agent_engine_id=agent_engine_id
    ),
)

For non-ADK frameworks (LangGraph, CrewAI), Memory Bank exposes a direct API with generate_memories and retrieve_memories — same capabilities, more manual wiring.

Agents in production: what changes at scale

The "agentic drift" problem

As agent autonomy increases, SRE and ops teams face a new operational discipline: agentic drift — when agent behavior gradually diverges from intended behavior in production without clear error signals. Agents may continue producing plausible-looking outputs while quietly drifting from their intended goals. This is harder to detect than traditional software failures, where errors are usually explicit.

The 5% of complex, novel outages that AI cannot self-resolve become the critical escalation path for human operators.

AI improves individuals but increases delivery instability

Survey data from the conference: 85% of respondents report AI has increased their individual productivity (13% extremely, 31% moderately, 41% slightly). But the measured organizational-level effects showed software delivery instability as the highest-impact negative outcome of AI adoption.

Individual speed gains don't automatically propagate to team or organizational outcomes. The gap between "I'm faster" and "we ship better" is where governance and process come in — and it's not a gap most teams are deliberately closing.

Two strategic frames worth keeping

The most useful mental model from the whole event:

Agent as Actor (done by agents): agents build and operate things autonomously — dev automation, code generation, orchestration.
Agent as Artifact (done to agents): agents are platform components that must be deployed, scaled, secured, governed, observed, and optimized.

Most teams invest heavily in the first and underinvest in the second. Production failures tend to originate in the second.

The design principles I'm keeping

Evaluate trajectories, not just outputs. A correct answer reached by an incorrect path is a bug waiting to surface on the next slightly different input. Trajectory evaluation is not optional — it's the signal that output evaluation misses.

Memory architecture is a first-class design decision. The choice between in-context history, Sessions, and Memory Bank has measurable accuracy and cost implications. The 20-point accuracy gap between full context and RMB is large enough to matter in production. It's not an implementation detail you optimize later.

Streaming over polling. SSE-based push vs. HTTP polling is the difference between a responsive and a broken UX for any long-running agent task. Design for it upfront.

Evaluation is a loop. Pre-deployment evals catch known issues. Production observation discovers the unknown unknowns. The feedback loop between production traces and golden datasets is where agents actually improve over time — and it needs to be built deliberately, not added retroactively.

Design for brownfield. The clean-room version of an agentic system is a prototype. The real one integrates with what already exists. Every architectural decision should be stress-tested against that constraint.

Based on notes from Google AI Agents Reloaded Live Labs Benelux. A2A demo code: github.com/MKand/agenticprotocols. Memory research: arXiv:2503.08026.

Closing the onboarding gap: a framework for Backend Engineers

Svetlana Perekrestova — Mon, 16 Feb 2026 23:35:09 +0000

After seven years in enterprise software engineering, I've noticed a pattern that almost every team repeats, regardless of company size or how mature its engineering culture is. There's always a company-level onboarding program — an HR portal, a welcome deck, a compliance module. And there's usually a team-level checklist — maybe a Confluence space, some wiki pages, a screen recording of someone clicking through the product. But there's a gap between those two things that almost nobody fills deliberately, and it quietly destroys the first few months of every new hire.

That gap is the ways-of-working layer: all the micro-decisions and conventions that live inside your team but rarely get written down properly. And closing it is one of the highest-leverage things a senior engineer can do.

Unveiling the hidden knowledge: bridging the onboarding gap

When I join a new team, I can usually find documentation about what the product does. What I can't find — or find only in fragmented, outdated form — is how work moves through the team: what a Git branch should be named, what the PR approval flow looks like, which Jira statuses map to which stages, whether the QA engineer writes the automation tests or whether I do, and where those tests even live.

This knowledge exists. It's just distributed across the heads of your team members, not written anywhere accessible. New engineers discover it incrementally, as friction, every time they touch a task.

The documentation that does exist usually has one of three problems. It's stale — written when the process was different and never updated, which makes it actively misleading. It's generic — written for the "standard" flow (new feature delivery) with no mention of edge cases like patching a previous version or handling a hotfix. Or it's scoped to a different team — written by a neighbouring squad whose conventions diverge from yours in ways that aren't immediately obvious.

None of this is negligence. It's just that keeping documentation current competes with shipping features, and documentation always loses that race unless someone makes it a priority.

The critical impact of effective onboarding on Engineer performance

Most companies measure new hires against some form of probation or ramp-up evaluation. Managers assess how quickly an engineer starts delivering meaningful output: merging PRs, closing tickets, moving features toward production. The expectation is reasonable — you were hired to ship.

But delivery is blocked by process comprehension. Until a newcomer understands how work flows through your team, they can't move fast, no matter how technically skilled they are. A senior engineer who joins a large enterprise with tightly coupled but independently managed departments can spend their entire probation period just figuring out who owns what and what sequence of steps gets a task to production. That's not a reflection of their ability — it's a structural failure of the onboarding process.

I've seen genuinely strong engineers underperform in their first review cycle for exactly this reason. And I've come to believe that the first two to three weeks of onboarding have an outsized, durable effect on how quickly — and how confidently — someone integrates into a team.

The two-week framework

What follows is the framework I've built and refined across multiple teams and companies. It assumes you're the senior engineer assigned as the newcomer's onboarding buddy. If no one has been assigned that role explicitly, volunteer for it anyway — it pays back in team throughput within weeks.

Week 1, days 1–2: environment and access

Don't underestimate this phase. Getting local environment, VPN, repository access, and IDE configuration working correctly always takes longer than expected, especially in enterprise environments with SSO, custom Maven/Gradle configurations, corporate certificate authorities, or internal artifact registries. Allocate real time for it, be available for troubleshooting, and don't let a newcomer burn their first two days feeling blocked and embarrassed to ask for help.

Week 1, days 3–5: Ways-of-Working walkthrough

This is the most critical session in the entire onboarding. I run it live, sharing my screen, walking through a complete example of how a piece of work moves from backlog to production. Not in slides — in the actual tools we use.

Concretely, I cover:

Jira workflow: how to pick up a ticket, what each status means, when to transition it, and when (and whether) to reassign it to QA for automation
Git branching strategy: how we name branches, what prefixes we use, what the branching model is (trunk-based, GitFlow, something in between)
Pull request process: how many approvals are required, who reviews what, what "done" looks like before you request review, and any automated checks that must pass
Testing requirements: which types of tests are expected (unit, integration, contract), what the coverage threshold is if one exists, and how to run them locally
CI/CD pipelines: which pipelines exist (build, test, deploy, release), which run automatically and which require manual trigger, what parameters are needed, where build artifacts land, and what artifact registry we use

I deliberately show all of this on screen. Words are ambiguous; a live walkthrough of the actual Jenkinsfile or GitHub Actions workflow removes ambiguity. I also stop to answer every question as it comes up, because questions during this session surface gaps in your own documentation that you'll want to fix later.

After the session, I send a follow-up message with links to every tool, page, and resource I mentioned — a curated reference list, not a dump of the entire Confluence tree.

Week 1, days 5+: codebase deep-dive

Once the process is clear, I move to the code itself. I pull a real task from the top of the backlog, assign it to the newcomer, and then walk them through how I would approach starting it — again on screen.

This covers:

Service structure and module layout
Where the relevant domain logic lives
How to find which service owns a given piece of behaviour (important in microservices environments)
How to run the service locally, including any pre-run steps that aren't integrated into the IDE run configuration
How to run the test suite, and any environment setup required before tests will pass locally
What to do if there's no clear task description — who to ask, what questions to ask them

This last point matters more than most onboarding documentation acknowledges. Incomplete tickets are normal, and knowing how to resolve ambiguity productively is a skill that needs to be explicitly modelled, not left to the newcomer to figure out on their own.

Week 2: supported independence

By the start of the second week, the newcomer should be working independently on their first task. Your job shifts from teaching to being reliably available. Answer questions quickly, review their PR thoroughly and constructively, and stay alert to signs that they're stuck on process rather than on a genuinely hard technical problem — those two things look similar from the outside but require very different responses.

Resist the urge to go quiet because they seem to be making progress. A quick daily check-in for the second week costs you ten minutes and prevents a lot of silent confusion.

The architecture diagram nobody draws

If you're working in a complex microservices environment — multiple services, event-driven communication, shared infrastructure components — take thirty minutes during the first week to draw a simplified architecture diagram together with the newcomer. It doesn't need to be beautiful. A whiteboard photo, a quick Miro sketch, or even a PlantUML diagram is fine.

The goal isn't completeness. It's orientation. Without a visual anchor, engineers in large systems spend weeks building a mental model that could be handed to them in half an hour. That's a straightforward win that almost never happens because no one prioritises it.

If your team doesn't have an architecture overview document, creating one during onboarding is a good forcing function. The newcomer's questions will tell you exactly what was missing.

Closing thoughts

Onboarding rarely gets the engineering rigour it deserves. The irony is that the investment is small and the return is significant: two focused weeks of structured knowledge transfer meaningfully accelerates a newcomer's ramp-up, reduces the noise that slows down the rest of the team, and increases the likelihood that a strong engineer survives their probation period and stays.

In my experience, it doesn't matter how large the company is or how well-documented its processes are at the macro level. Team-level onboarding is almost always underdeveloped. The engineers who make it a priority — who treat it as a real engineering problem worth solving deliberately — build better teams faster. That's worth the calendar time.