DEV Community: Juha Pellotsalo

Building a Personal Agent

Juha Pellotsalo — Tue, 28 Apr 2026 13:09:44 +0000

Shortly after OpenClaw came out I started building my own personal agent. I picked Claude Code as the harness, partly out of habit and partly because I wanted to see what it could do outside of coding.

The agent lives in a single directory on my file system. Launching Claude in that folder launches the agent. Nothing about it is Claude-specific, though. It runs on skills, MCPs, and custom CLI commands, and stores everything in markdown or YAML. Any harness can work with these concepts.

Bootstrapping

Each session starts by loading AGENT.md (or CLAUDE.md in this case). It's deliberately compact, just enough to point the agent in the right direction:

Agent identity and a link to SOUL.md, which holds the full description.
Operating principles, like "auto-improve custom CLI commands when they fail."
A short note on how skills are organized and how memory works.
Rules for formatting bash commands so they pass permission lists cleanly.

Everything else loads on demand.

Skills, CLIs, and the file system

Three pieces do most of the work, so it's worth introducing them up front.

Skills are the agent's playbook. Almost every action it takes is driven by one. A skill contains instructions for calling CLIs, with examples and guardrails, and procedures for things like compiling memory prints, priming sessions, and writing or reviewing code. I use relatively few skills with larger instruction sets, rather than a long catalog of small ones. Sessions typically begin with a skill call appropriate to the task. If I'm working on a project, /project primes the session by reading the project wiki.

CLIs are the arms and hands. They're custom-built commands the agent executes via bash, mostly thin wrappers around APIs: calendar-cli, gmail-cli, drive-cli. reddit-cli reads posts as JSON. youtube-cli pulls video transcripts. Going through CLIs instead of MCP gives me clearer control. The Google CLIs, for example, support multiple accounts, and the right credentials get picked up in code rather than from an agent-generated bash script.

CLI calls are also cheaper than MCP calls, and I can bake error correction into the code itself. LLMs love to hallucinate imaginary args, and validating those inside the CLI is trivial. The validation error returns the correct usage, so the agent fixes the call on the next try instead of looping through trial and error.

The file system is the database. Markdown and YAML for everything. Folder structure is the schema.

Markdown as a data store

Journals and logs are the unstructured side of this. The structured side looks more like a row in a database: front matter defines the shape, the body holds the content.

---
id: project-axiom
status: active
priority: high
related: [agent-system, langgraph-experiments]
---

# The Axiom

Notes and references for the agentic newsroom project...

These files reference each other like foreign keys, with links as plain paths. This works because LLMs are, for some reason, very good at reasoning over directory trees, and markdown is plain text the model reads natively. Binary database files are opaque by comparison.

The catch is hallucination. Despite clear instructions, the agent will occasionally write a markdown that doesn't match the defined shape, which immediately breaks anything that reads it.

To handle that, I use a third-party tool called ALS. A hook fires whenever a data markdown is edited, ALS validates it against the shape, and on any error it returns a correction message to the agent. It's pure code, deterministic, free, and faster than any agentic validation loop.

Memory

The first thing I implemented was memory. Initially it was just a text dump into MEMORY.md, but I wanted memories from day one so I'd have a record of building the agent itself. Almost everything else can be re-created. Those chronological footprints can't.

The current system is a journal spread across markdown files, one per day:

memory/
└── journal/
    ├── 2026-03/
    └── 2026-04/
        ├── 01.md
        ├── 02.md
        └── ...

A /remember skill collects memories at the end of each session. Daily files are split into topics, and if I work on the same topic across sessions, the skill bakes new memories into the existing topic by rewriting the whole file. It's a heavy operation, both in tokens and time, which is why one day is the smallest unit. It's small enough not to choke the agent during a rewrite. The next step is probably per-topic daily files.

Memories are also indexed into a file-based SQLite-vec store in the same folder. A hook fires on every journal edit and re-chunks the file. Chunk size is dynamic, anchored on section headers, so each topic becomes one chunk, tagged with the header and filename as metadata.

A /recall skill handles retrieval. The topic structure already creates clean semantic boundaries, which makes search effective on its own. On top of that, the final retrieval is hybrid: vector similarity blended with BM25 keyword scores at 70/30.

Knowledge bases

This part borrows from Andrej Karpathy's idea of LLM knowledge bases. There's a raw/ folder in the project root where the agent dumps anything mid-session: notes, screenshots, snapshots. Each night a scheduled task compiles those into a wiki, blending new material into what's already there. The wiki is a continuously updated snapshot of what I've been thinking and working on.

Projects

Almost every session is tied to some project: code, research, or work on the agent itself. A project is the basic unit of work. PROJECT.md holds the instructions, and each project has its own knowledge base that compiles into a self-updating wiki.

The agent also has project management built in. A Trello-like board with lists and cards, all stored as markdown.

---
id: card-042
list: in-progress
title: Implement /recall hybrid retrieval
created: 2026-04-12
---

Blend vector similarity with BM25 at 70/30. Validate on
last week's journal entries...

When a project session primes, the board loads into context so the agent always has the full scope of what's planned. The board is also modeled as a data store shape, which lets a small custom backend read it directly. More on that next.

Core and heartbeat

The agent system has a custom daemon running in the background. It's a cron-like scheduler that fires fresh agent sessions in headless terminals on a schedule. Each session has a small custom system prompt, but otherwise it's the same agent I use interactively, just running on its own.

A heartbeat task runs every 30 minutes. It backs up the system and pushes it to a remote git repo. Nothing more dramatic than that, but it means the agent's accumulated state is never more than half an hour from being safe.

Web apps on top

Most of my interaction is through the Claude CLI, typing and talking. But some things are easier to see than to describe.

The kanban board is the clearest example. A small Trello-like web app reads the same raw markdown files the agent does, which is safe because ALS guarantees the format and referential integrity hold. The web layer doesn't own the data. It just renders it.

Where this goes

A personal agent is a useful tool on its own, but it becomes something different once it has history. Memories, knowledge bases, project notes, access to my email and accounts and Spotify feeds. Every interaction is stored, and the agent's picture of me sharpens with each session.

It's not finished though. Claude Code is a worker harness. It's exceptional at coding and code-adjacent tasks, and bad at being a person. There's no real personality, no conversational nuance, and any character I write into SOUL.md gets diluted past the first few turns by the harness's own system prompt.

The natural next step is splitting the agent in two: a conversational layer on top, a worker layer underneath. The conversational layer would run on a different harness, with LangChain Deep Agents as my current candidate, and exist purely as the interface I talk to, delegating real work downward. That separation is what would let a personality actually take hold and evolve through memory, instead of getting flattened on every prompt.

The other missing piece is autonomy. Right now the agent is almost entirely interactive. Claude Code's recent auto-mode is a step in the right direction. It skips manual permission prompts in favor of a dedicated permission model. It's not quite there yet, but for an agent to run long tasks unattended, something like it is essential.

Building a Smart Storage Facility Prototype with BMAD

Juha Pellotsalo — Mon, 02 Feb 2026 03:54:58 +0000

I wanted to build something that takes the usual chat-with-data bot further. I created a fictional storage facility equipped with several environmental sensors: motion detectors, temperature monitors, and air quality sensors. These are laid out on a blueprint view split into several zones: loading bay and different types of storage rooms. This approximates where sensors would be in a real-life layout.

Beyond the typical chat panel

The UI integrates with an assistant panel that resembles the typical assistant chat view. But the UI explores options to click sensors and other elements of the system that auto-generate messages and send them to the assistant. Instead of typing everything (which the user can always do), they have easier UI elements to drive the conversation.

The system automatically scans the data and detects anomalies such as a rising cold room temperature. It flags warning scenarios in the UI and pre-formats them with button-clickable elements to query more info from the assistant. The system can generate formal incident reports, run compliance audit checks, and similar actions.

Dynamic visualization generator

There's also a dynamic visualization generator that works the same way as Claude artifacts. It first generates a handful of visualization ideas based on the system status and then proceeds to generate a React component rendering the visualization, which gets dynamically executed in a designated UI component.

Why I build these prototypes

I've found building these quick prototypes extremely useful because it improves my ability to rapidly prototype using Claude Code. I also try to experiment with something new each time to learn more and to boost efficiency.

With this project I experimented with the BMAD method. It's a system that tries to emulate established methodologies like scrum in an agentic way. At its core, BMAD is a collection of 68 workflows and processes modeled as prompts, executed by 26 specialized agent personas. The whole system follows a four-phase cycle: Analysis, Planning, Solutioning, and Implementation.

How BMAD works

The full default flow starts with analysis and brainstorming to collect product requirements. It splits these into epics and further down to stories that are then implemented one at a time. There are separate flows to design UX, run tests, and so on. Each of these is executed with a specific agent role activated.

The core idea is that these predefined agent personas, when activated, always do the tasks the same way across the entire project. This is important because when you're just freely prompting, the conversation with Claude tends to steer a certain way, and these subtle shifts affect how and what Claude writes, leading to inconsistent artifacts. BMAD calls this "solutioning," the third phase of their cycle, focused on keeping the process consistent.

My experience with BMAD

I've never felt comfortable following any process strictly down to the last detail. They tend to be too rigid, work sometimes in some areas but restrict in others. BMAD is no exception because it doesn't quite fit my personal style. Brainstorming and ideation was extremely useful, but the implementation flow with repeated steps of creating a story and developing it felt a little tedious.

That said, I can absolutely see great value in following this process: it produces written artifacts that persist where you are in the development process. This is a pattern commonly used in deep research agents where parts of the process are constantly stored on the file system. This helps immensely with Claude Code's main challenge: context window management. Running extensive one-shot tasks consumes the context window very quickly, leading to compacting which never really works well. It's far more optimal to clear the context frequently and then quickly continue where you left off. BMAD's doc artifacts help with exactly that.

Adversarial code review

Another useful feature is BMAD's adversarial code review. First-shot Claude code rarely produces the most optimal result and doesn't always account for architectural structure. The adversarial review process systematically looks for problems in the code, running multiple passes to identify vulnerabilities and propose improvements. It includes explicit guidance on handling false positives. This fits well with my style of iterating between quick one-shot implementation and structural revisions.

Final thoughts

BMAD isn't perfect, but I think it's a step in the right direction. It's the first serious attempt I've come across to use AI to drive its own development process. This kind of engineering structure is necessary. Otherwise you need to constantly run the same prompts to clean up the code and architecture and instruct the harness to pick up where you left off before the context window filled up.

Claude Code has made it extremely easy to take high-quality open source projects, reverse engineer them, and tailor them to your own needs. That's exactly what I started doing: take what works in BMAD, then create my own version that uses the same patterns but tailored in a way that fits my development process.

The full project code is available in this repo.

Agentic Content Scout: Exploring LangGraph’s Handoff Pattern

Juha Pellotsalo — Mon, 19 Jan 2026 03:40:56 +0000

With this project, I wanted to experiment with the create_agent pattern in LangGraph. This was previously known as create_react_agent, but it was renamed in the 1.0 release. I actually think the original name was more descriptive because it explicitly implements the ReAct (Reasoning and Acting) pattern.

ReAct is a fundamental building block for agentic applications. It is a simple but powerful loop: the agent receives instructions, uses its tools to reason and execute, and then reflects on the results until the task is complete.

The Supervisor and Orchestration

Another goal was to explore the supervisor (or orchestrator) pattern via a conversational CLI. The setup is similar to tools like Claude Code where the user interacts through a command line.

In this architecture, user queries go to a Supervisor agent first. Its primary job is to determine intent. Once it understands what the user wants to do, it delegates the work to a specific subagent.

Why Handoffs Matter

The delegation happens using the handoff pattern, which is different from standard execution flows. In many systems, a supervisor acts as a middleman for every single interaction. This becomes a bottleneck in conversational systems that require Human-in-the-Loop (HITL).

If a subagent needs to ask the user for clarification, a traditional flow would require the subagent to pass control back to the supervisor, which then asks the user, receives the answer, and passes it back again.

I think of this as micromanagement vs. full delegation. In real life, you want to hand off a task and say, "Go do this and let me know when it's done." You don't want to be bogged down by the day-to-day details of how the subagent executes. The handoff pattern allows the subagent to own the conversation with the user directly until its specific task is finished.

Building the Content Scout

The Agentic Content Scout allows users to track specific topics, like "AI development news" or "Open world video games." You can define preferences for each topic—for example, prioritizing academic research over marketing fluff, or avoiding social media in favor of reputable news outlets.

The system uses two subagents:

Topic Manager: Handles the CRUD operations for maintaining topics and preferences.
Content Scout: Uses the Tavily search tool to find and retrieve articles.

How the Graph Works

The system is built as a LangGraph main graph using MemorySaver as a checkpointer to maintain state. The execution loop follows these steps:

Input: The user sends a command like "create a topic for AI safety."
Intent: The Supervisor receives the message and decides to call the handoff_to_topics tool.
Command: The handoff tool returns a LangGraph Command object. This tells the graph to stay within the agent node but switch the active_agent state to topic_manager.
Ownership: The Topic Manager takes over the conversation. If it needs more info, it calls a gather_preferences tool which triggers a LangGraph interrupt().
Pause and Resume: The graph saves its state and returns control to the CLI. Once the user answers, the checkpointer restores the state and the Topic Manager resumes exactly where it left off—without the supervisor ever being involved.
Completion: Once the subagent finishes, it hands back a summary to the Supervisor, which provides the final response to the user.

The beauty of this pattern is that the supervisor truly delegates. It doesn't micromanage or relay messages. The subagent handles everything, including user interactions, and only reports back when done. The graph's single node design with dynamic routing makes handoffs clean: you just update a state field and loop.

Here is the repo: https://github.com/juhapellotsalo/agentic-content-scout

Why Claude Code Excels at Legacy System Modernization

Juha Pellotsalo — Sun, 11 Jan 2026 03:56:18 +0000

Nobody sets out to build a legacy system. They begin as well-designed solutions to real problems. They work so well they become critical infrastructure. Then time passes, requirements accumulate, developers rotate, and one hasty fix at a time the system becomes something nobody fully understands anymore.

We (senior developers) all know the story. The outdated library that's too risky to upgrade, the undocumented script discovered on a production server. Bugs that only manifest on Tuesdays for accounts created after February. When it finally becomes clear the system needs to be rewritten, estimates come in: six figures, multi-year timelines, full teams deployed.

I spent the last months exploring agentic AI applications but so far the most practical value I've found is simpler: Claude Code is exceptionally good at exactly those tasks that make legacy modernization so painful for humans.

Dependency archaeology

Legacy systems accumulate libraries that are outdated, discontinued, or simply vanished from the internet. Claude Code can scan the codebase, identify the worst offenders, and suggest modern replacements. Trivial dependencies can be written from scratch, eliminating the external dependency entirely.

Reverse engineering at scale

A decade-old system often contains hundreds of thousands of lines written by dozens of developers, most of whom left no documentation. It takes months for a human engineer to understand this code well enough to safely modify it. Claude Code generates readable summaries and traces logic flows tirelessly, producing in hours what would take weeks of careful reading.

Dead code identification

Legacy codebases bloat because deletion is risky. New features get added but old code rarely gets removed. Cross-referenced with production logs, Claude Code identifies what's actually executing versus what's just taking up space.

Exposing fossilized hacks

Quick fixes have a way of becoming permanent fixtures. It's not unusual for a legacy app to be entangled with its original dev environment: hard-coded scripts, file-system paths set years ago that have since cemented into load-bearing parts of the application. These are notoriously difficult for humans to trace but CLI-driven agent coders excel at exactly this kind of detective work.

Generating a preliminary spec

Perhaps the most valuable output is turning the existing application into a readable specification with minimal human effort. Claude Code extracts what the system actually does, how components interact, and which business rules are embedded in the logic. The result is a solid foundation for a proper spec that humans can review and refine rather than having to create from scratch. For systems where the original requirements are long lost or hopelessly outdated, this alone can save months.

Conclusion

This changes the economics dramatically. Detective work that previously took months of senior developer time can happen in days. The system that seemed untouchable becomes approachable. One capable developer armed with sound Claude Code skills can get the work well under way without having to deploy a full team.

This isn’t speculative. Last week I analyzed an Android app repository I had not seen before and fixed an old issue. The pull request was approved the same afternoon. This would’ve been impossible a year ago, which goes to show how fast the tools are developing and how good they’ve already become.

Reflections on building my first LangGraph project

Juha Pellotsalo — Mon, 22 Dec 2025 06:30:47 +0000

I recently completed my first LangGraph project that was more than a simple tutorial. It simulates a newspaper editorial workflow where a set of agents turn a story idea into a finished article. The newsroom itself wasn’t the goal, I simply used it to learn LangGraph properly.

Learning LangGraph

I explored several frameworks, including OpenAI Agents SDK, CrewAI, and AutoGen, but eventually settled on LangGraph. Modeling the application as a graph felt natural, and the level of control appealed to my background as a senior engineer.

I started with Udemy courses and Medium posts, but the best material came from LangChain’s own academy. The LangGraph foundations and deep research project courses were the two most useful ones.

Modeling agents as subgraphs

The system ended up being a mostly linear workflow with agents such as Assignment Editor, Research Assistant, and Reporter. I modeled each agent as its own subgraph with a single responsibility.

This worked well. I could spend a couple of days improving the Reporter, then move on to the Graphics Desk. Once I was happy with the individual agents (subgraphs), stitching them together into a parent workflow graph was trivial.

Prompt management was a surprising challenge

I wasn’t expecting this, but keeping track of which prompt does what quickly became difficult. Several agents shared things like guardrails and writing guidelines, and sharing these without duplicating or contradicting instructions was surprisingly hard.

In an agentic system, prompts are effectively code, but because they’re written in natural language, they lack the structure and constraints of a programming language.

I ended up using XML-style elements in prompts to indicate things like <Task> and <Output Format>. This probably helps the model, but it definitely helped me keep things organized. This is also what Anthropic recommends in their context engineering guide.

LangSmith for tracing

You can enable LangSmith tracing simply by providing an API key. It’s the easiest way I’ve found to add tracing with a usable UI, and I learned very quickly that it’s a must-have. You simply cannot build an agentic system without tracing.

LangSmith also includes built-in evaluation tools, which I didn’t use in this project. That’s an area that clearly needs more attention, and if I were to extend this system, LangSmith would be my starting point for proper evaluation.

Editing long-form articles

Editing the articles turned out to be one of the hardest problems. The first draft always needs several rounds of reflection and revision. It’s easy enough to analyze factual or stylistic issues, but fixing them is much harder.

A full feature article can be around 2,000 words. Single-shot rewrites often fix some problems but introduce new ones, and the system easily gets stuck in a review–revise loop where each iteration drifts further from the original.

I wasn’t able to solve this completely. The best results came from revising the article section by section, giving the model one section to edit along with the preceding and following sections to preserve flow. This worked much better, but increased latency and token usage beyond what made sense for a portfolio project. It is, however, the approach I’d try in production.

Skills as a software pattern

I only noticed this afterward, but LangChain has a section in their documentation about skills. This is the same skills concept Claude Code introduced some time ago, but framed more explicitly as a software pattern.

This would have been a good fit for my Research Assistant agent. It doesn’t rely on a single master prompt; instead, each node behaves like a small sub-agent. Generating web search queries is one skill, curating results is another, and packing them into state is a third. Modeling these as skills would likely make the code cleaner.

Final thoughts

I like LangGraph. Once it clicks, modeling systems as explicit graphs and subgraphs feels natural, especially coming from a traditional software engineering background.

The biggest drawback is the learning curve. If you’re building something for clients, it’s worth being aware that you can accumulate technical debt quickly. I’d probably start with a lower-code framework like n8n or CrewAI to prototype, and only switch to LangGraph if that doesn’t scale—or if there’s a team willing and able to maintain a more complex system.

The full project code is available in this repo.