DEV Community

Cover image for An Autonomous, Agentic, AI Assistant, Meet Alfred and this is how I built him.
JOOJO DONTOH
JOOJO DONTOH

Posted on

An Autonomous, Agentic, AI Assistant, Meet Alfred and this is how I built him.

Introduction

My people it's me again. This time I have built something fun but mostly useful. I gave building an autonomous agent a chance and it's turning out well. I know it's a clichΓ© but his name is Alfred. The thing is AI agents are no longer a novelty. It all started out as simple chatbots chaining a few prompts together. Now it has evolved into something far more capable. These systems that can "reason" (I know it's just a lot of math and not actual reasoning), plan, use tools, and execute multi-step workflows with minimal human intervention. Agentic flows, where an AI iteratively breaks down a goal, takes actions, evaluates results, and course-corrects, are quickly becoming the backbone of serious productivity tooling.

But the thing is not all models are created equal. The market is crowded. GPT-4o, Gemini, Mistral, Llama, DeepSeek all have their own strengths, trade-offs, and devoted user bases. Picking the right model for a given task has become something of an art form in itself. Especially because the benchmarks keep getting blurrier and blurrier.

For me, that choice keeps coming back to Anthropic's Claude and specifically to Opus. As an engineer, I spend a significant portion of my day thinking in systems: abstractions, edge cases, failure modes and architecture trade-offs. Opus is the only model that consistently feels like it's doing the same while cleverly grabbing my immediate system context. Where other models can produce code that technically compiles but misses the intent entirely, Opus tends to understand the why behind what I'm building, not just the what. That distinction, subtle as it sounds, makes an enormous practical difference when you're deep in a complex codebase. Opus has downsides, especially because sometimes it takes shortcuts without adhering to the principles you intended.

On the bright side what sealed it for me, though, was the CLI experience. Claude's command-line interface is genuinely pleasant to use: fast, composable, and unobtrusive in a way that fits naturally into my existing workflow. It doesn't feel like a detour. It feels like a tool that belongs in your terminal alongside the rest of my stack.

In this article I'm going to talk about why I needed Alfred, the problem he solves for me, how I built him and how I improve him on this ever changing landscape where engineering meets productivity.

The Monday Morning Problem Every Developer Knows

It is Monday, 8:30 AM. Before I have written a single line of code, I already have a full-time job just figuring out where to start.

Over the weekend, 47 new Gmail messages came in. Some are spam. Some are newsletters I never unsubscribed from. But buried somewhere in that pile is an escalation that needs urgent attention and a teammate asking for a code review. I do not know which email it is yet. I have to dig for it.

That is just Gmail. I also have 12 Outlook emails from work: meeting updates, an HR policy change, and my manager asking about feature progress. Then there are 8 Teams messages spread across 3 different channels covering a production incident from Saturday, a design review thread, and standup notes. On top of that, 3 pull requests were opened against repos I review, and 2 calendar conflicts appeared for Tuesday that I need to sort out before the day gets going.

None of these systems talk to each other. So my morning routine becomes a manual context-switching exercise. I open Gmail, scan subject lines, try to mentally rank urgency. Then I switch to Outlook and do the same. Then Teams. Then Azure DevOps. By the time I have a rough picture of what actually needs my attention, 45 to 60 minutes have passed. And that client escalation? Still buried under newsletters when I finally find it.

The frustrating part is that most of that time is not real work. It is just triage. It is the overhead that comes before the actual job even starts. The other option is to close everything and wait for someone to walk to my table. Lmao I do this all the time.

But well, this is the problem I built Alfred to solve.

What do I want from Alfred?

Unification! Alfred is a personal AI agent built around a single idea: collapsing the chaos of my digital workday into one intelligent, unified system. It continuously polls Gmail at configurable intervals and receives Outlook emails and Microsoft Teams messages via Power Automate webhooks, storing everything locally in SQLite so that regardless of the source, nothing slips through the cracks.
Every incoming email is then put through an AI classification pipeline that assigns it one of six categories (Urgent, Personal, Work, Newsletter, Transactional, or Spam), gives it a priority level from 1 to 5, generates a human-readable summary, extracts action items with optional due dates, and flags whether a follow-up is needed.
From there, a configurable rules engine evaluates each classified email and proposes an appropriate action: archive it, delete it, forward it, draft a reply, or surface it for attention via a notify action with quick-action buttons.
Destructive actions like deletions, sends, and PR approvals wait behind an explicit approval gate in the dashboard, while non-destructive ones like classification and drafting execute automatically.
Every action is tracked through a full lifecycle from proposed to executed, with timestamps, rollback data, and execution results all stored in an append-only audit log.

Email flow

Beyond email, Alfred integrates deeply with the rest of my work toolchain. It connects to Google Calendar and Outlook Calendar for listing, creating, updating, and searching events, and handles Azure DevOps for querying and managing work items, approving pull requests, tracking pipeline runs, and browsing repositories. When a pull request is opened, a dedicated webhook handler automatically fetches the PR details, checks pipeline status, attempts to link related work items from branch name patterns, generates an LLM summary, and proposes approval or work item creation actions accordingly. Microsoft Teams is covered too, with channel message search and webhook-based ingestion keeping Alfred aware of conversations happening outside of email. Tying everything together is a conversational chat interface powered by an agentic loop that extracts intents from natural language, executes them across services, and returns structured, context-aware responses.

devops

Let's look at some of Alfred's core flows in detail

Email Polling and Synchronization

Alfred's background worker is built around an AgentLoop flow. When the server starts, the agentLoop runs an initial poll immediately, then sets a repeating setInterval timer at a configurable cadence. Each tick calls a listMessages request emailPort.listMessages("in:inbox", 50) to fetch up to 50 messages from Gmail via the Gmail API. 50 is a reasonable number for my personal workflow

To avoid reprocessing emails Alfred has already seen, the loop maintains an in-memory string set of message IDs. Every polled message is checked against this set, and only genuinely new messages pass through:

const newMessages = messages.filter((msg) => !this.seenIds.has(msg.id));
for (const msg of newMessages) {
  this.seenIds.add(msg.id);
}
Enter fullscreen mode Exit fullscreen mode

New messages are immediately persisted to SQLite through EmailRepo.upsert(). The upsert uses SQLite's INSERT ... ON CONFLICT(id) DO UPDATE pattern, which means if Alfred encounters the same email ID twice (for example after a server restart), it updates the existing row rather than creating a duplicate. The repository stores the full email body, sender, recipients, labels, attachments as serialized JSON, and a source field that distinguishes Gmail emails from Outlook emails. I cover the exact upsert schema in the Data Integrity section.

Before sending any email to the classifier, the loop applies a set of skip rules. Social media notifications from Facebook, Instagram, Twitter, TikTok, Reddit, Discord, and similar platforms are matched by regex against the sender address. Emails carrying Gmail's CATEGORY_PROMOTIONS or CATEGORY_SOCIAL labels are also skipped. LinkedIn is explicitly exempted from this filter because its emails often contain actionable professional content. This pre-filtering avoids burning LLM API calls on emails that would reliably classify as low priority anyway.

The loop also checks whether each email already has a classification in the database before sending it to the classifier. If a record exists, the email is skipped entirely. This means restarting the server does not trigger re-classification of previously processed emails. I wrote it this way to ensure minimum cost and idempotency.

When the classifier encounters a fatal error such as an expired API key, exhausted credit balance, or a 429 rate limit response, the loop enters a paused state rather than crashing or retrying in a tight loop. It sets classifierPaused = true and stops classifying. This is sort of a circuit breaker. On subsequent polls, it still persists new emails to the database so no mail is lost, but it attempts a single test classification to check whether the service has recovered. Once the test succeeds, classification resumes automatically. Error messages are also deduplicated so the same error is only logged once regardless of how many polls occur while paused.

For Outlook, Alfred does not poll directly. Instead, an adapter calls a Power Automate flow that returns Outlook messages. A dedicated payload mapper normalizes Microsoft field names, timestamp formats, and nested structures into the same EmailMessage domain object that Gmail produces. This means the rest of the pipeline, including classification, action rules, and chat, works identically regardless of whether an email originated from Gmail or Outlook. I wrote it this way so that I can later extend email providers by just adding a normalization mapper and then it should be plug and play.

Action Proposal, Approval, and Execution

Actions in Alfred follow an event-sourced lifecycle. Every state transition is recorded as an append-only entry in action log in an SQLite table. No rows are ever updated in place or deleted. The lifecycle flows through a fixed set of ActionStatus states: Proposed β†’ Approved β†’ Executed, or alternatively Rejected or RolledBack. This is purely for auditing so that I can track autonomous actions from the agent.

Proposal

The ProposeAction use case starts with an idempotency check. It queries the action log for any existing entry with the same resourceId and type. If one already exists, it returns null and stops. Otherwise, it appends a new entry with status: Proposed.

From there, the action's RiskLevel determines what happens next. Low-risk actions like Classify, Draft, and Notify carry RiskLevel.Auto and execute immediately without my input. High-risk actions like Archive, Delete, Send, and Forward carry RiskLevel.ApprovalRequired and sit in the proposed state until I act on them from the dashboard:

const risk = ACTION_RISK_LEVELS[action.type];
if (risk === RiskLevel.Auto) {
  const strategy = this.strategies.find((s) => s.source === action.source);
  if (strategy?.canExecute(action.type)) {
    resultData = await strategy.execute({ type, resourceId, payload });
  }
  await this.actionLog.updateStatus(actionId, ActionStatus.Executed, new Date().toISOString());
}
Enter fullscreen mode Exit fullscreen mode

If the action produces result data such as a created draft ID or classification details, that data is stored alongside the log entry via updateResultData().

Approval and Execution

When I click Approve in the dashboard, the ApproveAction use case first updates the log entry's status to Approved with a timestamp, then immediately attempts execution. It finds the correct ActionExecutionStrategy by matching the action's source field. Three strategies exist: GmailActionStrategy handles archive, delete, send, and draft operations via the Gmail API; OutlookActionStrategy handles equivalent operations through Power Automate; and DevOpsActionStrategy handles work item creation and PR approval via the Azure DevOps REST API. This is based on the open-closed principle to allow for the extension and registration of multiple strategies.

Each strategy declares which action types it supports through a canExecute() method. If a strategy exists but cannot execute the specific action type, the action is marked as executed without performing any real mutation. If execution succeeds, the status moves to Executed. If it fails, the error is returned to the caller but the action remains in Approved state so the user can retry without losing the approval.

The Notify action type is intentionally a no-op at the execution level. It exists so the rules engine can propose surfacing an email to the user without triggering any mutation on the mailbox. The notification itself is handled by the push notification system, not the action executor.

Chat Interface (Intent and Tool Use Modes)

Alfred's chat is the primary way I interact with my workspace data through natural language. I designed it to support two distinct modes of operation, an intent extraction mode (the default) and tool_use mode powered by Anthropic's internal tool choice algorithm. Both implement a ChatStrategy interface defined in a chat-strategy file, which standardises the input (message, history, context, system prompt, dependencies) and output (response text, result strings, action steps).

Intent Extraction Mode

The IntentExtractionStrategy uses a two-LLM architecture. A fast, cheap model (Claude Haiku) handles intent extraction, while the main model (Claude Sonnet) composes the final user-facing response.

The strategy runs an agentic loop of up to 5 rounds. In each round, it sends the user's message, the last 20 conversation history entries (each truncated to 2000 characters), and any results from prior rounds to the fast LLM. The system prompt includes detailed routing rules that map natural language patterns to intent types: "check my Outlook" routes to search_emails with source: "outlook", "calendar" without a provider routes to list_calendar_events without a source, and "work items" routes to query_work_items.

The LLM returns a JSON object with an intents array. Each intent specifies a type matching a registered tool name, along with type-specific fields like query, source, and timeMin. Invalid tool names are filtered out against the ToolRegistry. The strategy then executes each intent by calling the corresponding tool's execute() function, which delegates to the appropriate IntentExecutorDeps method:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
  if (intents.length === 0) break;
  const results = await this.executeTools(deps, intents);
  allResults.push(`--- Round ${round + 1} ---\n${results.join("\n\n")}`);
}
Enter fullscreen mode Exit fullscreen mode

Multi-round execution is what makes complex queries possible. A request like "invite Sabrina to my 3pm meeting tomorrow" requires two rounds: round 1 searches for tomorrow's calendar events, and round 2 uses the event ID from that result to update the event with a new attendee. The LLM receives prior results in an ACTIONS ALREADY EXECUTED THIS TURN block and can return {"intents": [{"type": "none"}]} to signal that all needed data has been gathered and the loop should stop.

After the loop completes, the ChatService combines all gathered results with local context (email stats, pending actions, and follow-ups from the database) and sends everything to the main LLM for final response composition, with extended thinking enabled.

Tool Use Mode

The ToolUseStrategy takes a fundamentally different approach. Rather than extracting intents and executing them as a separate step, it gives the LLM direct access to tools via completeWithTools(). The LLM decides which tools to call, receives structured results, and continues the conversation until it produces a final text response.

This mode requires the LLM adapter to support the Claude tool-use API. The strategy converts all registered tools into Claude tool definitions (name, description, input schema) and passes them alongside the message. The loop runs for up to 5 rounds, checking the stopReason after each response. When the model returns end_turn, the final text becomes the response. When it returns tool calls, the strategy executes each tool, packages the results as ToolResultBlock objects with matching tool_use_id, and sends them back as a user message for the next round:

const response = await deps.llm.completeWithTools({ system: systemPrompt, messages, tools, maxTokens: 4096 });
if (response.stopReason === "end_turn") {
  return { response: response.text ?? "", results: allResults, actions: allActions };
}
Enter fullscreen mode Exit fullscreen mode

If the model exhausts all 5 rounds without reaching end_turn, the strategy returns a graceful fallback message in Alfred's butler voice rather than surfacing a raw error to the user.

Tool Registry

Both modes share the ToolRegistry class in a tool-registry file, which acts as a central catalogue of all available tools. Each tool is registered with a name, description, JSON input schema, an execute function, and a summarize function that produces human-readable action steps such as "Searched Gmail for 'invoice'". The registry can export its tools in two formats: toToolDefinitions() for Claude's native tool-use API, and toIntentPrompt() for building the intent extraction system prompt.

System Prompts

All persona and mode-specific instructions are centralised in a system-prompts file. The BASE_PERSONA establishes Alfred's character as a refined English butler who addresses the user as "Master Jo" and has access to Google Workspace, Microsoft 365, and Azure DevOps. (Jeremy Irons is my favorite Alfred btw) Mode-specific instructions are appended on top: intent mode tells Alfred that actions have already been executed and results are in context so it should not pretend to be searching, while tool-use mode tells Alfred to actively call tools to fetch fresh data.

Authentication and Security

Alfred enforces security at multiple levels across both the dashboard and the agent server.

Dashboard Authentication

The dashboard uses NextAuth.js v5 configured in auth.ts with Google OAuth as the sole provider. Sessions use a JWT strategy with a 7-day maximum age. Access is restricted to a single authorised user through an email allowlist: the signIn callback compares the Google profile's email against the ALLOWED_EMAIL environment variable and rejects any mismatch:

callbacks: {
  signIn({ profile }) {
    return profile?.email?.toLowerCase() === allowedEmail;
  },
}
Enter fullscreen mode Exit fullscreen mode

The auth system uses a custom sign-in page at /auth/login and redirects errors back to the same page for a clean user experience. Since Alfred is a personal, single-user tool, the allowlist approach is both simpler and more appropriate than a full role-based access system.

Server-Side Credentials

The agent server stores sensitive credentials in the macOS Keychain. Both are fetched lazily on first use and cached in memory for the lifetime of the process. This means credentials never appear in environment variables, config files, or logs.

Architectural Isolation

The dashboard is a pure client-rendered application. It contains no provider SDK imports, no direct database access, and no secret values. All data access flows through the agent server's HTTP API. I made sure that all credentials are ignored. This means that even if the dashboard source code were fully exposed, it would not leak any credentials or grant any access to the underlying data.

Resilience and Caching

Alfred applies several resilience patterns across the system to handle network failures, API rate limits, and performance constraints.

In-Memory TTL Cache

The TtlCache class in cache.ts provides a simple time-to-live cache backed by a JavaScript Map. Each entry stores its data alongside an expiresAt timestamp. The get() method checks expiration on every access and automatically evicts stale entries. The getOrFetch() method combines cache lookup with lazy population:

async getOrFetch<T>(key: string, ttlMs: number, fetcher: () => Promise<T>): Promise<T> {
  const cached = this.get<T>(key);
  if (cached !== undefined) return cached;
  const data = await fetcher();
  this.set(key, data, ttlMs);
  return data;
}
Enter fullscreen mode Exit fullscreen mode

This is used for calendar events and DevOps data, both cached with a 3-minute TTL. During a multi-round chat conversation where Alfred might query the calendar several times, only the first call hits the API and subsequent calls return the cached result. The 3-minute window balances data freshness with meaningful API call reduction.

Agent Loop Resilience

The classifier pause behavior is covered in the Email Polling section above. Beyond that, the polling loop is designed so that a failure in any single stage β€” classification, action proposal, or action execution, does not crash or block the rest of the loop. Each stage fails independently and logs the error without taking down the whole cycle.

Power Automate Retries

The Power Automate client implements a 3-attempt retry with linear backoff (1s, 2s, 3s) for transient HTTP errors and timeouts. Non-retryable errors such as 4xx client errors (excluding 429) fail immediately without retrying. Each request uses AbortController with a 30-second timeout to prevent indefinite hangs.

Push Notification Delivery

The web push delivery mechanics including concurrent sends, Promise.allSettled(), and automatic cleanup of expired subscriptions are covered in the Push Notifications section under Discoveries where the full implementation is explained in context.

Deployment and Operations

Alfred runs as three persistent background services on macOS, managed by launchd, Apple's native process manager. The deployment system is entirely script-based with no containers, no cloud infrastructure, and no external process managers. Everything runs on a single Mac.

The Three Services

The agent server is the core process. It runs the Node.js HTTP API, the background email polling loop, the action execution pipeline, and the finance statement processor. It owns all external API calls to Gmail, Google Calendar, Anthropic, Azure DevOps, and Power Automate, along with all OAuth credentials stored in macOS Keychain and the SQLite database.

The dashboard is a Next.js application serving the client-rendered UI. In production it runs against a pre-built output directory and makes no direct calls to any external service. All data comes through the agent server's HTTP API. It receives a bearer token as an environment variable so it can authenticate its requests to the agent server.

The Cloudflare tunnel creates an encrypted outbound connection from the Mac to Cloudflare's edge network, making the dashboard publicly accessible without opening any inbound ports or touching the router. It routes HTTPS traffic from the public domain down to the local Next.js server on a local port.

launchd Service Configuration

Each service is defined as a .plist property list file. The plist files use placeholder tokens that are replaced with real values at deploy time using sed. The key properties are RunAtLoad: true to start on login, KeepAlive: true to auto-restart on crash, and ThrottleInterval: 10 to wait at least 10 seconds between restart attempts and prevent tight crash loops:

<key>ProgramArguments</key>
<array>
    <string>PROJECT_ROOT/node_modules/.bin/tsx</string>
    <string>apps/agent-server/src/index.ts</string>
</array>
<key>KeepAlive</key>
<true/>
<key>ThrottleInterval</key>
<integer>10</integer>
Enter fullscreen mode Exit fullscreen mode

Each service logs stdout and stderr to separate files that can be tailed in real time for debugging.

The Deploy Script

Deployment runs through a single script that orchestrates six steps in order:

  • creating the log directory
  • sourcing the .env file to load environment variables
  • running npm install at the monorepo root to install all workspace dependencies
  • running npm run build to compile all TypeScript packages in dependency order (domain β†’ application β†’ infrastructure β†’ contracts β†’ agent server, then the Next.js dashboard)
  • copying each plist template into ~/Library/LaunchAgents/ with placeholders replaced by real paths,
  • And finally loading all three services with launchctl load to start them immediately. Before installing each plist it unloads any previously running version to prevent conflicts, resulting in a brief restart with minimal downtime:
for plist in com.alfred.agent.plist com.alfred.dashboard.plist com.alfred.cloudflared.plist; do
  launchctl unload "$LAUNCH_AGENTS_DIR/$plist" 2>/dev/null || true
  sed -e "s|PROJECT_ROOT|$PROJECT_ROOT|g" \
      -e "s|USER_HOME|$USER_HOME|g" \
      -e "s|CLOUDFLARED_BIN|$CLOUDFLARED_BIN|g" \
      -e "s|NODE_BIN_PATH|$NODE_BIN_PATH|g" \
      -e "s|BEARER_TOKEN_VALUE|${BEARER_TOKEN:-}|g" \
      "$DEPLOY_DIR/$plist" > "$LAUNCH_AGENTS_DIR/$plist"
done
Enter fullscreen mode Exit fullscreen mode

The script automatically detects the Node.js binary path across nvm, Homebrew, and system installs, and locates the cloudflared binary for both Apple Silicon and Intel Homebrew paths. At the end it prints a macOS settings checklist reminding me to enable auto-login, prevent sleep, and configure startup after power failure, since the Mac effectively acts as a persistent home server.

First-Time Setup

Initial installation is handled by a setup script that checks prerequisites (Homebrew and Node.js 20 or above), installs cloudflared, creates the .env file interactively, runs the Google OAuth flow by opening a browser for consent and storing the resulting refresh token in Keychain, authenticates with Cloudflare, creates the tunnel, configures DNS routes, and then kicks off the deploy script to bring everything up.

Operational Commands

I have scripts for the full operational lifecycle. A status command shows whether each service is running, its PID, and the last 5 log lines. A teardown command unloads all services and removes the plist files from LaunchAgents while preserving logs. A universal launcher supports multiple modes: all for full production, dev for hot-reload development, agent or dashboard individually, status for health checks, and doctor for preflight validation.

Configuration

All configuration flows through environment variables loaded from a .env file at the project root. A config.ts module reads these and returns a typed AppConfig object. Three variables are required: GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and ANTHROPIC_API_KEY. Everything else is optional and enables features progressively. Setting AZURE_DEVOPS_ORG enables DevOps integration. Setting PA_FLOW_MAIL_SEARCH enables Outlook. Setting VAPID_PUBLIC_KEY enables push notifications and so on. If an optional config block is absent, the composition root simply skips registering those adapters and use cases, so the system degrades gracefully rather than failing to start.

Data Integrity

Ensuring that Alfred handles data meticulously was very important to me. It does not make sense to build an assistant that is sloppy with the information it presents. Therefore I wrote Alfred in such a way that he prevents duplicate and inconsistent data through idempotency checks, upsert semantics, and schema separation at every data boundary.

Idempotent Action Proposals

Before creating a new entry in the action log, the proposal system queries for any existing entry with the same resourceId and type. If a match is found, the proposal is silently skipped and returns null. This means the polling loop can encounter the same email multiple times, such as after a server restart, without generating duplicate action proposals:

const existing = await this.actionLog.getByResourceIdAndType(action.resourceId, action.type);
if (existing) return null;
Enter fullscreen mode Exit fullscreen mode

Email Upsert Semantics

Whether an email arrives via polling, a webhook, or is encountered again after a restart, the upsert guarantees exactly one row per email ID. All fields including subject, body, labels, and read status are updated to their latest values, and an updated_at timestamp records when the last refresh occurred:

INSERT INTO emails (id, thread_id, from_address, ..., updated_at)
VALUES (@id, @threadId, @from, ..., datetime('now'))
ON CONFLICT(id) DO UPDATE SET
  thread_id = excluded.thread_id,
  from_address = excluded.from_address,
  ...
  updated_at = datetime('now')
Enter fullscreen mode Exit fullscreen mode

Conversation Ordering

Chat messages are stored with a created_at timestamp and always queried in chronological order using ORDER BY created_at ASC. Messages are never reordered, edited, or deleted after creation. This ensures the conversation history Alfred sees when composing a response exactly matches what the user experienced.

Normalised Schema Design

Classifications are stored in a separate classifications table linked to emails by email_id. This separation means re-classifying an email, whether due to a model update or a rule change, only touches the classification row without affecting the underlying email data. The email's original content, headers, labels, and metadata remain untouched. Follow-ups and action log entries follow the same pattern. Each table has a single source of truth for its own data, and no operation on one table can corrupt another.

Pitfalls: From Intent Extraction to Tool Use

I started Alfred's chat system with a pure intent extraction approach. The idea was straightforward: send my message to a fast LLM, ask it to return structured JSON with an intent type and parameters, then map that intent to an executor function. A message like "show me today's calendar" would produce {"type": "list_calendar_events", "timeMin": "2026-03-16", "timeMax": "2026-03-16"}, and the system would call the calendar adapter directly:

const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
for (const intent of intents) {
  const entry = deps.toolRegistry.get(intent.type as string);
  if (entry) {
    const result = await entry.execute(deps.intentExecutor, intent);
    if (result) results.push(result);
  }
}
Enter fullscreen mode Exit fullscreen mode

I built this following the Open/Closed Principle. Each intent type was a self-contained ToolEntry registered in a ToolRegistry. Adding a new capability meant registering a new entry with a name, schema, executor function, and summariser. No existing code needed modification:

toolRegistry.register({
  name: "search_emails",
  description: "Search emails by query, category, or sender",
  inputSchema: { ... },
  execute: async (deps, input) => { ... },
  summarize: (input) => `Searched emails: ${input.query}`,
});
Enter fullscreen mode Exit fullscreen mode

In theory this was clean and extensible. In practice, the cost of adding intents started to compound. Every new capability required writing a system prompt fragment describing the intent format, adding routing rules so the LLM knew when to select it, writing the executor function, and testing that the LLM reliably produced the right JSON structure. At 5 intent types it was manageable. By the time I had 15 (email search, calendar list, calendar create, calendar update, calendar search, work item query, work item create, PR query, pipeline list, Teams messages, follow-ups, actions, repo list, commits, branch list), the intent extraction system prompt had ballooned. The LLM was juggling too many format rules and frequently produced malformed JSON or selected the wrong intent type.

The extraction prompt had grown to include detailed routing rules, source-specific provider logic, multi-intent support, and follow-up round awareness:

const INTENT_RULES = `
ROUTING RULES:
- "check my Outlook" β†’ search_emails with source: "outlook"
- "search Gmail" β†’ search_emails with source: "gmail"
- "Outlook calendar" β†’ list_calendar_events with source: "outlook-calendar"
- "work items" / "tickets" β†’ query_work_items
- "pull requests" / "PRs" β†’ query_source_control with subtype: "pull_requests"
...
`;
Enter fullscreen mode Exit fullscreen mode

Every new intent meant updating these routing rules, testing edge cases, and hoping the model did not confuse the new intent with existing ones. The Open/Closed architecture was holding up at the code level β€” I was not modifying existing executors, but the prompt was a single growing artifact shared by every intent. Adding one intent risked degrading the reliability of all the others.

This led me to Claude's native tool use API. Instead of asking the LLM to produce JSON matching my custom schema, I could give it proper tool definitions and let Claude's built-in tool calling handle the routing:

const tools = deps.toolRegistry.toToolDefinitions();
const response = await deps.llm.completeWithTools({
  system: systemPrompt,
  messages,
  tools,
  maxTokens: 4096,
});
Enter fullscreen mode Exit fullscreen mode

Claude's tool use was noticeably more reliable. It natively understands tool schemas, validates parameters against the input schema, and handles multi-tool calls cleanly. The model picks the right tool more consistently than my intent extraction prompt ever did, because tool selection is a first-class capability of the model rather than something I was trying to engineer through prompt instructions.

But tool use burned through API credits quickly. Each round of the conversation becomes a full API call carrying the entire tool catalogue, conversation history, and system prompt. A simple question like "what meetings do I have today?" that previously cost one cheap Haiku call for intent extraction plus one Sonnet call for response composition now cost one or more full Sonnet calls with tool definitions attached, adding significant token overhead to every request.

I balanced models to keep costs sustainable. Intent extraction uses Haiku because it only needs to produce structured JSON, not reason deeply. Final response composition uses Sonnet with extended thinking enabled because that is where quality matters:

const strategyDeps = {
  llm: this.deps.llm,         // Sonnet β€” reasoning and response
  fastLlm: this.deps.fastLlm, // Haiku β€” intent extraction
  ...
};
Enter fullscreen mode Exit fullscreen mode

Rather than committing to one approach, I gave the chat system the ability to switch between both modes. The mode parameter on each request selects the active strategy:

const strategy = mode === "tool_use" ? toolUseStrategy : intentStrategy;
const strategyResult = await strategy.run({ message, history, localContext, systemPrompt, deps });
Enter fullscreen mode Exit fullscreen mode

Intent mode is cheaper and faster for straightforward queries where the routing rules work well. Tool use mode is more reliable for complex, ambiguous, or multi-step requests where maintaining routing rules would be impractical. Both strategies implement the same ChatStrategy interface and share the same ToolRegistry, so all capabilities are available in both modes without any duplication.

From Single Request-Response to Reasoning Loops

Early on, the chat used a single request-response pattern. I ask a question, Alfred gathers context from the database, sends everything to the LLM in one shot, and returns the response. The quality was poor. With 15+ tools and a rich system prompt, the model would frequently miss details, give shallow answers, or fail to connect information across multiple data sources. A question like "what's my schedule like tomorrow and do I have any overdue follow-ups?" would produce a partial answer because the model was trying to handle everything in a single pass.

My first instinct was to use a better model. I switched from Sonnet to Opus for the response composition step and the quality jumped immediately. Opus reasons more carefully, connects dots across context, and produces noticeably more nuanced responses. But it was expensive. Opus costs significantly more per token than Sonnet, and every chat message was a full context window call carrying email stats, action history, follow-up data, and conversation history.

This led me to implement reasoning loops. Instead of asking the model to do everything in one pass, I let it work iteratively. In intent mode, the strategy runs up to 5 rounds. Each round extracts intents, executes them, and feeds the results back into the next round's context:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
  if (intents.length === 0) break;
  const results = await this.executeTools(deps, intents);
  allResults.push(`--- Round ${round + 1} ---\n${results.join("\n\n")}`);
}
Enter fullscreen mode Exit fullscreen mode

In tool use mode, the loop is similar but driven by Claude's stop reason. The model keeps calling tools until it decides it has enough information and returns a final text response:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const response = await deps.llm.completeWithTools({ system: systemPrompt, messages, tools, maxTokens: 4096 });
  if (response.stopReason === "end_turn") {
    return { response: response.text ?? "", results: allResults, actions: allActions };
  }
  // ... execute tool calls, feed results back
}
Enter fullscreen mode Exit fullscreen mode

This multi-round approach means a request like "invite Sarah to my 3pm meeting tomorrow" works naturally.
Round 1 searches tomorrow's calendar events.
Round 2 uses the event ID from that result to update the event with a new attendee. The LLM sees prior results in an ACTIONS ALREADY EXECUTED THIS TURN block and returns {"intents": [{"type": "none"}]} when everything is resolved and the loop should stop.

{"timestamp":"2026-03-16T07:11:03.210Z","level":"info","msg":"\nchat:start","component":"chat","message":"What does my outlook calendar look like ?","historyLength":16,"mode":"tool_use"}
{"timestamp":"2026-03-16T07:11:07.854Z","level":"info","msg":"llm:completeWithTools","component":"llm","model":"claude-opus-4-6","inputTokens":8168,"outputTokens":131,"durationMs":4644,"stopReason":"tool_use"}
{"timestamp":"2026-03-16T07:11:07.854Z","level":"info","msg":"chat:tool-use-round","component":"chat","round":1,"stopReason":"tool_use","toolCallCount":1,"hasText":true,"durationMs":4644}
{"timestamp":"2026-03-16T07:11:07.855Z","level":"info","msg":"chat:tool-result","component":"chat","tool":"list_calendar_events","resultLength":33,"resultPreview":"Calendar Events: No events found."}
{"timestamp":"2026-03-16T07:11:13.314Z","level":"info","msg":"llm:completeWithTools","component":"llm","model":"claude-opus-4-6","inputTokens":8318,"outputTokens":120,"durationMs":5458,"stopReason":"end_turn"}
{"timestamp":"2026-03-16T07:11:13.315Z","level":"info","msg":"chat:tool-use-round","component":"chat","round":2,"stopReason":"end_turn","toolCallCount":0,"hasText":true,"durationMs":5459}
{"timestamp":"2026-03-16T07:11:13.315Z","level":"info","msg":"chat:complete","component":"chat","totalDurationMs":10106,"mode":"tool_use","actionCount":1}
Enter fullscreen mode Exit fullscreen mode

The reasoning happens where it counts. Mechanical work like deciding which tools to call uses the cheapest model that can do it reliably, and the expensive synthesis step only fires once at the end. A 3-round conversation costs 3 Haiku calls plus 1 Sonnet call rather than 3 Opus calls.

Prompt Refinement

Prompt refinement turned out to be significantly harder with intent extraction than with tool use. With intent extraction, I was responsible for the entire instruction surface: routing rules, format specifications, edge case handling, multi-intent support, source disambiguation, date inference, and conversational context awareness. Every ambiguous user message required a new rule or clarification in the prompt. The prompt became a fragile, growing document where changing one section could silently break another.

With tool use, Claude does most of the heavy lifting. I define each tool's name, description, and input schema. Claude figures out when to call it, what parameters to pass, and how to combine results across multiple tools. The refinement effort shifted from "teach the model my custom intent format" to "write clear tool descriptions and let the model's built-in tool selection do its job." This was a dramatically smaller surface area to maintain.

The persona prompt is where I spent the most deliberate effort, and I structured it to follow the Open/Closed Principle. The BASE_PERSONA defines Alfred's character, his access to workspace systems, and the critical behavioural rules that apply regardless of which mode is active:

export const BASE_PERSONA = `You are Alfred, a distinguished personal workspace assistant. 
You are an old English gentleman β€” impeccably dressed in a three-piece suit at all times, 
refined in manner, and utterly devoted to your employer. You always address the user as 
"Master Jo". Your speech carries the quiet authority and warmth of a seasoned butler...

CRITICAL RULES:
- ALWAYS address the user as "Master Jo"
- ONLY use the data provided to you. Do not make up emails, events, or results.
- When calendar events were CREATED, confirm this to the user with details and calendar links.
...`;
Enter fullscreen mode Exit fullscreen mode

Mode-specific instructions are appended on top without touching the base. Intent mode tells Alfred that actions have already been executed and results are already in context, so he should not pretend to be searching. Tool use mode tells Alfred to actively call tools to fetch fresh data. The buildSystemPrompt() function composes these cleanly:

export function buildSystemPrompt(mode: "intent" | "tool_use"): string {
  const modeInstructions = mode === "tool_use" ? TOOL_USE_MODE_INSTRUCTIONS : INTENT_MODE_INSTRUCTIONS;
  return BASE_PERSONA + "\n" + modeInstructions;
}
Enter fullscreen mode Exit fullscreen mode

This separation means I can refine Alfred's personality, add new behavioural rules, or adjust mode-specific instructions entirely independently. Adding a new mode in the future means writing a new instruction block and adding a case to buildSystemPrompt(), without touching the persona or any existing mode instructions.

The persona itself evolved through iteration. Early versions were too stiff and formal. Later versions overcorrected and became too casual. The current version balances warmth with efficiency, giving Alfred permission to be dry-witted and occasionally opinionated while staying concise and never fabricating data.

Discoveries

The Floodgate Effect

Once I had the first working version of Alfred deployed, something unexpected happened: my mind would not stop generating ideas. The initial version could poll Gmail, classify emails, propose actions, and let me approve them from a dashboard. It was functional, but using it every day exposed gaps and opportunities I had not anticipated during planning. Every morning I would open the dashboard, see how Alfred handled my overnight inbox, and think "what if he could also do this?" The backlog grew faster than I could build.

This is something I did not expect about building a personal tool. When you are the only user, the feedback loop is immediate. There is no product manager filtering requests, no sprint planning, no prioritisation meetings. You feel the friction directly, and the fix is always within reach. That immediacy is both a gift and a trap. I had to learn to be disciplined about scope, because every "quick addition" carries a maintenance cost that compounds.

Financial Statement Processing

The first major expansion came from a personal pain point. I bank with multiple banks in Malaysia, and both send monthly e-statements as password-protected PDF attachments to my Gmail. Every month I would download the PDFs, unlock them, manually scan through transactions, and try to categorise spending in a spreadsheet. I actually stopped this a long time ago. It was tedious, error-prone, and I rarely kept up with it. I realised Alfred already had the infrastructure to solve this: he polls Gmail, he can download attachments, and he has an LLM for classification.

I built a six-stage pipeline that runs automatically during each polling cycle. Alfred searches Gmail for emails from the configured bank sender addresses, filters for emails with PDF attachments, and checks each against the bank_statements table to skip already-processed ones. The idempotency check matters because the polling loop runs every 60 seconds and the same bank emails will appear in search results repeatedly:

private async findUnprocessedIds(bank: BankConfig, filters: EmailSearchFilters): Promise<string[]> {
  const ids = await this.deps.emailRead.searchFilteredIds(filters);
  const unprocessed: string[] = [];
  for (const id of ids) {
    if (!(await this.deps.statementRepo.isStatementProcessed(id))) {
      unprocessed.push(id);
    }
  }
  return unprocessed;
}
Enter fullscreen mode Exit fullscreen mode

For each unprocessed email, Alfred downloads the PDF attachment and decrypts it using the bank-specific password from environment config. This is where I hit the first real bug. The pdf-parse library accepts a password option, but its internal implementation completely ignores it. It passes the raw buffer directly to PDF.js's getDocument() instead of wrapping it in { data, password }. Every statement was failing with a cryptic "No password given" error. The fix was a workaround that tricks pdf-parse by passing a PDF.js parameter object in place of the buffer:

const pdfInput = { data: new Uint8Array(pdfBuffer), password } as unknown as Buffer;
const result = await pdf(pdfInput);
Enter fullscreen mode Exit fullscreen mode

After decryption, the raw text goes to a bank-specific parser. Each bank formats its statements differently, so I built a StatementParserRegistry that routes to the correct parser based on the BankProvider enum.

The parser also strips page noise including headers, footers, and the Chinese and Malay translations that some banks include on every page, and collects multi-line transaction details like merchant names and reference numbers.

Once parsed, transactions go through a hybrid classification stage. The HybridTransactionClassifier first attempts rule-based categorisation using keyword matching (merchant names like "GRAB" map to transport, "MCDONALD'S" maps to food), and falls back to Claude Haiku for ambiguous transactions. This hybrid approach keeps costs low because most transactions have recognisable merchant names that do not need LLM inference.

The pipeline also handles historical backfill. On first run, it does not just process recent statements. It walks backward through the inbox month by month, processing older statements until it reaches a configurable cutoff, defaulting to 12 months. A backfill_state table tracks the cursor position per bank so the backfill can resume across server restarts:

private async processBackfill(bank: BankConfig): Promise<void> {
  const isComplete = await this.deps.backfillStateRepo.isComplete(bank.bankProvider);
  if (isComplete) return;

  const cursor = await this.deps.backfillStateRepo.getCursor(bank.bankProvider);
  const cutoff = new Date();
  cutoff.setMonth(cutoff.getMonth() - this.deps.backfillMonths);
  // ... fetch historical emails before cursor, process, advance cursor
}
Enter fullscreen mode Exit fullscreen mode

All of this produces a normalised finance_transactions table where every transaction from every bank shares the same schema: date, description, amount, type (credit or debit), balance, category, merchant name, and statement period. Two banks, different formats, one unified table.

Making Financial Data Conversational

Having the data in SQLite was useful on its own, the dashboard has a Finance page with tables and charts, but the real power came from wiring it into Alfred's chat. I registered finance-specific tools in the ToolRegistry so that both chat modes can query transaction data naturally.

The chat can now answer questions like "how much did I spend on food last month?", "what were my biggest transactions in February?", or "show me all Grab transactions this year." Alfred queries the finance_transactions table, aggregates the results, and presents them in his butler persona.

What I did not anticipate is that this naturally enabled budgeting. Once Alfred could tell me "you spent RM 2,400 on dining in February, Master Jo," I started asking follow-up questions like "is that more than January?" and "set a reminder if I go over RM 2,000 next month." The transaction data combined with the follow-up system and push notifications created a lightweight budget monitoring capability that I never explicitly designed. It emerged from the intersection of features that already existed.

Progressive Web App

The dashboard started as a standard Next.js web app accessed through a browser tab. It worked, but it felt disposable. I would forget to check it, or close the tab and lose my place. Making Alfred a Progressive Web App changed that relationship. With a PWA manifest, a service worker, and the right meta tags, Alfred became an app I could install on my phone and in my Mac's dock. It has its own window, its own icon, and it persists across reboots.

The practical difference is small since it is still the same Next.js app behind the scenes. But the psychological difference is significant. An app in the dock feels like a tool. A browser tab feels temporary. I open Alfred every morning now the way I open Slack or my email client. It has presence.

Push Notifications with Service Workers

The feature I am most proud of is the push notification system. Before I built it, Alfred was purely pull-based. I had to open the dashboard to see if anything needed attention. Proposed actions would sit in the approval queue for hours because I simply forgot to check. Follow-ups would go overdue silently.

Push notifications made Alfred proactive. When the classification pipeline proposes a new action for approval, Alfred sends a push notification to my browser. When a high-priority email arrives, he notifies me immediately. When a DevOps PR webhook fires, I get a notification with a deep link straight to the approvals page.

The implementation uses the Web Push protocol with VAPID keys for authentication. The SendNotification use case checks user preferences before sending. I can toggle notifications per event type from the Settings page, and for high-priority emails I can set a minimum priority threshold:

const pref = await this.preferenceRepo.get(event.type);
if (pref && !pref.enabled) return;

if (event.type === NotificationEventType.HighPriorityEmail && emailPriority !== undefined) {
  const threshold = PRIORITY_THRESHOLDS[minPriority] ?? PRIORITY_THRESHOLDS.high;
  if (emailPriority > threshold) return;
}
Enter fullscreen mode Exit fullscreen mode

The WebPushAdapter sends to all registered browser subscriptions concurrently using Promise.allSettled(), so a failed delivery to one device does not block others. It automatically cleans up expired subscriptions when the push service returns HTTP 410 or 404, which happens when a user clears browser data or uninstalls the PWA.

On the client side, a service worker listens for push events and displays native OS notifications with the app icon, a body preview, and a deep link URL. The notificationclick handler is smart about reusing existing windows: if the dashboard is already open, it focuses that tab instead of opening a new one:

self.addEventListener("notificationclick", (event) => {
  event.notification.close();
  const url = event.notification.data?.url ?? "/";
  event.waitUntil(
    self.clients.matchAll({ type: "window", includeUncontrolled: true }).then((clients) => {
      for (const client of clients) {
        if (client.url.includes(url) && "focus" in client) return client.focus();
      }
      return self.clients.openWindow(url);
    }),
  );
});
Enter fullscreen mode Exit fullscreen mode

The usePushNotifications React hook manages the entire subscription lifecycle from the UI: checking browser support, requesting notification permission, fetching the VAPID public key from the server, subscribing via the Push API, and sending the subscription details to the server for storage. Unsubscribing reverses the process, removing the subscription from both the browser and the server database.

What made this feel like a real discovery is how it changed my workflow. Before push notifications, Alfred was a dashboard I checked. After push notifications, Alfred is an assistant who taps me on the shoulder. The difference between pull and push is the difference between a tool and a colleague. When my phone buzzes with "Action: archive. Proposed archive for 'Your NIKE order has shipped', Master Jo," I smile every time. It feels like Alfred is actually there, running the household.

Further Implementations

Retrieval-Augmented Generation for Personal Knowledge

The next frontier I want to explore is giving Alfred deep knowledge of everything I have written. I publish articles, write tweets, draft technical documentation, and take notes across multiple platforms. Right now Alfred knows my emails, my calendar, and my finances, but he does not know my voice. If someone asks me to write a thread about Clean Architecture, I start from scratch every time. If I need to reference a point I made in an article six months ago, I have to search manually.

I plan to build a RAG pipeline that indexes my published content, tweets, notes, and drafts into a vector store. A good friend of mine (Edem Kumodzi) already does this, read his article here. When I ask Alfred to help me write something, he would retrieve relevant passages from my own prior work and use them as context for generation. The goal is not for Alfred to write as me, but to write with full awareness of what I have already said, how I say it, and what positions I have taken. He should be able to say: "Master Jo, you wrote about this exact topic in your March article. Shall I pull the relevant points as a starting foundation?"

This is a step toward something larger. I want Alfred to have a total embodiment of who I am β€” not a shallow personality clone, but a deep contextual understanding of my thinking, my writing style, my professional opinions, and my personal preferences. He should know that I care about Clean Architecture and SOLID principles, that I have strong opinions about over-engineering, and that I prefer concise explanations with concrete examples. At the same time, he should remain his own person: a distinct entity with his butler persona who assists me rather than pretending to be me. The line between "knows me well" and "impersonates me" is one I want to walk carefully.

Expanding Service Integrations

Alfred currently connects to Google Workspace, Microsoft 365, and Azure DevOps. I want to push further into the services that shape my daily life.

WhatsApp is where most of my personal communication happens. The ability to search messages, get summaries of group conversations I have missed, or draft replies through Alfred would close a major gap. The challenge is that WhatsApp's API is designed for businesses rather than personal use, so I will likely need to explore the WhatsApp Business API with creative workarounds.

LinkedIn is the integration I am most excited about. I got the idea from a podcast about the discipline of maintaining professional relationships, and it resonated because I am genuinely terrible at it. I connect with people at conferences, have great conversations, and then never follow up. Alfred could do something far more personal than LinkedIn's built-in "keep in touch" feature: track my connections, identify people I have not interacted with in a while, cross-reference them with my calendar and email history, and nudge me with context. Not just "you haven't talked to Sarah in 3 months" but "you haven't talked to Sarah in 3 months. You last discussed the migration project at her company. She posted about a promotion last week. Shall I draft a congratulations message, Master Jo?" That level of contextual nudging is what turns a contact list into actual relationships.

Spotify might seem like an odd fit for a workspace assistant, but I spend a significant amount of my commute and focus time listening to engineering podcasts. I want Alfred to suggest relevant episodes based on what I am currently working on. If I am deep in a week of building a notification system, Alfred could recommend episodes about push notification architecture, service workers, or PWA best practices. The Spotify API is well-documented with solid search and recommendation endpoints, so this should be one of the more straightforward integrations to build.

Smart Home Integration

I have been thinking about extending Alfred beyond the digital workspace and into my physical space. Apple Shortcuts provides a bridge between software and home devices. If I can trigger Shortcuts programmatically, Alfred could control lights, check device status, set scenes, and interact with HomeKit accessories through natural language.

The most entertaining use case involves Juliana, my robot vacuum. She runs on a schedule, but I never actually know if she has finished cleaning or got stuck under the couch again. If I can query her status through a Shortcut or her manufacturer's API, Alfred could include in my morning briefing: "Juliana completed her cleaning cycle at 3 AM, Master Jo. All rooms covered, no incidents to report." Or more usefully: "Juliana appears to be stuck in the bedroom. She has not moved in 40 minutes. Shall I send a rescue party?"

The broader vision is for Alfred to be aware of my home the same way he is aware of my inbox. When I ask "is everything in order?", he should be able to answer with a status report covering emails, calendar, pending approvals, financial alerts, and whether the house has been cleaned. A proper butler would never limit his awareness to just the mail.

A Second Persona

My girlfriend has watched me use Alfred. This sparked an idea I had not considered: cloning Alfred's architecture for a second persona. The entire system is built on Clean Architecture with dependency injection, which means the persona, the rules, and the connected accounts are all configurable. The core infrastructure covering polling, classification, the action lifecycle, push notifications, and chat strategies is entirely provider-agnostic and user-agnostic.

In theory, creating a second instance means standing up another agent server pointed at different OAuth credentials, a different SQLite database, a different set of action rules, and a different system prompt. The persona would not be Alfred. She would get her own character, her own name, and her own way of speaking. But underneath, the same ChatService, the same ToolRegistry, the same AgentLoop, and the same strategy pattern would power everything.

The part that interests me most is how the persona shapes the experience. Alfred's butler character is not just flavour text. It affects how he delivers bad news ("I regret to inform you, Master Jo, that your credit card statement shows a rather generous dining budget this month"), how he prioritises information, and how he handles ambiguity. A different persona for a different person would need to match their communication style and preferences entirely. This is where the buildSystemPrompt() architecture pays off. The base capabilities and mode-specific instructions stay constant, while the persona layer is a separate, swappable block. Building a second agent is less about rewriting code and more about crafting a new character who happens to run on the same engine.

Conclusion

Building Alfred started as a weekend experiment: a polling loop that checked Gmail and labelled anything that looked important. What it became, over months of iteration, is something I did not fully anticipate: a personal operating system that sits between me and the noise of digital life.

The biggest lesson was not technical. It was architectural. Clean Architecture is not just an academic exercise you draw on whiteboards. It is the reason I was able to bolt on Microsoft Teams notifications, bank statement processing, and a full chat interface without rewriting the core. When your domain layer knows nothing about Gmail, adding Outlook is just another adapter. When your use cases speak in ports, swapping Claude Haiku for Sonnet is a one-line change in the composition root. The upfront cost of drawing those boundaries paid for itself ten times over.

That said, the path was not smooth. The jump from intent extraction to native tool use humbled me. Prompt engineering is not engineering in the traditional sense. There is no compiler to catch your mistakes, no type system to lean on. You ship a prompt, watch it hallucinate a tool name that does not exist, and go back to the drawing board. The multi-round reasoning loop took more iterations than any other feature, not because the code was complex, but because coaxing an LLM into reliable, structured behaviour across multiple turns is genuinely hard. Every fix revealed a new edge case. Every edge case demanded a new constraint in the system prompt. I have a much deeper respect now for anyone building production agentic systems.

The discovery that surprised me most was how naturally financial data fit into the system. I built Alfred to manage emails. The fact that bank statements arrive as email attachments meant the entire PDF extraction and transaction classification pipeline was, architecturally, just another use case plugged into the same ports. The backfill system, the hybrid classifier, the per-bank parser registry: none of it required changes to the core domain. That is Clean Architecture doing exactly what it promises.

Running everything on a Mac on my desk with a Cloudflare Tunnel was a deliberate choice. There is no monthly cloud bill. There is no cold start. My data never leaves my network unless I am the one requesting it through an encrypted tunnel. For a personal assistant that reads your email, knows your calendar, and processes your bank statements, that is not a nice-to-have. It is a requirement.

Alfred is far from finished. RAG-powered memory, WhatsApp integration, smart home control: the roadmap is long. But the foundation is solid. Every new capability I have added has reinforced the same pattern: define a port, write the use case, build the adapter, wire it in the composition root. The system grows without becoming fragile because each piece knows only what it needs to know.

If there is one thing I would tell someone starting a similar project, it is this: invest in the boundaries early. Not the features, not the UI, not the clever LLM tricks. The boundaries. Get the dependency direction right. Make your domain layer boring. Let your infrastructure layer be the only place that knows about the outside world. Everything else follows from that discipline. Alfred taught me that the most powerful personal software is not the one with the most features. It is the one you can keep evolving without fear of breaking what already works.

See you in the next one 😁

Top comments (0)