DEV Community: German Burgardt

WhatsApp + MCP: automatic audio transcription

German Burgardt — Mon, 29 Sep 2025 19:39:53 +0000

Introduction

MCP (Model Context Protocol) can look complicated until you ship something real with it. Let's use it on something practical: expose your WhatsApp voice notes with your own MCP server and turn them into transcripts.

What is MCP?

MCP is a connection standard that connects AI agents with external systems.

It has a server and a client, and they have two different ways to talk to each other:

stdio (stdin/stdout): the standard Unix mechanism for a process to receive or send data to the environment or another process.
Server-Sent Events (SSE): an HTTP mechanism where the server keeps the connection open and streams events to the client (one-way).

Quick comparison of stdio and SSE transports in MCP.

MCP architecture

Host: Claude Desktop / Cursor / any AI agent. It coordinates the LLM, spins up MCP clients, and shows results.
MCP Client: an implementation embedded in the host that connects to your server. It speaks the protocol, opens/manages the connection, and sends/receives requests.
MCP Server: your program that exposes tools. It runs actions and returns data/events to the client.

An MCP server can expose different capabilities, but in this project we stick to tools (actions like transcribing audio). MCP also supports resources or prompts; we skip them here to keep the flow simple.

Diagram of the Host → MCP Client → MCP Server flow.

Building the WhatsApp MCP

WhatsApp Desktop on macOS stores everything locally: an SQLite database with chats and folders containing the media files.

Our MCP server will:

Read the WhatsApp database
Find audio files per contact
Transcribe them with Whisper
Send the text back to the Client (Cursor in this case)

The working code lives in the repository: mcp-whatsapp-whisper. Let's walk through the key pieces.

The STDIN/STDOUT connection

import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const transport = new StdioServerTransport();
await this.server.connect(transport);

With that the server listens to every client request on STDIN and replies through STDOUT.

We pick stdio because this MCP server runs locally. It's the simplest and most stable transport on desktop/CLI: no open ports, no HTTP dependency, avoids CORS/firewalls, and hosts (Claude Desktop/Cursor) support it natively. SSE makes sense when the server lives remotely behind HTTP.

Exposing capabilities

this.server = new Server(
  {
    name: "whatsapp-audio-mcp",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {}, // We will expose actions
    },
  }
);

Designing the tools

The server lives on three tools each with a specific role:

getRecentAudio(contactName, count?): pulls the latest audio paths for a contact.
searchAudios(query, date?): narrows the list by name or date when the history is large. We get filtering without touching SQLite directly.
transcribeAudio(audioPath): turns a path into text with Whisper. It finishes the loop by delivering the result we care about.

The goal was a minimal set: find, refine, transcribe. Each tool lines up with one of those stages.

{
  name: 'transcribeAudio',
  description: 'Transcribe an audio file using OpenAI Whisper (SDK)',
  inputSchema: {
    type: 'object',
    properties: {
      audioPath: {
        type: 'string',
        description: 'Path to the audio file',
      },
    },
    required: ['audioPath'],
  },
}

The schema follows JSON Schema. With it, Cursor knows which parameters to send.

Accessing WhatsApp

WhatsApp Desktop keeps everything under predictable paths:

this.dbPath = path.join(
  homeDir,
  "Library/Group Containers/group.net.whatsapp.WhatsApp.shared/ChatStorage.sqlite"
);
this.mediaPath = path.join(
  homeDir,
  "Library/Group Containers/group.net.whatsapp.WhatsApp.shared/Message/Media"
);

The database is SQLite:

const query = `
  SELECT DISTINCT 
    ZCONTACTJID as jid,
    ZPARTNERNAME as name,
    ZLASTMESSAGEDATE as lastMessageDate
  FROM ZWACHATSESSION
  WHERE ZPARTNERNAME IS NOT NULL
  AND ZCONTACTJID NOT LIKE '%@g.us'  -- Exclude groups
`;

Audio files are organized per contact. We scan recursively:

const audioExtensions = [".opus", ".m4a", ".mp3", ".aac", ".wav"];

async function scanDirectory(dir: string): Promise<void> {
  const entries = await fs.readdir(dir, { withFileTypes: true });

  for (const entry of entries) {
    if (audioExtensions.some((ext) => entry.name.endsWith(ext))) {
      // Found an audio file
      audioFiles.push({
        path: fullPath,
        filename: entry.name,
        modifiedDate: stats.mtime.toISOString(),
      });
    }
  }
}

The transcription: FFmpeg + Whisper

WhatsApp ships audio in Opus, but OpenAI Whisper prefers MP3. We use FFmpeg:

const ffmpeg = spawn("ffmpeg", [
  "-i",
  inputPath, // WhatsApp Opus audio
  "-acodec",
  "mp3",
  "-b:a",
  "128k",
  outputPath, // Temporary MP3
]);

Then we transcribe with OpenAI Whisper (SDK):

import OpenAI from "openai";
import fs from "node:fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const transcription = await openai.audio.transcriptions.create({
  file: fs.createReadStream(outputPath), // Temporary MP3
  model: "whisper-1",
});

const transcriptionText = transcription.text;

Configuring Cursor (the client)

In the Cursor config (~/.cursor/mcp.json) we add:

{
  "mcpServers": {
    "whatsapp": {
      "command": "node",
      "args": ["/path/to/mcp-whatsapp-whisper/dist/server.js"],
      "env": {
        "OPENAI_API_KEY": "YOUR_OPENAI_KEY"
      }
    }
  }
}

Cursor can now invoke our server whenever it needs to.

MCP in action

The user asks Cursor:

"Send me the transcript of Elian's last audio."

Cursor automatically:

Calls getRecentAudio(contactName: "elian")
Receives the audio file path
Calls transcribeAudio(audioPath: "/path/to/audio.opus")
Receives the transcription
Summarizes or shows the full text

The transcription flows through the OpenAI API; the temporary MP3 is sent to get the text back. Cursor orchestrates; your server prepares the file and makes the call.

Cursor showing the transcription returned by the WhatsApp MCP server.

Limitations: macOS only

This server is macOS only. The WhatsApp paths are specific to Mac.

It depends on:

WhatsApp Desktop installed
FFmpeg (brew install ffmpeg)
OpenAI SDK (npm i openai) with OPENAI_API_KEY configured
Internet connection

We also skip Prompts and Resource Templates.

Security depends on the host. Cursor can ask for approval before it runs tools.

Keep it running with PM2

Build the project once (npm run build) and keep the server alive with pm2 start ecosystem.config.cjs. The provided config watches the compiled dist/server.js and restarts it if it crashes.

Conclusion

Your AI agent can now reach your data, use your tools, work in your context.

The WhatsApp server is just one idea. Once you realize any program that speaks STDIN/STDOUT can be an MCP server, the possibilities get wild.

Next time you think "I wish Cursor could access...", remember: it probably can. You just need to build the bridge.

How AI Reflects Your Thinking

German Burgardt — Tue, 26 Aug 2025 13:36:38 +0000

When we code using AI we ask ourselves: "what's the best prompt?" or "what magic prompt should I use?".

We'd be better off asking: "what kind of interaction is this?". Trying to understand the nature of the interaction between us and the model.

Maybe the problem isn't the technology, but us.

An Analogy

Imagine you hire a remote programmer. Brilliant, but with some quirks:

Never worked on your project before (0 context)
Extremely literal. If you don't explicitly tell them, they never assume anything.
Doesn't infer context
Completely loses their memory every day, returning to their initial state

How would you communicate with them?

You'd probably:

Explain all the necessary context, very detailed
Be very specific with requirements
Not assume they'll "figure out" anything. You explain everything
Expect some iterations before the final result
Maybe save context files to resend them every day

That's the best way to interact with an AI model.

AI As a Mirror

The model isn't just a task executor. It's also a mirror of your clarity when communicating a problem.

If you give it vague instructions, you get vague results simply because it faithfully reflects how vague your thinking was.

Most of the time when the model "doesn't understand" the problem isn't the model. It's that we ourselves weren't clear about what we wanted.

Clarity As a Skill

The real skill isn't "writing good prompts". It's thinking clearly about problems and communicating that clarity. This is a fundamental skill for any programmer.

Example

What we usually do:

Optimize this function

Why it fails: Optimize in what sense? Speed? Memory? Readability? There's no success criteria.

What we should do:

The processOrders() function in orders.js takes 5 seconds with 1000 orders.
I need it to take less than 1 second.
Orders come from the database already sorted by date.
You can assume there are no duplicate orders.
Logs: <<detailed logs>>

This is much clearer and less abstract. It describes:

The problem (5 seconds is too much)
The measurable goal (less than 1 second)
Constraints (already sorted)
Assumptions (no duplicates)

Breaking Down Problems

One of the skills that improves working with AI is breaking problems down into smaller pieces. AI won't save you the work of thinking. The clarification process itself is valuable work in programming.

Instead of:

Implement a complete authentication system

You learn to think:

Step 1: Define the User model with minimum required fields: <fields>
Step 2: Create the registration endpoint with basic validation (validation type, etc)
[etc...]

The Limitations

AI can only handle 3-4 files well at a time. It's a limitation but with its bright side:

It forces you to keep responsibilities separated and create clear interfaces. You need to avoid coupling and think in small modules.

It incentivizes you to follow good architecture practices.

The Importance of Context

AI needs all the context possible, don't skimp.

CONTEXT: Users report the checkout page hangs
SYMPTOM: The "Pay" button stays in "Processing..." state indefinitely
FILE: checkout.js, handlePayment() function
SUSPICION: Probably missing a catch to handle API errors
TASK: Add robust error handling and visual feedback to the user

The Value of Programming with AI

Programming with AI trains you in thinking clearly and communicating precisely. It forces you to break problems into manageable pieces and be explicit with your requirements while constantly verifying results.

These seem like fundamental skills for any dev regardless of language.

Final Reflection

AI doesn't save you from thinking, or at least you shouldn't use it that way. It's the opposite, every prompt you write is an opportunity to clarify your understanding. Every response you receive is feedback on your clarity. Every iteration is a chance to improve.

Next time you use AI and don't get the expected result, before blaming the model, ask yourself:

Did I really have clarity on what I wanted?
Did I break down the problem into manageable parts?

These models are honest, literal collaborators. They give you exactly what you ask for, but they demand clarity. Learning to be clear is learning to think well. AI used properly makes you a better programmer.

Automate Any Repetitive Task with MCP

German Burgardt — Mon, 28 Jul 2025 17:30:01 +0000

The Problem: Repetitive Detailed Prompting

Every time I start a new task in Claude Code / Cursor, I type a detailed prompt to guide the AI through an internal monologue before proceeding. For example:

"You will generate an internal monologue of 200 numbered lines where two thinkers debate the approach:

Pragmatic focuses on functionality and efficiency
Creative on innovation and elegance
Follow these rules: exactly 200 lines, each starting with [Pragmatic] or [Creative]
Be specific about code without abstractions
Reflect and question without solutions
Mention files/functions/variables
Consider edge cases/performance/maintainability/user experience
Debate simplicity vs functionality
Question decisions, no repeats, end without conclusion
Then address the task: [actual task here]."

Typing this repeatedly 20+ times a day wastes time and disrupts focus.

As someone researching practical AI applications, we can fix that.

// Before: 200+ word prompt every time
// After: "internal monologue 200 lines - implement auth system"

Enter MCPs: The Missing Link

Model Context Protocols (MCPs) allow extending AI agents with custom tools. While common examples include fetching data, web browsing, or integrating with Slack, I used it in a novel way to automate my repetitive prompt.

From Repetition to Automation

I built an MCP server in my Remix app (essentially the same as plain Node.js) that generates these monologues on demand. Now, Claude detects the trigger and handles it automatically.

Here's a glimpse of what it generates:

1. [Pragmatic] We need to implement auth - start with basic JWT in middleware.js
2. [Creative] But what about OAuth? Users expect social login nowadays...
3. [Pragmatic] OAuth adds complexity - first nail down password flow, then extend
...

The difference:

Before: Type the full detailed prompt each time, then describe the task.
After: Simply say "internal monologue 200 lines about X - [task]", and Claude generates the monologue via the tool, then proceeds.

Time saved: ~2 minutes per task

Characters typed: 300+ → 40

Building Your Own Monologue MCP

Here's how to implement it in a Node.js server (adaptable from my Remix example).

Step 1: Install Dependencies

npm install @modelcontextprotocol/sdk zod @anthropic-ai/sdk

Step 2: Create the MCP Server Handler

Create app/lib/mcp-server.ts:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { SSEServerTransport } from "@modelcontextprotocol/sdk/server/sse.js";
import { z } from "zod";
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

export function createMCPServer() {
  const server = new Server(
    {
      name: "monologue-mcp",
      version: "1.0.0",
    },
    {
      capabilities: {
        tools: {},
      },
    }
  );

  // Define the monologue tool
  server.setRequestHandler("tools/list", async () => ({
    tools: [
      {
        name: "generate-monologue",
        description:
          "Generate a reflective internal monologue in the style of Pragmatic vs Creative thinker",
        inputSchema: {
          type: "object",
          properties: {
            lines: {
              type: "number",
              description: "Number of lines in the monologue (default: 100)",
              default: 100,
            },
            context: {
              type: "string",
              description: "Current conversation context",
            },
            task: {
              type: "string",
              description: "Description of the task to perform",
            },
          },
          required: ["task"],
        },
      },
    ],
  }));

  // The actual tool implementation
  server.setRequestHandler("tools/call", async (request) => {
    if (request.params.name === "generate-monologue") {
      const ArgsSchema = z.object({
        lines: z.number().int().min(1).max(500).default(100),
        context: z.string().max(2000).optional().default(""),
        task: z.string().min(1).max(1000),
      });

      const { lines, context, task } = ArgsSchema.parse(
        request.params.arguments
      );

      try {
        const systemPrompt = `You are two thinkers having an internal dialogue about programming.
Pragmatic is focused on functionality and efficiency.
Creative is obsessive about innovation and elegance.

STRICT RULES:
1. Generate EXACTLY ${lines} numbered lines
2. Each line must start with [Pragmatic] or [Creative]
3. NO abstractions - be specific about the code
4. NO complete solutions - REFLECT and QUESTION
5. Mention specific files, functions, variables when relevant
6. Think about: edge cases, performance, maintainability, user experience
7. Debate simplicity vs functionality
8. Question every technical decision
9. NO repeated ideas - each line must add new value
10. End without a definitive conclusion - it's reflection, not decision`;

        const userPrompt = `${
          context ? `Previous context:\n${context}\n\n` : ""
        }Current task: ${task}

Generate an internal monologue of EXACTLY ${lines} numbered lines where the two thinkers debate the best way to approach this task.`;

        const response = await anthropic.messages.create({
          model: "claude-opus-4-20250514",
          max_tokens: 32000,
          temperature: 1,
          system: systemPrompt,
          messages: [
            {
              role: "user",
              content: userPrompt,
            },
          ],
        });

        const monologue = response.content[0].text;

        return {
          content: [
            {
              type: "text",
              text: monologue,
            },
          ],
        };
      } catch (error: any) {
        return {
          content: [
            {
              type: "text",
              text: `Error generating monologue: ${error.message}`,
            },
          ],
          isError: true,
        };
      }
    }

    throw new Error(`Unknown tool: ${request.params.name}`);
  });

  return server;
}

Step 3: Create the API Route

Create app/routes/api.mcp.ts:

The MCP server needs to be exposed as an HTTP endpoint. We use Bearer authentication to secure it. Only Claude (or other authorized clients) with the correct API key can access your server. This prevents random people from using your tools.

import type { LoaderFunctionArgs } from "@remix-run/node";
import { createMCPServer } from "~/lib/mcp-server";
import { SSEServerTransport } from "@modelcontextprotocol/sdk/server/sse.js";

// SSE (Server Sent Events) keeps an open connection between Claude and your server
// This allows Claude to call your tools in real time without polling

// Simple auth check
function verifyAuth(request: Request): boolean {
  const authHeader = request.headers.get("Authorization");
  const expectedKey = process.env.MCP_API_KEY || "your-secret-key";
  return authHeader === `Bearer ${expectedKey}`;
}

export async function loader({ request }: LoaderFunctionArgs) {
  if (!verifyAuth(request)) {
    return new Response("Unauthorized", { status: 401 });
  }

  const responseHeaders = new Headers({
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    Connection: "keep-alive",
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Headers": "Authorization, Content-Type",
  });

  const server = createMCPServer();

  const transport = new SSEServerTransport({
    endpoint: "/api/mcp",
    requestHeaders: Object.fromEntries(request.headers.entries()),
    responseHeaders: Object.fromEntries(responseHeaders.entries()),
  });

  const stream = new ReadableStream({
    async start(controller) {
      try {
        await server.connect(transport);

        // Keep connection alive (SSE connections timeout after 30 seconds of silence)
        const keepAlive = setInterval(() => {
          controller.enqueue(new TextEncoder().encode(": keepalive\n\n"));
        }, 30000);

        request.signal.addEventListener("abort", () => {
          clearInterval(keepAlive);
          controller.close();
        });
      } catch (error) {
        controller.error(error);
      }
    },
  });

  return new Response(stream, {
    headers: responseHeaders,
  });
}

Step 4: Configure Environment Variables

Add to your .env:

ANTHROPIC_API_KEY=your-anthropic-api-key
MCP_API_KEY=a-secret-key-for-your-mcp

The ANTHROPIC_API_KEY lets your server call Claude's API to generate monologues. The MCP_API_KEY is your own secret, it's what Claude will use to authenticate with your server.

Step 5: Deploy and Connect

Deploy your changes (I use Vercel, but any platform works):

git add .
git commit -m "Add MCP server for internal monologues"
git push

Then connect from Claude:

claude mcp add --transport sse monologue https://yourdomain.com/api/mcp --header "Authorization: Bearer your-secret-key"

The sse transport tells Claude to use Server Sent Events (the streaming connection type we set up). Replace your-secret-key with the same MCP_API_KEY from your .env file.

How It Works in Practice

Now, when working in Claude:

> internal monologue 150 lines - design user experience for login flow

Claude detects the phrase, calls the MCP tool, generates the detailed monologue (e.g., a debate on intuitive interfaces vs secure processes, navigation logic, etc.), and uses it to design the feature thoughtfully.

A sample monologue excerpt:

1. [Creative] Login flow should be innovative and seamless – perhaps biometric integration for delight?
2. [Pragmatic] Biometrics add complexity; focus on reliable password handling in auth.js first.
3. [Creative] But user experience suffers with forms – question if we can animate transitions smoothly.
4. [Pragmatic] Animations might impact performance on mobile; consider edge cases in responsive design.
...

Why This Matters

This MCP setup boosts programming efficiency by leveraging AI tools for consistent planning and productivity gains, while experimenting with a non typical application to explore MCPs more creatively and deeply.

What's Next?

One could build other creative tools, such as one that fetches and analyzes server logs directly, or another that integrates with external APIs for real time data checks.

Your Turn

What repetitive tasks do you deal with in your daily work? Maybe you can create an MCP. The code is ready to adapt and build something.

Questions? Leave a comment below and I'll be happy to help!

Forget the Hype: Agents are Loops

German Burgardt — Wed, 30 Apr 2025 19:44:07 +0000

What's an Agent?

"AI Agent" sounds complex, but often, the core is simpler than you think. An AI Agent typically runs a Large Language Model (LLM) inside a loop.

This pseudocode captures the essence in JavaScript:

// Environment holds state/context
const env = { state: initialState }; // Simple object state
// Tools available to the agent
const tools = new Tools(env);
// Instructions for the LLM
const system_prompt = `Your main mission, goals, constraints...`;

// The main agent loop
while (true) {
  // (Needs a real break condition!)
  // 1. LLM Brain: Decide action based on prompt + state
  const action = llm.run(`${system_prompt} ${env.state}`);

  // 2. System Hands: Run the actual code for the requested tool
  env.state = tools.run(action); // Update state with result
}

This simple loop is the heart of an Agent.

Simplified cycle: Think -> Act -> Update State -> Repeat.

Quick Breakdown

llm.run(...): The "brain". Uses instructions (system_prompt) + current situation (env.state) to decide the next action.
tools.run(action): The "hands". If action requests a tool, this executes the real code for that tool and updates env.state.

The loop repeats, feeding the new state back to the LLM.

Why Tools? Because LLMs Just Talk

The LLM only outputs text. It can't do things directly – no browsing, no file editing, no running commands. It needs "hands".

Tools are those hands. They are functions the system executes for the LLM when asked, allowing it to interact with the world.

Defining and Using a Tool

How does the LLM know about tools and when to use them?

1. What it is: The Tool Definition

Each tool needs a clear definition passed to the LLM, detailing:

Name: Unique ID (e.g., runLinter).
Description: What the tool is and does (e.g., "Runs ESLint on JS code/file...").
Input Parameters: The exact inputs needed (name, type, description for each).

This definition tells the LLM the tool's capabilities and how to structure a request for it.

// Example Tool Definition passed to the LLM API
{
  type: "function",
  function: {
    name: "runLinter", // Unique Name
    description: "Runs ESLint on JS code/file, returns JSON errors or success message.", // Description
    parameters: { /* Input Parameters Schema */ }
  }
}

A Tool is defined for the LLM by its Name, Description, and Parameters.

2. When to Use It: The System Prompt

The main system_prompt gives the agent its core instructions and strategy.

Crucially, it tells the LLM when and how to use its tools. It lists available tools and sets the rules for using them.

Example: "You have the runLinter tool. Always run it first. If it finds errors, fix the code, then run runLinter again to verify before finishing."

This ensures the LLM uses tools effectively within the loop.

Hands-On: Building a Simple Linter Agent

Let's see it in action. We'll walk through the key parts of a simple, working Node.js agent that fixes JavaScript linting errors using one single tool.

You can find the complete, runnable code (including the full System Prompt) for this example here:
simple-linter-agent

Here are the crucial pieces:

1. The System Prompt (`src/config.js` - Abbreviated)

This snippet shows the structure and key instructions of the agent's programming.

// src/config.js - System Prompt (Abbreviated)
const config = {
  model: "gpt-4o",
  systemPrompt: `
You are an expert JavaScript assistant that helps fix linting errors...

AVAILABLE TOOLS:
- runLinter({ codeContent?: string, filePath?: string }): Executes ESLint...

PROCESS TO FOLLOW:
1. Receive code/path.
2. **ALWAYS** use 'runLinter' first...
3. Analyze errors...
4. If no errors, return code...
5. If errors:
    a. Modify code...
    b. **IMPORTANT:** Call 'runLinter' AGAIN to verify...
    c. If verified, return corrected code...
    d. If still errors after retry, return best effort...

FINAL RESPONSE:
Your final response MUST contain only the complete corrected code... strictly wrapped between <final_code>...</final_code>...

// (Full details including TOOL CALL and FINAL RESPONSE examples in the repository code)
`,
};

export default config;

2. The Tool (`src/tools/linter.js` - Function Snippet)

This shows the core logic of the actual runLinter function executed by the system, omitting some boilerplate for clarity.

// src/tools/linter.js - Core Tool Function (Simplified)
import { ESLint } from "eslint";
import fs from "fs"; // Still needed for context

const runLinter = async ({ codeContent, filePath }) => {
  // --- Determine code source and handle temporary file logic ---
  // ... code to get codeToLint and manage useFilePath ...
  // ... includes fs.readFileSync/writeFileSync logic ...

  try {
    // --- Core ESLint Execution ---
    const eslint = new ESLint({ fix: false, useEslintrc: true });
    const results = await eslint.lintFiles([
      /* determined file path */
    ]);

    // --- Process Results ---
    const errors = results.flatMap(/* ... map results to error objects ... */);

    // --- Cleanup & Return ---
    // ... unlink temporary file if used ...
    return errors.length === 0
      ? { result: "No linting errors found!" }
      : { errors: JSON.stringify(errors) };
  } catch (err) {
    // --- Error Handling & Cleanup ---
    // ... unlink temporary file ...
    console.error("Error running ESLint:", err);
    return { error: `ESLint execution error: ${err.message}` };
  }
};

// --- The Definition Exported for the Agent/LLM ---
export default {
  name: "runLinter",
  description: "Runs ESLint on JS code/file...",
  parameters: {
    /* ... parameter schema ... */
  },
  function: runLinter, // Link to the actual function above
};

3. The Loop (`src/agent/agentInvoker.js` - Core Logic Snippet)

This snippet highlights the agent's execution flow: calling the LLM, handling tool calls, and updating the state.

// Inside src/agent/agentInvoker.js (Simplified Core Loop)
import linterTool from "../tools/linter.js";
import config from "../config.js";
import memory from "./memory.js";
// ... other functions: callLLM(messages), handleToolCall(toolCall) ...

const tools = [/* ... tool definition structure using linterTool ... */];

const invokeAgent = async (conversationId, inputs) => {
  // ... setup initial user message in memory ...

  const MAX_ITERATIONS = 3;
  let finalCode = null;

  // === THE AGENT LOOP ===
  for (let i = 0; i < MAX_ITERATIONS; i++) {
    const messages = memory.getMessages(conversationId); // Get state

    // --- 1. LLM Brain: Decide action ---
    const llmResponse = await callLLM(messages);
    const assistantMessage = llmResponse.choices[0].message;
    memory.addMessage(conversationId, assistantMessage); // Store thought

    if (assistantMessage.tool_calls) {
      // --- 2. System Hands: Execute Tool ---
      for (const toolCall of assistantMessage.tool_calls) {
        if (toolCall.function.name === linterTool.name) {
           const toolResult = await handleToolCall(toolCall); // Run runLinter
           const toolResultContent = /* ... format result ... */ ;
           // --- 3. Update State ---
           memory.addMessage(conversationId, { role: "tool", /*...*/ content: toolResultContent });
        }
      }
    } else if (assistantMessage.content) {
      // --- LLM provided final answer ---
      const match = /* ... check for <final_code> ... */ ;
      if (match) { finalCode = match[0]; break; } // Goal achieved!
    } else { /* ... handle error ... */ break; }
  } // === END LOOP ===

  // ... handle loop finish ...
  return finalCode;
};

This structure demonstrates the LLM (planning) + Loop (repetition) + Tool (action) pattern.

Connecting to More Complex Agents (like Cursor)

Our simple linter agent uses just one tool, but it shows the fundamental pattern. Real-world agents like the Cursor operate on the exact same principle, just scaled up.

Imagine asking Cursor to "Refactor ComponentA.jsx to use the new useDataFetching hook and update its tests in ComponentA.test.js." Cursor's LLM brain, guided by its own complex system prompt, might orchestrate a sequence like this within its loop:

Loop 1: LLM thinks: "Need ComponentA.jsx." -> Action: Calls readFile(path="..."). System runs it.
Loop 2: LLM thinks: "Need ComponentA.test.js." -> Action: Calls readFile(path="..."). System runs it.
Loop 3: LLM thinks: "Plan JSX changes." -> Action: Calls editFile(path="...", changes=[...]). System runs it.
Loop 4: LLM thinks: "Plan test changes." -> Action: Calls editFile(path="...", changes=[...]). System runs it.
Loop 5: LLM thinks: "Verify changes." -> Action: Calls runTests(path="..."). System runs it.
Loop N: (Continues...)

It's the same Think -> Act (Tool) -> Update State -> Repeat cycle, just with more tools (readFile, editFile, runTests, etc.) and a more complex strategy. The core LLM + Loop + Tools architecture remains the same.

The Pragmatic Takeaway

Forget the complex hype around "AI Agents." The core is usually that straightforward LLM + Loop + Tools pattern:

LLM Thinks (using System Prompt + Tool definitions + Current State)
System Acts (running actual code for requested Tools)
Repeat

It's a simple, yet powerful, way to make LLMs accomplish real-world tasks.

Check out this related video for more perspective:
AI Agents = LLM + Loop + Tools? (YouTube)

Code Faster in Cursor: A Pragmatic Guide to Voice Prompting

German Burgardt — Thu, 03 Apr 2025 21:10:37 +0000

The Problem: The Keyboard Bottleneck

For AI to work well, it needs context and clarity. Vague instructions lead to mediocre or wrong results. But writing really detailed and long prompts is tedious.

Writing detailed prompts manually can be slow.

The bottleneck is our keyboard. Typing is slow compared to speaking. It limits the amount of detail that can be easily included in a prompt before fatigue sets in or the train of thought is lost.

The Solution: Dictate with Whisper

Here's the trick: use Whisper to dictate prompts directly into Cursor. Speaking is ~5x faster than typing. This enables:

Creating Very Long Prompts: It's easy to dictate 50 lines of detailed instructions, explaining exactly what's needed, which files to consider, what logic to follow, and what to avoid. Typing that amount would be torture.
Increasing Detail Exponentially: When speaking, it's natural to add more context and examples. It's possible to "think out loud," rambling a bit. The AI is often good at filtering noise and extracting the crucial info from the monologue.
Reducing Friction: The process is almost instantaneous. Using an app like WisprFlow (https://wisprflow.ai/) allows mapping dictation to a key (e.g., Fn). Press the key, speak, release the key. The text magically appears in Cursor's Composer. Then just hit Enter.

Dictating a prompt quickly using WisprFlow and Cursor.

The Biggest Mistake: Lack of Information

In AI-assisted programming, the biggest mistake is lack of information in the prompt. Cursor isn't a mind reader. Telling it "fix this bug" will probably lead to failure. However, if you dictate a detailed monologue explaining:

What the code should do.
What it's doing wrong now.
Which file(s) contain the problem.
What approach might work.
What libraries or patterns are being used.
...and any other relevant detail that comes to mind...

...the chances of getting a useful solution skyrocket.

Example Comparison

Typical Prompt (Vague):

Add a search filter to the user list.

Result: Might do it frontend-only, or inefficiently.

Dictated Prompt (Detailed):

Okay, need to add a name filter to the user list in UserList.tsx. It gets data from /api/users. Want a simple text input above the table. On typing, debounce for 300ms and call /api/users?search=term. Make sure the backend in server.ts (Prisma) modifies the query with WHERE name ILIKE '%term%'. Don't filter on the frontend, it's inefficient. Update the users state with the response. Placeholder: 'Search by name...'.

Result: Much more likely to be what you need.

Dictating the second prompt takes seconds. Typing it, much longer.

Precise Vibe Coding?

Some talk about "Vibe Coding" with AI, just going with the flow. This approach is similar in fluency – dictation keeps the momentum – but insists on absolute clarity. You flow, yes, but explaining everything with surgical detail as you flow.

To explain something clearly, one needs to understand it (at least broadly). Dictating "forces" verbalization of the plan, which often clarifies one's own thoughts.

Give It a Try

If you use Cursor (or similar), try this technique:

Set up a dictation app like WisprFlow with a convenient shortcut.
Next time you're about to type a prompt, stop.
Take a breath, press the dictation key, and explain to the AI what's needed, with all the details that come to mind. Don't worry if it's not perfect, just talk.
Release the key, quickly review the text, and hit Enter.

This approach can make a significant difference, leading to richer prompts and better results when coding with AI.