DEV Community: Manish Rawal

Why I Built voice-to-task-agent: Kill Context Switching with Your Voice

Manish Rawal — Sun, 29 Mar 2026 04:02:27 +0000

You're in a pair programming session, deep in the terminal, and you spot a bug. "Ah, we should file a ticket for that," your partner says. The flow is broken. Someone has to open a browser, navigate to Jira, click "Create," remember the project key, fill out the summary, and then try to recapture the mental state you were just in.

This tiny interruption -- this "context switch tax" -- is a silent killer of productivity. It happens dozens of times a day. An idea for a refactor comes up in a meeting, a follow-up is mentioned on a call, a bug is discovered mid-debug. Each one requires leaving your current task to perform a manual, repetitive operational task. We lose focus, and sometimes, we lose the action item entirely.

voice-to-task-agent is my answer to this problem. It's a simple Python CLI that turns your spoken words into actions, in real-time, without ever leaving your terminal. It listens for your commands, understands your intent, and executes tasks like creating Jira tickets or sending emails, all while you keep your hands on the keyboard and your mind on the code.

Quick Start: Talk, Don't Type

Getting started is designed to be ridiculously fast. Once you've configured your API keys in a simple YAML file, you just run one command:

pip install voice-to-task-agent
vtta listen

Now, just start talking.

You say:

"Hey, can you create a ticket to fix the SSO login bug... put it in the 'WEB' project. High priority."

Your terminal responds:

"Okay, creating a high-priority Jira ticket in project WEB for 'Fix SSO login bug'. Do you want to add a description?"

You reply:

"Yeah, just say 'Users are reporting 500 errors when logging in via Google SSO'."

And a moment later:

✅ Jira ticket created: WEB-1337

The ticket is filed. Your flow is intact. Your thought process is uninterrupted.

How It Works: A Real-Time Conversational Pipeline

This isn't just a fancy speech-to-text script. The magic is in the real-time, bidirectional data pipeline that connects your microphone to a large language model and back to your system's tools.

It works in a few steps:

Audio Capture: The CLI uses the sounddevice library in Python to capture raw audio chunks from your microphone. It doesn't wait for silence; it starts streaming immediately.
Streaming to AI: These audio chunks are sent directly to a streaming conversational voice API, like Google's multi-modal Gemini API. This is key -- the model starts processing your voice as you speak, providing near-instantaneous transcription and comprehension.
Unified Tool Calling: As the model understands your intent, it doesn't just generate text. It generates a structured tool_call request. When it hears "...create a ticket...", it recognizes this maps to a function you've defined, like create_jira_ticket, and figures out the parameters (summary, project, priority) from your natural language.
Local Execution and Response: The agent running in your terminal receives this tool_call, executes the corresponding Python function (which calls the Jira API), and gets a result. This result -- whether a success message with a ticket URL or an error -- is then streamed back to the Gemini API as part of the same continuous conversation. The model then uses this information to formulate its final, helpful response to you: "Okay, I've created the ticket for you."

Architecting this requires careful management of a bidirectional stream, handling network latency, and designing for failure. What happens if the Jira API is down? The agent needs to handle that gracefully and report back through the conversational interface. It’s a surprisingly complex orchestration problem disguised as a simple CLI.

Why I Built This: A Program Manager Who Codes

My background is in Program Management and BizOps. My job has always been about one thing: making operations more efficient. I'm obsessed with identifying and eliminating friction that slows teams down. For years, I did this with process maps, spreadsheets, and strategy decks. But with the rise of agentic AI, I realized we now have a much more powerful tool.

I build open-source AI tools because I believe the best way to prove the business value of AI is to build real, working solutions to tangible problems. I'm not just interested in what's theoretically possible; I'm focused on what's practically useful, today. This project is a perfect showcase of that philosophy. It directly attacks the "context switch tax," a well-known operational drag on engineering teams.

Building voice-to-task-agent was also an exercise in wearing my Technical Program Manager hat. It forced me to:

Architect a data pipeline: A real-time system with multiple dependencies (mic hardware, network, multiple APIs).
Integrate disparate systems: Connecting a bleeding-edge AI service with a standard enterprise workhorse like Jira.
Focus on the user: The goal isn't just to call an API; it's to create a seamless, "it just works" experience that doesn't break a developer's flow.
Think about the "ilities": Reliability, usability, and extensibility. A production-grade tool can't be a brittle demo.

This practical, hands-on approach is what informs my other projects as well. When you build real agents, you quickly run into real problems. How much is this costing me? That led me to build agent-cost-tracker. Is my agent just agreeing with me to be helpful? That led to llm-sycophancy-eval. How do I debug when it's slow? That's why I created agent-profiler. These tools aren't academic exercises--they are solutions to the real-world challenges of operationalizing AI.

What's Next?

voice-to-task-agent is just getting started. It's a proof of concept for a future where operational tasks are handled through ambient, conversational interfaces. Here's what I'm thinking about next:

More Tools, More Actions: Adding integrations for creating GitHub issues, sending Slack messages, and updating Salesforce records are obvious next steps. The tool-calling framework is designed to be easily extensible.
Smarter Confirmation: For potentially destructive actions, implementing a "Are you sure?" confirmation step that can be confirmed by voice is critical for safe, reliable use.
Local and On-Device Models: Exploring the use of local, on-device models for the initial transcription could dramatically reduce latency and enhance privacy, sending only the structured intent to a cloud LLM for tool mapping.

This project is open-source, and I'd welcome any and all contributions, from new tool integrations to documentation improvements.

Why I Built agent-cost-tracker: Stop Your AI Agent from Secretly Bankrupting You

Manish Rawal — Sat, 28 Mar 2026 23:45:21 +0000

You’ve done it. After days of prompt engineering, wrestling with LangChain, and debugging esoteric errors, your AI agent finally works. It autonomously researches topics, uses tools, and completes the complex task you assigned--it feels like magic. Then you open your OpenAI billing dashboard and the magic vanishes, replaced by a cold, hard, three-digit number that’s growing way too fast.

I’ve been there. The very nature of agentic workflows--with their unpredictable loops and chains of thought--turns cost forecasting into a complete guessing game. This is a massive problem, not just for engineers, but for anyone trying to build a viable product on top of this technology.

That’s why I built agent-cost-tracker. It’s an open-source Python library that gives you a crystal-clear, step-by-step breakdown of your AI agent's API costs. It tracks every call, calculates the expense for both input and output tokens, and generates an interactive visualization so you can see exactly where your money is going. No more billing surprises, just the data you need to build efficient and economically viable agents.

Quick Start

Getting started is trivial. You wrap your agent's execution code in a CostTracker context manager. That's it. It automatically patches the necessary libraries and starts listening.

Here’s a complete example. Let's assume you have your agent's logic in a function called run_my_agent_flow.

from agent_cost_tracker import CostTracker
from my_agent_module import run_my_agent_flow # Your agent code lives here

# Initialize the tracker
with CostTracker() as cost_tracker:
    # Run your agent as you normally would
    run_my_agent_flow("What were the key highlights of Apple's latest earnings call?")

# Print the total cost and generate an interactive report
print(f"Total cost for the run: ${cost_tracker.get_total_cost():.4f}")
cost_tracker.visualize_costs(browser=True) # Opens a report in your browser

After running this, you'll get a beautiful, interactive HTML file that breaks down the cost of every single LLM call your agent made during that run.

How It Works

The magic behind agent-cost-tracker is a technique called monkey-patching. When you enter the with CostTracker() as ... block, the library temporarily replaces the API call methods from popular libraries like openai and litellm with its own custom versions. Don't worry--it's less chaotic than it sounds.

Here's the sequence:

Patching: CostTracker finds the chat.completions.create and chat.completions.acreate methods in the openai client object. It stores a reference to the original methods and puts its own "wrapper" method in their place.
Execution: Your agent code runs exactly as written. It thinks it's calling the normal OpenAI API, but it's actually calling the CostTracker wrapper.
Interception: The wrapper first calls the original OpenAI method, letting the API call complete successfully. When it receives the response, it intercepts it before passing it back to your agent. It pulls the usage object from the response payload, which contains the prompt_tokens and completion_tokens.
Calculation & Logging: Using a built-in, up-to-date price list for different models (like gpt-4-turbo, claude-3-opus, etc.), the tracker calculates the precise cost of that individual call. It logs the model used, the token counts, and the final cost, then returns the original response to your agent so the workflow can continue uninterrupted.
Restoration: As soon as your code exits the with block--either by finishing or by raising an error--the original, un-patched methods are put back in their place. This ensures the tracker has zero side effects on any other part of your application.

The final step, visualization, is handled by Plotly. It takes the logged data and generates a self-contained HTML file with an interactive Sankey diagram. This diagram is perfect for visualizing flows, letting you easily trace the path of your agent and see which steps or tool uses are racking up the biggest bill.

Why I Built This

I’m a Program Manager with a background in Business Operations, and I'm obsessed with agentic AI. My job is to bridge the gap between business strategy and AI engineering. I don’t just write strategy decks; I build real tools to prove what’s possible and uncover the operational hurdles we’ll face in production.

In BizOps, you learn one thing very quickly: a project without a predictable budget is a non-starter. When I started building complex AI agents, I was horrified by how opaque their costs were. An agent designed to do research might make five API calls for one query and fifty for another. You can't build a business on that kind of variance without a way to measure and control it.

I needed answers to basic business questions:

What is our average cost per task?
Which tool is the most expensive for our agent to use?
If we swap gpt-4-turbo for claude-3-sonnet in a specific step, how much do we save, and what's the impact on quality?

agent-cost-tracker is the tool I needed to answer those questions. It turns an engineering black box into a measurable business process. It provides the concrete data required to make informed trade-offs between cost, latency, and performance. This is the same philosophy I applied to my other project, llm-sycophancy-eval, which stress-tests agents for behavioral flaws. First, you have to understand and measure the system--whether its behavior or its cost--before you can optimize it.

What's Next

This is just the beginning. I believe cost-awareness needs to be a first-class citizen in the agent development lifecycle. Here are a few things I have planned:

Expanded Provider Support: Adding first-class support for other major model providers like Cohere, Gemini (through their native SDKs), and more open-source models.
Budget Thresholds and Alerting: The ability to set a maximum budget for a run. If the agent exceeds it, the tracker will raise an exception to halt execution, preventing runaway costs.
Deeper Dashboard Insights: More advanced analytics in the visualization, like breaking down costs by the "tool" being used or providing time-series data to spot performance regressions.

The project is fully open-source, and I welcome contributions. If you have an idea or want to help build out these features, please open an issue or a pull request on GitHub.