How I Built a Voice-Controlled Local AI Agent with Streamlit, Local STT, and Safe Tool Execution

#ai #generativeai #streamlit #machinelearning

I built a voice-controlled local AI agent as part of an AI/ML and Generative AI internship assignment. The goal was to create a system that accepts audio input, converts speech into text, understands the user’s intent, executes local tools, and displays the full pipeline in a clean user interface.

The final application is built with Streamlit and follows a local-first design. A user can either record audio directly from the microphone or upload an existing audio file. Once the audio is provided, the system sends it through a speech-to-text layer, then into an intent-planning module, and finally into a safe tool execution layer. The UI shows the transcription, the planned actions, the executed result, and the final output.

The overall architecture is simple but effective:

Audio Input -> Speech-to-Text -> Intent Planner -> Tool Executor -> UI

For speech-to-text, I designed the project to use a local Hugging Face model by default. This keeps the system aligned with the assignment’s preference for local models. At the same time, I included an API fallback option using OpenAI transcription for cases where local hardware may not be strong enough to run speech models smoothly. This was important because the assignment explicitly allowed an API-based workaround when local hardware is limited.

For intent understanding, I structured the system to support multiple backends. The default version uses a lightweight local rules-based planner so the app can still run without requiring a local LLM server. For a stronger local setup, the project also supports Ollama, which can be used to run a local model for planning, summarization, code generation, and chat. An OpenAI-based option is also supported. The system can detect the required assignment intents such as creating files, writing code, summarizing text, and general chat.

One improvement I added beyond the minimum requirement was compound command support. Instead of handling only one action at a time, the planner can break a single request into multiple ordered steps. For example, if the user says, “Summarize this text and save it to summary.txt,” the system first performs the summarization and then saves the result into the requested file. This made the agent feel more practical and closer to a real workflow assistant.

For execution, I created a safe local tool layer that maps intents to specific actions. These actions include creating folders, creating files, writing generated code into files, summarizing text, saving plain text, and responding to general chat prompts. The most important safety design decision was restricting all file operations to a dedicated output/ folder inside the repository. This ensures that even if the user asks for file creation or code writing, the application cannot accidentally overwrite files elsewhere on the machine.

The Streamlit UI was designed to make the pipeline transparent. After audio is provided, the app shows the transcribed text and the planned steps before execution. This gives the user a chance to review what the system is about to do. After execution, the UI displays the final output and the result of each step. I also added session history so previous actions remain visible during the run. This made the system easier to demonstrate and improved the overall user experience.

One of the biggest challenges in building this project was balancing capability with reliability. A fully local AI workflow is ideal, but local models are sometimes heavy depending on hardware. To address this, I added graceful fallback behavior. If a local LLM backend is unavailable, the system can still use a rules-based planner. If local speech-to-text is not practical, the app can switch to an API backend. This made the project more robust and easier to run across different environments.

Another challenge was safe path handling. Since the agent is allowed to create files and write code, it was important to ensure those actions stay within a restricted area. I implemented path checks so all write targets resolve inside the output/ directory. This turned out to be one of the most important engineering decisions in the project because it keeps the local agent useful without making it unsafe.

If I had more time, I would improve the local LLM planning even further and benchmark different local models for both speed and quality. For example, it would be useful to compare different Whisper-sized STT models and different Ollama-hosted LLMs for latency, memory usage, and output quality. That would make the system even stronger from an engineering evaluation perspective.

Overall, this project was a great exercise in combining speech processing, intent understanding, safe local tool use, and frontend design into one end-to-end AI application. It goes beyond a simple chatbot by taking real actions on the local machine while still keeping those actions constrained and reviewable.

GitHub Repository: https://github.com/Prateek022/voice-controlled-local-ai-agent

DEV Community

How I Built a Voice-Controlled Local AI Agent with Streamlit, Local STT, and Safe Tool Execution

Top comments (0)