DEV Community

Akash kumar
Akash kumar

Posted on

Voice-Controlled Local AI Agent (Works Even on 8GB RAM)

What if you could control your computer using just your voice — without needing a powerful GPU or heavy local models?

I built a Voice-Controlled AI Agent that:

  • Understands speech 🎤
  • Detects user intent 🧠
  • Executes real actions like file creation, code generation, and summarization ⚡

And the best part?
👉 It works smoothly even on low-end systems (8GB RAM).


🎬 Demo

📽️ Watch the full demo here:
👉 https://drive.google.com/file/d/17Uvp72dDi82pAqEqbJ6pl3LaLphxwaGm/view?usp=sharing

(Replace with your YouTube / Drive / Loom video link)

https://youtu.be/Pl3lwBoYruM

LIVE LINK

https://localaiagent-twxulfwrigcagqtbecnomh.streamlit.app/

✨ Features

  • 🎤 Audio Input

    • Record directly from microphone
    • Upload audio files
  • 🧠 Intent Classification

    • Converts speech → structured JSON
    • Accurately detects user commands
  • Core Actions

    • create_file → Creates files safely
    • write_code → Generates and saves code
    • summarize_text → Summarizes content
    • general_chat → Handles normal queries
  • 🔒 Safe Execution

    • All outputs are restricted to /output directory
    • Prevents accidental system modification

🏗️ System Architecture

Building AI systems locally with limited RAM is challenging. Here's how I solved it:

1. 🎙️ Speech-to-Text (STT)

  • Local Mode:
    Uses openai-whisper (tiny model) → runs on CPU

  • Fast Mode (Recommended):
    Uses Groq API (Whisper-large-v3) → extremely fast ⚡


2. 🧠 LLM + Intent Engine

Running large models locally was not feasible:

  • 8B models consume ~5GB RAM ❌
  • Causes system slowdown

👉 Solution:

  • Used Groq API (Llama 3 - 8B / 70B)
  • Provides:

    • Fast inference ⚡
    • Structured JSON output
    • Reliable intent classification

3. 🖥️ Frontend

  • Built using Streamlit
  • Uses st.audio_input for seamless recording
  • Simple and clean UI

🔄 How It Works

  1. User speaks or uploads audio 🎤
  2. Whisper converts speech → text
  3. LLM processes text → structured JSON
  4. System executes action locally

Example:

{
  "action": "create_file",
  "filename": "hello.py"
}
Enter fullscreen mode Exit fullscreen mode

💻 Example Use Case

🗣️ User says:

"Create a Python file called hello.py"

⚙️ System:

  • Transcribes audio
  • Detects create_file intent
  • Creates file in /output folder
  • Shows success message

⚡ Setup Instructions

Prerequisites


Installation

git clone <your-repo-link>
cd local_ai_agent
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Environment Setup

GROQ_API_KEY=your_api_key_here
Enter fullscreen mode Exit fullscreen mode

Run the App

streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

⚠️ Challenges Faced

  • Running LLMs on 8GB RAM
  • Slow transcription using CPU Whisper
  • Ensuring consistent JSON output from LLM
  • Managing safe file execution

💡 Key Learnings

  • Hybrid approach (local + API) is powerful
  • Structured prompts = better automation
  • UI simplicity improves usability massively

🔮 Future Improvements

  • Add more actions (email automation, system control)
  • Improve offline performance
  • Add memory (conversation history)
  • Multi-command execution

🔗 Links


🙌 Final Thoughts

This project shows that you don’t need expensive hardware to build powerful AI systems.

With the right architecture and smart trade-offs, even a mid-range laptop can run intelligent AI agents efficiently.

If you found this useful, feel free to ⭐ the repo or share your thoughts!


🏷️ Tags

python #ai #machinelearning #streamlit #opensource #productivity

Top comments (0)