thirzq

Posted on Jun 9

How I create fully localled Voice Agent App + RAG

#ai #llm #rag #showdev

This project presents an offline voice agent that uses Indonesian law data from the Pasal ID API and is optimized for the Indonesian language. It is capable of understanding spoken Indonesian, generating responses in Indonesian, and speaking back in Indonesian without requiring cloud APIs. The system combines Whisper-based speech recognition, Ollama-hosted LLMs, and local text-to-speech models to provide a privacy-preserving conversational AI experience. You can access the project repository here: PasalVA.

Usually, when using voice assistant applications, we need to rely on cloud-based services, which creates dependence on third-party providers. An internet connection becomes mandatory, which impacts usability in environments with limited or unreliable network access. In addition, cloud-based solutions require operational costs because requests must be sent to third-party servers. To address these challenges, this project aims to develop a fully local voice agent that is capable of functioning as a voice assistant by eliminating external service dependencies while supporting the Indonesian language.

System Architecture

The application flow follows a voice assistant architecture with additional Retrieval-Augmented Generation (RAG) capabilities to retrieve relevant Indonesian laws.

User
│
├── Text Query
│   │
│   ▼
│ Text Input
│
└── Voice Query
    │
    ▼
Microphone
    │
    ▼
Speech-to-Text
    │
    ▼
Text Processing
    │
    ▼
Retrieve Related Laws
    │
    ▼
LLM (Ollama)
    │
    ▼
Response Text
    │
    ├── Display in UI
    │
    ▼
Text-to-Speech
    │
    ▼
Speaker Output

The application allows users to either type their query or use a microphone to ask a question. For voice input, the audio is first converted into text using a Speech-to-Text (STT) model. The resulting text, along with directly typed queries, is then processed to remove noise and normalize the input.

After preprocessing, the query is converted into embeddings and used to retrieve relevant Indonesian laws from the local knowledge base. The retrieved legal context and user query are then sent to a locally hosted LLM through Ollama. The LLM generates a response, which is returned to the frontend as text. If the user is using voice interaction, the response can also be converted into speech using a Text-to-Speech (TTS) model and played through the speaker.

Technology Stack

Component	Technology
Backend	Python FastAPI
LLM Runtime	Ollama
LLM Model	Gemma 3 1B
STT	Whisper Small
TTS	Facebook MMS TTS (Indonesian)
Vector Database	ChromaDB
Embedding Model	EmbeddingGemma
Frontend UI	React
Storage	PostgreSQL
Deployment	Docker

Technology Selection

The backend is built using Python FastAPI because it provides high performance with minimal overhead, making it suitable for serving local AI models efficiently.

Ollama is used as the LLM runtime because it simplifies the deployment and management of local language models and embedding models while integrating easily with Python applications.

Gemma 3 1B is selected as the primary language model because it offers fast inference and low resource requirements. The model can be upgraded to larger and more capable alternatives depending on the available hardware.

Whisper Small is used for Speech-to-Text because it is one of the most accurate open-source STT models and provides strong support for Indonesian language recognition.

Facebook MMS TTS Indonesian is used for Text-to-Speech because it produces natural Indonesian pronunciation while remaining lightweight enough for local deployment.

ChromaDB serves as the vector database because it is lightweight, easy to deploy locally, and works well in Docker-based environments.

EmbeddingGemma is used to generate embeddings for both legal documents and user queries. It supports multilingual understanding and can run completely offline.

React is used for the frontend because it simplifies client-server communication and enables rapid development of interactive user interfaces.

Docker is used for deployment and containerization, making the system easier to install, manage, and run across different environments.

PostgreSQL is used for persistent storage because it is reliable, performant, and easy to deploy alongside the rest of the application, particularly through lightweight Docker images such as the Alpine version.

Speech Recognition Pipeline

const startRecording = useCallback(async () => {
    try {
      cleanupAudioResources();

      const stream = await navigator.mediaDevices.getUserMedia({
        audio: { channelCount: 1, sampleRate: 16000, echoCancellation: true }
      });
      streamRef.current = stream;

      const audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
      audioContextRef.current = audioContext;
      pcmChunksRef.current = [];

      const source = audioContext.createMediaStreamSource(stream);
      sourceRef.current = source;

      const processor = audioContext.createScriptProcessor(4096, 1, 1);
      processorRef.current = processor;

      processor.onaudioprocess = (e) => {
        const input = e.inputBuffer.getChannelData(0);
        pcmChunksRef.current.push(new Float32Array(input));
      };

      source.connect(processor);
      processor.connect(audioContext.destination);

      setIsRecording(true);
    } catch (err) {
      console.error("Failed to start recording:", err);
      cleanupAudioResources();
    }
  }, []);

Microphone capture is handled by the frontend using the browser's built-in AudioContext API to record audio from the users microphone.

const formData = new FormData();
formData.append("audio", wavBlob, "recording.wav");

const res = await fetch("http://localhost:8000/chat/voice", {
  method: "POST",
  body: formData
});

The recorded audio is then sent to the backend as a WAV file through a POST request using FormData. The request is handled by the /chat/voice endpoint, which passes the audio file to the speech recognition pipeline powered by Whisper.

async def stt_async(audio_file_path: str) -> str:
    async def chunk_generator():
        with open(audio_file_path, "rb") as f:
            while chunk := f.read(4096):  # 4KB chunks
                yield chunk
    return await service.transcribe_speech(chunk_generator())

The audio file is streamed to the model instead of being loaded entirely into memory. Streaming the audio in 4 KB chunks allows the system to process the file more efficiently, reducing memory usage and enabling smoother transcription.

async def transcribe_speech(self, audio_stream: typing.AsyncIterator[bytes]) -> str:
     audio_buffer = io.BytesIO(audio_bytes)
     raw_waveform_np, file_sample_rate = sf.read(
         audio_buffer,
         dtype="float32",
         always_2d=False
     )

The streamed audio bytes are first collected and decoded into a NumPy waveform array. During this process, the audio sample rate (file_sample_rate) is also extracted. This step ensures that the audio data is converted into a format that can be processed by the Whisper model.

if raw_waveform_np.ndim == 2:
    raw_waveform_np = raw_waveform_np.mean(axis=1)

As a safety precaution, stereo audio is converted into mono audio by averaging the two channels. This is done because Whisper performs best when processing a single audio channel.

raw_waveform_tensor = torch.from_numpy(raw_waveform_np).float().unsqueeze(0)

Afterward, the NumPy waveform array is converted into a PyTorch tensor, which is the format required by the model for inference.

if file_sample_rate != self.stt_sampling_rate:
    resampler = torchaudio.transforms.Resample(
        orig_freq=file_sample_rate,
        new_freq=self.stt_sampling_rate
    )
    resampled_waveform_tensor = resampler(raw_waveform_tensor)
else:
    resampled_waveform_tensor = raw_waveform_tensor

The system then checks whether the audio files sample rate matches the sample rate expected by the model. If the sample rates are different, the audio is resampled to the models required sampling rate. Otherwise, the original waveform is used directly. This step ensures that all audio inputs are standardized before being passed to the Whisper model, preventing potential accuracy issues caused by mismatched sampling rates.

resampled_waveform_np = resampled_waveform_tensor.squeeze().numpy()

reduced_noise_audio = nr.reduce_noise(
    y=resampled_waveform_np,
    sr=self.stt_sampling_rate,
    prop_decrease=0.5,
    stationary=True
)

This section first removes the batch dimension from the tensor using squeeze(), converting it from a two-dimensional tensor with a batch size of 1 into a one-dimensional waveform array.

Before:
[
  [sample1, sample2, sample3, ...]
]

After:
[sample1, sample2, sample3, ...]

The waveform is then converted back into a NumPy array because the noise reduction library expects NumPy input. Afterward, noise reduction is applied to suppress background noise. The parameter prop_decrease=0.5 reduces the detected noise by approximately 50%, helping improve speech clarity before transcription.

input_features = self.feature_extractor(
    reduced_noise_audio,
    sampling_rate=self.stt_sampling_rate,
    return_tensors="pt"
).input_features.to(self.device)

input_features = input_features.to(self.stt_model.dtype)

This step converts the cleaned audio waveform into the features required by Whisper. Specifically, the feature extractor transforms the audio into a log-Mel spectrogram representation, which is the input format expected by the model. The resulting feature tensor is then moved to the target device (CPU or GPU) and converted to the same PyTorch data type as the model to ensure compatibility during inference.

with torch.no_grad():
    predicted_ids = self.stt_model.generate(
        input_features,
        attention_mask=torch.ones(
            input_features.shape[:2],
            dtype=torch.long,
            device=self.device
        ),
        language="id",
        task="transcribe",
    )

transcription = self.stt_processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

self.logger.info(f"Transcription result: '{transcription.strip()}'")
return transcription.strip()

This is the final stage of the transcription pipeline. The torch.no_grad() context disables gradient computation because the model is only performing inference and not training. This reduces memory consumption and improves execution speed.

The Whisper model then generates token IDs from the extracted audio features. The parameters language="id" and task="transcribe" explicitly instruct the model to perform Indonesian speech transcription.

After the token IDs are generated, the tokenizer decodes them into human-readable text using the model's vocabulary. Finally, strip() removes any leading or trailing whitespace that may have been generated during decoding, and the cleaned transcription is returned as the final result.

Retrieval-Augmented Generation (RAG)

This system uses a dense embedding–based retriever to retrieve legal documents relevant to a user's query. The legal documents consist of full-text regulations that are divided into smaller chunks using LangChain's Recursive Text Splitter. The primary data source is the Indonesian Regulations Dataset available on Kaggle, while the secondary source is Pasal.id.

When the system receives a query from either speech recognition or text input, it processes the query through the RAG pipeline.

def process_query(self, query):
    //other code
    self._ensure_populated(query)
    logger.info("Extracting keywords...")
    keywords = self.llm_service.get_keyword_from_query(query)
    logger.info(f"Extracted keywords: {keywords}")

    for kw in keywords:
        self._ensure_populated(kw)
    //Other code

def _ensure_populated(self, query: str):
    """Check collection size and populate if empty."""
    try:
        count = self.rag_pipeline.chroma_client.get_collection_count()
        logger.info(f"Current collection count: {count}")

        if count == 0:
            logger.info("Collection is empty, populating from Pasal...")
            self._populate_from_pasal(query)
    except Exception as e:
        logger.error(f"Failed to check/populate collection: {e}")

First, the system checks whether the ChromaDB collection has already been populated with legal documents. If the collection is empty, it retrieves relevant regulations from Pasal.id and stores them in the vector database. This serves as a secondary data source that can supplement the primary dataset.

Because this process requires access to Pasal.id, an internet connection is needed when the collection has not yet been populated. For offline deployment, I preloaded legal documents into the vector database to ensure that the retrieval process can operate without internet access.

for term in search_terms:
    try:
        logger.info(f"Fetching Pasal data for term: '{term}'")
        pasal_response = self.pasal_service.search_pasal(term)

        snippets, metadatas = self.pasal_service._get_snippets_and_metadata(pasal_response)

        if snippets:
            self.rag_pipeline.populate(documents=snippets, metadatas=metadatas)
            pasal_info_parts.extend(snippets)
            populated = True
            logger.info(f"Populated {len(snippets)} snippets for term: '{term}'")

The _ensure_populated() method retrieves legal documents from the secondary data source and populates the vector database based on the user's query and the query keywords generated by the LLM.

try:
    retrieved_docs, retrieved_metas = self.rag_pipeline.retrieve(query, keyword=keywords)
except Exception as e:
    logger.error(f"Retrieval failed: {e}, attempting populate + retry...")
    self._populate_from_pasal(query, keywords)
    try:
        retrieved_docs, retrieved_metas = self.rag_pipeline.retrieve(query, keyword=keywords)
    except Exception as e2:
        logger.error(f"Retrieval failed again after populate: {e2}")
        retrieved_docs, retrieved_metas = [], []

//In Dense Retriever
def retrieve(self, query: str, keyword: str = None, where_filter: Dict = None) -> tuple[List[str], List[Dict]]:
    if self.chroma_client.get_collection_count() == 0:
        print(f"Warning: Collection {self.chroma_client.collection.name} is empty.")
        return [], []

    query_embeddings = self._get_embeddings([query])
    if not query_embeddings:
        return [], []

    count = self.chroma_client.get_collection_count()
    if where_filter:
        available = len(self.chroma_client.collection.get(where=where_filter)["documents"])
        count = available if available > 0 else count

    results = self.chroma_client.collection.query(
        query_embeddings=query_embeddings,
        n_results=min(self.top_k, count),
        where=where_filter,
        include=["documents", "metadatas"]
    )

    if not results["documents"] or not results["documents"][0]:
        return [], []

    return results["documents"][0], results["metadatas"][0]

This is where the retrieval process takes place. First, the user's query is converted into an embedding using the embedding model. The embedding is then used to perform a similarity search against the documents stored in ChromaDB. The retriever returns the most relevant legal document chunks along with their corresponding metadata, which are later used as context for the LLM during answer generation.

try:
    decision = self.evaluator_service.evaluate(query, retrieved_docs, retrieved_metas)
except Exception as e:
    logger.error(f"Evaluator failed: {e}")
    decision = "incorrect"
logger.info(f"Evaluator decision: {decision}")

pasal_info = None

if decision == "correct" and retrieved_docs:
    try:
        response = self.llm_service.generate_answer_with_context(
            query,
            "\n\n---\n\n".join(retrieved_docs),
            retrieved_metas
        )
    except Exception as e:
        logger.error(f"LLM answer generation failed: {e}")
        decision = "incorrect"

if decision in ("ambiguous", "incorrect") or not retrieved_docs:
    pasal_info = self._populate_from_pasal(query, keywords)

    if pasal_info:
        try:
            response = self.llm_service.generate_answer_with_context(query, pasal_info, [])
        except Exception as e:
            logger.error(f"LLM answer generation from Pasal info failed: {e}")
            response = "Sorry, I couldn't generate an answer at this time."
    else:
        response = "Sorry, I couldn't find relevant information for your query."

After the retrieval step, the retrieved documents are evaluated by the decision evaluator. The evaluator uses the cross-encoder/ms-marco-MiniLM-L-6-v2 model to measure the semantic relevance between the user's query and the retrieved context. Based on the relevance score, the evaluator classifies the retrieval result into one of three categories: correct, ambiguous, or incorrect.

If the relevance score exceeds the predefined threshold, the retrieved documents are considered correct and are passed directly to the LLM as context for answer generation. If the score falls below the threshold, the result is classified as ambiguous or incorrect. In this case, the system performs a fallback retrieval from Pasal.id using the full query and uses the retrieved legal information as the context for answer generation instead.

This evaluation step helps ensure that only highly relevant retrieved documents are used during answer generation, reducing the risk of providing inaccurate or unrelated legal information.

return {
    "answer": response,
    "source_documents": retrieved_docs if decision == "correct" else ([pasal_info] if pasal_info else [])
}

Finally, the system returns the generated answer along with the source documents used during the generation process. These results are then sent to the frontend, where the answer is displayed to the user together with the supporting legal references.

Text-to-Speech Module

To activate the Text-to-Speech (TTS) feature, the user clicks the speaker icon located beside the response. The frontend then sends the generated answer to the TTS endpoint.

// In frontend
const response = await fetch(ttsEndpoint, {
        method: 'POST',
        headers: { 
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ text }) 
      });

//In Backend
router.post("/tts")
async def generate_tts(request: TTSRequest):
    try:
        tts_output_path = await tts_async(request.text)
        with open(tts_output_path, "rb") as f:
            audio_content = f.read()
        return StreamingResponse(
            iter([audio_content]),
            media_type="audio/wav",
            headers={"Content-Disposition": f'attachment; filename="response.wav"'}
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"TTS generation failed: {str(e)}")

async def tts_async(text: str) -> str:
    audio_chunks = []
    async for chunk in service.generate_speech_facebook(text):
        audio_chunks.append(chunk)
    full_audio = b''.join(audio_chunks)

    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
        with wave.open(tmp.name, 'wb') as wav:
            wav.setnchannels(1)
            wav.setsampwidth(2) 
            wav.setframerate(service.tts_sampling_rate)
            wav.writeframes(full_audio)
        return tmp.name

The frontend sends a request to the TTS endpoint, which generates the speech audio and returns it as a WAV file. During this process, the system configures the audio as mono (single-channel) audio, uses 16-bit PCM encoding (setsampwidth(2)), sets the sample rate according to the TTS model's sampling rate, and writes the generated audio data into a WAV file.

async def generate_speech_facebook(self, text: str) -> typing.AsyncIterator[bytes]:
    """
    Generates speech using the loaded Hugging Face VITS model and streams it as WAV bytes.
    """
    try:            
        cleaned_text = self._clean_text(text)

def _clean_text(self, text: str) -> str:
    """
    Cleans and preprocesses text for TTS:
    1. Expand abbreviations → 2. Convert numbers → 3. Clean punctuation
    """
    if not text:
        return ""

    text = self._expand_abbreviations(text)

    cleaned_text = self.konversi_angka_ke_kata(text)

    cleaned_text = re.sub(r"[^\w\s]", "", cleaned_text)
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()

    return cleaned_text

Before generating speech, the text is cleaned and preprocessed. First, abbreviations are expanded so that the model can pronounce them correctly. For example, abbreviations may be converted into individual characters separated by spaces to encourage spelling rather than pronunciation as a word. Next, numbers are converted into their word representations because the TTS model handles written numbers more accurately than numeric digits. Finally, punctuation is removed and whitespace is normalized to produce cleaner input for the model.

inputs = self.tts_tokenizer(
    cleaned_text,
    return_tensors="pt",
    padding=True
).to(self.device)

with torch.no_grad():
    output = self.tts_model(**inputs).waveform

audio_data = output.cpu().to(torch.float32).numpy().squeeze()

The cleaned text is then tokenized into a format that can be processed by the TTS model. During inference, gradient computation is disabled using torch.no_grad() to improve efficiency and reduce memory usage. The model generates a waveform tensor, which is then converted to a Float32 NumPy array and squeezed to remove the batch dimension.

wav_buffer = io.BytesIO()
int_audio = np.int16(audio_data * 32767)
scipy.io.wavfile.write(wav_buffer, rate=self.tts_sampling_rate, data=int_audio)
wav_buffer.seek(0)

while True:
    chunk = wav_buffer.read(4096)
    if not chunk:
        break
    yield chunk
    await asyncio.sleep(0.01)

Finally, the system creates an in-memory audio file and converts the generated waveform into 16-bit PCM format. The WAV data is written into a memory buffer and streamed in chunks of 4096 bytes. These chunks are sent incrementally to the TTS endpoint response, allowing the audio to be transmitted efficiently to the frontend for playback.

References

[1] S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, "Corrective Retrieval-Augmented Generation," arXiv preprint arXiv:2401.15884, 2024. Available: https://arxiv.org/abs/2401.15884

[2] H. Sugiharto, "Indonesian Regulations Dataset," Kaggle. Available: https://www.kaggle.com/datasets/hermansugiharto/indonesian-regulations-dataset

[3] Pasal.id, "Pasal.id API Documentation." Available: https://pasal.id/api

[4] N. Reimers and the Hugging Face Community, "cross-encoder/ms-marco-MiniLM-L6-v2," Hugging Face Model Hub. Available: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2

[5] Meta AI, "facebook/mms-tts-ind," Hugging Face Model Hub. Available: https://huggingface.co/facebook/mms-tts-ind

Top comments (1)

Mateo Ruiz • Jun 9

I think the most valuable takeaway here is that AI is shifting where the bottleneck lives.

A few years ago, writing code was often the slowest part. Today, generating code is cheap and fast. What remains expensive is understanding requirements, evaluating trade-offs, handling edge cases, designing architecture, and maintaining systems over time.

That's why the conversation shouldn't be "prompting vs engineering." It should be about amplifying engineering judgment. The teams getting the most value from AI aren't the ones with the fanciest prompts they're the ones with strong review processes, testing, architecture standards, and domain expertise. AI can accelerate implementation dramatically, but the biggest wins come when experienced engineers use it to remove repetitive work and focus more energy on decisions that actually impact scalability, reliability, and business outcomes.

The tool has changed. The importance of sound engineering judgment hasn't.