DEV Community

Cover image for Your personal Blockchain tutor powered by AssemblyAI
Julian
Julian Subscriber

Posted on

Your personal Blockchain tutor powered by AssemblyAI

AssemblyAI Voice Agents Challenge: Domain Expert

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

1. Overview of the Voice Agent and Prompt Categories Addressed

The voice-based Crypto Education Agent is designed to provide interactive, personalized learning experiences in cryptocurrency topics through natural voice interactions. It uses AssemblyAI for real-time speech-to-text (STT) transcription, handling domain-specific jargon like "BTC," "DeFi," and "zero-knowledge proofs" with high accuracy. The transcribed queries are processed via a Retrieval-Augmented Generation (RAG) pipeline built with LlamaIndex, which retrieves factual information from a curated knowledge base stored in Pinecone vector database. Responses are generated using Anthropic's Claude LLM, ensuring detailed, hallucination-free explanations. The agent also learns from conversations by indexing dialogue history, enabling personalized follow-ups, such as building on prior queries about staking to explain related DeFi concepts.

Key features include:

  • Low-latency interactions: Achieving ~300ms transcription via AssemblyAI's Universal-Streaming, making it feel like a real-time conversation.
  • Domain-specific enhancements: Word boosts in AssemblyAI improve transcription accuracy for crypto terms to 95%+.
  • Educational tools: Supports use cases like concept explanations, quizzes, market insights, and interactive quizzes.
  • Testing and scalability: Validated through 5 structured use cases with metrics for accuracy, relevance, and personalization; built for future extensions like live market APIs.

This agent primarily addresses the following prompt categories:

  • Domain Expert: It acts as a specialized tutor in cryptocurrency education, drawing from a tailored knowledge base to explain complex topics (e.g., blockchain fundamentals or NFT terms) accurately and contextually, reducing hallucinations through RAG.
  • Real-Time Performance: Leveraging AssemblyAI's streaming STT for low-latency voice input, it enables seamless, interactive sessions where users can speak queries and receive immediate responses, with conversation learning for dynamic personalization.

While it could indirectly support Business Automation (e.g., automating educational workflows in trading advisory), the core focus is on expert knowledge delivery and performant real-time interactions.

Demo

GitHub Repository

AssemblyAI RAG with Learning System

A comprehensive system that combines AssemblyAI's Universal-Streaming technology with Retrieval-Augmented Generation (RAG) using LlamaIndex, Anthropic, and Pinecone. This project enables real-time speech-to-text transcription, semantic search, and conversational AI with long-term learning capabilities.

πŸ—οΈ Project Structure

assembly_ai/
β”œβ”€β”€ assembly.py                 # Main AssemblyAI streaming implementation
β”œβ”€β”€ build_rag_index.py         # Pinecone index construction
β”œβ”€β”€ rag_with_learning.py       # RAG chat engine with learning
β”œβ”€β”€ transcribe_test_audio.py   # Batch audio transcription
β”œβ”€β”€ crypto_kb/                 # Knowledge base documents
β”œβ”€β”€ test_audio/                # Audio test samples
β”œβ”€β”€ test_cases.json           # Test case definitions
└── .env                      # Environment variables (create this)

πŸš€ Features

  • Real-time Speech-to-Text: Using AssemblyAI's Universal-Streaming API
  • Domain-Specific Word Boost: Enhanced recognition for crypto/finance terminology
  • RAG Pipeline: Semantic search with Pinecone and LlamaIndex
  • Long-term Learning: Conversation history integration
  • Batch Testing: Audio transcription accuracy evaluation
  • Multi-Modal Input: Support for both live audio and pre-recorded files

πŸ“Ί Demo Video

Watch a demonstration…

Technical Implementation & AssemblyAI Integration

Using AssemblyAI's capabilities

def run_batch_tests():
    # Use the API key from Assembly
    aai.settings.api_key = api_key
    # Use the word_boost list
    config = aai.TranscriptionConfig(word_boost=word_boost)
    # Defines config fot the Assembly usage.
    transcriber = aai.Transcriber(config=config)

    with open("test_audio/test_cases.json", "r") as f:
        test_cases = json.load(f)

    for case in test_cases:
        transcript = transcriber.transcribe(case["filename"]).text
        print(f"Expected: {case['expected_transcript']}")
        print(f"Got: {transcript}")
        print()
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Iterates through a set of test cases, each with an audio file (.wav) and an expected transcript simulating a real time conversation.
  • Transcribes each file and prints both the expected and actual results for comparison.

Importance:

  • Provides a simple but effective way to evaluate transcription accuracy and system performance.
  • Supports regression testing and quality assurance for the speech-to-text pipeline.

Testing AssemblyAI's real time processing

microphone_stream = aai.extras.MicrophoneStream(sample_rate=8000)
for chunk in microphone_stream:
    print("Audio chunk read:", len(chunk))
    break  # Remove break to keep reading, this is just to test
client.stream(microphone_stream)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • This snippet tests live microphone audio capture by reading a chunk and printing its size.
  • The break is used to only read one chunk for debugging.
  • After confirming audio capture, the stream is passed to AssemblyAI for real-time transcription.
  • Not a priority for MVP but implementing in future versions.

Importance:

  • Ensures that the microphone is correctly configured and audio is being captured before starting a full streaming session.
  • Helps debug device index and sample rate issues, which are common in cross-platform audio applications.

Testing AssemblyAI's accuracy and storing date for future process

transcripts = []
for case in test_cases:
    audio_path = case["filename"]
    transcript = transcriber.transcribe(audio_path)
    transcripts.append(transcript.text)
    print(f"Transcript for {audio_path}: {transcript.text}")

# Save transcripts for RAG queries
with open("transcripts.json", "w") as f:
    json.dump(transcripts, f)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • This code batch-transcribes a set of audio files, storing the resulting transcripts.
  • Each transcript is printed and then saved to a JSON file for later use.
  • Imagine each transcript is the result of the real-time conversation with a normal user.

Importance:

  • Enables the integration of audio data into the RAG pipeline, allowing spoken content to be indexed and retrieved semantically.
  • Supports workflows where knowledge is captured from meetings, podcasts, or other audio sources.

Creating core vector index for RAG system

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=embed_model)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • This code builds the core vector index for the RAG system.
  • PineconeVectorStore connects LlamaIndex to a managed Pinecone vector database.
  • StorageContext manages how data is stored and retrieved.
  • VectorStoreIndex.from_documents ingests all documents using the specified embedding model (here, HuggingFace's all-MiniLM-L6-v2), ensuring all vectors are compatible with the Pinecone index dimension.

Importance:

  • This is the foundation for semantic search and retrieval in your RAG pipeline.
  • Using HuggingFace embeddings ensures local, API-free operation and avoids OpenAI dependencies.
  • Ensures that all downstream retrieval and chat operations are based on a robust, scalable vector store.

Simulating human conversation with the agent (AssemblyAI, LlamaIndex, Anthropic)

# Simulate conversation loop
for query in queries:
    response = chat_engine.chat(query)  # Uses history
    print(f"Query: {query}\nResponse: {response}\n")

    # Update index with convo for long-term learning
    from llama_index.core import Document
    convo_doc = Document(text=f"User: {query} | Agent: {response}")
    index.insert(convo_doc)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • This loop simulates a user-agent conversation.
  • Each query is sent to the chat engine, which uses the RAG index and LLM (Anthropic) to generate a response.
  • After each exchange, the conversation is wrapped as a Document and inserted into the index, enabling the system to "learn" from new interactions.

Importance:

  • Demonstrates how the system can perform continual learning, updating its knowledge base with new conversational data.
  • This enables adaptive, context-aware responses and supports use cases like personalized assistants or evolving knowledge bases.

Top comments (0)