This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
1. Overview of the Voice Agent and Prompt Categories Addressed
The voice-based Crypto Education Agent is designed to provide interactive, personalized learning experiences in cryptocurrency topics through natural voice interactions. It uses AssemblyAI for real-time speech-to-text (STT) transcription, handling domain-specific jargon like "BTC," "DeFi," and "zero-knowledge proofs" with high accuracy. The transcribed queries are processed via a Retrieval-Augmented Generation (RAG) pipeline built with LlamaIndex, which retrieves factual information from a curated knowledge base stored in Pinecone vector database. Responses are generated using Anthropic's Claude LLM, ensuring detailed, hallucination-free explanations. The agent also learns from conversations by indexing dialogue history, enabling personalized follow-ups, such as building on prior queries about staking to explain related DeFi concepts.
Key features include:
- Low-latency interactions: Achieving ~300ms transcription via AssemblyAI's Universal-Streaming, making it feel like a real-time conversation.
- Domain-specific enhancements: Word boosts in AssemblyAI improve transcription accuracy for crypto terms to 95%+.
- Educational tools: Supports use cases like concept explanations, quizzes, market insights, and interactive quizzes.
- Testing and scalability: Validated through 5 structured use cases with metrics for accuracy, relevance, and personalization; built for future extensions like live market APIs.
This agent primarily addresses the following prompt categories:
- Domain Expert: It acts as a specialized tutor in cryptocurrency education, drawing from a tailored knowledge base to explain complex topics (e.g., blockchain fundamentals or NFT terms) accurately and contextually, reducing hallucinations through RAG.
- Real-Time Performance: Leveraging AssemblyAI's streaming STT for low-latency voice input, it enables seamless, interactive sessions where users can speak queries and receive immediate responses, with conversation learning for dynamic personalization.
While it could indirectly support Business Automation (e.g., automating educational workflows in trading advisory), the core focus is on expert knowledge delivery and performant real-time interactions.
Demo
GitHub Repository
AssemblyAI RAG with Learning System
A comprehensive system that combines AssemblyAI's Universal-Streaming technology with Retrieval-Augmented Generation (RAG) using LlamaIndex, Anthropic, and Pinecone. This project enables real-time speech-to-text transcription, semantic search, and conversational AI with long-term learning capabilities.
ποΈ Project Structure
assembly_ai/
βββ assembly.py # Main AssemblyAI streaming implementation
βββ build_rag_index.py # Pinecone index construction
βββ rag_with_learning.py # RAG chat engine with learning
βββ transcribe_test_audio.py # Batch audio transcription
βββ crypto_kb/ # Knowledge base documents
βββ test_audio/ # Audio test samples
βββ test_cases.json # Test case definitions
βββ .env # Environment variables (create this)
π Features
- Real-time Speech-to-Text: Using AssemblyAI's Universal-Streaming API
- Domain-Specific Word Boost: Enhanced recognition for crypto/finance terminology
- RAG Pipeline: Semantic search with Pinecone and LlamaIndex
- Long-term Learning: Conversation history integration
- Batch Testing: Audio transcription accuracy evaluation
- Multi-Modal Input: Support for both live audio and pre-recorded files
πΊ Demo Video
Watch a demonstrationβ¦
Technical Implementation & AssemblyAI Integration
Using AssemblyAI's capabilities
def run_batch_tests():
# Use the API key from Assembly
aai.settings.api_key = api_key
# Use the word_boost list
config = aai.TranscriptionConfig(word_boost=word_boost)
# Defines config fot the Assembly usage.
transcriber = aai.Transcriber(config=config)
with open("test_audio/test_cases.json", "r") as f:
test_cases = json.load(f)
for case in test_cases:
transcript = transcriber.transcribe(case["filename"]).text
print(f"Expected: {case['expected_transcript']}")
print(f"Got: {transcript}")
print()
Explanation:
- Iterates through a set of test cases, each with an audio file (.wav) and an expected transcript simulating a real time conversation.
- Transcribes each file and prints both the expected and actual results for comparison.
Importance:
- Provides a simple but effective way to evaluate transcription accuracy and system performance.
- Supports regression testing and quality assurance for the speech-to-text pipeline.
Testing AssemblyAI's real time processing
microphone_stream = aai.extras.MicrophoneStream(sample_rate=8000)
for chunk in microphone_stream:
print("Audio chunk read:", len(chunk))
break # Remove break to keep reading, this is just to test
client.stream(microphone_stream)
Explanation:
- This snippet tests live microphone audio capture by reading a chunk and printing its size.
- The
break
is used to only read one chunk for debugging. - After confirming audio capture, the stream is passed to AssemblyAI for real-time transcription.
- Not a priority for MVP but implementing in future versions.
Importance:
- Ensures that the microphone is correctly configured and audio is being captured before starting a full streaming session.
- Helps debug device index and sample rate issues, which are common in cross-platform audio applications.
Testing AssemblyAI's accuracy and storing date for future process
transcripts = []
for case in test_cases:
audio_path = case["filename"]
transcript = transcriber.transcribe(audio_path)
transcripts.append(transcript.text)
print(f"Transcript for {audio_path}: {transcript.text}")
# Save transcripts for RAG queries
with open("transcripts.json", "w") as f:
json.dump(transcripts, f)
Explanation:
- This code batch-transcribes a set of audio files, storing the resulting transcripts.
- Each transcript is printed and then saved to a JSON file for later use.
- Imagine each transcript is the result of the real-time conversation with a normal user.
Importance:
- Enables the integration of audio data into the RAG pipeline, allowing spoken content to be indexed and retrieved semantically.
- Supports workflows where knowledge is captured from meetings, podcasts, or other audio sources.
Creating core vector index for RAG system
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=embed_model)
Explanation:
- This code builds the core vector index for the RAG system.
-
PineconeVectorStore
connects LlamaIndex to a managed Pinecone vector database. -
StorageContext
manages how data is stored and retrieved. -
VectorStoreIndex.from_documents
ingests all documents using the specified embedding model (here, HuggingFace'sall-MiniLM-L6-v2
), ensuring all vectors are compatible with the Pinecone index dimension.
Importance:
- This is the foundation for semantic search and retrieval in your RAG pipeline.
- Using HuggingFace embeddings ensures local, API-free operation and avoids OpenAI dependencies.
- Ensures that all downstream retrieval and chat operations are based on a robust, scalable vector store.
Simulating human conversation with the agent (AssemblyAI, LlamaIndex, Anthropic)
# Simulate conversation loop
for query in queries:
response = chat_engine.chat(query) # Uses history
print(f"Query: {query}\nResponse: {response}\n")
# Update index with convo for long-term learning
from llama_index.core import Document
convo_doc = Document(text=f"User: {query} | Agent: {response}")
index.insert(convo_doc)
Explanation:
- This loop simulates a user-agent conversation.
- Each query is sent to the chat engine, which uses the RAG index and LLM (Anthropic) to generate a response.
- After each exchange, the conversation is wrapped as a
Document
and inserted into the index, enabling the system to "learn" from new interactions.
Importance:
- Demonstrates how the system can perform continual learning, updating its knowledge base with new conversational data.
- This enables adaptive, context-aware responses and supports use cases like personalized assistants or evolving knowledge bases.
Top comments (0)