DEV Community

Akan
Akan

Posted on

Let's Build a Voice RAG System That Actually Works 🎉

What We're Going to Build (And Why It's Pretty Cool)

So, we know how sometimes you wish you could just talk to your computer and have it actually understand what you're asking? Well, that's exactly what we're building today!

Picture this: You record yourself asking "Hey, what's machine learning all about?" and boom - your system transcribes what you said, searches through your documents AND the web, then gives you a smart answer back. Pretty neat, right?

The Magic Behind the Curtain ✨

Here's what our little system does:

  1. Listens to your voice (using a fancy Whisper model)
  2. Thinks about what you asked (searches your knowledge base)
  3. Looks stuff up on the internet (because why not get fresh info?)
  4. Puts it all together into a nice answer

And the best part? We're making it FAST by using your GPU properly. No more waiting around for 30 seconds while your model thinks!


Before We Dive In - Let's Get Ready! 🛠️

What You'll Need

Don't worry, this isn't going to break the bank:

  • A Google Colab account (the free one works fine!)
  • About 30 minutes of your time
  • A sense of curiosity (and maybe some coffee ☕)

Quick GPU Check

First things first - let's make sure we've got the good stuff:

import torch
print(f"🔥 GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ Uh oh! No GPU detected. Go to Runtime → Change Runtime Type → T4 GPU")
Enter fullscreen mode Exit fullscreen mode

If you see something like "Tesla T4" pop up, you're golden! 🎉


Step 1: Installing Our Toolbox 📦

Alright, time to grab all the cool libraries we need. Think of this as gathering ingredients before we start cooking:

# The core ML stuff (this is where the magic happens)
!pip install -q transformers>=4.41.0 torch torchaudio --upgrade
!pip install -q accelerate bitsandbytes optimum

# For grabbing stuff from the web and making pretty interfaces
!pip install -q requests beautifulsoup4 gradio

# The smart search and audio processing bits
!pip install -q sentence-transformers datasets librosa soundfile

# Super-fast similarity search (the secret sauce!)
!pip install -q faiss-gpu
Enter fullscreen mode Exit fullscreen mode

Pro tip: If faiss-gpu gives you trouble, just use faiss-cpu instead. It'll still work great!


Step 2: The Heart of Our System - The VoiceRAGT4 Class 💝

Now here's where things get interesting. We're building a class that's like a Swiss Army knife for voice processing:

class VoiceRAGT4:
    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"🚀 Using device: {self.device}")

        # These will hold our AI models
        self.speech_to_text = None
        self.embedder = None

        # Our knowledge base
        self.documents = []
        self.document_embeddings = None
        self.faiss_index = None

        # Let's set everything up!
        self.setup_models()
Enter fullscreen mode Exit fullscreen mode

Think of this as setting up your workspace before you start a project. We're just getting everything organized!

Loading Our Models (The Fun Part!) 🤖

Here's where we load up our AI models. I've added some special sauce to make them run super fast on your T4:

def setup_models(self):
    print("🔧 Loading models with T4 optimizations...")

    # First, let's try to load your custom Whisper model
    try:
        model_id = "AfroLogicInsect/whisper-finetuned-float32"

        # This is the secret sauce - memory optimization!
        bnb_config = BitsAndBytesConfig(
            load_in_8bit=True,  # Uses half the memory!
            bnb_8bit_compute_dtype=torch.float16
        )

        # Load the model
        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id,
            torch_dtype=torch.float16,  # Faster inference
            low_cpu_mem_usage=True,     # Be nice to your RAM
            use_safetensors=True,       # Modern and safe
            quantization_config=bnb_config
        )

        if self.device == "cuda":
            model = model.to("cuda")

        processor = AutoProcessor.from_pretrained(model_id)

        # Create our speech-to-text pipeline
        self.speech_to_text = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            max_new_tokens=128,
            chunk_length_s=30,  # Process in 30-second chunks
            batch_size=8,       # Sweet spot for T4
            torch_dtype=torch.float16,
            device=0 if self.device == "cuda" else -1,
        )

        print("✅ Your custom Whisper model loaded successfully!")

    except Exception as e:
        print(f"🤔 Hmm, couldn't load your custom model: {e}")
        print("🔄 No worries! Using the standard Whisper-small instead...")

        # Fallback option
        self.speech_to_text = pipeline(
            "automatic-speech-recognition",
            model="openai/whisper-small",
            torch_dtype=torch.float16,
            device=0 if self.device == "cuda" else -1,
            chunk_length_s=30,
            batch_size=8
        )

    # Now let's load our embedding model (this finds similar documents)
    print("🧠 Loading sentence transformer...")
    self.embedder = SentenceTransformer('all-MiniLM-L6-v2', device=self.device)

    # Make it even faster on GPU
    if self.device == "cuda":
        self.embedder.half()

    print("🎉 All models loaded and ready to rock!")
Enter fullscreen mode Exit fullscreen mode

What's happening here? We're loading two main models:

  1. Whisper - converts your speech to text
  2. Sentence Transformer - understands the meaning of text (pretty cool, right?)

Step 3: Making Audio Sound Better 🎵

Before we feed audio to our model, let's clean it up a bit:

def preprocess_audio(self, audio_path):
    """Make audio sound nice for Whisper"""
    try:
        # Load audio at 16kHz (Whisper's favorite frequency)
        audio, sr = librosa.load(audio_path, sr=16000)

        # Normalize volume levels
        audio = librosa.util.normalize(audio)

        # Reduce background noise (simple but effective!)
        audio = librosa.effects.preemphasis(audio)

        return audio
    except Exception as e:
        print(f"😅 Audio preprocessing hiccup: {e}")
        return None
Enter fullscreen mode Exit fullscreen mode

This is like adjusting the microphone settings to make sure Whisper can hear you clearly!

The Speech-to-Text Magic ✨

Here's where we actually convert your voice to text:

def transcribe_audio(self, audio_path):
    """Turn your voice into text with GPU power!"""
    try:
        # Clean up the audio first
        audio = self.preprocess_audio(audio_path)
        if audio is None:
            return "Oops! Couldn't process that audio file"

        # Clear some GPU memory (being polite!)
        if self.device == "cuda":
            torch.cuda.empty_cache()

        # The actual transcription (with speed boost!)
        with torch.cuda.amp.autocast():  # Mixed precision = faster!
            result = self.speech_to_text(
                audio,
                generate_kwargs={
                    "max_new_tokens": 128,
                    "num_beams": 2,      # Good balance of speed vs quality
                    "do_sample": False,
                    "use_cache": True
                }
            )

        # Clean up after ourselves
        if self.device == "cuda":
            torch.cuda.empty_cache()

        return result["text"].strip()

    except Exception as e:
        print(f"😬 Transcription went sideways: {e}")
        return f"Error: {e}"
Enter fullscreen mode Exit fullscreen mode

Step 4: Building Our Knowledge Base 📚

Now for the really cool part - teaching our system about stuff! We'll add documents and create a super-fast search index:

def add_documents_batch(self, documents, batch_size=32):
    """Add a bunch of documents and make them searchable"""
    print(f"📚 Processing {len(documents)} documents...")

    self.documents.extend(documents)

    # Process in batches to avoid memory issues
    all_embeddings = []

    for i in range(0, len(self.documents), batch_size):
        batch = self.documents[i:i+batch_size]
        print(f"🔄 Processing batch {i//batch_size + 1}...")

        # Convert text to numbers (embeddings) that capture meaning
        if self.device == "cuda":
            with torch.cuda.amp.autocast():
                batch_embeddings = self.embedder.encode(
                    batch,
                    batch_size=batch_size,
                    show_progress_bar=True,
                    normalize_embeddings=True
                )
        else:
            batch_embeddings = self.embedder.encode(
                batch,
                batch_size=batch_size,
                show_progress_bar=True,
                normalize_embeddings=True
            )

        all_embeddings.append(batch_embeddings)

        # Keep things tidy
        if self.device == "cuda":
            torch.cuda.empty_cache()

    # Combine all the embeddings
    self.document_embeddings = np.vstack(all_embeddings)

    # Create a super-fast search index
    try:
        dimension = self.document_embeddings.shape[1]

        if self.device == "cuda":
            # GPU-powered search! 🚀
            res = faiss.StandardGpuResources()
            self.faiss_index = faiss.GpuIndexFlatIP(res, dimension)
            print("🚀 Using GPU-accelerated FAISS - this is gonna be fast!")
        else:
            self.faiss_index = faiss.IndexFlatIP(dimension)
            print("🔧 Using CPU FAISS - still pretty quick!")

        # Add our embeddings to the index
        self.faiss_index.add(self.document_embeddings.astype(np.float32))

    except Exception as e:
        print(f"🤷‍♂️ FAISS setup hiccup: {e}")
        print("📝 No worries, we'll use a backup method!")
        self.faiss_index = None

    print(f"✅ Added {len(documents)} documents to the knowledge base!")
Enter fullscreen mode Exit fullscreen mode

What's happening here? We're converting all your documents into "embeddings" - these are like fingerprints that capture the meaning of the text. Then we build a super-fast search index so we can find relevant documents in milliseconds!


Step 5: Web Search Integration 🌐

Sometimes we need fresh info from the internet. Let's add that capability:

def web_search(self, query, num_results=5):
    """Grab some fresh info from the web"""
    try:
        print(f"🌐 Searching the web for: {query}")

        # Using DuckDuckGo's API (it's free and doesn't track you!)
        url = f"https://api.duckduckgo.com/?q={query}&format=json&no_html=1&skip_disambig=1"
        response = requests.get(url, timeout=5)
        data = response.json()

        results = []

        # Get the main abstract if available
        if data.get('Abstract'):
            results.append({
                'title': data.get('AbstractSource', 'DuckDuckGo')[:50],
                'content': data['Abstract'][:300],  # Keep it concise
                'url': data.get('AbstractURL', ''),
                'relevance': 1.0
            })

        # Get related topics
        for topic in data.get('RelatedTopics', [])[:num_results-1]:
            if isinstance(topic, dict) and topic.get('Text'):
                results.append({
                    'title': (topic.get('FirstURL', '').split('/')[-1] or 'Related')[:50],
                    'content': topic['Text'][:300],
                    'url': topic.get('FirstURL', ''),
                    'relevance': 0.8
                })

        print(f"📊 Found {len(results)} web results")
        return results

    except Exception as e:
        print(f"🤔 Web search didn't work out: {e}")
        return [{'title': 'Search Error', 'content': f'Search failed: {e}', 'url': '', 'relevance': 0}]
Enter fullscreen mode Exit fullscreen mode

Step 6: Lightning-Fast Document Search ⚡

Here's where the magic really happens - finding relevant documents super quickly:

def retrieve_documents_fast(self, query, k=5):
    """Find the most relevant documents lightning fast!"""
    if len(self.documents) == 0:
        print("📭 No documents in the knowledge base yet!")
        return []

    try:
        print(f"🔍 Searching for: {query}")

        # Clear GPU memory
        if self.device == "cuda":
            torch.cuda.empty_cache()

        # Convert the query to an embedding
        if self.device == "cuda":
            with torch.cuda.amp.autocast():
                query_embedding = self.embedder.encode([query], normalize_embeddings=True)
        else:
            query_embedding = self.embedder.encode([query], normalize_embeddings=True)

        results = []

        if self.faiss_index is not None:
            # Use our super-fast FAISS index!
            scores, indices = self.faiss_index.search(
                query_embedding.astype(np.float32),
                min(k, len(self.documents))
            )

            for i, score in zip(indices[0], scores[0]):
                if score > 0.25:  # Only keep relevant results
                    results.append({
                        'content': self.documents[i],
                        'score': float(score),
                        'index': int(i)
                    })
        else:
            # Fallback method (still pretty fast!)
            from sklearn.metrics.pairwise import cosine_similarity
            similarities = cosine_similarity(query_embedding, self.document_embeddings)[0]

            top_indices = np.argsort(similarities)[::-1][:k]

            for idx in top_indices:
                score = similarities[idx]
                if score > 0.25:
                    results.append({
                        'content': self.documents[idx],
                        'score': float(score),
                        'index': int(idx)
                    })

        print(f"📋 Found {len(results)} relevant documents")
        return results

    except Exception as e:
        print(f"😅 Document search hit a snag: {e}")
        return []
Enter fullscreen mode Exit fullscreen mode

Step 7: Putting It All Together 🎭

Now let's create the main function that orchestrates everything:

def process_voice_query_optimized(self, audio_file):
    """The main event - process a voice query end-to-end!"""
    from datetime import datetime

    start_time = datetime.now()
    print("🎬 Starting the voice RAG pipeline...")

    try:
        # Step 1: Speech to Text
        print("🎤 Converting speech to text...")
        stt_start = datetime.now()
        text_query = self.transcribe_audio(audio_file)
        stt_time = (datetime.now() - stt_start).total_seconds()
        print(f"📝 Got: '{text_query}' (took {stt_time:.2f}s)")

        if text_query.startswith("Error"):
            return text_query, "", f"Transcription failed in {stt_time:.2f}s"

        # Step 2: Search for relevant info (doing both at the same time!)
        print("🔍 Searching knowledge base and web...")
        search_start = datetime.now()

        # Find relevant documents
        retrieved_docs = self.retrieve_documents_fast(text_query, k=5)

        # Search the web too
        search_results = self.web_search(text_query, num_results=3)

        search_time = (datetime.now() - search_start).total_seconds()

        # Step 3: Generate a nice response
        print("💭 Crafting the perfect response...")
        response_start = datetime.now()
        response = self.generate_response_optimized(text_query, search_results, retrieved_docs)
        response_time = (datetime.now() - response_start).total_seconds()

        total_time = (datetime.now() - start_time).total_seconds()

        # Show off our performance! 
        perf_summary = f"""⚡ Performance Report:
• Speech Recognition: {stt_time:.2f}s
• Document Search: {search_time:.2f}s  
• Response Crafting: {response_time:.2f}s
• Total Time: {total_time:.2f}s
• Documents Found: {len(retrieved_docs)}
• Web Results: {len(search_results)}

🎯 That's {60/total_time:.1f} queries per minute!"""

        return text_query, response, perf_summary

    except Exception as e:
        error_time = (datetime.now() - start_time).total_seconds()
        print(f"💥 Something went wrong: {e}")
        return f"❌ System Error: {e}", "", f"Failed after {error_time:.2f}s"
Enter fullscreen mode Exit fullscreen mode

Creating Beautiful Responses ✨

Let's make our responses look really nice:

def generate_response_optimized(self, query, search_results, retrieved_docs):
    """Create a beautiful, informative response"""

    context_parts = []

    # Add our knowledge base results first (they're usually more reliable)
    if retrieved_docs:
        context_parts.append("📚 From Your Knowledge Base:")
        # Sort by relevance score
        retrieved_docs.sort(key=lambda x: x['score'], reverse=True)
        for doc in retrieved_docs[:3]:  # Top 3
            context_parts.append(f"{doc['content'][:200]}... (confidence: {doc['score']:.2f})")

    # Add web search results
    if search_results:
        context_parts.append("\n🌐 Fresh from the Web:")
        for result in search_results[:3]:
            if result['content']:
                context_parts.append(f"{result['title']}: {result['content'][:150]}...")

    context = "\n".join(context_parts)

    if not context.strip():
        return "🤷‍♂️ Hmm, I couldn't find much about that. Try asking something else or check if your knowledge base has relevant info!"

    # Put together a nice response
    response = f"""🎯 **You asked**: {query}

{context}

💡 **In a nutshell**: The information above covers the key aspects of your question. The knowledge base results are typically most reliable, while web results give you the latest info!"""

    return response
Enter fullscreen mode Exit fullscreen mode

Step 8: Let's See It in Action! 🎮

Time to create our user interface and actually use this thing:

# Initialize our system
print("🚀 Starting up the Voice RAG system...")
voice_rag = VoiceRAGT4()

# Add some sample documents to get started
print("📚 Adding some AI knowledge to get started...")
ai_docs = [
    "Artificial Intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems.",
    "Machine Learning is a subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.",
    "Deep Learning is a machine learning technique that teaches computers to learn by example, using neural networks with many layers.",
    "Natural Language Processing (NLP) helps computers understand, interpret and generate human language in a valuable way.",
    "Computer Vision enables machines to identify and analyze visual content in images and videos.",
    "Neural networks are computing systems inspired by biological neural networks that constitute animal brains.",
    "Large Language Models (LLMs) are AI models trained on vast amounts of text data to understand and generate human language.",
    "Transformer architecture is the foundation of modern language models, using attention mechanisms to process sequences.",
    "GPU acceleration significantly speeds up AI model training and inference through parallel processing capabilities.",
    "Fine-tuning allows pre-trained models to be adapted for specific tasks with smaller, domain-specific datasets."
]

voice_rag.add_documents_batch(ai_docs, batch_size=16)
print("✅ Knowledge base is ready!")
Enter fullscreen mode Exit fullscreen mode

The User Interface 🎨

Now let's create a nice interface with Gradio:

def process_audio_interface(audio):
    """User-friendly wrapper for our voice processing"""
    if audio is None:
        return "Please record or upload an audio file! 🎤", "", "No audio provided"

    print("🎵 Processing your audio...")
    result = voice_rag.process_voice_query_optimized(audio)

    # Keep things tidy
    voice_rag.clear_gpu_memory()

    return result

# Create the interface
interface = gr.Interface(
    fn=process_audio_interface,
    inputs=gr.Audio(
        type="filepath",
        label="🎤 Record Your Question or Upload Audio",
        sources=["microphone", "upload"]
    ),
    outputs=[
        gr.Textbox(label="📝 What You Said", lines=3, max_lines=5),
        gr.Textbox(label="🤖 AI Response", lines=12, max_lines=20),
        gr.Textbox(label="⚡ Performance Stats", lines=8, max_lines=10)
    ],
    title="🎙️ Voice RAG System - Ask Me Anything!",
    description="""
    **Hey there! 👋** 

    This is your personal voice-powered AI assistant! Just record your voice or upload an audio file, 
    and I'll transcribe what you said, search through the knowledge base, grab fresh info from the web, 
    and give you a comprehensive answer.

    **Try asking about**:
    • Artificial Intelligence and Machine Learning
    • Technology concepts
    • General knowledge questions
    • Current events (I'll search the web!)

    **Pro tip**: Speak clearly and ask specific questions for the best results! 🎯
    """,
    theme=gr.themes.Soft(),
    allow_flagging="never"
)

print("🎉 Interface ready! Click the link to start chatting with your AI!")
interface.launch(share=True, debug=True)
Enter fullscreen mode Exit fullscreen mode

Want to Test How Fast It Is? 🏃‍♂️

Let's add a fun benchmark to see how speedy our system really is:

def benchmark_system():
    """Let's see how fast this baby can go!"""
    test_queries = [
        "What is machine learning?",
        "How does deep learning work?", 
        "Explain artificial intelligence to me",
        "What are neural networks?",
        "How do transformers work in AI?"
    ]

    print("🏁 Starting the speed test!")
    total_times = []

    for i, query in enumerate(test_queries):
        print(f"\n🧪 Test {i+1}/5: '{query}'")
        start = datetime.now()

        # Run our pipeline (without audio since we're just testing speed)
        retrieved = voice_rag.retrieve_documents_fast(query, k=3)
        search_results = voice_rag.web_search(query, num_results=3)
        response = voice_rag.generate_response_optimized(query, search_results, retrieved)

        elapsed = (datetime.now() - start).total_seconds()
        total_times.append(elapsed)
        print(f"⏱️ Done in {elapsed:.2f} seconds!")

        voice_rag.clear_gpu_memory()  # Keep things clean

    avg_time = np.mean(total_times)
    print(f"\n📊 Speed Test Results:")
    print(f"🚀 Average query time: {avg_time:.2f} seconds")
    print(f"💨 Can handle {60/avg_time:.1f} queries per minute") 
    print(f"🎯 That's pretty darn fast for a full RAG system!")

# Run the benchmark!
benchmark_system()
Enter fullscreen mode Exit fullscreen mode

Making It Your Own 🎨

Adding Your Own Documents

Want to teach your system about specific topics? Here's how:

def add_my_documents():
    """Add your own knowledge to the system"""

    # Replace these with your own content!
    my_docs = [
        "Your company's product information goes here",
        "Domain-specific knowledge for your field",
        "FAQ answers for common questions",
        "Technical documentation snippets",
        # Add as many as you want!
    ]

    # Only add if you've actually added content
    if "Your company's product information" not in my_docs[0]:
        voice_rag.add_documents_batch(my_docs, batch_size=16)
        print(f"🎉 Added {len(my_docs)} of your documents!")
    else:
        print("💡 Edit the my_docs list above to add your own content!")

# Uncomment this line when you've added your documents
# add_my_documents()
Enter fullscreen mode Exit fullscreen mode

Loading Documents from Files

Want to load documents from PDFs or text files? Here's a helper:

def load_documents_from_files(file_paths):
    """Load documents from various file types"""
    documents = []

    for file_path in file_paths:
        try:
            if file_path.endswith('.txt'):
                with open(file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    # Split into chunks so they're not too long
                    chunks = [content[i:i+500] for i in range(0, len(content), 400)]
                    documents.extend(chunks)

            elif file_path.endswith('.pdf'):
                # You'd need to install PyPDF2: !pip install PyPDF2
                import PyPDF2
                with open(file_path, 'rb') as file:
                    reader = PyPDF2.PdfReader(file)
                    text = ""
                    for page in reader.pages:
                        text += page.extract_text()
                    chunks = [text[i:i+500] for i in range(0, len(text), 400)]
                    documents.extend(chunks)

            print(f"📄 Loaded {file_path}")

        except Exception as e:
            print(f"😅 Couldn't load {file_path}: {e}")

    return documents

# Example usage:
# my_files = ["document1.txt", "manual.pdf", "faq.txt"]
# docs = load_documents_from_files(my_files)
# voice_rag.add_documents_batch(docs)
Enter fullscreen mode Exit fullscreen mode

When Things Don't Go As Planned 🤔

Here are some common hiccups and how to fix them:

"Out of Memory" Errors

# If you get GPU memory errors, try these:

# 1. Reduce batch size
voice_rag.add_documents_batch(documents, batch_size=8)  # Instead of 32

# 2. Clear memory more often
torch.cuda.empty_cache()

# 3. Use CPU fallback
voice_rag.device = "cpu"  # Slower but uses less memory
Enter fullscreen mode Exit fullscreen mode

Audio Problems

# If audio processing fails:

def fix_audio_issues(audio_path):
    """Sometimes audio files need extra help"""
    try:
        # Try loading with different settings
        audio, sr = librosa.load(audio_path, sr=None)

        # Convert to 16kHz if needed
        if sr != 16000:
            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

        return audio
    except Exception as e:
        print(f"🎵 Audio trouble: {e}")
        return None
Enter fullscreen mode Exit fullscreen mode

Model Loading Issues

# If models won't load:

def setup_simple_models(self):
    """Simpler model setup as fallback"""
    print("🔧 Using simpler model configuration...")

    # Use basic Whisper without fancy optimizations
    self.speech_to_text = pipeline(
        "automatic-speech-recognition", 
        "openai/whisper-tiny",  # Smallest, fastest model
        device=0 if torch.cuda.is_available() else -1
    )

    # Basic embeddings
    self.embedder = SentenceTransformer('all-MiniLM-L6-v2')

    print("✅ Basic setup complete!")
Enter fullscreen mode Exit fullscreen mode

What's Next? 🚀

Congratulations! You've built a pretty awesome voice RAG system. Here are some fun ideas to make it even better:

1. Real-time Processing

Make it work with live microphone input so we can just talk to it continuously.

2. Multi-language Support

Add support for different languages by using multilingual Whisper and embedding models.

3. Better Document Processing

Add support for more file types (Word docs, PowerPoints, etc.) and better text chunking.

4. Conversation Memory

Make it remember what you talked about earlier in the conversation.

5. Custom Response Styles

Train it to respond in different styles (formal, casual, technical, etc.).


Wrapping Up 🎁

We've just built something pretty amazing! Your Voice RAG system can:

  • ✅ Understand your speech
  • ✅ Search through documents lightning-fast
  • ✅ Grab fresh info from the web
  • ✅ Give you intelligent, contextual answers
  • ✅ Do it all really, really fast thanks to GPU optimization

The best part? This is just the beginning. You can customize it, add your own documents, integrate it into other systems, or just have fun asking it questions!

Remember: The more good documents we feed it, the smarter it gets. So start adding content that's relevant to what you want to ask about.

Now go forth and build something awesome! 🎉


P.S. - If you build something cool with this, I'd love to hear about it! And if you run into any weird issues, don't panic - that's just part of the fun of building AI systems. Happy coding!


Bonus Round: Cool Tricks and Advanced Features 🎪

Since you've made it this far, let me share some extra goodies that'll make your Voice RAG system even more impressive!

Memory Trick: Making It Remember Your Conversations 🧠

Want the system to remember what you talked about? Here's a simple way to add conversation memory:

class ConversationalVoiceRAG(VoiceRAGT4):
    def __init__(self):
        super().__init__()
        self.conversation_history = []  # Remember everything!
        self.max_history = 10  # Don't remember TOO much

    def process_with_memory(self, audio_file):
        """Process voice with conversation context"""
        # Get the current query
        text_query = self.transcribe_audio(audio_file)

        # Build context from conversation history
        context_query = self.build_contextual_query(text_query)

        # Process normally but with context
        retrieved_docs = self.retrieve_documents_fast(context_query, k=5)
        search_results = self.web_search(context_query, num_results=3)
        response = self.generate_response_optimized(context_query, search_results, retrieved_docs)

        # Remember this conversation
        self.conversation_history.append({
            'user': text_query,
            'assistant': response,
            'timestamp': datetime.now()
        })

        # Don't let memory get too long
        if len(self.conversation_history) > self.max_history:
            self.conversation_history.pop(0)

        return text_query, response

    def build_contextual_query(self, current_query):
        """Add conversation context to the query"""
        if not self.conversation_history:
            return current_query

        # Get the last few exchanges for context
        recent_context = self.conversation_history[-3:]  # Last 3 exchanges

        context_parts = []
        for exchange in recent_context:
            context_parts.append(f"Previously discussed: {exchange['user']}")

        contextual_query = f"""
        Current question: {current_query}

        Conversation context:
        {chr(10).join(context_parts)}

        Please answer considering this conversation history.
        """

        return contextual_query
Enter fullscreen mode Exit fullscreen mode

Multi-Language Magic 🌍

Want to understand different languages? Here's how to make it multilingual:

def setup_multilingual_models(self):
    """Support multiple languages like a boss!"""
    print("🌍 Setting up multilingual support...")

    # Use multilingual Whisper
    self.speech_to_text = pipeline(
        "automatic-speech-recognition",
        "openai/whisper-large",  # Supports 99 languages!
        torch_dtype=torch.float16,
        device=0 if torch.cuda.is_available() else -1,
        return_timestamps=True
    )

    # Multilingual embeddings
    self.embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

    print("✅ Now I can understand many languages!")

def detect_language(self, text):
    """Figure out what language someone is speaking"""
    # Simple language detection
    try:
        from langdetect import detect
        language = detect(text)
        print(f"🗣️ Detected language: {language}")
        return language
    except:
        return "unknown"

def respond_in_language(self, response, target_language):
    """Respond in the same language as the user"""
    if target_language == "en":
        return response

    # You could integrate with translation APIs here
    print(f"💬 Would translate response to: {target_language}")
    return response + f"\n\n(Response in {target_language} would go here)"
Enter fullscreen mode Exit fullscreen mode

Real-Time Voice Processing 🎙️

Want to make it work with live audio? Here's a simple real-time version:

import pyaudio
import threading
import queue
import time

class RealTimeVoiceRAG(VoiceRAGT4):
    def __init__(self):
        super().__init__()
        self.audio_queue = queue.Queue()
        self.is_listening = False
        self.audio_buffer = []

    def start_listening(self):
        """Start listening for voice input"""
        print("🎤 Starting real-time listening...")

        self.is_listening = True

        # Start audio capture thread
        audio_thread = threading.Thread(target=self.capture_audio, daemon=True)
        audio_thread.start()

        # Start processing thread
        process_thread = threading.Thread(target=self.process_audio_stream, daemon=True)
        process_thread.start()

        print("👂 I'm listening! Say something...")

    def capture_audio(self):
        """Capture audio from microphone"""
        try:
            p = pyaudio.PyAudio()
            stream = p.open(
                format=pyaudio.paFloat32,
                channels=1,
                rate=16000,
                input=True,
                frames_per_buffer=1024
            )

            while self.is_listening:
                data = stream.read(1024, exception_on_overflow=False)
                self.audio_queue.put(data)

            stream.stop_stream()
            stream.close()
            p.terminate()

        except Exception as e:
            print(f"🎵 Audio capture error: {e}")

    def process_audio_stream(self):
        """Process audio in real-time"""
        while self.is_listening:
            try:
                # Collect audio for 3 seconds
                audio_chunk = []
                for _ in range(48):  # ~3 seconds at 16kHz
                    if not self.audio_queue.empty():
                        audio_chunk.append(self.audio_queue.get())

                if audio_chunk:
                    # Convert to numpy array
                    audio_data = np.frombuffer(b''.join(audio_chunk), dtype=np.float32)

                    # Simple voice activity detection
                    if np.max(np.abs(audio_data)) > 0.01:  # Adjust threshold as needed
                        print("🗣️ Voice detected, processing...")
                        # Process the audio chunk
                        # (You'd save this to a temp file and process it)

                time.sleep(0.1)  # Small delay

            except Exception as e:
                print(f"🤔 Processing error: {e}")

    def stop_listening(self):
        """Stop real-time processing"""
        self.is_listening = False
        print("🛑 Stopped listening")

# Usage:
# real_time_rag = RealTimeVoiceRAG()
# real_time_rag.start_listening()
# # Let it run for a while...
# real_time_rag.stop_listening()
Enter fullscreen mode Exit fullscreen mode

Smart Document Chunking 📄

Here's a smarter way to split your documents that preserves meaning:

def smart_chunk_documents(self, text, chunk_size=500, overlap=50):
    """Split text intelligently, keeping related sentences together"""
    import re

    # Split into sentences first
    sentences = re.split(r'[.!?]+', text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue

        # If adding this sentence would exceed chunk size
        if len(current_chunk) + len(sentence) > chunk_size:
            if current_chunk:
                chunks.append(current_chunk.strip())

                # Start new chunk with overlap
                words = current_chunk.split()
                overlap_text = " ".join(words[-overlap:]) if len(words) > overlap else current_chunk
                current_chunk = overlap_text + " " + sentence
            else:
                current_chunk = sentence
        else:
            current_chunk += " " + sentence if current_chunk else sentence

    # Don't forget the last chunk!
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def load_and_chunk_file(self, file_path):
    """Load a file and chunk it smartly"""
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

        chunks = self.smart_chunk_documents(content, chunk_size=400, overlap=30)
        print(f"📄 Split {file_path} into {len(chunks)} smart chunks")

        return chunks
    except Exception as e:
        print(f"😅 Couldn't process {file_path}: {e}")
        return []
Enter fullscreen mode Exit fullscreen mode

Performance Dashboard 📊

Want to see detailed performance metrics? Here's a cool dashboard:

import matplotlib.pyplot as plt
from collections import defaultdict
import time

class PerformanceTracker:
    def __init__(self):
        self.metrics = defaultdict(list)
        self.query_history = []

    def track_query(self, query, transcription_time, search_time, response_time, total_docs, web_results):
        """Track performance for each query"""
        total_time = transcription_time + search_time + response_time

        self.metrics['transcription_times'].append(transcription_time)
        self.metrics['search_times'].append(search_time)
        self.metrics['response_times'].append(response_time)
        self.metrics['total_times'].append(total_time)
        self.metrics['docs_found'].append(total_docs)
        self.metrics['web_results'].append(web_results)

        self.query_history.append({
            'query': query,
            'total_time': total_time,
            'timestamp': time.time()
        })

    def show_performance_dashboard(self):
        """Create a cool performance visualization"""
        if not self.metrics['total_times']:
            print("📊 No performance data yet!")
            return

        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))

        # Response time distribution
        ax1.hist(self.metrics['total_times'], bins=20, alpha=0.7, color='skyblue')
        ax1.set_title('🚀 Response Time Distribution')
        ax1.set_xlabel('Time (seconds)')
        ax1.set_ylabel('Frequency')

        # Performance over time
        ax2.plot(self.metrics['total_times'], marker='o', color='orange')
        ax2.set_title('⏱️ Performance Over Time')
        ax2.set_xlabel('Query Number')
        ax2.set_ylabel('Time (seconds)')

        # Documents found distribution
        ax3.bar(['Transcription', 'Search', 'Response'], 
                [np.mean(self.metrics['transcription_times']),
                 np.mean(self.metrics['search_times']),
                 np.mean(self.metrics['response_times'])], 
                color=['lightcoral', 'lightgreen', 'lightblue'])
        ax3.set_title('⚡ Average Time by Component')
        ax3.set_ylabel('Time (seconds)')

        # Success rate
        success_rate = len([t for t in self.metrics['total_times'] if t < 5]) / len(self.metrics['total_times']) * 100
        ax4.pie([success_rate, 100-success_rate], labels=['Fast (<5s)', 'Slow (>5s)'], 
                colors=['lightgreen', 'lightcoral'], autopct='%1.1f%%')
        ax4.set_title('🎯 Speed Success Rate')

        plt.tight_layout()
        plt.show()

        # Print summary stats
        print(f"""
📈 Performance Summary:
• Average total time: {np.mean(self.metrics['total_times']):.2f}s
• Fastest query: {min(self.metrics['total_times']):.2f}s
• Slowest query: {max(self.metrics['total_times']):.2f}s
• Success rate (<5s): {success_rate:.1f}%
• Total queries processed: {len(self.metrics['total_times'])}
        """)

# Add to your VoiceRAG class:
def __init__(self):
    # ... existing init code ...
    self.performance_tracker = PerformanceTracker()

# In your process_voice_query_optimized method, add:
# self.performance_tracker.track_query(text_query, stt_time, search_time, response_time, len(retrieved_docs), len(search_results))

# Then you can view stats:
# voice_rag.performance_tracker.show_performance_dashboard()
Enter fullscreen mode Exit fullscreen mode

Adding Personality 🤖

Want to give your AI some personality? Here's how:

def generate_response_with_personality(self, query, search_results, retrieved_docs, personality="helpful"):
    """Generate responses with different personalities"""

    personalities = {
        "helpful": {
            "greeting": "🤔 Let me help you with that!",
            "tone": "friendly and informative",
            "emoji": ""
        },
        "enthusiastic": {
            "greeting": "🎉 Oh, that's a GREAT question!",
            "tone": "excited and energetic", 
            "emoji": "🚀"
        },
        "scholarly": {
            "greeting": "📚 An interesting inquiry indeed.",
            "tone": "academic and thorough",
            "emoji": "🎓"
        },
        "casual": {
            "greeting": "👋 Hey! So you want to know about",
            "tone": "relaxed and conversational",
            "emoji": "😊"
        }
    }

    style = personalities.get(personality, personalities["helpful"])

    # Build context as before...
    context_parts = []
    if retrieved_docs:
        context_parts.append("📚 From what I know:")
        for doc in retrieved_docs[:3]:
            context_parts.append(f"{doc['content'][:200]}...")

    if search_results:
        context_parts.append("\n🌐 Fresh from the web:")
        for result in search_results[:3]:
            if result['content']:
                context_parts.append(f"{result['content'][:150]}...")

    context = "\n".join(context_parts)

    if personality == "enthusiastic":
        response = f"""{style['greeting']} {query}

{context}

{style['emoji']} This is SO cool - there's tons of great info about this topic! Hope this helps fuel your curiosity!"""

    elif personality == "scholarly":
        response = f"""{style['greeting']}

Based on my analysis of the available sources:

{context}

{style['emoji']} In conclusion, the evidence suggests these are the key considerations regarding your inquiry."""

    elif personality == "casual":
        response = f"""{style['greeting']} {query.lower()}? 

Here's the deal:

{context}

{style['emoji']} Hope that clears things up! Let me know if you want me to dig deeper into any part of this."""

    else:  # helpful (default)
        response = f"""{style['greeting']}

{context}

{style['emoji']} I hope this information helps answer your question! Feel free to ask if you need clarification on anything."""

    return response

# Usage:
# response = voice_rag.generate_response_with_personality(query, search_results, retrieved_docs, personality="enthusiastic")
Enter fullscreen mode Exit fullscreen mode

Web Interface Upgrade 🌐

Want a fancier web interface? Here's an enhanced Gradio setup:

def create_advanced_interface():
    """Create a more sophisticated interface"""

    with gr.Blocks(title="🎙️ Voice RAG Pro", theme=gr.themes.Soft()) as interface:
        gr.Markdown("# 🎙️ Voice RAG System Pro")
        gr.Markdown("Ask me anything using your voice! I'll search my knowledge base and the web to give you comprehensive answers.")

        with gr.Row():
            with gr.Column(scale=2):
                audio_input = gr.Audio(
                    label="🎤 Your Question",
                    sources=["microphone", "upload"],
                    type="filepath"
                )

                # Settings panel
                with gr.Accordion("⚙️ Settings", open=False):
                    personality = gr.Dropdown(
                        choices=["helpful", "enthusiastic", "scholarly", "casual"],
                        value="helpful",
                        label="🤖 AI Personality"
                    )

                    search_web = gr.Checkbox(
                        value=True,
                        label="🌐 Search Web"
                    )

                    max_docs = gr.Slider(
                        minimum=1,
                        maximum=10,
                        value=5,
                        step=1,
                        label="📚 Max Documents to Retrieve"
                    )

                submit_btn = gr.Button("🚀 Process Voice", variant="primary")

            with gr.Column(scale=3):
                transcription_output = gr.Textbox(
                    label="📝 What You Said",
                    lines=3,
                    max_lines=5
                )

                response_output = gr.Textbox(
                    label="🤖 AI Response",
                    lines=15,
                    max_lines=25
                )

                with gr.Accordion("📊 Performance & Debug", open=False):
                    performance_output = gr.Textbox(
                        label="⚡ Performance Metrics",
                        lines=8
                    )

        # Examples
        gr.Markdown("### 🎯 Try These Examples:")

        example_queries = [
            "What is machine learning?",
            "How do neural networks work?",
            "Explain artificial intelligence",
            "What's the latest in AI research?"
        ]

        gr.Examples(
            examples=[[q] for q in example_queries],
            inputs=[audio_input]
        )

        def process_with_settings(audio, personality, search_web, max_docs):
            """Process audio with custom settings"""
            if audio is None:
                return "Please record or upload audio!", "", ""

            # Your existing processing code here, but with the settings
            # This is where you'd modify the pipeline based on user preferences

            result = voice_rag.process_voice_query_optimized(audio)
            return result

        submit_btn.click(
            process_with_settings,
            inputs=[audio_input, personality, search_web, max_docs],
            outputs=[transcription_output, response_output, performance_output]
        )

        # Auto-submit when audio is uploaded
        audio_input.change(
            process_with_settings,
            inputs=[audio_input, personality, search_web, max_docs],
            outputs=[transcription_output, response_output, performance_output]
        )

    return interface

# Launch the advanced interface
# advanced_interface = create_advanced_interface()
# advanced_interface.launch(share=True, debug=True)
Enter fullscreen mode Exit fullscreen mode

Final Pro Tips 🎯

Here are some insider secrets to make your system even better:

  1. Batch Everything: Always process multiple items together when possible - it's way more efficient!

  2. Cache Smart: Save frequently used embeddings and search results to avoid recomputing.

  3. Monitor GPU Memory: Keep an eye on torch.cuda.memory_allocated() - clear cache when it gets too high.

  4. Use Async: For web searches, use asyncio to make multiple requests simultaneously.

  5. Quality Over Quantity: Better to have 100 high-quality documents than 1000 poor ones.

  6. Test with Real Users: Your system might work perfectly for us right now but confuse others - test it!

  7. Keep Learning: The AI field moves fast - stay updated with new models and techniques.

Full script available here: https://github.com/AkanimohOD19A/Voice-RAG-v1

The End... Or Is It? 🎬

You've now got a seriously impressive Voice RAG system that can:

  • 🎤 Understand speech in multiple languages
  • 🧠 Remember conversations
  • ⚡ Process queries lightning-fast
  • 📊 Track its own performance
  • 🤖 Have different personalities
  • 🌐 Search the web intelligently
  • 📄 Handle complex documents

But here's the thing - this is really just the beginning! Every day, new models come out, new techniques are discovered, and new possibilities emerge.

The system you've built is a solid foundation that we can keep improving and adapting. Maybe next you'll add video understanding, or connect it to a robot, or make it control your smart home. The sky's the limit!

Remember: The best AI systems aren't just technically impressive - they're actually useful and fun to interact with. We have to keep that in mind as we continue building.

Now go forth and create something amazing! And most importantly... have fun with it! 🎉

Happy building, and may your GPU never run out of memory! 🚀

Top comments (0)