DEV Community

Cover image for My Journey with Pipecat + Attendee (And Hacking the Infrastructure)
Angad Singh
Angad Singh

Posted on

My Journey with Pipecat + Attendee (And Hacking the Infrastructure)

The journey of building production-ready voice agents with Pipecat + Attendee, and the infrastructure changes I had to implement.

The Problem I Was Trying to Solve

Building a voice agent that could actually join Zoom/Google Meet meetings and have conversations.

The challenge?

I needed to integrate Pipecat (a voice AI framework) with Attendee (a meeting bot infrastructure), but there was a fundamental mismatch in how they handled audio data.


The Original Problem: Mixed Audio vs. Speaker Identification

When I first started, Attendee was only sending mixed audio packets - basically one big audio stream with everyone's voices combined. This was a nightmare for voice AI because:

  1. No speaker identification - I couldn't tell who was talking
  2. Poor transcription quality - Mixed audio is harder to transcribe accurately
  3. Context loss - The AI couldn't respond appropriately without knowing the speaker

I had two options:

  • Per-participant audio packets - Individual audio streams per person
  • Transcription frames - Pre-processed transcription with speaker info

I chose transcription frames because they're more efficient and give me exactly what I need: clean text with speaker identification.


The Infrastructure Changes I Made

1. Added Transcription Frame Support to Attendee

I had to modify the Attendee infrastructure locally to support transcription frames. Here's what I added:

# In bot_controller.py - New method to send transcription frames
def send_transcription_to_pipecat(self, speaker_id: str, speaker_name: str, text: str, is_final: bool, timestamp_ms: int, duration_ms: int):
    """Send transcription frame with speaker information to Pipecat via WebSocket"""
    if not self.websocket_audio_client:
        return

    if not self.websocket_audio_client.started():
        logger.info("Starting websocket audio client for transcription...")
        self.websocket_audio_client.start()

    payload = transcription_websocket_payload(
        speaker_id=speaker_id,
        speaker_name=speaker_name,
        text=text,
        is_final=is_final,
        timestamp_ms=timestamp_ms,
        duration_ms=duration_ms,
        bot_object_id=self.bot_in_db.object_id,
    )

    self.websocket_audio_client.send_async(payload)
    logger.info(f"Sent transcription to Pipecat: [{speaker_name}]: {text}")

# Added logic to disable mixed audio when transcription frames are enabled
def disable_mixed_audio_packets(self):
    """Check if mixed audio packets should be disabled in favor of transcription frames"""
    audio_settings = self.bot_in_db.settings.get("audio_settings", {})
    return audio_settings.get("disable_mixed_audio_packets", True)

def add_mixed_audio_chunk_callback(self, chunk: bytes):
    # Skip WebSocket transmission if mixed audio packets are disabled
    if self.disable_mixed_audio_packets():
        logger.debug("Mixed audio packets disabled, skipping WebSocket transmission")
        return
    # ... rest of the method
Enter fullscreen mode Exit fullscreen mode

2. Created New WebSocket Payload Format

I had to create a new payload format for transcription frames:

# In websocket_payloads.py - New transcription payload
def transcription_websocket_payload(
    speaker_id: str,
    speaker_name: str,
    text: str,
    is_final: bool,
    timestamp_ms: int,
    duration_ms: int,
    bot_object_id: str
) -> dict:
    """Package transcription data with speaker information for websocket transmission."""
    return {
        "trigger": RealtimeTriggerTypes.type_to_api_code(RealtimeTriggerTypes.TRANSCRIPTION_FRAME),
        "bot_id": bot_object_id,
        "data": {
            "speaker_id": speaker_id,
            "speaker_name": speaker_name,
            "text": text,
            "is_final": is_final,
            "timestamp_ms": timestamp_ms,
            "duration_ms": duration_ms,
        },
    }
Enter fullscreen mode Exit fullscreen mode

3. Updated Bot Configuration

I added a new setting to control this behavior:

# In serializers.py - New audio setting
"disable_mixed_audio_packets": {
    "type": "boolean",
    "description": "Whether to disable mixed audio packets in favor of transcription frames. When True, the bot will use per-participant transcription frames instead of mixed audio for better speaker identification.",
    "default": True
}
Enter fullscreen mode Exit fullscreen mode

Why I Chose Transcription Frames Over Per-Participant Audio

This was a key decision. Here's why I went with transcription frames:

Per-Participant Audio Problems:

  • Higher bandwidth - Multiple audio streams instead of one
  • More complex processing - Need to handle multiple audio inputs
  • Latency issues - More data to process and transmit
  • Still need transcription - Would have to do STT on each stream

Transcription Frames Benefits:

  • Lower bandwidth - Just text data instead of audio
  • Speaker identification built-in - Attendee already knows who's talking
  • Better accuracy - Attendee's transcription is optimized for meetings
  • Simpler processing - Just handle text, not audio.

The transcription frames approach was cleaner and more efficient for my use case.


Building the Pipecat Integration

1. Custom Frame Serializer

I had to create a custom serializer to handle the new transcription format:

class AttendeeFrameSerializer(FrameSerializer):
    async def deserialize(self, data: str | bytes) -> Frame | None:
        json_data = json.loads(data)

        # Handle the new transcription frames from my modified Attendee
        if json_data.get("trigger") == "realtime_audio.transcription":
            transcription_data = json_data["data"]
            speaker_id = transcription_data.get("speaker_id", "unknown")
            speaker_name = transcription_data.get("speaker_name", "Unknown")
            text = transcription_data.get("text", "")

            # Create transcription frame with speaker info
            transcription_frame = TranscriptionFrame(
                text=text,
                user_id=speaker_id,
                timestamp=timestamp_ms
            )
            transcription_frame.speaker_name = speaker_name
            return transcription_frame

        # Fallback to per-participant audio if needed
        elif json_data.get("trigger") == "realtime_audio.per_participant":
            # ... handle per-participant audio
Enter fullscreen mode Exit fullscreen mode

2. Custom Transcription Processor

I built a processor to add speaker context to the LLM:

class TranscriptionFrameProcessor(FrameProcessor):
    async def process_frame(self, frame, direction):
        if isinstance(frame, TranscriptionFrame):
            speaker_name = getattr(frame, 'speaker_name', 'Unknown Speaker')

            # Add speaker context for the LLM
            if frame.text and frame.text.strip():
                original_text = frame.text
                frame.text = f"[{speaker_name}]: {original_text}"

                logger.info(f"🎤 Processing transcription from {speaker_name}: {frame.text}")
                await self.push_frame(frame, direction)
Enter fullscreen mode Exit fullscreen mode

3. Optimized Pipeline

I fine-tuned the pipeline for natural conversation:

# Optimized VAD parameters for responsive detection
vad_params = VADParams(
    confidence=0.4,      # Lower confidence for more responsive detection
    start_secs=0.15,     # Faster response time
    stop_secs=0.25,      # Shorter stop time for natural flow
    min_volume=0.25,     # Lower volume threshold
)

# Use transcription processor instead of STT (transcription comes from Attendee)
transcription_processor = TranscriptionFrameProcessor()
logger.info("Using Attendee transcription with speaker diarization instead of local STT")
Enter fullscreen mode Exit fullscreen mode

The Result: A Working Voice Agent

After all this work, I ended up with a voice agent that:

  1. Joins real meetings via Attendee's infrastructure
  2. Knows who's speaking thanks to my transcription frame modifications
  3. Responds naturally with optimized conversation flow
  4. Handles real-time audio with low latency
  5. Has a web interface for easy configuration

The Complete Flow:

  1. User configures the agent via web interface
  2. Bot joins meeting via Attendee (with my modifications)
  3. Attendee sends transcription frames (from closed captions) with speaker info
  4. Pipecat processes the transcription with speaker context
  5. LLM generates appropriate responses
  6. TTS converts to natural speech
  7. Agent responds in the meeting

What I Learned

1. Infrastructure Changes Are Sometimes Necessary

The original Attendee infrastructure wasn't designed for voice AI use cases. I had to modify it to support transcription frames, which was the right architectural decision.

3. Speaker Context Is Everything

Without knowing who's speaking, a voice agent is just a chatbot. The transcription frame approach gave me perfect speaker identification.

4. Real-Time Performance Matters

Every millisecond of latency affects conversation quality. I spent a lot of time optimizing VAD parameters and buffer sizes.

5. Testing Is Critical

I built comprehensive tests to validate the integration:

def test_complete_flow_simulation(self):
    """Test the complete flow from transcription message to processed text frame."""
    # Create transcription message as sent by my modified Attendee
    transcription_message = {
        "trigger": "realtime_audio.transcription",
        "data": {
            "speaker_id": "participant_123",
            "speaker_name": "John Doe",
            "text": "Hello, this is a test transcription.",
            "is_final": True,
            "timestamp_ms": 1703123456789,
        },
    }

    # Test the complete pipeline
    json_message = json.dumps(transcription_message)
    transcription_frame = asyncio.run(self.serializer.deserialize(json_message))

    # Verify it works end-to-end
    self.assertIsInstance(transcription_frame, TranscriptionFrame)
Enter fullscreen mode Exit fullscreen mode

The Bottom Line

This is an active development project where I'm iterating, experimenting, optimizing, and adding new features. Building this voice agent isn't just about learning Pipecat - it's about understanding how to integrate complex systems and make architectural decisions that actually work.

The transcription frame approach was cleaner, more efficient, and gave me exactly what I needed for natural conversation.

This is a work in progress, and I'm always open to questions and feedback!

Top comments (0)