Ooi Yee Fei for AWS Community Builders

Posted on Jan 31

Bridging Two Worlds: Custom ASL Model on EKS + Claude Facilitation via Context Engineering

#ai #machinelearning #aws #webdev

Part 1 covered training an ASL recognition model on EKS. This part focuses on deploying that model for inference and designing a real-world conversation flow using Bedrock Claude with context engineering.

The Problem

A deaf person and a hearing person want to have a conversation. No interpreter available.

Deaf person signs → hearing person needs audio
Hearing person speaks → deaf person needs text

Seems straightforward - just translate between modalities. But there's a deeper challenge most people miss:

ASL isn't English in sign form. Grammar, word order, and expression differ fundamentally. Direct translation doesn't work.
ML models aren't perfect. Our model is 65% accurate. Users need guidance when detection fails.
Vocabulary is limited. 100 signs vs thousands of English words. Users need help staying within what the system understands.

This is where AI facilitation matters. Claude doesn't just translate - it understands conversation context, suggests responses within vocabulary constraints, and keeps the dialogue flowing even when the recognition model stumbles. It bridges the gap between imperfect ML and usable conversation.

The app runs in a browser. No downloads, no accounts.

Some sample demo images, more detailed in below discussion:

System Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                            Browser (React)                               │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐  ┌──────────────┐  │
│  │   Webcam     │  │  MediaPipe   │  │  Speech     │  │ Conversation │  │
│  │   Feed       │──│  Hand Track  │  │  Recognition│  │ View + TTS   │  │
│  └──────────────┘  └──────┬───────┘  └──────┬──────┘  └──────────────┘  │
└────────────────────────────┼────────────────┼────────────────────────────┘
                             │                │
                             │ landmarks      │ text (browser API)
                             ▼                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      CloudFront + Lambda                            │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  /v1/asl/predict  →  EKS Inference (PoseLSTM)                 │  │
│  │  /v1/suggestions  →  Bedrock Claude (context-aware prompts)  │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Flow:

Deaf → Hearing: Webcam → MediaPipe → EKS model → Text → Browser TTS (audio)
Hearing → Deaf: Microphone → Browser Speech Recognition → Text display

Deploying the Inference Endpoint

EKS Setup for Inference

Training used g6.12xlarge (4 L4 GPUs). For inference, that's overkill. The model is 2M params - runs fine on CPU.

Key decisions:

CPU inference - Model is ~2M params, no GPU needed
Minimal resources - 256Mi memory, 100m CPU request
FastAPI server - Async-friendly, good for ML serving

The Inference API

Input: 32 frames of hand landmarks (126 features per frame)
Output: predicted sign, confidence score, top-5 predictions
Latency: ~40ms per prediction

Why Lambda + CloudFront (Not Direct EKS)

Lambda as a proxy instead of exposing EKS directly:

Security - EKS stays in private subnet, no public exposure
Caching - CloudFront caches repeated requests (same sign = same response)
Cost - Lambda scales to zero, only pay for actual requests
Auth - CloudFront + OAC handles authentication without custom code

User → CloudFront → Lambda → EKS (private)
         ↓
      (caching)

Conversation Facilitation: The Design Challenge

Sign recognition is solved. The harder problem: making the conversation actually work.

The Communication Gap

ASL is not English - Grammar, word order, and concepts differ
Vocabulary mismatch - Our model knows 100 signs. Real conversations need more.
Context matters - "BOOK" could mean "I want a book" or "I'm reading a book" depending on context

Solution: Context Engineering with Bedrock Claude

Claude handles conversation facilitation - not translation, but generating contextually relevant response suggestions.

Key requirement: Claude needs to know not just what was said, but who said it and how they communicate.

// suggestionService.ts - the actual implementation

interface ConversationTurn {
  type: 'asl' | 'voice';  // Who said it and how
  text: string;            // What was communicated
}

export async function getLLMSuggestions(
  conversationHistory: ConversationTurn[],
  mode: 'asl' | 'voice'
): Promise<string[]> {
  const prompt = buildSuggestionPrompt(conversationHistory, mode);
  const suggestions = await callClaudeBedrock(prompt);
  return suggestions;
}

Context Engineering: The Four Key Decisions

1. Sliding Window (Last 6 Turns)

We don't send the entire conversation history. Just the last 6 turns.

conversationContext = history
  .slice(-6)  // Only recent context
  .map(turn => `[${turn.type === 'asl' ? 'Deaf user' : 'Hearing user'}]: ${turn.text}`)
  .join('\n');

Why 6? Enough context to understand the topic, not so much that it confuses the model or wastes tokens. Most conversations have natural topic shifts every 4-6 exchanges anyway.

2. Mode-Aware Prompting

The same conversation needs different suggestions depending on who's responding next:

const modeContext = isASLMode
  ? `The deaf user will respond using ASL signs. Suggest common ASL signs from the WLASL-100 vocabulary.
Available signs include: HELLO, HELP, YES, NO, THANK-YOU, PLEASE, GOOD, WANT, NEED, LIKE, GO, EAT, DRINK...
Keep suggestions to single words or short phrases that are actual ASL signs.`
  : `The hearing user will respond by speaking. Suggest natural, conversational responses.
Keep suggestions brief (under 8 words each) and appropriate for the context.`;

This is critical. ASL suggestions must be constrained to signs the model can actually recognize. Voice suggestions can be natural language.

3. Turn Type Labeling

Each message is tagged with who sent it:

// In the prompt, turns look like:
// [Deaf user (ASL)]: HELLO
// [Hearing user (Voice)]: Hi! How are you?
// [Deaf user (ASL)]: GOOD

Claude can track the conversation flow and understand the back-and-forth pattern. This helps it generate appropriate responses for each party.

4. Response Caching

Same context = same suggestions. No need to hit the API twice.

const suggestionCache = new Map<string, { suggestions: string[]; timestamp: number }>();
const CACHE_TTL_MS = 30000; // 30 seconds

// Cache key is the full context hash
const cacheKey = `${mode}:${conversationHistory.map(t => `${t.type}:${t.text}`).join('|')}`;

30-second TTL means rapid back-and-forth doesn't hammer the API, but suggestions stay fresh as the conversation evolves.

Additional Design Decisions

Graceful fallbacks - If API fails, default suggestions like ['YES', 'NO', 'HELP', 'THANK-YOU'] are shown
Vocabulary display - UI shows all 100 supported signs so users know what's possible
JSON output format - Claude returns suggestions as JSON array for reliable parsing

Real-Time Flow

Deaf user signs → Hearing user hears:

1. Webcam captures video stream
   ↓
2. MediaPipe extracts hand landmarks (client-side, ~10ms)
   ↓
3. Collect 32 frames of landmarks (~1 second of signing)
   ↓
4. Send to /v1/asl/predict (EKS)
   ↓
5. Model returns prediction + confidence (~40ms)
   ↓
6. If confidence ≥ 60%: show confirmation dialog (threshold balances false positives vs missed detections)
   ↓
7. User confirms → sign added to conversation
   ↓
8. Browser TTS speaks the sign to hearing user
   ↓
9. Bedrock generates suggestions for next response

Hearing user speaks → Deaf user reads:

1. Browser Speech Recognition API captures audio
   ↓
2. Real-time transcription (browser-native)
   ↓
3. User confirms → text added to conversation
   ↓
4. Deaf user reads the message
   ↓
5. Bedrock generates ASL sign suggestions (vocabulary-constrained)

Total latency: ~500ms for ASL detection flow. Speech recognition is near-instant (browser-native).

How This Differs from Other ASL Apps

Most ASL apps:

Teaching ASL (educational)
Translating ASL to text (one-way)
Avatar-based signing (uncanny valley)

VoxSign:

Bidirectional (both parties can initiate)
Context-aware (Claude tracks conversation flow)
Confirmation-based (handles model uncertainty)
Suggestion-driven (guides users within vocabulary limits)

Unsolved Challenges

100 signs isn't enough - Real conversations need 500+ signs minimum
Two-hand signs are harder - Model struggles with signs requiring hand interaction
No sentence-level understanding - We detect individual signs, not full ASL sentences
One-way ASL - Deaf user reads text, doesn't see ASL video (video generation was too slow)

Key Takeaways

Context engineering > prompt engineering - Deciding what to send Claude (last 6 turns, turn types, mode context) matters more than prompt wording. The sliding window was a bigger win than any prompt change.
Mode-aware prompting is essential - Same conversation, different constraints. ASL suggestions must be vocabulary-constrained; voice suggestions can be natural language.
Confirmation UX handles model uncertainty - User confirms predictions before adding to conversation.
Cache aggressively - Same context = same suggestions. 30-second TTL saved API costs and improved latency.
Bedrock is cost-effective - ~$0.003 per conversation turn. Context engineering to reduce token count pays off directly.

DEV Community