John Mahoney

Posted on Apr 21

How we run real-time AI deposition analysis with Deepgram + Claude

#ai #claude #node #showdev

TL;DR — Two-hour witness depos produce 12K–25K words of live transcript. We run that through Deepgram Nova-3 → a single Node.js WebSocket server → Claude Haiku-4-5 with a 12-key JSON schema → the attorney's screen in ~4 seconds per segment. The hard parts weren't the models; they were WebSocket idle timeouts, audio pipelines on macOS, and prompt engineering to stop hallucinated PubMed citations. Here's what shipped and what broke on the way.

The setup

Product context: plaintiff med-mal attorneys spend a lot of their life in depositions. "Did the expert's current testimony contradict their report from 2019?" is the kind of question that wins cases. It's also the kind of question a human brain doesn't hold well during two hours of straight listening.

We're building a tool that does hold it: Courtroom AI lives on a browser tab the attorney watches during the depo. It listens through the reporter's realtime stream (or a microphone, or a Deepgram hook-up), produces structured JSON analysis per testimony segment, and pushes real-time flags — admissions, evasion patterns, prior-testimony contradictions, peer-reviewed literature that contradicts "in my experience" claims, FRE foundation triggers — to the side panel.

Stack is almost embarrassingly simple:

Frontend: React + Vite, no router, no state library beyond component state and a useWebSocket hook. Single 232 KB JS bundle.
Backend: one Node.js HTTP server, ws for WebSocket, @anthropic-ai/sdk for Claude calls.
Transcription: Deepgram Nova-3 for streaming ASR; microphone fallback via the Web Speech API; pasted-transcript "simulator" mode for replaying historical depos.
Analysis: Claude Haiku-4-5 with max_tokens: 8192 (we'll get to why).
Hosting: Railway. Deploys on git push main. Uptime from UptimeRobot + Sentry for error tracking.

The 12-key analysis schema

Every testimony segment (roughly one Q-A pair or a 60-second chunk of narrative) goes through one Claude call that returns a strict JSON object with 12 top-level keys:

{
  "medical":         { "accuracyScore": 0-10, "inaccuracies": [...], ... },
  "daubert":         { "vulnerabilityScore": 0-10, "vulnerabilities": [...], ... },
  "priorTestimony":  { "inconsistencies": [...], "impeachmentOpportunities": [...] },
  "crossExam":       { "questions": [...], "keyWeaknesses": [...] },
  "elements":        { "duty": {...}, "breach": {...}, "causation": {...}, "damages": {...} },
  "admission":       { "isAdmission": bool, "quote": "...", "significance": "..." },
  "evasion":         { "isEvasive": bool, "pattern": "...", "escalationScript": [...] },
  "coverage":        { "topicsCovered": [...] },
  "foundation":      { "triggers": ["FRE 613", "FRE 803(18)", ...] },
  "chartContradiction": { "contradicted": bool, "witnessClaim": "...", "chartEvidence": "..." },
  "literatureHits":  [{ "witnessClaim": "...", "pubmedQuery": "...", "results": [...] }]
}

The reason it's one big schema instead of 12 separate calls: latency. At 12 sequential calls per segment, with Haiku averaging 1.5s per response, the attorney would see analysis ~18 seconds after the witness spoke. That's unusable. One combined call with a single output budget lands in ~4 seconds on dense segments.

Gotcha #1: max_tokens 4096 was too low

When we first shipped, crossExam at the end of the output would truncate mid-sentence. Users saw "Cross-examination analysis failed" toast errors. The model wasn't failing; it was hitting the token budget.

Haiku-4-5's default response budget is 4096 tokens. Twelve fields worth of nested arrays on dense Q-A segments routinely need 6–8K output tokens. We bumped to max_tokens: 8192 — you pay for the ceiling only if you hit it, so there's no cost penalty for raising it. The fix was a one-line change; the diagnosis took hours because the error surfaced as "JSON parse error at position X" rather than "max_tokens exceeded."

Takeaway: if Claude is returning malformed JSON, check the finish_reason before debugging the prompt. finish_reason: "max_tokens" means your budget is the problem.

Gotcha #2: WebSockets and Cloudflare's 100s idle timeout

Railway fronts everything with Cloudflare, which closes idle WebSocket connections after 100 seconds. A witness pausing to read a document could easily take 2 minutes. We were losing sessions silently, and the user would see "analysis stopped" with no obvious cause.

Fix: a client-side ping every 25 seconds. Four pings per Cloudflare window, well under the budget:

useEffect(() => {
  if (!ws) return;
  const id = setInterval(() => {
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({ type: 'ping' }));
    }
  }, 25_000);
  return () => clearInterval(id);
}, [ws]);

Server side just echoes pong and moves on. Zero logic, pure connection keepalive.

Gotcha #3: PubMed citations that didn't exist

Early versions asked Claude to produce literature that contradicts the witness's "in my experience" claims — complete with PMIDs, authors, journal names. Which it did. Convincingly. And almost none of them were real.

This is the Mata v. Avianca problem in miniature, and it's disqualifying for legal tech. An attorney who reads one fake citation loses trust in everything the tool outputs.

Fix: Claude only generates a PubMed search query. We then call NCBI E-utilities (esearch.fcgi → esummary.fcgi) to resolve that query into actual PMIDs with real titles, authors, and journal years. If the query returns zero hits, we retry with a progressively simpler query (ladder from 3-AND-clause → 2-AND → 1-AND → first-noun-phrase) before giving up. The attorney never sees a fabricated citation; at worst they see "no literature found for this query," which is accurate.

// Progressive fallback: Claude's over-specific queries often return zero hits.
// Simplify until we get something, or return empty array honestly.
async function pubmedLookup(query, topN = 3) {
  const ladder = pubmedQueryLadder(query);
  for (const term of ladder) {
    const pmids = await esearch(term);
    if (pmids.length) return esummary(pmids);
  }
  return [];
}

24-hour in-memory cache on query string means repeated witnesses don't re-hit NCBI. Rate-limited to 3 req/sec per their published limit (unused API-key-free tier).

Gotcha #4: macOS audio pipelines and microphone permissions

Microphone capture in Chrome on macOS requires two things plaintiff attorneys don't think about: (a) the site's Permissions-Policy header must allow microphone=(self), and (b) the macOS system preferences must allow browser mic access at the OS level.

We tightened Permissions-Policy too aggressively one week and accidentally disabled mic access on /courtroom-ai/ for three days. The browser console message is unhelpful ([Deprecation] Feature policy 'microphone' is disabled) and the tool just... silently doesn't transcribe. Users called it "broken."

Fix: explicit per-route policy:

if (pathname.startsWith('/courtroom-ai')) {
  res.setHeader('Permissions-Policy', 'microphone=(self), camera=()');
}

For local dev on macOS we also ship a bin/courtroom-setup.sh that re-creates a BlackHole aggregate audio device after reboots (macOS doesn't persist aggregate devices across reboots, which is a whole separate surprise).

Gotcha #5: committed-dist because of nixpacks

Railway's nixpacks builder does run npm run build when it detects a Vite project, but its output sometimes doesn't overwrite the committed dist/ bundle, especially with cache hits. We lost a whole afternoon debugging "why isn't my frontend change showing up" when the answer was: the old bundle was still the one being served.

Fix: we commit dist/ to the repo and .gitignore explicitly re-allows it. On any frontend change:

cd courtroom-ai-tool/frontend && npx vite build
git add dist/

This feels wrong — "don't commit build artifacts" is received wisdom. But it gives us a deterministic bundle-hash match between source and prod, which is more valuable than the .gitignore hygiene in a small team.

What we learned

One call > many calls for real-time UX. Take the max_tokens hit; pay once, return everything.
Never generate citations the user might act on. Generate queries that resolve against an authoritative source, and fail honestly when resolution returns nothing.
Browser realtime = keepalive pings. Any path that goes through a CDN has an idle timeout. Find it before your users do.
On-device audio is more brittle than the models. The transcript quality failures we've seen are overwhelmingly pipeline-level, not ASR-level. Test the mic path on fresh macOS installs.
Commit the damn build. Until your deploy platform's cache semantics are bulletproof, deterministic artifacts beat clean gitignore every time.

We're MedLegal AI — we're not hiring, we're building this. If you're a plaintiff firm and want to kick the tires, 14-day free trial at medicalai.law. The Courtroom AI add-on is $99/mo for 10 hours or $299/mo for 50; full details here.

Canonical URL for this post: https://medicalai.law/blog/how-ai-analyzes-deposition-real-time

Tags: #webdev #ai #nodejs #legaltech

DEV Community