DEV Community: John Mahoney

How we run real-time AI deposition analysis with Deepgram + Claude

John Mahoney — Tue, 21 Apr 2026 16:49:09 +0000

TL;DR — Two-hour witness depos produce 12K–25K words of live transcript. We run that through Deepgram Nova-3 → a single Node.js WebSocket server → Claude Haiku-4-5 with a 12-key JSON schema → the attorney's screen in ~4 seconds per segment. The hard parts weren't the models; they were WebSocket idle timeouts, audio pipelines on macOS, and prompt engineering to stop hallucinated PubMed citations. Here's what shipped and what broke on the way.

The setup

Product context: plaintiff med-mal attorneys spend a lot of their life in depositions. "Did the expert's current testimony contradict their report from 2019?" is the kind of question that wins cases. It's also the kind of question a human brain doesn't hold well during two hours of straight listening.

We're building a tool that does hold it: Courtroom AI lives on a browser tab the attorney watches during the depo. It listens through the reporter's realtime stream (or a microphone, or a Deepgram hook-up), produces structured JSON analysis per testimony segment, and pushes real-time flags — admissions, evasion patterns, prior-testimony contradictions, peer-reviewed literature that contradicts "in my experience" claims, FRE foundation triggers — to the side panel.

Stack is almost embarrassingly simple:

Frontend: React + Vite, no router, no state library beyond component state and a useWebSocket hook. Single 232 KB JS bundle.
Backend: one Node.js HTTP server, ws for WebSocket, @anthropic-ai/sdk for Claude calls.
Transcription: Deepgram Nova-3 for streaming ASR; microphone fallback via the Web Speech API; pasted-transcript "simulator" mode for replaying historical depos.
Analysis: Claude Haiku-4-5 with max_tokens: 8192 (we'll get to why).
Hosting: Railway. Deploys on git push main. Uptime from UptimeRobot + Sentry for error tracking.

The 12-key analysis schema

Every testimony segment (roughly one Q-A pair or a 60-second chunk of narrative) goes through one Claude call that returns a strict JSON object with 12 top-level keys:

{
  "medical":         { "accuracyScore": 0-10, "inaccuracies": [...], ... },
  "daubert":         { "vulnerabilityScore": 0-10, "vulnerabilities": [...], ... },
  "priorTestimony":  { "inconsistencies": [...], "impeachmentOpportunities": [...] },
  "crossExam":       { "questions": [...], "keyWeaknesses": [...] },
  "elements":        { "duty": {...}, "breach": {...}, "causation": {...}, "damages": {...} },
  "admission":       { "isAdmission": bool, "quote": "...", "significance": "..." },
  "evasion":         { "isEvasive": bool, "pattern": "...", "escalationScript": [...] },
  "coverage":        { "topicsCovered": [...] },
  "foundation":      { "triggers": ["FRE 613", "FRE 803(18)", ...] },
  "chartContradiction": { "contradicted": bool, "witnessClaim": "...", "chartEvidence": "..." },
  "literatureHits":  [{ "witnessClaim": "...", "pubmedQuery": "...", "results": [...] }]
}

The reason it's one big schema instead of 12 separate calls: latency. At 12 sequential calls per segment, with Haiku averaging 1.5s per response, the attorney would see analysis ~18 seconds after the witness spoke. That's unusable. One combined call with a single output budget lands in ~4 seconds on dense segments.

Gotcha #1: max_tokens 4096 was too low

When we first shipped, crossExam at the end of the output would truncate mid-sentence. Users saw "Cross-examination analysis failed" toast errors. The model wasn't failing; it was hitting the token budget.

Haiku-4-5's default response budget is 4096 tokens. Twelve fields worth of nested arrays on dense Q-A segments routinely need 6–8K output tokens. We bumped to max_tokens: 8192 — you pay for the ceiling only if you hit it, so there's no cost penalty for raising it. The fix was a one-line change; the diagnosis took hours because the error surfaced as "JSON parse error at position X" rather than "max_tokens exceeded."

Takeaway: if Claude is returning malformed JSON, check the finish_reason before debugging the prompt. finish_reason: "max_tokens" means your budget is the problem.

Gotcha #2: WebSockets and Cloudflare's 100s idle timeout

Railway fronts everything with Cloudflare, which closes idle WebSocket connections after 100 seconds. A witness pausing to read a document could easily take 2 minutes. We were losing sessions silently, and the user would see "analysis stopped" with no obvious cause.

Fix: a client-side ping every 25 seconds. Four pings per Cloudflare window, well under the budget:

useEffect(() => {
  if (!ws) return;
  const id = setInterval(() => {
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({ type: 'ping' }));
    }
  }, 25_000);
  return () => clearInterval(id);
}, [ws]);

Server side just echoes pong and moves on. Zero logic, pure connection keepalive.

Gotcha #3: PubMed citations that didn't exist

Early versions asked Claude to produce literature that contradicts the witness's "in my experience" claims — complete with PMIDs, authors, journal names. Which it did. Convincingly. And almost none of them were real.

This is the Mata v. Avianca problem in miniature, and it's disqualifying for legal tech. An attorney who reads one fake citation loses trust in everything the tool outputs.

Fix: Claude only generates a PubMed search query. We then call NCBI E-utilities (esearch.fcgi → esummary.fcgi) to resolve that query into actual PMIDs with real titles, authors, and journal years. If the query returns zero hits, we retry with a progressively simpler query (ladder from 3-AND-clause → 2-AND → 1-AND → first-noun-phrase) before giving up. The attorney never sees a fabricated citation; at worst they see "no literature found for this query," which is accurate.

// Progressive fallback: Claude's over-specific queries often return zero hits.
// Simplify until we get something, or return empty array honestly.
async function pubmedLookup(query, topN = 3) {
  const ladder = pubmedQueryLadder(query);
  for (const term of ladder) {
    const pmids = await esearch(term);
    if (pmids.length) return esummary(pmids);
  }
  return [];
}

24-hour in-memory cache on query string means repeated witnesses don't re-hit NCBI. Rate-limited to 3 req/sec per their published limit (unused API-key-free tier).

Gotcha #4: macOS audio pipelines and microphone permissions

Microphone capture in Chrome on macOS requires two things plaintiff attorneys don't think about: (a) the site's Permissions-Policy header must allow microphone=(self), and (b) the macOS system preferences must allow browser mic access at the OS level.

We tightened Permissions-Policy too aggressively one week and accidentally disabled mic access on /courtroom-ai/ for three days. The browser console message is unhelpful ([Deprecation] Feature policy 'microphone' is disabled) and the tool just... silently doesn't transcribe. Users called it "broken."

Fix: explicit per-route policy:

if (pathname.startsWith('/courtroom-ai')) {
  res.setHeader('Permissions-Policy', 'microphone=(self), camera=()');
}

For local dev on macOS we also ship a bin/courtroom-setup.sh that re-creates a BlackHole aggregate audio device after reboots (macOS doesn't persist aggregate devices across reboots, which is a whole separate surprise).

Gotcha #5: committed-dist because of nixpacks

Railway's nixpacks builder does run npm run build when it detects a Vite project, but its output sometimes doesn't overwrite the committed dist/ bundle, especially with cache hits. We lost a whole afternoon debugging "why isn't my frontend change showing up" when the answer was: the old bundle was still the one being served.

Fix: we commit dist/ to the repo and .gitignore explicitly re-allows it. On any frontend change:

cd courtroom-ai-tool/frontend && npx vite build
git add dist/

This feels wrong — "don't commit build artifacts" is received wisdom. But it gives us a deterministic bundle-hash match between source and prod, which is more valuable than the .gitignore hygiene in a small team.

What we learned

One call > many calls for real-time UX. Take the max_tokens hit; pay once, return everything.
Never generate citations the user might act on. Generate queries that resolve against an authoritative source, and fail honestly when resolution returns nothing.
Browser realtime = keepalive pings. Any path that goes through a CDN has an idle timeout. Find it before your users do.
On-device audio is more brittle than the models. The transcript quality failures we've seen are overwhelmingly pipeline-level, not ASR-level. Test the mic path on fresh macOS installs.
Commit the damn build. Until your deploy platform's cache semantics are bulletproof, deterministic artifacts beat clean gitignore every time.

We're MedLegal AI — we're not hiring, we're building this. If you're a plaintiff firm and want to kick the tires, 14-day free trial at medicalai.law. The Courtroom AI add-on is $99/mo for 10 hours or $299/mo for 50; full details here.

Canonical URL for this post: https://medicalai.law/blog/how-ai-analyzes-deposition-real-time

Tags: #webdev #ai #nodejs #legaltech

How we built real-time deposition analysis with Claude's streaming API

John Mahoney — Tue, 21 Apr 2026 01:11:59 +0000

Medical-malpractice plaintiff attorneys spend 3+ hours in expert depositions hunting for two things: admissions they can use at trial, and inconsistencies they can impeach. Both windows close in seconds. If you don't catch them live, you're reading the transcript a week later wishing you had.

We built a live-feed analyzer that watches the deposition stream, runs Claude against every 30-second window, and surfaces real-time signals to the attorney's laptop while they question the witness.

Architecture

Three hops:

Deepgram transcribes the live audio over WebSocket
Our Node WS server buffers transcript into 30-second segments
Claude (Haiku 4.5, streaming) analyzes each segment and returns a 12-key JSON

The JSON is the heart of the system. Every segment returns:

{
  "medical":         { accuracyScore, inaccuracies, accurateStatements, confidence, summary },
  "daubert":         { vulnerabilityScore, vulnerabilities, strengths, overallRisk },
  "priorTestimony":  { inconsistencies, impeachmentOpportunities, summary },
  "crossExam":       { questions, keyWeaknesses, recommendedApproach },
  "elements":        { duty, breach, causation, damages },   // each { advanced, quote }
  "admission":       { isAdmission, quote, significance, whyMatters },
  "evasion":         { isEvasive, pattern, escalationScript },
  "coverage":        { topicsCovered, notes },
  "foundation":      { triggers },                              // FRE 613/803(18)/803(6)/702/30(b)(6)
  "chartContradiction": { contradicted, witnessClaim, chartEvidence, severity },
  "literatureHits":  [{ witnessClaim, pubmedQuery, foundationScript, pubmedUrl }]
}

The per-segment loop

// Per segment: single Claude call with the whole expert-witness-specific prompt
const msg = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 8192,                // critical — 4096 truncates crossExam mid-stream
  messages: [{ role: 'user', content: buildPrompt(segment, caseContext, chart) }],
});
const analysis = sanitizeResult(JSON.parse(extractJson(msg.content[0].text)));
ws.send(JSON.stringify({ type: 'analysis', analysis, segment }));

Three things we got wrong the first time:

max_tokens=4096 was too small. The 12-key output needs ~6-8K on dense segments. If crossExam is written near the end of the stream, it gets truncated and the UI shows "Cross-examination analysis failed." Bumped to 8192.
Chart context wasn't propagating. chartContradiction can't fire without the chart data in the prompt. We now stash ws._sessionChartContext on a setChartContext WS message before analysis begins.
Cloudflare killed idle WebSockets after 100s. Claude's longer analyses took 45-90s, and during dense segments the WS went silent. Added a 25s keepalive ping from the client.

What we skipped (for now)

PDF.js text-layer positioning for the chart contradiction pin (today it's at the file-list row, not the page)
Firm-scoped vector index over historical transcripts (cross-case expert inconsistency)
Live PubMed API calls — today we generate the search query, the attorney clicks through

Full writeup (including the chart-cross-reference and co-counsel channel) is on our blog at medicalai.law/blog/how-ai-analyzes-deposition-real-time.

Questions welcome.

Processing 1,500 Pages of Medical Records in 3 Minutes with AI

John Mahoney — Wed, 15 Apr 2026 13:24:29 +0000

Medical malpractice attorneys deal with thousands of pages of medical records per case. Organizing those records into a chronological timeline is the foundation of every case — and it's historically been done by hand, taking 20-40 hours per case.

We built a pipeline that extracts structured data from uploaded medical record PDFs, streams AI-generated analysis back to the browser in real time, and handles files up to 500MB. Here's how it works.

John Mahoney, Founder @ MedLegal AI

The Architecture

The system has four stages:

Upload — Browser uploads PDFs directly to S3 via presigned URLs
Extract — Server pulls the file from S3, runs OCR if needed, extracts raw text
Analyze — Text is sent to Claude API for structured extraction
Stream — Results stream back to the browser via SSE as they're generated

Stage 1: Presigned S3 Uploads

Medical record PDFs are large. 200-500MB is common. We're deployed behind Cloudflare and Railway, both with upload size limits.

The solution: the browser uploads directly to S3 via presigned PUT URLs.

\`javascript
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3');
const { getSignedUrl } = require('@aws-sdk/s3-request-presigner');

async function generatePresignedUpload(userId, fileName) {
const fileId = crypto.randomUUID() + '.pdf';
const s3Key = `case-analysis/uploads/\${userId}/\${fileId}`;

const presignClient = new S3Client({
region: process.env.AWS_REGION,
requestChecksumCalculation: 'WHEN_REQUIRED',
responseChecksumValidation: 'WHEN_REQUIRED',
});

const putCmd = new PutObjectCommand({
Bucket: process.env.S3_BUCKET,
Key: s3Key,
ServerSideEncryption: 'AES256',
});

return await getSignedUrl(presignClient, putCmd, { expiresIn: 600 });
}
`\

Key gotcha: AWS SDK v3 adds checksum query params that break browser PUT requests. Set requestChecksumCalculation: 'WHEN_REQUIRED'\ to fix.

Stage 2: Text Extraction with OCR Fallback

We try pdf-parse first (fast, digital PDFs), then fall back to Poppler + Tesseract for scanned documents.

Stage 3: AI Analysis with Claude

We use Claude's streaming Messages API. Rate limiting is handled with exponential backoff and user-visible status messages.

Stage 4: SSE Streaming

Server-Sent Events give us real-time streaming from server to browser. We use fetch + ReadableStream instead of EventSource because we need POST requests.

Critical for Railway: Send headers immediately and keepalive comments every 30s to prevent proxy timeouts.

Results

1,500 pages processed in 3-5 minutes vs. 20-40 hours manually. SSE streaming means users see the timeline being built in real time.

Stack: Node.js 20+, Claude API, AWS S3, Poppler + Tesseract, React + Vite, Railway

John Mahoney builds AI tools for medical malpractice litigation at medicalai.law.

Building an AI Pipeline to Process 10,000+ Pages of Medical Records

John Mahoney — Tue, 07 Apr 2026 02:36:50 +0000