DEV Community: Kiran Baby

I built a real-time 3D map of London Underground trains

Kiran Baby — Sat, 04 Apr 2026 15:33:45 +0000

The idea

TFL (Transport for London) publishes live arrival predictions for every tube train through their Unified API. The data is free and I thought: what if I could take those predictions and actually show every train moving across London in real time, on a 3D map?

Turns out you can. But the gap between "TFL gives you arrival times" and "smooth 3D trains gliding along accurate track geometry" is way bigger than I expected. This post is about everything that lives in that gap.

Live demo: minilondon3d.xyz

You can tap any train to see its full route and upcoming stops, or tap a station to see all approaching trains with live countdowns. There's also a service status panel showing disruptions across all lines.

Architecture overview

The system has three main pieces:

A Python worker that polls the TFL API every 60 seconds, processes the raw data, and writes to Redis
A FastAPI server that reads from Redis and pushes updates to frontends over WebSocket
A Next.js frontend that receives train data, animates positions along polylines using requestAnimationFrame, and renders everything with Three.js on top of MapLibre GL

The worker and API server are separate processes. They communicate entirely through Redis: the worker writes cached train data and publishes updates on a pub/sub channel, the API server subscribes and relays to WebSocket clients. This means I can restart either one independently, and the API server stays completely stateless.

Building the foundation: static route data

Before a single train can be placed on the map, the system needs two things: accurate track geometry (the physical shape of each line on a map) and a station coordinate index (where every station is and how stations are ordered along each line). These come from completely different sources and get loaded once at startup.

Track geometry: GeoJSON

TFL's own route endpoint only gives you straight lines between stations, which looks terrible on a map. Real tube lines curve, run parallel, split at junctions. I got accurate geometry from Oliver O'Brien's GeoJSON file.

The problem is it comes as many tiny disconnected line fragments. The server's first job at startup is chaining these into continuous polylines. The algorithm picks a fragment, scans for others whose start or end matches within about 5 meters, flips and chains them together, and repeats until nothing else connects. This reduces something like 47 fragments down to 3-5 continuous polylines per line (roughly one per branch).

Station data: TFL's Route/Sequence endpoint

The live arrivals endpoint gives you naptanId, stationName, timeToStation, and other prediction data for each upcoming stop, but no coordinates. No lat, no lon. If you want to know where King's Cross actually is on a map, you need a separate data source.

That source is TFL's /Line/{id}/Route/Sequence/{direction} endpoint, which I fetch for every line in both directions at startup. This endpoint returns two important structures:

stopPointSequences is where station coordinates and branch topology live. Each entry in this array represents a branch: a segment of track between junctions. For a simple line like Victoria, there's basically one branch. For the Northern line, there are several.

Each branch contains an ordered list of stop points with their naptan IDs, names, coordinates, zones, disruption flags, and which other lines serve that station. It also contains nextBranchIds and prevBranchIds, which describe how branches connect at junctions. This is essentially a graph of the line's topology.

I use this branch graph for validation. Before attempting polyline extrapolation between two consecutive stops, I check that both stops appear in the same branch. If they don't, the two stops straddle a junction and polyline math would project the train onto the wrong branch.

orderedLineRoutes gives complete end-to-end route variants. While stopPointSequences gives you the physical branch segments, orderedLineRoutes gives you the full journeys that trains actually make. Each variant is a name and an ordered list of naptan IDs covering the full route from origin to terminus.

These variants are critical for data cleaning. When a train shows up with a list of predicted stops, I match it against all variants by finding the one whose naptan ID list contains the most of the train's stops. This tells me which specific route the train is running, which in turn gives me the correct geographical ordering of stations. That ordering becomes the source of truth for detecting and filtering out stale or inconsistent predictions.

All of this gets built into an in-memory station coordinate index (keyed by both naptan ID and normalized station name), a set of route variant definitions, and a branch topology graph. The whole thing gets cached in Redis so it survives restarts without re-fetching from TFL.

Pre-computed segment table

With both the track geometry and station positions loaded, the server does one more thing at startup: it walks every route variant's station list pairwise, snaps each station onto the polylines, slices the polyline between each consecutive pair of stations, and stores the resulting geometry. Both forward and reverse directions are cached.

When a user later clicks a train and requests its full route path, the server just concatenates pre-computed segments. No geometry work at request time. This brought the path endpoint from 200-400ms down to essentially a dictionary lookup.

The hard part: where is each train?

Now we have geometry, station coordinates, route variants, and branch topology all loaded. The worker starts polling TFL's arrivals endpoint every 60 seconds, which returns raw predictions for every active train across all lines.

But remember: TFL doesn't give you GPS coordinates for trains. What you get is a list of arrival predictions: "Vehicle X will arrive at Y in Z seconds"

For each prediction, we know the naptan ID of the station, but we look up that station's lat/lon from our own pre-built station coordinate index. TFL never tells us where the train is. We have to figure that out.

Polyline-based extrapolation (the good one)

Take the train's next two upcoming stops. Look up their coordinates from our station index. Snap both onto the line's track polylines. If they both land on the same polyline segment, extrapolate backward from the first stop. Then interpolation happens on the frontend as smooth animations.

Before doing this, the system checks the branch graph to confirm both stops are on the same branch. This prevents the math from projecting Northern line trains onto the wrong branch at junctions.

Straight-line extrapolation (fallback)

When polyline snapping fails (the two stops land on different polyline segments, or there's a junction crossing between them), I fall back to simple linear extrapolation between the two station coordinates. Less accurate, but it keeps the train roughly where it should be.

Both approaches require at least two upcoming stops with valid coordinates. In the rare cases where that's not available (a train approaching its terminus with only one prediction left, or a station coordinate lookup failure), the system falls back to parsing TFL's currentLocation text (strings like "Between King's Cross and Angel" or "At Finsbury Park"), and as a last resort, just places the train at its nearest upcoming stop.

Snapping to tracks (and not the wrong track)

Bearing-aware snapping

Here's a problem I didn't anticipate. The Northern line has two branches through central London (Bank and Charing Cross) that run geographically very close together. A naive "snap to nearest polyline" approach would sometimes snap a Bank branch train onto the Charing Cross tracks, because they're only a few hundred meters apart.

The fix: bearing-aware snapping. When snapping a point, I also pass the estimated direction of travel. The algorithm scores each candidate segment by combining distance and bearing alignment, with the bearing penalty weighted 3x relative to distance. A segment that's farther away but aligned with the train's direction of travel beats a closer segment that's angled off. Anything more than 60 degrees off the train's bearing gets rejected entirely before scoring.

This single change fixed most of the "train on wrong branch" bugs.

TFL data quality issues (there are many)

Working with TFL arrival predictions taught me that real-time transit data is messy in ways I really didn't expect.

Duplicate vehicle IDs across lines

TFL reuses numeric vehicle IDs across different lines. Vehicle "240" on Bakerloo and "240" on Piccadilly are completely different physical trains. If you group predictions by just vehicleId, you get Frankenstein trains with stops on two different lines. I group by (vehicleId, lineId) and create composite IDs like bakerloo_240 to keep them separate.

DLR has no vehicle IDs at all

DLR trains are driverless and TFL doesn't assign them vehicle IDs in the arrivals endpoint. Every single prediction comes with vehicleId: "000". To get DLR trains on the map, I synthesize unique IDs by combining the line, destination, direction, and a rank within each station group. It's a hack, but it works.

Stale predictions and mixed snapshots

This was the nastiest data quality issue and it took me the longest to figure out.

You get situations where a train has already passed a station, but the api still reports a low timeToStation prediction for it. Or the same train has predictions from two different moments in time, so the time-ordered stop list disagrees with the actual geographical order of the stations.

If you just sort by timeToStation and trust it, you get trains that appear to jump backward or zigzag.

The fix uses the route variant's geographical station ordering as the source of truth (this is why resolving the variant first matters so much). First, I drop any stops that are geographically behind the train's anchor position along the variant. Then I sort the remaining stops by their position along the route, not by time. Then I walk backward through the list and drop any stop whose time exceeds a later stop's time (a reverse monotonic filter). Finally, there's a plausibility check: if the first stop is many stations away from the second but only a few seconds apart in time, the first one is stale and gets dropped. A tube train needs at least good amount of time per station, so "5 stations in 20 seconds" is obviously wrong.

Duplicate platform predictions

TFL returns one prediction per platform, not per station. At a shared termini you can get 5+ predictions for a single stop because TFL broadcasts to all possible platforms before the platform is assigned. I deduplicate by naptanId to keep one entry per station.

Station name inconsistencies

TFL spells the same station differently across endpoints and lines. I built a normalizer to handle this, mainly so the station index doesn't end up with duplicate entries and so display names stay consistent across the UI. It also serves as a fallback for the rare cases where a naptan ID lookup fails and the system has to match by station name instead.

The frontend: animating 400+ trains at 60fps

Client-side polyline animation

The backend sends updated positions every 60 seconds. If I rendered those directly, trains would teleport every minute. Instead, the frontend builds an animation chain for each train.

Three.js instanced rendering

Every train is rendered as a small 3D shape (a simplified body with a pointed roof) using Three.js. TThree.js InstancedMesh added to MapLibre as a custom layer, sharing the same WebGL context as the map. Since MapLibre doesn't know the trains exist, clicking on them requires manual raycasting against the instanced mesh.

WebSocket with REST fallback

The frontend opens a WebSocket connection for real-time pushes. If the connection drops, it automatically falls back to REST polling every 30 seconds while attempting to reconnect with exponential backoff (starting at 1 second, capping at 30 seconds).

On the backend side, the worker never talks to WebSocket clients directly. It publishes updates to a Redis pub/sub channel. The API server subscribes to that channel and relays messages to all connected WebSocket clients, grouped by subscription rooms.

If you've read this far, thanks ❤️ I'd love to know your feedback!

🌍 I Built MapMeet: A 3D Globe Event Platform for the Mux + DEV Challenge

Kiran Baby — Wed, 31 Dec 2025 19:13:04 +0000

This is a submission for the DEV's Worldwide Show and Tell Challenge Presented by Mux

🦈 Alright Sharks... I Mean, Judges!

I'll be honest with you. I've binge-watched way too many episodes of Shark Tank. The drama, the pitches, the "I'm out" moments... I'm completely hooked.

So when I saw this challenge was literally described as "Shark Tank but without the sharks" I knew this was my moment.

What I Built

Today I'm here to pitch you MapMeet, a global event discovery platform that lets anyone create, discover, and join events visualized on a stunning interactive 3D globe.

But here's the twist that makes MapMeet different from every other event platform out there:

🌐 Real-time geographic arcs connect attendees to events on the globe. When someone RSVPs and shares their location, a beautiful animated arc draws from their location to the event showing the global reach of your event in the most visually stunning way possible.

Imagine hosting a hackathon and watching arcs light up from Tokyo, India, Lagos, Berlin, and San Francisco all converging on your event marker. That's the MapMeet experience.

(Psst... I created a live event for this hackathon so you can see it in action yourself. Link below! 👀)

My Pitch Video

Demo

🌍 Live App: https://www.mapmeet.co

🎉 JOIN THE MAPMEET EVENT I CREATED FOR THIS HACKATHON!

I've created a special event on MapMeet to celebrate this Mux + DEV challenge. Join it to show your support and see the platform in action! I'm on Premium so unlimited people can join - let's see how global we can make this! 🌍

I did some digging and set the event location at Mux HQ in San Francisco so all our arcs will converge right on their doorstep 😄 Also made a custom Mux + DEV cover image for it because why not go all in?

👉 JOIN: MapMeet Launch Party - Mux + DEV Hackathon 🌍

No account needed to view, just click and explore! If you RSVP, you'll become one of those beautiful arcs on the globe. Let's light it up together! 🌈

How MapMeet Works - Complete Overview

The Story Behind It

The Problem I Saw

Every event platform feels flat. You create an event, share a link, and hope people show up. There's no visual excitement, no sense of global community, no "wow factor" that makes people want to share your event.

I asked myself: What if attending an event felt like being part of something global?

The MapMeet Vision

MapMeet transforms event hosting into a visual experience. Concert organizers can show fans flying in from around the world. Hackathon hosts can visualize their global developer community. Marathon coordinators can display runners coming from every continent. Conference speakers can see their audience's geographic spread. Community meetups can prove their worldwide reach to sponsors.

The Shareability Secret

Here's something I'm really proud of: Event pages don't require login to view.

This is huge. When you share your MapMeet event link on WhatsApp, Instagram, Twitter, or LinkedIn, anyone can see your stunning 3D globe visualization, view attendee arcs from around the world, read all event details, and get hyped about joining.

No friction. No "sign up to see more" walls. Just pure, shareable, eye-catching event pages that make people stop scrolling and say "Wait, what is THIS?"

This means your event promotion just got a serious upgrade. Instead of sharing a boring event link, you're sharing an interactive 3D experience. That's the kind of link people actually click.

Building in Public: The Real Journey

The Timeline

I started building MapMeet around December 9th. I had the vision clear in my head, a 3D globe, real-time connections, the whole thing.

But somewhere around week two, I hit a wall. You know that feeling when you're deep in code, nothing's working the way you want, and suddenly every other project idea seems more exciting? Yeah. I started drifting to other side projects, telling myself I'd come back to MapMeet "later."

Then I saw this hackathon announcement.

Shark Tank-style pitches? Video submissions? $3,000 in prizes?

That was the kick I needed. Having a deadline and a reason to ship changed everything. I went from "maybe I'll finish this someday" to "this is going live, and I'm pitching it to the world."

Thank you, Mux and DEV, for the accountability. 🙏

First-Time Integrations

This project pushed me into territory I'd never explored before.

🔐 Supabase (First Time)

I'd heard about Supabase but never actually built with it. MapMeet uses Supabase Auth for Google OAuth and email/password authentication, Supabase Realtime for broadcasting live arc updates, and Supabase Storage for event cover images.

💳 Stripe (First Time)

I'd never implemented payments before. The idea of handling real money in my code was honestly intimidating.

But Stripe's documentation is incredible. I set up checkout sessions for upgrading to Premium, and webhooks for syncing subscription status.

Lesson learned: The integrations you're scared of are usually the ones with the best documentation. Just start.

Technical Highlights

While MapMeet isn't open-source (yet 👀), here's the architecture powering the platform:

The Stack

Layer	Technology
Frontend	Next.js, Tailwind CSS
Backend	FastAPI, SQLModel ORM
Database	PostgreSQL (on Supabase)
Auth	Supabase Auth (Google OAuth + Email/Password)
Realtime	Supabase Realtime (broadcast channels)
Storage	Supabase Storage (event cover images)
Payments	Stripe
3D Globe	Mapbox

The Domain & Hosting Setup

Quick story: I snagged mapmeet.co from GoDaddy because their pricing was great AND it included custom email addresses for the first year.

Frontend is hosted on Vercel. I just pointed my nameservers from GoDaddy to Vercel, and we're live with edge-fast global performance.

The Business Model

MapMeet runs on a freemium model:

Feature	Free	Premium
Active Events	1	Unlimited
Attendees per Event	10	Unlimited
Real-time Arcs	✅	✅
Custom Marker Colors	✅	✅
Price	$0	$19/month

The free tier is genuinely useful for small meetups and testing the platform. Premium unlocks MapMeet for serious event organizers who need scale.

Use of Mux

Let's talk about Mux because this was a genuine discovery for me.

Instant Thumbnails via URL

Need a screenshot from your video? With YouTube, you'd have to manually screenshot and upload it.

With Mux? Just construct a URL, Just a URL....

What's Next for MapMeet?

This hackathon was the push to ship v1, but I'm just getting started:

🎥 Video integration (now that I've discovered Mux!)
🌐 Event categories for better discovery
📊 Analytics dashboard for organizers

Let's Connect!

If you've made it this far, thank you. Seriously. It means the world.

Here's how you can support MapMeet:

1. 🌍 Join the Hackathon Event!

JOIN: MapMeet Launch Party - Mux + DEV.to Hackathon 🌍

Be one of the arcs on the globe! Let's make this the most globally distributed hackathon celebration ever.

2. 💬 Tell Me What You Think

Drop a comment below. I read and respond to every single one.

3. ❤️ React If This Resonated

4. 🔗 Share With Event Organizers

Know someone who hosts meetups, conferences, or hackathons? Share MapMeet with them!

One Last Thing

Building MapMeet taught me that the scariest part of any project is showing it to the world. It's easy to keep tweaking forever, telling yourself "it's not ready yet."

This hackathon gave me a deadline and a stage. I'm grateful for that push.

To everyone building something and waiting for the "right moment" to share it: this is your sign. Ship it. Pitch it. Let the world see what you've made.

The globe is waiting for your arcs. 🌍✨

👉 Try MapMeet: https://www.mapmeet.co

👉 Join the Event: [https://www.mapmeet.co/event/62e4buqr]

What would YOU host on a 3D globe?

A 24-hour coding marathon across time zones? A worldwide marathon watch party? A concert with fans lighting up from every continent?

Drop yours below! 👇

Video Libraries Made Searchable by AI

Kiran Baby — Fri, 26 Dec 2025 12:35:13 +0000

Ever tried finding that ONE moment in a 2-hour video? Yeah, me too. It sucks.

Back again with another project! Hope y'all had an amazing Christmas! 🎄. Jingle bells jingle bells jingle all the way ✌️

The Problem

You recorded a meeting. Or a lecture. Or your kid's recital. Now you need to find that specific part where someone said something important, or that exact scene you vaguely remember.

Your options:

Scrub through the entire video like a caveman
Hope YouTube's auto-chapters got it right (they didn't)
Give up and rewatch the whole thing

What if you could just... describe what you're looking for?

"Find the part where he talks about the budget"

"Show me when there's a red car on screen"

"Jump to where she mentions the deadline"

That's what I built.

Introducing SearchLightAI 🔦

SearchLightAI lets you search your videos by describing what you see OR what was said. Upload a video, wait for it to process, then search with natural language.

It returns the exact timestamp. Click it. You're there.

Search your videos like you search your documents.

The Tech Stack 🤓

Layer	Tech
API	FastAPI + SQLModel
Databases	PostgreSQL (metadata) + Qdrant (vectors)
Vision AI	SigLIP2 (google/siglip2-base-patch16-512)
Speech AI	faster-whisper + Sentence Transformers
Video Processing	FFmpeg + PySceneDetect
Frontend	Next.js 16, React 19, Tailwind CSS, shadcn/ui

How It Works

📥 Ingestion Pipeline

Video Upload
    ↓
PySceneDetect → finds scene changes
    ↓
FFmpeg → extracts keyframes + audio
    ↓
faster-whisper → transcribes speech
    ↓
SigLIP2 → embeds keyframes (768-dim)
Sentence Transformers → embeds transcript (384-dim)
    ↓
Qdrant → stores all vectors

🔍 Search Pipeline

Your query: "when he talks about the budget"
    ↓
Same models embed your query
    ↓
Cosine similarity search in Qdrant
    ↓
Results ranked by relevance
    ↓
Click → jump to exact timestamp

Three Search Modes

🎬 Visual Search - Describe what you see

"man standing near whiteboard"
"outdoor scene with trees"
"someone holding a laptop"

🎤 Speech Search - What was said

"when they mentioned the quarterly results"
"the part about machine learning"
"discussion about the timeline"

🔀 Hybrid Search - Best of both
Combines visual and speech results. Usually what you want.

The Secret Sauce: SigLIP2

Most visual search uses CLIP. I went with SigLIP2 instead.

Why? SigLIP uses sigmoid loss instead of softmax contrastive loss. The practical difference: better zero-shot performance, especially for fine-grained visual details.

One quirk though - raw SigLIP scores are lower than you'd expect. A "great match" might be 0.25-0.35 cosine similarity. So I rescale them:

def rescale_siglip_score(cosine_score: float) -> float:
    """Maps SigLIP scores to intuitive 0-1 range."""
    midpoint = 0.18
    steepness = 12
    x = (cosine_score - midpoint) * steepness
    return 1 / (1 + math.exp(-x))

Now 0.35 → ~90%, 0.25 → ~70%, which feels right in the UI.

Smart Keyframe Extraction

I'm not extracting every frame (that would be insane). PySceneDetect uses adaptive content detection to find actual scene changes.

For each scene, I grab:

Frame at the start
Frame at the middle (for scenes > 2 seconds)

This gives good coverage without exploding storage or processing time.

Running It Yourself

Docker Compose (Recommended)

git clone https://github.com/kiranbaby14/searchlightai.git
cd searchlightai

cp apps/server/.env.example apps/server/.env
cp apps/client/.env.example apps/client/.env

docker-compose up -d

Wait for models to load (around 2-3 min first time), then:

Frontend: http://localhost:3000
API: http://localhost:8000

Prerequisites

NVIDIA GPU with CUDA support
Docker + Docker Compose
Around 4GB+ VRAM should work (SigLIP2 + faster-whisper + Sentence Transformers are relatively lightweight)

⏱️ Heads up: Processing time depends on video length. A 10-min video takes a couple minutes, but longer videos (1hr+) will need more patience. Scene detection, transcription, and embedding generation all add up.

What Could You Build With This?

Some ideas:

📹 Meeting search - Find decisions across hundreds of recorded meetings
🎓 Lecture navigation - Students jumping to specific topics
📺 Media asset management - Search through footage libraries
📱 Personal video search - Your phone videos, finally searchable

The Code Is Yours

GitHub: github.com/kiranbaby14/SearchLightAI

Star it ⭐ if you think video search should be this easy.

Shoutouts 🙏

SigLIP2 from Google for visual embeddings that actually work
PySceneDetect for making scene detection actually usable
Qdrant for a vector DB that just works
faster-whisper for Whisper that's actually fast

That's It. Go Break It.

Clone it, throw your weirdest videos at it, see what breaks. File issues. Send PRs. Roast my code in the comments.

The best part of putting stuff out there? Finding out all the ways you didn't think of using it.

Catch you in the next one. ✌️

Built with ⚡, mass Claude Code sessions, and an unhealthy amount of caffeine ☕ by @kiranbaby14

I Built a 3D AI Avatar That Actually Sees and Talks Back 🎭

Kiran Baby — Fri, 26 Dec 2025 11:11:25 +0000

Chatbots are so 2020. Let me show you what I built instead.

It's been ages since I last posted here. Hope y'all had a great Christmas! 🎄 Feels good to be back. ✌️

The Problem With Every AI Assistant Right Now

You know what's annoying? Typing.

Every AI tool out there wants you to type type type like it's 1995. And don't even get me started on the ones that "listen" but can't see what you're showing them.

So I asked myself: What if I built an AI that works like an actual conversation?

One that:

👀 Sees what you show it (camera feed)
👂 Hears you naturally (no push-to-talk nonsense)
🗣️ Responds with voice and perfectly synced lip movements
🎭 Expresses emotions through a 3D avatar

And runs 100% locally on your machine. No API keys bleeding your wallet dry.

Introducing TalkMateAI 🚀

TalkMateAI is a real-time, multimodal AI companion. You talk to it, show it things through your camera, and it responds with natural speech while a 3D avatar lip-syncs perfectly to every word.

It's like having a conversation with a character from a video game, except it's actually intelligent.

The Tech Stack (For My Fellow Nerds 🤓)

Backend (Python)

FastAPI + WebSockets → Real-time bidirectional communication
PyTorch + Flash Attention 2 → GPU go brrrrr
OpenAI Whisper (tiny) → Speech recognition
SmolVLM2-256M-Video-Instruct → Vision-language understanding
Kokoro TTS → Natural voice synthesis with word-level timing

Frontend (TypeScript)

Next.js 15 → Because Turbopack is fast af
Tailwind CSS + shadcn/ui → Pretty buttons
TalkingHead.js → 3D avatar with lip-sync magic
Web Audio API + AudioWorklet → Low-latency audio processing
Native WebSocket → None of that socket.io bloat

How It Actually Works

Here's the flow:

You speak → 
  VAD detects speech → 
    Audio (+ camera frame if enabled) sent via WebSocket → 
      Whisper transcribes → 
        SmolVLM2 understands text + image together → 
          Generates response → 
            Kokoro synthesizes speech with timing data → 
              Audio + lip-sync data sent back → 
                3D avatar speaks with perfect sync

All of this happens in real-time.

The Secret Sauce: Native Word Timing 🎯

Most TTS solutions give you audio and that's it. You're left guessing when each word starts for lip-sync.

Kokoro TTS gives you word-level timing data out of the box:

const speakData = {
  audio: audioBuffer,
  words: ["Hello", "world"],
  wtimes: [0.0, 0.5],      // when each word starts
  wdurations: [0.4, 0.6]   // how long each word lasts
};

// TalkingHead uses this for pixel-perfect lip sync
headRef.current.speakAudio(speakData);

The result? Lips that move exactly when they should. No uncanny valley weirdness.

Voice Activity Detection That Actually Works

I didn't want push-to-talk. I wanted natural conversation flow.

So I built a custom VAD using the Web Audio API's AudioWorklet. It calculates energy levels in real-time and tracks speech frames vs silence frames - all from the frontend (so no unnecessary wastage of backend processing power).

You just... talk. When you pause naturally, it processes. When you keep talking, it waits.

It respects conversational flow.

⚠️ Heads up: This version doesn't support barge-in (interrupting the avatar mid-speech) or sophisticated turn-taking detection. It's purely pause-based - you talk, pause, it responds.

The Vision Component 👁️

Here's where it gets spicy. The camera isn't just for show.

When enabled, every audio segment gets sent with a camera snapshot. SmolVLM2 processes both together - the audio transcription AND what it sees.

You can literally say "What am I holding?" and it'll tell you.

Running It Yourself

Prerequisites

Node.js 20+
Python 3.10
NVIDIA GPU with ~4GB+ VRAM should work (I used RTX 3070 8GB, but the models are lightweight - Whisper tiny + SmolVLM2-256M + Kokoro TTS)
PNPM & UV package managers

Setup

# Clone it
git clone https://github.com/kiranbaby14/TalkMateAI.git
cd TalkMateAI

# Install everything
pnpm run monorepo-setup

# Run both frontend and backend
pnpm dev

Frontend: http://localhost:3000
Backend: http://localhost:8000

What Can You Build With This?

This is open source. Fork it. Break it. Make it weird.

Some ideas:

📚 Language tutors that watch your pronunciation
🎨 Creative companions that see your art and give feedback
🔍 Screen assistants - combine with Screenpipe for an AI that knows what you've been doing

The Code Is Yours

GitHub: github.com/kiranbaby14/TalkMateAI

🛠️ Fair warning: This was a curiosity-driven project, not a polished product. There are rough edges, things I'd do differently now, and probably bugs I haven't found yet. But that's the fun of open source, right? Dig in, break stuff, make it better.

Star it ⭐ if you think chatbots should evolve.

Shoutouts 🙏

Big thanks to met4citizen for the incredible TalkingHead library. The 3D avatar rendering and lip-sync magic? That's all their work. I just plugged it in and fed it audio + timing data. Absolute legend.

What Would You Build?

Seriously, drop a comment. I want to know what wild ideas you have for real-time multimodal AI.

AI that sees + hears + responds naturally? That's not the future anymore.

That's right now. And you can run it on your GPU.

Built with ❤️ and probably too much caffeine by @kiranbaby14

My First Blog and My First Game

Kiran Baby — Thu, 27 Jan 2022 07:38:42 +0000

Hey guys, so I am new to the DEV community and I am really excited to share my first blog about the first game that I created. The game was named as "Spheron-The ball game" because the protagonist of the game was obviously a 'sphere' and I don't know from where the 'spheron' name popped up in my head. But anyway, I created this game a long while ago back in 2020 while I was doing my undergrad, and I managed to complete the game and upload it to the PlayStore once the colleges were closed due to the pandemic. I guess I am thankful for that which I shouldn't be, but hey, I got a lot of free time to develop the game. The game was made using unity engine and C# as its prgramming language. As I was a beginner into game dev I looked into and learned from a lot of youtube tutorials on how to build a game using unity. Brackey's youtube channel helped me a lot, I am sure Unity devs would've at least heard of this channel once in their lifetime. I know that the game is not an extraordinary or over-the-top one but it was my first game so it holds a special place in my heart. The genre of the game is an endless runner type and you could also collect coins along the way. I would link the game at the bottom of the post so you guys can check it out if you're interested.

Controls

The controls are fairly simple

Touch the right side of the screen to move right
Touch the left side of the screen to move left

The objective of the game is to get the protagonist ie; the sphere to dodge all the obstacle that comes along the way without falling down from the platform, and increase the score to the maximum that you can. I've also created a coin system in the game so that the player can collect coins along the way which can later be used to buy different skins for the character, and also if the player dies midway it can be used for resurrection.

I've also incorporated ads into the app. And these are only reward ads so you don't have to worry about ads popping up here and there and annoying you every time. The ad is optional to the person playing the game. Once the player dies, a popup menu will come that has an ad button to resurrect the player and continue playing. So the ad is completely optional. I used google AdMob for the implementation of the ads. At first, I messed up with the ads. When I uploaded my game to the PlayStore I clicked on the ads many times myself on my own phone and google as the all-seeing eye came to know of it and blocked my AdMob account. But later it got resolved.

So this was my first blog. I know it took me 2 years to write a blog about the first game that I made, but hey, I wrote it at the end. And I hope to keep writing blogs on this wonderful platform. The next blog will likely be about the second game that I made and once it is done I will update the link to the blog here. So hope u guys enjoyed reading my blog and if you'd like to check out my game and give me feedback on it the link's down below.

PlayStore Link

https://play.google.com/store/apps/details?id=com.Jbk.Spheron