5 must know open-source repositories to build cool AI apps

#ai #opensource #agents #vision

Everywhere I look, teams are racing to ship AI-powered features, from solo founders building chatbots to enterprise teams automating workflows. The momentum is massive, and the big players (OpenAI, Google, and Meta) are pouring billions into new models.

But here’s the truth: you don’t need their budgets to build something impressive. What you do need are the right open-source tools and frameworks that give you full control, transparency, and freedom to experiment.

After experimenting with tons of AI integrations, I’ve found a handful of open-source repositories that make building real-time, multimodal apps actually doable.

These tools let you move from idea to prototype fast, no black boxes, no vendor lock-in.

1. Stream Vision Agents: Build Real-Time Video + Audio Intelligence

One of the cooler projects I’ve seen lately is Stream Vision Agents, an open source framework for building real-time, multimodal AI that can see, hear, and respond in milliseconds.

It’s built for developers who want to bring true intelligence to live video without being locked into a single model or transport provider.

Open Source: Fork it, read it, improve it.
Open Platform: Works with Stream Video or any WebRTC-based SDK.
Flexible Providers: Plug in OpenAI Realtime, Gemini Live, or your favorite STT/TTS and vision models.
It’s a bit like LiveKit Agents, but with a bigger focus on real-time vision and multimodal intelligence.

Let’s take a look at this example:

Sports Coach:

You can spin up a golf coaching AI with YOLO and OpenAI Realtime as the brain. YOLO handles pose detection, while the Realtime API reacts to movements as they happen. No lag, no buffering.

The cool part is, it’s not just for golf. The same setup works for stuff like drone-based fire detection, sports or gaming analytics, physical therapy assistance, workout form correction, and interactive dance or movement-based games. Basically, anything that needs a live “eyes and ears” AI.

agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="Read @golf_coach.md",
    llm=openai.Realtime(fps=10),
    #llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive
    processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")],
)

For more about Vision Agents, visit their documentation.

Star the Vision Agents repository ⭐

2. Open-Sora: High-Fidelity Text-to-Video Generation

Open-Sora is a super interesting open-source take on OpenAI’s Sora. It lets you convert text or images into short, high-quality videos that actually look stable (smooth motion, consistent frames, the whole thing). You can fine-tune it on your own datasets if you want to generate domain-specific stuff like marketing clips, story scenes, or quick simulations. It’s still early, but there’s a lot of room to experiment.

Why you’ll like it:

Supports text-to-video and image-to-video generation
Built for efficiency with diffusion-based architecture
Ideal for short clips (up to 15 seconds)
Actively maintained and open for contributions.

Star the Open Sora repository ⭐

3. OpenVoice v2: Instant Voice Cloning and Speech Synthesis

OpenVoice v2, built by the BentoML team, is one of the most impressive open-source voice cloning projects out there right now.

It can replicate a speaker’s tone and accent from just a few seconds of reference audio. It’s great for anything voice-driven. Think interactive AI agents, dubbing, or voice-enabled interfaces.

Why you’ll like it:

Multilingual and emotion-aware voice synthesis
Works well with real-time frameworks such as Stream Vision Agents
Simple API for inference and fine-tuning

Star the Open Voice repository ⭐

4. SpeechBrain: All-in-One Toolkit for Speech and Audio Intelligence

SpeechBrain is a PyTorch-based open-source toolkit that covers pretty much everything audio: ASR, TTS, speaker recognition, even speech enhancement.

It’s modular, easy to experiment with, and surprisingly production-ready. There are tons of prebuilt recipes if you just want to prototype fast or plug audio intelligence into something bigger you’re already building. .

Why you’ll like it:

Unified library for speech recognition and generation
Integrates easily with LLMs and real-time frameworks
Supports distributed and on-device inference

Star the Speech Brain repository ⭐

5. LiveKit Agents – Build Real-Time Voice and Video AI Applications

LiveKit Agents makes it easy to build real-time voice and video AI apps that actually feel live. Low latency, no awkward lag. You can run it locally or in the cloud, and hook it up to models like OpenAI Realtime, Gemini, or Whisper to handle the heavy lifting. It’s great for stuff like virtual meeting assistants, customer-support bots, or live translation apps.