DEV Community

Cover image for Build 2026 & Cosmos 3: Microsoft and NVIDIA Drop Major AI Models This Week
DoremonAI
DoremonAI

Posted on

Build 2026 & Cosmos 3: Microsoft and NVIDIA Drop Major AI Models This Week

Cover

Build 2026 & Cosmos 3: Microsoft and NVIDIA Drop Major AI Models This Week

The first week of June 2026 was absolutely packed, and while the open-weight mega-drop (25+ models) stole headlines, two announcements stand out as genuinely platform-shifting: NVIDIA's Cosmos 3 and Microsoft's new MAI model family at Build 2026.


🌌 NVIDIA Cosmos 3 — The First Open Omnimodal World Model

NVIDIA dropped Cosmos 3 on June 1, and it's hard to overstate its ambition. Built on a mixture-of-transformers architecture, Cosmos 3 isn't just another LLM or image generator — it's an open world foundation model for physical AI that moves fluidly across text, images, video, audio, and actions.

Key highlights:

  • Open-source on Hugging Face under a permissive license
  • Currently ranked #1 open-source Text-to-Image and #1 Image-to-Video model by Artificial Analysis
  • Top policy model on RoboArena for robotics tasks
  • Built for physical AI — connecting understanding, generation, simulation, and action

This isn't just a model release — it's a blueprint for how future AI systems will perceive and interact with the physical world.


🏗️ Microsoft Build 2026: MAI Models Go Multimodal

At Microsoft Build 2026 (June 2–4, San Francisco), Microsoft unveiled a major expansion of its MAI (Microsoft AI) model family across four modalities:

MAI-Image-2.5 & MAI-Image-2.5-Flash

  • #2 on the Arena leaderboard for image editing
  • Precise, controllable image editing (not just generation)
  • Available in Microsoft Foundry for production workflows
  • Flash variant for low-latency use cases

MAI-Voice-2

  • 15+ languages with expanded emotional expression
  • Significant leap in natural speech synthesis
  • Built for Copilot and real-time voice interactions

MAI-Transcribe-1.5

  • 43 languages supported
  • Mixture-of-Experts (MoE) architecture for efficiency
  • Enterprise-grade speech-to-text accuracy

All models are available now via Azure AI Foundry, Fireworks AI, Baseten, and OpenRouter.


🔮 Why This Matters

Both releases point in the same direction: multimodality is the new normal.

NVIDIA is betting that unifying every modality (including robotics actions) into one open model will unlock physical AI. Microsoft is betting that developers need modality-specific, production-ready models they can deploy today.

Whether you're building the next robotics startup or adding voice/image capabilities to your app — this was the week the toolbox got a whole lot bigger.


What are you building with these? Drop a comment below!

First published June 11, 2026

Top comments (0)