DEV Community: Zilliz

How to Install and Run OpenClaw (Previously Clawdbot/Moltbot) on Mac: A Step-by-Step Tutorial

Zilliz — Fri, 13 Feb 2026 10:03:05 +0000

OpenClaw is an open-source, self-hosted gateway that bridges your everyday messaging apps to AI coding agents. Instead of switching between tabs, apps, and interfaces, you send a message from WhatsApp or Telegram and get an AI-powered response right in your pocket. It's MIT licensed, runs on your hardware, and keeps you in full control of your data.

In this tutorial, we'll walk through everything you need to install and run OpenClaw on macOS — from prerequisites to your first working chat.

What You'll Need

Before you begin, make sure you have the following:

macOS (any recent version)
Node.js 22 or newer — the installer script will handle this for you if it's not already installed, but it's good to check
An API key — Anthropic is recommended by the OpenClaw team
About 5 minutes of your time

To check your current Node version, open Terminal and run:

node --version

If you see v22.x.x or higher, you're good to go. If not, don't worry — the installer will take care of it.

Step 1: Install OpenClaw via the Installer Script (Recommended)

The fastest way to install OpenClaw on macOS is the one-line installer script. It handles Node detection, CLI installation, and launches the onboarding wizard — all in one step.

Open Terminal and run:

curl -fsSL https://openclaw.ai/install.sh | bash

That's it. The script will download the CLI, install it globally via npm, and kick off the onboarding wizard automatically.

Alternative: Install via npm Directly

If you already have Node 22+ and prefer manual control, you can install OpenClaw with npm:

npm install -g openclaw@latest
openclaw onboard --install-daemon

Alternative: Install via pnpm

If pnpm is your package manager of choice:

pnpm add -g openclaw@latest
pnpm approve-builds -g   # approve openclaw, node-llama-cpp, sharp, etc.
openclaw onboard --install-daemon

Note: pnpm requires explicit approval for packages with build scripts. After the first install shows the "Ignored build scripts" warning, run pnpm approve-builds -g and select the listed packages.

Troubleshooting: `sharp` Build Errors

If you have libvips installed globally (common on macOS via Homebrew) and sharp fails during installation, force prebuilt binaries:

SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install -g openclaw@latest

If you see the error sharp: Please add node-gyp to your dependencies, either install build tooling (Xcode Command Line Tools + npm install -g node-gyp) or use the environment variable above.

Step 2: Run the Onboarding Wizard

If the installer script didn't automatically launch it, start the onboarding wizard manually:

openclaw onboard --install-daemon

The wizard walks you through configuring auth, gateway settings, and optional channel connections. It also installs OpenClaw as a background service (daemon), so the Gateway stays running even after you close Terminal.

What the Wizard Configures

Authentication — generates a token for the Gateway so local and remote clients must authenticate
Gateway settings — port, bind address, and service installation
Channel connections — optional setup for WhatsApp, Telegram, Discord, and more

Step 3: Verify the Gateway Is Running

Once onboarding completes, check that the Gateway is up:

openclaw gateway status

You should see confirmation that the Gateway is running. If you want to run it in the foreground for debugging or quick testing, use:

openclaw gateway --port 18789

For a full health check:

openclaw health

Step 4: Open the Control UI (Dashboard)

The fastest way to start chatting with OpenClaw is through the browser-based Control UI — no channel setup required.

Run:

openclaw dashboard

This copies the dashboard URL, opens your browser if possible, and displays the link. By default, the Control UI is served at:

http://127.0.0.1:18789/

If the dashboard prompts for authentication, paste the token from your Gateway config. You can retrieve it with:

openclaw config get gateway.auth.token

Security note: The Control UI is an admin surface — it provides access to chat, configuration, and execution approvals. Do not expose it publicly. Stick to localhost, Tailscale Serve, or an SSH tunnel.

Step 5: Connect a Chat Channel (Optional)

While the Control UI gives you instant access to chat, the real power of OpenClaw is messaging your AI agent from the apps you already use. Here's a quick overview of supported channels:

Channel	Setup Complexity	Notes
Telegram	Easiest	Simple bot token
WhatsApp	Easy	QR pairing required; stores more state on disk
Discord	Moderate	Bot API + Gateway
iMessage	Moderate	Recommended via BlueBubbles macOS server
IRC	Low	Classic IRC; channels + DMs
Slack	Moderate	Bolt SDK; workspace apps
Signal	Moderate	Privacy-focused; uses signal-cli

Multiple channels can run simultaneously — configure as many as you want and OpenClaw routes messages per chat.

Check out this post for: OpenClaw Tutorial: Connect to Slack for Local AI Assistant - Milvus Blog

Quick Example: Pair WhatsApp

To connect WhatsApp, run:

openclaw channels login

Follow the QR pairing flow, and you'll be able to message your AI agent directly from WhatsApp.

Step 6: Send a Test Message

With a channel configured, send a test message from the CLI:

openclaw message send --target +15555550123 --message "Hello from OpenClaw"

Replace the phone number with your own. If everything is wired correctly, you'll see the message arrive in your messaging app — and OpenClaw's AI agent will respond.

Optional: Build from Source

For contributors or anyone who wants to run from a local checkout:

git clone https://github.com/openclaw/openclaw.git
cd openclaw
pnpm install
pnpm ui:build
pnpm build

Link the CLI globally:

pnpm link --global

Then run onboarding:

openclaw onboard --install-daemon

For hot-reload during development, use pnpm gateway:watch instead of the standard gateway command.

Optional: macOS App Onboarding

OpenClaw also offers a native macOS app (menu bar) with its own onboarding flow. If you're using the app:

Approve the macOS security warning when first launching
Allow "Find Local Networks" permission
Choose Local vs. Remote — select "This Mac" for a local-only Gateway
Grant permissions — the app may request Automation, Notifications, Accessibility, and other TCC permissions depending on your use case
Install the CLI (optional) — the app can install the global openclaw CLI via npm so terminal workflows work alongside the app
Chat in the onboarding session — the app opens a dedicated chat so the agent can introduce itself

The stable workflow recommended by the OpenClaw docs: install and launch OpenClaw.app, complete the onboarding checklist, then link your channels with openclaw channels login.

Configuration Basics

OpenClaw stores its configuration at ~/.openclaw/openclaw.json. Out of the box, it uses the bundled Pi binary in RPC mode with per-sender sessions — no configuration needed.

If you want to restrict who can message your agent, add an allowFrom rule:

{
  "channels": {
    "whatsapp": {
      "allowFrom": ["+15555550123"],
      "groups": {
        "*": {
          "requireMention": true
        }
      }
    }
  },
  "messages": {
    "groupChat": {
      "mentionPatterns": ["@openclaw"]
    }
  }
}

Key File Locations on macOS

Path	Purpose
`~/.openclaw/openclaw.json`	Main configuration file
`~/.openclaw/workspace/`	Skills, prompts, memories
`~/.openclaw/credentials/`	Channel credentials
`~/.openclaw/agents/<agentId>/sessions/`	Agent session data
`/tmp/openclaw/`	Logs

Useful Environment Variables

Variable	Purpose
`OPENCLAW_HOME`	Override home directory for internal path resolution
`OPENCLAW_STATE_DIR`	Override the state directory
`OPENCLAW_CONFIG_PATH`	Override the config file path

Troubleshooting: `openclaw` Not Found

If your shell can't find the openclaw command after installation, the issue is almost always a missing PATH entry.

Quick diagnosis:

node -v
npm -v
npm prefix -g
echo "$PATH"

If the output of npm prefix -g plus /bin isn't in your $PATH, add it to your shell startup file (~/.zshrc for modern macOS):

export PATH="$(npm prefix -g)/bin:$PATH"

Then open a new Terminal window or run source ~/.zshrc.

Troubleshooting: "Unauthorized" / 1008 Error in Dashboard

If the Control UI shows an unauthorized error:

Make sure the Gateway is reachable: openclaw status
Retrieve the token: openclaw config get gateway.auth.token
In the dashboard settings, paste the token into the auth field and reconnect

If you need to generate a fresh token:

openclaw doctor --generate-gateway-token

What You Now Have

After completing this tutorial, you have:

A running Gateway on your Mac
Authentication configured for secure access
Control UI access for browser-based chat
Optionally, one or more connected messaging channels (WhatsApp, Telegram, Discord, etc.)

From here, you can explore multi-agent routing, workspace isolation, media support, and the full range of OpenClaw's capabilities.

Resources

Our Journey to 35K+ GitHub Stars: The Real Story of Building Milvus from Scratch

Zilliz — Mon, 30 Jun 2025 10:35:10 +0000

For the past few years, we've been focused on one thing: building an enterprise-ready vector database for the AI era. The hard part isn't building a database—it's building one that's scalable, easy to use, and actually solves real problems in production.

This June, we reached a new milestone: Milvus hit 35,000 stars on GitHub (now it has 35.5K+ stars at the time of writing). We're not going to pretend this is just another number—it means a lot to us.

Each star represents a developer who took the time to look at what we've built, found it useful enough to bookmark, and in many cases, decided to use it. Some of you have gone further: filing issues, contributing code, answering questions in our forums, and helping other developers when they get stuck.

We wanted to take a moment to share our story—the real one, with all the messy parts included.

We Started Building Milvus Because Nothing Else Worked

Back in 2017, we started with a simple question: As AI applications were starting to emerge and unstructured data was exploding, how do you efficiently store and search the vector embeddings that power semantic understanding?

Traditional databases weren't built for this. They're optimized for rows and columns, not high-dimensional vectors. The existing technologies and tools were either impossible or painfully slow for what we needed.

We tried everything available. Hacked together solutions with Elasticsearch. Built custom indexes on top of MySQL. Even experimented with FAISS, but it was designed as a research library, not a production database infrastructure. Nothing provided the complete solution we envisioned for enterprise AI workloads.

So we started building our own. Not because we thought it would be easy—databases are notoriously hard to get right—but because we could see where AI was heading and knew it needed purpose-built infrastructure to get there.

By 2018, we were deep into developing what would become Milvus. The term "vector database" didn't even exist yet. We were essentially creating a new category of infrastructure software, which was both exciting and terrifying.

Open-Sourcing Milvus: Building in Public

In November 2019, we decided to open-source Milvus version 0.10.

Open-sourcing means exposing all your flaws to the world. Every hack, every TODO comment, every design decision you're not entirely sure about. But we believed that if vector databases were going to become critical infrastructure for AI, they needed to be open and accessible to everyone.

The response was overwhelming. Developers didn't just use Milvus—they improved it. They found bugs we'd missed, suggested features we hadn't considered, and asked questions that made us think harder about our design choices.

In 2020, we joined the LF AI & Data Foundation. This wasn't just for credibility—it taught us how to maintain a sustainable open-source project. How to handle governance, backward compatibility, and building software that lasts years, not months.

By 2021, we released Milvus 1.0 and graduated from LF AI & Data Foundation. That same year, we won the BigANN global challenge for billion-scale vector search. That win felt good, but more importantly, it validated that we were solving real problems the right way.

The Hardest Decision: Starting Over

Here's where things get complicated. By 2021, Milvus 1.0 was working well for many use cases, but enterprise customers kept asking for the same things: better cloud-native architecture, easier horizontal scaling, more operational simplicity.

We had a choice: patch our way forward or rebuild from the ground up. We chose to rebuild.

Milvus 2.0 was essentially a complete rewrite. We introduced a fully decoupled storage-compute architecture with dynamic scalability. It took us two years and was honestly one of the most stressful periods in our company's history. We were throwing away a working system that thousands of people were using to build something unproven.

But when we released Milvus 2.0 in 2022, it transformed Milvus from a powerful vector database into production-ready infrastructure that could scale to enterprise workloads. That same year, we also completed a Series B+ funding round—not to burn money, but to double down on product quality and support for global customers. We knew this path would take time, but every step had to be built on a solid foundation.

When Everything Accelerated with AI

2023 was the year of RAG (retrieval-augmented generation). Suddenly, semantic search went from an interesting AI technique to essential infrastructure for chatbots, document Q&A systems, and AI agents.

The GitHub stars of Milvus spiked. Support requests multiplied. Developers who had never heard of vector databases were suddenly asking sophisticated questions about indexing strategies and query optimization.

This growth was exciting but also overwhelming. We realized we needed to scale not just our technology, but our entire approach to community support. We hired more developer advocates, completely rewrote our documentation, and started creating educational content for developers new to vector databases.

We also launched Zilliz Cloud—our fully managed version of Milvus. Some people asked why we were "commercializing" our open-source project. The honest answer is that maintaining enterprise-grade infrastructure is expensive and complex. Zilliz Cloud allows us to sustain and accelerate Milvus development while keeping the core project completely open source.

Then came 2024. Forrester named us a leader in the vector database category. Milvus passed 30,000 GitHub stars. And we realized: the road we'd been paving for seven years had finally become the highway. As more enterprises adopted vector databases as critical infrastructure, our business growth accelerated rapidly—validating that the foundation we'd built could scale both technically and commercially.

The Team Behind Milvus: Zilliz

Here's something interesting: many people know Milvus but not Zilliz. We're actually fine with that. Zilliz is the team behind Milvus—we build it, maintain it, and support it.

What we care about most are the unglamorous things that make the difference between a cool demo and production-ready infrastructure: performance optimizations, security patches, documentation that actually helps beginners, and responding thoughtfully to GitHub issues.

We've built a 24/7 global support team across the U.S., Europe, and Asia, because developers need help in their time zones, not ours. We have community contributors we call "Milvus Ambassadors" who organize events, answer forum questions, and often explain concepts better than we do.

We've also welcomed integrations with AWS, GCP, and other cloud providers—even when they offer their own managed versions of Milvus. More deployment options are good for users. Though we've noticed that when teams hit complex technical challenges, they often end up reaching out to us directly because we understand the system at the deepest level.

Many people think open source is just a "toolbox," but it's actually an "evolutionary process"—a collective effort by countless people who love and believe in it. Only those who truly understand the architecture can provide the "why" behind bug fixes, performance bottleneck analysis, data system integration, and architectural adjustments.

So if you're using open-source Milvus, or considering vector databases as a core component of your AI system, we encourage you to reach out to us directly for the most professional and timely support.

Real Impact in Production: The Trust from Users

The use cases for Milvus have grown beyond what we initially imagined. We're powering AI infrastructure for some of the world's most demanding enterprises across every industry.

Bosch, the global automotive technology leader and pioneer in autonomous driving, revolutionized their data analysis with Milvus achieving 80% reduction in data collection costs and $1.4M annual savings while searching billions of driving scenarios in milliseconds for critical edge cases.

Read AI, one of the fastest-growing productivity AI companies serving millions of monthly active users, uses Milvus to achieve sub-20-50ms retrieval latency across billions of records and 5× speedup in agentic search. Their CTO says, "Milvus serves as the central repository and powers our information retrieval among billions of records."

A global fintech leader, one of the world's largest digital payment platforms processing tens of billions of transactions across 200+ countries and 25+ currencies, chose Milvus for 5-10× faster batch ingestion than competitors, completing jobs in under 1 hour that took others 8+ hours.

Filevine, the leading legal work platform trusted by thousands of law firms across the United States, manages 3 billion vectors across millions of legal documents, saving attorneys 60-80% of time in document analysis and achieving "true consciousness of data" for legal case management.

We're also supporting NVIDIA, OpenAI, Microsoft, Salesforce, Walmart, and many others in almost every industry. Over 10,000 organizations have made Milvus or Zilliz Cloud their vector database of choice.

These aren't just technical success stories—they're examples of how vector databases are quietly becoming critical infrastructure that powers the AI applications people use every day.

Why We Built Zilliz Cloud: Enterprise-Grade Vector Database as a Service

Milvus is open-source and free to use. But running Milvus well at enterprise scale requires deep expertise and significant resources. Index selection, memory management, scaling strategies, security configurations—these aren't trivial decisions. Many teams want the power of Milvus without the operational complexity and with enterprise support, SLA guarantees, etc.

That's why we built Zilliz Cloud—a fully managed version of Milvus deployed across 25 global regions and 5 major clouds, including AWS, GCP, and Azure, designed specifically for enterprise-scale AI workloads that demand performance, security, and reliability.

Here's what makes Zilliz Cloud different:

Massive Scale with High Performance: Our proprietary AI-powered AutoIndex engine delivers 3-5× faster query speeds than open-source Milvus, with zero index tuning required. The cloud-native architecture supports billions of vectors and tens of thousands of concurrent queries while maintaining sub-second response times.
Built-in Security & Compliance: Encryption at rest and in transit, fine-grained RBAC, comprehensive audit logging, SAML/OAuth2.0 integration, and BYOC (bring your own cloud) deployments. We're compliant with GDPR, HIPAA, and other global standards that enterprises actually need.
Optimized for Cost-Efficiency: Tiered hot/cold data storage, elastic scaling that responds to real workloads, and pay-as-you-go pricing can reduce total cost of ownership by 50% or more compared to self-managed deployments.
Truly Cloud-Agnostic without vendor lock-in: Deploy on AWS, Azure, GCP, Alibaba Cloud, or Tencent Cloud without vendor lock-in. We ensure global consistency and scalability regardless of where you run.

These capabilities might not sound flashy, but they solve real, daily problems that enterprise teams face when building AI applications at scale. And most importantly: it's still Milvus under the hood, so there's no proprietary lock-in or compatibility issues.

What's Next: Vector Data Lake

We coined the term "vector database" and were the first to build one, but we're not stopping there. We're now building the next evolution: Vector Data Lake.

Here's the problem we're solving: not every vector search needs millisecond latency. Many enterprises have massive datasets that are queried occasionally, including historical document analysis, batch similarity computations, and long-term trend analysis. For these use cases, a traditional real-time vector database is both overkill and expensive.

Vector Data Lake uses a storage-compute separated architecture specifically optimized for massive-scale, infrequently accessed vectors while keeping costs dramatically lower than real-time systems.

Core capabilities include:

Unified Data Stack: Seamlessly connects online and offline data layers with consistent formats and efficient storage, so you can move data between hot and cold tiers without reformatting or complex migrations.
Compatible Compute Ecosystem: Works natively with frameworks like Spark and Ray, supporting everything from vector search to traditional ETL and analytics. This means your existing data teams can work with vector data using tools they already know.
Cost-Optimized Architecture: Hot data stays on SSD or NVMe for fast access; cold data automatically moves to object storage like S3. Smart indexing and storage strategies keep I/O fast when you need it while making storage costs predictable and affordable.

This isn't about replacing vector databases—it's about giving enterprises the right tool for each workload. Real-time search for user-facing applications, cost-effective vector data lakes for analytics and historical processing.

We still believe in the logic behind Moore's Law and Jevons Paradox: as the unit cost of computing drops, adoption scales. The same applies to vector infrastructure.

By improving indexes, storage structures, caching, and deployment models—day in, day out—we hope to make AI infrastructure more accessible and affordable for everyone, and to help bring unstructured data into the AI-native future.

A Big Thanks to You All!

Those 35K+ stars represent something we're genuinely proud of: a community of developers who find Milvus useful enough to recommend and contribute to.

But we're not done. Milvus has bugs to fix, performance improvements to make, and features our community has been asking for. Our roadmap is public, and we genuinely want your input on what to prioritize.

The number itself isn't what matters—it's the trust those stars represent. Trust that we'll keep building in the open, keep listening to feedback, and keep making Milvus better.

To our contributors: your PRs, bug reports, and documentation improvements make Milvus better every day. Thank you so much.
To our users: thank you for trusting us with your production workloads and for the feedback that keeps us honest.
To our community: thank you for answering questions, organizing events, and helping newcomers get started.

If you're new to vector databases, we'd love to help you get started. If you're already using Milvus or Zilliz Cloud, we'd love to hear about your experience. And if you're just curious about what we're building, our community channels are always open.

Let's keep building the infrastructure that makes AI applications possible—together.

Top 5 Open Source Vector Search Engines: A Comprehensive Comparison Guide for 2025

Zilliz — Sun, 01 Jun 2025 15:09:10 +0000

Introduction

Vector search, also known as vector similarity search, has quickly evolved from an experimental technology to a must-have component in many AI applications. As developers and technical leaders, we're increasingly looking for ways to handle similarity-based queries that traditional databases simply weren't designed to handle efficiently.

Whether you're building a product recommendation system or implementing semantic search, the underlying challenge is the same: how do you efficiently find the "nearest neighbors" to a query vector in a potentially massive dataset? That's where vector search engines come in.

The good news is that the open source community has stepped up with multiple high-quality options. The challenging part? Figuring out which one is right for your specific use case, technical requirements, and team expertise.

In this guide, we'll walk through the most popular open-source vector search engines available today, compare their strengths and limitations, and provide practical insights to help you make an informed decision. We'll cover everything from the technical foundations to specific implementation considerations, with a focus on real-world applications.

Understanding Vector Search: Core Concepts

Before diving into specific engines, let's establish some shared understanding of what vector search actually involves.

What Are Vector Embeddings?

At its core, vector search relies on embedding data into vectors—essentially converting information (text, images, audio, or any other data type) into lists of floating-point numbers that capture semantic meaning. These vectors typically range from dozens to thousands of dimensions.

For example, a text embedding model might encode the sentence "The weather is nice today" into a 384-dimensional vector where semantically similar sentences like "It's a beautiful day" would be positioned nearby in this high-dimensional space.

Vector Search vs. Traditional Search

Traditional search engines typically use inverted indices and exact keyword matching. Vector search, in contrast, measures the distance between vectors to find similar items, regardless of exact keyword overlap.

Consider these approaches:

Traditional keyword search matches "red leather jacket" with documents containing exactly those words. Vector search, however, can match "red leather jacket" with items that are conceptually similar, even if described as "scarlet biker coat" because it understands the semantic similarity rather than requiring exact term matches.

Key Performance Metrics

When evaluating vector search engines, several metrics matter:

Query speed is measured in milliseconds or queries per second (QPS), indicating how quickly results are returned. Recall represents the percentage of relevant results actually retrieved compared to what should have been retrieved. Index build time tells you how long it takes to create the search index, while memory usage reflects RAM requirements for both indexing and querying. Scalability refers to a system's ability to handle increasing data volumes and query loads without experiencing performance degradation.

Understanding these fundamentals will help frame our exploration of the specific engines.

Popular Vector Search Use Cases

Vector search isn't just a theoretical concept—it's powering some of the most innovative applications being built today. Here are the key use cases where vector search engines are making a significant impact:

Retrieval Augmented Generation (RAG)

RAG has become one of the most common applications of vector search, combining the power of large language models with knowledge retrieval. In RAG implementations, documents are converted to vector embeddings and stored in a vector database like Milvus, Faiss, and Zilliz Cloud. When a query arrives, the system retrieves the most relevant documents based on vector similarity. These retrieved documents provide context to an LLM, allowing for more accurate, up-to-date responses.

This approach helps address the hallucination problem in LLMs while enabling them to access domain-specific information that wasn't included in their training data.

AI Agents and Knowledge Retrieval

AI agents often need to make decisions based on relevant information scattered across various sources. Vector search enables these agents to quickly retrieve context-relevant information from large knowledge bases, identify similar past interactions or decisions, and construct memory systems that understand semantic similarity.

For developers building AI agents, the choice of vector database can significantly impact both performance and capabilities.

Recommendation Systems

E-commerce platforms, streaming services, and content sites rely heavily on recommendation engines to increase engagement. Vector search powers these systems by representing user preferences and item features as vectors, finding items similar to those a user has liked previously, and identifying users with similar taste profiles.

The right vector search engine can make the difference between recommendations that feel random versus those that seem to understand user preferences intuitively.

Semantic Search Applications

Text search that understands meaning rather than just keywords is transforming how we interact with information. Vector search enables finding conceptually similar documents even when terminology differs, understanding user intent behind queries, and supporting multilingual search where concepts align across languages.

Image and Multimedia Similarity Search

Beyond text, vector search excels at finding similar images, audio, or videos. This capability powers applications like identifying visually similar products in e-commerce, finding music with similar acoustic properties, and detecting near-duplicate media assets.

These applications require vector engines that can handle diverse embedding types efficiently.

Now that we’ve learned about the essence of vector search and its common use cases, let’s explore the top vector databases, particularly open-source options.

Milvus

Milvus is the most popular open-source vector database with more than 35,000 stars on GitHub. It first appeared in 2019 and has since gained significant traction in the developer community. Created specifically to handle large-scale similarity searches, Milvus was designed from the ground up to address the unique challenges of vector data management.

Architecture and Technical Capabilities

Milvus uses a cloud-native architecture with separated storage and compute layers. Stateless query nodes handle search requests, storage nodes manage data persistence, and coordinator nodes handle cluster management. This separation allows Milvus to scale horizontally as data volumes and query loads increase—a critical consideration for production deployments.

The platform supports multiple index types, including HNSW (Hierarchical Navigable Small World), IVF (Inverted File), DiskANN, and others, providing developers with flexibility to optimize for different workloads. Milvus also offers hybrid search capabilities, combining vector similarity with scalar filtering and full-text search, which proves valuable when search needs to consider both semantic similarity and keyword matching, as well as metadata constraints.

Milvus supports multiple distance metrics, including Euclidean, Cosine, and Inner Product, making it adaptable to various embedding types and similarity definitions. Its storage architecture includes time travel capabilities, allowing point-in-time queries and backups.

Milvus can be used to build various types of AI applications, from demos running locally in Jupyter Notebooks to massive-scale Kubernetes clusters handling tens of billions of vectors. Currently, there are three Milvus deployment options: Milvus Lite, Milvus Standalone, and Milvus Distributed.

Performance Characteristics

In benchmarks, Milvus demonstrates query latency typically in single-digit milliseconds for million-scale datasets, making it suitable for real-time applications. The platform supports ANNS (Approximate Nearest Neighbor Search) algorithms that trade perfect recall for substantial speed improvements—an essential trade-off for practical applications.

Memory usage in Milvus is managed through disk-based storage with memory caching, allowing it to handle datasets larger than available RAM. This approach makes Milvus more cost-effective for large vector collections compared to purely in-memory solutions.

For most production workloads, Milvus strikes a balance between recall accuracy and query speed, with tunable parameters that enable adjustments tailored to specific requirements. However, this flexibility comes with added complexity in configuration and optimization.

Migration Simplicity

A notable advantage of Milvus is the straightforward migration path from other vector databases. Through open-source migration tools like the Vector Transport Service (VTS) tool, moving data from other vector search engines to Milvus is simplified. This tool supports automated schema mapping, incremental data migration, and data validation during the transfer process. This makes Milvus particularly attractive for teams that have outgrown their current solution or want to standardize on a single platform.

That said, migration always involves some effort and risk, so thorough testing remains necessary, despite the use of these tools.

Zilliz Cloud: Fully Managed Milvus

While the open-source Milvus is powerful on its own, it requires local machines and engineering resources to deploy, operate, and maintain when building production-level applications. Zilliz, the engineering team behind Milvus, has created a fully managed Milvus on Zilliz Cloud, eliminating all the operational overhead for its customers so that they can invest more in creation and their business, rather than devoting all resources to infrastructure management.

This Zilliz Cloud service provides additional feature sets, simplified deployment and operations, automatic scaling and resource management, advanced security features, and SLA-backed reliability. The managed service also includes continuous updates and optimizations, eliminating the need for in-house expertise.

For teams focused on building applications rather than managing infrastructure, Zilliz Cloud provides a way to leverage Milvus without operational overhead.

Community and Ecosystem

The Milvus ecosystem has grown substantially, with an active GitHub repository that features regular releases. The project provides client SDKs for Python, Java, Go, and other languages, as well as integration with popular AI models and ML frameworks, including LangChain and LlamaIndex. Additionally, it features a growing community forum and comprehensive documentation.

This ecosystem maturity reduces implementation risks and provides multiple resources for troubleshooting. However, like any open-source project, community support can sometimes be unpredictable compared to paid support options.

Faiss

Faiss, short for Facebook AI Similarity Search, is a popular vector search library that was developed and open-sourced by Facebook AI Research (now Meta) in 2017. Unlike some other options in this comparison, Faiss was created by researchers for researchers, initially focusing on academic and experimental workloads before being adopted for production systems.

Technical Overview

Faiss takes a different approach from some other vector search solutions. It's implemented in C++ with Python bindings for performance and designed as a library rather than a standalone service. One distinguishing feature is its optimization for both CPU and GPU execution, with certain workloads seeing dramatic speedups on GPU hardware.

The library offers multiple index types tailored for various scenarios. IndexFlatL2 offers exact search with L2 distance for perfect accuracy. IndexIVFFlat implements an inverted file with flat storage for improved query speed. IndexHNSW leverages Hierarchical Navigable Small World graphs for efficient approximate search. IndexPQ utilizes product quantization for memory efficiency, allowing even modest hardware to search billions of vectors.

Strengths and Limitations

One of Faiss's major strengths is raw performance. It's often the fastest option for in-memory vector search when properly configured. The library achieves memory efficiency through clever compression techniques, such as product quantization, which can reduce vector storage requirements by an order of magnitude.

Faiss also stands out with native GPU support for even faster processing, making it ideal for research environments with access to GPU resources. The library offers fine-grained control with detailed parameter tuning options for those who want to optimize their workloads.

However, Faiss comes with notable limitations. It has no built-in persistence layer, meaning developers must handle saving and loading indexes themselves. It requires more integration work than turnkey solutions since it's a library rather than a service. Faiss is also less suited for distributed deployments without additional engineering work. So, many developers use Faiss for experimenting or prototyping.

Perhaps most significantly, Faiss has a steeper learning curve than some alternatives. The documentation, while comprehensive, assumes a strong understanding of the underlying algorithms and techniques.

Annoy

Annoy, which stands for "Approximate Nearest Neighbors Oh Yeah," was developed by Spotify and open-sourced in 2013, making it one of the older solutions in this comparison. Created specifically to power Spotify's music recommendation system, Annoy takes a distinct approach optimized for read-heavy workloads with relatively static data.

Approximate Nearest Neighbors Approach

Annoy uses random projection binary search trees as its core algorithm. Each tree splits the vector space differently, creating a forest of trees that collectively provide good approximations of the true nearest neighbors. As more trees are added to the forest, the probability of finding the true nearest neighbors increases, allowing a trade-off between accuracy and resource usage.

This approach differs significantly from the graph-based methods used by many newer vector search engines.

Performance Trade-offs

Annoy makes specific trade-offs that distinguish it from more general-purpose solutions. It's read-optimized, delivering very fast performance at query time, but this comes at the cost of write flexibility. Once built, Annoy indexes don't change—new data requires rebuilding the index.

The system is disk-based, with indexes that can be memory-mapped for efficiency. This allows Annoy to handle datasets larger than available RAM while maintaining good query performance. However, Annoy offers limited functionality beyond core approximate nearest neighbor search, lacking many features found in more comprehensive solutions.

These design choices make Annoy different from databases designed for frequent updates and complex queries.

Integration Options

Annoy offers Python bindings with scikit-learn compatibility, making it accessible to data scientists and ML engineers. Its C++ core provides good performance despite the simplified API. The library supports easy serialization and deserialization of indexes, facilitating offline build processes.

The API is simple and focused exclusively on nearest neighbor search, making it easy to learn, but it is limited in functionality. Unlike more comprehensive vector databases, Annoy requires additional infrastructure for features like persistence, scaling, and query filtering.

Weaviate

Weaviate emerged in 2019 as a different approach to vector search. Unlike pure vector databases, Weaviate combines vector search capabilities with a knowledge graph, creating a hybrid system designed to add contextual understanding to similarity queries.

What sets Weaviate apart is its graph-based data model. In Weaviate, data objects can be connected through semantic relationships, and these connections add context to vector-based queries. This allows queries to blend vector similarity with graph traversal, supporting more sophisticated searches than simple nearest-neighbor matching. For instance, a deployment might store product embeddings and also model relationships between products, categories, and brands. A user query could then return not only similar items but also those connected through shared attributes or behaviors.

This hybrid model enables expressive querying, but it also introduces additional complexity in data modeling and indexing. Developers must manage both vector embeddings and graph relationships, which can increase the learning curve and operational overhead.

Weaviate uses HNSW-based indexing for efficient vector search and supports flexible filtering applied either pre- or post-search. It scales through sharding, allowing it to handle growing datasets and query loads. However, distributed setups can become more complex to configure and operate, particularly at larger scales.

While Weaviate performs well across a variety of use cases, it's not always the top performer in pure vector search benchmarks. Its additional graph features, while powerful, can lead to slower response times when executing complex queries that combine vector search with multiple relationship traversals. This makes it better suited to applications that benefit from contextual enrichment, rather than those requiring ultra-low latency on high-throughput vector-only workloads.

Qdrant

Qdrant (pronounced "quadrant") is a newer entrant to the vector database space, first appearing in 2021. Qdrant provides both REST and gRPC APIs for interacting with the database, making it accessible from virtually any programming language. Its storage is isolated in collections, similar to tables in traditional databases, providing logical separation of different data types. The architecture offers point-in-time consistency guarantees and ACID-compliant operations for data reliability. This approach makes Qdrant more familiar to developers coming from traditional database backgrounds, reducing the learning curve.

A key strength of Qdrant is its ability to combine vector search with traditional filtering. The platform offers rich filter expressions that execute efficiently as part of the search process. Its payload-based filtering integrates directly into the search rather than being applied as a post-processing step. It also supports complex boolean conditions, including AND, OR, and NOT operations across multiple fields, and allows boosting results based on specific filter conditions—useful for nuanced ranking in hybrid search.

However, this filtering flexibility comes with trade-offs. As filter expressions become more complex or datasets grow, query performance may degrade, particularly when many filters are applied in high-cardinality fields. Additionally, while Qdrant supports distributed deployments, its horizontal scaling features are still evolving compared to more mature systems, and operational tooling around large-scale clustering remains relatively limited. These factors should be considered when evaluating Qdrant for high-scale or highly dynamic workloads.

Comparison Table: Key Features of Top Vector Search Engines


Engine	Architecture	Filtering	Managed Option	Distributed	Update Frequency
Milvus	Cloud-native, storage/compute separation	Excellent	Zilliz Cloud	Yes	Real-time
Faiss	Library, C++ with Python bindings	Limited	No	Manual	Batch
Annoy	Forest of binary trees	No	No	No	Offline only
Weaviate	Knowledge graph + vector DB	Good	Weaviate Cloud	Yes	Real-time
Qdrant	Rust-based, collections	Good	Qdrant Cloud	Yes	Real-time

Other Notable Vector Search Options

Beyond the main purpose built options highlighted above, many traditional databases start to offer vector search capability as an add-on.

Elasticsearch with Vector Search

Elasticsearch, already widely adopted for text search, has added vector search capabilities in recent versions. This functionality introduces kNN (k-Nearest Neighbors) search to the Elasticsearch ecosystem, enabling organizations to utilize their existing infrastructure for vector search requirements.

The integration with existing Elasticsearch features enables teams to combine traditional text search, faceting, and aggregations with vector similarity on a single platform. The familiar API reduces the learning curve for teams already using Elasticsearch.

This approach works well for organizations already invested in the Elastic ecosystem who need to add vector capabilities without adopting an entirely new database. However, performance may not match purpose-built vector databases for large-scale, vector-only workloads.

Vespa

Vespa is Yahoo's open source search engine that combines traditional search, vector search, and sophisticated ranking in a single platform. It offers real-time indexing and searching, with updates immediately available for query, unlike some solutions that require batch processing or index rebuilding.

The platform provides sophisticated ranking frameworks that can combine multiple signals, including vector similarity, text relevance, and business rules. It scales to large deployments with a distributed architecture and has been battle-tested in production at major internet companies.

Vespa's comprehensive feature set makes it suitable for complex search applications, though this comes with increased complexity compared to more focused solutions. It requires more resources to deploy and maintain than simpler vector search options.

pgvector

pgvector is an extension that adds vector data types and operations to PostgreSQL, allowing vector search within a traditional relational database. It supports multiple index types including IVF and HNSW for efficient similarity search on vector columns.

The key advantage is the ability to use SQL queries combining vector and relational data, making it easy to add vector search to existing applications without adopting a separate database. This option leverages existing PostgreSQL infrastructure and expertise, potentially reducing operational overhead.

The main limitation is that performance may not match dedicated vector databases for very large vector collections or high query volumes. It represents a pragmatic compromise rather than an optimized solution for vector-only workloads. What is most important, does SQL really necessary for AI workloads in the future?

Emerging Options

The vector database space continues to evolve with newer projects entering the field. Chroma focuses specifically on embeddings for LLM applications, with simplified APIs for RAG implementations. Marqo emphasizes simplicity and cloud-native operations, aiming to reduce the operational burden of vector search. LanceDB offers embedded vector search capabilities, targeting edge devices and applications that need to operate offline.

These emerging options show the continued innovation in the space, though they generally lack the production history and ecosystem maturity of more established solutions.

Choosing the Right Vector Search Engine

With so many options available, selecting the right vector search engine requires careful consideration of your specific needs and constraints.

Decision Framework

When evaluating vector search engines, start by considering your scale requirements—how many vectors will you store and query, both now and in the future? Different engines have different scaling characteristics and sweet spots.

Next, assess your query patterns. Will you perform pure vector search, or do you need to combine vector similarity with filtering, relationship traversal, or other operations? Some engines excel at pure vector search but struggle with complex hybrid queries.

Update frequency is another important consideration. If your data changes frequently or requires real-time updates, solutions like Annoy that require rebuilding indexes will be problematic. Conversely, if your data is relatively static, simpler architectures may offer performance advantages.

Integration needs matter as well. Do you need a standalone service, a library to embed in your application, or an extension to an existing database? Your current infrastructure and team expertise may make certain options more practical than others.

Finally, consider your team's expertise with specific technologies. The best technical solution on paper may not be the best choice if your team lacks the skills to implement and maintain it effectively.

Scaling Considerations

Different engines approach scaling in different ways, and understanding these differences is crucial for achieving long-term success. Milvus offers horizontal scaling with separated storage and compute, allowing independent scaling of different components as needs change. Faiss excels at vertical scaling, particularly with GPU acceleration, but requires more custom work for distributed deployments.

Your anticipated growth trajectory should influence your choice, with some solutions better suited to gradual scaling while others may require significant re-architecture as you grow.

Total Cost of Ownership

When selecting a vector search engine, consider all aspects of total cost of ownership. Infrastructure costs include RAM and CPU requirements, which vary significantly between solutions. Some engines require substantial memory for optimal performance, while others can operate effectively with more modest resources.

Operational complexity affects ongoing maintenance costs. Deployment, monitoring, and maintenance effort varies widely, with some solutions requiring specialized expertise while others integrate more easily with standard DevOps practices.

Development time is another important factor. The learning curve and integration complexity of different engines can significantly impact project timelines and success rates. Solutions with better documentation, more examples, and more intuitive APIs typically result in faster implementation.

Support options range from community forums to commercial support agreements. Consider your organization's requirements for response times and support guarantees when evaluating options.

Finally, consider potential migration costs. If your needs change, how difficult would it be to switch to a different solution? Engines with standard APIs and export capabilities provide more future flexibility.

Future-Proofing

Vector search technology is evolving rapidly; therefore, selecting a solution that can adapt to your changing needs is crucial. Examine community activity and release cadence to assess ongoing development. Projects with regular updates and active discussion forums are more likely to remain relevant and up-to-date.

Corporate backing and sustainability matter for long-term viability. Projects supported by established companies or foundations generally have more stable development trajectories.

Aligning the feature roadmap with your anticipated needs helps ensure the solution grows in directions that benefit your use cases. Finally, flexibility to adapt as requirements change provides insurance against unexpected shifts in project requirements.

Benchmarking with Real-world Workloads

Benchmark results are often the first thing teams look at when comparing vector search engines, but many published benchmarks fail to reflect real-world usage. Synthetic tests tend to focus on idealized conditions—fixed datasets, uniform queries, and read-heavy workloads—while ignoring the complexities of real applications. In production, your system may need to support frequent updates, concurrent queries, multi-modal filtering, and hybrid search across structured and unstructured data. These challenges can drastically affect actual performance, scalability, and reliability.

To make an informed choice, prioritize benchmarks that replicate your expected workload patterns as closely as possible. Testing with real datasets, realistic query volumes, and operational constraints will provide a more accurate picture of how a vector search engine performs in your environment.

VDBBench is an open-source benchmark designed from the ground up to simulate production reality. Unlike synthetic tests that cherry-pick scenarios, VDBBench pushes databases through continuous ingestion, rigorous filtering conditions, and diverse scenarios, just like your actual production workloads.

VDBBench GitHub: https://github.com/zilliztech/VectorDBBench.

Conclusion and Next Steps

Vector search has moved beyond niche applications to become a fundamental building block for many modern applications. The open source ecosystem offers multiple strong options, each with distinct advantages and trade-offs.

For most teams just starting with vector search, Milvus provides a good balance of features, performance, and operational simplicity. Its comprehensive functionality and growing ecosystem make it suitable for a wide range of use cases, while fully managed options like Zilliz Cloud reduce operational overhead.

For specific needs, alternatives like Faiss (performance-focused), Weaviate (knowledge graph integration), Qdrant (filtering capabilities), or Annoy (read-optimized workloads) may be better fits.

Whatever you choose, start small, benchmark thoroughly against your specific workload, and validate assumptions before committing to a production deployment. Vector search technology continues to evolve rapidly, so staying engaged with the community around your chosen solution is essential for long-term success.

Ready to get started? Most of these projects offer excellent quickstart guides, Docker containers for easy experimentation, and active communities eager to help newcomers. The best way to evaluate is to build a small proof of concept with your actual data and query patterns.

Happy searching!

The Great AI Agent Protocol Race: Function Calling vs. MCP vs. A2A

Zilliz — Tue, 29 Apr 2025 06:17:37 +0000

If you’ve been keeping an eye on the AI dev world lately, you’ve probably noticed something: everyone is now talking about AI Agents — not just smart chatbots, but full-blown autonomous programs that can use tools, call APIs, and even collaborate with each other. LangChain and OpenAI even had a debate over the definition of “AI Agents.”

But as soon as you start building serious AI Agent systems, one big headache hits you: there’s no clear, universal way for Agents to work with tools — or with each other.

Right now, three major approaches are competing to define the future of AI agent architecture:

Function Calling: OpenAI's pioneering approach — teaching LLMs to make API calls like junior developers
MCP (Model Context Protocol): Anthropic’s attempt to create a standard toolkit interface across models and services.
A2A (Agent-to-Agent Protocol): Google’s brand-new spec for letting different Agents talk to each other and work as a team.

Every major AI player — OpenAI, Anthropic, Google — is quietly betting that whoever defines these standards will shape the future agent ecosystem.

For developers building beyond basic chatbots, understanding these protocols isn't just about keeping up — it's about avoiding painful rewrites down the road.

Here's what we'll cover in his post:

What is Function Calling Why it made tool use possible — but why it’s not enough.
How MCP tries to fix the mess by creating a real protocol for tools and models.
What A2A adds by making Agents work together like teams, not loners.
How you should actually think about using them (without wasting time chasing hype).

Function Calling: The Pioneer with Growing Pains

Function Calling, popularized by OpenAI and now adopted by Meta, Google, and others, was the first mainstream approach to connecting LLMs with external tools. Think of it as teaching your LLM to write API calls based on natural language requests.

Figure 1: Function calling workflow (Credit @Google Cloud)

The workflow is straightforward:

User asks a question ("What's the weather in Seattle?")
LLM recognizes it needs external data
It selects the appropriate function from your predefined list
It formats parameters following JSON Schema:
5

{
  "location": "Seattle",
  "unit": "celsius"
}

Your application executes the actual API call
The LLM incorporates the returned data into its response

For developers, Function Calling feels like giving your AI a cookbook of API recipes it can follow. For simple applications with a single model, it's nearly plug-and-play. To learn more about how to use function calling for building applications, check out the following articles:

But there's a significant drawback when scaling: no cross-model consistency. Each LLM provider implements function calling differently. Want to support both Claude and GPT? You'll need to maintain separate function definitions and handle different response formats.

It's like having to rewrite your restaurant order in a different language for each chef in the kitchen. This M×N problem becomes unwieldy fast as you add more models and tools.

Function Calling also lacks native support for multi-step function chains. If the output from one function needs to feed into another, you're handling that orchestration yourself.

MCP (Model Context Protocol): The Universal Translator for AI and Tools

MCP (Model Context Protocol) addresses precisely these scaling issues. Backed by Anthropic and gaining support across models like Claude, GPT, Llama, and others, MCP introduces a standardized way for LLMs to interact with external tools and data sources.

How MCP Works

Think of MCP as the "USB standard for AI tools" — a universal interface that ensures compatibility:

Tools advertise their capabilities using a standardized format, describing available actions, required inputs, and expected outputs
AI models read these descriptions and can automatically understand how to use the tools
Applications integrate once and gain compatibility across the AI ecosystem

MCP transforms the messy M×N integration problem into a more manageable M+N problem.

The MCP Architecture

MCP uses a client-server model with four key components:

Figure 2: The MCP architecture (Credit @anthropic)

MCP Hosts: The applications where users interact with AI (like Claude Desktop or AI-enhanced code editors)
MCP Clients: The connectors that manage communication between hosts and servers
MCP Servers: Tool implementations that expose functionality through the MCP standard
Data Sources: The underlying files, databases, APIs and services that provide information

If Function Calling is like having to speak multiple languages to different chefs, MCP is like having a universal translator in the kitchen. Define your tools once, and any MCP-compatible model can use them without custom code. This dramatically reduces the marginal cost of adding new models or tools to your application. As someone who's dealt with integration headaches, that's music to my ears.

A2A (Agent-to-Agent Protocol): The Team Coordinator for AI Agents

While Function Calling and MCP focus on model-to-tool interaction, A2A (Agent-to-Agent Protocol), introduced by Google, tackles a different challenge: How do we get multiple specialized agents to collaborate effectively?

As AI agent architectures grow more complex, it quickly becomes clear that no single agent should handle everything. You might have one agent specialized in document summarization, another in database queries, and another in user interaction.

A2A defines a lightweight, open protocol that lets different Agents:

Discover each other and advertise their capabilities,
Delegate tasks dynamically to the best-suited Agent,
Coordinate progress and share real-time updates securely.

Figure 3: How A2A works (credit @Google)

A2A facilitates communication between a "client" agent that manages tasks and a "remote" agent that executes them. If Function Calling gives an agent access to tools, A2A lets agents form effective teams.

Consider hiring a software engineer: A hiring manager could task their agent to find candidates matching specific criteria. This agent then collaborates with specialized agents to source candidates, schedule interviews, and facilitate background checks — all through a unified interface.

Quick Comparison: Function Calling vs MCP vs A2A

It's tempting to see these protocols as competitors, but they actually solve different pieces of the agent ecosystem puzzle:

Function Calling connects models to individual tools (limited but simple)
MCP standardizes tool access across different models (more scalable)
A2A enables collaboration between independent agents (higher-level orchestration)

	Function Calling	MCP	A2A
What it solves	Model → API calls	Model → Tools access, standardized	Agent → Agent collaboration
Good for	Simple real-time queries	Scalable tool ecosystems	Distributed multi-agent workflows
Pain points	No standard, messy multi-model support	Need to set up servers	Still early days, limited support
Real-world analogy	Teaching your AI to make phone calls	Having any smart app access any database/API easily	Having teams of bots working together like coworkers

In architectural terms, MCP answers "what tools can my agent use?" while A2A handles "how can my agents work together?"

This resembles how we structure complex software: individual components with well-defined interfaces, composed into larger systems. An effective agent ecosystem needs both tool interfaces (Function Calling/MCP) and inter-agent communication (A2A).

What This Means for Developers

So, what should you, as a developer building with AI, do with these competing standards?

For simple applications: Function Calling remains the quickest path to adding tool use to your LLM application, especially if you're only using one model provider.
For cross-model compatibility: Consider adopting MCP, which gives you broader model support without duplicating integration work.
For complex multi-agent systems: Keep an eye on A2A, which could become crucial as agent ecosystems mature.

The smart play might be to layer these approaches: use Function Calling for quick prototyping, but implement MCP adapters for better scalability, with A2A orchestration for multi-agent workflows.

The Road Ahead

The conversation around what makes an "AI Agent" is still evolving — sometimes even debated between companies like OpenAI, Anthropic, and LangChain.

But regardless of definitions, one thing is clear: Standards like Function Calling, MCP, and A2A are laying the foundation for the next generation of AI applications.

For developers, understanding these patterns early is an investment in future-proofing your work. It's how we move from toy demos to production-ready systems — the kind that solve real problems at scale. The agent ecosystem is developing rapidly, and building on these protocols now means positioning your applications for what's coming next.

What do you think? Which protocols are you using in your AI projects? Are you betting on one standard winning out, or preparing for a multi-protocol future?

More Resources

What Exactly Are AI Agents? Why OpenAI and LangChain Are Fighting Over Their Definition?

Zilliz — Wed, 23 Apr 2025 06:53:28 +0000

Key Takeaways

At the simplest level, AI agents are software programs powered by artificial intelligence that can perceive their environment, make decisions, and take actions to achieve a goal—often autonomously.
OpenAI and LangChain recently debated what truly defines an agent — simplicity vs. flexibility is the core divide.
Agents differ from LLMs, chatbots, and workflows by being goal-driven, tool-using, and proactive.
AI Agents are already used in coding, business ops, healthcare, education, personal productivity, and many other areas.

🥊 The OpenAI vs. LangChain “AI Agent” Debate

The AI community witnessed a fascinating debate in early 2025 when OpenAI released its comprehensive guide to AI agents, which prompted a swift response from LangChain. This public exchange highlighted fundamental differences in how major players conceptualize AI agents and revealed important distinctions that every developer should understand.

Let’s talk drama first. 🙂

What Happened? What Sparked the Controversy?

OpenAI, in their new documentation for the Assistants API, explained how to build agents using their platform, complete with tools, memory, threads, and a planning architecture.
However, they described AI agents in a high-level, somewhat simplified manner: as large language models (LLMs) with memory and tools that can achieve goals.
Then, LangChain, whose entire framework revolves around agent workflows, dropped a response blog: “How to Think About Agent Frameworks”. And it didn't pull punches.

LangChain’s Core Argument:

LangChain argued that OpenAI’s guide:

Oversimplifies what agents are – reducing them to just tool-using LLMs.
Misrepresents existing frameworks – implying LangChain-style agents are unstable or unreliable because of flaws in the architecture, not because of current limitations in LLM reasoning.
Ignores the core “agent loop” – the concept of an agent continuously reasoning and deciding what to do next is critical, and it’s not front and center in OpenAI’s model.

Why Do They See It Differently?

This isn’t just a clash of opinions — it’s a difference in philosophy and design priorities:

Perspective	OpenAI	LangChain
Focus	API-first, productized “agent-like” experience for devs	Open-source, modular framework for complex agent systems
Design	Abstracts away the inner loop for stability and ease	Embraces reasoning loops and flexibility, even if fragile
Goal	Make it simple to add memory, tools, and goals to your assistant	Let devs build sophisticated, customizable multi-step agents
Tradeoff	More controlled and user-friendly, but maybe less “agentic”	More powerful and flexible, but higher risk of tool misuse or reasoning errors

Who’s “Right”?

Honestly? Both have good points.

OpenAI wants to productize agents safely and cleanly for the average developer.
LangChain wants to push the boundaries of autonomy and reasoning, even if it’s messier.

So if you’re just getting started and want something that works? OpenAI’s Assistants API is solid. If you’re building ambitious workflows and need total control? LangChain might be the better fit.

The good news: this debate is fueling clarity in the space. It’s pushing the whole AI world to ask: “What does it really mean to build an autonomous, intelligent, goal-driven AI system?”

And that’s the question we’ll dig into for the rest of this post.

🔍 So, What Exactly Are AI Agents?

Imagine waking up to find your coffee already brewing, your calendar optimized for the day, and your inbox sorted with draft responses ready for your approval. Meanwhile, your code repository has been scanned overnight, bugs fixed, and tests automatically generated. Welcome to the future.

At the simplest level, AI agents are software programs powered by artificial intelligence that can perceive their environment, make decisions, and take actions to achieve a goal—often autonomously. Unlike traditional software that follows rigid, pre-programmed instructions, AI agents can operate with varying degrees of autonomy, learning from their interactions and adapting their behavior accordingly.

Think of an AI agent as a digital assistant on steroids – one that doesn't just respond to your commands but anticipates needs, solves problems, and accomplishes tasks with minimal human supervision. The key distinction is autonomy and goal-orientation: agents are built to pursue objectives rather than simply process inputs.

To put it in everyday terms, if traditional software is like a bicycle that goes exactly where you steer it, an AI agent is more like a self-driving car that gets you to your destination while handling the navigation details itself.

How AI Agents Work

Let's peek under the hood of these AI agents. At their core, AI agents follow what we call a "perception-think-action loop" – but don't let the fancy term intimidate you. It's actually pretty intuitive when you break it down:

The Perception-Think-Action Loop

Think of this as the agent's basic rhythm:

Perception: First, your agent takes in information. This could be your typed request, data from APIs, sensor readings, or even the content of files. It's basically gathering all the context it needs.
Reasoning: Now comes the thinking part. The agent (usually powered by a Large Language Model or LLM) processes what it's perceived. It's asking itself: "What's really being asked here? What's the goal? What information do I have and what do I need?"
Planning: This is where agents really shine compared to simpler AI systems. The agent maps out a sequence of steps to achieve the goal. If the task is complex, it might break it down into sub-tasks and determine dependencies.
Action: Time to get things done! The AI agent executes its plan by utilizing the tools at its disposal – it may call an API, query a vector database, generate code, or even control physical devices if they are connected to it.
Learning & Adaptation: After taking action, the agent evaluates the results. Did it work? If not, why? It uses this feedback to adjust its approach, either immediately for the current task or to improve future performance.

Let me share how this works with a concrete example. Say you tell your coding agent: "Create a weather dashboard for my city."

Perception: It processes your request and understands you want a weather dashboard application.
Reasoning: It determines it needs to: find your location, access weather data, create a visualization interface, and package it as a usable application.
Planning: It maps out steps like:

First, determine your location (either ask you or use default settings)
Research weather APIs that offer the needed data
Design a UI layout with key weather metrics
Write front-end code for visualization
Set up API connections to fetch real-time data
Package everything into a deployable application

Action: The agent starts executing these steps. It might ask you for your location, generate API authentication code for a weather service, create HTML/CSS/JS for the dashboard, and test that the data flows correctly.
Learning: If you say the temperature display is too small, it adapts and regenerates that component with a larger font. It remembers this preference for future tasks.

The Secret Sauce: Tool Use

What makes today's agents truly powerful is their ability to use tools – they're not limited to just generating text responses. An advanced agent might:

Write and execute code in various programming languages
Call external APIs to get real-time data
Search the web for information
Interact with databases
Control browser automation tools
Generate and manipulate images or other media

This tool's use capability is what transforms a "smart chatbot" into a genuine AI agent. The agent can extend its capabilities beyond what's built into its core model by leveraging these external tools.

Key Components of an AI Agent

Modern AI agents are complex systems comprised of several critical components working together to create intelligent, goal-oriented behavior. Let's break down these essential building blocks:

1. Foundation AI Models

At the core of most AI agents is a foundation model, typically a Large Language Model (LLM) like GPT-4, Claude, or Llama that provides the reasoning capabilities. These models act as the "brain" of the agent, enabling it to:

Process and generate natural language
Understand context and nuance
Apply common sense reasoning to novel situations
Generate plans and evaluate alternatives

The choice of foundation model significantly impacts an agent's capabilities, with more advanced models generally offering better reasoning but at higher computational costs.

2. Memory Systems

Unlike simple chatbots, sophisticated AI agents maintain various types of memory:

Short-term memory: Keeps track of the current conversation or task context
Long-term memory: Stores persistent information like user preferences or learned knowledge
Episodic memory: Records specific interactions or "experiences" for future reference

For instance, a customer service agent remembering your previous issues when you contact support again exemplifies effective memory utilization.

Vector databases like Milvus and Zilliz Cloud usually play a key role in powering the memory system of AI agents.

3. Tool Use Systems

Today's most capable agents can leverage external tools to overcome the limitations of language models alone:

API connections to external services
Search engines and knowledge bases
Database access
Code execution environments
Other specialized AI models (like image generators)

This tool use capability transforms agents from passive responders to active problem-solvers that can affect the world outside their language model.

4. Planning and Reasoning Systems

Advanced agents incorporate explicit planning components that help them break down complex goals:

Task decomposition: Breaking larger goals into manageable subtasks
Reasoning chains: Using techniques like chain-of-thought (COT) to work through problems step-by-step
Self-reflection: Evaluating the quality of their own plans and outputs
Feedback incorporation: Learning from successes and failures to improve future plans

5. Agent Frameworks and Orchestration

Most production AI agents are built on specialized frameworks that handle the complex integration of the above components. For example:

LangChain: Provides modular components for building agents with memory, tool-use capabilities, and prompt management in a flexible architecture
LlamaIndex: Specializes in knowledge-intensive applications, particularly for retrieving and reasoning over document collections
OpenAI Agents SDK: offers a simplified framework focused on reliable tool use with OpenAI's models

These frameworks handle the complex plumbing needed for agents to function reliably, providing developers with abstractions for common agent patterns. Check out this blog for most popular AI frameworks: 10 Open-Source LLM Frameworks Developers Can’t Ignore in 2025

6. Knowledge Retrieval Mechanisms

Truly useful agents need access to specific knowledge:

RAG (Retrieval-Augmented Generation): Allows agents to pull relevant information from documents or databases before generating responses
Knowledge graphs: Provide structured relationships between concepts for more precise reasoning
Vector search: Enables semantic similarity matching rather than just keyword lookups
Hybrid retrieval: Combines multiple approaches for more robust information access

The knowledge component is often what transforms a generic agent into a domain-specific expert that can provide genuinely valuable insights or assistance.

7. Security and Safety Systems

As agents gain more capabilities, safeguards become increasingly important:

Input filtering: Screens requests for harmful content
Output moderation: Ensures responses meet safety guidelines
Authorization boundaries: Limits what actions agents can take
Monitoring systems: Tracks agent behavior and performance
Explainability tools: Makes agent reasoning transparent to users and developers

These systems transform experimental agents into reliable, production-ready systems that can be trusted in real-world environments.

Vector Databases: The Backbone of Long-Term Agent Memory

As mentioned above, AI agents to function effectively, they need a robust memory system that extends beyond short-term context. This is where vector databases emerge as a critical infrastructure component powering sophisticated agent architectures.

Vector databases such as Milvus and Zilliz Cloud store information as high-dimensional vectors—mathematical representations that capture the semantic meaning of data whether it's text, images, audio, or other unstructured formats. This approach allows agents to perform similarity searches and retrieve contextually relevant information based on meaning rather than exact keyword matches. For example, when an agent encounters a new query, it can access its memory system to retrieve similar past interactions or relevant knowledge, enabling it to make informed decisions and adapt to new situations. Without such memory, agents would lack the continuity required for advanced reasoning and adaptive learning.

To get started quickly on building an AI agent yourself, check out the tutorials below.

Tutorial: Building a GraphRAG Agent With Neo4j and Milvus
Tutorial: Agentic RAG with Claude 3.5 Sonnet, LlamaIndex, and Milvus
Tutorial: Building an AI Agent for RAG with Milvus and LlamaIndex
Tutorial: Stop Waiting, Start Building: Voice Assistant With Milvus and Llama 3.2

AI Agents vs. Other AI Systems

OK, so now you're probably wondering, "How are AI agents different from all the other AI stuff I've been using?" Great question! Let's clear up some confusion by comparing agents with their AI cousins:

AI Agents vs. LLMs (Even Advanced Ones)

Think of modern LLMs like GPT-4, Claude, or DeepSeek as incredibly powerful brains waiting for direction. Here's what separates them from true agents:

LLMs by themselves:

Function as "stateless" systems – forgetting context between sessions unless explicitly reminded
Generate impressive text, but can't take actions beyond the chat interface
Respond to prompts rather than independently pursuing objectives

Even cutting-edge models with reasoning capabilities (like Claude 3.7 Sonnet with extended thinking or DeepSeek R1) and built-in search:

Can break down complex problems step-by-step
Access real-time information beyond their training data
Produce sophisticated analysis and explanations
But still operate within a reactive, prompt-response paradigm

What transforms an LLM into an agent:

Persistent memory architecture using vector databases and state management
Tool integration frameworks that enable a diverse action space
Planning systems that maintain progress toward defined goals
Feedback loops that allow adaptation based on outcomes

The difference is like having a brilliant consultant (LLM) versus an autonomous colleague (agent). The consultant gives excellent advice when asked but forgets you between meetings. The agent remembers your preferences, anticipates needs, takes initiative on your behalf, and learns from each interaction to serve you better over time.

AI Agents vs. AI Assistants

This is a subtle but important distinction that confuses many developers. AI assistants (like the basic versions of Siri, Alexa, or even Claude) are designed primarily to help users through conversation and simple predefined actions. They're focused on the human-AI interaction.

AI agents go a step further:

They can operate independently, even when you're not directly interacting with them
They have more agency to make decisions within their scope
They often work in the background on longer-running tasks
They can be more proactive rather than just reactive

For example, an AI assistant might help you book a flight when you ask it to. An AI agent might notice you've been discussing a trip, proactively research flight options based on your calendar availability, and then suggest the best times to book based on price trends it's been monitoring.

AI Agents vs. Chatbots

Traditional chatbots were designed for one thing: conversation. Even modern LLM-powered chatbots are primarily interfaces for communication. The differences from agents are stark:

Chatbots:

are conversation-first, with actions as an afterthought;
typically wait for user prompts before doing anything;
usually operate within a limited domain of knowledge.

AI Agents vs. AI Workflows

If you've built AI applications before, you might have created workflow chains or pipelines. These are predetermined sequences of AI operations linked together. While useful, they differ from agents in critical ways:

AI workflows are like assembly lines – efficient but rigid. They follow the same steps every time, and if something unexpected happens, they often break down. Agents are more like skilled workers who can adapt their approach based on circumstances.

Types of AI Agents

Not all AI agents are created equal. Let me walk you through the main types seen in the wild, with real examples that might help you understand their unique characteristics:

Task-Specific Agents

These are specialized agents designed to excel at particular jobs. They're like expert contractors you bring in for specific work.

Example: GitHub Copilot for Docs

This coding documentation agent doesn't just generate documentation – it reads codebases, understands function signatures and dependencies, analyzes existing documentation patterns, and then creates contextually appropriate docs that match team styles. It can work across multiple files, maintaining consistency in terminology and approach.

Autonomous Agents

These agents can work independently over extended periods with limited supervision. They're more like employees than tools.

Example: AutoGPT

One of the first autonomous agents that caught widespread attention. You give it a high-level goal like "Create a successful blog about renewable energy," and it breaks this down into subtasks: researching current trends, identifying target audiences, planning content categories, drafting articles, finding relevant images, setting up publishing schedules, and analyzing traffic patterns to optimize future content. It can spend days or weeks pursuing these goals, making adjustments based on results.

Multi-Agent Systems

These involve multiple specialized agents working together, like a team with different roles.

Example: AgentVerse

This framework exemplifies the multi-agent approach. In a content production environment, it might deploy:

A research agent that gathers information on trending topics
A planning agent that outlines content structure
Multiple specialist writers focused on different aspects (technical details, beginner explanations, etc.)
An editor agent that ensures consistency across pieces
A feedback agent that analyzes user engagement
A coordinator agent that manages workflows and resolves conflicts

The magic happens in the interactions – agents can debate approaches, request clarification from each other, and collaboratively solve problems in ways none could individually.

Embodied Agents

These agents control or interact with physical systems in the real world.

Example: Amazon's Warehouse Robots

These have evolved from simple path-following machines to sophisticated agents that adaptively navigate dynamic environments. They can reroute around obstacles, prioritize packages based on shipping deadlines, coordinate with other robots to prevent bottlenecks, and even predict and preposition themselves for anticipated order volumes.

Use Cases for AI Agents

Let's explore how AI agents are actually being used right now across different industries. These examples represent what's truly possible with today's technology:

Software Development

In modern development workflows, coding agents transform productivity. A modern coding agent doesn't just write code snippets – it functions as a true development partner. Feed it a product spec, and it will architect a solution, generate the code across multiple files and functions, create appropriate tests, and then help debug any issues.

For example, at recent hackathons, teams have used agents to build entire image processing applications. The agent handles everything from setting up the React frontend to implementing the backend APIs and database schema. When teams run into performance bottlenecks with large image processing, the agent analyzes the code, identifies the issue, and implements a more efficient algorithm, complete with proper error handling and edge case management. What would take days of work is accomplished in hours.

Business Operations

Finance departments have been early adopters of agent technology. Many CFOs deploy accounting agents that completely transform month-end close processes. These agents don't just process transactions – they reconcile accounts across multiple systems, identify discrepancies, follow up on missing documentation, prepare financial statements with explanatory notes, and even suggest journal entries to correct issues they discover.

The game-changer is how they handle exceptions. Rather than simply flagging problems for humans to resolve, they can reason through complex accounting rules to suggest appropriate treatments for unusual transactions. When encountering truly novel situations, they research accounting standards, propose solutions with citations to relevant guidance, and learn from accountants' feedback to handle similar situations autonomously in the future.

Healthcare

Healthcare providers are using monitoring agents that go far beyond traditional alert systems. Hospitals implement patient monitoring agents that integrate data from electronic health records, bedside monitors, medication administration systems, and lab results. These agents don't just notify staff when readings exceed thresholds – they understand clinical context.

For instance, when a patient's oxygen saturation drops, the agent checks recent medication administration, position changes, and historical patterns for that patient. It can distinguish between temporary fluctuations and concerning trends, only alerting staff when truly necessary. Over time, it learns each patient's baseline and normal variations, dramatically reducing false alarms while catching subtle early warning signs of deterioration that static monitoring would miss.

Education

Educational agents are evolving from simple tutoring programs to comprehensive learning companions. University professors develop research mentor agents to support graduate students. These agents don't just answer questions – they help shape the entire research process.

When a student begins a project, the agent helps refine research questions, suggests methodological approaches, identifies potential difficulties, and maps out a realistic timeline. As the student progresses, it reviews drafts, suggests improvements to experimental design, helps interpret results, and provides guidance on presenting findings effectively. Most impressively, it adapts its support based on each student's strengths, weaknesses, and learning style – providing more structure for those who need it while encouraging independence in others.

Personal Productivity

Personal productivity agents are perhaps the most accessible use case for most people. A robust productivity agent transforms workload management. It's not just a glorified to-do list – it's a genuine workload management partner.

It tracks projects across multiple tools (email, task managers, documents, calendar), identifies dependencies and potential conflicts, and proactively suggests schedule adjustments. When receiving new requests, it evaluates them against current commitments and helps determine what to prioritize or delegate. It drafts appropriate responses based on communication style and relationship with each person.

What makes it truly valuable is how it learns preferences and working patterns over time. It recognizes which times of day are most suited for creative work versus meetings, which tasks tend to be procrastinated on, and how long similar tasks have typically taken in the past. It uses this knowledge to suggest realistic schedules that work with actual habits rather than some idealized productivity system.

Challenges and Considerations

While AI agents present incredible opportunities, they also come with significant challenges that we need to address as developers and users:

Alignment Problems: When Agents Go Off-Track

Consider an email management agent designed to prioritize inbox messages. Despite clear instructions about what "important" means, the agent might flag all messages from a manager as urgent (including lunch invitations) while categorizing client emergency requests as "can wait until tomorrow." Why? Because it observed the user responding quickly to their boss several times and learned the wrong pattern from this behavior.

This is what's called an alignment problem – when agents optimize for goals that don't match the user's actual intentions. As agents gain more capabilities and autonomy, ensuring they accurately understand true objectives becomes critically important. The issue isn't about malicious AI but rather misunderstandings that can have significant consequences when agents have meaningful power to act independently.

The Black Box Problem: Why Did It Do That?

Have you ever had an agent make a decision that left you scratching your head? I remember reviewing code changes made by an agent that completely restructured our authentication system. The changes worked, but I had no idea why the agent thought this approach was better.

Without transparency into agent reasoning, it's difficult to trust their decisions or learn from their approaches. The most effective agent systems I've worked with provide clear explanations of their decision-making process – not just what they did, but why they chose that approach over alternatives.

Security Headaches: New Attack Surfaces

Giving agents access to systems creates new security considerations. A colleague of mine built an agent to help manage their AWS infrastructure. It was incredibly useful until it accidentally exposed sensitive configuration details in logs because it didn't understand the security implications.

Agents often need broad access privileges to be useful, but this creates potential security vulnerabilities. Careful permission design, monitoring systems, and appropriate guardrails are essential – especially when agents interact with critical systems.

The Responsibility Question: Who's Accountable?

When your automated trading agent made a series of questionable trades that lost money, the question immediately arose: who's responsible? The developer who built it? You who deployed it? The company that created the underlying AI model?

As agents take more autonomous actions in the world, we need clearer frameworks for accountability. This isn't just a legal question – it's also about designing appropriate human oversight and intervention mechanisms that preserve the efficiency benefits of automation while maintaining appropriate control.

Conclusion

If you're just starting to explore this world of AI agents, don't be intimidated. Start small – maybe with a personal productivity agent or a code assistant. Watch how it works, learn its strengths and limitations, and gradually expand the tasks you entrust to it. Before you know it, you'll be designing multi-agent systems to tackle complex workflows that previously required entire teams.

For those already building agents, consider the human-agent relationship carefully. The most successful implementations I've seen don't aim to replace human workers, but rather to enhance their capabilities – handling routine tasks so that people can focus on creative problem-solving, strategic thinking, and interpersonal connections.

Whether you're looking to build AI agents or just understand how they'll impact your work, there's no better time to dive in. The tools are becoming increasingly accessible, their capabilities more impressive, and their applications more diverse with each passing month.

References

Build RAG Chatbot 🤖 with LangChain, Milvus, Mistral AI Pixtral, and NVIDIA bge-m3

Zilliz — Fri, 14 Mar 2025 07:00:00 +0000

Introduction to RAG

Retrieval-Augmented Generation (RAG) is a game-changer for GenAI applications, especially in conversational AI. It combines the power of pre-trained large language models (LLMs) like OpenAI’s GPT with external knowledge sources stored in vector databases such as Milvus and Zilliz Cloud, allowing for more accurate, contextually relevant, and up-to-date response generation. A RAG pipeline usually consists of four basic components: a vector database, an embedding model, an LLM, and a framework.

Key Components We'll Use for This RAG Chatbot

This tutorial shows you how to build a simple RAG chatbot in Python using the following components:* LangChain: An open-source framework that helps you orchestrate the interaction between LLMs, vector stores, embedding models, etc, making it easier to integrate a RAG pipeline.

Milvus: An open-source vector database optimized to store, index, and search large-scale vector embeddings efficiently, perfect for use cases like RAG, semantic search, and recommender systems. If you hate to manage your own infrastructure, we recommend using Zilliz Cloud, which is a fully managed vector database service built on Milvus and offers a free tier supporting up to 1 million vectors.
Mistral AI Pixtral: Pixtral is a cutting-edge image generation model designed for high-quality visual content creation. With a focus on artistic style transfer and detail accuracy, it excels in transforming text prompts into vibrant images. Ideal for applications in design, marketing, and creative fields, Pixtral enhances workflows with its versatility and aesthetic precision.
NVIDIA bge-m3: The NVIDIA bge-m3 is a state-of-the-art generative model designed for high-performance multi-modal tasks, particularly in natural language processing and computer vision. Its strengths lie in real-time data processing and scalability, making it ideal for applications in interactive AI systems, creative content generation, and advanced analytical tools in various industries.By the end of this tutorial, you’ll have a functional chatbot capable of answering questions based on a custom knowledge base.Note: Since we may use proprietary models in our tutorials, make sure you have the required API key beforehand.

Step 1: Install and Set Up LangChain

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

Step 2: Install and Set Up Mistral AI Pixtral

pip install -qU "langchain[mistralai]"

import getpass
import os

if not os.environ.get("MISTRAL_API_KEY"):
  os.environ["MISTRAL_API_KEY"] = getpass.getpass("Enter API key for Mistral AI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("pixtral-12b-2409", model_provider="mistralai")

Step 3: Install and Set Up NVIDIA bge-m3

pip install -qU langchain-nvidia-ai-endpoints

import getpass
import os

if not os.environ.get("NVIDIA_API_KEY"):
  os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter API key for NVIDIA: ")

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embeddings = NVIDIAEmbeddings(model="baai/bge-m3")

Step 4: Install and Set Up Milvus

pip install -qU langchain-milvus

from langchain_milvus import Milvus

vector_store = Milvus(embedding_function=embeddings)

Step 5: Build a RAG Chatbot

Now that you’ve set up all components, let’s start to build a simple chatbot. We’ll use the Milvus introduction doc as a private knowledge base. You can replace it with your own dataset to customize your RAG chatbot. import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://milvus.io/docs/overview.md",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("doc-style doc-post-content")
        )
    ),
)

docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "nn".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Test the Chatbot

Yeah! You've built your own chatbot. Let's ask the chatbot a question. response = graph.invoke({"question": "What data types does Milvus support?"})

print(response["answer"])

Example Output

Milvus supports various data types including sparse vectors, binary vectors, JSON, and arrays. Additionally, it handles common numerical and character types, making it versatile for different data modeling needs. This allows users to manage unstructured or multi-modal data efficiently.

Optimization Tips

As you build your RAG system, optimization is key to ensuring peak performance and efficiency. While setting up the components is an essential first step, fine-tuning each one will help you create a solution that works even better and scales seamlessly. In this section, we’ll share some practical tips for optimizing all these components, giving you the edge to build smarter, faster, and more responsive RAG applications.

LangChain optimization tips

To optimize LangChain, focus on minimizing redundant operations in your workflow by structuring your chains and agents efficiently. Use caching to avoid repeated computations, speeding up your system, and experiment with modular design to ensure that components like models or databases can be easily swapped out. This will provide both flexibility and efficiency, allowing you to quickly scale your system without unnecessary delays or complications.

Milvus optimization tips

Milvus serves as a highly efficient vector database, critical for retrieval tasks in a RAG system. To optimize its performance, ensure that indexes are properly built to balance speed and accuracy; consider utilizing HNSW (Hierarchical Navigable Small World) for efficient nearest neighbor search where response time is crucial. Partitioning data based on usage patterns can enhance query performance and reduce load times, enabling better scalability. Regularly monitor and adjust cache settings based on query frequency to avoid latency during data retrieval. Employ batch processing for vector insertions, which can minimize database lock contention and enhance overall throughput. Additionally, fine-tune the model parameters by experimenting with the dimensionality of the vectors; higher dimensions can improve retrieval accuracy but may increase search time, necessitating a balance tailored to your specific use case and hardware infrastructure.

Mistral AI Pixtral optimization tips

Pixtral is optimized for multimodal RAG applications, requiring careful management of both textual and visual data retrieval. Improve retrieval efficiency by using specialized embeddings for different modalities—vector search for text and CLIP-based embeddings for images. Implement a multimodal ranking system to prioritize the most contextually relevant passages and images. Optimize model performance by structuring input prompts effectively, ensuring text and visual information are well-integrated without unnecessary repetition. Fine-tune temperature settings based on response requirements—lower values (0.1–0.2) for accuracy-driven applications, higher values for creative outputs. If deploying at scale, use parallel inference for handling large multimodal datasets efficiently. Streamline inference by leveraging batching and caching strategies, especially when handling frequently queried images and text pairs.

NVIDIA bge-m3 optimization tips

To optimize the NVIDIA bge-m3 in a Retrieval-Augmented Generation (RAG) setup, ensure you're using the latest driver and CUDA toolkit for improved performance. Fine-tune the model hyperparameters such as learning rate and batch size based on your specific dataset to enhance efficiency. Employ mixed precision training to speed up computations and reduce memory usage. Utilize data augmentation techniques to increase the variability of your training dataset, helping the model generalize better. Additionally, streamline your retrieval process by implementing efficient indexing methods and caching frequently accessed data, which can significantly reduce latency during inference. Finally, monitor resource utilization with NVIDIA’s profiling tools to identify and address bottlenecks dynamically.

By implementing these tips across your components, you'll be able to enhance the performance and functionality of your RAG system, ensuring it’s optimized for both speed and accuracy. Keep testing, iterating, and refining your setup to stay ahead in the ever-evolving world of AI development.

RAG Cost Calculator: A Free Tool to Calculate Your Cost in Seconds

RAG Cost Calculator is a free tool that quickly estimates the cost of building a RAG pipeline, including chunking, embedding, vector storage/search, and LLM generation. It also helps you identify cost-saving opportunities and achieve up to 10x cost reduction on vector databases with the serverless option.

Calculate your RAG cost now.Calculate your RAG cost

What Have You Learned?

By diving into this tutorial, you’ve unlocked the power of combining cutting-edge tools to build a robust RAG system from scratch! You learned how LangChain acts as the glue, orchestrating the entire pipeline by seamlessly connecting your data sources, retrieval logic, and generative AI. With Milvus as your vector database, you saw firsthand how to store and query dense embeddings at scale, ensuring lightning-fast similarity searches that pull the most relevant context for your queries. Then came Mistral AI’s Mixtral, the LLM powerhouse that transforms retrieved snippets into coherent, human-like answers—showcasing its knack for multilingual understanding and creative problem-solving. And let’s not forget NVIDIA’s bge-m3, the embedding model that turns text into rich, multidimensional vectors, capturing semantic nuances so your system understands exactly what users are asking for. Together, these tools form a dynamic quartet, turning raw data into actionable insights with precision and flair.

But this tutorial didn’t stop at the basics—you also picked up pro tips for optimizing performance, like tweaking chunk sizes for better retrieval or fine-tuning prompts to guide Mixtral’s outputs. The cherry on top? That free RAG cost calculator you explored, which helps you balance accuracy and expenses as you scale. Now, imagine what you can build next! Whether it’s a customer support bot, a research assistant, or a personalized learning tool, you’ve got the blueprint to innovate. So fire up your IDE, experiment with these tools, and let your creativity run wild. The future of intelligent applications is in your hands—go build something amazing, share it with the world, and keep pushing the boundaries of what RAG can do! 🚀

Further Resources

🌟 In addition to this RAG tutorial, unleash your full potential with these incredible resources to level up your RAG skills.* How to Build a Multimodal RAG | Documentation

We'd Love to Hear What You Think!

We’d love to hear your thoughts! 🌟 Leave your questions or comments below or join our vibrant Milvus Discord community to share your experiences, ask questions, or connect with thousands of AI enthusiasts. Your journey matters to us!If you like this tutorial, show your support by giving our Milvus GitHub repo a star ⭐—it means the world to us and inspires us to keep creating! 💖

Tutorial: Build a RAG Chatbot with LangChain 🦜, Zilliz Cloud, Anthropic Claude 3 Opus, and Google Vertex AI text-embedding-004

Zilliz — Wed, 12 Mar 2025 07:00:00 +0000

Introduction to RAG

Key Components We'll Use for This RAG Chatbot

This tutorial shows you how to build a simple RAG chatbot in Python using the following components:

LangChain: An open-source framework that helps you orchestrate the interaction between LLMs, vector stores, embedding models, etc, making it easier to integrate a RAG pipeline.
Zilliz Cloud: a fully managed vector database-as-a-service platform built on top of the open-source Milvus, designed to handle high-performance vector data processing at scale. It enables organizations to efficiently store, search, and analyze large volumes of unstructured data, such as text, images, or audio, by leveraging advanced vector search technology. It offers a free tier supporting up to 1 million vectors.
Anthropic Claude 3 Opus: This advanced model in the Claude 3 series is designed for complex reasoning and nuanced conversations. It combines deep understanding with ethical considerations, making it ideal for sensitive applications like customer support, therapy chatbots, and content generation where context and empathy are paramount.
Google Vertex AI text-embedding-004: This model specializes in creating high-quality text embeddings for diverse natural language processing tasks. Its strength lies in capturing semantic meaning and relationships effectively, making it suitable for applications such as semantic search, clustering, and recommendation systems. Ideal for developers seeking to enhance AI-driven insights from textual data.

By the end of this tutorial, you’ll have a functional chatbot capable of answering questions based on a custom knowledge base.

Note: Since we may use proprietary models in our tutorials, make sure you have the required API key beforehand.

Step 1: Install and Set Up LangChain

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

Step 2: Install and Set Up Anthropic Claude 3 Opus

pip install -qU "langchain[anthropic]"

import getpass
import os

if not os.environ.get("ANTHROPIC_API_KEY"):
  os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("claude-3-opus-latest", model_provider="anthropic")

Step 3: Install and Set Up Google Vertex AI text-embedding-004

pip install -qU langchain-google-vertexai

from langchain_google_vertexai import VertexAIEmbeddings

embeddings = VertexAIEmbeddings(model="text-embedding-004")

Step 4: Install and Set Up Zilliz Cloud

pip install -qU langchain-milvus

from langchain_milvus import Zilliz

vector_store = Zilliz(
    embedding_function=embeddings,
    connection_args={
        "uri": ZILLIZ_CLOUD_URI,
        "token": ZILLIZ_CLOUD_TOKEN,
    },
)

Step 5: Build a RAG Chatbot

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://milvus.io/docs/overview.md",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("doc-style doc-post-content")
        )
    ),
)

docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "nn".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Test the Chatbot

Yeah! You've built your own chatbot. Let's ask the chatbot a question.

response = graph.invoke({"question": "What data types does Milvus support?"})
print(response["answer"])

Example Output

Milvus supports various data types including sparse vectors, binary vectors, JSON, and arrays. Additionally, it handles common numerical and character types, making it versatile for different data modeling needs. This allows users to manage unstructured or multi-modal data efficiently.

Optimization Tips

LangChain optimization tips

Zilliz Cloud optimization tips

Optimizing Zilliz Cloud for a RAG system involves efficient index selection, query tuning, and resource management. Use Hierarchical Navigable Small World (HNSW) indexing for high-speed, approximate nearest neighbor search while balancing recall and efficiency. Fine-tune ef_construction and M parameters based on your dataset size and query workload to optimize search accuracy and latency. Enable dynamic scaling to handle fluctuating workloads efficiently, ensuring smooth performance under varying query loads. Implement data partitioning to improve retrieval speed by grouping related data, reducing unnecessary comparisons. Regularly update and optimize embeddings to keep results relevant, particularly when dealing with evolving datasets. Use hybrid search techniques, such as combining vector and keyword search, to improve response quality. Monitor system metrics in Zilliz Cloud’s dashboard and adjust configurations accordingly to maintain low-latency, high-throughput performance.

Anthropic Claude 3 Opus optimization tips

Claude 3 Opus is a powerful model for RAG applications requiring deep reasoning and high-quality responses. Optimize performance by structuring retrieval results effectively, ensuring that only the most relevant context is provided to avoid unnecessary token usage. Utilize a ranker to prioritize key passages before sending them to the model, preventing information overload and improving response quality. Fine-tune hyperparameters like temperature (0.1–0.3 for factual tasks) and top-k sampling to maintain accuracy while controlling response variation. If cost and speed are concerns, use Claude 3 Opus selectively for complex queries while relying on a smaller model like Claude 3 Haiku for simpler tasks. Implement caching for repeated or high-frequency queries to minimize API calls and improve latency. Use Claude’s parallel processing capabilities where applicable to handle multiple document queries efficiently.

Google Vertex AI text-embedding-004 optimization tips

Google Vertex AI text-embedding-004 offers high-quality embeddings suitable for a wide range of RAG applications. To improve retrieval efficiency, reduce redundancy in input text by preprocessing data and focusing on key concepts and relevant context. For large-scale deployments, utilize batch processing to generate embeddings in parallel, reducing latency. Optimize search performance by implementing hybrid search strategies that combine traditional keyword matching with dense vector similarity. Fine-tune temperature settings to balance between creativity and precision, and adjust the model’s top-k and top-p parameters to control the variability of results. Cache embeddings for high-demand queries to reduce unnecessary processing, and refresh embeddings periodically to maintain relevance as new data is ingested.

RAG Cost Calculator: A Free Tool to Calculate Your Cost in Seconds

Calculate your RAG cost now.

Calculate your RAG cost

What Have You Learned?

What have you learned?

Wow, what an exciting journey you've embarked on! In this tutorial, you’ve seen how the integration of various cutting-edge technologies can culminate in a powerful RAG system. You started with LangChain as the robust framework that effortlessly ties all components together, orchestrating their collaboration seamlessly. It’s truly the backbone of your architecture, allowing for a smooth flow of data and requests.

Next, we dove into how the Zilliz Cloud vector database enhances your application by enabling lightning-fast searches, ensuring that retrieving relevant information is not only efficient but also scalable. This rapid retrieval capability is fundamental for delivering a stellar user experience.

We then explored how the Anthropic Claude 3 Opus LLM elevates your application’s conversational intelligence, empowering your system to generate engaging and contextually aware responses. With its capabilities, your user interactions can now feel more natural and dynamic.

The magic doesn’t stop there! The Google Vertex AI text-embedding-004 model generates rich semantic representations, giving unique context to searches and responses. You also picked up on optimizing techniques and learned about using a free cost calculator to manage potential expenses.

Now, it’s your turn! With the knowledge and tools you've gathered, you have an incredible opportunity to build, innovate, and optimize your very own RAG applications. Get out there, experiment, and let your creativity shine! The future is bright, and the possibilities are endless. Happy building!

Further Resources

🌟 In addition to this RAG tutorial, unleash your full potential with these incredible resources to level up your RAG skills.

How to Build a Multimodal RAG | Documentation
How to Enhance the Performance of Your RAG Pipeline
Graph RAG with Milvus | Documentation
How to Evaluate RAG Applications - Zilliz Learn
Generative AI Resource Hub | Zilliz

We'd Love to Hear What You Think!

If you like this tutorial, show your support by giving our Milvus GitHub repo a star ⭐—it means the world to us and inspires us to keep creating! 💖

How to Build a RAG Chatbot with LangChain, Milvus, Together AI Mixtral 8x7B Instruct v0.1, and OpenAI text-embedding-3-large

Zilliz — Wed, 05 Mar 2025 17:00:00 +0000

Introduction to RAG

Key Components We'll Use for This RAG Chatbot

Milvus: An open-source vector database optimized to store, index, and search large-scale vector embeddings efficiently, perfect for use cases like RAG, semantic search, and recommender systems. If you hate to manage your own infrastructure, we recommend using Zilliz Cloud, which is a fully managed vector database service built on Milvus and offers a free tier supporting up to 1 million vectors.
Together AI Mixtral 8x7B Instruct v0.1: This model offers a powerful blend of instruction-based learning and advanced natural language understanding. With its 8x7B architecture, it excels in generating coherent and context-aware responses. Ideal for applications like chatbots, content creation, and educational tools where user guidance and high-quality interaction are essential.
text-embedding-3-large: OpenAI's text embedding model, generating embeddings with 1536 dimensions, designed for tasks like semantic search and similarity matching.

By the end of this tutorial, you’ll have a functional chatbot capable of answering questions based on a custom knowledge base.

Note: Since we may use proprietary models in our tutorials, make sure you have the required API key beforehand.

Step 1: Install and Set Up LangChain

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

Step 2: Install and Set Up Together AI Mixtral 8x7B Instruct v0.1

pip install -qU "langchain[together]"

import getpass
import os

if not os.environ.get("TOGETHER_API_KEY"):
  os.environ["TOGETHER_API_KEY"] = getpass.getpass("Enter API key for Together AI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("mistralai/Mixtral-8x7B-Instruct-v0.1", model_provider="together")

Step 3: Install and Set Up OpenAI text-embedding-3-large

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Step 4: Install and Set Up Milvus

pip install -qU langchain-milvus

from langchain_milvus import Milvus

vector_store = Milvus(embedding_function=embeddings)

Step 5: Build a RAG Chatbot

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://milvus.io/docs/overview.md",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("doc-style doc-post-content")
        )
    ),
)

docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "nn".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Test the Chatbot

Yeah! You've built your own chatbot. Let's ask the chatbot a question. response = graph.invoke({"question": "What data types does Milvus support?"})

print(response["answer"])

Example Output

Milvus supports various data types including sparse vectors, binary vectors, JSON, and arrays. Additionally, it handles common numerical and character types, making it versatile for different data modeling needs. This allows users to manage unstructured or multi-modal data efficiently.

Optimization Tips

LangChain Optimization Tips

Milvus optimization tips

Together AI Mixtral 8x7B Instruct v0.1 optimization tips

Together AI’s Mixtral 8x7B Instruct v0.1 uses a mixture-of-experts (MoE) architecture to balance efficiency and performance. Optimize retrieval by dynamically adjusting the number of retrieved documents based on query complexity to prevent overloading the context window. Structure prompts effectively, ensuring that critical details are at the start of the input to guide the model’s focus. Use a temperature of 0.1–0.3 for factual accuracy while tweaking top-k and top-p for balanced response generation. Together AI’s inference stack allows for optimized execution, so enable expert pruning to limit active pathways when full capacity isn’t needed. Implement caching strategies for common queries to minimize redundant processing. If integrating multiple models, use Mixtral 8x7B for medium-to-high complexity reasoning while offloading simpler queries to smaller, more efficient models.

OpenAI text-embedding-3-large optimization tips

OpenAI text-embedding-3-large is a high-capacity embedding model designed for precise and rich semantic representation, making it ideal for RAG systems with complex document retrieval needs. Optimize efficiency by preprocessing and normalizing text to reduce noise before embedding generation. Use dimensionality reduction techniques, such as PCA, if storage or computational limits become a concern. When querying, leverage HNSW-based approximate nearest neighbor (ANN) search to accelerate retrieval while maintaining accuracy. Batch process embedding requests to reduce latency and optimize resource utilization. Implement re-ranking models to further refine top results based on query context. Regularly update the embedding store with newly ingested data to maintain retrieval relevance.By implementing these tips across your components, you'll be able to enhance the performance and functionality of your RAG system, ensuring it’s optimized for both speed and accuracy. Keep testing, iterating, and refining your setup to stay ahead in the ever-evolving world of AI development.

RAG Cost Calculator: A Free Tool to Calculate Your Cost in Seconds

Estimating the cost of a Retrieval-Augmented Generation (RAG) pipeline involves analyzing expenses across vector storage, compute resources, and API usage. Key cost drivers include vector database queries, embedding generation, and LLM inference.RAG Cost Calculator is a free tool that quickly estimates the cost of building a RAG pipeline, including chunking, embedding, vector storage/search, and LLM generation. It also helps you identify cost-saving opportunities and achieve up to 10x cost reduction on vector databases with the serverless option.Calculate your RAG cost now.Calculate your RAG cost

What Have You Learned?

What have you learned? This tutorial has taken you on an exciting journey through the integration of a powerful framework, a vector database, a state-of-the-art large language model (LLM), and an innovative embedding model to build a cutting-edge Retrieval-Augmented Generation (RAG) system. You've seen how the framework elegantly ties everything together, orchestrating the flow of data and commands like a maestro leading a symphony.

With Milvus as your vector database, you've harnessed the power of speedy and efficient searches, allowing your system to quickly locate relevant information without breaking a sweat. The Together AI Mixtral 8x7B Instruct model has shown you how to infuse conversational intelligence into applications, helping your users interact with utmost ease and understanding. Meanwhile, the OpenAI text-embedding-3-large model has equipped you with the capability to create rich semantic representations, ensuring that your data isn’t just accurate but deeply meaningful.

Don’t forget the optimization tips you've picked up along the way, and the handy free cost calculator that empowers you to budget your resources wisely. Now, it’s time to roll up your sleeves and dive into building, optimizing, and innovating your own RAG applications! The possibilities are immense, and with the knowledge you’ve gained, you’re well-prepared to create solutions that can truly make a difference. Go ahead, let your creativity run wild—your RAG adventure starts now!

Further Resources

🌟 In addition to this RAG tutorial, unleash your full potential with these incredible resources to level up your RAG skills.* How to Build a Multimodal RAG | Documentation

We'd Love to Hear What You Think!

If you like this tutorial, show your support by giving our Milvus GitHub repo a star ⭐—it means the world to us and inspires us to keep creating! 💖

RAG Chatbot: Build with LangChain, Milvus, Fireworks AI 🔥Llama 3.1 8B Instruct, and Cohere embed-multilingual-v2.0

Zilliz — Mon, 03 Mar 2025 23:51:21 +0000

Introduction to RAG

Key Components We'll Use for This RAG Chatbot

This tutorial shows you how to build a simple RAG chatbot in Python using the following components:

LangChain: An open-source framework that helps you orchestrate the interaction between LLMs, vector stores, embedding models, etc, making it easier to integrate a RAG pipeline.
Milvus: An open-source vector database optimized to store, index, and search large-scale vector embeddings efficiently, perfect for use cases like RAG, semantic search, and recommender systems. If you hate to manage your own infrastructure, we recommend using Zilliz Cloud, which is a fully managed vector database service built on Milvus and offers a free tier supporting up to 1 million vectors.
Fireworks AI Llama 3.1 8B Instruct: This model is designed to deliver precise instructions and guidance through advanced reasoning capabilities. With its 8 billion parameters, it excels in generating coherent responses across various domains, making it ideal for educational tools, virtual assistants, and interactive content creation. Its strength lies in user engagement through personalized interactions.
Cohere embed-multilingual-v2.0: This model specializes in generating high-quality multilingual embeddings, enabling effective cross-lingual understanding and retrieval. Its strengths lie in capturing semantic relationships in diverse languages, making it suitable for applications such as multilingual search, recommendation systems, and global content analysis where language diversity is a critical factor.

By the end of this tutorial, you’ll have a functional chatbot capable of answering questions based on a custom knowledge base.

Note: Since we may use proprietary models in our tutorials, make sure you have the required API key beforehand.

Step 1: Install and Set Up LangChain

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

Step 2: Install and Set Up Fireworks AI Llama 3.1 8B Instruct

pip install -qU "langchain[fireworks]"


import getpass
import os

if not os.environ.get("FIREWORKS_API_KEY"):
  os.environ["FIREWORKS_API_KEY"] = getpass.getpass("Enter API key for Fireworks AI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("accounts/fireworks/models/llama-v3p1-8b-instruct", model_provider="fireworks")

Step 3: Install and Set Up Cohere embed-multilingual-v2.0

pip install -qU langchain-cohere


import getpass
import os

if not os.environ.get("COHERE_API_KEY"):
  os.environ["COHERE_API_KEY"] = getpass.getpass("Enter API key for Cohere: ")

from langchain_cohere import CohereEmbeddings

embeddings = CohereEmbeddings(model="embed-multilingual-v2.0")

Step 4: Install and Set Up Milvus

pip install -qU langchain-milvus


from langchain_milvus import Milvus

vector_store = Milvus(embedding_function=embeddings)

Step 5: Build a RAG Chatbot

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://milvus.io/docs/overview.md",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("doc-style doc-post-content")
        )
    ),
)

docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "nn".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Test the Chatbot

Yeah! You've built your own chatbot. Let's ask the chatbot a question.

response = graph.invoke({"question": "What data types does Milvus support?"})
print(response["answer"])

Example Output

Milvus supports various data types including sparse vectors, binary vectors, JSON, and arrays. Additionally, it handles common numerical and character types, making it versatile for different data modeling needs. This allows users to manage unstructured or multi-modal data efficiently.

Optimization Tips

LangChain Optimization Tips

Milvus optimization tips

Fireworks AI Llama 3.1 8B Instruct optimization tips

Llama 3.1 8B Instruct is a cost-efficient model that delivers strong performance in RAG applications with moderate complexity. Optimize retrieval by limiting context length to only the most relevant passages, ensuring efficient token usage. Structure prompts clearly, with short, well-organized sections that guide the model’s focus. Keep temperature around 0.1–0.3 for accuracy and fine-tune top-k and top-p for flexibility. Cache high-frequency queries to minimize redundant processing and reduce API costs. Take advantage of Fireworks AI’s infrastructure to batch requests, optimizing efficiency for large-scale operations. Use response streaming to enhance interactivity in applications requiring fast feedback. If deploying multiple models, leverage 8B for simple queries and hand off more complex tasks to larger models.

Cohere embed-multilingual-v2.0 optimization tips

Cohere embed-multilingual-v2.0 supports a variety of languages, making it ideal for cross-lingual RAG setups. To optimize efficiency, preprocess text to remove language-specific noise and handle encoding issues, ensuring clean input for embedding generation. Implement efficient ANN algorithms, like FAISS with hierarchical indexing, to support fast retrieval across multilingual datasets. Compress embeddings using techniques such as product quantization or HNSW to optimize storage and speed. Use language detection models to route queries to the appropriate language-specific embeddings, minimizing unnecessary computation. Batch embedding operations and take advantage of parallel processing to handle large amounts of multilingual data efficiently. Regularly update embeddings to ensure the model reflects any language shifts or evolving trends.

RAG Cost Calculator: A Free Tool to Calculate Your Cost in Seconds

Calculate your RAG cost now.

Calculate your RAG cost

What Have You Learned?

What have you learned? Wow, what an incredible journey we've taken together through the world of Retrieval-Augmented Generation (RAG)! You’ve successfully integrated a powerful framework with a cutting-edge vector database, an impressive large language model, and a sophisticated embedding model to create a next-gen RAG system. The joy of seeing these components work together is just fantastic, isn't it?

You explored how the framework elegantly ties all the parts together, creating seamless workflows that make your projects feel more like magic than mere code. The lightning-fast searches powered by the vector database not only enhance performance but open up a universe of possibilities for retrieving relevant information at remarkable speeds! With the conversational intelligence provided by the LLM—Fireworks AI Llama 3.1—you can engage users like never before, making interactions feel natural and intuitive.

Furthermore, the embedding model, Cohere embed-multilingual-v2.0, has given you remarkable capabilities in generating rich semantic representations, enabling you to capture nuances in language that can significantly enhance user experience. And let's not forget those handy optimization tips and that free cost calculator—tools designed to ensure you get the most value from your RAG application.

So, what's next? Don’t let your newfound knowledge sit idle! Dive in, start building, optimizing, and innovating your own RAG applications. The world is eager for fresh ideas, and with the skills you’ve acquired, you’re more than ready to make a difference. Go ahead and unleash your creativity—your adventure in AI has just begun!

Further Resources

🌟 In addition to this RAG tutorial, unleash your full potential with these incredible resources to level up your RAG skills.

How to Build a Multimodal RAG | Documentation
How to Enhance the Performance of Your RAG Pipeline
Graph RAG with Milvus | Documentation
How to Evaluate RAG Applications - Zilliz Learn
Generative AI Resource Hub | Zilliz

We'd Love to Hear What You Think!

If you like this tutorial, show your support by giving our Milvus GitHub repo a star ⭐—it means the world to us and inspires us to keep creating! 💖

Scaling Audio Similarity Search with Vector Databases

Zilliz — Sat, 01 Mar 2025 00:01:50 +0000

Imagine being able to find a song you can’t quite remember—just by humming a few notes into an app and instantly having all the details pop up. Sounds like magic, right? Well, it’s not—it's audio similarity search in action. In today’s world of exponential data growth, where audio content is exploding, efficient audio similarity search is crucial for powering everything from music recommendations to real-time content retrieval and even complex audio classifications. As the sheer volume of audio data soars into the millions (and even billions), traditional search methods simply can’t keep up. Enter vector databases, the game-changer in enabling scalable and ultra-fast similarity searches by turning audio signals into high-dimensional embeddings. Let’s dig into how vector databases make large-scale audio similarity search a reality.

Understanding Audio Similarity Search

What is audio similarity search?

At its core, audio similarity search involves finding and retrieving audio that closely matches a given query. Instead of relying on traditional keyword searches, which depend on metadata or transcriptions, this technology uses machine learning models to analyze audio characteristics like pitch, timbre, rhythm, and more, offering a much more nuanced and accurate retrieval.

Common use-cases

Music Recommendation - Apps such as Spotify analyze audio features of the songs played often to suggest similar tracks, enhancing user experience.
Podcast Search - Users can easily look for podcasts with similar content, voices, tones, or themes based on their preferences.
Speech Similarity - Used in security applications and voice assistants to detect speaker identity or match spoken phrases.
Environmental Sound Recognition - Used for wildlife monitoring by recognizing animal calls or for disaster response management by tracking the severity of earthquakes or landslides through audio cues.

Challenges in traditional audio search

Traditionally audio search has been heavily dependent on keywords which are manually assigned tags or transcriptions of the audio data. This approach requires very precise metadata and ignores rich audio features making the retrieval difficult and inaccurate. Additionally, as datasets grow, manually tagging and indexing audio files becomes impractical. Hence, modern approaches using embeddings and vector databases can make large-scale audio search possible.

The Magic of Vector Databases in Audio Search

What is a vector database, anyway?

A vector database is a specialized database that can store, index, and retrieve any kind of unstructured data - text, images, video, or audio in the form of vector embeddings. Embeddings are high-dimensional numerical representations (vectors) that capture the essential features of the data. These embeddings allow performing similarity search by mathematically comparing the query with the stored data enabling efficient and accurate retrieval. Vector databases offer capabilities such as scalability for large data, real-time processing, and high-speed retrieval making real-world applications possible.

Creating vector embeddings from unstructured data

How do vector databases store and index embeddings?

A vector embedding is stored in a vector database along with its metadata which assists in efficient retrieval. Vector indexing helps to intelligently store the vector embeddings so that the search time is minimized. Common indexing techniques are IVF (Inverted File Index) and HNSW (Hierarchical Navigable Small World). They partition the dataset in a way to minimize the search time.

Popular vector databases - Milvus and Zilliz Cloud

Milvus is an open-source vector database that supports GPU acceleration and Approximate Nearest Neighbor (ANN) algorithms like HNSW, IVF, and PQ, making it ideal for applications such as audio similarity search, image retrieval, and recommendation systems. Zilliz Cloud is the fully managed, cloud-native version of Milvus, offering a serverless infrastructure with auto-scaling, high availability, and enterprise-grade security. These databases enable efficient handling of large-scale vector search tasks with minimal operational overhead.

How Audio Embeddings Enable Similarity Search

What are audio embeddings?

Audio embeddings are numerical representations of audio signals that capture key sound characteristics such as pitch, tempo, rhythm, and timbre. These embeddings enable direct comparison of audio clips based on their inherent acoustic characteristics instead of relying on textual metadata.

What are the different techniques to generate audio embeddings?

Before creating embeddings, the raw audio signals undergo preprocessing steps such as resampling (standardizing the sample rate for consistency), noise reduction (removing unwanted background sounds), and segmentation (dividing audio into meaningful chunks).

Next, key audio features are extracted using different techniques:

Mel-Frequency Cepstral Coefficients (MFCCs): These features mimic human auditory perception by capturing the spectral shape of a sound, making them useful for speech and music analysis.
Spectrograms: They are a visual representation of frequency over time, highlighting variations in pitch, intensity, and harmonic structures, which are widely used as input for deep learning models.
Chroma-based Features: These capture the tonal content of an audio signal by emphasizing pitch class distribution.

Once features are extracted, deep learning-based models further process them to generate high-dimensional embeddings:

OpenL3: A deep audio representation model trained on multimodal datasets, capturing a wide range of audio patterns for tasks like environmental sound recognition and music similarity.
YAMNet: A model based on MobileNet, trained on the AudioSet dataset, which classifies and extracts embeddings for over 500 sound categories, including speech, instruments, and ambient noises.
VGGish: A deep neural network inspired by VGG, trained on YouTube videos, designed to extract generic audio features applicable to tasks like audio event detection and content-based retrieval.

Once embeddings are generated, they are stored and indexed in a vector database, allowing for fast and scalable similarity search.

Audio similarity search with Zilliz Cloud

Scaling Audio Similarity Search with Vector Databases

As audio datasets could contain millions of files, performing efficient search and retrieval becomes a challenge. Vector databases play a crucial role in making audio search systems scalable by offering advanced search algorithms and optimized indexing strategies.

Managing Large-Scale Audio Datasets

Handling massive audio datasets is possible with the help of techniques such as batch processing, distributed storage, and GPU-accelerated indexing offered by vector databases. They allow the processing of large volumes of audio embeddings without compromising performance.

Indexing Strategies for Efficient Search

Vector databases optimize similarity search using indexing techniques like:

HNSW (Hierarchical Navigable Small World) - It is a graph-based indexing method that builds multiple layers of proximity-based connections of embeddings. The top layers contain scarcely connected nodes whereas the lower layers have denser connections. When a new query comes in, the traversal happens from the top to the bottom.

Search in HNSW algorithm (Source)

IVF (Inverted File Index) - It splits the whole dataset into clusters using techniques like K-means clustering so when a new query comes the most similar cluster is found and further search happens within that.
PQ (Product Quantization) - It compresses high-dimensional vectors into smaller sub-vectors, improving storage efficiency and search speed.

Handling High-Dimensional Data

Audio embeddings are often high-dimensional, leading to the curse of dimensionality, which causes increased computational cost and less effective indexing. Therefore, dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-SNE can help reduce dimensions without losing critical audio features. Additionally, quantization techniques such as Product Quantization (PQ) and Scalar Quantization (SQ) can compress vectors to make them storage efficient.

Performing Real-time Search

All in all, vector databases enable real-time search by maintaining a low latency rate through efficient Approximate Nearest Neighbor (ANN) search algorithms, fast indexing techniques, distributed processing, quantization, in-memory operations, GPU acceleration, and effective handling of high-dimensional data.

Tools and Frameworks for Scaling Audio Similarity Search

To build a scalable audio similarity search system choosing the right vector databases and suitable embedding models or libraries is crucial. Here’s how you can pick the ones relevant to you.

Which vector database is appropriate for you?

Vector databases store and index high-dimensional embeddings on the scale of millions or billions making real-world applications fast and scalable. Some of the most popular vector databases are -

Milvus - Milvus is a highly scalable vector database (supports billion-scale vector search) built for real-time search and retrieval with efficient indexing methods such as HNSW and IVF. It is ideal for enterprise applications or for someone wanting an open-source yet scalable option.
Zilliz Cloud - It is a fully managed, cloud-native version of Milvus, optimized for seamless scaling and deployment. It supports serverless architecture and integrates easily with AWS, Google Cloud, and other cloud providers. It is ideal for teams without dedicated DevOps resources who want a plug-and-play vector search solution.
FAISS (Facebook AI Similarity Search) - It is Facebook’s open-source library for performing quick similarity searches leveraging GPU acceleration. It is best suitable for offline, batch-based similarity search and research applications.

Which audio embedding model should you choose?

Audio embeddings transform raw audio into meaningful feature vectors that can be compared in a vector space. The following models provide pre-trained embeddings -

OpenL3: A deep learning-based model that extracts general-purpose audio embeddings using self-supervised learning on multimodal datasets.
VGGish: A CNN-based model trained on YouTube-8M, commonly used for music and audio classification.
YAMNet: A MobileNet-based model trained on Google's AudioSet, specializing in environmental sound classification.
Other models like CLAP (Contrastive Language-Audio Pretraining) and DEEP Audio Embeddings provide domain-specific embeddings for speech processing and music retrieval.

Optimizing Performance and Efficiency in Large-Scale Audio Search

Performance and Efficiency in large-scale audio systems can be optimized by considering the following aspects.

Techniques to improve search speed and accuracy

Approximate Nearest Neighbor (ANN) Search - ANN algorithms quickly approximate the closest matches instead of exhaustively comparing every audio embedding.
Optimizing Memory Usage and Compute
- Using Dimensionality reduction techniques like PCA (Principal Component Analysis) or Autoencoders reduces the size of embeddings, improving efficiency.
- Doing batch processing instead of single queries reduces computational overhead.

Ways to balance accuracy with computational efficiency

Adjusting the parameters of the indexing algorithm and the search parameters of the vector database, for eg., adjusting ‘ef’ in Milvus Search increases the accuracy.
Using domain-specific embeddings and training custom models on task-specific datasets helps reduce noise and improve search quality.

Techniques to reduce latency in real-time applications

Preloading embeddings into memory, performing distributed search, and using multi-GPU processing are some of the ways to reduce latency and speed up operations.

Challenges and Considerations

Data Privacy and Security - Audio data such as personal voice notes, biometric speech patterns, or medical audio must be carefully protected as unauthorized access could lead to privacy violations. Encryption techniques and secure access control mechanisms (Zilliz Cloud offers role-based access control) which allow managing permissions can be used to safeguard user data.
Scalability Challenges - As the volume of audio datasets can keep increasing (millions to billions), the system must scale efficiently without compromising retrieval speed. Techniques like vector quantization, sharding, and HNSW indexing are essential to improve performance. Employing distributed storage solutions (Milvus deployed on Kubernetes) allows the system to handle high query loads while maintaining low latency.
Model Drift - The audio embeddings can become outdated as new sounds, voices, or music styles emerge making the search system less accurate. Therefore, continuous retraining on fresh data is necessary to keep embeddings relevant. Implementing drift detection techniques to monitor performance and embedding versioning to track updates can help keep search results accurate and updated.
Ethical Considerations - Mitigating bias in audio datasets is essential to ensure fair results. An embedding model predominantly trained or certain accents or languages may not cater well to others leading to unfair retrieval results. Therefore, having diverse and representative data is crucial. Additionally, using explainability techniques can provide transparency and make users trust and interpret the results more acceptably.

Conclusion

Audio similarity search powered by vector databases is transforming industries, from music recommendation to environmental monitoring. With the ability to handle vast datasets and offer lightning-fast retrieval, this technology opens up countless possibilities. But like any powerful tool, it requires careful handling of data privacy, scalability, and model relevance. As AI continues to evolve, audio similarity search will remain a foundational technology, unlocking new potential in the world of audio AI.

Additional Reading

DeepSeek vs. OpenAI: A Battle of Innovation in Modern AI

Zilliz — Sat, 01 Mar 2025 00:01:47 +0000

Introduction

The rapid advancements in AI technology have led to models that not only excel in complex tasks but also seamlessly adapt to a wide range of applications, enhancing their utility across industries. OpenAI, a pioneer in this space, continues to push the boundaries with innovative models that redefine natural language processing and machine learning capabilities. These advancements have sparked a wave of innovation, thereby making AI more accessible, efficient, and capable of performing tasks that once seemed out of reach.

However, OpenAI's dominance is now being challenged by emerging competitors, such as DeepSeek, a Chinese AI company that has introduced DeepSeek R1, an open-source model that rivals some of the most advanced models available. DeepSeek R1 stands out due to its focus on cost efficiency and its ability to match the performance of high-end models while keeping operational costs significantly lower. This emerging player has begun to capture attention, especially for organizations and developers seeking a balance between performance and affordability.

In this blog, we will explore two standout models from OpenAI: OpenAI o1, which is renowned for its advanced reasoning capabilities and deliberate "thinking before responding" approach, and OpenAI o3-mini, a faster, more cost-efficient model optimized for STEM applications. Additionally, we will compare these models to DeepSeek R1, which offers similar performance and capabilities, but at a fraction of the cost. By examining their key features, performance benchmarks, and use cases, we aim to provide you with insights on how to choose the right AI model for your specific needs, whether you're working in research, software development, healthcare, or other fields where complex reasoning and cost efficiency are paramount.

OpenAI o1 Overview

Launched in September 2024 (in its preview version, and the full version released in December 2024), OpenAI o1 represents a significant leap forward in AI reasoning. Unlike its predecessors, o1 is specifically designed to tackle complex, multi-step tasks using a chain-of-thought reasoning approach, which is enhanced by large-scale reinforcement learning. This innovative method enables the model to think through problems step-by-step, improving its problem-solving capabilities and making it particularly effective for logical reasoning and decision-making in challenging scenarios.

Key Features

Model Architecture: OpenAI o1 is built on a transformer-based architecture optimized for reasoning and problem-solving. It employs a unique mechanism for generating extended chains-of-thought, allowing the model to perform deeper and more thorough analyses before providing an answer. This extended reasoning process enhances accuracy and reliability for complex queries.

Figure 1: Multi-step conversation with reasoning tokens (Source)

Training Data: OpenAI o1 was trained on a combination of filtered publicly available datasets and proprietary data from partnerships to enhance its reasoning and technical capabilities. Its training data includes web data, open-source datasets, reasoning datasets, paywalled content, specialized archives, and industry-specific resources, ensuring strong performance in both general knowledge and complex problem-solving.

Performance Benchmarks: While OpenAI o1 is slower than earlier models like GPT-4o due to its reasoning processes, it consistently ranks higher in accuracy for complex tasks, particularly in STEM fields like science, mathematics, and coding. It achieved an impressive 83% on the American Invitational Mathematics Examination (AIME), and ranked in the 89th percentile on Codeforces competitive programming challenges. Additionally, it has demonstrated PhD-level accuracy in benchmarks for physics, biology, and chemistry problems.

Figure 2: Performance Benchmarks OpenAI o1 (Source)

Use Cases and Application Areas: OpenAI o1 is widely applicable in fields requiring advanced reasoning, such as scientific research (data analysis, hypothesis testing), software development (multi-step workflows, debugging), healthcare (diagnosis development), and educational applications (solving complex puzzles or crosswords).

OpenAI o3-mini Overview

The OpenAI o3-mini was officially released in late January 2025, following a preview in December 2024. This model continues OpenAI's progression in enhancing reasoning capabilities for complex tasks, offering notable improvements over previous models like o1, particularly in speed, efficiency, and performance across coding, mathematics, and science challenges.

Key Features

Model Architecture: Similar to OpenAI o1, o3-mini is based on a transformer architecture specifically optimized for advanced reasoning. It leverages the chain-of-thought technique to enable step-by-step problem-solving, combined with large-scale reinforcement learning to enhance reasoning. A standout feature of o3-mini is its reduced latency compared to o1, allowing for faster results while maintaining high accuracy in complex tasks.

Figure 3: Latency Comparison o3-mini vs. o1 (Source)

Training Data: Like its predecessor, o3-mini was trained on a combination of publicly available datasets, proprietary OpenAI data, and advanced filtering techniques to ensure a safe and effective training setup.

Performance Benchmarks: OpenAI o3-mini has demonstrated exceptional performance on benchmark datasets, achieving 87.3% accuracy in competition-level math problems, 79.7% accuracy on PhD-level science questions, and 49.3% accuracy in software programming, surpassing OpenAI o1 at higher reasoning levels.

Figure 4: Performance Benchmarks o3-mini vs. o1: Mathematics, AIME (Source)

Figure 5: Performance Benchmarks o3-mini vs. o1: PhD-level science (Source)

Figure 6: Performance Benchmarks o3-mini vs. o1: Software Engineering (Source)

Use Cases and Application Areas: While OpenAI o3-mini shares many use cases with o1, such as scientific research, software development, healthcare, and educational problem-solving, it stands out in domains where high-level reasoning with lower latency is essential. For instance, in financial analysis, o3-mini can efficiently handle complex risk forecasting, fraud detection, and investment strategy simulations, all while quickly processing large volumes of data.

Deepseek R1 Overview

DeepSeek R1, released in January 2025, is an open-source AI model developed by the Chinese company DeepSeek. It is specifically designed for advanced reasoning and problem-solving, leveraging a combination of chain-of-thought reasoning, supervised fine-tuning, and reinforcement learning to enhance logical inference. One of its standout features is its ability to achieve performance levels comparable to leading AI models, such as OpenAI o1, while maintaining significantly lower operational costs.

Key Features

Model Architecture: DeepSeek R1 employs a transformer-based architecture optimized for reasoning tasks. Its training methodology builds on DeepSeek V3 and includes multiple steps: large-scale reinforcement learning, supervised fine-tuning, and a curated dataset of chain-of-thought examples. What sets it apart is the combination of three key elements, which significantly improve efficiency and problem-solving depth:

Mixture-of-Experts (MoE): This technique dynamically selects a subset of specialized neural network "experts" for each input, reducing computation while improving efficiency.
Multi-Head Latent Attention (MLA): The core idea behind MLA is low-rank joint compression for attention keys and values, which helps optimize the Key-Value (KV) cache during inference.
Multi-Token Prediction (MTP): Enables to predict multiple future tokens at once.

Figure 7: Basic architecture of DeepSeek-V3 (Source)

Figure 8: Illustration of our Multi-Token Prediction (MTP) implementation (Source)

Training Data: DeepSeek R1 was trained on a combination of two proprietary datasets that are not publicly available. One dataset adds reasoning capabilities, while the other enhances general-purpose tasks:

Reasoning: cold start chain-of-thought data to fine-tune DeepSeek V3.
Non-reasoning: labeled data for the subsequent supervised fine-tuning step to enhance general-purpose tasks such as writing, translation or factual QA.

Performance Benchmarks: DeepSeek R1 excels in tasks requiring reasoning and deep analytical thinking, achieving 79.8% on AIME, 97.3% on MATH-500, and 96.3% accuracy on Codeforces. It also performed strongly in general knowledge benchmarks, with 90.8% on MMLU and 71.5% on GPQA Diamond.

Figure 9: Performance Benchmarks DeepSeek (Source)

Use Cases and Application Areas: Due to its strong performance in math and coding tasks, DeepSeek R1 is well suited for scientific research, software development, and academic education. Its open-source nature makes it accessible to a wide range of users and industries at a low cost.

Model Versions

OpenAI has released three model versions of its o1, preview, mini and full, and one of o3-mini, differing in context window size and maximum output tokens. Meanwhile, DeepSeek has introduced two DeepSeek R1 models. The initial Zero version represented their first training attempt, excelling in reasoning benchmarks but facing challenges related to readability and language mixing. Additionally, DeepSeek has developed distilled models by fine-tuning open-source models such as Qwen and Llama.

Figure 10: OpenAI o1 model versions (Source)

Figure 11: OpenAI o3-mini model versions (Source)

Figure 12: DeepSeek R1 model versions (Source)

Figure 13: DeepSeek R1 distill model versions (Source)

Cost Comparison

Pricing is a crucial factor when selecting an AI model for schainpecific use cases. Below is a comparison of the costs associated with OpenAI o1, OpenAI o3-mini, and DeepSeek R1, including input and output token pricing per 1 million tokens, as well as cached prices.


Model	Cached Price (per 1M tokens)	Input Token Price (per 1M tokens)	Output Token Price (per 1M tokens)
OpenAI o1	$7.5	$15.00	$60.00
OpenAI o1-mini	$0.55	$1.10	$4.40
OpenAI o3-mini	$0.55	$1.10	$4.40
DeepSeek R1	$0.14	$0.55	$2.19

This table highlights the cost differences between the models. DeepSeek R1 offers significantly lower pricing, costing half as much as OpenAI's most affordable model, which makes it more cost efficient and scalable.

Comparison Table: Deepseek R1 vs. OpenAI o1 vs OpenAI o3-mini

To provide a clearer comparison, the table below outlines key features, performance benchmarks, areas of application, and cost considerations. OpenAI's key strength lies in the low latency of the o3-mini model, while DeepSeek R1 stands out for its cost efficiency.


Feature	OpenAI o1	OpenAI o3-mini	DeepSeek R1
Release Date	Dec 2024	Jan 2025	Jan 2025
Architecture	Transformer-based, chain-of-thought reasoning	Transformer-based, chain-of-thought optimized for low latency	Transformer-based, MoE, MLA, MTP
Training Data	Public + proprietary	Public + proprietary	Proprietary
Benchmarks	83% (AIME), 89% (Codeforces), 79.7% (PhD-level STEM accuracy)	87.3% (AIME), 79.7% (Science), 49.3% (Coding)	79.8% (AIME), 97.3% (MATH-500), 96.3% (Codeforces)
Latency	Higher due to extended reasoning	Lower latency, faster responses	Moderate latency
Use Cases	Scientific research, software development, healthcare, education	Financial analysis, real-time decision-making, software development, research	Scientific research, academic education, software development
Open-Source	No	No	Yes
Cost Efficiency	High-cost proprietary model	High-cost proprietary model	Lower operational cost

Figure 14: Benchmark performance comparison DeepSeek vs. OpenAI (Source)

Because of its open-source nature, DeepSeek R1 is available on multiple platforms, including Hugging Face and Ollama, while OpenAI models are integrated into various enterprise solutions and cloud platforms. However, OpenAI has not yet released its model weights, and users cannot download a fine-tuned model from the OpenAI platform. DeepSeek R1 also features a curated list of integrations in the following repository, including LiteLLM, Langfuse, and Ragflow. Additionally, it can be found on AWS and Azure.

Conclusion

As we move further into the age of AI, the competition between leading models like OpenAI's o1 and o3-mini, and newcomers like DeepSeek R1, is only going to intensify. OpenAI's models have proven their effectiveness in a wide variety of applications, especially where high-level reasoning, scalability, and robust performance are essential. Their innovation, speed, and advanced features set a high bar for the industry.

However, DeepSeek R1 offers a compelling alternative, especially for those seeking to leverage advanced AI capabilities without the high operational costs associated with other models. Its open-source nature and impressive performance benchmarks make it an attractive option for organizations and developers looking for a cost-effective solution without sacrificing the depth of reasoning and problem-solving capabilities.

Ultimately, the choice between these models comes down to your specific use case. If you're working on high-complexity tasks that demand rigorous, step-by-step reasoning, OpenAI o1 might be the best fit. If speed and cost-efficiency are more important, especially in STEM-related fields or financial applications, OpenAI o3-mini might be the right choice. For those seeking an open-source, budget-friendly solution that still delivers exceptional performance in tasks like math and software development, DeepSeek R1 presents an excellent alternative.

DEV Community: Zilliz

How to Install and Run OpenClaw (Previously Clawdbot/Moltbot) on Mac: A Step-by-Step Tutorial

What You'll Need

Step 1: Install OpenClaw via the Installer Script (Recommended)

Alternative: Install via npm Directly

Alternative: Install via pnpm

Troubleshooting: sharp Build Errors

Step 2: Run the Onboarding Wizard

What the Wizard Configures

Step 3: Verify the Gateway Is Running

Step 4: Open the Control UI (Dashboard)

Step 5: Connect a Chat Channel (Optional)

Quick Example: Pair WhatsApp

Step 6: Send a Test Message

Optional: Build from Source

Optional: macOS App Onboarding

Configuration Basics

Key File Locations on macOS

Useful Environment Variables

Troubleshooting: openclaw Not Found

Troubleshooting: "Unauthorized" / 1008 Error in Dashboard

What You Now Have

Resources

Our Journey to 35K+ GitHub Stars: The Real Story of Building Milvus from Scratch

We Started Building Milvus Because Nothing Else Worked

Open-Sourcing Milvus: Building in Public

The Hardest Decision: Starting Over

When Everything Accelerated with AI

The Team Behind Milvus: Zilliz

Real Impact in Production: The Trust from Users

Why We Built Zilliz Cloud: Enterprise-Grade Vector Database as a Service

What's Next: Vector Data Lake

A Big Thanks to You All!

Top 5 Open Source Vector Search Engines: A Comprehensive Comparison Guide for 2025

Introduction

Understanding Vector Search: Core Concepts

What Are Vector Embeddings?

Vector Search vs. Traditional Search

Key Performance Metrics

Popular Vector Search Use Cases

Retrieval Augmented Generation (RAG)

AI Agents and Knowledge Retrieval

Recommendation Systems

Semantic Search Applications

Image and Multimedia Similarity Search

Milvus

Architecture and Technical Capabilities

Performance Characteristics

Migration Simplicity

Zilliz Cloud: Fully Managed Milvus

Community and Ecosystem

Faiss

Technical Overview

Strengths and Limitations

Annoy

Approximate Nearest Neighbors Approach

Performance Trade-offs

Integration Options

Weaviate

Qdrant

Comparison Table: Key Features of Top Vector Search Engines

Other Notable Vector Search Options

Elasticsearch with Vector Search

Vespa

pgvector

Emerging Options

Choosing the Right Vector Search Engine

Decision Framework

Scaling Considerations

Total Cost of Ownership

Future-Proofing

Benchmarking with Real-world Workloads

Conclusion and Next Steps

Popular Video AI Models Every Developer Should Know

YOLO (You Only Look Once): Real-Time Object Detection

Key Features and Benefits

YOLO v8: Latest Advances

Common Use Cases

MoViNet: Efficient Action Recognition for Embedding Extraction

Key Features and Benefits

Troubleshooting: `sharp` Build Errors

Troubleshooting: `openclaw` Not Found