Chloe Williams for Zilliz

Posted on Jul 25, 2024

AI/ML Terms You May Not Know & Battle of the LLMs 🤔

#llm #ai #machinelearning #eventsinyourcity

In this issue:

AI/ML Terms You May Not Know
Which LLM Knows Vector Databases the Best?
Voyage AI Embeddings and Rerankers for Search and RAG
ICYMI: Unstructured Data Talk Recaps from Around the World
Upcoming Events

AI/ML Terms You May Not Know

🔥Topic Modeling & BERTopic

Why is it relevant?

All forms of data are constantly coming into the world, so the need for tools to navigate it has become more crucial. BERTopic employs neural network-based techniques to uncover themes and patterns in large text data with unprecedented accuracy and depth.

What is Topic Modeling? AI VS Bread 🤖🍞

Topic modeling: a method for unearthing the latent themes or "topics" within a collection of documents. Examine the text within these documents to detect patterns and relationships that indicate the presence of these topics. 🔎

🍞Example: a document focused on artificial intelligence will likely contain terms like "large language models" and "ChatGPT," unlike a document centered on baking bread.

What is BERTopic?

BERTopic: a novel topic modeling technique that simplifies the topic modeling process. It uses various embedding techniques and class-based TF-IDF (c-TF-IDF) to create dense clusters, allowing for easily interpretable topics while keeping important words in the topic descriptions. 💡

It approaches topic modeling in 4 steps on a high level:

▶️ Document embedding: Convert documents into embeddings using Bidirectional Encoder Representations from Transformers (BERT).

▶️ Dimensionality Reduction: Compresses embeddings into a lower-dimensional space.

▶️ Clustering: Group these embeddings to gather similar documents in one category.

▶️ Topic Extraction: Extract topic names using a class-based variation of TF-IDF.

Read the full guide

Which LLM Knows Vector Databases Best?

We asked LLMs to explain what a vector database is to someone who isn’t familiar with it. Which LLM do you think “won” this response? Let us know in the comments below and see who we picked! ⬇️

Here are the results from various LLMs using the same prompt:

Anthropic Claude (Claude 3.5 Sonnet):

ChatGPT (GPT 4o)

Vector databases are super smart librarians! 🤓

Voyage AI Embeddings and Rerankers for Search and RAG

Voyage AI provides various customized embedding models across many domains to carry out efficient RAG techniques. These models are connected with vector databases like Milvus by Zilliz to store and retrieve vector embeddings related to the generated query.

Zilliz partnered with Voyage AI to streamline the conversion of unstructured data into searchable vector embeddings on Zilliz Cloud. Voyage AI embedding models integrated in Zilliz Cloud Pipelines are voyage-code-2, voyage-law-2, and voyage-large-2-instruct. These models are for specific code, law, finance, and multilingual domains.

See the step-by-step on using this integration and Cohere (the LLM) to build a RAG application: Read Blog

Watch the meetup talk with Voyage AI CEO on “cutting-edge embeddings and rerankers for search and RAG”

ICYMI: Unstructured Data Talk Recaps from Around the World

We had a packed schedule in July 🌞with fun AI events all over the world. See the recaps and register for next month’s events! ⬇️⬇️⬇️

Recap: July 16 SF @ GitHub

▶️ Garbage In, Garbage Out: Why poor data curation is killing your AI models (and how to fix it)

▶️ It's your unstructured data: How to get your GenAI app to production (and speed up your delivery)

▶️ Multimodal Embeddings

Recap: July 17 Berlin @ Google

▶️ Gemini with Advanced Function Calling

▶️ RAG using PGvector

▶️ Docker tools & Open Source libraries for productivity

▶️ How to Load Test an LLM API using Gatling

▶️ Improving Analytics with Time Series and Vector Databases

▶️ Scaling Vector Search: How Milvus Handles Billions+

Save your spot for August events

Upcoming Events

August 1: Unstructured Data Processing from Cloud to Edge (virtual)

Learn why you should add a Cloud Native Vector Database to your Data and AI platform. Tim Spann will cover a quick introduction to Milvus, vector databases, and unstructured data processing. By adding Milvus to your architecture, you can scale out and improve your AI use cases through RAG, real-time search, multimodal search, recommendation engines, fraud detection, and many more emerging use cases.

Save Your Spot

August 2-4: AI-focused Touring Hackathon & Tech Fair (in-person)

🌟 TechStars Startup Weekend Personal AI Hackathon: Fri-Sat with the pitches 🗣️ and a Tech Fair 🎪 on Sunday showcasing local AI startups.

Christy Bergman will be there on Friday to answer your Milvus questions and kickstart the hackathon!

August 5: San Francisco Unstructured Data Meetup (in-person)

Join us in San Francisco for a meetup on August 5! There will be food, refreshments, networking, and cool AI talks.

▶️ Using Ray Data for Multimodal Embedding Inference with Christy Bergman, Developer Advocate at Zilliz

▶️ Building the Future of Neural Search: How to Train State-of-the-Art Embeddings with mixedbread.ai

▶️ A Different Angle: Retrieval Optimized Embedding Models with marqo.ai

Save your spot: https://lu.ma/3q2brqp8

August 8: Building an Agentic RAG locally with Milvus, Ollama, and Llama Agents (virtual)

With the recent release of Llama Agents, we can now build agents that are async first and run as their own service. During this webinar, Stephen will show you how to build an Agentic RAG System using Llama Agents and Milvus.

Save Your Spot

August 13: South Bay Unstructured Data Meetup (in-person)