DEV Community

MindsDB Team for MindsDB

Posted on • Originally published at mindsdb.com

Building a Semantic Search Knowledge Base with MindsDB

Written by Andriy Burkov, Ph.D. & Author, MindsDB Advisor

What happens when a developer searches for "how to make async HTTP calls" but your documentation says "asynchronous network requests"? Traditional keyword search fails—even though the content is exactly what they need.

This is the fundamental limitation of keyword search: it matches words, not meaning.

In this tutorial, we'll build a semantic search system using MindsDB that understands user intent. Using 2 million Stack Overflow posts, we'll create knowledge bases with two different vector storage backends—pgvector and FAISS—and compare their performance.

What You'll Learn:

  • How MindsDB knowledge bases convert text into searchable vectors
  • Setting up pgvector (PostgreSQL-based) and FAISS (Facebook AI Similarity Search) storage
  • Combining semantic search with metadata filters
  • Building an AI agent that uses your knowledge base to answer questions

Prerequisites:

  • A MindsDB account (cloud or self-hosted)
  • PostgreSQL database with the Stack Overflow dataset
  • An OpenAI API key for embeddings

How Semantic Search Works

Before we dive in, let's understand the key difference between keyword and semantic search:

Aspect Keyword Search Semantic Search
Matching Exact words Meaning/intent
Query: "async HTTP" Misses "asynchronous requests" Finds both
Handles synonyms
Understands context

Semantic search works by:

  1. Embedding: Converting text into numerical vectors using an embedding model
  2. Storing: Saving these vectors in a vector database
  3. Querying: Converting the search query to a vector and finding the closest matches

MindsDB handles all of this through its Knowledge Base abstraction.

Installing Dependencies

We need two packages:

  • mindsdb_sdk: Python client for interacting with MindsDB servers
  • pandas: For working with query results as DataFrames
!pip install mindsdb_sdk pandas
Enter fullscreen mode Exit fullscreen mode

3. Connecting to the MindsDB Cloud Instance

import mindsdb_sdk

# Connect to your MindsDB instance
server = mindsdb_sdk.connect(
    'YOUR_MINDSDB_URL',  # e.g., 'https://cloud.mindsdb.com' for MindsDB Cloud
    login='YOUR_USERNAME',
    password='YOUR_PASSWORD'
)
print("Connected to MindsDB server")
Enter fullscreen mode Exit fullscreen mode
Connected to MindsDB server
Enter fullscreen mode Exit fullscreen mode

4. Connecting to the Data Source

def run_query(sql, success_msg="Query executed successfully"):
    """Execute a SQL query and handle 'already exists' errors gracefully."""
    try:
        result = server.query(sql).fetch()
        print(success_msg)
        return result
    except RuntimeError as e:
        if "already exists" in str(e).lower():
            print("Resource already exists - skipping")
        else:
            raise
        return None

# Connect to your PostgreSQL database containing Stack Overflow data
run_query("""
    CREATE DATABASE pg_sample
    WITH ENGINE = "postgres",
    PARAMETERS = {
        "user": "YOUR_PG_USER",
        "password": "YOUR_PG_PASSWORD",
        "host": "YOUR_PG_HOST",
        "port": "5432",
        "database": "sample"
    }
""", "Created pg_sample database connection")
Enter fullscreen mode Exit fullscreen mode
Created pg_sample database connection
Enter fullscreen mode Exit fullscreen mode

Let's verify the connection by exploring the data. Check the dataset size:

# Get total row count
count = server.query("SELECT COUNT(*) as cnt FROM pg_sample.stackoverflow_2m").fetch()
print(f"Dataset size: {count['cnt'].iloc[0]:,} rows")
Enter fullscreen mode Exit fullscreen mode
Dataset size: 2,000,000 rows
Enter fullscreen mode Exit fullscreen mode

Show 10 records:

# Test sample data
df = server.query("SELECT * FROM pg_sample.stackoverflow_2m LIMIT 10").fetch()

# Display as a nice table (in Jupyter notebooks)
from IPython.display import display
display(df)
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
id PostTypeId AcceptedAnswerId ParentId Score ViewCount Body Title ContentLicense FavoriteCount CreationDate LastActivityDate LastEditDate LastEditorUserId OwnerUserId Tags
0 7 2 NaN 4.0 522 NaN An explicit cast to `double` like this isn't n... None CC BY-SA 4.0 NaN 2008-07-31T22:17:57.883 2019-10-21T14:03:54.607 2019-10-21T14:03:54.607 5496973.0 9.0 None
1 9 1 1404.0 NaN 2199 784860.0 Given a `DateTime` representing a person's bir... How do I calculate someone's age based on a Da... CC BY-SA 4.0 0.0 2008-07-31T23:40:59.743 2023-02-02T18:38:32.613 2022-07-27T22:34:36.320 3524942.0 1.0 c#,.net,datetime
2 11 1 1248.0 NaN 1644 197314.0 Given a specific `DateTime` value, how do I di... Calculate relative time in C# CC BY-SA 4.0 0.0 2008-07-31T23:55:37.967 2022-09-05T11:26:30.187 2022-07-10T00:19:55.237 16790137.0 1.0 c#,datetime,time,datediff,relative-time-span
3 14 1 NaN NaN 491 173083.0 What is the difference between [Math.Floor()](... Difference between Math.Floor() and Math.Trunc... CC BY-SA 3.0 0.0 2008-08-01T00:59:11.177 2022-04-22T08:59:43.817 2017-02-25T17:42:17.810 6495084.0 11.0 .net,math
4 6 1 31.0 NaN 319 23465.0 I have an absolutely positioned `div` containi... Why did the width collapse in the percentage w... CC BY-SA 4.0 0.0 2008-07-31T22:08:08.620 2021-01-29T18:46:45.963 2021-01-29T18:46:45.963 9134576.0 9.0 html,css,internet-explorer-7
5 12 2 NaN 11.0 347 NaN Here's how I do it\n\n ```\nvar ts = new TimeSp... None CC BY-SA 4.0 NaN 2008-07-31T23:56:41.303 2020-06-13T10:30:44.397 2020-06-13T10:30:44.397 238419.0 1.0 None
6 13 1 NaN NaN 701 277780.0 Is there a standard way for a web server to be... Determine a user's timezone CC BY-SA 4.0 0.0 2008-08-01T00:42:38.903 2022-03-29T07:31:31.320 2020-12-03T03:37:56.313 584192.0 9.0 html,browser,timezone,user-agent,timezone-offset
7 4 1 7.0 NaN 794 70633.0 I want to assign the decimal variable "trans" ... How to convert Decimal to Double in C#? CC BY-SA 4.0 0.0 2008-07-31T21:42:52.667 2022-09-08T05:07:26.033 2022-09-08T05:07:26.033 16124033.0 8.0 c#,floating-point,type-conversion,double,decimal
8 17 1 26.0 NaN 198 85547.0 How do I store binary data in [MySQL](http://e... Binary Data in MySQL CC BY-SA 3.0 0.0 2008-08-01T05:09:55.993 2020-12-03T03:37:51.763 2020-12-03T03:37:51.763 584192.0 2.0 mysql,database,binary-data,data-storage
9 24 1 49.0 NaN 193 101180.0 If I have a trigger before the update on a tab... Throw an error preventing a table update in a ... CC BY-SA 4.0 0.0 2008-08-01T12:12:19.350 2021-01-29T12:57:17.153 2021-01-29T12:57:17.153 14152908.0 22.0 mysql,database,triggers

The Stack Overflow dataset contains 2 million posts—both questions (PostTypeId=1) and answers (PostTypeId=2). Key columns include:

  • Id - Unique identifier for each post
  • Body - The content we'll make semantically searchable
  • Title - The title of the post (questions only)
  • Tags - Programming language and topic tags (e.g., python, javascript)
  • Score - Community voting score—useful for prioritizing high-quality content
  • ViewCount - Popularity metric for filtering
  • PostTypeId - Type of post (1=question, 2=answer)
  • AcceptedAnswerId - ID of the accepted answer (for questions)
  • CreationDate, LastActivityDate, LastEditDate - Timestamps

This rich metadata allows us to combine semantic understanding with traditional filters—for example, finding Python questions about async programming with a score above 10.

4. Setting Up Vector Storage Backends

MindsDB supports multiple vector storage options. We'll set up both pgvector and a recently added FAISS and will compare how quick they are.

pgvector (PostgreSQL Extension)

pgvector is a PostgreSQL extension for vector similarity search. It's ideal when you want to keep vectors alongside your relational data.


python
# Create pgvector database connection
run_query("""
    CREATE DATABASE pg_vector
    WITH ENGINE = "pgvector",
    PARAMETERS = {
        "user": "YOUR_PG_USER",
        "password": "YOUR_PG_PASSWORD",
        "host": "YOUR_PG_HOST",
        "port": "5432",
        "database": "vector"
    }
""", "Created pg_vector database connection")


Enter fullscreen mode Exit fullscreen mode
Created pg_vector database connection
Enter fullscreen mode Exit fullscreen mode

FAISS (Facebook AI Similarity Search)

FAISS is a library for efficient similarity search developed by Facebook AI Research. It's optimized for fast similarity search on large datasets.


python
# Create FAISS database connection
run_query("""
    CREATE DATABASE db_faiss
    WITH ENGINE = 'duckdb_faiss',
    PARAMETERS = {
        "persist_directory": "/home/ubuntu/faiss"
    }
""", "Created db_faiss database connection")


Enter fullscreen mode Exit fullscreen mode
Created db_faiss database connection
Enter fullscreen mode Exit fullscreen mode

Choosing Between pgvector and FAISS

Feature pgvector FAISS
Best for Integration with existing PostgreSQL Maximum query speed
Persistence Native PostgreSQL storage File-based
Scalability Good (PostgreSQL limits) Excellent (billions of vectors)
Setup complexity Requires PostgreSQL extension Standalone
Query speed Good (~19s for 2M vectors) Excellent (~5s for 2M vectors)

For this tutorial, we'll implement both so you can see the performance difference firsthand.

5. Creating Knowledge Bases

Now we have a table with relational data and two vector stores to keep the embedding vectors. We are ready to create knowledge bases using both storage backends.

The knowledge base will:

  • Use OpenAI's text-embedding-3-small model for generating embeddings
  • Store the post Body as searchable content
  • Include metadata fields for filtering results

Knowledge Base with pgvector Storage


python
def kb_exists(kb_name):
    """Check if a knowledge base already exists."""
    try:
        result = server.query("SELECT name FROM information_schema.knowledge_bases").fetch()
        return kb_name in result['name'].values
    except Exception:
        return False

# Create pgvector knowledge base
if kb_exists("kb_stack_vector"):
    print("kb_stack_vector already exists - skipping creation")
else:
    run_query("""
        CREATE KNOWLEDGE_BASE kb_stack_vector
        USING
            storage = pg_vector.stack,
            embedding_model = {
                "provider": "openai",
                "model_name": "text-embedding-3-small"
            },
            content_columns = ['Body'],
            metadata_columns = [
                "PostTypeId",
                "AcceptedAnswerId",
                "ParentId",
                "Score",
                "ViewCount",
                "Title",
                "ContentLicense",
                "FavoriteCount",
                "CreationDate",
                "LastActivityDate",
                "LastEditDate",
                "LastEditorUserId",
                "OwnerUserId",
                "Tags"
            ]
    """, "Created kb_stack_vector knowledge base")


Enter fullscreen mode Exit fullscreen mode
Created kb_stack_vector knowledge base
Enter fullscreen mode Exit fullscreen mode

Knowledge Base with FAISS Storage


python
# Create FAISS knowledge base
if kb_exists("kb_stack_faiss"):
    print("kb_stack_faiss already exists - skipping creation")
else:
    run_query("""
        CREATE KNOWLEDGE_BASE kb_stack_faiss
        USING
            storage = db_faiss.stack,
            embedding_model = {
                "provider": "openai",
                "model_name": "text-embedding-3-small"
            },
            content_columns = ['Body'],
            metadata_columns = [
                "PostTypeId",
                "AcceptedAnswerId",
                "ParentId",
                "Score",
                "ViewCount",
                "Title",
                "ContentLicense",
                "FavoriteCount",
                "CreationDate",
                "LastActivityDate",
                "LastEditDate",
                "LastEditorUserId",
                "OwnerUserId",
                "Tags"
            ]
    """, "Created kb_stack_faiss knowledge base")


Enter fullscreen mode Exit fullscreen mode
Created kb_stack_faiss knowledge base
Enter fullscreen mode Exit fullscreen mode

Understanding the Parameters

Parameter Description
storage Specifies the vector database connection and table name
embedding_model Configuration for the embedding model (provider and model name)
content_columns Columns to embed and make semantically searchable
metadata_columns Columns available for filtering (not embedded, but stored)

6. Loading Data into Knowledge Bases

Now we'll insert the Stack Overflow data into our knowledge bases. This process:

  1. Fetches data from the source table in batches
  2. Generates embeddings for content columns using the OpenAI API
  3. Stores vectors and metadata in the vector database

Loading Data into pgvector Knowledge Base


python
def is_kb_empty(kb_name):
    """Check if a knowledge base is empty (fast - only fetches 1 row)."""
    result = server.query(f"SELECT id FROM {kb_name} LIMIT 1").fetch()
    return len(result) == 0

if is_kb_empty("kb_stack_vector"):
    print("kb_stack_vector is empty - starting data insertion...")
    server.query("""
        INSERT INTO kb_stack_vector
        SELECT * FROM pg_sample.stackoverflow_2m 
        USING 
            batch_size = 1000, 
            track_column = id
    """).fetch()
    print("Data insertion started for kb_stack_vector")
else:
    print("kb_stack_vector is not empty - skipping data insertion")


Enter fullscreen mode Exit fullscreen mode
Data insertion started for kb_stack_vector
Enter fullscreen mode Exit fullscreen mode

Loading Data into FAISS Knowledge Base


python
if is_kb_empty("kb_stack_faiss"):
    print("kb_stack_faiss is empty - starting data insertion...")
    server.query("""
        INSERT INTO kb_stack_faiss
        SELECT * FROM pg_sample.stackoverflow_2m 
        USING 
            batch_size = 1000, 
            track_column = id
    """).fetch()
    print("Data insertion started for kb_stack_faiss")
else:
    print("kb_stack_faiss is not empty - skipping data insertion")


Enter fullscreen mode Exit fullscreen mode
Data insertion started for kb_stack_faiss
Enter fullscreen mode Exit fullscreen mode

Wait until the data insertion is complete.

7. Querying the Knowledge Bases

Once data is loaded, you can perform semantic searches combined with metadata filtering.

Basic Semantic Search

Search for content related to "8-bit music" (finds semantically similar content):


python
import time

# Semantic search on pgvector KB
start = time.time()
results_vector = server.query("""
    SELECT * FROM kb_stack_vector 
    WHERE content = '8-bit music'
    AND Tags LIKE '%python%'
    LIMIT 10
""").fetch()
elapsed_vector = time.time() - start
print(f"pgvector query time: {elapsed_vector:.2f} seconds")
display(results_vector)

# Semantic search on FAISS KB
start = time.time()
results_faiss = server.query("""
    SELECT * FROM kb_stack_faiss 
    WHERE content = '8-bit music'
    AND Tags LIKE '%python%'
    LIMIT 10
""").fetch()
elapsed_faiss = time.time() - start
print(f"FAISS query time: {elapsed_faiss:.2f} seconds")
display(results_faiss)


Enter fullscreen mode Exit fullscreen mode
pgvector query time: 19.21 seconds
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
id chunk_id chunk_content distance relevance ContentLicense ViewCount LastEditDate Score AcceptedAnswerId OwnerUserId LastActivityDate Tags LastEditorUserId PostTypeId ParentId Title FavoriteCount CreationDate metadata
0 1118266 1118266:Body:1of2:0to971 Im trying to engineer in python a way of trans... 0.605447 0.622879 CC BY-SA 2.5 1694.0 2009-07-13T08:32:20.797 0 NaN NaN 2010-03-17T15:16:17.060 python,audio 12855.0 1 None List of values to a sound file NaN 2009-07-13T08:27:25.393 {'Tags': 'python,audio', 'Score': 0, 'Title': ...
1 974071 974071:Body:1of1:0to791 I have a mosquito problem in my house. This wo... 0.615257 0.619097 CC BY-SA 2.5 55695.0 2017-05-23T12:32:21.507 44 974291.0 51197.0 2020-02-12T22:24:39.977 python,audio,mp3,frequency -1.0 1 None Python library for playing fixed-frequency sound 0.0 2009-06-10T07:05:02.037 {'Tags': 'python,audio,mp3,frequency', 'Score'...
2 1967040 1967040:Body:1of1:0to224 I am confused because there are a lot of progr... 0.626904 0.614665 CC BY-SA 2.5 6615.0 None 7 1968691.0 237934.0 2021-08-10T10:40:59.217 python,audio NaN 1 None How can i create a melody? Is there any sound-... 0.0 2009-12-27T21:04:34.243 {'Tags': 'python,audio', 'Score': 7, 'Title': ...
3 1118266 1118266:Body:2of2:972to1430 The current solution I'm thinking of involves ... 0.627442 0.614461 CC BY-SA 2.5 1694.0 2009-07-13T08:32:20.797 0 NaN NaN 2010-03-17T15:16:17.060 python,audio 12855.0 1 None List of values to a sound file NaN 2009-07-13T08:27:25.393 {'Tags': 'python,audio', 'Score': 0, 'Title': ...
4 1344884 1344884:Body:1of1:0to327 I want to learn how to program a music applica... 0.643957 0.608289 CC BY-SA 2.5 2205.0 2017-05-23T12:11:22.607 7 1346272.0 164623.0 2022-04-14T09:12:07.197 python,perl,waveform -1.0 1 None Programming a Self Learning Music Maker 0.0 2009-08-28T03:28:03.937 {'Tags': 'python,perl,waveform', 'Score': 7, '...
5 2376505 2376505:Body:1of2:0to968 Write a function called listenToPicture that t... 0.645214 0.607824 CC BY-SA 2.5 3058.0 2010-03-04T02:28:26.703 0 NaN 285922.0 2010-03-06T05:27:48.017 python,image,audio 34397.0 1 None How do I loop through every 4th pixel in every... NaN 2010-03-04T02:26:22.603 {'Tags': 'python,image,audio', 'Score': 0, 'Ti...
6 2226853 2226853:Body:1of1:0to877 I'm trying to write a program to display PCM d... 0.654162 0.604536 CC BY-SA 2.5 12425.0 None 7 2226907.0 210920.0 2015-07-25T11:16:16.747 python,audio,pcm NaN 1 None Interpreting WAV Data 0.0 2010-02-09T05:01:25.703 {'Tags': 'python,audio,pcm', 'Score': 7, 'Titl...
7 1561104 1561104:Body:1of1:0to306 Is there a way to do this? Also, I need this t... 0.668074 0.599494 CC BY-SA 2.5 1303.0 2020-06-20T09:12:55.060 1 1561314.0 151377.0 2012-01-29T00:01:18.230 python,pygame,pitch -1.0 1 None Playing sounds with python and changing their ... NaN 2009-10-13T15:44:54.267 {'Tags': 'python,pygame,pitch', 'Score': 1, 'T...
8 1382998 1382998:Body:4of4:2649to3382 ``` \n¼ éíñ§ÐÌëÑ » ¼ ö ® © ’\n0 1\n2 10\n3 10\n... 0.670654 0.598568 CC BY-SA 3.0 12497.0 2011-06-09T06:00:51.243 18 1383721.0 6946.0 2015-06-04T17:13:43.323 python,unicode 6946.0 1 None latin-1 to ascii 0.0 2009-09-05T10:44:40.167 {'Tags': 'python,unicode', 'Score': 18, 'Title...
9 1837686 1837686:Body:1of2:0to950 I wish to take a file encoded in UTF-8 that do... 0.675999 0.596659 CC BY-SA 3.0 3016.0 2011-10-15T13:17:24.520 2 NaN NaN 2011-10-15T13:17:24.520 python,c,utf-8,compression 12113.0 1 None Compressing UTF-8(or other 8-bit encoding) to ... 0.0 2009-12-03T04:43:05.963 {'Tags': 'python,c,utf-8,compression', 'Score'...
FAISS query time: 5.04 seconds
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
id distance chunk_id chunk_content relevance ContentLicense ViewCount LastEditDate Score AcceptedAnswerId OwnerUserId ParentId LastEditorUserId LastActivityDate Tags PostTypeId FavoriteCount Title CreationDate metadata
0 1118266 0.605468 1118266:Body:1of2:0to971 Im trying to engineer in python a way of trans... 0.622871 CC BY-SA 2.5 1694.0 2009-07-13T08:32:20.797 0 NaN NaN None 12855.0 2010-03-17T15:16:17.060 python,audio 1 NaN List of values to a sound file 2009-07-13T08:27:25.393 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
1 974071 0.615225 974071:Body:1of1:0to791 I have a mosquito problem in my house. This wo... 0.619109 CC BY-SA 2.5 55695.0 2017-05-23T12:32:21.507 44 974291.0 51197.0 None -1.0 2020-02-12T22:24:39.977 python,audio,mp3,frequency 1 0.0 Python library for playing fixed-frequency sound 2009-06-10T07:05:02.037 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
2 1967040 0.626923 1967040:Body:1of1:0to224 I am confused because there are a lot of progr... 0.614657 CC BY-SA 2.5 6615.0 None 7 1968691.0 237934.0 None NaN 2021-08-10T10:40:59.217 python,audio 1 0.0 How can i create a melody? Is there any sound-... 2009-12-27T21:04:34.243 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
3 1118266 0.627461 1118266:Body:2of2:972to1430 The current solution I'm thinking of involves ... 0.614454 CC BY-SA 2.5 1694.0 2009-07-13T08:32:20.797 0 NaN NaN None 12855.0 2010-03-17T15:16:17.060 python,audio 1 NaN List of values to a sound file 2009-07-13T08:27:25.393 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
4 1344884 0.643955 1344884:Body:1of1:0to327 I want to learn how to program a music applica... 0.608289 CC BY-SA 2.5 2205.0 2017-05-23T12:11:22.607 7 1346272.0 164623.0 None -1.0 2022-04-14T09:12:07.197 python,perl,waveform 1 0.0 Programming a Self Learning Music Maker 2009-08-28T03:28:03.937 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
5 2376505 0.645192 2376505:Body:1of2:0to968 Write a function called listenToPicture that t... 0.607832 CC BY-SA 2.5 3058.0 2010-03-04T02:28:26.703 0 NaN 285922.0 None 34397.0 2010-03-06T05:27:48.017 python,image,audio 1 NaN How do I loop through every 4th pixel in every... 2010-03-04T02:26:22.603 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
6 2226853 0.654112 2226853:Body:1of1:0to877 I'm trying to write a program to display PCM d... 0.604554 CC BY-SA 2.5 12425.0 None 7 2226907.0 210920.0 None NaN 2015-07-25T11:16:16.747 python,audio,pcm 1 0.0 Interpreting WAV Data 2010-02-09T05:01:25.703 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
7 1561104 0.668055 1561104:Body:1of1:0to306 Is there a way to do this? Also, I need this t... 0.599501 CC BY-SA 2.5 1303.0 2020-06-20T09:12:55.060 1 1561314.0 151377.0 None -1.0 2012-01-29T00:01:18.230 python,pygame,pitch 1 NaN Playing sounds with python and changing their ... 2009-10-13T15:44:54.267 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
8 1382998 0.670668 1382998:Body:4of4:2649to3382 ```\n¼ éíñ§ÐÌëÑ » ¼ ö ® © ’\n0 1\n2 10\n3 10\n... 0.598563 CC BY-SA 3.0 12497.0 2011-06-09T06:00:51.243 18 1383721.0 6946.0 None 6946.0 2015-06-04T17:13:43.323 python,unicode 1 0.0 latin-1 to ascii 2009-09-05T10:44:40.167 {'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...
9 1837686 0.675986 1837686:Body:1of2:0to950 I wish to take a file encoded in UTF-8 that do... 0.596664 CC BY-SA 3.0 3016.0 2011-10-15T13:17:24.520 2 NaN NaN None 12113.0 2011-10-15T13:17:24.520 python,c,utf-8,compression 1 0.0 Compressing UTF-8(or other 8-bit encoding) to ... 2009-12-03T04:43:05.963 {'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...

Analyzing the Results

Notice how the search for "8-bit music" returned posts about:

  • Converting values to sound files
  • Playing fixed-frequency sounds
  • Creating melodies programmatically

None of these posts contain the exact phrase "8-bit music," yet they're all semantically relevant to chiptune/retro audio generation. This is the power of semantic search.

Also note the 4x speed improvement with FAISS (5 seconds vs 19 seconds for pgvector). For production systems with high query volumes, this difference is significant.

Combined Semantic and Metadata Filtering

Find AJAX-related posts tagged with jQuery that have high view counts:


python
# pgvector: Semantic search with metadata filters
start = time.time()
results = server.query("""
    SELECT * FROM kb_stack_vector 
    WHERE content = 'ajax'
        AND Tags LIKE '%jquery%'
        AND ViewCount > 1000.0
        AND relevance > 0.6
    LIMIT 10
""").fetch()
print(f"pgvector query time: {time.time() - start:.2f} seconds")
display(results)

# FAISS: Semantic search with metadata filters
start = time.time()
results = server.query("""
    SELECT * FROM kb_stack_faiss 
    WHERE content = 'ajax'
        AND Tags LIKE '%jquery%'
        AND ViewCount > 1000.0
        AND relevance > 0.6
    LIMIT 10
""").fetch()
print(f"FAISS query time: {time.time() - start:.2f} seconds")
display(results)


Enter fullscreen mode Exit fullscreen mode
pgvector query time: 5.76 seconds
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
id chunk_id chunk_content distance relevance ContentLicense ViewCount LastEditDate Score AcceptedAnswerId OwnerUserId LastActivityDate Tags LastEditorUserId PostTypeId ParentId Title FavoriteCount CreationDate metadata
0 1400637 1400637:Body:28of32:25627to26627 o.ajax({type:"POST",url:E,data:G,success:H,dat... 0.427265 0.700641 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 2013-08-05T16:07:54.400 javascript,jquery 8590.0 1 None Stop reload for ajax submitted form NaN 2009-09-09T16:12:46.057 {'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
1 1400637 1400637:Body:30of32:27488to28356 O=false;T.onload=T.onreadystatechange=function... 0.453764 0.687870 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 2013-08-05T16:07:54.400 javascript,jquery 8590.0 1 None Stop reload for ajax submitted form NaN 2009-09-09T16:12:46.057 {'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
2 1400637 1400637:Body:27of32:24691to25626 rn this},serialize:function(){return o.param(t... 0.454629 0.687460 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 2013-08-05T16:07:54.400 javascript,jquery 8590.0 1 None Stop reload for ajax submitted form NaN 2009-09-09T16:12:46.057 {'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
3 1424774 1424774:Body:2of2:934to1745 var self = this;\n $.ajax({\n ... 0.461486 0.684235 CC BY-SA 2.5 3601.0 None 1 1426940.0 173350.0 2020-06-08T10:43:45.037 jquery,loops NaN 1 None Loop with 8 times NaN 2009-09-15T02:02:58.927 {'Tags': 'jquery,loops', 'Score': 1, 'Title': ...
4 1400637 1400637:Body:31of32:28357to29238 N=function(X){if(J.readyState==0){if(P){clearI... 0.462191 0.683905 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 2013-08-05T16:07:54.400 javascript,jquery 8590.0 1 None Stop reload for ajax submitted form NaN 2009-09-09T16:12:46.057 {'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
5 546344 546344:Body:2of3:902to1764 var before = function() { $(loading).show() ;... 0.463258 0.683407 CC BY-SA 2.5 1463.0 2009-02-13T16:17:38.170 0 546642.0 2755.0 2009-02-13T16:37:59.867 javascript,jquery,ajax 2755.0 1 None Using jQuery, how can I store the result of a ... 0.0 2009-02-13T15:25:00.963 {'Tags': 'javascript,jquery,ajax', 'Score': 0,...
6 1279625 1279625:Body:2of3:782to1754 ``` \n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ... 0.468882 0.680790 CC BY-SA 3.0 1130.0 2016-12-03T07:00:58.213 0 1279881.0 58375.0 2016-12-03T07:00:58.213 events,jquery,getjson 6637668.0 1 None Trouble with jQuery Ajax timing NaN 2009-08-14T19:06:28.043 {'Tags': 'events,jquery,getjson', 'Score': 0, ...
7 1400637 1400637:Body:32of32:29239to30048 L(){if(M.complete){M.complete(J,R)}if(M.global... 0.468944 0.680761 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 2013-08-05T16:07:54.400 javascript,jquery 8590.0 1 None Stop reload for ajax submitted form NaN 2009-09-09T16:12:46.057 {'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
8 1775625 1775625:Body:5of9:3144to4049 }\n\n}\n</script>\n\n\n\n<script type=... 0.472723 0.679014 CC BY-SA 2.5 2100.0 2009-11-21T14:46:00.250 1 1776406.0 212889.0 2009-11-21T19:03:52.070 jquery,form-submit 212889.0 1 None jQuery - Multiple form submission trigger unre... 0.0 2009-11-21T14:32:41.383 {'Tags': 'jquery,form-submit', 'Score': 1, 'Ti...
9 1400637 1400637:Body:26of32:23690to24690 nclick")}o(function(){var L=document.createEle... 0.477784 0.676689 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 2013-08-05T16:07:54.400 javascript,jquery 8590.0 1 None Stop reload for ajax submitted form NaN 2009-09-09T16:12:46.057 {'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
FAISS query time: 2.50 seconds
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
id distance chunk_id chunk_content relevance ContentLicense ViewCount LastEditDate Score AcceptedAnswerId OwnerUserId ParentId LastEditorUserId LastActivityDate Tags PostTypeId FavoriteCount Title CreationDate metadata
0 1400637 0.427243 1400637:Body:28of32:25627to26627 o.ajax({type:"POST",url:E,data:G,success:H,dat... 0.700651 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 None 8590.0 2013-08-05T16:07:54.400 javascript,jquery 1 NaN Stop reload for ajax submitted form 2009-09-09T16:12:46.057 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
1 1400637 0.453769 1400637:Body:30of32:27488to28356 O=false;T.onload=T.onreadystatechange=function... 0.687867 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 None 8590.0 2013-08-05T16:07:54.400 javascript,jquery 1 NaN Stop reload for ajax submitted form 2009-09-09T16:12:46.057 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
2 1400637 0.454589 1400637:Body:27of32:24691to25626 rn this},serialize:function(){return o.param(t... 0.687479 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 None 8590.0 2013-08-05T16:07:54.400 javascript,jquery 1 NaN Stop reload for ajax submitted form 2009-09-09T16:12:46.057 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
3 1424774 0.461469 1424774:Body:2of2:934to1745 var self = this;\n $.ajax({\n ... 0.684243 CC BY-SA 2.5 3601.0 None 1 1426940.0 173350.0 None NaN 2020-06-08T10:43:45.037 jquery,loops 1 NaN Loop with 8 times 2009-09-15T02:02:58.927 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
4 1400637 0.462233 1400637:Body:31of32:28357to29238 N=function(X){if(J.readyState==0){if(P){clearI... 0.683886 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 None 8590.0 2013-08-05T16:07:54.400 javascript,jquery 1 NaN Stop reload for ajax submitted form 2009-09-09T16:12:46.057 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
5 546344 0.463237 546344:Body:2of3:902to1764 var before = function() { $(loading).show() ;... 0.683416 CC BY-SA 2.5 1463.0 2009-02-13T16:17:38.170 0 546642.0 2755.0 None 2755.0 2009-02-13T16:37:59.867 javascript,jquery,ajax 1 0.0 Using jQuery, how can I store the result of a ... 2009-02-13T15:25:00.963 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
6 1279625 0.468854 1279625:Body:2of3:782to1754 ```\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ... 0.680803 CC BY-SA 3.0 1130.0 2016-12-03T07:00:58.213 0 1279881.0 58375.0 None 6637668.0 2016-12-03T07:00:58.213 events,jquery,getjson 1 NaN Trouble with jQuery Ajax timing 2009-08-14T19:06:28.043 {'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...
7 1400637 0.468931 1400637:Body:32of32:29239to30048 L(){if(M.complete){M.complete(J,R)}if(M.global... 0.680767 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 None 8590.0 2013-08-05T16:07:54.400 javascript,jquery 1 NaN Stop reload for ajax submitted form 2009-09-09T16:12:46.057 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
8 1775625 0.472740 1775625:Body:5of9:3144to4049 }\n\n}\n</script>\n\n\n\n<script type=... 0.679007 CC BY-SA 2.5 2100.0 2009-11-21T14:46:00.250 1 1776406.0 212889.0 None 212889.0 2009-11-21T19:03:52.070 jquery,form-submit 1 0.0 jQuery - Multiple form submission trigger unre... 2009-11-21T14:32:41.383 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
9 1400637 0.477785 1400637:Body:26of32:23690to24690 nclick")}o(function(){var L=document.createEle... 0.676688 CC BY-SA 2.5 2741.0 2009-09-09T16:16:59.430 2 1400656.0 107129.0 None 8590.0 2013-08-05T16:07:54.400 javascript,jquery 1 NaN Stop reload for ajax submitted form 2009-09-09T16:12:46.057 {'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...

Understanding Query Results

The query returns these columns:

Column Description
id Original document ID
chunk_id Identifier for the text chunk
chunk_content The actual text content
metadata JSON object with all metadata fields
distance Vector distance (lower = more similar)
relevance Relevance score (higher = more relevant, 0-1)

Filtering by Relevance

Get only highly relevant results:

The Power of Combined Filtering

The query we just ran demonstrates MindsDB's hybrid search capability:


sql
SELECT * FROM kb_stack_faiss 
WHERE content = 'ajax'              -- Semantic match
    AND Tags LIKE '%jquery%'        -- Metadata filter
    AND ViewCount > 1000            -- Popularity threshold
    AND relevance > 0.6             -- Quality threshold


Enter fullscreen mode Exit fullscreen mode

This finds posts that:

  1. Are semantically similar to "ajax" (not just keyword matches)
  2. Are tagged with jQuery
  3. Have significant engagement (>1000 views)
  4. Meet a minimum relevance score

This combination is impossible with traditional search and would require complex custom code with raw vector databases.


python
def run_query_ignore_exists(sql, success_msg="Query executed successfully"):
    """Execute a query, silently ignoring 'already exists' errors."""
    try:
        result = server.query(sql).fetch()
        print(success_msg)
        return result
    except RuntimeError as e:
        return None  # Silently ignore
# Create MindsDB Agent
run_query_ignore_exists("""
    drop agent stackoverflow_agent
""", "Dropped stackoverflow_agent")

run_query("""
    CREATE AGENT stackoverflow_agent
    USING
        model = {
            "provider": "openai",
            "model_name": "gpt-4.1"
        },
        data = {
            "knowledge_bases": ["mindsdb.kb_stack_faiss"]
        },
        prompt_template = '
            You are a helpful programming assistant. 
            mindsdb.kb_stack_faiss is a knowledge base that contains Stack Overflow questions and answers.
            Use this knowledge to provide accurate, helpful responses to programming questions.
            Include code examples when relevant.
            You must base your answer on the Stack Overflow questions and answers extracted from mindsdb.kb_stack_faiss.
            If you failed to get the results from mindsdb.kb_stack_faiss, answer I could not get the results from mindsdb.kb_stack_faiss.
            Print the chunk ID for each question and answer you based your answer on.
            IMPORTANT: Use a limit of 100 in your query to the knowledge base.
        '
""", "Created stackoverflow_agent")



Enter fullscreen mode Exit fullscreen mode
Dropped stackoverflow_agent
Created stackoverflow_agent
Enter fullscreen mode Exit fullscreen mode

python
# Query the agent
start = time.time()
response = server.query("""
    SELECT answer
    FROM stackoverflow_agent 
    WHERE question = 'Compare JavaScript to TypeScript for building web services'
""").fetch()
print(f"Agent response time: {time.time() - start:.2f} seconds\n")
print(response['answer'].iloc[0])


Enter fullscreen mode Exit fullscreen mode
Agent response time: 63.44 seconds

To compare JavaScript and TypeScript for building web services, let's look at insights from Stack Overflow posts (see chunk IDs for reference):

JavaScript:

  • JavaScript is a dynamic, weakly typed, prototype-based language with first-class functions (1253285:Body:1of1:0to384).
  • It is the default language for web development, both on the client (browser) and, with Node.js, on the server (870980:Body:1of1:0to133).
  • JavaScript is flexible and widely supported, but its lack of static typing can lead to runtime errors and makes large codebases harder to maintain.

TypeScript:

  • While not directly mentioned in the top results, TypeScript is a superset of JavaScript that adds static typing and modern language features. It compiles to JavaScript, so it runs anywhere JavaScript does.
  • TypeScript helps catch errors at compile time, improves code readability, and is especially beneficial for large projects or teams.

Web Services:

  • JavaScript (with Node.js) is commonly used to build RESTful APIs and web services (208051:Body:1of1:0to147).
  • TypeScript is increasingly popular for the same purpose, as it provides all the benefits of JavaScript plus type safety and better tooling (e.g., autocompletion, refactoring).

Summary Table:

Feature JavaScript TypeScript
Typing Dynamic, weakly typed Static typing (optional)
Tooling Good, but less type-aware Excellent (autocompletion, refactor)
Learning Curve Lower Slightly higher (due to types)
Error Checking Runtime Compile-time + runtime
Ecosystem Huge, universal Same as JS, plus TS-specific tools
Maintainability Can be challenging in large code Easier in large codebases

Conclusion:

  • For small projects or rapid prototyping, JavaScript is sufficient and easy to start with.
  • For larger projects, teams, or when maintainability and reliability are priorities, TypeScript is generally preferred.

References:

If you want more specific code examples or a deeper dive into either technology, let me know!

Enter fullscreen mode Exit fullscreen mode




Conclusion

We've built a complete semantic search system that:

  • Processes 2 million Stack Overflow posts
  • Supports both pgvector and FAISS backends
  • Combines semantic search with metadata filtering
  • Powers an AI agent for natural language queries

Key Takeaways

  1. FAISS is much faster than pgvector for pure search queries
  2. Metadata filtering lets you narrow results by tags, scores, dates
  3. Knowledge bases abstract complexity—no need to manage embeddings manually
  4. Agents can leverage knowledge bases for RAG-style applications

Next Steps

  • Try different embedding models
  • Add more data sources
  • Build a chat interface
  • Explore different chunking strategies

Top comments (0)