Echo.lee for seekdb

Posted on Mar 18

Building a Smart, AI-Powered Book Search App — The Easy Way

#hybridsearch #ai #programming #appwritehack

Just spent some time getting hands-on with seekdb, and it’s been a pleasant surprise—here’s a quick breakdown of what caught my eye (no fluff, just the good stuff):

✅ Lightweight & easy to spin up: Runs smoothly on my MacBook via Docker Desktop, or straight up with pip on Linux. macOS/Windows native support is on the way, so soon it’ll be a simple one-command install, no Docker required.

✅ Unified architecture done right: Natively supports relational, vector, full-text, JSON, and GIS data types—all indexes update atomically in the same transaction. Zero Data Lag, strict ACID compliance, and none of the latency/inconsistency headaches from traditional CDC sync.

✅ AI-Native out of the box: Built-in embedding models and AI functions mean one SQL query handles vector + full-text + scalar filtering. No more messy glue code to stitch tech stacks together—perfect for powering RAG workflows.

✅ Schema-free API: Write directly, no need to predefine rigid table structures—saves so much setup time.

✅ Full MySQL compatibility: Easy upgrade path for traditional databases looking to add AI capabilities without a complete overhaul.

✅ Open-source (Apache 2.0) with OceanBase backing: Long-term support is locked in, and the project’s only getting better—always a win for the community.

In this tutorial, we'll build an intelligent book search application from scratch using seekdb, demonstrating semantic search, hybrid search, and other core capabilities.

What We'll Build

This tutorial will walk you through creating a smart book search app that demonstrates seekdb's main features:

1. Data Import

Import from CSV files into seekdb
Support batch data import
Automatically convert book text information into 384-dimensional vector embeddings

2. Three Search Capabilities

Semantic Search: Based on vector similarity, use natural language queries to find semantically related books
Metadata Filtering: Precise filtering by rating, genre, year, price, and other fields
Hybrid Search: Combines semantic search + metadata filtering using RRF (Reciprocal Rank Fusion) algorithm

3. Index Optimization

Create HNSW vector indexes to boost semantic search performance
Generate column indexes from metadata (extract fields from JSON to create indexes)

4. Tech Stack

Database: seekdb, pyseekdb (seekdb's Python SDK), pymysql
Data Processing: pandas

Prerequisites

1. Install OrbStack

OrbStack is a lightweight Docker alternative optimized for Mac. It starts fast and uses fewer resources. We'll use it to deploy seekdb locally.

Step 1: Install via Homebrew (Recommended)

brew install orbstack

Or download from the official website: https://orbstack.dev

Step 2: Start OrbStack

# Start OrbStack
open -a OrbStack

# Verify installation
orb version

2. Deploy seekdb Image

If downloads are slow, configure Docker to use a domestic mirror source in OrbStack settings.

# Pull seekdb image
docker pull oceanbase/seekdb:latest

# Start seekdb container
docker run -d \
  --name seekdb \
  -p 2881:2881 \
  -e MODE=slim \
  oceanbase/seekdb:latest

# Check container status
docker ps | grep seekdb

# View logs (ensure service started successfully)
docker logs seekdb

Wait about 30 seconds for seekdb to fully start. You can monitor the startup logs with docker logs -f seekdb. When you see "boot success", it's ready.

3. Download the Dataset

Download the dataset from: https://www.kaggle.com/datasets/sootersaalu/amazon-top-50-bestselling-books-2009-2019

Rename it to: bestsellers_with_categories.csv. It contains 550 records of Amazon's historical bestsellers.

4. Download the Tutorial Code

git clone https://github.com/kejun/demo-seekdb-hybridsearch.git

Project Structure:

demo-seekdb-books-hybrid-search/
├── database/
│   ├── db_client.py      # Database client wrapper
│   └── index_manager.py  # Index manager
├── data/
│   └── processor.py      # Data processor
├── models/
│   └── book_metadata.py  # Book metadata model
├── utils/
│   └── text_utils.py     # Text processing utilities
├── import_data.py        # Data import script
├── hybrid_search.py      # Hybrid search demo
└── bestsellers_with_categories.csv  # Data file

Create Python Virtual Environment:

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate   # macOS/Linux
# or
.\venv\Scripts\activate    # Windows

Install Dependencies:

pip install -r requirements.txt

Execution Results

Run python import_data.py to import data. You'll see the entire process: load data file → connect to database → create database → create collection → batch import data → create metadata indexes.

(Note: seekdb currently supports HNSW indexes for embedding columns and full-text indexes for document columns. Metadata field indexing is planned for future releases.)

seekdb uses a schema-free interface design. For example, in data/processor.py, when calling collection.add(), you can pass any dictionary directly:

collection.add(
    ids=valid_ids,
    documents=valid_documents,
    metadatas=valid_metadatas  # Pass dictionary list directly, no schema predefinition needed
)

Complete Results (abbreviated):

Loading data file: bestsellers_with_categories.csv
Data loaded!
- Total rows: 550
- Total columns: 7
- Column names: Name, Author, User Rating, Reviews, Price, Year, Genre
- Load time: 0.01 seconds

Connecting to database...
Host: 127.0.0.1:2881
Database: demo_books
Collection: book_info
Database ready
Database connection successful

Creating/rebuilding collection...
Collection name: book_info
Vector dimensions: 384
Distance metric: cosine
Collection created successfully

Processing data...
Data preprocessing complete!
- Total records: 550
- Validation errors: 0
- Processing time: 0.05 seconds

Importing data to collection...
- Batch size: 100
- Total batches: 6
- Starting import...

Import progress: 100%|█████████████████████████████████████| 6/6 [00:53<00:00,  8.97s/batch]

Data import complete!
- Import time: 53.83 seconds
- Average speed: 10 records/second

Creating metadata indexes...
- Index fields: genre, year, user_rating, author, reviews, price
Index creation complete!
- Creation time: 3.81 seconds

Data import process complete!
Total time: 59.64 seconds
Imported records: 550
Database: demo_books
Collection: book_info

After importing data, you can query the database directly using the MySQL client or install obclient in the terminal.

# Enter seekdb container
docker exec -it seekdb bash

# Connect using MySQL client (seekdb is MySQL-compatible)
mysql -h127.0.0.1 -P2881 -uroot

book_info is a seekdb collection, which corresponds to the underlying table name c$v1$book_info:

-- View all databases
SHOW DATABASES;

-- Switch to demo database
USE demo;

-- View all tables (collections)
SHOW TABLES;

-- View collection structure
DESC c$v1$book_info;

-- Query collection data
SELECT * FROM c$v1$book_info LIMIT 10;

-- Count records
SELECT COUNT(*) FROM c$v1$book_info;

-- Exit
EXIT;

show table schemaDESC c$v1$book_info：

show index created：

(Note: pyseekdb doesn't currently support direct indexing of metadata columns, so the project uses pymysql + SQL DDL to implement metadata indexing. The next pyseekdb version will support automatic indexing of metadata fields.)

Running Hybrid Search

Next, run python hybrid_search.py. seekdb's built-in embedding model is sentence-transformers/all-MiniLM-L6-v2, with a maximum vector dimension of 384. For better results, configure an external model service.

Hybrid search is seekdb's killer feature. It simultaneously executes full-text retrieval and vector retrieval, then merges results using the RRF (Reciprocal Rank Fusion) algorithm.

Looking at the code example, query_params defines a full-text search for "inspirational" while filtering by user rating (user_rating >= 4.5) from metadata. knn_params is semantic search, with query_texts being the phrase "inspirational life advice", using the same user rating filter.

Code Snippet:

query_params = {
    "where_document": {"$contains": "inspirational"},
    "where": {"user_rating": {"$gte": 4.5}},
    "n_results": 5
}
knn_params = {
    "query_texts": ["inspirational life advice"],
    "where": {"user_rating": {"$gte": 4.5}},
    "n_results": 5
}

results = collection.hybrid_search(
    query=query_params,
    knn=knn_params,
    rank={"rrf": {}},
    n_results=5,
    include=["metadatas", "documents", "distances"]
)

The results are impressively accurate. Complete execution results (abbreviated):

=== Semantic Search ===
Query: ['self improvement motivation success']

Semantic Search - Found 5 results:

[1] The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change
    Author: Stephen R. Covey
    Rating: 4.6
    Reviews: 9325
    Price: $24.0
    Year: 2011
    Genre: Non Fiction
    Similarity distance: 0.5358
    Similarity: 0.4642

(Other results omitted...)


=== Hybrid Search (Rating≥4.5) ===
Query: {'where_document': {'$contains': 'inspirational'}, 'where': {'user_rating': {'$gte': 4.5}}, 'n_results': 5}
KNN Query Texts: ['inspirational life advice']

Hybrid Search (Rating≥4.5) - Found 5 results:

[1] Mindset: The New Psychology of Success
    Author: Carol S. Dweck
    Rating: 4.6
    Reviews: 5542
    Price: $10.0
    Year: 2014
    Genre: Non Fiction
    Similarity distance: 0.0159
    Similarity: 0.9841

(Other results omitted...)


=== Hybrid Search (Non Fiction) ===
Query: {'where_document': {'$contains': 'business'}, 'where': {'genre': 'Non Fiction'}, 'n_results': 5}
KNN Query Texts: ['business entrepreneurship leadership']

Hybrid Search (Non Fiction) - Found 5 results:

[1] The Five Dysfunctions of a Team: A Leadership Fable
    Author: Patrick Lencioni
    Rating: 4.6
    Reviews: 3207
    Price: $6.0
    Year: 2009
    Genre: Non Fiction
    Similarity distance: 0.0164
    Similarity: 0.9836

(Other results omitted...)


=== Hybrid Search (Fiction, After 2015, Rating≥4.0) ===
Query: {'where_document': {'$contains': 'fiction'}, 'where': {'$and': [{'year': {'$gte': 2015}}, {'user_rating': {'$gte': 4.0}}, {'genre': 'Fiction'}]}, 'n_results': 5}
KNN Query Texts: ['fiction story novel']

Hybrid Search (Fiction, After 2015, Rating≥4.0) - Found 5 results:

[1] A Gentleman in Moscow: A Novel
    Author: Amor Towles
    Rating: 4.7
    Reviews: 19699
    Price: $15.0
    Year: 2017
    Genre: Fiction
    Similarity distance: 0.0154
    Similarity: 0.9846

(Other results omitted...)


=== Hybrid Search (Reviews≥10000) ===
Query: {'where_document': {'$contains': 'popular'}, 'where': {'reviews': {'$gte': 10000}}, 'n_results': 10}
KNN Query Texts: ['popular bestseller']

Hybrid Search (Reviews≥10000) - Found 10 results:

[1] Twilight (The Twilight Saga, Book 1)
    Author: Stephenie Meyer
    Rating: 4.7
    Reviews: 11676
    Price: $9.0
    Year: 2009
    Genre: Fiction
    Similarity distance: 0.0143
    Similarity: 0.9857

[2] 1984 (Signet Classics)
    Author: George Orwell
    Rating: 4.7
    Reviews: 21424
    Price: $6.0
    Year: 2017
    Genre: Fiction
    Similarity distance: 0.0145
    Similarity: 0.9855

[3] Last Week Tonight with John Oliver Presents A Day in the Life of Marlon Bundo (Better Bundo Book, LGBT Childrens Book)
    Author: Jill Twiss
    Rating: 4.9
    Reviews: 11881
    Price: $13.0
    Year: 2018
    Genre: Fiction
    Similarity distance: 0.0147
    Similarity: 0.9853

(Other results omitted...)

Vibe Coding Friendly

If you're using Cursor or Claude Code for development, you've probably installed context7-mcp. It queries the latest API documentation, code examples, and more—the perfect companion for vibe coding. I noticed seekdb has been added to Context7:

seekdb: https://context7.com/oceanbase/seekdb
pyseekdb: https://context7.com/oceanbase/pyseekdb

If you haven't installed it yet, I highly recommend it:

{
  "mcpServers": {
    "context7": {
      "command": "npx",
      "args": [
        "-y",
        "@upstash/context7-mcp",
        "--api-key",
        "<your-apiKey-created-on-context7>"
      ]
    }
  }
}

After installation, you can learn and use seekdb simultaneously.

Key Takeaways

What makes seekdb special:

Lightweight & Easy to Deploy: Runs smoothly on a MacBook, with native macOS/Windows support coming soon
Unified Architecture: Combines relational, vector, full-text, JSON, and GIS in one system
AI-Native: Built-in embeddings and AI functions, no glue code needed
Schema-Free: Write directly without predefining schemas
MySQL-Compatible: Easy migration path for existing databases
Open Source: Apache 2.0 license with OceanBase backing

The hybrid search capability is particularly impressive—combining semantic understanding with precise metadata filtering delivers results that feel both intelligent and accurate.

Repo: github.com/oceanbase/seekdb (Apache 2.0 — Stars, Issues, PRs welcome)
Docs: seekdb documentation
Discord: https://discord.com/channels/1331061822945624085/1331061823465590805
Medium:https://medium/seekdb
Press: OceanBase Releases seekdb (MarkTechPost)

I hope this tutorial helps you get started with seekdb more smoothly. Enjoy building! 🚀

DEV Community