Abhi

Posted on Oct 29

Building Semantica: An AI-Powered Academic Search Platform with MindsDB

#mindsdb #rag #ai

Connecting ideas across the frontiers of knowledge

🎯 The Problem: Information Overload in Academic Research

Picture this: You're a researcher trying to explore cutting-edge developments in quantum computing. You fire up your favorite search engine and type "quantum error correction." What do you get? Thousands of results. Papers from arXiv, patents, preprints from bioRxiv, medRxiv, chemRxiv—all scattered across different platforms, each with its own search interface.

But here's the real challenge:

Keyword-only search fails to capture semantic meaning: Traditional search engines look for exact keyword matches. If you search for "machine learning in healthcare," you might miss groundbreaking papers that use terms like "artificial intelligence in medical diagnostics" or "neural networks for disease prediction."
Fragmented knowledge sources: Academic papers live in silos—arXiv for physics and CS, bioRxiv for biology, medRxiv for medicine, chemRxiv for chemistry, and patents in yet another database. Researchers waste hours searching across multiple platforms.
No contextual understanding: Found an interesting paper? Great! Now you need to read all 50 pages to understand if it's relevant to your research. What if you could just ask the paper questions?
Comparing multiple papers is tedious: Want to understand how three different approaches to the same problem compare? You'll need to read all three papers, take notes, and synthesize the information yourself.

The research community needed something better.

💡 The Solution: Semantica

Enter Semantica—an intelligent academic search and chat platform that revolutionizes how researchers discover and interact with scientific literature.

Semantica solves these problems by:

✅ Semantic Search: Find papers by meaning, not just keywords

✅ Multi-Source Integration: Search across arXiv, bioRxiv, medRxiv, chemRxiv, and patents simultaneously

✅ AI-Powered Chat: Ask questions and get answers directly from papers

✅ Multi-Document Analysis: Compare and analyze up to 4 papers in a single conversation

✅ Hybrid Search: Combine traditional keyword search with semantic understanding

Github Repo - Semantica
Video Demo - Semantica

🏗️ How I Built It: The MindsDB Magic

The secret sauce behind Semantica is MindsDB—an AI platform that transforms databases into AI-powered systems. Here's how we leveraged MindsDB's powerful features:

1. Knowledge Bases with pgvector Integration

At the heart of Semantica lies MindsDB's knowledge base functionality, which seamlessly integrates with PostgreSQL and the pgvector extension.

What I did:

Created a PostgreSQL database with the pgvector extension for efficient vector storage
Connected it to MindsDB using their database integration
Created a knowledge base that automatically generates embeddings using OpenAI's text-embedding-3-small model

# Sample startup code
def setup_knowledge_base():
    """Create knowledge base in MindsDB with automatic embeddings"""

    create_kb_query = f"""
    CREATE KNOWLEDGE_BASE {kb_name}
    USING
        model = {embedding_model},
        storage = {db_name}.{storage_table};
    """

    client.query(create_kb_query)

Why this is powerful:

Automatic embeddings: MindsDB handles the complexity of generating and storing vector embeddings
Metadata filtering: We can filter by publication year, category, source, and more
Hybrid search: MindsDB supports combining semantic similarity with traditional SQL WHERE clauses

2. Semantic Search with Hybrid Capabilities

One of MindsDB's standout features is its hybrid search capability, which we extensively use in Semantica.

The search query looks like this:

SELECT article_id, metadata, relevance 
FROM my_knowledge_base
WHERE content = 'quantum error correction'
  AND hybrid_search = true 
  AND hybrid_search_alpha = 0.7
  AND source = 'arxiv'
  AND published_year = '2024';

Breaking it down:

content = 'query': Performs semantic similarity search
hybrid_search = true: Enables hybrid mode (semantic + keyword)
hybrid_search_alpha: Controls the balance (0.0 = pure keyword, 1.0 = pure semantic)
Additional WHERE clauses: Traditional SQL filtering on metadata

The result? Users can fine-tune their search from pure semantic understanding to traditional keyword matching, getting the best of both worlds!

3. Dynamic AI Agents for Multi-Document Chat

Here's where things get really exciting. When a user selects papers to chat with, Semantica:

Creates individual knowledge bases for each paper

   CREATE KNOWLEDGE_BASE paper_123_kb
   USING model = text-embedding-3-small,
         storage = my_pgvector.pgvector_storage_table;

   INSERT INTO paper_123_kb
   SELECT text, title, authors, abstract
   FROM my_pgvector.paper_raw
   WHERE article_id = '123' AND source = 'arxiv';

Generates a custom AI agent with access to all selected papers

   CREATE AGENT research_assistant
   USING 
       model = 'gpt-4o',
       skills = [],
       knowledge_bases = ['paper_123_kb', 'paper_456_kb', 'paper_789_kb'],
       prompt_template = 'You are a research assistant...';

Enables natural language queries across all papers

   SELECT answer 
   FROM research_assistant 
   WHERE question = 'Compare the methodologies used in these papers';

Why this approach is brilliant:

Each chat session gets a dedicated AI agent
The agent has deep understanding of all selected papers through their knowledge bases
MindsDB handles the RAG (Retrieval-Augmented Generation) pipeline automatically
Responses are grounded in the actual paper content, reducing hallucinations

4. Reranking for Improved Relevance

MindsDB's knowledge base supports reranking, which we leverage to improve search quality:

knowledge_base:
  reranking_model:
    provider: "openai"
    model_name: "gpt-4o"

Reranking takes the initial search results and uses a more powerful model (GPT-4o) to reorder them based on true relevance to the query. This two-stage approach provides:

Fast initial retrieval using vector similarity
High-quality ranking using advanced language understanding

5. Background jobs to keep up

MindsDB's knowledge base supports jobs, which we leverage to improve keep the knowledge base updated with latest papers:

CREATE JOB kb_sync (
   INSERT INTO kv_kb (
      SELECT * FROM my_pgvector.paper_raw
   )
)
EVERY 1 day;

Every paper with the text and metadata is inserted to postgres table my_pgvector.paper_raw from which it's inserted to the knowledge base.

🛠️ Technical Architecture

The Stack

┌─────────────────────────────────────┐
│   Frontend (React + TypeScript)    │
│   - Search interface                │
│   - Chat UI with PDF viewer         │
│   - Filter controls                 │
└─────────────┬───────────────────────┘
              │ REST API
┌─────────────▼───────────────────────┐
│   Backend (FastAPI + Python)        │
│   - /search endpoint                │
│   - /chat/initiate endpoint         │
│   - /chat/completion endpoint       │
└─────────────┬───────────────────────┘
              │ MindsDB SDK
┌─────────────▼───────────────────────┐
│   MindsDB Platform                  │
│   ┌───────────────────────────┐     │
│   │   Knowledge Bases         │     │
│   │   - Main KB (all papers)  │     │
│   │   - Per-paper KBs (chat)  │     │
│   └───────────────────────────┘     │
│   ┌───────────────────────────┐     │
│   │   AI Agents (GPT-4o)      │     │
│   │   - Dynamic generation    │     │
│   │   - Multi-KB access       │     │
│   └───────────────────────────┘     │
└─────────────┬───────────────────────┘
              │ PostgreSQL Protocol
┌─────────────▼───────────────────────┐
│   PostgreSQL + pgvector             │
│   - Vector embeddings               │
│   - Metadata storage                │
│   - Fast similarity search          │
└─────────────────────────────────────┘

Data Flow for Search

User enters query: "machine learning for optics"
Frontend sends request to /api/v1/search with query and filters
Backend constructs MindsDB query:

   SELECT article_id, metadata, relevance
   FROM my_knowledge_base
   WHERE content = 'machine learning for optics'
     AND hybrid_search = true
     AND hybrid_search_alpha = 0.7;

MindsDB processes query:
- Generates query embedding
- Performs vector similarity search in pgvector
- Applies filters
- Reranks results
Results transformed and returned to frontend
User sees ranked papers with relevance scores

Data Flow for Chat

User selects papers (e.g., 3 papers about quantum computing)
Frontend initiates chat via /api/v1/chat/initiate
Backend creates infrastructure:
- Creates 3 individual knowledge bases (one per paper)
- Populates each KB with paper content
- Creates an AI agent with access to all 3 KBs
- Returns agent ID to frontend
User asks question: "What are the key challenges?"
Frontend sends message via /api/v1/chat/completion
Backend queries agent:

   SELECT answer 
   FROM agent_abc123 
   WHERE question = 'What are the key challenges?';

MindsDB agent:
- Retrieves relevant context from all 3 knowledge bases
- Constructs answer using GPT-4o
- Returns grounded response
Answer displayed in chat interface with markdown formatting

🚀 Key Features in Action

Feature 1: Multi-Source Semantic Search

Users can search across five different academic sources simultaneously:

Example Query: "CRISPR applications in gene therapy"

What happens behind the scenes:

Query is embedded using OpenAI's embedding model
MindsDB performs vector similarity search across all sources
Results are filtered by user-selected corpora
Reranking improves result quality
Papers are returned ranked by relevance

The UX:

Search Results:
📄 "CRISPR-Cas9 Applications in Gene Therapy" (bioRxiv, 2023)
📄 "Therapeutic Genome Editing Methods" (patent, 2024)
📄 "Gene Therapy Advances Using CRISPR" (medRxiv, 2023)

Feature 2: Hybrid Search with Alpha Control

Users can slide between semantic and keyword search:

Alpha = 0.0: Pure keyword matching (fast, specific)
Alpha = 0.5: Balanced hybrid search
Alpha = 1.0: Pure semantic search (finds conceptually similar papers)

Real-world example:

Query: "neural networks"
Alpha 0.0: Returns papers with exact phrase "neural networks"
Alpha 1.0: Returns papers about "deep learning," "artificial neural systems," "connectionist models"

Feature 3: AI-Powered Multi-Document Chat

The killer feature! Select up to 4 papers and have a conversation:

Example conversation:

User: What are the main differences in methodology between these papers?

AI Agent: Based on the three papers you selected:

Paper 1 uses a supervised learning approach with labeled datasets...
Paper 2 employs reinforcement learning with reward shaping...
Paper 3 introduces an unsupervised method using contrastive learning...

The key distinction is in their learning paradigms, with Paper 1 requiring...

What makes this powerful:

The AI has read and understood all papers
Answers are grounded in actual paper content
Can compare, contrast, and synthesize information
Cites specific findings from the papers

Feature 4: Live PDF Viewing

While chatting, users can view the actual PDFs in a split-pane interface:

Left: PDF viewer with Google Docs integration
Right: Chat interface
Switch between papers with one click

📊 The Impact of MindsDB Features

Before MindsDB: The Traditional Approach

Building this without MindsDB would require:

# Manually generate embeddings
from openai import OpenAI
client = OpenAI(api_key=api_key)

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# Manually implement vector search
import pgvector
def search_papers(query, limit=10):
    query_vector = get_embedding(query)
    cursor.execute("""
        SELECT id, metadata, 
               1 - (embedding <=> %s::vector) as similarity
        FROM papers
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_vector, query_vector, limit))
    return cursor.fetchall()

# Manually implement RAG
def answer_question(question, paper_ids):
    # 1. Retrieve relevant chunks from papers
    chunks = retrieve_relevant_chunks(question, paper_ids)

    # 2. Construct prompt with context
    context = "\n".join(chunks)
    prompt = f"Context: {context}\n\nQuestion: {question}\nAnswer:"

    # 3. Call GPT-4
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Manually implement hybrid search
def hybrid_search(query, alpha=0.5):
    semantic_results = vector_search(query)
    keyword_results = full_text_search(query)
    return combine_results(semantic_results, keyword_results, alpha)

Problems with this approach:

😰 200+ lines of complex code
🐛 Bug-prone embedding management
🔧 Manual vector database optimization
📈 Difficult to scale
⚡ No built-in reranking
🤯 Complex RAG pipeline implementation

After MindsDB: The Modern Approach

-- Create knowledge base (handles embeddings, storage, indexing)
CREATE KNOWLEDGE_BASE papers_kb
USING model = 'text-embedding-3-small',
      storage = postgres_db.papers;

-- Search with hybrid capabilities
SELECT * FROM papers_kb
WHERE content = 'quantum computing'
  AND hybrid_search = true
  AND hybrid_search_alpha = 0.7;

-- Create AI agent for chat
CREATE AGENT research_agent
USING model = 'gpt-4o',
      knowledge_bases = ['paper1_kb', 'paper2_kb'];

-- Get answers
SELECT answer FROM research_agent
WHERE question = 'What are the key findings?';

Benefits:

✅ Simple, declarative SQL syntax
✅ Automatic embedding generation and management
✅ Built-in hybrid search
✅ Managed RAG pipeline
✅ Optimized vector search
✅ Automatic reranking
✅ Scalable infrastructure

Development time saved: Approximately 2-3 weeks of implementation and testing!

📚 Lessons Learned

1. MindsDB Simplifies AI Pipelines

Key Takeaway: What took weeks to build manually takes hours with MindsDB.

The knowledge base abstraction is incredibly powerful:

No need to manage embedding generation
No need to implement vector search algorithms
No need to build RAG pipelines from scratch
Focus on user experience, not infrastructure

2. SQL as an AI Interface is Powerful

Key Takeaway: Developers already know SQL. Why not use it for AI?

-- This is AI magic disguised as familiar SQL:
SELECT answer FROM ai_agent WHERE question = 'Summarize this paper';

The learning curve is minimal, but the possibilities are vast.

3. Hybrid Search is Essential

Key Takeaway: Don't force users to choose between semantic and keyword search.

Users want both:

Semantic for exploratory research
Keyword for specific citations/terms
Hybrid for balanced results

MindsDB makes this trivial with the hybrid_search_alpha parameter.

4. Dynamic Resource Creation Unlocks Flexibility

Key Takeaway: Creating KBs and agents on-the-fly enables powerful features.

Rather than one monolithic agent:

Create focused, specialized agents per session
Tailor knowledge bases to user selection
Scale horizontally (more sessions = more agents)
Clean architecture with clear boundaries

5. Developer Experience Matters

Key Takeaway: Fast setup = more adoption.

Our automated startup script:

Eliminates manual configuration
Reduces errors
Gets new contributors productive immediately
Serves as living documentation

🌟 Real-World Use Cases

Use Case 1: Literature Review

Scenario: PhD student researching quantum error correction.

Workflow:

Search: "quantum error correction near-term devices"
Enable hybrid search (alpha = 0.7) for balanced results
Filter to 2023-2024, arXiv + patents
Select 4 most relevant papers
Initiate chat
Ask: "What are the different approaches to reducing qubit overhead?"
Ask: "Which paper reports the best error rates?"
Ask: "What are the experimental challenges mentioned?"

Outcome: 30-minute conversation replaces hours of reading and note-taking.

Use Case 2: Cross-Domain Research

Scenario: Biomedical engineer exploring AI applications in medicine.

Workflow:

Search: "machine learning medical diagnostics"
Select sources: arXiv, bioRxiv, medRxiv
Hybrid search to catch both ML papers and medical papers
Find papers bridging CS and medicine
Chat with selected papers to understand interdisciplinary approaches

Outcome: Discovers connections between computer science and medical research that single-domain search would miss.

Use Case 3: Patent Analysis

Scenario: R&D team checking novelty of invention.

Workflow:

Search: "graph neural networks semiconductor design"
Enable patents corpus
Filter to recent years (2022-2024)
Review patents in relevant space
Chat: "What techniques are patented in this domain?"
Chat: "Are there any patents specifically covering [our approach]?"

Outcome: Efficient prior art search with AI-assisted analysis.

Use Case 4: Teaching & Learning

Scenario: Professor preparing lecture on CRISPR.

Workflow:

Search: "CRISPR gene therapy clinical trials"
Select foundational papers + recent advances
Chat: "Explain the evolution from bench to bedside"
Chat: "What are the key safety concerns?"
Chat: "Suggest examples for undergraduate vs graduate courses"

Outcome: AI becomes a teaching assistant, helping structure educational content.

🙏 Acknowledgments

This project wouldn't exist without:

MindsDB Team: For building an incredible AI platform and hosting Hacktoberfest
OpenAI: For GPT-4o and embedding models
PostgreSQL & pgvector: For robust vector storage
FastAPI & React: For excellent developer experiences
The Open Source Community: For countless libraries and tools

Special thanks to MindsDB for making AI accessible to developers worldwide!

#MindsDB #Hacktoberfest #AI #SemanticSearch #AcademicResearch #OpenSource #RAG #VectorDatabase #GPT4 #Python #React

Made with ❤️ for MindsDB Hacktoberfest 2025

DEV Community

Building Semantica: An AI-Powered Academic Search Platform with MindsDB

🎯 The Problem: Information Overload in Academic Research

💡 The Solution: Semantica

🏗️ How I Built It: The MindsDB Magic

1. Knowledge Bases with pgvector Integration

2. Semantic Search with Hybrid Capabilities

3. Dynamic AI Agents for Multi-Document Chat

4. Reranking for Improved Relevance

5. Background jobs to keep up

🛠️ Technical Architecture

The Stack

Data Flow for Search

Data Flow for Chat

🚀 Key Features in Action

Feature 1: Multi-Source Semantic Search

Feature 2: Hybrid Search with Alpha Control

Feature 3: AI-Powered Multi-Document Chat

Feature 4: Live PDF Viewing

📊 The Impact of MindsDB Features

Before MindsDB: The Traditional Approach

After MindsDB: The Modern Approach

📚 Lessons Learned

1. MindsDB Simplifies AI Pipelines

2. SQL as an AI Interface is Powerful

3. Hybrid Search is Essential

4. Dynamic Resource Creation Unlocks Flexibility

5. Developer Experience Matters

🌟 Real-World Use Cases

Use Case 1: Literature Review

Use Case 2: Cross-Domain Research

Use Case 3: Patent Analysis

Use Case 4: Teaching & Learning

🙏 Acknowledgments

Top comments (0)