DEV Community

Cover image for Building Semantica: An AI-Powered Academic Search Platform with MindsDB
Abhi
Abhi

Posted on

Building Semantica: An AI-Powered Academic Search Platform with MindsDB

Connecting ideas across the frontiers of knowledge


🎯 The Problem: Information Overload in Academic Research

Picture this: You're a researcher trying to explore cutting-edge developments in quantum computing. You fire up your favorite search engine and type "quantum error correction." What do you get? Thousands of results. Papers from arXiv, patents, preprints from bioRxiv, medRxiv, chemRxivβ€”all scattered across different platforms, each with its own search interface.

But here's the real challenge:

  1. Keyword-only search fails to capture semantic meaning: Traditional search engines look for exact keyword matches. If you search for "machine learning in healthcare," you might miss groundbreaking papers that use terms like "artificial intelligence in medical diagnostics" or "neural networks for disease prediction."

  2. Fragmented knowledge sources: Academic papers live in silosβ€”arXiv for physics and CS, bioRxiv for biology, medRxiv for medicine, chemRxiv for chemistry, and patents in yet another database. Researchers waste hours searching across multiple platforms.

  3. No contextual understanding: Found an interesting paper? Great! Now you need to read all 50 pages to understand if it's relevant to your research. What if you could just ask the paper questions?

  4. Comparing multiple papers is tedious: Want to understand how three different approaches to the same problem compare? You'll need to read all three papers, take notes, and synthesize the information yourself.

The research community needed something better.


πŸ’‘ The Solution: Semantica

Enter Semanticaβ€”an intelligent academic search and chat platform that revolutionizes how researchers discover and interact with scientific literature.

Semantica solves these problems by:

βœ… Semantic Search: Find papers by meaning, not just keywords

βœ… Multi-Source Integration: Search across arXiv, bioRxiv, medRxiv, chemRxiv, and patents simultaneously

βœ… AI-Powered Chat: Ask questions and get answers directly from papers

βœ… Multi-Document Analysis: Compare and analyze up to 4 papers in a single conversation

βœ… Hybrid Search: Combine traditional keyword search with semantic understanding


Github Repo - Semantica
Video Demo - Semantica


πŸ—οΈ How I Built It: The MindsDB Magic

The secret sauce behind Semantica is MindsDBβ€”an AI platform that transforms databases into AI-powered systems. Here's how we leveraged MindsDB's powerful features:

1. Knowledge Bases with pgvector Integration

At the heart of Semantica lies MindsDB's knowledge base functionality, which seamlessly integrates with PostgreSQL and the pgvector extension.

What I did:

  • Created a PostgreSQL database with the pgvector extension for efficient vector storage
  • Connected it to MindsDB using their database integration
  • Created a knowledge base that automatically generates embeddings using OpenAI's text-embedding-3-small model
# Sample startup code
def setup_knowledge_base():
    """Create knowledge base in MindsDB with automatic embeddings"""

    create_kb_query = f"""
    CREATE KNOWLEDGE_BASE {kb_name}
    USING
        model = {embedding_model},
        storage = {db_name}.{storage_table};
    """

    client.query(create_kb_query)
Enter fullscreen mode Exit fullscreen mode

Why this is powerful:

  • Automatic embeddings: MindsDB handles the complexity of generating and storing vector embeddings
  • Metadata filtering: We can filter by publication year, category, source, and more
  • Hybrid search: MindsDB supports combining semantic similarity with traditional SQL WHERE clauses

2. Semantic Search with Hybrid Capabilities

One of MindsDB's standout features is its hybrid search capability, which we extensively use in Semantica.

The search query looks like this:

SELECT article_id, metadata, relevance 
FROM my_knowledge_base
WHERE content = 'quantum error correction'
  AND hybrid_search = true 
  AND hybrid_search_alpha = 0.7
  AND source = 'arxiv'
  AND published_year = '2024';
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • content = 'query': Performs semantic similarity search
  • hybrid_search = true: Enables hybrid mode (semantic + keyword)
  • hybrid_search_alpha: Controls the balance (0.0 = pure keyword, 1.0 = pure semantic)
  • Additional WHERE clauses: Traditional SQL filtering on metadata

The result? Users can fine-tune their search from pure semantic understanding to traditional keyword matching, getting the best of both worlds!

3. Dynamic AI Agents for Multi-Document Chat

Here's where things get really exciting. When a user selects papers to chat with, Semantica:

  1. Creates individual knowledge bases for each paper
   CREATE KNOWLEDGE_BASE paper_123_kb
   USING model = text-embedding-3-small,
         storage = my_pgvector.pgvector_storage_table;

   INSERT INTO paper_123_kb
   SELECT text, title, authors, abstract
   FROM my_pgvector.paper_raw
   WHERE article_id = '123' AND source = 'arxiv';
Enter fullscreen mode Exit fullscreen mode
  1. Generates a custom AI agent with access to all selected papers
   CREATE AGENT research_assistant
   USING 
       model = 'gpt-4o',
       skills = [],
       knowledge_bases = ['paper_123_kb', 'paper_456_kb', 'paper_789_kb'],
       prompt_template = 'You are a research assistant...';
Enter fullscreen mode Exit fullscreen mode
  1. Enables natural language queries across all papers
   SELECT answer 
   FROM research_assistant 
   WHERE question = 'Compare the methodologies used in these papers';
Enter fullscreen mode Exit fullscreen mode

Why this approach is brilliant:

  • Each chat session gets a dedicated AI agent
  • The agent has deep understanding of all selected papers through their knowledge bases
  • MindsDB handles the RAG (Retrieval-Augmented Generation) pipeline automatically
  • Responses are grounded in the actual paper content, reducing hallucinations

4. Reranking for Improved Relevance

MindsDB's knowledge base supports reranking, which we leverage to improve search quality:

knowledge_base:
  reranking_model:
    provider: "openai"
    model_name: "gpt-4o"
Enter fullscreen mode Exit fullscreen mode

Reranking takes the initial search results and uses a more powerful model (GPT-4o) to reorder them based on true relevance to the query. This two-stage approach provides:

  • Fast initial retrieval using vector similarity
  • High-quality ranking using advanced language understanding

5. Background jobs to keep up

MindsDB's knowledge base supports jobs, which we leverage to improve keep the knowledge base updated with latest papers:

CREATE JOB kb_sync (
   INSERT INTO kv_kb (
      SELECT * FROM my_pgvector.paper_raw
   )
)
EVERY 1 day;
Enter fullscreen mode Exit fullscreen mode

Every paper with the text and metadata is inserted to postgres table my_pgvector.paper_raw from which it's inserted to the knowledge base.


πŸ› οΈ Technical Architecture

The Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend (React + TypeScript)    β”‚
β”‚   - Search interface                β”‚
β”‚   - Chat UI with PDF viewer         β”‚
β”‚   - Filter controls                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚ REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Backend (FastAPI + Python)        β”‚
β”‚   - /search endpoint                β”‚
β”‚   - /chat/initiate endpoint         β”‚
β”‚   - /chat/completion endpoint       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚ MindsDB SDK
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   MindsDB Platform                  β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚   β”‚   Knowledge Bases         β”‚     β”‚
β”‚   β”‚   - Main KB (all papers)  β”‚     β”‚
β”‚   β”‚   - Per-paper KBs (chat)  β”‚     β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚   β”‚   AI Agents (GPT-4o)      β”‚     β”‚
β”‚   β”‚   - Dynamic generation    β”‚     β”‚
β”‚   β”‚   - Multi-KB access       β”‚     β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚ PostgreSQL Protocol
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PostgreSQL + pgvector             β”‚
β”‚   - Vector embeddings               β”‚
β”‚   - Metadata storage                β”‚
β”‚   - Fast similarity search          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Data Flow for Search

  1. User enters query: "machine learning for optics"
  2. Frontend sends request to /api/v1/search with query and filters
  3. Backend constructs MindsDB query:
   SELECT article_id, metadata, relevance
   FROM my_knowledge_base
   WHERE content = 'machine learning for optics'
     AND hybrid_search = true
     AND hybrid_search_alpha = 0.7;
Enter fullscreen mode Exit fullscreen mode
  1. MindsDB processes query:
    • Generates query embedding
    • Performs vector similarity search in pgvector
    • Applies filters
    • Reranks results
  2. Results transformed and returned to frontend
  3. User sees ranked papers with relevance scores

Data Flow for Chat

  1. User selects papers (e.g., 3 papers about quantum computing)
  2. Frontend initiates chat via /api/v1/chat/initiate
  3. Backend creates infrastructure:
    • Creates 3 individual knowledge bases (one per paper)
    • Populates each KB with paper content
    • Creates an AI agent with access to all 3 KBs
    • Returns agent ID to frontend
  4. User asks question: "What are the key challenges?"
  5. Frontend sends message via /api/v1/chat/completion
  6. Backend queries agent:
   SELECT answer 
   FROM agent_abc123 
   WHERE question = 'What are the key challenges?';
Enter fullscreen mode Exit fullscreen mode
  1. MindsDB agent:
    • Retrieves relevant context from all 3 knowledge bases
    • Constructs answer using GPT-4o
    • Returns grounded response
  2. Answer displayed in chat interface with markdown formatting

πŸš€ Key Features in Action

Feature 1: Multi-Source Semantic Search

Users can search across five different academic sources simultaneously:

Example Query: "CRISPR applications in gene therapy"

What happens behind the scenes:

  1. Query is embedded using OpenAI's embedding model
  2. MindsDB performs vector similarity search across all sources
  3. Results are filtered by user-selected corpora
  4. Reranking improves result quality
  5. Papers are returned ranked by relevance

The UX:

Search Results:
πŸ“„ "CRISPR-Cas9 Applications in Gene Therapy" (bioRxiv, 2023)
πŸ“„ "Therapeutic Genome Editing Methods" (patent, 2024)
πŸ“„ "Gene Therapy Advances Using CRISPR" (medRxiv, 2023)
Enter fullscreen mode Exit fullscreen mode

Feature 2: Hybrid Search with Alpha Control

Users can slide between semantic and keyword search:

  • Alpha = 0.0: Pure keyword matching (fast, specific)
  • Alpha = 0.5: Balanced hybrid search
  • Alpha = 1.0: Pure semantic search (finds conceptually similar papers)

Real-world example:

  • Query: "neural networks"
  • Alpha 0.0: Returns papers with exact phrase "neural networks"
  • Alpha 1.0: Returns papers about "deep learning," "artificial neural systems," "connectionist models"

Feature 3: AI-Powered Multi-Document Chat

The killer feature! Select up to 4 papers and have a conversation:

Example conversation:

User: What are the main differences in methodology between these papers?

AI Agent: Based on the three papers you selected:

Paper 1 uses a supervised learning approach with labeled datasets...
Paper 2 employs reinforcement learning with reward shaping...
Paper 3 introduces an unsupervised method using contrastive learning...

The key distinction is in their learning paradigms, with Paper 1 requiring...

What makes this powerful:

  • The AI has read and understood all papers
  • Answers are grounded in actual paper content
  • Can compare, contrast, and synthesize information
  • Cites specific findings from the papers

Feature 4: Live PDF Viewing

While chatting, users can view the actual PDFs in a split-pane interface:

  • Left: PDF viewer with Google Docs integration
  • Right: Chat interface
  • Switch between papers with one click

πŸ“Š The Impact of MindsDB Features

Before MindsDB: The Traditional Approach

Building this without MindsDB would require:

# Manually generate embeddings
from openai import OpenAI
client = OpenAI(api_key=api_key)

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# Manually implement vector search
import pgvector
def search_papers(query, limit=10):
    query_vector = get_embedding(query)
    cursor.execute("""
        SELECT id, metadata, 
               1 - (embedding <=> %s::vector) as similarity
        FROM papers
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_vector, query_vector, limit))
    return cursor.fetchall()

# Manually implement RAG
def answer_question(question, paper_ids):
    # 1. Retrieve relevant chunks from papers
    chunks = retrieve_relevant_chunks(question, paper_ids)

    # 2. Construct prompt with context
    context = "\n".join(chunks)
    prompt = f"Context: {context}\n\nQuestion: {question}\nAnswer:"

    # 3. Call GPT-4
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Manually implement hybrid search
def hybrid_search(query, alpha=0.5):
    semantic_results = vector_search(query)
    keyword_results = full_text_search(query)
    return combine_results(semantic_results, keyword_results, alpha)
Enter fullscreen mode Exit fullscreen mode

Problems with this approach:

  • 😰 200+ lines of complex code
  • πŸ› Bug-prone embedding management
  • πŸ”§ Manual vector database optimization
  • πŸ“ˆ Difficult to scale
  • ⚑ No built-in reranking
  • 🀯 Complex RAG pipeline implementation

After MindsDB: The Modern Approach

-- Create knowledge base (handles embeddings, storage, indexing)
CREATE KNOWLEDGE_BASE papers_kb
USING model = 'text-embedding-3-small',
      storage = postgres_db.papers;

-- Search with hybrid capabilities
SELECT * FROM papers_kb
WHERE content = 'quantum computing'
  AND hybrid_search = true
  AND hybrid_search_alpha = 0.7;

-- Create AI agent for chat
CREATE AGENT research_agent
USING model = 'gpt-4o',
      knowledge_bases = ['paper1_kb', 'paper2_kb'];

-- Get answers
SELECT answer FROM research_agent
WHERE question = 'What are the key findings?';
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • βœ… Simple, declarative SQL syntax
  • βœ… Automatic embedding generation and management
  • βœ… Built-in hybrid search
  • βœ… Managed RAG pipeline
  • βœ… Optimized vector search
  • βœ… Automatic reranking
  • βœ… Scalable infrastructure

Development time saved: Approximately 2-3 weeks of implementation and testing!


πŸ“š Lessons Learned

1. MindsDB Simplifies AI Pipelines

Key Takeaway: What took weeks to build manually takes hours with MindsDB.

The knowledge base abstraction is incredibly powerful:

  • No need to manage embedding generation
  • No need to implement vector search algorithms
  • No need to build RAG pipelines from scratch
  • Focus on user experience, not infrastructure

2. SQL as an AI Interface is Powerful

Key Takeaway: Developers already know SQL. Why not use it for AI?

-- This is AI magic disguised as familiar SQL:
SELECT answer FROM ai_agent WHERE question = 'Summarize this paper';
Enter fullscreen mode Exit fullscreen mode

The learning curve is minimal, but the possibilities are vast.

3. Hybrid Search is Essential

Key Takeaway: Don't force users to choose between semantic and keyword search.

Users want both:

  • Semantic for exploratory research
  • Keyword for specific citations/terms
  • Hybrid for balanced results

MindsDB makes this trivial with the hybrid_search_alpha parameter.

4. Dynamic Resource Creation Unlocks Flexibility

Key Takeaway: Creating KBs and agents on-the-fly enables powerful features.

Rather than one monolithic agent:

  • Create focused, specialized agents per session
  • Tailor knowledge bases to user selection
  • Scale horizontally (more sessions = more agents)
  • Clean architecture with clear boundaries

5. Developer Experience Matters

Key Takeaway: Fast setup = more adoption.

Our automated startup script:

  • Eliminates manual configuration
  • Reduces errors
  • Gets new contributors productive immediately
  • Serves as living documentation

🌟 Real-World Use Cases

Use Case 1: Literature Review

Scenario: PhD student researching quantum error correction.

Workflow:

  1. Search: "quantum error correction near-term devices"
  2. Enable hybrid search (alpha = 0.7) for balanced results
  3. Filter to 2023-2024, arXiv + patents
  4. Select 4 most relevant papers
  5. Initiate chat
  6. Ask: "What are the different approaches to reducing qubit overhead?"
  7. Ask: "Which paper reports the best error rates?"
  8. Ask: "What are the experimental challenges mentioned?"

Outcome: 30-minute conversation replaces hours of reading and note-taking.

Use Case 2: Cross-Domain Research

Scenario: Biomedical engineer exploring AI applications in medicine.

Workflow:

  1. Search: "machine learning medical diagnostics"
  2. Select sources: arXiv, bioRxiv, medRxiv
  3. Hybrid search to catch both ML papers and medical papers
  4. Find papers bridging CS and medicine
  5. Chat with selected papers to understand interdisciplinary approaches

Outcome: Discovers connections between computer science and medical research that single-domain search would miss.

Use Case 3: Patent Analysis

Scenario: R&D team checking novelty of invention.

Workflow:

  1. Search: "graph neural networks semiconductor design"
  2. Enable patents corpus
  3. Filter to recent years (2022-2024)
  4. Review patents in relevant space
  5. Chat: "What techniques are patented in this domain?"
  6. Chat: "Are there any patents specifically covering [our approach]?"

Outcome: Efficient prior art search with AI-assisted analysis.

Use Case 4: Teaching & Learning

Scenario: Professor preparing lecture on CRISPR.

Workflow:

  1. Search: "CRISPR gene therapy clinical trials"
  2. Select foundational papers + recent advances
  3. Chat: "Explain the evolution from bench to bedside"
  4. Chat: "What are the key safety concerns?"
  5. Chat: "Suggest examples for undergraduate vs graduate courses"

Outcome: AI becomes a teaching assistant, helping structure educational content.


πŸ™ Acknowledgments

This project wouldn't exist without:

  • MindsDB Team: For building an incredible AI platform and hosting Hacktoberfest
  • OpenAI: For GPT-4o and embedding models
  • PostgreSQL & pgvector: For robust vector storage
  • FastAPI & React: For excellent developer experiences
  • The Open Source Community: For countless libraries and tools

Special thanks to MindsDB for making AI accessible to developers worldwide!


#MindsDB #Hacktoberfest #AI #SemanticSearch #AcademicResearch #OpenSource #RAG #VectorDatabase #GPT4 #Python #React


Made with ❀️ for MindsDB Hacktoberfest 2025

Top comments (0)