Connecting ideas across the frontiers of knowledge
π― The Problem: Information Overload in Academic Research
Picture this: You're a researcher trying to explore cutting-edge developments in quantum computing. You fire up your favorite search engine and type "quantum error correction." What do you get? Thousands of results. Papers from arXiv, patents, preprints from bioRxiv, medRxiv, chemRxivβall scattered across different platforms, each with its own search interface.
But here's the real challenge:
Keyword-only search fails to capture semantic meaning: Traditional search engines look for exact keyword matches. If you search for "machine learning in healthcare," you might miss groundbreaking papers that use terms like "artificial intelligence in medical diagnostics" or "neural networks for disease prediction."
Fragmented knowledge sources: Academic papers live in silosβarXiv for physics and CS, bioRxiv for biology, medRxiv for medicine, chemRxiv for chemistry, and patents in yet another database. Researchers waste hours searching across multiple platforms.
No contextual understanding: Found an interesting paper? Great! Now you need to read all 50 pages to understand if it's relevant to your research. What if you could just ask the paper questions?
Comparing multiple papers is tedious: Want to understand how three different approaches to the same problem compare? You'll need to read all three papers, take notes, and synthesize the information yourself.
The research community needed something better.
π‘ The Solution: Semantica
Enter Semanticaβan intelligent academic search and chat platform that revolutionizes how researchers discover and interact with scientific literature.
Semantica solves these problems by:
β
Semantic Search: Find papers by meaning, not just keywords
β
Multi-Source Integration: Search across arXiv, bioRxiv, medRxiv, chemRxiv, and patents simultaneously
β
AI-Powered Chat: Ask questions and get answers directly from papers
β
Multi-Document Analysis: Compare and analyze up to 4 papers in a single conversation
β
Hybrid Search: Combine traditional keyword search with semantic understanding
Github Repo - Semantica
Video Demo - Semantica
ποΈ How I Built It: The MindsDB Magic
The secret sauce behind Semantica is MindsDBβan AI platform that transforms databases into AI-powered systems. Here's how we leveraged MindsDB's powerful features:
1. Knowledge Bases with pgvector Integration
At the heart of Semantica lies MindsDB's knowledge base functionality, which seamlessly integrates with PostgreSQL and the pgvector extension.
What I did:
- Created a PostgreSQL database with the pgvector extension for efficient vector storage
- Connected it to MindsDB using their database integration
- Created a knowledge base that automatically generates embeddings using OpenAI's
text-embedding-3-smallmodel
# Sample startup code
def setup_knowledge_base():
"""Create knowledge base in MindsDB with automatic embeddings"""
create_kb_query = f"""
CREATE KNOWLEDGE_BASE {kb_name}
USING
model = {embedding_model},
storage = {db_name}.{storage_table};
"""
client.query(create_kb_query)
Why this is powerful:
- Automatic embeddings: MindsDB handles the complexity of generating and storing vector embeddings
- Metadata filtering: We can filter by publication year, category, source, and more
- Hybrid search: MindsDB supports combining semantic similarity with traditional SQL WHERE clauses
2. Semantic Search with Hybrid Capabilities
One of MindsDB's standout features is its hybrid search capability, which we extensively use in Semantica.
The search query looks like this:
SELECT article_id, metadata, relevance
FROM my_knowledge_base
WHERE content = 'quantum error correction'
AND hybrid_search = true
AND hybrid_search_alpha = 0.7
AND source = 'arxiv'
AND published_year = '2024';
Breaking it down:
-
content = 'query': Performs semantic similarity search -
hybrid_search = true: Enables hybrid mode (semantic + keyword) -
hybrid_search_alpha: Controls the balance (0.0 = pure keyword, 1.0 = pure semantic) - Additional WHERE clauses: Traditional SQL filtering on metadata
The result? Users can fine-tune their search from pure semantic understanding to traditional keyword matching, getting the best of both worlds!
3. Dynamic AI Agents for Multi-Document Chat
Here's where things get really exciting. When a user selects papers to chat with, Semantica:
- Creates individual knowledge bases for each paper
CREATE KNOWLEDGE_BASE paper_123_kb
USING model = text-embedding-3-small,
storage = my_pgvector.pgvector_storage_table;
INSERT INTO paper_123_kb
SELECT text, title, authors, abstract
FROM my_pgvector.paper_raw
WHERE article_id = '123' AND source = 'arxiv';
- Generates a custom AI agent with access to all selected papers
CREATE AGENT research_assistant
USING
model = 'gpt-4o',
skills = [],
knowledge_bases = ['paper_123_kb', 'paper_456_kb', 'paper_789_kb'],
prompt_template = 'You are a research assistant...';
- Enables natural language queries across all papers
SELECT answer
FROM research_assistant
WHERE question = 'Compare the methodologies used in these papers';
Why this approach is brilliant:
- Each chat session gets a dedicated AI agent
- The agent has deep understanding of all selected papers through their knowledge bases
- MindsDB handles the RAG (Retrieval-Augmented Generation) pipeline automatically
- Responses are grounded in the actual paper content, reducing hallucinations
4. Reranking for Improved Relevance
MindsDB's knowledge base supports reranking, which we leverage to improve search quality:
knowledge_base:
reranking_model:
provider: "openai"
model_name: "gpt-4o"
Reranking takes the initial search results and uses a more powerful model (GPT-4o) to reorder them based on true relevance to the query. This two-stage approach provides:
- Fast initial retrieval using vector similarity
- High-quality ranking using advanced language understanding
5. Background jobs to keep up
MindsDB's knowledge base supports jobs, which we leverage to improve keep the knowledge base updated with latest papers:
CREATE JOB kb_sync (
INSERT INTO kv_kb (
SELECT * FROM my_pgvector.paper_raw
)
)
EVERY 1 day;
Every paper with the text and metadata is inserted to postgres table my_pgvector.paper_raw from which it's inserted to the knowledge base.
π οΈ Technical Architecture
The Stack
βββββββββββββββββββββββββββββββββββββββ
β Frontend (React + TypeScript) β
β - Search interface β
β - Chat UI with PDF viewer β
β - Filter controls β
βββββββββββββββ¬ββββββββββββββββββββββββ
β REST API
βββββββββββββββΌββββββββββββββββββββββββ
β Backend (FastAPI + Python) β
β - /search endpoint β
β - /chat/initiate endpoint β
β - /chat/completion endpoint β
βββββββββββββββ¬ββββββββββββββββββββββββ
β MindsDB SDK
βββββββββββββββΌββββββββββββββββββββββββ
β MindsDB Platform β
β βββββββββββββββββββββββββββββ β
β β Knowledge Bases β β
β β - Main KB (all papers) β β
β β - Per-paper KBs (chat) β β
β βββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββ β
β β AI Agents (GPT-4o) β β
β β - Dynamic generation β β
β β - Multi-KB access β β
β βββββββββββββββββββββββββββββ β
βββββββββββββββ¬ββββββββββββββββββββββββ
β PostgreSQL Protocol
βββββββββββββββΌββββββββββββββββββββββββ
β PostgreSQL + pgvector β
β - Vector embeddings β
β - Metadata storage β
β - Fast similarity search β
βββββββββββββββββββββββββββββββββββββββ
Data Flow for Search
- User enters query: "machine learning for optics"
-
Frontend sends request to
/api/v1/searchwith query and filters - Backend constructs MindsDB query:
SELECT article_id, metadata, relevance
FROM my_knowledge_base
WHERE content = 'machine learning for optics'
AND hybrid_search = true
AND hybrid_search_alpha = 0.7;
-
MindsDB processes query:
- Generates query embedding
- Performs vector similarity search in pgvector
- Applies filters
- Reranks results
- Results transformed and returned to frontend
- User sees ranked papers with relevance scores
Data Flow for Chat
- User selects papers (e.g., 3 papers about quantum computing)
-
Frontend initiates chat via
/api/v1/chat/initiate -
Backend creates infrastructure:
- Creates 3 individual knowledge bases (one per paper)
- Populates each KB with paper content
- Creates an AI agent with access to all 3 KBs
- Returns agent ID to frontend
- User asks question: "What are the key challenges?"
-
Frontend sends message via
/api/v1/chat/completion - Backend queries agent:
SELECT answer
FROM agent_abc123
WHERE question = 'What are the key challenges?';
-
MindsDB agent:
- Retrieves relevant context from all 3 knowledge bases
- Constructs answer using GPT-4o
- Returns grounded response
- Answer displayed in chat interface with markdown formatting
π Key Features in Action
Feature 1: Multi-Source Semantic Search
Users can search across five different academic sources simultaneously:
Example Query: "CRISPR applications in gene therapy"
What happens behind the scenes:
- Query is embedded using OpenAI's embedding model
- MindsDB performs vector similarity search across all sources
- Results are filtered by user-selected corpora
- Reranking improves result quality
- Papers are returned ranked by relevance
The UX:
Search Results:
π "CRISPR-Cas9 Applications in Gene Therapy" (bioRxiv, 2023)
π "Therapeutic Genome Editing Methods" (patent, 2024)
π "Gene Therapy Advances Using CRISPR" (medRxiv, 2023)
Feature 2: Hybrid Search with Alpha Control
Users can slide between semantic and keyword search:
- Alpha = 0.0: Pure keyword matching (fast, specific)
- Alpha = 0.5: Balanced hybrid search
- Alpha = 1.0: Pure semantic search (finds conceptually similar papers)
Real-world example:
- Query: "neural networks"
- Alpha 0.0: Returns papers with exact phrase "neural networks"
- Alpha 1.0: Returns papers about "deep learning," "artificial neural systems," "connectionist models"
Feature 3: AI-Powered Multi-Document Chat
The killer feature! Select up to 4 papers and have a conversation:
Example conversation:
User: What are the main differences in methodology between these papers?
AI Agent: Based on the three papers you selected:
Paper 1 uses a supervised learning approach with labeled datasets...
Paper 2 employs reinforcement learning with reward shaping...
Paper 3 introduces an unsupervised method using contrastive learning...The key distinction is in their learning paradigms, with Paper 1 requiring...
What makes this powerful:
- The AI has read and understood all papers
- Answers are grounded in actual paper content
- Can compare, contrast, and synthesize information
- Cites specific findings from the papers
Feature 4: Live PDF Viewing
While chatting, users can view the actual PDFs in a split-pane interface:
- Left: PDF viewer with Google Docs integration
- Right: Chat interface
- Switch between papers with one click
π The Impact of MindsDB Features
Before MindsDB: The Traditional Approach
Building this without MindsDB would require:
# Manually generate embeddings
from openai import OpenAI
client = OpenAI(api_key=api_key)
def get_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# Manually implement vector search
import pgvector
def search_papers(query, limit=10):
query_vector = get_embedding(query)
cursor.execute("""
SELECT id, metadata,
1 - (embedding <=> %s::vector) as similarity
FROM papers
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_vector, query_vector, limit))
return cursor.fetchall()
# Manually implement RAG
def answer_question(question, paper_ids):
# 1. Retrieve relevant chunks from papers
chunks = retrieve_relevant_chunks(question, paper_ids)
# 2. Construct prompt with context
context = "\n".join(chunks)
prompt = f"Context: {context}\n\nQuestion: {question}\nAnswer:"
# 3. Call GPT-4
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Manually implement hybrid search
def hybrid_search(query, alpha=0.5):
semantic_results = vector_search(query)
keyword_results = full_text_search(query)
return combine_results(semantic_results, keyword_results, alpha)
Problems with this approach:
- π° 200+ lines of complex code
- π Bug-prone embedding management
- π§ Manual vector database optimization
- π Difficult to scale
- β‘ No built-in reranking
- π€― Complex RAG pipeline implementation
After MindsDB: The Modern Approach
-- Create knowledge base (handles embeddings, storage, indexing)
CREATE KNOWLEDGE_BASE papers_kb
USING model = 'text-embedding-3-small',
storage = postgres_db.papers;
-- Search with hybrid capabilities
SELECT * FROM papers_kb
WHERE content = 'quantum computing'
AND hybrid_search = true
AND hybrid_search_alpha = 0.7;
-- Create AI agent for chat
CREATE AGENT research_agent
USING model = 'gpt-4o',
knowledge_bases = ['paper1_kb', 'paper2_kb'];
-- Get answers
SELECT answer FROM research_agent
WHERE question = 'What are the key findings?';
Benefits:
- β Simple, declarative SQL syntax
- β Automatic embedding generation and management
- β Built-in hybrid search
- β Managed RAG pipeline
- β Optimized vector search
- β Automatic reranking
- β Scalable infrastructure
Development time saved: Approximately 2-3 weeks of implementation and testing!
π Lessons Learned
1. MindsDB Simplifies AI Pipelines
Key Takeaway: What took weeks to build manually takes hours with MindsDB.
The knowledge base abstraction is incredibly powerful:
- No need to manage embedding generation
- No need to implement vector search algorithms
- No need to build RAG pipelines from scratch
- Focus on user experience, not infrastructure
2. SQL as an AI Interface is Powerful
Key Takeaway: Developers already know SQL. Why not use it for AI?
-- This is AI magic disguised as familiar SQL:
SELECT answer FROM ai_agent WHERE question = 'Summarize this paper';
The learning curve is minimal, but the possibilities are vast.
3. Hybrid Search is Essential
Key Takeaway: Don't force users to choose between semantic and keyword search.
Users want both:
- Semantic for exploratory research
- Keyword for specific citations/terms
- Hybrid for balanced results
MindsDB makes this trivial with the hybrid_search_alpha parameter.
4. Dynamic Resource Creation Unlocks Flexibility
Key Takeaway: Creating KBs and agents on-the-fly enables powerful features.
Rather than one monolithic agent:
- Create focused, specialized agents per session
- Tailor knowledge bases to user selection
- Scale horizontally (more sessions = more agents)
- Clean architecture with clear boundaries
5. Developer Experience Matters
Key Takeaway: Fast setup = more adoption.
Our automated startup script:
- Eliminates manual configuration
- Reduces errors
- Gets new contributors productive immediately
- Serves as living documentation
π Real-World Use Cases
Use Case 1: Literature Review
Scenario: PhD student researching quantum error correction.
Workflow:
- Search: "quantum error correction near-term devices"
- Enable hybrid search (alpha = 0.7) for balanced results
- Filter to 2023-2024, arXiv + patents
- Select 4 most relevant papers
- Initiate chat
- Ask: "What are the different approaches to reducing qubit overhead?"
- Ask: "Which paper reports the best error rates?"
- Ask: "What are the experimental challenges mentioned?"
Outcome: 30-minute conversation replaces hours of reading and note-taking.
Use Case 2: Cross-Domain Research
Scenario: Biomedical engineer exploring AI applications in medicine.
Workflow:
- Search: "machine learning medical diagnostics"
- Select sources: arXiv, bioRxiv, medRxiv
- Hybrid search to catch both ML papers and medical papers
- Find papers bridging CS and medicine
- Chat with selected papers to understand interdisciplinary approaches
Outcome: Discovers connections between computer science and medical research that single-domain search would miss.
Use Case 3: Patent Analysis
Scenario: R&D team checking novelty of invention.
Workflow:
- Search: "graph neural networks semiconductor design"
- Enable patents corpus
- Filter to recent years (2022-2024)
- Review patents in relevant space
- Chat: "What techniques are patented in this domain?"
- Chat: "Are there any patents specifically covering [our approach]?"
Outcome: Efficient prior art search with AI-assisted analysis.
Use Case 4: Teaching & Learning
Scenario: Professor preparing lecture on CRISPR.
Workflow:
- Search: "CRISPR gene therapy clinical trials"
- Select foundational papers + recent advances
- Chat: "Explain the evolution from bench to bedside"
- Chat: "What are the key safety concerns?"
- Chat: "Suggest examples for undergraduate vs graduate courses"
Outcome: AI becomes a teaching assistant, helping structure educational content.
π Acknowledgments
This project wouldn't exist without:
- MindsDB Team: For building an incredible AI platform and hosting Hacktoberfest
- OpenAI: For GPT-4o and embedding models
- PostgreSQL & pgvector: For robust vector storage
- FastAPI & React: For excellent developer experiences
- The Open Source Community: For countless libraries and tools
Special thanks to MindsDB for making AI accessible to developers worldwide!
#MindsDB #Hacktoberfest #AI #SemanticSearch #AcademicResearch #OpenSource #RAG #VectorDatabase #GPT4 #Python #React
Made with β€οΈ for MindsDB Hacktoberfest 2025

Top comments (0)