Aashish Karki

Posted on Oct 28

Building the Academic Research Copilot: From ArXiv to Semantic Search in Minutes

#webdev #ai #tutorial #devchallenge

Finding the right paper shouldn’t feel like searching for a needle in a haystack. Keyword search misses context, titles can be misleading, and abstracts use different vocabulary for the same idea. The Academic Research Copilot solves this with hybrid semantic search over ArXiv—combining vector embeddings with simple filters—so you can ask real questions and get relevant papers fast.

This post walks through the problem, the architecture, it was built, and how you can run or extend it yourself.

TL;DR

Ingest ArXiv metadata → DuckDB
Create a MindsDB Knowledge Base → generate embeddings for title + summary
Query with natural language → get semantically similar papers
Serve via FastAPI and a Streamlit UI
Run locally with Docker, Gemini embeddings

Repo: https://github.com/aashish079/academic-research-copilot

Repo folders to peek at:

src/data/fetch_papers.py – ArXiv → DuckDB
src/knowledge_base/kb_manager.py – KB creation + ingestion
src/knowledge_base/queries.py – semantic/hybrid queries (with a DuckDB fallback)
src/api/routes.py, src/app.py – FastAPI endpoints
src/ui/streamlit_app.py – Streamlit UI

The use case: question-first research

Researchers often start with a question, not a keyword. For example:

“privacy in federated learning”
“diffusion models for medical imaging”
“efficient attention variants in transformers”

Traditional search requires exactly the right words. Semantic search uses embeddings to find conceptually similar content, not just textual matches. Hybrid search then refines results with lightweight metadata filters (authors, year, categories) when needed.

Outcome: faster discovery, better recall, less time sifting PDFs.

Architecture

At a high level:

1) Fetch papers from ArXiv and store them in a local DuckDB papers table.
2) Register that DuckDB file inside MindsDB.
3) Create a Knowledge Base (KB) configured to embed title + summary.
4) Populate the KB using INSERT … SELECT (MindsDB generates embeddings automatically).
5) Expose clean HTTP APIs via FastAPI; Streamlit calls the API and renders results.

Data ingestion: ArXiv → DuckDB

We use the arxiv Python package to fetch results by topic and store them into DuckDB. Each paper is normalized to a consistent schema.

Key fields:

entry_id (primary key)
title
summary (abstract)
authors (comma-separated)
published_date
pdf_url
categories

Snippet (from src/data/fetch_papers.py):

papers = []
for result in search.results():
    papers.append({
        "entry_id": result.entry_id,
        "title": result.title,
        "summary": result.summary.replace("\n", " "),
        "authors": ", ".join([a.name for a in result.authors]),
        "published_date": result.published.date(),
        "pdf_url": result.pdf_url,
        "categories": ", ".join(result.categories),
    })

The script creates the DuckDB table if needed and upserts rows to avoid duplicates.

Knowledge Base: embeddings with MindsDB

We connect the DuckDB file inside MindsDB and create a KB named academic_kb whose embeddings are built over title and summary. In this project, we use Google Gemini text-embedding-004 by setting GEMINI_API_KEY.

Create KB (SQL idea shown in README):

CREATE KNOWLEDGE_BASE academic_kb
USING
  embedding_model = {
    "provider": "google",
    "model_name": "text-embedding-004",
    "api_key": "${GEMINI_API_KEY}"
  },
  content_columns = ['title','summary'],
  id_column = 'entry_id';

Populate the KB from DuckDB:

INSERT INTO academic_kb
SELECT entry_id, title, summary, authors, published_date, pdf_url, categories
FROM duckdb_papers.papers;

The KB stores vector embeddings and metadata. Query-time fields such as relevance and distance help rank results.

Fallback behavior: if the KB isn’t available, we degrade to DuckDB text search (LIKE over title/summary/categories) so the app keeps working.

Querying: semantic and hybrid search

All query logic is centralized in src/knowledge_base/queries.py. The app supports:

Basic semantic search (WHERE content = 'your query')
Thresholded semantic search (filter by relevance)
Hybrid search (semantic + metadata filters like author/year/category)

Example semantic search (SQL):

SELECT entry_id, title, summary, authors, published_date, pdf_url, categories,
       distance, relevance
FROM academic_kb
WHERE content = 'privacy in federated learning'
ORDER BY relevance DESC
LIMIT 10;

Python usage via the SDK (simplified idea):

from src.knowledge_base.queries import query_academic_papers
results = await query_academic_papers("privacy in federated learning", limit=10)

Hybrid constraints are applied either in SQL or post-filtered in Python, depending on the field.

Serving: FastAPI + Streamlit

FastAPI endpoints (see src/api/routes.py):
- POST /api/search
- POST /api/search/semantic
- POST /api/search/hybrid
- GET /api/papers/{entry_id}
- GET /api/health
Streamlit UI (src/ui/streamlit_app.py) calls those APIs and renders the results list with titles, authors, abstracts, links, and relevance scores.

This separation keeps the UI thin and the backend reusable.

Running it locally

Choose Docker or bare metal.

Docker (recommended):

# 1) Configure env
cp .env.docker.example .env
# (edit GEMINI_API_KEY)

# 2) Start services
docker compose up --build

# 3) Populate KB (first run)
docker compose exec academic_research_copilot python scripts/populate_kb.py

# 4) Open apps
# UI       → http://localhost:8501
# API Docs → http://localhost:8000/api/docs

Bare metal:

# 1) Configure env
cp .env.example .env
# (edit GEMINI_API_KEY)

# 2) Create & activate venv (zsh/bash)
python -m venv venv
source venv/bin/activate

# 3) Install deps
pip install -r requirements.txt

# 4) Populate KB (first run)
python scripts/populate_kb.py

# 5) Start apps (two terminals)
uvicorn src.app:app --reload
streamlit run src/ui/streamlit_app.py

Tip: The first populate run can take several minutes (fetching papers + generating embeddings). Subsequent runs are much faster.

What makes this effective

Semantic recall: finds conceptually similar papers even when vocabulary differs.
Hybrid control: tighten the net with author/year/category filters as needed.
Local-first: DuckDB + MindsDB, easy to containerize, embedding model.
Clean API surface: a small set of focused endpoints for the UI or other clients.

What’s next

Reranking: add a cross-encoder or LLM reranker on the top 20 candidates.
Query understanding: expand to multi-query rewriting or synonyms.
Summarization: on-demand TL;DR and key-takeaways for each paper.
Collections: let users save, label, and export reading lists.

Closing

The Academic Research Copilot is a pragmatic, local-first way to bring semantic search to your research workflow. It’s lightweight, fast to run, and easy to extend. If you’re curious about a new topic or deep into a literature review, this setup will save you time and surface better results.

Happy researching! 🎓

DEV Community