Teycir Ben Soltane

Posted on Jun 8

I Built a Semantic arXiv Paper Search Engine on Cloudflare's Edge — Here's Everything I Learned

#ai #webdev #productivity #semantic

Fast hybrid FTS5 + vector search, AI summaries, local Ollama bulk ingestion, and zero login. All running at the edge with Cloudflare D1, Vectorize, Workers AI, and Next.js 16.

If you've ever tried to search arXiv properly, you know the pain. The official search is keyword-only, the UI is frozen in 2004, and there's no AI-assisted understanding to help you quickly grasp whether a paper is even worth reading.

So I built ArxivExplorer — a full-stack semantic paper search engine with AI-generated summaries, author pages, claim classification, bookmarks, and a CLI for AI assistants. It runs entirely on Cloudflare's edge platform with no traditional server and no login required.

This post is a deep technical writeup of every interesting decision I made along the way.

The Stack at a Glance

Next.js 16 (Worker mode via OpenNext)
  └── Cloudflare Worker (frontend)
  └── Cloudflare Worker (API)
  └── Cloudflare Worker (ingest / cron)
        └── Cloudflare D1 (SQLite + FTS5)
        └── Cloudflare Vectorize (768-dim BGE embeddings)
        └── Cloudflare KV (caching layer)
        └── Workers AI (Llama 3.1 + BGE)
        └── Ollama (local bulk processing)

Frontend: Next.js 16, Tailwind, Framer Motion, Three.js
API: Cloudflare Workers (TypeScript)
Database: Cloudflare D1 (SQLite) with FTS5 virtual tables
Vector Search: Cloudflare Vectorize (cosine similarity, 768 dims)
Cache: Cloudflare KV with tiered TTLs
AI (live): Workers AI — llama-3.1-8b-instruct + bge-base-en-v1.5
AI (bulk): Local Ollama — gemma4:e4b (summaries) + nomic-embed-text (embeddings)

Why Deploy Next.js as a Worker, Not Cloudflare Pages?

This was the first non-obvious architectural decision.

Cloudflare Pages unconditionally injects a per-request nonce into script-src at the CDN layer — before your response reaches the browser. There's no way to override this with middleware or a _headers file, because the injection happens downstream.

That nonce breaks the app's own Content Security Policy.

The solution: deploy via OpenNext's Cloudflare adapter in main + assets mode, which produces a standard Worker. The Worker has no such injection and serves the app's CSP intact.

npx opennextjs-cloudflare build
wrangler deploy --config wrangler.jsonc

The output is .open-next/worker.js plus a static asset bundle — clean, no surprises.

The Hybrid Search Algorithm

Pure keyword search misses semantic matches. Pure vector search misses exact terms. The winning approach is a weighted hybrid.

GET /api/search?q=attention+mechanisms

The flow inside search.ts:

1. Normalize & sanitize query
2. Check KV cache (2h TTL) → early return if hit
3. In parallel:
   a. D1 FTS5 keyword search  (top 30, title boosted 10:1:5)
   b. Vectorize semantic search (top 30, query embedding cached 24h)
4. Merge: 25% keyword weight · 75% semantic weight
5. Quality gate: drop results below 70% of the best relative score
6. Return top 20, write to KV

The weights came from empirical tuning. Users searching for "transformer architecture" should surface papers about attention mechanisms and positional encodings, not just literal matches to "transformer architecture." 75% semantic covers that. The 25% keyword weight ensures exact-term papers don't get buried.

The quality gate at step 5 is critical. Without it, a broad query returns papers where the 20th result is barely related to the 1st. Dropping anything below 70% of the top score keeps the result set tight.

const KEYWORD_WEIGHT = 0.25;
const SEMANTIC_WEIGHT = 0.75;
const VECTORIZE_TOP_K = 30;
const FTS_TOP_K = 30;
const MIN_RELATIVE_SCORE = 0.70;

FTS5 Configuration

SQLite's FTS5 is genuinely powerful and free. The virtual table with triggers:

CREATE VIRTUAL TABLE papers_fts USING fts5(
  title, abstract, authors,
  content='papers', content_rowid='rowid',
  tokenize='porter unicode61'
);

-- Keep in sync automatically
CREATE TRIGGER papers_ai AFTER INSERT ON papers BEGIN
  INSERT INTO papers_fts(rowid, title, abstract, authors)
  VALUES (new.rowid, new.title, new.abstract, new.authors);
END;

Porter stemming handles "learning" → "learn", "networks" → "network". The unicode61 tokenizer handles international author names.

For the title boost, FTS5's bm25() function accepts column weights:

SELECT *, bm25(papers_fts, 10, 1, 5) AS rank
FROM papers_fts
WHERE papers_fts MATCH ?
ORDER BY rank
LIMIT 30

Title weight 10, abstract weight 1, authors weight 5. Most users search by topic more than author, so title relevance dominates.

The Data Pipeline

Papers go through four stages before they appear in search:

Stage 1: Fetch

An ingest Worker polls arXiv's API on a minutely cron (* * * * *). It processes exactly one pending paper per run — generating its summary, embedding, and entity extraction in a single consolidated AI call.

Why one paper per minute? The Workers AI free tier has a daily neuron budget (5,000/day). Processing one paper per run gives 1,440 opportunities per day but the neuron cost per paper limits actual throughput to ~113 papers/day. The minutely cron provides fine-grained control versus a batch hourly job.

# wrangler.ingest.toml
[triggers]
crons = ["* * * * *"]

[vars]
ARXIV_FETCH_CATEGORIES = "cs.AI,cs.LG"
SUMMARY_MODEL = "@cf/meta/llama-3.1-8b-instruct"
EMBEDDING_MODEL = "@cf/baai/bge-base-en-v1.5"

Stage 2: Summarize

A single consolidated prompt per paper generates structured JSON output covering TL;DR, key contributions, methods, limitations, beginner explanation, and technical summary. Keeping it as one prompt minimizes API calls and maintains contextual coherence across sections.

Papers are flagged with summary_ready:

0 = pending
1 = complete
2 = failed (retried after 7 days)

Stage 3: Enrich (optional)

Separate cron jobs and admin-triggered scripts handle:

Semantic Scholar — citation counts (updated hourly)
CrossRef — DOI metadata, journal info, funders
OpenAlex — concepts, affiliations, ROR IDs
Papers With Code — code repositories, SOTA benchmarks (schema ready)

Stage 4: Related Papers

Pre-computes top-8 semantically similar papers per paper using Vectorize and stores them in a related_papers table. This makes the /api/paper/:id/related endpoint a cheap D1 query rather than a live vector lookup.

Local Bulk Processing with Ollama

Workers AI is great for live inference but rate-limited for bulk ingestion. When a remote rate-limit block hits (especially arXiv's Fastly cluster returning timeouts rather than 429s), the local Ollama pipeline picks up the slack.

# Bulk ingest via local Ollama
npx tsx scripts/bulk-ingest.ts --days 7 --categories cs.LG,cs.CL

# Process pending/failed papers from remote D1
ADMIN_SECRET=<secret> npx tsx scripts/process-pending-local.ts

# Push processed local DB to remote D1 + Vectorize
ADMIN_SECRET=<secret> npx tsx scripts/push-local-to-remote.ts

The push script uses the D1 REST API directly — not wrangler d1 execute per paper. The naive approach of shelling out to wrangler for each paper is ~100× slower and breaks on special characters in paper text. Direct REST calls are fast and handle Unicode cleanly.

Local models:
| Role | Model |
|------|-------|
| Summarization | gemma4:e4b (8B, Q4_K_M) |
| Embeddings | nomic-embed-text (137M, F16) |

The embedding model produces 768-dimensional vectors, which matches the bge-base-en-v1.5 used in production. This means local and remote embeddings are in the same vector space and directly comparable in Vectorize.

arXiv Rate Limiting: The Subtle Gotcha

arXiv uses two separate Fastly clusters:

arxiv.org — the main site
export.arxiv.org — the API endpoint

When export.arxiv.org blocks you, it manifests as connection timeouts (curl exit code 28, HTTP 000) — not HTTP 429. There's no polite rejection; the TCP connection just hangs until timeout.

The key insight: continued probing during a block extends the block window. The correct response is to back off completely and wait — hence the wait-and-ingest.sh script that polls for connectivity before re-running ingestion.

# wait-and-ingest.sh pattern
while ! curl -sf --max-time 10 "https://export.arxiv.org/search/" > /dev/null; do
  echo "Still blocked, waiting 5 min..."
  sleep 300
done
# Now safe to ingest
npx tsx scripts/bulk-ingest.ts

The script runs as a background nohup process with logs tailed separately — important for long-running jobs on a dev machine that might sleep.

Caching Strategy

Three tiers with different TTLs:

Data	TTL	Rationale
Search results	2h	Papers don't change; query results are stable
Query embeddings	24h	Expensive to generate; queries repeat
Paper detail	Indefinite (lazy)	Written to KV on first access, not at ingest
Trending	60min	Popular papers shift hourly
RSS feed	1h	Balance freshness vs. load

The lazy KV write pattern for paper detail is worth emphasizing. Rather than writing every paper to KV at ingest time (expensive, mostly wasted for unpopular papers), the paper detail handler writes to KV on the first cache miss. Subsequent requests hit KV. Popular papers get cached automatically; obscure papers save KV writes.

Security Hardening

Running a public API with AI inference endpoints means rate limiting isn't optional.

Per-IP token bucket on all public endpoints:

withRateLimit(request, env.CACHE, {
  maxRequests: 60,
  windowSeconds: 60,
  lockoutSeconds: 120,
  namespace: 'search'
}, cors, handler)

AI-specific limits on /api/classify-claim: hard character limits plus rate limiting prevent prompt injection and OOM conditions from oversized inputs.

Timing-safe admin auth:

const secretA = new TextEncoder().encode(providedSecret);
const secretB = new TextEncoder().encode(env.ADMIN_SECRET);
if (secretA.length !== secretB.length ||
    !crypto.subtle.timingSafeEqual(secretA, secretB)) {
  return unauthorized();
}

Input sanitization is centralized in src/shared/sanitize.ts — control character removal, length limits, allowlists for category codes and date filters. Applied to every API endpoint uniformly.

CORS: explicit origin only, wildcard rejected at startup. If ALLOWED_ORIGIN isn't set, the worker refuses to start.

The AI Features

Pre-Generated Summaries

Every paper gets a structured summary generated at ingest time:

tldr — one sentence
key_contributions — JSON array
methods — JSON array
limitations — JSON array
beginner_explain — plain language paragraph
technical_summary — researcher-level paragraph

Having these pre-generated means zero latency for AI content on paper detail pages. The alternative — generating on demand — would add 2–5 seconds to every paper page load.

Claim Classification

The /claim route lets users define a scientific claim and find papers that support, contradict, or are neutral to it. Each paper in the result set gets classified by Llama 3.1 with concurrent processing and progress tracking.

POST /api/classify-claim
{ "claim": "RLHF improves instruction following more than SFT alone",
  "paperIds": ["2302.13971", "2305.18290", ...] }

Abstract Search

Users can paste a full abstract to find semantically similar papers — bypassing keyword search entirely. The abstract is embedded directly and queried against Vectorize.

GET /api/search?embedText=<abstract text>

This is genuinely useful for researchers who have a draft abstract and want to find related work before writing their introduction.

Follow-up Questions as Search Links

AI-generated summaries include follow-up research questions. These are rendered as clickable links that trigger a search query. One click from "what papers explore this further" to a results page.

Author Pages and SEO

Each author gets a dedicated page at /author/[name] with citation statistics, timeline visualization, and their full paper list. These pages are server-side rendered with JSON-LD schema markup for Google Scholar compatibility.

The SEO setup:

Dynamic Open Graph and Twitter Card meta tags on all paper pages
Auto-generated sitemap.xml covering all papers, topics, and authors
/ai.txt and /llms.txt routes for LLM tool discovery
ISR with 10-minute revalidation for fast but fresh content

CLI for AI Assistants

A CLI tool (arxiv-cli) designed specifically for AI assistants to search papers programmatically:

arxiv-cli search "diffusion models image generation" 5
arxiv-cli paper 2303.04137
arxiv-cli trending 10
arxiv-cli topic large-language-models 20
arxiv-cli author "Yoshua Bengio" 10

Output is structured plain text optimized for LLM parsing — clean IDs, consistent field labels, no HTML noise. This was inspired by the /llms.txt standard and makes the tool useful in Claude Code, ChatGPT Code Interpreter, and similar environments.

Performance Numbers

From load testing with 100 concurrent requests:

Endpoint	Cache Hit	Cache Miss
Search	<240ms	<400ms
Paper detail	<190ms	<500ms
Trending	~60ms (KV)	~300ms

Cache hit rate is ~85%. The 15% cache misses are new queries or expired entries hitting D1 directly.

0% error rate under 100 concurrent requests. Cloudflare's edge distributes load globally without any autoscaling configuration.

Database Schema Decisions

A few schema choices worth calling out:

authors_normalized — a separate column storing lowercased author names for fast prefix search. Doing LOWER(authors) LIKE ? on every query is expensive; the normalized column with an index is O(log n).

citation_snapshots — storing historical citation data enables citation velocity calculations. A paper that went from 0 to 50 citations last week is more interesting than one that's had 500 for two years.

topics table — curated topic collections with category mappings. Rather than making topics a UI-only concept, they're first-class database entities. This lets the API serve topic pages with proper SEO and caching.

Single source of truth: migrations/schema.sql is canonical. Incremental migrations layer on top. This avoids schema drift between local and remote D1 databases.

Lessons Learned

1. arXiv's rate limiting is IP-based, not token-based. There's no way to authenticate to get higher limits. Polite crawling (1 request / 3 seconds, POLITE_EMAIL header) is the only lever.

2. D1 FTS5 + Vectorize is a surprisingly strong combo. For a side project, you don't need Elasticsearch. SQLite's FTS5 handles keyword search very well at this scale, and Vectorize handles semantics. The hybrid beats either alone.

3. Local Ollama for bulk + Workers AI for live is the right split. Local has no rate limits and can process hundreds of papers overnight. Workers AI provides fast, stateless inference at the edge for real-time requests.

4. Pre-generating AI content beats on-demand generation for perceived performance. Users don't wait for AI. Papers that haven't been processed yet show a "Summary pending" indicator rather than blocking the page load.

5. The quality gate on search results matters more than total recall. Returning 20 mediocre results is worse than returning 12 good ones. The MIN_RELATIVE_SCORE threshold was one of the highest-impact UX improvements.

What's Next

Papers With Code integration for benchmark and leaderboard data
Citation velocity tracking (citations per week from snapshots)
Personalized recommendations based on bookmark history
PDF upload for "find papers similar to this"

Open Source

ArxivExplorer is open source under BSL 1.1 (free for personal/academic use, converts to MIT on 2029-06-01).

The repo includes the full Next.js frontend, Cloudflare Workers API and ingest workers, all migration scripts, local Ollama pipeline, CLI tool, and integration test suite.

GitHub: github.com/Teycir/ArxivExplorer

If you're building something in the research discovery space, or have questions about the Cloudflare edge stack, I'd love to hear from you.

Built by Teycir Ben Soltane — security tools, edge apps, and privacy-first software.

Tags: cloudflare, nextjs, ai, typescript, searchengine

DEV Community