Jinhoon Jeong

Posted on Mar 11

How I Built a Self-Hosted Paper Digest That Uses LLM Scoring to Filter Research Papers

#python #machinelearning #opensource #productivity

Every morning, I'd open arXiv and PubMed, scroll through dozens of new papers, and try to figure out which ones were actually relevant to my research projects. Most days: 30 minutes in, 2-3 useful papers out. Some days I'd skip it entirely and hope I wasn't missing anything critical.

Google Scholar alerts helped, but they're keyword-based. If a paper uses different terminology but has directly transferable methodology, you'll never see it. I wanted something that understood what I'm actually working on — not just matching words, but evaluating relevance.

So I built Paper Morning.

What it does

You describe your research projects in plain text:

"I'm building a foundation model for endoscopic image analysis, focusing on self-supervised pretraining with limited annotations"

Every day (or every 3 days, or weekly), Paper Morning:

Pulls new papers from arXiv, PubMed, Semantic Scholar, and Google Scholar
Sends each abstract to an LLM (Gemini, with Cerebras fallback)
The LLM scores each paper 1–10 against your project description
Papers above your threshold get emailed to you with a summary explaining why they're relevant

Why not just use keyword matching?

This is the part that surprised me the most.

A paper about fetal ultrasound AI diagnosis scored 9/10 for my endoscopy project. The clinical domain is completely different, but the training methodology — large-scale multi-hospital image dataset, sensitivity/specificity evaluation, junior vs. senior clinician comparison — transferred directly. Keyword matching would have filtered this out entirely.

The LLM is also surprisingly good at distinguishing "tangentially interesting" (score 5-6) from "directly useful for your work right now" (score 9), as long as your project description gives it enough context.

Architecture

The design is intentionally simple:

┌─────────────────────┐
│  Your project        │
│  descriptions        │
│  (plain text)        │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Paper collectors    │
│  arXiv │ PubMed     │
│  Semantic Scholar    │
│  Google Scholar      │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  LLM scoring        │
│  (Gemini / Cerebras) │
│  Score 1-10 per paper│
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Email digest        │
│  Only papers above   │
│  your threshold      │
│  + why it's relevant │
└─────────────────────┘

The whole thing runs on GitHub Actions. You fork the repo, add your API keys as GitHub Secrets, and it triggers on a cron schedule. No server, no Docker, no infrastructure to maintain 24-hours. The free tier of GitHub Actions is more than enough for daily digests.

A few technical decisions worth mentioning

Multi-source aggregation with deduplication. Papers often appear on multiple databases. The tool collects from all four sources and deduplicates by title/DOI before LLM scoring, so you don't waste tokens on the same paper twice.

Non-linear candidate scaling for longer intervals. If you set the schedule to weekly instead of daily, you'll have ~7x more papers to evaluate. But sending 7x more tokens to the LLM isn't ideal. The tool applies a non-linear cap (LLM_MAX_CANDIDATES) that scales sub-linearly with the lookback window, keeping costs predictable.

PubMed 429 auto-retry. PubMed's API rate-limits aggressively. The tool has built-in exponential backoff and auto-retry specifically for PubMed's 429 responses. Setting an NCBI_API_KEY helps but isn't required.

Duplicate send prevention. A sent_ids.json file tracks which papers have already been emailed, so you never get the same paper twice even if it stays in the search window.

Multilingual output

One config value: OUTPUT_LANGUAGE. Set it to any language and the summaries arrive in that language. The LLM handles the translation naturally as part of the summarization step — no separate translation pipeline needed.

This was a conscious design choice. Research is global, and many researchers prefer reading summaries in their native language even if the original papers are in English.

What I'd do differently

If I were starting over, I'd probably add a lightweight web UI for viewing digests instead of relying purely on email. Email works fine for daily use, but it's not great for searching past digests or comparing papers across weeks.

I'd also explore using embedding-based pre-filtering before the LLM scoring step. Right now every candidate paper gets a full LLM evaluation, which works but is the main cost driver. A cheap embedding similarity check could eliminate obvious non-matches before they reach the LLM.

Try it

The repo is here: github.com/jeong87/paper-morning

You'll need:

A free Google Gemini API key
A Gmail account with an app password
(Optional) NCBI API key for better PubMed stability
(Optional) SerpAPI key for Google Scholar

The recommended setup is GitHub Actions mode — fork, add secrets, done. There's also a local mode with a web console if you prefer.

It works for any research field that publishes on arXiv, PubMed, or Google Scholar. I built it for medical AI, but there's nothing domain-specific in the code.

If you run into issues, open a GitHub issue or reach out — happy to help.

Top comments (3)

klement Gunndu • Mar 11

The fetal ultrasound scoring 9/10 for endoscopy is a perfect example of why semantic scoring beats keywords for research discovery. Curious how the Cerebras fallback compares to Gemini on scoring consistency.

Jinhoon Jeong • Mar 11

Thanks for the kind words! That's exactly the kind of cross-domain serendipity we designed the LLM scoring for — keyword filters would never surface a fetal ultrasound paper for an endoscopy researcher, but the shared imaging techniques make it genuinely relevant.

On the Cerebras fallback question: honestly, scoring consistency between Gemini and Cerebras is not 1:1. Gemini (especially 3.1-pro with advanced reasoning) tends to produce more nuanced scores with tighter spread, while Cerebras (gpt-oss-120b) can be more generous across the board. The fallback is designed as a reliability safety net (Gemini quota/downtime), not as a scoring-equivalent substitute. If you're running it seriously, I'd recommend keeping Gemini as primary and treating Cerebras scores as approximate.

That said — a comparative scoring benchmark is a great idea for a future release. Would love contributions if you're interested!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.