A data science team I worked with was paying $200/month for a research monitoring service. It sent them new papers in their field every morning.
I looked at what it actually did: query arXiv, filter by keywords, format as email. That's it.
I replaced it with 50 lines of Python. Here's how.
The Problem
ML teams need to track new research. Options:
- Semantic Scholar API — great but rate-limited
- Google Scholar — no official API, blocks scrapers
- Paid services ($100-500/mo) — Iris.ai, Connected Papers Pro, etc.
But two APIs give you everything for free: arXiv (2.4M+ papers) and Crossref (140M+ papers).
The 50-Line Solution
import requests
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta
def search_arxiv(query, max_results=20):
"""Search arXiv for recent papers."""
url = f'http://export.arxiv.org/api/query?search_query=all:{query}&sortBy=submittedDate&sortOrder=descending&max_results={max_results}'
response = requests.get(url)
root = ET.fromstring(response.text)
ns = {'atom': 'http://www.w3.org/2005/Atom'}
papers = []
for entry in root.findall('atom:entry', ns):
papers.append({
'title': entry.find('atom:title', ns).text.strip().replace('\n', ' '),
'authors': [a.find('atom:name', ns).text for a in entry.findall('atom:author', ns)],
'summary': entry.find('atom:summary', ns).text.strip()[:200],
'published': entry.find('atom:published', ns).text[:10],
'link': entry.find('atom:id', ns).text
})
return papers
def search_crossref(query, days_back=7, max_results=10):
"""Search Crossref for recent peer-reviewed papers."""
from_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')
url = f'https://api.crossref.org/works?query={query}&filter=from-pub-date:{from_date}&rows={max_results}&sort=published&order=desc'
response = requests.get(url, headers={'User-Agent': 'ResearchBot/1.0 (mailto:your@email.com)'})
data = response.json()
papers = []
for item in data.get('message', {}).get('items', []):
papers.append({
'title': item.get('title', ['Untitled'])[0],
'authors': [f\{a.get("given","")} {a.get("family","")}\ for a in item.get('author', [])[:3]],
'journal': item.get('container-title', ['N/A'])[0] if item.get('container-title') else 'Preprint',
'doi': item.get('DOI', 'N/A'),
'citations': item.get('is-referenced-by-count', 0)
})
return papers
def daily_research_digest(topics):
"""Generate a daily digest for multiple research topics."""
print(f'=== Research Digest — {datetime.now().strftime("%Y-%m-%d")} ===\n')
for topic in topics:
print(f'## {topic.upper()}\n')
# arXiv: latest preprints
arxiv_papers = search_arxiv(topic, max_results=5)
print(f'### arXiv Preprints ({len(arxiv_papers)} found)')
for p in arxiv_papers:
print(f' [{p["published"]}] {p["title"]}')
print(f' Authors: {", ".join(p["authors"][:3])}')
print(f' {p["link"]}\n')
# Crossref: peer-reviewed papers
crossref_papers = search_crossref(topic, days_back=7)
print(f'### Peer-Reviewed ({len(crossref_papers)} found)')
for p in crossref_papers:
print(f' {p["title"]}')
print(f' Journal: {p["journal"]} | Citations: {p["citations"]}')
print(f' DOI: {p["doi"]}\n')
print()
# Configure your topics
my_topics = ['transformer architecture', 'reinforcement learning', 'LLM fine-tuning']
daily_research_digest(my_topics)
What This Actually Does
- arXiv API — searches 2.4M+ papers, returns latest preprints in your field. No API key needed. Free.
- Crossref API — searches 140M+ peer-reviewed publications. Includes citation counts, DOIs, journal names. Also free.
- Combines both — you get preprints (bleeding edge) AND peer-reviewed papers (validated research) in one digest.
Making It Automatic
Save as research_digest.py and add to cron:
# Run every morning at 8 AM
0 8 * * * python3 /path/to/research_digest.py >> /path/to/digest.log 2>&1
Or send to Slack/Discord:
import requests
def send_to_slack(papers, webhook_url):
blocks = []
for p in papers[:10]:
blocks.append({
'type': 'section',
'text': {'type': 'mrkdwn', 'text': f'*{p["title"]}*\n{p.get("link", p.get("doi", ""))}'}
})
requests.post(webhook_url, json={'blocks': blocks})
The Real Savings
| Service | Cost | What you get |
|---|---|---|
| Iris.ai | $180/mo | AI paper recommendations |
| Connected Papers Pro | $96/mo | Visual paper graphs |
| Semantic Scholar Alert | Free but limited | 3 queries/min |
| This script | $0 | Unlimited queries, customizable |
The paid services add AI summaries and recommendation graphs. But if you just need "show me new papers about X" — that's exactly what the arXiv + Crossref APIs do.
Extending It
- Add semantic search — use sentence-transformers to rank papers by relevance
- Build a RAG pipeline — embed papers in ChromaDB, query with natural language (full tutorial here)
- Track citations over time — Crossref gives citation counts, great for finding trending papers
- Filter by institution — Crossref metadata includes author affiliations
API Templates
I have ready-to-use templates for 20+ APIs (arXiv, Crossref, npm, Shodan, HIBP, and more): api-scraping-templates
Full list of 77 scraping tools: awesome-web-scraping-2026
What research APIs are you using? I'm building a collection of free data sources — share yours in the comments.
Top comments (0)