DEV Community

Alex Spinov
Alex Spinov

Posted on

I Replaced a $200/Month AI Training Data Pipeline with 50 Lines of Python

A data science team I worked with was paying $200/month for a research monitoring service. It sent them new papers in their field every morning.

I looked at what it actually did: query arXiv, filter by keywords, format as email. That's it.

I replaced it with 50 lines of Python. Here's how.

The Problem

ML teams need to track new research. Options:

  • Semantic Scholar API — great but rate-limited
  • Google Scholar — no official API, blocks scrapers
  • Paid services ($100-500/mo) — Iris.ai, Connected Papers Pro, etc.

But two APIs give you everything for free: arXiv (2.4M+ papers) and Crossref (140M+ papers).

The 50-Line Solution

import requests
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta

def search_arxiv(query, max_results=20):
    """Search arXiv for recent papers."""
    url = f'http://export.arxiv.org/api/query?search_query=all:{query}&sortBy=submittedDate&sortOrder=descending&max_results={max_results}'
    response = requests.get(url)
    root = ET.fromstring(response.text)
    ns = {'atom': 'http://www.w3.org/2005/Atom'}

    papers = []
    for entry in root.findall('atom:entry', ns):
        papers.append({
            'title': entry.find('atom:title', ns).text.strip().replace('\n', ' '),
            'authors': [a.find('atom:name', ns).text for a in entry.findall('atom:author', ns)],
            'summary': entry.find('atom:summary', ns).text.strip()[:200],
            'published': entry.find('atom:published', ns).text[:10],
            'link': entry.find('atom:id', ns).text
        })
    return papers

def search_crossref(query, days_back=7, max_results=10):
    """Search Crossref for recent peer-reviewed papers."""
    from_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')
    url = f'https://api.crossref.org/works?query={query}&filter=from-pub-date:{from_date}&rows={max_results}&sort=published&order=desc'
    response = requests.get(url, headers={'User-Agent': 'ResearchBot/1.0 (mailto:your@email.com)'})
    data = response.json()

    papers = []
    for item in data.get('message', {}).get('items', []):
        papers.append({
            'title': item.get('title', ['Untitled'])[0],
            'authors': [f\{a.get("given","")} {a.get("family","")}\ for a in item.get('author', [])[:3]],
            'journal': item.get('container-title', ['N/A'])[0] if item.get('container-title') else 'Preprint',
            'doi': item.get('DOI', 'N/A'),
            'citations': item.get('is-referenced-by-count', 0)
        })
    return papers

def daily_research_digest(topics):
    """Generate a daily digest for multiple research topics."""
    print(f'=== Research Digest — {datetime.now().strftime("%Y-%m-%d")} ===\n')

    for topic in topics:
        print(f'## {topic.upper()}\n')

        # arXiv: latest preprints
        arxiv_papers = search_arxiv(topic, max_results=5)
        print(f'### arXiv Preprints ({len(arxiv_papers)} found)')
        for p in arxiv_papers:
            print(f'  [{p["published"]}] {p["title"]}')
            print(f'  Authors: {", ".join(p["authors"][:3])}')
            print(f'  {p["link"]}\n')

        # Crossref: peer-reviewed papers
        crossref_papers = search_crossref(topic, days_back=7)
        print(f'### Peer-Reviewed ({len(crossref_papers)} found)')
        for p in crossref_papers:
            print(f'  {p["title"]}')
            print(f'  Journal: {p["journal"]} | Citations: {p["citations"]}')
            print(f'  DOI: {p["doi"]}\n')
        print()

# Configure your topics
my_topics = ['transformer architecture', 'reinforcement learning', 'LLM fine-tuning']
daily_research_digest(my_topics)
Enter fullscreen mode Exit fullscreen mode

What This Actually Does

  1. arXiv API — searches 2.4M+ papers, returns latest preprints in your field. No API key needed. Free.
  2. Crossref API — searches 140M+ peer-reviewed publications. Includes citation counts, DOIs, journal names. Also free.
  3. Combines both — you get preprints (bleeding edge) AND peer-reviewed papers (validated research) in one digest.

Making It Automatic

Save as research_digest.py and add to cron:

# Run every morning at 8 AM
0 8 * * * python3 /path/to/research_digest.py >> /path/to/digest.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Or send to Slack/Discord:

import requests

def send_to_slack(papers, webhook_url):
    blocks = []
    for p in papers[:10]:
        blocks.append({
            'type': 'section',
            'text': {'type': 'mrkdwn', 'text': f'*{p["title"]}*\n{p.get("link", p.get("doi", ""))}'}
        })
    requests.post(webhook_url, json={'blocks': blocks})
Enter fullscreen mode Exit fullscreen mode

The Real Savings

Service Cost What you get
Iris.ai $180/mo AI paper recommendations
Connected Papers Pro $96/mo Visual paper graphs
Semantic Scholar Alert Free but limited 3 queries/min
This script $0 Unlimited queries, customizable

The paid services add AI summaries and recommendation graphs. But if you just need "show me new papers about X" — that's exactly what the arXiv + Crossref APIs do.

Extending It

  1. Add semantic search — use sentence-transformers to rank papers by relevance
  2. Build a RAG pipeline — embed papers in ChromaDB, query with natural language (full tutorial here)
  3. Track citations over time — Crossref gives citation counts, great for finding trending papers
  4. Filter by institution — Crossref metadata includes author affiliations

API Templates

I have ready-to-use templates for 20+ APIs (arXiv, Crossref, npm, Shodan, HIBP, and more): api-scraping-templates

Full list of 77 scraping tools: awesome-web-scraping-2026


What research APIs are you using? I'm building a collection of free data sources — share yours in the comments.

Top comments (0)