I Replaced a $200/Month AI Training Data Pipeline with 50 Lines of Python

#machinelearning #tutorial #ai #python

A data science team I worked with was paying $200/month for a research monitoring service. It sent them new papers in their field every morning.

I looked at what it actually did: query arXiv, filter by keywords, format as email. That's it.

I replaced it with 50 lines of Python. Here's how.

The Problem

ML teams need to track new research. Options:

Semantic Scholar API — great but rate-limited
Google Scholar — no official API, blocks scrapers
Paid services ($100-500/mo) — Iris.ai, Connected Papers Pro, etc.

But two APIs give you everything for free: arXiv (2.4M+ papers) and Crossref (140M+ papers).

The 50-Line Solution

import requests
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta

def search_arxiv(query, max_results=20):
    """Search arXiv for recent papers."""
    url = f'http://export.arxiv.org/api/query?search_query=all:{query}&sortBy=submittedDate&sortOrder=descending&max_results={max_results}'
    response = requests.get(url)
    root = ET.fromstring(response.text)
    ns = {'atom': 'http://www.w3.org/2005/Atom'}

    papers = []
    for entry in root.findall('atom:entry', ns):
        papers.append({
            'title': entry.find('atom:title', ns).text.strip().replace('\n', ' '),
            'authors': [a.find('atom:name', ns).text for a in entry.findall('atom:author', ns)],
            'summary': entry.find('atom:summary', ns).text.strip()[:200],
            'published': entry.find('atom:published', ns).text[:10],
            'link': entry.find('atom:id', ns).text
        })
    return papers

def search_crossref(query, days_back=7, max_results=10):
    """Search Crossref for recent peer-reviewed papers."""
    from_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')
    url = f'https://api.crossref.org/works?query={query}&filter=from-pub-date:{from_date}&rows={max_results}&sort=published&order=desc'
    response = requests.get(url, headers={'User-Agent': 'ResearchBot/1.0 (mailto:your@email.com)'})
    data = response.json()

    papers = []
    for item in data.get('message', {}).get('items', []):
        papers.append({
            'title': item.get('title', ['Untitled'])[0],
            'authors': [f\{a.get("given","")} {a.get("family","")}\ for a in item.get('author', [])[:3]],
            'journal': item.get('container-title', ['N/A'])[0] if item.get('container-title') else 'Preprint',
            'doi': item.get('DOI', 'N/A'),
            'citations': item.get('is-referenced-by-count', 0)
        })
    return papers

def daily_research_digest(topics):
    """Generate a daily digest for multiple research topics."""
    print(f'=== Research Digest — {datetime.now().strftime("%Y-%m-%d")} ===\n')

    for topic in topics:
        print(f'## {topic.upper()}\n')

        # arXiv: latest preprints
        arxiv_papers = search_arxiv(topic, max_results=5)
        print(f'### arXiv Preprints ({len(arxiv_papers)} found)')
        for p in arxiv_papers:
            print(f'  [{p["published"]}] {p["title"]}')
            print(f'  Authors: {", ".join(p["authors"][:3])}')
            print(f'  {p["link"]}\n')

        # Crossref: peer-reviewed papers
        crossref_papers = search_crossref(topic, days_back=7)
        print(f'### Peer-Reviewed ({len(crossref_papers)} found)')
        for p in crossref_papers:
            print(f'  {p["title"]}')
            print(f'  Journal: {p["journal"]} | Citations: {p["citations"]}')
            print(f'  DOI: {p["doi"]}\n')
        print()

# Configure your topics
my_topics = ['transformer architecture', 'reinforcement learning', 'LLM fine-tuning']
daily_research_digest(my_topics)

What This Actually Does

arXiv API — searches 2.4M+ papers, returns latest preprints in your field. No API key needed. Free.
Crossref API — searches 140M+ peer-reviewed publications. Includes citation counts, DOIs, journal names. Also free.
Combines both — you get preprints (bleeding edge) AND peer-reviewed papers (validated research) in one digest.

Making It Automatic

Save as research_digest.py and add to cron:

# Run every morning at 8 AM
0 8 * * * python3 /path/to/research_digest.py >> /path/to/digest.log 2>&1

Or send to Slack/Discord:

import requests

def send_to_slack(papers, webhook_url):
    blocks = []
    for p in papers[:10]:
        blocks.append({
            'type': 'section',
            'text': {'type': 'mrkdwn', 'text': f'*{p["title"]}*\n{p.get("link", p.get("doi", ""))}'}
        })
    requests.post(webhook_url, json={'blocks': blocks})

The Real Savings

Service	Cost	What you get
Iris.ai	$180/mo	AI paper recommendations
Connected Papers Pro	$96/mo	Visual paper graphs
Semantic Scholar Alert	Free but limited	3 queries/min
This script	$0	Unlimited queries, customizable

The paid services add AI summaries and recommendation graphs. But if you just need "show me new papers about X" — that's exactly what the arXiv + Crossref APIs do.

Extending It

Add semantic search — use sentence-transformers to rank papers by relevance
Build a RAG pipeline — embed papers in ChromaDB, query with natural language (full tutorial here)
Track citations over time — Crossref gives citation counts, great for finding trending papers
Filter by institution — Crossref metadata includes author affiliations

API Templates

I have ready-to-use templates for 20+ APIs (arXiv, Crossref, npm, Shodan, HIBP, and more): api-scraping-templates

Full list of 77 scraping tools: awesome-web-scraping-2026

What research APIs are you using? I'm building a collection of free data sources — share yours in the comments.

Need custom dev tools, scrapers, or API integrations? I build automation for dev teams. Email spinov001@gmail.com — or explore awesome-web-scraping.

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*