DEV Community: Reel Crave

The Smartest Founders I Know Read Research Papers. Here Is How They Do It at Scale.

Reel Crave — Mon, 22 Jun 2026 18:33:37 +0000

There is a reason some people always seem to see what is coming before everyone else.

There is a pattern I have noticed in founders and product teams that consistently build ahead of the curve.

They read research.

Not blog posts summarising research. Not Twitter threads about research. The actual papers. Because by the time a scientific finding becomes a blog post, someone is already building a company on it. By the time it becomes a Twitter thread, that company has raised a round.

The gap between "published in a journal" and "everyone knows about this" is where the real edge lives. And for a long time, exploiting that gap at scale was genuinely hard.

Not anymore.

The Problem With Staying on Top of Research

Reading papers manually does not scale. A serious researcher in a fast-moving field publishes hundreds of relevant papers every month. You cannot read them all. You cannot even skim them all.

What you can do is build a system that reads them for you, filters for what matters, and surfaces only the signal.

The bottleneck has never been processing or summarisation. LLMs are excellent at that now. The bottleneck has always been getting the papers into your system in the first place, cleanly, reliably, and automatically.

That is the problem ScholarAPI solves.

What This Actually Looks Like in Practice

Say you are a founder building in the longevity space. You want to know every meaningful paper published on senolytics, NAD+ metabolism, or mTOR inhibition within 48 hours of it going live.

Or you are a VC doing diligence on a biotech. You want to understand the research landscape around a specific compound before your partner meeting on Thursday.

Or you are a product team at a healthcare company that needs to monitor clinical trial literature for competitive signals on a therapeutic area you are entering.

In all three cases, the workflow is the same.

Find the papers.
Get the full text.
Process it.
Surface the signal.

ScholarAPI handles the first two steps. Completely.

The Setup

ScholarAPI is a REST API with access to 30 million plus open-access papers from 20,000 plus academic sources. New papers appear in the index within 24 to 48 hours of publication.

The endpoint that makes the monitoring use case work is /list with the indexed_after parameter. You give it a keyword and a timestamp. It returns everything new that matches.

pythonimport requests
from datetime import datetime, timedelta, timezone

API_KEY = "sch_xxxxxxxxx"
BASE = "https://scholarapi.net/api/v1"
HEADERS = {"X-API-Key": API_KEY}

def get_new_papers(topic: str, hours_back: int = 48) -> list:
since = (datetime.now(timezone.utc) - timedelta(hours=hours_back)).isoformat()

resp = requests.get(
    f"{BASE}/list",
    headers=HEADERS,
    params={
        "q":             topic,
        "indexed_after": since,
        "has_text":      "true",
        "limit":         50
    }
)
return resp.json().get("results", [])

Then you pull the full text for anything that looks relevant:

pythondef get_full_text(paper_id: str) -> str:
resp = requests.get(f"{BASE}/text/{paper_id}", headers=HEADERS)
return resp.text if resp.status_code == 200 else ""

Then you hand it to an LLM and ask it to summarise, extract key findings, flag anything that matches a list of competitor names or compounds you care about, or score relevance to a thesis you are tracking.

The whole pipeline runs in under a minute. You can schedule it daily, pipe the output to Slack, Notion, an email digest, whatever your team already uses.

Why This Is Different From Google Alerts or RSS Feeds

Google Alerts gives you web mentions. RSS feeds from journal websites give you titles and abstracts, if you are lucky. Neither gives you full text. Neither is programmable in any meaningful way.

ScholarAPI gives you the actual paper content, structured, clean, and queryable. That means you can do things like:

Search across 30 million papers for a specific chemical compound name and get back papers that mention it in the body text, not just the title.

Pull everything published on a topic in the last 30 days and feed it into an embedding pipeline to find semantic clusters you did not know to search for.

Cross-reference author names across papers to map which research groups are most active in a space and track their output over time.

None of that is possible with alerts or RSS. All of it is straightforward once you have programmatic full-text access.

For the Non-Technical Reader

You do not need to run this code yourself. You need to know it is possible, and either find someone who can build it or use it as context when evaluating tools that claim to do this for you.

The core insight is this: competitive intelligence on emerging research has historically required either expensive proprietary databases or a dedicated person whose job is reading papers all day. ScholarAPI makes the data layer of that problem cheap and accessible. What you do with the data is up to you.

If you work in an industry that is being reshaped by science, which at this point is most industries, the teams that will win are the ones who see the research before it becomes consensus.

That edge is a data pipeline away.

What It Costs

1,000 free credits on signup at scholarapi.net. A search call is 10 credits plus 2 per result returned. Full text is 3 credits per paper at current pricing.

Monitoring a topic daily, pulling 20 new papers, reading full text on 10 of them: roughly 270 credits per day. The $19.90 starter pack lasts over a month of daily monitoring at that rate.

The Quiet Advantage

The best investment research, the best product decisions, and the best startup ideas I have seen in the last few years all had one thing in common: the person making the call had read something that most people had not gotten to yet.

Research papers are the earliest signal that exists for most scientific and technological shifts. They are public. They are free to access. They are just inconvenient enough to read systematically that most people do not bother.

That inconvenience is the moat. And now there is an API for it.

scholarapi.net

Tags: productivity python startup career

Academic Plagiarism Is a Data Problem. Most Tools Are Solving the Wrong Thing..

Reel Crave — Sat, 13 Jun 2026 14:19:56 +0000

Why the reference corpus matters more than the algorithm.. and what actually fixes it.

A professor I know spent three weeks investigating a suspected plagiarism case last year.

Not three weeks because the writing was hard to spot. Three weeks because verifying it, finding the original source, pulling the actual text, comparing it properly, was genuinely painful. She had a hunch. She had a student's submitted paper. What she didn't have was a fast, reliable way to check that hunch against millions of published papers without paying for an enterprise tool her university couldn't afford or manually trawling databases she barely had access to.

She eventually found it. The student had lifted three paragraphs almost verbatim from a 2019 materials science paper published in an open-access journal most people have never heard of.

The tool she used to catch it? Google Scholar and instinct. The time it took? Embarrassing for 2025.

The Actual Problem With Plagiarism Detection

Here is what most people assume: plagiarism detection is an NLP problem. Train a model, compute similarity scores, flag matches above a threshold. Problem solved.

That assumption is wrong, or at least incomplete.

The NLP part is largely solved. Cosine similarity, embedding-based semantic search, n-gram overlap, these are mature techniques. You can implement a basic plagiarism detector in an afternoon.

What you cannot implement in an afternoon is a reference corpus worth checking against.

This is the part nobody talks about. The algorithm is only as good as the documents you compare against. If your reference database covers PubMed and a handful of major journals, you will miss the paper published in a regional open-access journal in 2017. You will miss the conference proceedings. You will miss the preprint that never made it into a major index but circulated widely enough to be plagiarised.

Coverage is everything. And coverage is exactly where most tools quietly fail.

What Founders Building in This Space Actually Need

If you are building a plagiarism detection product, for universities, publishers, academic integrity platforms, or even just internal research quality tools, you have three real problems:

The corpus problem. You need programmatic access to millions of papers. Not metadata. Not abstracts. Full text, because plagiarism hides in body paragraphs, not titles.

The freshness problem. A student plagiarising today might be copying from a paper published last month. Your reference database needs to stay current, not just be a snapshot from three years ago.

The cost problem. Licensing access to academic content at scale from traditional publishers is genuinely expensive and slow. The contracts alone take months.

Open-access literature sidesteps the third problem entirely. And open-access has grown dramatically, a significant and increasing share of new research is published open-access, especially in sciences and medicine. For most plagiarism detection use cases, it is where the viable corpus lives.

Where ScholarAPI Fits

ScholarAPI indexes 30 million plus open-access papers from 20,000 plus academic sources. The key thing for plagiarism use cases specifically is not the search endpoint, it is the full text extraction.

Most academic APIs will give you a title, an abstract, maybe a DOI. ScholarAPI gives you the actual paper text, pre-extracted and clean, via a single API call.

curl "https://scholarapi.net/api/v1/text/{paper_id}" \
  -H "X-API-Key: sch_xxxxxxxxx"

That returns the extracted full text of the paper. Not HTML. Not a PDF binary you have to parse yourself. The text, ready to compare against.

For building a plagiarism detection pipeline, this changes the economics completely. Instead of building and maintaining a PDF extraction layer, which is genuinely painful, especially for two-column academic layouts, you get clean text directly. Your engineering effort goes into the comparison logic, which is the interesting part.

The bulk endpoint matters here too. /texts/{ids} lets you pull up to 100 full texts in a single call. When you are checking a submitted manuscript against candidate papers, that means your reference lookup is one request, not a hundred.

A Simple Pipeline That Actually Works

This is not a production system. It is the skeleton of one, enough to show how the pieces fit together.

import requests
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

API_KEY = "sch_xxxxxxxxx"
BASE    = "https://scholarapi.net/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def find_candidates(manuscript_excerpt: str, top_k: int = 20) -> list:
    """Search for papers likely to match the manuscript content."""
    resp = requests.get(f"{BASE}/search", headers=HEADERS, params={
        "q": manuscript_excerpt[:200],  # use a representative excerpt as query
        "limit": top_k
    })
    return resp.json().get("results", [])


def fetch_full_texts(paper_ids: list) -> dict:
    """Bulk fetch full text for candidate papers."""
    ids_str = ",".join(paper_ids)
    resp = requests.get(f"{BASE}/texts/{ids_str}", headers=HEADERS)
    return resp.json()  # returns {paper_id: full_text}


def check_similarity(manuscript: str, reference_texts: dict) -> list:
    """
    Compute TF-IDF cosine similarity between manuscript
    and each reference paper. Returns ranked results.
    """
    docs    = [manuscript] + list(reference_texts.values())
    ids     = list(reference_texts.keys())

    vectorizer = TfidfVectorizer(ngram_range=(2, 4))
    matrix     = vectorizer.fit_transform(docs)

    scores = cosine_similarity(matrix[0:1], matrix[1:]).flatten()

    ranked = sorted(
        zip(ids, scores),
        key=lambda x: x[1],
        reverse=True
    )
    return ranked


def run_check(manuscript: str):
    print("Finding candidate papers...")
    candidates = find_candidates(manuscript)

    if not candidates:
        print("No candidates found.")
        return

    ids   = [p["id"] for p in candidates]
    titles = {p["id"]: p["title"] for p in candidates}

    print(f"Fetching full text for {len(ids)} candidates...")
    texts = fetch_full_texts(ids)

    print("Computing similarity scores...")
    results = check_similarity(manuscript, texts)

    print("\nTop matches:")
    for paper_id, score in results[:5]:
        print(f"  {score:.3f} — {titles.get(paper_id, paper_id)}")


# Example
manuscript_sample = """
The electron transport chain generates ATP through a series of 
oxidation-reduction reactions across the inner mitochondrial membrane...
"""
run_check(manuscript_sample)

The TF-IDF approach with bigrams and trigrams catches close paraphrasing reasonably well. For a production system you would swap this for embedding-based similarity, sentence transformers work well here, but the structure stays the same. ScholarAPI handles the corpus. You handle the comparison.

For Teachers and Institutions Specifically

If you are not a developer but you are reading this because you deal with academic integrity, this section is for you.

The reason tools like Turnitin work for obvious cases is that they have large proprietary databases and student paper repositories. Where they struggle is niche open-access literature, non-English language journals, and recently published papers that have not yet been indexed.

ScholarAPI's index is specifically open-access, which means it covers exactly the blind spot that traditional tools miss. A paper published last month in an open-access biology journal will be in the index within 48 hours. That freshness is not something most institutional tools can match.

If your institution has a developer who can spend a few hours with an API, the cost of building a basic checking tool on top of ScholarAPI is genuinely low. 1,000 free credits on signup at scholarapi.net. A search call costs 10 credits plus 2 per result. Full text retrieval is currently 3 credits per paper at promo pricing. Checking a submitted essay against 50 candidate papers costs roughly 200 credits, under a dollar.

That is not a replacement for institutional tools. It is a supplement for the cases those tools miss.

The Honest Bit

ScholarAPI is open-access only. Elsevier, Wiley, Taylor and Francis subscription content is not in there. If the suspected plagiarism source is behind a paywall, this does not help you find it.

But here is the practical reality: most plagiarism in student work comes from accessible sources. Things students could actually read. Open-access papers, preprints, publicly available theses. Subscription-only journal articles from 2011 that require institutional access are rarely the source. They are not readable without credentials. Students plagiarise what they can reach.

Open-access coverage catches most of what matters.

Where This Goes

The plagiarism detection space is quietly getting rebuilt. Embedding models and semantic similarity have made it possible to catch paraphrasing that keyword overlap misses entirely. The missing piece has always been corpus coverage, having enough of the right documents to check against.

That is a data access problem more than an AI problem. And data access problems have boring, practical solutions.

ScholarAPI is one of them. Not glamorous. Not a research breakthrough. Just 30 million papers, clean full text, and an API that works.

Try it at scholarapi.net. The free credits are enough to build something real.

Tags: python webdev career tutorial

How to Build a Clean Academic Dataset Without Losing Your Mind (or Your Weekend)

Reel Crave — Thu, 28 May 2026 11:17:51 +0000

The dataset problem nobody talks about.. and the API that quietly solves it.

Everyone has an opinion on which model to fine-tune.

Nobody talks about where the training data actually comes from.

Ask any ML engineer who has built something on scientific literature and you'll hear the same story: the model took two weeks. The dataset took two months. The dataset was the hard part.

I've been there. Cobbling together CSVs from PubMed exports, writing scrapers that broke every time a journal sneezed, hand-cleaning PDF extractions that looked like someone ran a blender through a research paper. It's unglamorous, it's slow, and it's the reason a lot of genuinely good AI projects never ship.

This article is about doing it the right way, building clean, structured, reproducible academic datasets using ScholarAPI. We'll go from zero to a production-ready dataset pipeline, with real code you can run today.

First: Why Academic Datasets Are Uniquely Painful

Most dataset-building tutorials assume you're scraping Reddit or pulling from a nice REST API with a consistent schema. Academic literature is neither of those things.

Here's what you're actually dealing with:

Fragmentation. Research is spread across 20,000+ journals, repositories, preprint servers, and institutional databases. There is no single place to query all of it. PubMed covers medicine. arXiv covers physics and CS. Neither covers materials science, economics, or law particularly well.

Format chaos. The canonical format for academic publishing is PDF, a format designed for print, not machines. Extracting clean text from a PDF is a non-trivial engineering problem. Do it wrong and you get scrambled column layouts, broken equations, and reference lists fused into body text.

No stable programmatic access. Google Scholar has 389 million papers. It also has no API. The moment your scraper gets reliable, Google changes something and you're back to zero.

Legal ambiguity at scale. Using copyrighted content to train models is genuinely complicated. Open-access literature, where authors have explicitly licensed reuse, is the safe zone. But you have to know what you're pulling.

ScholarAPI is built around exactly these constraints: 30M+ open-access papers, pre-extracted full text, structured JSON, stable endpoints. It doesn't solve every problem but it eliminates the ones that kill most projects before they start.

What You're Actually Building

By the end of this article you'll have:

A domain-specific corpus builder, pull N papers on any topic, clean and structured
A streaming dataset pipeline, new papers automatically added as they're published
A Hugging Face-ready dataset, pushed directly to the Hub with proper schema

All three use the same four endpoints:

GET /api/v1/search          # find papers by keyword
GET /api/v1/list            # paginate by date, monitor new content  
GET /api/v1/text/{id}       # clean extracted full text
GET /api/v1/texts/{ids}     # bulk, up to 100 texts in one call

Auth is one header everywhere: X-API-Key: sch_xxxxxxxxx

Get your key at scholarapi.net, 1,000 free credits on signup, enough to pull a few hundred full texts and genuinely evaluate whether this works for your use case.

Part 1: The Domain Corpus Builder

This is the most common use case: you need N papers on a topic, with full text, for fine-tuning, RAG, or evaluation.

import requests
import json
import time
from pathlib import Path

API_KEY = "sch_xxxxxxxxx"
BASE    = "https://scholarapi.net/api/v1"
HEADERS = {"X-API-Key": API_KEY}

def search_papers(query: str, limit: int = 100) -> list[dict]:
    """Search for papers matching a query. Returns metadata list."""
    resp = requests.get(
        f"{BASE}/search",
        headers=HEADERS,
        params={"q": query, "limit": limit}
    )
    resp.raise_for_status()
    return resp.json().get("results", [])


def fetch_texts_bulk(paper_ids: list[str]) -> dict[str, str]:
    """
    Fetch full text for up to 100 papers in one API call.
    Returns {paper_id: full_text} dict.
    """
    # API accepts comma-separated IDs
    ids_str = ",".join(paper_ids[:100])
    resp = requests.get(
        f"{BASE}/texts/{ids_str}",
        headers=HEADERS
    )
    resp.raise_for_status()
    return resp.json()  # {id: text, id: text, ...}


def build_corpus(query: str, target_size: int = 500, output_path: str = "corpus.jsonl") -> int:
    """
    Build a full-text corpus for a given query topic.
    Saves to JSONL, one JSON object per line, easy to stream later.
    """
    print(f"Searching for papers: '{query}'")
    papers = search_papers(query, limit=min(target_size, 100))
    print(f"Found {len(papers)} papers in search results")

    # Filter to papers that have full text available
    with_text = [p for p in papers if p.get("has_text")]
    print(f"{len(with_text)} have full text available")

    written = 0
    # Batch into groups of 100 for the bulk endpoint
    with open(output_path, "w") as f:
        for i in range(0, len(with_text), 100):
            batch = with_text[i:i+100]
            ids   = [p["id"] for p in batch]

            texts = fetch_texts_bulk(ids)

            for paper in batch:
                pid  = paper["id"]
                text = texts.get(pid)
                if not text:
                    continue

                record = {
                    "id":             pid,
                    "title":          paper.get("title"),
                    "authors":        paper.get("authors", []),
                    "published_date": paper.get("published_date"),
                    "journal":        paper.get("journal"),
                    "abstract":       paper.get("abstract"),
                    "full_text":      text,
                    "source_url":     paper.get("url"),   # auditable backlink
                    "query":          query,
                }
                f.write(json.dumps(record) + "\n")
                written += 1

            # Be a good API citizen
            time.sleep(0.5)
            print(f"  Batch {i//100 + 1} done — {written} records so far")

    print(f"\nCorpus saved to {output_path} — {written} papers with full text")
    return written


# Run it
build_corpus(
    query="transformer attention mechanism natural language processing",
    target_size=200,
    output_path="nlp_corpus.jsonl"
)

Why JSONL? Because it streams. You can process a 10GB JSONL file line-by-line without loading it into memory. It's also what Hugging Face datasets expect natively. Start with JSONL, you'll thank yourself later.

Why source_url in every record? ScholarAPI includes a backlink to the original paper on every result. Keep it. When someone asks "where did this training data come from," you have a per-record answer. That's the difference between an auditable dataset and a liability.

Part 2: The Streaming Pipeline

Static datasets go stale. If you're building a system that needs to stay current with new research, a literature monitoring agent, a continuously updated RAG knowledge base, an LLM that gets fine-tuned monthly, you need a pipeline, not a one-time dump.

The /list endpoint with indexed_after is what makes this possible. ScholarAPI indexes new papers within 24–48 hours of publication. Here's a pipeline that runs daily and appends only new content:

import requests
import json
from datetime import datetime, timedelta, timezone
from pathlib import Path

API_KEY = "sch_xxxxxxxxx"
BASE    = "https://scholarapi.net/api/v1"
HEADERS = {"X-API-Key": API_KEY}

# State file — tracks when we last ran
STATE_FILE = Path(".pipeline_state.json")


def load_state() -> dict:
    if STATE_FILE.exists():
        return json.loads(STATE_FILE.read_text())
    # First run, go back 7 days
    default_since = (datetime.now(timezone.utc) - timedelta(days=7)).isoformat()
    return {"last_run": default_since, "total_records": 0}


def save_state(state: dict):
    STATE_FILE.write_text(json.dumps(state, indent=2))


def fetch_new_papers(keyword: str, since: str) -> list[dict]:
    """Pull all new papers matching keyword since a given timestamp."""
    all_results = []
    cursor      = None

    while True:
        params = {
            "q":             keyword,
            "indexed_after": since,
            "has_text":      "true",
            "limit":         100,
        }
        if cursor:
            params["cursor"] = cursor

        resp = requests.get(f"{BASE}/list", headers=HEADERS, params=params)
        resp.raise_for_status()
        data = resp.json()

        results = data.get("results", [])
        all_results.extend(results)

        # Paginate until exhausted
        cursor = data.get("next_cursor")
        if not cursor or not results:
            break

    return all_results


def run_pipeline(keyword: str, output_file: str = "stream.jsonl"):
    state = load_state()
    since = state["last_run"]
    now   = datetime.now(timezone.utc).isoformat()

    print(f"Pipeline run: {since} → {now}")
    print(f"Keyword: '{keyword}'")

    papers = fetch_new_papers(keyword, since)
    print(f"New papers found: {len(papers)}")

    if not papers:
        print("Nothing new. Updating state and exiting.")
        state["last_run"] = now
        save_state(state)
        return

    # Bulk-fetch full texts in batches of 100
    added = 0
    with open(output_file, "a") as f:  # append mode
        for i in range(0, len(papers), 100):
            batch = papers[i:i+100]
            ids   = [p["id"] for p in batch]
            ids_str = ",".join(ids)

            texts_resp = requests.get(
                f"{BASE}/texts/{ids_str}",
                headers=HEADERS
            )
            texts = texts_resp.json()

            for paper in batch:
                pid  = paper["id"]
                text = texts.get(pid)
                if not text:
                    continue

                f.write(json.dumps({
                    "id":             pid,
                    "title":          paper.get("title"),
                    "published_date": paper.get("published_date"),
                    "indexed_at":     paper.get("indexed_at"),
                    "full_text":      text,
                    "source_url":     paper.get("url"),
                    "pipeline_run":   now,
                }) + "\n")
                added += 1

    state["last_run"]      = now
    state["total_records"] = state.get("total_records", 0) + added
    save_state(state)

    print(f"Added {added} new records. Total dataset size: {state['total_records']}")


# Run it — or stick this in a cron job / Airflow DAG
run_pipeline(
    keyword="CRISPR gene editing therapy",
    output_file="crispr_stream.jsonl"
)

Cron it at 6am daily:

0 6 * * * /usr/bin/python3 /path/to/pipeline.py >> /var/log/pipeline.log 2>&1

Your dataset grows automatically. Every morning it's slightly smarter than yesterday.

Part 3: Push to Hugging Face

You have a JSONL file. Now make it useful to everyone, including your future self.

from datasets import Dataset, DatasetDict
import json
from pathlib import Path


def jsonl_to_hf_dataset(jsonl_path: str, train_split: float = 0.9) -> DatasetDict:
    """
    Load a JSONL corpus and split into train/test.
    Pushes to Hugging Face Hub.
    """
    records = []
    with open(jsonl_path) as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))

    print(f"Loaded {len(records)} records from {jsonl_path}")

    # Build HF Dataset
    ds = Dataset.from_list(records)

    # Train/test split
    split     = ds.train_test_split(test_size=1 - train_split, seed=42)
    ds_dict   = DatasetDict({"train": split["train"], "test": split["test"]})

    print(f"Train: {len(ds_dict['train'])} | Test: {len(ds_dict['test'])}")
    return ds_dict


def push_to_hub(ds_dict: DatasetDict, repo_id: str, hf_token: str):
    """Push dataset to Hugging Face Hub."""
    ds_dict.push_to_hub(
        repo_id,
        token=hf_token,
        commit_message="Dataset built with ScholarAPI, open-access full text"
    )
    print(f"Dataset live at: https://huggingface.co/datasets/{repo_id}")


# Full flow
ds = jsonl_to_hf_dataset("nlp_corpus.jsonl")
push_to_hub(
    ds_dict=ds,
    repo_id="your-org/nlp-academic-corpus",
    hf_token="hf_xxxxxxxxxx"
)

That's it. Your dataset is on the Hub, versioned, citable, and searchable.

The Credit Math (Because You're Going to Ask)

Credits aren't opaque. Here's exactly what a dataset build costs:

Action	Credits
`/search` (per call)	10 + 2 per result
`/text/{id}` (single)	3 credits (promo, normally 5)
`/texts/{ids}` (bulk)	3 per paper (promo, normally 5)
`/pdf/{id}`	5 credits (promo, normally 10)

Real example: Building a 500-paper full-text corpus.

5 search calls × (10 + 200) credits = 1,050 credits
500 full texts × 3 credits = 1,500 credits
Total: ~2,550 credits = $19.90 pack

A 5,000-paper corpus at promo rates sits comfortably inside the $149 pack (10K credits).

Promo pricing on text and PDF endpoints runs until end of June 2026, worth building sooner rather than later.

The Things That Will Trip You Up

has_text is not guaranteed. Even with has_text=true in your list query, a small percentage of papers will return empty text. The PDF exists but extraction failed, corrupted file, scanned image-only PDF, unusual encoding. Build your pipeline to handle None text gracefully. We do this above with the if not text: continue guard.

Deduplication matters. If you run multiple queries on overlapping topics, you'll get duplicate papers with different query labels. Deduplicate by id before training. Don't skip this, duplicates in training data are a quiet way to skew your model.

# Deduplicate a JSONL by paper ID
seen = set()
with open("corpus.jsonl") as f_in, open("corpus_deduped.jsonl", "w") as f_out:
    for line in f_in:
        record = json.loads(line)
        if record["id"] not in seen:
            seen.add(record["id"])
            f_out.write(line)

print(f"Unique papers: {len(seen)}")

Open-access only. Elsevier, Wiley, Taylor & Francis, their subscription-paywalled content isn't here. Open-access publications from Springer and Nature are. Check your target domain's open-access rate before committing to a corpus size, CS and medicine have excellent OA coverage; some law and humanities journals less so.

Rate limits exist but are generous. Don't hammer the API with 1,000 parallel requests. The time.sleep(0.5) in the corpus builder above is intentional. You'll get cleaner results and avoid any throttling.

A Real-World Schema Worth Copying

Here's the schema I actually use in production. Opinionated, tested, ready to go:

RECORD_SCHEMA = {
    # Identity
    "id":             str,   # ScholarAPI paper ID, stable, use as primary key
    "source_url":     str,   # Original journal/repo URL, auditable

    # Content
    "title":          str,
    "abstract":       str,
    "full_text":      str,   # Pre-extracted, clean

    # Metadata
    "authors":        list,  # ["Last, First", ...]
    "published_date": str,   # ISO 8601: "2024-03-15"
    "journal":        str,
    "doi":            str,   # When available

    # Pipeline bookkeeping
    "query":          str,   # Which search query surfaced this paper
    "indexed_at":     str,   # When ScholarAPI indexed it
    "pipeline_run":   str,   # When YOUR pipeline ran, for debugging
}

The pipeline_run field sounds like overkill until you have three months of streaming data and need to diagnose why a batch from February looks different from March. Add it from day one.

What You Can Build With This

Biomedical QA fine-tuning corpus. Pull 10K papers from oncology, cardiology, and neurology. Split into (context, question, answer) triples using an LLM. Fine-tune a small model. You now have a domain-specific medical QA system trained on peer-reviewed literature.

Cross-disciplinary embedding benchmark. Build 1K papers across 10 domains. Embed them. Measure how well different embedding models separate domains in latent space. Publish the benchmark. People will cite it.

Hallucination evaluation dataset. Take paper abstracts. Generate LLM summaries. Compare against the actual conclusion sections. You have a grounded hallucination benchmark that's impossible to game because the ground truth is published literature.

Temporal drift dataset. Pull papers from 2018 and 2024 on the same topic. Fine-tune on 2018 data, evaluate on 2024. You now have a dataset that measures how much a field has moved, which is exactly what you need to understand model knowledge cutoffs.

None of these existed as clean, reproducible pipelines before tools like ScholarAPI made the data layer boring. That's the point.

The Part I Want You to Actually Remember

Dataset quality is a multiplier on everything downstream.

A mediocre model trained on clean, well-structured, domain-specific data will outperform a great model trained on garbage. This is not a controversial opinion. It's something the entire ML community knows and somehow keeps forgetting when it comes time to actually collect the data.

The data layer deserves real engineering. Reproducibility. Versioning. Auditable sources. Deduplication. Clean text, not HTML artifacts.

ScholarAPI makes this tractable for academic literature. The endpoints are simple, the text extraction is real, and every record links back to where it came from. That's the baseline you need to build anything you'd actually trust.

The rest is your problem to solve. But at least it's the interesting part.

Start Here

# Get your API key at scholarapi.net, 1,000 free credits

# Your first corpus query:
curl "https://scholarapi.net/api/v1/search?q=your+topic+here&limit=10" \
  -H "X-API-Key: sch_xxxxxxxxx"

Full API reference. No fluff, just endpoints.

If you build something with this, a dataset, a benchmark, a pipeline, drop it in the comments. I'm genuinely curious what people are using academic literature for that I haven't thought of yet.

Tags: python machinelearning datascience api

I Was Scraping Google Scholar at 2am. There Had to Be a Better Way..

Reel Crave — Fri, 22 May 2026 12:58:48 +0000

I just didn't know academic data could be this clean.

Let me paint you a picture.

It's 2am. You have a RAG pipeline half-built. Your Google Scholar scraper, the one you spent three days nursing into existence, is throwing 403s again. You don't know if it's the IP, the headers, the rate limit, or Google quietly changing something in the DOM for the fourth time this month.

You refresh Stack Overflow. You find a six-month-old thread. The top answer is "just use Selenium."

You close the laptop.

I did this for longer than I'd like to admit before I found ScholarAPI. And honestly, writing this feels a little embarrassing, because the fix was so obvious in retrospect that I don't know why nobody talks about it.

Why Academic Data Is Quietly the Hardest Data Problem in AI

Here's something the ML Twitter discourse never covers: the bottleneck in most research-adjacent AI systems isn't the model. It's the input data.

Think about what you actually need when you're building on top of academic literature:

Clean, extracted full text (not HTML soup from a journal that hasn't updated its CSS since 2009)
Stable programmatic access that doesn't break when Elsevier sneezes
Coverage that isn't limited to one database or one discipline
A way to monitor new publications without manually checking 40 tabs

Google Scholar is a search engine. It's not an API. It was never designed for what we're trying to do with it. Scraping it is playing an adversarial game against one of the most well-resourced companies on earth and you are going to lose, eventually, always, at the worst possible time.

Semantic Scholar is genuinely great for metadata. If you need paper titles, authors, abstracts, and citation graphs, stop reading, go use it. It's free, it's well-documented, and it handles bulk metadata beautifully.

But the moment you need full text actual extracted content, ready for chunking, embedding, or fine-tuning, that's where things get painful.

That's the gap ScholarAPI fills.

What ScholarAPI Actually Is

ScholarAPI is a REST API that gives you programmatic access to 30M+ open-access papers from 20K+ academic sources worldwide. Not metadata. The actual text.

Four endpoints. That's basically the whole surface area:

GET /api/v1/search      # keyword search across 30M papers
GET /api/v1/list        # paginate by date, monitor new content
GET /api/v1/text/{id}   # clean extracted full text
GET /api/v1/pdf/{id}    # raw PDF binary

Auth is a single header: X-API-Key: sch_xxxxxxxxx

That's it. No OAuth dance. No SDK you have to install. No XML responses from 2003. Just curl.

Let Me Show You What This Actually Looks Like

Use Case 1: Building a Domain-Specific RAG Knowledge Base

Say you're building a RAG system that answers questions about quantum computing research. You need a corpus. Here's how fast that goes:

import requests

API_KEY = "sch_xxxxxxxxx"
BASE = "https://scholarapi.net/api/v1"
headers = {"X-API-Key": API_KEY}

# 1. Search for papers
resp = requests.get(f"{BASE}/search", headers=headers, params={
    "q": "quantum error correction qubit",
    "limit": 50
})
papers = resp.json()["results"]

# 2. Pull full text for each
corpus = []
for paper in papers:
    text_resp = requests.get(f"{BASE}/text/{paper['id']}", headers=headers)
    if text_resp.status_code == 200:
        corpus.append({
            "id": paper["id"],
            "title": paper["title"],
            "text": text_resp.text,
            "source_url": paper.get("url")  # every record links back to its source
        })

print(f"Corpus built: {len(corpus)} papers")

You just built a domain-specific corpus. No scraping. No HTML parsing. No Selenium. The text field comes pre-extracted from the PDF, it's not HTML, it's not raw PDF binary, it's actual clean text you can chunk and embed directly.

Want 100 papers in one call instead of looping? There's /texts (plural):

curl "https://scholarapi.net/api/v1/texts/96f3e91,6d0b618,1075ea1" \
  -H "X-API-Key: sch_xxxxxxxxx"

Up to 100 IDs at once. Your API bill and your patience will both thank you.

Use Case 2: Literature Monitoring Agent

This is the one that genuinely surprised me. The /list endpoint has an indexed_after parameter that turns ScholarAPI into a real-time content firehose.

Here's a monitoring agent in 30 lines:

import requests
from datetime import datetime, timedelta, timezone

API_KEY = "sch_xxxxxxxxx"
BASE = "https://scholarapi.net/api/v1"
headers = {"X-API-Key": API_KEY}

def monitor_new_papers(keyword, hours_back=24):
    since = (datetime.now(timezone.utc) - timedelta(hours=hours_back)).isoformat()

    resp = requests.get(f"{BASE}/list", headers=headers, params={
        "q": keyword,
        "indexed_after": since,
        "has_text": "true",   # only papers with extractable full text
        "limit": 100
    })

    papers = resp.json().get("results", [])
    print(f"New papers matching '{keyword}' in last {hours_back}h: {len(papers)}")

    for p in papers:
        print(f"  - {p['title']} ({p['published_date']})")

    return papers

# Run it
new_oncology_papers = monitor_new_papers("KRAS inhibitor lung cancer")

Stick this in a cron job. Point it at whatever niche you're tracking. New papers appear in the index within 24–48 hours of publication.

I built one of these for a researcher tracking a specific drug compound across disciplines. She used to spend Monday mornings trawling through journal alerts. Now she gets a digest at 7am. She called it "life-changing." I called it a Tuesday afternoon.

Use Case 3: Plagiarism Detection at Scale

This one is underrated. The /texts bulk endpoint makes candidate comparison genuinely efficient.

The workflow:

Take a submitted manuscript
Extract meaningful n-grams or embeddings from it
Search ScholarAPI for semantically similar papers
Pull full text for top candidates
Run your similarity score

def check_manuscript(manuscript_text, top_k=20):
    # Extract a representative query from the manuscript
    # (in practice, use your NLP pipeline here)
    query = extract_key_phrases(manuscript_text)

    # Find candidate papers
    candidates = requests.get(f"{BASE}/search", headers=headers, params={
        "q": query,
        "limit": top_k
    }).json()["results"]

    # Bulk-fetch full texts
    ids = ",".join([p["id"] for p in candidates])
    texts = requests.get(f"{BASE}/texts/{ids}", headers=headers).json()

    # Now run your similarity scoring against `texts`
    return score_similarity(manuscript_text, texts)

One API call to get 20 candidate full texts. The economics work out too, at current promo pricing (text endpoint is 3 credits until end of June 2026), checking a manuscript against 500 candidates costs around $3 against the $149 pack.

The Honest Limitations Section (Because You'll Find Them Anyway)

Paywalled content is not accessible. Elsevier, Wiley, Taylor & Francis, none of that. ScholarAPI covers open-access only. This is a real limitation. If your use case requires content behind institutional paywalls, this isn't your tool.

New papers take 24–48 hours to appear. It's not real-time. For most monitoring use cases this is fine. For breaking news workflows, it's not.

It's not free for bulk use. 1,000 free credits on signup (enough to test properly), then paid tiers: $19.90 / $149 / $790 / $4,990 for 1K / 10K / 100K / 1M credits. There's a credit calculator on the pricing page so you know exactly what you're buying.

For context: a /search call costs 10 credits + 2 per result. Full text retrieval is currently 3 credits (promo) normally 5. PDFs are 5 credits (promo) normally 10.

Who This Is Actually For

This is for you if:

You're building RAG on academic literature and your data pipeline is the bottleneck
You're training or fine-tuning on scientific corpora and need clean text at scale
You're building a literature monitoring agent and Google Scholar alerts make you want to retire
You're building a plagiarism detection tool and need a reference corpus you can actually query programmatically

This is not for you if:

You need one-off manual paper searches (just use Google Scholar)
You need paywalled content (look at institutional API agreements)
You're a student doing a literature review (the free credits cover you, but you're not the target user)

The Thing Nobody Says Out Loud

There's a quiet assumption baked into most AI research tooling: that data collection is someone else's problem. That you'll figure it out. That scraping is fine.

It's not fine. Scrapers break. They break silently. They break in production. They break when your advisor is waiting for results, or when your customer's pipeline just died, or at 2am when you really didn't need one more thing to fix.

The data layer deserves the same engineering discipline as the model layer. You wouldn't run a production system on a model you can't version, query reliably, or scale. You shouldn't run it on a data pipeline you can't trust either.

ScholarAPI is boring in the best possible way. It does one thing. It does it reliably. It has 99.9% uptime. Every record links back to its original source so your data is auditable.

It's the thing I wish existed two years ago.

Getting Started

# 1. Get your free API key at scholarapi.net (1,000 free credits)
# 2. Your first call:

curl "https://scholarapi.net/api/v1/search?q=your+research+topic&limit=10" \
  -H "X-API-Key: sch_xxxxxxxxx"

API docs are here. Clean, complete, no fluff.

If you're building something with it, especially anything RAG-related or for literature monitoring, I'd genuinely like to hear what you're working on. Drop it in the comments.

Tags: api python machinelearning research

Anyone else building training corpora from academic literature?

Reel Crave — Fri, 15 May 2026 19:04:21 +0000

Curious what your data collection pipeline looks like.

I've been pulling from ScholarAPI for domain-specific RAG datasets.. medical, materials science, chemistry. The structured JSON + PDF access makes chunking for embeddings cleaner than parsing scraped HTML.

Current setup: ScholarAPI → extract → chunk → embed into Chroma. Works well for domain-specific Q&A.

What are you using?

Genuinely curious if there's something better I'm missing for open-access coverage.

(https://scholarapi.net?via=-asig3)

Resource for anyone building tools for systematic literature reviews/automating paper monitoring: ScholarAPI provides access to 30M+ open-access papers, metadata, full-text, & PDFs via REST API. Useful if you're building research tools. scholarapi.net

Reel Crave — Fri, 15 May 2026 18:50:04 +0000