DEV Community

Cover image for EU CORDIS API: scrape 57k Horizon Europe grants into clean JSON
Devil Scrapes
Devil Scrapes

Posted on

EU CORDIS API: scrape 57k Horizon Europe grants into clean JSON

Quick answer: CORDIS — the EU's Community Research and Development Information Service — exposes 57,000+ Horizon Europe and Horizon 2020 grant records through a search interface, but publishes no official bulk API. To get the data programmatically you scrape its internal JSON search endpoint. The EU CORDIS Grants Scraper on Apify does exactly that: it accepts a search query, a programme code, a coordinator country, or a list of project IDs, and returns every matching project as typed JSON with 20 fields — funding amounts, coordinator, participants, programme hierarchy, dates, status, and objective. Standard mode costs $3.05 per 1,000 rows; full-abstract mode $5.05 per 1,000 rows.

The European Commission awards hundreds of billions of euros in research grants across Horizon Europe and the now-closed Horizon 2020. The full record — who got the money, for what, with which partners, under which call — lives in CORDIS: 57,000+ project records.

For the researchers, grant writers, policy analysts, and B2B sales teams who need this in bulk, the web interface is a dead end: no download button, no export API, no bulk dump that stays current. Here is what it takes to get it programmatically, and how I packaged it into a one-call Actor.

What is CORDIS? 🔎

CORDIS is the European Commission's authoritative registry of EU-funded research and innovation projects, maintained by DG RTD and tracking every major EU research programme since FP1 in the 1980s. The current search index covers two frameworks:

  • Horizon Europe (HORIZON, 2021-2027): the EU's flagship programme, ~€95 billion budget — from individual ERC grants to collaborative actions spanning a dozen countries.
  • Horizon 2020 (H2020, 2014-2020): the predecessor programme, now closed; ~€80 billion disbursed.

Each record carries the full consortium, the programme code hierarchy (e.g. HORIZON.2.5 — Climate, Energy and Mobility), the call identifier, the funding scheme, dates, total cost, the EU grant amount, and the objective — all under CC-BY 4.0, free to use with attribution.

Does CORDIS have an API? 🔌

No — not officially. The European Commission does not publish a documented, versioned API for the CORDIS grant database. The only programmatic surface is the internal JSON search endpoint the website itself uses: https://cordis.europa.eu/search/en?format=json. It uses Lucene-style query syntax with = field separators (not the usual :), returns hits.hit as a dict instead of a one-element list on single-result pages, and needs the query string percent-encoded but not double-encoded. And num=100 reliably returns only 10 results, so the practical page cap is num=50.

None of it is documented; all of it has to be reverse-engineered from DevTools — exactly why a maintained Actor earns its few dollars per thousand rows.

What the data looks like 📤

One ResultRow per project, 20 fields. A real record from project 101069357 (Photo2Fuel, Horizon Europe):

{
  "project_id": "101069357",
  "project_acronym": "Photo2Fuel",
  "project_title": "Artificial PHOTOsynthesis to produce FUELs and chemicals",
  "project_url": "https://cordis.europa.eu/project/id/101069357/en",
  "programme": "HORIZON.2.5",
  "programme_display_name": "Climate, Energy and Mobility",
  "call_id": "HORIZON-CL5-2021-D2-01",
  "funding_scheme": "HORIZON-RIA",
  "start_date": "2022-09-01",
  "end_date": "2025-08-31",
  "total_cost_eur": 2493171.25,
  "eu_contribution_eur": 2493171.0,
  "status": "SIGNED",
  "coordinator_organization": "IDENER RESEARCH & DEVELOPMENT AIE",
  "coordinator_country": "ES",
  "participating_organizations": ["FUNDACION TECNALIA RESEARCH & INNOVATION", "UPPSALA UNIVERSITET"],
  "participating_countries": ["DE", "ES", "SE"],
  "objective_summary": "Photo2Fuel will develop technology that converts CO2 into useful fuels and chemicals using non-photosynthetic microorganisms...",
  "keywords": ["solar energy", "bacteria", "archaea", "solar fuels", "CO2 reduction"],
  "scraped_at": "2026-05-16T12:00:00.000Z"
}
Enter fullscreen mode Exit fullscreen mode

All 20 fields are Pydantic-validated before the row is written. status is a strict Literal["SIGNED", "CLOSED", "TERMINATED"] — an unexpected value errors loud rather than passing garbage through. scraped_at is always ISO 8601 UTC; arrays are arrays, not comma-separated strings. It drops straight into Pandas or BigQuery.

The naive approach (and why it falls apart) ⚠️

The obvious first attempt: replay the DevTools XHR call to cordis.europa.eu/search/en with requests.get() and paginate. Three reasons it breaks in practice:

1. The query encoding trap. CORDIS requires the query string percent-encoded by the HTTP client, not pre-encoded in the template. Pass a pre-encoded q=contenttype%3Dproject... and CORDIS double-encodes it to %253Dproject... and returns HTTP 500. We pass the raw string as a params value and let curl-cffi do exactly one round of encoding — pinned with a regression test, because it breaks only in production.

2. The dict-vs-list anomaly. When a query matches exactly one project, the API returns hits.hit as a plain dict, not a one-element list. A naive for hit in data["hits"]["hit"] then iterates over the keys and silently emits garbage. We call _ensure_list() on every nested association array before iteration.

3. TLS fingerprinting and rate-limit readiness. CORDIS doesn't currently rate-limit datacenter IPs, but we build on curl-cffi with Chrome 131 TLS impersonation (ADR-0002 house default) so the Actor absorbs any future tightening. We retry with exponential backoff on 429 and 503, honouring Retry-After (max 5 attempts); we thread Apify residential proxies on demand via useProxy and rotate the session on every block. We never return an empty dataset silently — a scan that exhausts the source surfaces a clear status message instead of pretending it found data.

The Actor ⚙️

Live on the Apify Store: apify.com/DevilScrapes/eu-cordis-grants.

Four input modes, one Actor. Pick exactly one of searchQuery, programmeFilter, countryFilter, or projectIds — a Pydantic XOR validator rejects zero or two+ before any network call.

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/eu-cordis-grants").call(
    run_input={
        "searchQuery": "quantum computing",  # free-text search
        "framework": "HORIZON",              # or "H2020" / "ANY"
        "maxProjects": 100,                  # 1-5000, list modes only
    }
    # other modes: {"programmeFilter": "HORIZON.2.5", ...} |
    #   {"countryFilter": "DE", ...} | {"projectIds": ["101069357"]}
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)
Enter fullscreen mode Exit fullscreen mode

Toggle includeFullAbstract: true to deep-fetch the detail page for the full untruncated objective (truncated at ~2,000 chars in the search API). That issues one extra HTTP GET per row and switches the PPE event from result-row ($0.003) to result-row-detailed ($0.005). Run it via the Apify Console form or the API; results stream into the dataset as JSON, CSV, Excel, or XML.

What you'd actually use this for 💡

Grant-writer competitive intelligence. Before submitting a proposal to HORIZON-CL5-2021-D2-01 (or any call), pull every project already funded in that programme cluster — who the successful coordinators are, which countries dominate. Five minutes of runtime replaces hours of manual searches.

Research-institution portfolio reporting. Filter by countryFilter=DE (or your ISO-2 code) to enumerate every Horizon Europe project your national research community coordinated, with funding totals and consortium partners. Most research-management systems (Worktribe, Pure, Converis) don't pull this from CORDIS automatically.

Science-policy analysis. Bulk-export H2020 vs HORIZON funding distributions across schemes (RIA, IA, CSA, ERC) — eu_contribution_eur and funding_scheme together give you the full budget geography.

B2B SaaS prospect lists. The coordinator_organization and coordinator_country fields give you the legal name and home country of every research organisation that has won EU funding — enrich with Apollo or LinkedIn for a targeted outreach list of buyers for research software or lab equipment.

Pricing — exact numbers 💰

Pay-per-event — you pay only for rows that land in your dataset.

Event Price (USD) When
actor-start $0.05 Once per run, at boot
result-row $0.003 Per project row, standard mode
result-row-detailed $0.005 Per project row, full-abstract mode
Run Rows Mode Cost
1 project lookup 1 standard $0.053
50 search results 50 standard $0.20
500 programme rows 500 standard $1.55
1,000 country-filtered rows 1,000 standard $3.05
1,000 rows, full abstracts 1,000 detailed $5.05

At scale the per-row charge dominates: $3.05 per 1,000 rows standard. Apify's $5 free trial credit covers your first ~1,600 rows, no credit card required.

The technically interesting part 🔧

The hits.hit dict-vs-list anomaly is load-bearing. CORDIS's search API is an XML-to-JSON serialiser: when a query matches one project it collapses hits.hit to a keyed dict; with two or more it produces an array. The same shape instability affects every nested association array — organizations, programmes, calls, categories. The _ensure_list() normaliser in src/parser.py handles all four before any iteration, so a single-result programme filter doesn't silently return one row with empty fields. Multi-programme projects get the same care: the Actor picks the entry where @attributes.uniqueProgrammePart == "true", so you get the top-level call cluster, not a nested sub-component code.

Limitations 🚧

  • HORIZON and H2020 only. FP7 and earlier frameworks aren't indexed in the current CORDIS search; the Actor returns zero rows for FP6-and-earlier queries.
  • No deliverables, publications, or result documents. On the detail page but not the search API; out of scope.
  • Country filter is post-filter, not server-side (coordinator/country=DE returns zero results — verified). A country-filter run may scan many more API pages than maxProjects; emitted rows are always ≤ maxProjects.
  • Objective text is truncated in standard mode (~2,000 chars). Set includeFullAbstract: true for the full text ($0.005/row).
  • Page size is fixed at 50. num=100 reliably returns only 10 results (verified); the Actor uses num=50.
  • 7-day default storage retention on Apify's free tier. Export immediately, or open a named dataset to persist.
  • CORDIS data is CC-BY 4.0. Attribution to the European Commission's CORDIS is required when republishing.

FAQ ❓

Is scraping CORDIS legal?
CORDIS data is published by the European Commission under CC-BY 4.0 (EU Open Data policy). The search endpoint is public and unauthenticated; the Actor reads only what the public UI exposes and collects no personal data. Attribution is required when republishing. As always, verify your use case with your legal team.

Is there an official CORDIS API I should use instead?
No. The European Commission publishes no documented, versioned API for the project database. The EU Open Data Portal offers only static bulk CSV/XML dumps updated infrequently — not per-query, not real-time.

Can I export to Google Sheets or a data warehouse?
Yes. Download CSV, Excel, JSON, or XML from the Apify Console, or call the Apify API: GET /datasets/{id}/items?format=csv&clean=true. Or webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n.

Why is the country filter slow on large results?
CORDIS offers no query parameter for coordinator country, so the Actor scans full search results and post-filters in process. To collect 500 German-coordinated projects it may scan 1,500+ records. The maxProjects cap applies to emitted rows, not API pages read.


The Actor is live and accepting runs now: apify.com/DevilScrapes/eu-cordis-grants. The $5 free trial credit covers roughly 1,600 standard-mode rows before you enter a credit card. Missing a field — organisation type, deliverable links, FP7 support? Drop a comment or open an issue on the Store page.


Built by Devil Scrapes — pay-per-event Apify Actors for builders who need the data, not the drama. 😈

Top comments (0)