DEV Community

Can Yılmaz
Can Yılmaz

Posted on • Originally published at apify.com

What I learned scraping JSON-LD Schema & Meta Tag Extractor: schema, gotchas and the tooling that worked

I had a short window this week to evaluate JSON-LD Schema & Meta Tag Extractor as a data source. Here is the condensed write-up of what the data looks like, what surprised me, and the bits of infrastructure that paid off.

The source

JSON-LD Schema & Meta Tag Extractor Scrape Schema.org, OpenGraph & Meta Tags Extract structured data and SEO metadata from any webpage in seconds. The relevant questions for any new source are always: is the markup stable, is pagination sensible, and how aggressively does it rate-limit. For this one, all three answers are "good enough that you can build on it" -- which is honestly more than I can say for a lot of supposedly easy targets.

The schema

What you get back per record:

  • url -- url
  • pageTitle -- page title
  • metaDescription -- meta description
  • jsonLd -- json ld
  • openGraph -- open graph
  • twitter -- twitter
  • scrapeDate -- scrape date

Nothing exotic, which is exactly what you want from a feed. Flat records, predictable keys, types you can guess from the names.

Real rows

Two records from a sample run, trimmed for the inevitable wall of text:

{
  "url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/",
  "pageTitle": "Spinach and Feta Turkey Burgers Recipe",
  "metaDescription": "These spinach and feta turkey burgers are moist and easy to make in one bowl with simple ingredients, shaped into patties, and cooked on a...",
  "jsonLd": [
    "[... 1 items ...]"
  ],
  "openGraph": {
    "type": "article",
    "site_name": "Allrecipes",
    "url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/",
    "title": "Spinach and Feta Turkey Burgers",
    "description": "These spinach and feta turkey burgers are moist and easy to make in one bowl with simple ingredients, shaped into patties, and cooked on a...",
    "...": "(1 more fields)"
  },
  "twitter": {
    "card": "summary_large_image",
    "site": "@allrecipes",
    "title": "Spinach and Feta Turkey Burgers",
    "description": "These spinach and feta turkey burgers are moist and easy to make in one bowl with simple ingredients, shaped into patties, and cooked on a...",
    "image": "https://www.allrecipes.com/thmb/cpf6Rics5oHGq1TZ1df5fEaImwM=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/1360550-582be362ee994..."
  },
  "scrapeDate": "2026-05-15T10:51:38.226Z"
}
Enter fullscreen mode Exit fullscreen mode

Gotchas

A few things I would not have known without actually pulling data:

  • Optional fields disappear instead of being null. Not the end of the world, but it means every loader needs to be tolerant of missing keys.
  • Long-form text fields contain control characters. Newlines, tabs, the occasional rogue carriage return. Strip them at load time unless you actively want them.
  • Timestamps are UTC ISO-8601 which is great, but it does mean any local-time dashboard needs an explicit conversion.
  • Some numeric fields are emitted as strings. Cast on load.
  • Re-scraping with overlapping windows creates duplicates. Dedup on the natural ID.

What I would build next

A few directions this dataset would support nicely:

  • A daily snapshot pipeline that lands raw JSON into object storage, then materialises a curated table for dashboards.
  • A change-detection layer that computes row-level diffs between consecutive scrapes -- great for surfacing new and removed records.
  • A text-extraction layer over the long-form content fields, feeding into search or topic modelling.
  • A small validation suite that runs after every scrape: row count above a floor, key fields present in 100% of rows, timestamp parses cleanly. Cheap to write, catches schema drift in minutes instead of weeks.

Cost considerations

Worth thinking about before you commit. The dominant cost on a recurring feed is not the per-record extraction price -- it is the maintenance time when the upstream source changes. A solid heuristic: budget half a day per source per quarter for maintenance work, and twice that for sources with active anti-bot defences. If that maintenance budget is too steep for the value the dataset provides, the project is not a fit.

The other cost worth modelling is storage. Raw JSON partitioned by date is cheap if you compress it -- a few cents per gigabyte per month on most clouds -- but it stops being cheap if you forget about retention. Set a lifecycle policy that ages anything older than your useful replay window into a colder tier, and revisit the policy every few months.

Bottom line

For an afternoon's evaluation work this was time well spent. The dataset is structurally clean, the scraper handled rate-limits without me having to think about it, and the records are rich enough to start asking real questions immediately. If the upstream source stays stable for a quarter -- which is the realistic horizon for most public sources -- the cost-benefit of integrating this feed is firmly positive.


For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/json-ld-schema-meta-tag-extractor. It supports JSON, CSV and Excel exports and runs on a schedule.

Top comments (0)