DEV Community: Omar Eldeeb

How to Scrape YouTube Shorts Data (Exact View, Like & Comment Counts)

Omar Eldeeb — Mon, 06 Jul 2026 18:00:54 +0000

If you have ever tried to figure out how to scrape YouTube Shorts data — the view counts, like counts, publish dates, and channel stats for a batch of Shorts — you have probably discovered that it is more annoying than it looks. Shorts are technically just videos, but YouTube treats them as a distinct surface, and the tools most people reach for either cost quota you do not have or return numbers that are rounded and incomplete.

This post walks through the real options: what the official API does and does not give you, how the "internal API" scraping approach actually works, and a runnable example you can drop into a script today.

Why scraping YouTube Shorts data is harder than it should be

The obvious first stop is the YouTube Data API v3. It is a legitimate, well-documented REST API, and for many jobs it is the right tool. But three things make it awkward specifically for Shorts work:

Quota. Every Google Cloud project gets a default allocation of 10,000 units per day. That sounds like a lot until you learn the pricing: a search.list call costs 100 units, so you get roughly 100 searches per day before you hit quotaExceeded. A plain videos.list read is only 1 unit, but you can only call it once you already have the video IDs — and getting those IDs by keyword or hashtag is the expensive search step. (Google's quota calculator lists the per-method costs.)
No clean Shorts filter. The Data API has no type=short parameter. A Short is returned as an ordinary video resource. You can heuristically guess (short duration, vertical aspect ratio, #shorts in the description) but there is no first-class "give me this channel's Shorts" endpoint.
API key / OAuth setup. You need a Google Cloud project, an enabled API, and a key — and for some data, OAuth. That is fine for a product, but heavy for a one-off dataset.

To be clear: the official API is not unusable. If you need a handful of videos per day and you already manage an API key, it is a fine, fully-supported choice. The friction is real but bounded — it is the quota and the missing Shorts isolation that push people toward scraping when they need volume or richer data.

The internal-API approach (how scraping actually works)

Here is the part most tutorials get wrong. You do not scrape YouTube Shorts by parsing HTML with regex, and there is no secret public REST endpoint you can curl. What actually powers youtube.com in your browser is an internal JSON API (the same backend the web player talks to as you scroll). When you load a Shorts feed, a channel page, or a search results page, the site sends structured JSON payloads containing every field the UI renders — titles, exact view counts, like counts, publish timestamps, channel subscriber counts, and more.

The scraping approach, conceptually, is:

Issue the same requests the browser issues, with the right context/client parameters, so YouTube returns those JSON payloads instead of a rendered page.
Walk the (deeply nested, frequently-renamed) JSON to pull out the fields you care about.
Rotate residential proxies so you are not making thousands of requests from one datacenter IP.

The payoff over the official API: because you are reading the same data the site itself renders, you get exact counts (e.g. viewCount: 4823117) rather than the rounded values (4.8M) you often see, no API key, no OAuth, and no 10,000-unit ceiling.

The cost is engineering time. YouTube renames those JSON keys and changes the request envelope regularly, so a scraper you write today will quietly break in a few months. That is the maintenance treadmill you are signing up for if you build it yourself — and the reason a maintained tool is often worth it for this specific target.

Doing it without building the parser yourself

Rather than reverse-engineer the payload walking, you can call a maintained actor that already does it. Here is a complete, runnable Node.js example using the Apify client to scrape Shorts by channel handle, keyword, hashtag, or a direct Short URL — each input line is auto-detected, so you do not configure any "mode":

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

const run = await client.actor('constructive_calm/youtube-shorts-scraper-pro').call({
  // Each line is auto-detected: @handle, keyword, #hashtag, or a Short URL
  startUrls: [
    '@MrBeast',
    'cooking tips',
    '#asmr',
    'https://www.youtube.com/shorts/abcdEFGhijk',
  ],
  maxResults: 25,            // per input line
  publishedAfter: '30 days', // date "2026-01-01" or relative "7 days"
  sortOrder: 'popular',      // newest | popular | oldest
  includeComments: true,
  maxComments: 20,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();

for (const s of items) {
  console.log(
    `${s.viewCount.toLocaleString()} views | ` +
    `virality ${s.viralScore}/100 | ${s.title}`
  );
}

startUrls is the only required input. Everything else has sensible defaults (maxResults is 10, maxComments is 20). If you flip on downloadVideos, each result also gets a videoDownloadUrl pointing at a 360p MP4 saved to storage with a shareable link.

What you get back per Short

Each dataset item is a flat object, which makes it trivial to dump to CSV or load into pandas:

{
  "videoId": "abcdEFGhijk",
  "url": "https://www.youtube.com/shorts/abcdEFGhijk",
  "title": "60-second garlic butter trick",
  "description": "…",
  "hashtags": ["cooking", "shorts"],
  "viewCount": 4823117,
  "likeCount": 312044,
  "commentCount": 1876,
  "durationSeconds": 58,
  "publishedAt": "2026-06-14T09:32:00Z",
  "thumbnail": "https://i.ytimg.com/…",
  "channelName": "Example Kitchen",
  "channelId": "UC…",
  "channelSubscribers": 2450000,
  "channelVerified": true,
  "viralScore": 87,
  "engagementRate": 6.7
}

The first block of fields comes straight from YouTube's own data. The last two — viralScore (a 0–100 index) and engagementRate — are computed by the actor, not values YouTube publishes. They are handy for ranking a batch of Shorts by breakout potential without writing your own formula, but treat them as a derived heuristic, not an official metric.

A quick note on choosing your approach

A few Shorts, occasionally, and you already have an API key? Use the official Data API. It is free within quota and fully supported.
You need Shorts isolated by channel/keyword/hashtag, exact counts, and volume? Either write an internal-API scraper (and accept the maintenance) or use a maintained one.

If you want to see the exact input/output shape before running anything, there is a free query-builder tool where you paste your inputs and preview the resulting output schema in the browser. It does not fetch live results client-side — YouTube is not CORS-open, so no browser page can — it just helps you construct and understand the request.

To run it for real against live data, the backing actor is youtube-shorts-scraper-pro on Apify. It is free to start, then pay-as-you-go, and it handles the residential-proxy rotation and the JSON-walking so you do not inherit the breakage treadmill.

However you go about it, keep it to public data only — public Shorts, public channel stats, public comments — and you have a clean, repeatable way to pull YouTube Shorts data with exact numbers instead of rounded guesses.

Disclosure: I build and maintain the youtube-shorts-scraper-pro actor referenced above.

How to Extract Saudi Arabia Property Data Across Bayut.sa, Wasalt.sa, Aqar.fm and PropertyFinder.sa

Omar Eldeeb — Tue, 30 Jun 2026 20:40:40 +0000

If you have ever tried to assemble a clean dataset of Saudi Arabia property data, you already know the pain: the same villa in Riyadh shows up on Bayut.sa, on PropertyFinder.sa, and on Aqar.fm — three different listing IDs, three slightly different prices, three different agents. Naively concatenate them and your "market" is 30–40% duplicates. This post walks through how the Saudi portals actually expose their data, and the one field that quietly solves the deduplication problem: the REGA advertisement license number.

I have spent a fair amount of time poking at these portals, so the goal here is to give you the honest, verified mechanics — not a fantasy where everything is a tidy public REST API.

The four portals that matter

For practical coverage of the Kingdom, four sites carry the bulk of supply:

Bayut.sa — the dubizzle/Bayut group's Saudi portal, English + Arabic.
Wasalt.sa — owned by the Public Investment Fund (PIF), strong on sale and off-plan inventory.
Aqar.fm (sa.aqar.fm) — one of the highest-traffic portals in the Kingdom, Arabic-only, very deep on local supply.
PropertyFinder.sa — the Saudi arm of Property Finder, English-first, strong agent coverage.

Each one carries the same core fields per listing: price, area (m²), location, bedrooms/bathrooms, property type, listing purpose (sale / rent / off-plan), and the listing agent or brokerage. That overlap is exactly what makes cross-portal joins both necessary and possible.

How the listings are actually delivered

Here is the first thing worth internalizing: these are modern JavaScript front-ends, but you usually do not need a headless browser to read a listing.

Several of these portals are built as Next.js-style applications. When a Next.js page renders on the server, the framework serializes the props used for rendering into an embedded JSON blob — the __NEXT_DATA__ script tag — so React can hydrate on the client. Per Next.js's own maintainers, this tag is not removable, and "the data contained in there should be what is already presented to your user." Translation for our purposes: the listing data that paints the page is sitting right there in the initial HTML response as structured JSON.

That means a plain HTTP GET, plus a parse of the embedded JSON, often gets you a fully structured listing object — price, beds, baths, geo, agent — without spinning up Playwright. (Some portals use a differently named state object or an embedded data-island instead of __NEXT_DATA__, so always inspect the actual HTML before assuming a selector.)

The REGA join key — the part most people miss

Saudi Arabia regulates real-estate advertising through the Real Estate General Authority (REGA). Under the brokerage regulations, you cannot legally advertise a property without a real-estate advertisement license, and the license number plus its expiry must be displayed on the advertisement. REGA even runs a public Advertisement License Inquiry service so anyone can check a license number's status — active, expired, or cancelled.

The downstream consequence is the useful part. Because the property (via its brokerage contract) gets a REGA advertisement license, and that number must be shown on the ad, the same property advertised on Bayut.sa and PropertyFinder.sa carries the same REGA license number on both. That makes the REGA advertisement number an excellent cross-portal join key for deduplication — far more reliable than fuzzy-matching on price-and-address, which breaks the moment one agent rounds the price or writes "Al Malqa" instead of "Al-Malqa".

A few honest caveats:

Not every portal exposes the number with equal prominence. The Next.js-rendered portals (Bayut, PropertyFinder, Wasalt) tend to surface it cleanly; Aqar.fm is more variable, so treat REGA-on-Aqar as best-effort and keep a fuzzy fallback.
A single brokerage contract with a "marketing" scope can cover an advertisement, so in rare cases one license maps to a small bundle rather than exactly one unit. Use it as a strong signal, not an infallible primary key.

A runnable extraction sketch

Here is a self-contained Node.js (18+) example showing the pattern: fetch a page, pull the embedded JSON, normalize to a common shape, and dedupe by REGA number. The selector and JSON paths are intentionally generic — you adapt them to whatever each portal actually emits — but the structure is exactly what you'd ship.

// node >=18, ESM. npm i cheerio
import * as cheerio from "cheerio";

const UA =
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
  "(KHTML, like Gecko) Chrome/124.0 Safari/537.36";

// Pull the embedded Next.js JSON island from a listing/search page.
async function fetchEmbeddedJson(url) {
  const res = await fetch(url, { headers: { "User-Agent": UA } });
  if (!res.ok) throw new Error(`${url} -> ${res.status}`);
  const html = await res.text();
  const $ = cheerio.load(html);

  // Next.js serializes render props here. Some portals use a
  // differently-named tag, so inspect the HTML and adjust.
  const raw = $("#__NEXT_DATA__").contents().text();
  if (!raw) return null;
  return JSON.parse(raw);
}

// Map a portal-specific listing object onto one canonical shape.
function normalize(portal, l) {
  return {
    portal,
    sourceId: l.id ?? l.listingId ?? null,
    rega: (l.permitNumber ?? l.licenseNumber ?? l.regaAdNumber ?? "")
      .toString()
      .trim(),
    purpose: l.purpose ?? l.category ?? null, // sale | rent | off-plan
    type: l.propertyType ?? null,             // villa, apartment, ...
    price: l.price ?? null,
    areaSqm: l.area ?? l.builtUpArea ?? null,
    beds: l.bedrooms ?? null,
    baths: l.bathrooms ?? null,
    city: l.location?.city ?? null,
    district: l.location?.district ?? null,
    agent: l.agent?.name ?? l.contactName ?? null,
  };
}

// Dedupe across portals using the REGA advertisement number.
function dedupeByRega(rows) {
  const byRega = new Map();
  const noKey = [];
  for (const r of rows) {
    if (!r.rega) {
      noKey.push(r); // fall back to fuzzy matching downstream
      continue;
    }
    const seen = byRega.get(r.rega);
    if (!seen) {
      byRega.set(r.rega, { ...r, seenOn: [r.portal] });
    } else {
      seen.seenOn.push(r.portal);
      // Keep the richest record; here we prefer a non-null price.
      if (seen.price == null && r.price != null) seen.price = r.price;
    }
  }
  return { merged: [...byRega.values()], unmatched: noKey };
}

// --- usage sketch ---
const data = await fetchEmbeddedJson("https://www.example-portal.sa/search");
const listings = data?.props?.pageProps?.listings ?? []; // adapt this path
const rows = listings.map((l) => normalize("examplePortal", l));
const { merged, unmatched } = dedupeByRega(rows);
console.log(`merged=${merged.length} unmatched=${unmatched.length}`);

The shape of data.props.pageProps differs per portal — that path is the single thing you'll spend the most time pinning down. Once you have it, the normalize → dedupeByRega pipeline is portable across all four sites.

Practical notes from the trenches

Be polite. Throttle, set a real User-Agent, and respect each site's terms. These portals will rate-limit aggressive clients, and Aqar.fm in particular reacts to bursty traffic.
Proxies matter for scale. A single datacenter IP is fine for prototyping; for any real volume you'll want rotating (ideally residential) IPs.
Normalize early. Arabic district names have multiple romanizations — store the original Arabic string and a slugified key, and never join on the display name.
Treat REGA as a strong signal, not gospel. Join on it first, then fuzzy-match the remainder on (city, district, price-bucket, beds, areaSqm).

If you'd rather not build the plumbing

I build data tools, so two honest plugs. To sketch a query — pick the platform, the listing type, and preview the output field shape before you write any code — I made a free query-builder at datatooly.xyz/saudi-real-estate-search. It's a builder, not a live in-browser scraper, so it won't return listings itself.

When you want the data flowing, the Saudi Real Estate Scraper on Apify handles the multi-portal fetch, the embedded-JSON parsing, and the REGA-based dedup for you — free to start, then pay-as-you-go.

Disclosure: I built both the query-builder and the Apify actor linked above.

The takeaway, tool or no tool: in Saudi Arabia, the regulator handed you a free primary key. Use the REGA advertisement license number and your cross-portal dataset gets dramatically cleaner.

How to Extract Dubai Used Car Listings Data from Gulf Marketplaces

Omar Eldeeb — Tue, 30 Jun 2026 20:40:19 +0000

If you need Dubai used car listings data — make, model, year, price, mileage, location, and dealer for thousands of vehicles — the good news is that the major Gulf marketplaces are far friendlier to extraction than most modern web apps. The listings render server-side, and the detail pages carry structured data you can parse without a headless browser. This post walks through the actual shape of that data on DubiCars and Dubizzle Motors, with a runnable example you can adapt today.

I've spent time poking at all three. Here's what's true, what's easy, and where you'll hit a wall.

The landscape: three platforms, three data shapes

The UAE used-car market is concentrated across a few portals:

DubiCars — the UAE's largest dedicated auto portal, with tens of thousands of used listings live at any time. Listings render server-side, and individual car pages live at predictable URLs like https://www.dubicars.com/2024-toyota-prado-vxr-<id>.html.
Dubizzle Motors — the general-classifieds giant, with tens of thousands of used cars in the UAE. It's a Next.js application, which (as we'll see) is both a gift and a headache.

(YallaMotor is another large regional portal worth knowing about, though we'll focus on DubiCars and Dubizzle here since their extraction patterns cover the two dominant approaches.)

The thing that makes Gulf marketplaces pleasant to work with is that they're SEO-driven. These sites want Google to read their inventory, so they expose it server-side and they annotate it. That's exactly the data you want.

Why JSON-LD is your best friend

Dealerships and marketplaces embed Schema.org structured data in their pages so search engines (and increasingly AI search systems) can render rich results — price, mileage, availability, the works. The Vehicle/Car and Product types cover make, model, model year, mileage (mileageFromOdometer), fuel type, transmission, body type, VIN, and price.

For the data extractor, this is a free lunch. Instead of writing brittle CSS selectors against a layout that changes every quarter, you read a <script type="application/ld+json"> block that the site maintains for its own SEO. It's the most stable surface on the page.

On DubiCars, the listing grid renders server-side — results are visible in raw HTML without executing a line of JavaScript (paginated, so you page through them) — and detail pages carry JSON-LD plus a clean field set: price (shown in AED with an optional USD toggle), year, mileage in km, location (emirate), regional spec (GCC / Japanese / American / Other — a DubiCars-specific field on the page, not a standard schema.org property), and dealer name with a link to the dealer profile.

A runnable example

Here's a self-contained Node.js script (Node 18+, native fetch, no dependencies) that fetches a DubiCars-style detail page and extracts vehicle fields from its JSON-LD. It safely handles both a single object and the @graph array form, and it normalizes the fields you actually care about.

// extract-vehicle.mjs — Node 18+ (native fetch, no deps)

/** Pull every application/ld+json block out of raw HTML. */
function extractJsonLdBlocks(html) {
  const blocks = [];
  const re =
    /<script[^>]+type=["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;
  let m;
  while ((m = re.exec(html)) !== null) {
    try {
      blocks.push(JSON.parse(m[1].trim()));
    } catch {
      // Skip malformed blocks rather than crashing the run.
    }
  }
  return blocks;
}

/** Flatten @graph and arrays into a flat list of typed nodes. */
function flattenNodes(blocks) {
  const out = [];
  for (const b of blocks) {
    if (Array.isArray(b)) out.push(...b);
    else if (Array.isArray(b['@graph'])) out.push(...b['@graph']);
    else out.push(b);
  }
  return out;
}

function isVehicleNode(node) {
  const t = node && node['@type'];
  const types = Array.isArray(t) ? t : [t];
  return types.some((x) =>
    ['Car', 'Vehicle', 'Product'].includes(x)
  );
}

function normalize(node) {
  const offer = Array.isArray(node.offers) ? node.offers[0] : node.offers;
  const odo = node.mileageFromOdometer || {};
  return {
    title: node.name ?? null,
    make: node.brand?.name ?? node.manufacturer?.name ?? null,
    model: node.model ?? null,
    year: node.vehicleModelDate ?? node.productionDate ?? null,
    price: offer?.price ?? null,
    currency: offer?.priceCurrency ?? null,
    mileage: odo.value ?? null,
    mileageUnit: odo.unitCode ?? odo.unitText ?? null,
    fuelType: node.fuelType ?? null,
    transmission: node.vehicleTransmission ?? null,
    bodyType: node.bodyType ?? null,
    vin: node.vehicleIdentificationNumber ?? null,
    url: node.url ?? offer?.url ?? null,
  };
}

async function extractVehicle(url) {
  const res = await fetch(url, {
    headers: {
      // A realistic UA reduces the odds of an interstitial.
      'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
        '(KHTML, like Gecko) Chrome/124.0 Safari/537.36',
      'Accept-Language': 'en-US,en;q=0.9',
    },
  });
  if (!res.ok) throw new Error(`HTTP ${res.status} for ${url}`);
  const html = await res.text();

  const nodes = flattenNodes(extractJsonLdBlocks(html));
  const vehicle = nodes.find(isVehicleNode);
  if (!vehicle) {
    throw new Error('No Vehicle/Car/Product JSON-LD found on page.');
  }
  return normalize(vehicle);
}

const target = process.argv[2];
if (!target) {
  console.error('Usage: node extract-vehicle.mjs <listing-url>');
  process.exit(1);
}
extractVehicle(target)
  .then((v) => console.log(JSON.stringify(v, null, 2)))
  .catch((e) => {
    console.error('Extraction failed:', e.message);
    process.exit(1);
  });

Run it against a detail URL and you get a clean object. The pattern is deliberately defensive: malformed JSON-LD blocks are skipped, both @graph and bare-object forms are handled, and missing fields come back as null instead of throwing. That matters at scale, because no marketplace annotates every listing identically.

When there's no JSON-LD: read the embedded state

Dubizzle Motors is a Next.js app. Next.js apps serialize their server-fetched data into a <script id="__NEXT_DATA__" type="application/json"> block at the bottom of the HTML — the same data the React app hydrates from. For listing and search pages, that JSON often contains the full result set (and frequently more fields than the visible card shows). The extraction approach is identical to the JSON-LD one: grab the block, JSON.parse it, then walk props.pageProps to find the listings array.

function extractNextData(html) {
  const m = html.match(
    /<script id="__NEXT_DATA__" type="application\/json">([\s\S]*?)<\/script>/
  );
  return m ? JSON.parse(m[1]) : null;
}

The honest catch: Dubizzle sits behind a WAF. Plain datacenter requests get challenged or blocked, and naive fetch calls are often detected by their TLS fingerprint, headers notwithstanding. Reaching it reliably needs a request library that mimics a real browser's TLS handshake plus residential or well-rotated proxies, and a respectful crawl rate. DubiCars is far more forgiving — start there if you're learning.

A few ground rules regardless of platform: keep concurrency low, cache pages you've already fetched, identify yourself honestly in your User-Agent where reasonable, and check each site's terms before you collect at volume.

If you'd rather not maintain the plumbing

Selector drift, WAF challenges, proxy rotation, and bilingual (English/Arabic) field normalization add up. If you just want the data, I built two things to skip the maintenance:

A free Gulf Used Car query builder that lets you pick a platform and preview the output shape before you commit to anything. (It's a query/preview builder — it doesn't scrape live in your browser.)
The Gulf Used Car Scraper on Apify, which handles DubiCars and Dubizzle Motors end to end — JSON-LD parsing, __NEXT_DATA__ extraction, the WAF handling, and normalized output. It's free to start, then pay-as-you-go.

Disclosure: I build and maintain both tools, and the Apify link is an affiliate link.

Wrapping up

The reason Dubai used car listings data is approachable comes down to one thing: these marketplaces publish structured data on purpose. Read the JSON-LD when it's there, fall back to __NEXT_DATA__ when it isn't, write defensive parsers that tolerate missing fields, and treat Dubizzle's WAF with respect. Start with the runnable script above against DubiCars, and you'll have normalized listings flowing in an afternoon.

How to Build an H-1B Salary Database by Employer (the Real Data Source + Python)

Omar Eldeeb — Tue, 30 Jun 2026 20:39:57 +0000

If you've searched for an H-1B salary database by employer and hoped to find a clean JSON API you could just hit, you've probably been disappointed. There isn't one — at least not from the government. But the underlying data is fully public, authoritative, and very queryable once you know where it actually lives. This post shows you the real source, what's in it, and how to query H-1B salary data by employer and role programmatically with Python.

Where the data actually comes from

Every site you've seen with H-1B salaries — h1bdata.info, h1bgrader, and the rest — is re-publishing the same upstream source: the LCA (Labor Condition Application) disclosure data released by the U.S. Department of Labor's Office of Foreign Labor Certification (OFLC).

Here's how it works. Before an employer can file an H-1B petition, it must submit a Labor Condition Application to the DOL attesting to the wage it will pay. By law, OFLC publishes these applications as public disclosure data. The files cover the H-1B, H-1B1, and E-3 programs and are released quarterly on the OFLC performance-data page:

https://www.dol.gov/agencies/eta/foreign-labor/performance

A few things worth being precise about, because most blog posts get them fuzzy:

It's bulk download, not a live API. OFLC posts Excel (.xlsx) files — one per quarter since FY2020, and one per year for older years going back to 2008. There is no real-time query endpoint. You download the file and parse it yourself.
Each row is one application, not one approval. The CASE_STATUS column tells you whether it was Certified, Denied, Withdrawn, etc. If you want "real" offers, filter to Certified.
The wage is a disclosed offered wage, not actual paid salary. It's the wage the employer attested to on the application. That's still the single best public benchmark for "what does Company X pay for Role Y," but it's not the same as a verified paystub.

Each file has 75+ columns. The ones you almost always care about:

Column	Meaning
`CASE_STATUS`	Certified / Denied / Withdrawn
`EMPLOYER_NAME`	The sponsoring employer
`JOB_TITLE`	Employer-supplied job title
`SOC_TITLE`	Standardized occupation (e.g. "Software Developers")
`WAGE_RATE_OF_PAY_FROM` / `..._TO`	Offered wage range
`WAGE_UNIT_OF_PAY`	Year / Hour / Month
`WORKSITE_CITY` / `WORKSITE_STATE`	Where the role sits
`RECEIVED_DATE`	When it was filed

Column names have drifted slightly over the years (older files used names like LCA_CASE_EMPLOYER_NAME), so always check the record layout PDF that ships alongside each year's file.

Querying it in Python

Here's a complete, runnable example. Download a quarterly LCA disclosure .xlsx from the OFLC page first (they're large — tens to over a hundred MB, hundreds of thousands of rows), then point this at it. It filters to certified cases, normalizes everything to an annual wage, and lets you query by employer and SOC role.

import pandas as pd

# Path to a quarterly LCA disclosure file downloaded from
# https://www.dol.gov/agencies/eta/foreign-labor/performance
FILE = "LCA_Disclosure_Data_FY2026_Q1.xlsx"

# Only read the columns we need — the full file has 75+ and is slow to load whole.
COLS = [
    "CASE_STATUS", "EMPLOYER_NAME", "JOB_TITLE", "SOC_TITLE",
    "WAGE_RATE_OF_PAY_FROM", "WAGE_UNIT_OF_PAY",
    "WORKSITE_CITY", "WORKSITE_STATE",
]

df = pd.read_excel(FILE, usecols=COLS, engine="openpyxl")

# 1. Keep only certified applications.
df = df[df["CASE_STATUS"].str.upper() == "CERTIFIED"].copy()

# 2. Normalize every wage to an annual figure. DOL's WAGE_UNIT_OF_PAY values are
# Year / Month / Bi-Weekly / Week / Hour, but the exact label and casing drift
# across years — so match case-insensitively and cover the common variants
# (otherwise an unmatched unit silently becomes NaN and the row is dropped).
PERIODS = {"YEAR": 1, "MONTH": 12, "BI-WEEKLY": 26,
           "WEEK": 52, "WEEKLY": 52, "HOUR": 2080}  # 2080 = 40h * 52w

df["wage"] = pd.to_numeric(df["WAGE_RATE_OF_PAY_FROM"], errors="coerce")
unit = df["WAGE_UNIT_OF_PAY"].astype(str).str.strip().str.upper()
df["annual_wage"] = df["wage"] * unit.map(PERIODS)
df = df.dropna(subset=["annual_wage"])

def query(employer=None, role=None, state=None):
    out = df
    if employer:
        out = out[out["EMPLOYER_NAME"].str.contains(employer, case=False, na=False)]
    if role:
        out = out[out["SOC_TITLE"].str.contains(role, case=False, na=False)]
    if state:
        out = out[out["WORKSITE_STATE"].str.upper() == state.upper()]
    return out

# Example: software roles at Stripe in California
res = query(employer="Stripe", role="Software", state="CA")

print(f"{len(res)} matching certified applications")
print(res["annual_wage"].describe()[["count", "mean", "min", "max"]].round(0))
print("Median:", round(res["annual_wage"].median()))

A few practical notes:

pandas.read_excel with usecols is dramatically faster than loading all 75 columns. For repeated queries, convert the file to Parquet once (df.to_parquet(...)) and read that instead — it's an order of magnitude quicker.
Always normalize WAGE_UNIT_OF_PAY. A surprising share of records are filed hourly, and if you don't convert them you'll get a "median salary" of $85 that quietly ruins your analysis.
The example uses WAGE_RATE_OF_PAY_FROM (the floor, and the conservative benchmark). The file also carries WAGE_RATE_OF_PAY_TO for the ceiling of the posted range — add it to COLS if you want both ends.
Employer names are not deduplicated. "Amazon.com Services LLC", "Amazon Web Services", and "Amazon Development Center" are separate strings. Substring matching (as above) is the pragmatic fix; for serious work you'll want a normalization pass.

Scaling it: many quarters, many years

One quarter is a snapshot. To build a real H-1B salary database by employer you'll want to concatenate several years of files so you can see trends and have enough rows per employer/role to compute a stable median:

import glob

frames = []
for f in glob.glob("LCA_Disclosure_Data_FY20*.xlsx"):
    # usecols errors if any name is missing, and column names drift across years —
    # so read the header row first and request only the columns that exist in this file.
    avail = pd.read_excel(f, nrows=0, engine="openpyxl").columns
    frames.append(pd.read_excel(f, usecols=[c for c in COLS if c in avail], engine="openpyxl"))
full = pd.concat(frames, ignore_index=True)

That's also where the work starts to add up: downloading every quarterly file, handling the column-name drift between years, deduplicating employers, and keeping it refreshed each quarter when OFLC posts new data. None of it is hard — it's just plumbing — but it's the part that turns "I found a CSV" into "I have a queryable benchmark."

A shortcut if you'd rather skip the plumbing

If you just want to sketch the query you'd run — employer + role + state — before committing to the full ETL, I put together a small free query-builder that lets you compose the lookup and preview the shape of the result: datatooly.xyz/salary-benchmark-lookup. It builds the query for you; it does not run the dataset live in your browser (these files are far too large for that).

And if you want the parsed, normalized, deduplicated data delivered as structured JSON without maintaining the pipeline yourself, I built an Apify actor for exactly this — SalaryBench IQ — which queries the DOL LCA disclosure data by employer/role/state and returns clean records. It's free to start, then pay-as-you-go.

Disclosure: I built both of those tools, so treat that section as me telling you about my own work — the DOL-source facts and the code above stand entirely on their own.

The honest takeaway

There is no official H-1B salary JSON API. But the data behind every H-1B salary site is one well-documented, free, public source: the DOL OFLC LCA disclosure files. Download a quarter, filter to Certified, normalize the wage unit, and you have a defensible salary benchmark by employer and role in about 30 lines of Python. Everything past that — multi-year history, employer normalization, quarterly refresh — is just engineering you can do yourself or offload.

How to Scrape a Telegram Channel Without Login (No API Key, No Phone Number)

Omar Eldeeb — Tue, 30 Jun 2026 20:39:36 +0000

If you want to scrape a Telegram channel without login, you do not need MTProto, you do not need a bot token, and you do not need to hand Telegram a phone number. Every public channel quietly exposes a server-rendered HTML preview that you can fetch with a plain HTTP request and parse with any HTML library. No account. No risk to an account you don't even have to create.

This is the cleanest, lowest-friction way to pull recent posts from a public channel, and most people never discover it because the Telegram developer docs push you straight toward the full Bot API or the Telethon/MTProto client libraries. Those are powerful, but they're overkill — and a liability — when all you want is the public message feed.

Let me show you exactly how it works, give you a runnable script, and be honest about where this approach hits a wall.

The trick: t.me/s/<channel>

Telegram publishes a public web preview for every public channel at:

https://t.me/s/<channelname>

The /s/ is the important part — it serves the "preview" (the embeddable, indexable version). Hit https://t.me/s/durov in a browser with JavaScript turned off and you'll still see the posts. That's the tell: the HTML is server-rendered, so a basic HTTP client receives the fully-populated page. No headless browser required.

Inside that HTML, each post lives in a predictable structure. The classes are the same ones the official Telegram post widget uses, and have been stable in practice:

.tgme_widget_message — the container for a single post. It carries a data-post attribute like durov/123, where 123 is the message ID.
.tgme_widget_message_text — the post body (with inline HTML for links, bold, etc.).
.tgme_widget_message_date time — a <time> element whose datetime attribute is an ISO-8601 timestamp.
.tgme_widget_message_views — the view count, rendered as a human string like 12.4K.
.tgme_widget_message_photo_wrap / .tgme_widget_message_video_wrap — media wrappers; the image URL is tucked into a background-image CSS rule.

That's everything you need to reconstruct a structured feed.

A runnable example

Here's a self-contained Python script using requests and beautifulsoup4. It fetches one page of a public channel and extracts each post's ID, text, timestamp, and view count.

import re
import requests
from bs4 import BeautifulSoup

def scrape_channel(channel: str):
    url = f"https://t.me/s/{channel}"
    # A normal browser UA avoids the occasional stripped-down response.
    headers = {"User-Agent": "Mozilla/5.0 (compatible; my-scraper/1.0)"}
    resp = requests.get(url, headers=headers, timeout=20)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "html.parser")
    posts = []

    for msg in soup.select(".tgme_widget_message"):
        data_post = msg.get("data-post", "")          # e.g. "durov/123"
        msg_id = data_post.split("/")[-1] if data_post else None

        text_el = msg.select_one(".tgme_widget_message_text")
        text = text_el.get_text("\n", strip=True) if text_el else ""

        time_el = msg.select_one(".tgme_widget_message_date time")
        timestamp = time_el["datetime"] if time_el and time_el.has_attr("datetime") else None

        views_el = msg.select_one(".tgme_widget_message_views")
        views = views_el.get_text(strip=True) if views_el else None

        posts.append({
            "id": msg_id,
            "text": text,
            "datetime": timestamp,
            "views": views,
        })

    return posts


if __name__ == "__main__":
    for p in scrape_channel("telegram"):
        print(f"[{p['datetime']}] ({p['views']} views) #{p['id']}")
        print(p["text"][:200])
        print("-" * 40)

Install the two dependencies and run it:

pip install requests beautifulsoup4
python scrape.py

You'll get the most recent posts on the channel's preview page printed as structured records. No keys, no login, nothing to authorize.

Paginating backwards with ?before=

A single fetch of t.me/s/<channel> returns roughly the latest 16–20 posts. To go further back, use the before= query parameter with a message ID. Telegram then returns the page of posts older than that ID:

https://t.me/s/<channel>?before=<message_id>

So the loop is simple: grab a page, find the smallest data-post message ID on it, then request the next page with ?before=<that_id>. Repeat until you hit your target count or stop getting new posts.

def scrape_paginated(channel: str, max_posts: int = 200):
    collected, before = [], None
    while len(collected) < max_posts:
        url = f"https://t.me/s/{channel}"
        if before:
            url += f"?before={before}"
        # scrape_channel_from_url = the parse logic from the first example,
        # factored out to accept a full URL instead of building it from the channel name
        page = scrape_channel_from_url(url)
        if not page:
            break
        collected.extend(page)
        # oldest message id on this page becomes the next cursor
        ids = [int(p["id"]) for p in page if p["id"] and p["id"].isdigit()]
        if not ids:
            break
        new_before = min(ids)
        if new_before == before:              # no progress -> reached the end
            break
        before = new_before
    return collected[:max_posts]

Add a short time.sleep() between requests to be polite, and dedupe by message ID since page boundaries can overlap by one or two posts.

Where this approach stops working (be honest)

This method is genuinely useful, but it has hard limits you should know up front:

Public channels only. Private channels, and all groups, are invisible to t.me/s/. Groups in particular require Telethon/MTProto with API credentials and membership — there is no login-free path.
Recent messages, not full archives. Paginating with before= walks backwards, but Telegram does not serve unbounded deep history through the preview reliably. You can get a healthy window of recent posts; you cannot count on dumping a channel's entire multi-year backlog this way.
Rendered fields only. You get what the preview shows — text, dates, view counts, media URLs, link previews. You do not get the raw API objects (reactions breakdowns, forward chains, edit history) that MTProto exposes.
Rate limiting. It's an HTTP endpoint like any other. Hammer it from one IP and you'll get throttled. Space out requests and rotate if you're going wide.

If your use case fits inside "recent posts from public channels," none of this matters and you're done. If you need full history at scale, you've outgrown the preview.

When you outgrow the snippet

Two things that saved me time, in increasing order of scale:

To stop hand-checking class names and before= cursors, I built a small free query builder at datatooly.xyz/telegram-channel-search. You type a channel and pick the preview shape you want, and it generates the request configuration for you. It's a builder — it composes the query, it doesn't run a live scrape inside your browser.
When you need full history, many channels in parallel, media downloads, and managed proxies instead of babysitting a requests loop, that's where a hosted actor earns its keep. I run Telegram Intelligence Pro on Apify for exactly that — it's free to start, then pay-as-you-go, so you can try it on one channel before committing to a bulk job.

Disclosure: I build the datatooly tool and the Apify actor linked above; the t.me/s/ technique and the code here work entirely on their own without either.

TL;DR

To scrape a Telegram channel without login: fetch https://t.me/s/<channel>, parse .tgme_widget_message blocks for data-post IDs, .tgme_widget_message_text, the <time datetime> attribute, and .tgme_widget_message_views, then page backwards with ?before=<oldest_id>. No API key, no phone number, no MTProto — just HTTP and an HTML parser. Mind the limits (public channels, recent posts) and reach for a hosted tool only when you actually need scale.

Understat xG Data Export: How to Pull Expected Goals Programmatically (Python + CSV)

Omar Eldeeb — Tue, 30 Jun 2026 20:39:15 +0000

If you've searched for an Understat xG data export and come up empty, here's the honest truth up front: Understat does not publish an official public API. There's no documented REST endpoint, no API key, no /v1/xg route. But the expected-goals data is absolutely reachable programmatically — it's sitting right inside the HTML of every page, and once you know the trick, exporting it to CSV is a dozen lines of Python.

This post walks through exactly how Understat ships its data, the one decoding gotcha that trips most people up, and a runnable scraper that gets you league, player, and match xG into a clean DataFrame you can write straight to CSV.

Where the xG data actually lives

Understat (understat.com) covers six leagues — the Premier League, La Liga, Bundesliga, Serie A, Ligue 1, and the Russian Premier League — for every season since 2014/15. When you open a page like https://understat.com/league/EPL/2023, the tables you see (xG, xGA, shots, points) are not rendered server-side as HTML rows. The page ships a near-empty table and a block of JavaScript that builds it client-side.

If you view source and search for xG, you'll find something like this inside a <script> tag:

var teamsData = JSON.parse('\x7B\x221\x22\x3A\x7B\x22id\x22\x3A\x221\x22...');

That's the whole game. The data is a JSON string passed to JSON.parse(), but every character has been escaped into \xNN hex sequences (and some \uNNNN unicode escapes). A naive json.loads() on that raw string will fail because Python sees literal backslash-x sequences, not the characters they represent.

Which variable holds what:

/league/<LEAGUE>/<SEASON> — datesData (every fixture with team-level xG), teamsData (season table with xG/xGA/npxG/PPDA), and playersData (per-player xG, xA, shots, key passes).
/player/<PLAYER_ID> — that player's match-by-match and grouped shot/xG data.
/match/<MATCH_ID> — shotsData (every shot with xG, x/y coordinates, situation, result) plus rostersData.

So once you can decode one variable, you can decode all of them. The only thing that changes is the URL and the variable name.

The decode that everyone gets wrong

The reliable pattern, confirmed across multiple community scrapers, is:

Find the <script> tag containing your target variable.
Slice out the string between (' and ').
Turn the \x escapes back into real characters with .encode("utf-8").decode("unicode_escape").
Parse the result with json.loads().

That decode("unicode_escape") step is the key line. It converts \x7B into {, \x22 into ", and so on, leaving you with a normal JSON string. One gotcha worth knowing up front: unicode_escape decodes those bytes as latin-1, so accented player names (Ødegaard, Sané) come out as mojibake unless you re-encode to latin-1 and decode as UTF-8 — the scraper below does exactly that.

A runnable Python scraper

Here's a self-contained script that pulls the season player table for a league and writes it to CSV. It only needs requests, beautifulsoup4, and pandas.

import json
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; xg-export/1.0)"}

def get_understat_var(url: str, var_name: str):
    """Pull one JSON variable (e.g. 'playersData') out of an Understat page."""
    resp = requests.get(url, headers=HEADERS, timeout=20)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    for script in soup.find_all("script"):
        text = script.string or ""
        if var_name in text:
            # Grab the string passed to JSON.parse('...')
            start = text.index("('") + 2
            end = text.index("')", start)
            raw = text[start:end]
            # Turn \x7B etc. back into real characters, then parse.
            # NOTE: unicode_escape decodes bytes as latin-1, which mangles
            # accented names (Ødegaard, Sané). Re-encode latin-1 -> decode utf-8 to fix it.
            decoded = raw.encode("utf-8").decode("unicode_escape").encode("latin-1").decode("utf-8")
            return json.loads(decoded)
    raise ValueError(f"{var_name} not found at {url}")

def league_players(league: str, season: int) -> pd.DataFrame:
    url = f"https://understat.com/league/{league}/{season}"
    data = get_understat_var(url, "playersData")
    return pd.DataFrame(data)

if __name__ == "__main__":
    df = league_players("EPL", 2023)
    # Keep the columns most people want for an xG export
    cols = ["player_name", "team_title", "games", "time",
            "goals", "xG", "assists", "xA", "shots", "key_passes",
            "npg", "npxG"]
    df = df[cols]
    # Numeric fields arrive as strings — cast the xG ones
    for c in ["xG", "xA", "npxG"]:
        df[c] = pd.to_numeric(df[c])
    df.sort_values("xG", ascending=False, inplace=True)
    df.to_csv("epl_2023_xg.csv", index=False)
    print(df.head(10).to_string(index=False))

Run it and you get epl_2023_xg.csv — your Understat xG data export to CSV, sorted by expected goals, ready for a notebook or a spreadsheet.

A few notes that save debugging time:

The numeric fields (xG, xA, shots) come back as strings, not numbers. Cast them before you do math, as the script does.
For shot-level data, hit a match page and pull shotsData: get_understat_var("https://understat.com/match/26618", "shotsData"). Each shot carries xG, pitch coordinates X/Y, situation, shotType, and result — everything you need for a shot map.
Be polite: add a delay between requests if you loop over a whole season, and cache responses so you're not re-hitting the site during development.

Doing it in JavaScript

Same idea in Node — with one catch: JSON.parse itself rejects \x escapes (JSON only permits \uNNNN), so you can't just feed it the raw body. Convert the \xNN byte escapes to real bytes first, then decode them as UTF-8 (this also handles accented names correctly):

const res = await fetch("https://understat.com/league/EPL/2023");
const html = await res.text();
const match = html.match(/var\s+playersData\s*=\s*JSON\.parse\('(.+?)'\)/s);
// \xNN escapes are raw UTF-8 bytes -> rebuild the byte array, then decode UTF-8
const bytes = Uint8Array.from(
  match[1].replace(/\\x([0-9A-Fa-f]{2})/g, (_, h) => String.fromCharCode(parseInt(h, 16))),
  (c) => c.charCodeAt(0)
);
const players = JSON.parse(new TextDecoder("utf-8").decode(bytes));
console.log(players[0].player_name, players[0].xG);

The double JSON.parse looks odd but is deliberate: the inner one turns the \x-escaped body into a clean JSON string, the outer one parses that JSON.

When the DIY route stops being worth it

Scraping one league-season is easy. Scraping every shot from every match across six leagues and ten seasons is a different job — you're into rate limits, retries, schema drift, and de-duplication. At that point the script above becomes a small maintenance project.

If you'd rather skip that, two things I built can help. The first is a free Understat query builder — you pick a league and season and it shows you the exact field shape you'll get back, so you can plan your export without reading raw HTML. (It assembles and previews the request; it doesn't scrape live in the browser.) The second, for production-scale pulls, is the Understat Football Analytics actor on Apify, which handles the league/player/match extraction and returns structured rows you can export to CSV, JSON, or Excel. It's free to start, then pay-as-you-go.

Disclosure: I built both of those tools, so treat that last paragraph as the author's pitch — the scraping method above stands entirely on its own and is everything you need to export Understat xG data yourself.

Wrapping up

There's no official Understat API, but there doesn't need to be. The xG, xA, and shot data is embedded as an escaped JSON string in each page; decode the \x escapes with unicode_escape, run json.loads, and you have clean structured data. Point the same helper at /league, /player, or /match URLs and you've covered league tables, player seasons, and shot maps — all exportable to CSV in a few lines.

How to Build a Threads Scraper for Meta Profiles and Posts

Omar Eldeeb — Sat, 13 Jun 2026 16:32:35 +0000

If you want to build a Threads scraper, the first thing to get straight is what Threads actually is in 2026 — because the surface has changed under everyone's feet. Threads is Meta's X-competitor, and it is no longer a small experiment: Meta reported it crossed 400 million monthly active users in August 2025. That growth is exactly why marketers, researchers, and data teams suddenly want programmatic access to profiles, posts, and hashtags.

This guide is the honest version. I'll show you what loads without authentication, what doesn't, where Meta's official API helps versus where it doesn't, and a runnable code example you can adapt today.

Fact #1: It's threads.com now, not threads.net

A surprising number of tutorials still hardcode threads.net. That's stale. On April 24, 2025, Meta officially migrated the canonical domain from threads.net to threads.com. Meta didn't own the .com at launch — it belonged to a messaging startup — and acquired it in September 2024 before flipping the canonical domain the following spring. Old threads.net URLs now redirect, but if you're writing a Threads scraper, target threads.com directly so you skip a redirect hop and avoid brittle string matching.

# Do this
PROFILE_URL = "https://www.threads.com/@zuck"

# Not this (redirects, and you may parse a redirect interstitial)
# PROFILE_URL = "https://www.threads.net/@zuck"

Fact #2: There is no open public API for general scraping

This is the part people get wrong in both directions, so let's be precise.

Meta does publish an official Threads API, opened to developers in 2024. It is genuinely useful for some things: publishing posts on behalf of an authenticated user, tokenless oEmbed for embedding public posts, and a limited ability to search public posts by author or media type. But it is not an open data firehose. To use it meaningfully you register a Meta Developer App and go through App Review, and the read surface is narrow and account-scoped — it's built for "let my app post and embed," not "let me pull arbitrary public profiles and their post history at scale."

So when someone says "just use the API," the honest answer is: the official API solves publishing well and bulk public reading poorly. For competitive research, audience analysis, or trend tracking across accounts you don't own, you're going to read the public web surface instead. Which brings us to the good news.

Fact #3: Public profiles and posts render cookie-free

Threads is, relative to Instagram or LinkedIn, friendly to logged-out reading. Public posts render in the initial server-side HTML for unauthenticated visitors. You don't need cookies, a logged-in session, or GraphQL doc_id juggling to read a public profile's recent posts — the data is in the page Meta serves to crawlers.

The cleanest way to trigger that crawler-friendly server-rendered HTML is to identify as Meta's own link-preview crawler, facebookexternalhit. This is the bot Meta runs to build link previews when a URL is shared, and it reliably receives the SSR variant of the page. Combined with structured data embedded in the HTML, you get profile and post fields without browser automation.

Here's a minimal, correct example in Python. It fetches a public profile page with the crawler user-agent and pulls structured data out of the HTML. No login, no headless browser.

import json
import re
import urllib.request

UA = "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

def fetch_public_profile(username: str) -> str:
    url = f"https://www.threads.com/@{username}"
    req = urllib.request.Request(url, headers={"User-Agent": UA})
    with urllib.request.urlopen(req, timeout=20) as resp:
        return resp.read().decode("utf-8", errors="replace")

def extract_jsonld(html: str):
    """Pull <script type='application/ld+json'> blocks (structured data)."""
    blocks = re.findall(
        r'<script type="application/ld\+json"[^>]*>(.*?)</script>',
        html,
        flags=re.DOTALL,
    )
    out = []
    for b in blocks:
        try:
            out.append(json.loads(b.strip()))
        except json.JSONDecodeError:
            continue
    return out

if __name__ == "__main__":
    html = fetch_public_profile("zuck")
    for obj in extract_jsonld(html):
        # ProfilePage / Person objects carry name, handle, description, etc.
        print(json.dumps(obj, indent=2)[:800])

A few notes so this holds up in production:

Parse the JSON, don't regex the fields. The HTML markup churns constantly; the embedded structured-data and inline JSON blobs are far more stable. Find the script blocks, json.loads them, then walk the objects.
Expect more than one JSON shape. Threads has shipped at least two structured-data layouts over time (a bare Person object and a ProfilePage wrapping mainEntity). Handle both, or your parser silently returns nulls after Meta ships a tweak.
Validate the host. If you accept arbitrary input URLs, make sure the host is exactly threads.com (or www.threads.com). A naive suffix check like "ends with threads.com" will happily accept notthreads.com and open you to SSRF. Match the host, not a substring.

Fact #4: Search and reply-trees are the hard part

Here's where logged-out reading hits its ceiling, and where honest expectation-setting matters.

Profiles and a profile's recent posts: easy. Public, in the SSR HTML, cookie-free.

Full reply trees: limited. Without an authenticated session, a post's discussion tree returns only the publicly-indexed posts that reference or quote it — roughly 15–30 — not the complete comment list. The deep thread requires a login Threads doesn't hand to anonymous crawlers.

Keyword search and hashtags: partial. You can pull top results for a tag or query from the public surface, but the volume and depth are capped by what Threads chooses to expose to logged-out users. Treat search/hashtag as "top sample," not "exhaustive archive," and design your downstream analytics around a sample, not a census.

This isn't a flaw in your code — it's the platform boundary. A good Threads scraper is explicit about which mode returns complete data (profile, posts-by-user) and which returns a public subset (search, hashtag, reply-tree). On the legal side, logged-out scraping of public data has generally been treated more favorably by US case law than authenticated scraping (e.g., hiQ v. LinkedIn, Meta v. Bright Data) — but that's a posture, not legal advice. Read Meta's terms and your own jurisdiction.

Putting it together: modes you actually want

A complete Threads scraper usually exposes these modes:

Profile — handle, bio, follower count, bio links, verification.
Posts by user — recent posts for one or more usernames.
Post detail — a source post plus its public quote-reposts/references.
Search — top results for a keyword (sampled).
Hashtag — top posts for a tag (sampled).
Monitor — emit only posts new since your last run, for ongoing tracking.

The first three return complete-ish public data; the last three are sample-or-delta by nature. Knowing that distinction up front saves you from promising stakeholders an "everything" dataset the platform won't give you.

A faster path than hand-rolling it

The code above works, but going from "fetches one profile" to "handles both JSON shapes, retries transient failures, rotates IPs when Meta rate-limits, paginates posts, and dedupes a monitor run" is real engineering. If you'd rather skip the maintenance treadmill, I built two things to help.

First, a free Threads query builder. Important honest caveat: it is a query builder, not a live in-browser scraper. Threads isn't CORS-open, so nothing fetches live results in your tab. You pick a mode, type usernames or a query, set limits, and it previews the exact output shape so you know the field structure before you run anything. It's the fastest way to design your schema.

Second, the backing Threads Scraper actor on Apify runs the configured job for real. It uses the cookie-free SSR approach described here (no login, no cookie management), supports all six modes above including monitor-deltas and bio-contact extraction, and is free to start, then pay-as-you-go — the first 50 chargeable events per run are free, so you can validate output on real data before spending anything.

Disclosure: I built the query-builder tool and the Apify actor referenced above.

Whether you hand-roll it with the snippet here or run the actor, the takeaways are the same: target threads.com, expect no open public API, lean on cookie-free SSR for profiles and posts, and treat search/hashtag/reply-trees as public samples rather than complete archives. Build for those boundaries and your scraper stays correct as Threads keeps shipping changes.

How to Build a LinkedIn Profile Scraper: The Honest Technical Guide

Omar Eldeeb — Sat, 13 Jun 2026 16:32:13 +0000

If you have ever tried to build a LinkedIn profile scraper, you have probably discovered that the obvious path — "just call the API" — is a dead end. LinkedIn does not hand out programmatic access to arbitrary member profiles, and most of the tutorials that promise a five-line solution quietly skip the parts that actually matter: the data source, the legal posture, and why naive HTML parsing breaks.

This article is the honest version. I will show you where public profile data genuinely lives, a correct code pattern for reading it, and the legal nuance you need to understand before you point any automation at LinkedIn. No fabricated benchmarks, no "100% undetectable" nonsense.

There is no open public API for profiles

Let's get this out of the way first, because it shapes every decision downstream.

LinkedIn has an API, but it is not a general-purpose way to read other people's profiles. Public/open access to profile data was removed back in 2015. What remains in the self-service developer portal is narrow: "Sign in with LinkedIn" gives you the authenticated user's own name, headline, and photo — and only with their consent. Anything richer (the Profile API, full work history, connections) is gated behind the LinkedIn Partner Program, requires approval, and ships with hard restrictions.

A few details that surprise people:

The Profile API restricts data retention — under the partner terms you generally may not cache or store profile data beyond short, strictly time-limited windows.
The API Terms of Use explicitly prohibit scraping, combining LinkedIn data with other sources, reselling data, and using the API for lead generation.

So if your goal is "read public profile pages at scale for research or enrichment," the official API simply does not offer that product. That is not a loophole you are missing — it is a deliberate design choice. Which leads to the real question: what is publicly available on the page itself?

What a public LinkedIn profile actually exposes

Open a LinkedIn profile in an incognito window — no login — and you will see a public version of the page. That HTML is rendered for search engines and social-preview crawlers, and like most modern sites built for SEO, it embeds structured data using schema.org vocabulary in JSON-LD format.

Concretely, public profile pages carry a <script type="application/ld+json"> block describing a Person (often nested inside a ProfilePage via its mainEntity property). Google has recommended JSON-LD for profile-page structured data since 2017, and LinkedIn populates it, likely because it wants those rich search results.

This matters enormously for a scraper. Instead of writing brittle CSS selectors against an obfuscated, frequently-changing DOM, you parse a machine-readable JSON object that the site publishes for search engines. It is more stable, more complete, and far less likely to silently break on a redesign.

A correct pattern for reading the JSON-LD

Here is a minimal, runnable Node.js example that extracts and parses JSON-LD from an HTML document. I am showing the parsing logic — the part most tutorials get wrong — rather than encouraging you to hammer LinkedIn directly.

import { load } from "cheerio";

/**
 * Extract a Person object from a profile page's JSON-LD.
 * Handles both shapes seen in the wild:
 *   1. A bare Person at the top level
 *   2. A ProfilePage whose `mainEntity` is the Person
 */
function extractPerson(html) {
  const $ = load(html);
  const blocks = $('script[type="application/ld+json"]')
    .map((_, el) => $(el).contents().text())
    .get();

  for (const raw of blocks) {
    let data;
    try {
      data = JSON.parse(raw);
    } catch {
      continue; // skip malformed blocks instead of crashing the run
    }

    // JSON-LD may be a single object or a @graph array
    const nodes = Array.isArray(data)
      ? data
      : Array.isArray(data["@graph"])
        ? data["@graph"]
        : [data];

    for (const node of nodes) {
      if (node["@type"] === "Person") return node;
      if (node["@type"] === "ProfilePage" && node.mainEntity?.["@type"] === "Person") {
        return node.mainEntity;
      }
    }
  }
  return null;
}

const person = extractPerson(html);
if (person) {
  console.log({
    name: person.name,
    headline: person.jobTitle ?? person.description,
    image: typeof person.image === "string" ? person.image : person.image?.contentUrl,
    sameAs: person.sameAs, // linked social/profile URLs
  });
}

Two things to notice. First, the function tolerates both the bare-Person shape and the ProfilePage.mainEntity shape — real pages drift between them, and a scraper that assumes only one will return nulls the day the markup changes. Second, malformed JSON-LD is skipped, not fatal. Defensive parsing is the difference between an enrichment job that quietly drops one row and one that kills the whole batch.

What this snippet does not show is fetching. That is intentional, because how you request the page is where both the engineering and the law get interesting.

The fetching problem (and the trick that helps)

A plain fetch() from a datacenter IP with a generic user agent usually gets you an interstitial or a login wall, not the public HTML. The page you see in incognito is served to recognized crawlers.

The pragmatic approach is to identify your client as one of the social-preview bots LinkedIn already whitelists for link unfurling — for example the facebookexternalhit/1.1 user agent — and route the request through a proxy so you are not firing thousands of calls from one address. That combination tends to return the SSR HTML with the JSON-LD intact, cookie-free (no logged-in session, no fake accounts). That is exactly the technique the actor I mention at the end uses: social-preview UA plus a datacenter proxy, parse the JSON-LD, then augment with a few DOM-extracted engagement counts for recent posts.

The reason this is worth doing carefully rather than aggressively brings us to the part nobody should skip.

The legal reality: hiQ v. LinkedIn

You cannot write honestly about a LinkedIn profile scraper without the hiQ v. LinkedIn saga, and it is routinely misquoted in both directions. Here is what actually happened.

In April 2022, the Ninth Circuit reaffirmed a narrow reading of the Computer Fraud and Abuse Act (CFAA). The core holding: when a site generally permits public access to data, scraping that public data is likely not "access without authorization" under the CFAA. That is the line everyone celebrates — and it is real.

But the story did not end there. In late 2022 the case resolved with a stipulated $500,000 judgment against hiQ. The district court had found that LinkedIn's user agreement — which prohibits scraping and fake accounts — was enforceable as a matter of contract. hiQ also caught CFAA liability tied specifically to using fake accounts to reach password-protected pages.

The honest takeaway is two-sided:

Scraping genuinely public data is, in the Ninth Circuit, unlikely to be a CFAA (anti-hacking) violation.
That is not blanket permission. Terms-of-service breach-of-contract claims are a separate and live risk, logging in or using fake accounts changes the analysis entirely, and this is evolving case law — not settled, nationwide green light. Privacy regimes like GDPR add another independent layer if you touch EU residents' data.

Treat "public + no login + respect the ToS posture + minimize footprint + know your jurisdiction" as the baseline, and get your own legal advice for anything commercial. Anyone who tells you scraping LinkedIn is flatly "legal" or flatly "illegal" is oversimplifying a genuinely nuanced area.

A faster way to prototype the output

If you want to see the exact shape of the data before writing any code, I built a free LinkedIn Profile Lookup query builder. Important: it is a query builder, not a live scraper — it assembles a ready-to-run input config and previews the JSON output shape (name, headline, work history, education, recent posts, articles) right in the page. It does not fetch live results in your browser. It is just the fastest way to design your query and know what fields you will get back.

When you are ready to actually run extraction at scale, that config drops straight into the LinkedIn Profile Pro actor on Apify, which implements the cookie-free JSON-LD approach described above (social-preview UA + datacenter proxy, with residential fallback). It returns the parsed profile plus up to roughly ten recent posts and articles per profile, and it is free to start, then pay-as-you-go — the first handful of profiles per run cost nothing for testing, and you are not charged for duplicates or invalid slugs.

Disclosure: I built both the query builder and the Apify actor linked above.

Wrapping up

The durable lesson is that a good LinkedIn profile scraper is mostly an exercise in reading the structured data a public page already publishes — not in defeating LinkedIn — and in respecting a legal boundary that is narrower and more nuanced than the headlines suggest. Parse the JSON-LD defensively, handle both Person shapes, stay on genuinely public surfaces, never use fake logins, and keep the ToS and hiQ precedent in mind. Do that, and you have an enrichment pipeline that is both robust and defensible.

Sources: Ninth Circuit / CFAA analysis (Jenner & Block), hiQ settlement and breach-of-contract finding (Privacy World), LinkedIn API Terms of Use, schema.org ProfilePage, Google profile-page structured data.

The SEC EDGAR API: A Practical Guide to Free Filing Data in Python

Omar Eldeeb — Sat, 13 Jun 2026 16:31:52 +0000

The SEC EDGAR API is one of the best-kept secrets in financial data engineering: every mandatory disclosure filed by every U.S. public company, available as clean JSON, for free, with no API key. If you've ever paid for a "fundamentals" data vendor or scraped a brokerage page for a balance sheet, you've been working harder than you need to. The raw, authoritative source — quarterly revenue, insider trades, institutional holdings, 8-K event filings — is sitting on data.sec.gov waiting for an HTTP GET.

The catch is small but absolute, and it trips up almost everyone on their first request. Let's walk through how the API actually works, write a correct, runnable Python example, and cover the one rule that will get your IP blocked if you ignore it.

What "the SEC EDGAR API" actually is

There isn't a single endpoint. "The SEC EDGAR API" is really three free public services that work together:

The structured data API (data.sec.gov) — JSON endpoints for company submissions and XBRL financial facts.
Full-text search (efts.sec.gov) — a keyword search index over the text of every filing submitted since 2001, including exhibits.
The ticker map (company_tickers.json) — a small file that maps stock tickers and company names to the internal IDs the other two services require.

None of them require registration or an API key. All of them require one HTTP header. We'll get to that.

The CIK: EDGAR's primary key

EDGAR doesn't index companies by ticker. It uses a Central Index Key (CIK) — a unique integer assigned to every filer. Apple's CIK is 320193.

Two things bite people here:

You need to translate a ticker (AAPL) into a CIK before you can call most endpoints. That's what company_tickers.json is for.
In API URLs, the CIK must be zero-padded to exactly 10 digits. Apple's 320193 becomes CIK0000320193. Pass the un-padded number and you'll get a 404.

This is the single most common silent failure when getting started, so bake the padding into a helper and never think about it again.

The one rule: declare a User-Agent or get a 403

The SEC enforces a fair-access policy. Every request must include a User-Agent header that identifies who you are, and the policy asks for a contact — typically your name and email. Send a request without it, or with a generic library default, and EDGAR returns 403 Forbidden and may block your IP for roughly ten minutes.

I confirmed this the hard way while researching this article: an automated fetch of an SEC documentation page with no declared User-Agent came straight back as 403 Forbidden. That's not an edge case — it's the designed behavior.

This rule has a subtle, important consequence: a normal web browser cannot consume these endpoints directly. Browser JavaScript is forbidden by the Fetch spec from setting the User-Agent header — it's a "forbidden header name." So a pure in-browser tool physically cannot make a compliant request to data.sec.gov. Any browser-based EDGAR helper is therefore a query builder or preview — it constructs the right URL for you to run server-side — not a live in-browser fetcher. Keep that distinction in mind; it matters when you choose tooling later.

The other half of fair access is a rate limit of 10 requests per second per IP. Exceed it and you'll see 429 responses and, again, a temporary block. A simple time.sleep(0.1) between calls, or capping yourself a little lower at ~8/s, keeps you safely compliant.

A correct, runnable Python example

Here's an end-to-end script: resolve a ticker to a CIK, zero-pad it, and pull a specific financial concept (annual revenue) from the XBRL companyconcept endpoint. It uses only the standard requests library and follows every fair-access rule.

import time
import requests

# Identify yourself. The SEC fair-access policy requires a descriptive
# User-Agent with a contact. Use your real app name + email.
HEADERS = {"User-Agent": "edgar-demo/1.0 (you@example.com)"}

def get_ticker_cik_map():
    """Download the official ticker -> CIK map."""
    url = "https://www.sec.gov/files/company_tickers.json"
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    # Keys are arbitrary indices; each value has cik_str, ticker, title.
    return {row["ticker"].upper(): row["cik_str"] for row in resp.json().values()}

def cik_padded(cik_int):
    """EDGAR requires the CIK zero-padded to 10 digits."""
    return f"CIK{int(cik_int):010d}"

def get_concept(cik_int, concept, taxonomy="us-gaap"):
    """Fetch one XBRL concept (e.g. Revenues) for a company."""
    url = (
        f"https://data.sec.gov/api/xbrl/companyconcept/"
        f"{cik_padded(cik_int)}/{taxonomy}/{concept}.json"
    )
    resp = requests.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    return resp.json()

if __name__ == "__main__":
    tickers = get_ticker_cik_map()
    cik = tickers["AAPL"]
    print(f"AAPL CIK: {cik} -> {cik_padded(cik)}")

    time.sleep(0.1)  # stay under 10 req/s

    data = get_concept(cik, "RevenueFromContractWithCustomerExcludingAssessedTax")

    # Print annual (10-K) USD figures.
    for unit in data["units"]["USD"]:
        if unit.get("form") == "10-K" and unit.get("fp") == "FY":
            print(unit["fy"], unit["frame"] if "frame" in unit else "",
                  f"${unit['val']:,}")

Two things to notice. First, the User-Agent is doing real work — remove it and every call 403s. Second, XBRL concepts are specific: revenue under modern US-GAAP is usually tagged RevenueFromContractWithCustomerExcludingAssessedTax, not a friendly Revenue. Discovering the right tag for each company is part of the job.

The other endpoints worth knowing

Once you're past authentication, the API surface is broad:

Submissions — https://data.sec.gov/submissions/CIK##########.json returns a company's filing history: every form type, accession number, and date. This is your entry point for "list all 10-Ks for this company."
Company facts — https://data.sec.gov/api/xbrl/companyfacts/CIK##########.json returns all XBRL facts for a company in one call. Heavy, but great for bulk extraction.
Frames — https://data.sec.gov/api/xbrl/frames/us-gaap/{CONCEPT}/{UNIT}/CY{YEAR}.json flips the axis: one concept across every company for a period. Perfect for cross-sectional analysis ("every filer's 2024 revenue").
Full-text search — https://efts.sec.gov/LATEST/search-index?q=... searches the text of all filings since 2001 by keyword, with filters for form type, date range, and entity. No key, same User-Agent rule.

Where it gets hard (and where a tool helps)

The endpoints are free and well-documented, but turning them into a usable dataset is more work than a single GET. Real projects hit:

XBRL tag archaeology — companies use different, sometimes deprecated, tags for the same concept across years.
Form-specific parsing — Form 4 (insider trades), Form 13F (institutional holdings), and 8-K item codes each have their own nested structures and quirks.
Pagination, rate-limit backoff, and ticker resolution plumbing you rewrite on every project.
The browser problem — you can't prototype a live query from a web UI because of the User-Agent restriction.

If you want to design a query before you write the plumbing, a free SEC EDGAR query builder lets you assemble the right endpoint and parameters and preview the request shape. Because of the User-Agent rule above, it builds and previews the query — it does not execute a live fetch in your browser; you run the generated request server-side.

When you'd rather skip the plumbing entirely, the SEC EDGAR Scraper actor handles the compliant-User-Agent requests, rate limiting, and parsing for you. It exposes nine modes — filings, normalized financials, raw XBRL facts, full-text search, Form 4 insider trades, Form 13F holdings, activist (SC 13D/G) stakes, a latest-filings feed, and parsed 8-K items — with ticker-to-CIK resolution built in and output as JSON, CSV, Excel, or XML. It's free to start (the first 50 chargeable events per run are free), then pay-as-you-go.

The takeaway

The SEC EDGAR API gives you institutional-grade financial data for the price of a well-formed HTTP header. Remember the three rules — declare a User-Agent, zero-pad your CIK to 10 digits, and stay under 10 requests per second — and the entire corpus of U.S. public-company disclosures is yours to query. Start with the company_tickers.json map, graduate to companyconcept for targeted facts or frames for cross-sectional pulls, and reach for the full-text index when you need to find filings by what they say, not just who filed them.

Disclosure: I'm the author of the SEC EDGAR Scraper actor and the linked query builder.

Sources: SEC: Accessing EDGAR Data, SEC: EDGAR APIs, SEC: EDGAR Full Text Search FAQ.

The TikTok Ad Library API: A Developer's Honest Guide to the DSA Commercial Content Library

Omar Eldeeb — Sat, 13 Jun 2026 16:30:29 +0000

If you have ever tried to find a clean, documented TikTok ad library API, you have probably hit a wall of marketing pages, half-answers, and tools that promise "global TikTok ads" without telling you what is actually inside. This guide cuts through that. I will explain exactly what TikTok exposes, what it does not, where the data comes from, and how to query it programmatically without guessing.

The short version: there is a real, public TikTok ad transparency database, but its scope is narrower than most people assume, and there are two very different access paths with very different rules.

What the TikTok Ad Library actually is

TikTok runs a Commercial Content Library at library.tiktok.com. It exists because of the EU's Digital Services Act (DSA), which requires very large online platforms to keep a searchable, public archive of the advertising they serve. So this is not a marketing feature TikTok built for fun — it is a regulatory obligation.

That regulatory origin shapes everything about it, including the single most important fact you need before you write a line of code:

US ads are not in this library. The Commercial Content Library covers ads shown to users in the EU/EEA, plus the UK, Switzerland, and Türkiye. It does not cover the United States, Brazil, India, Mexico, Canada, Japan, or Australia.

In practice the supported set is the 27 EU member states, the three additional EEA countries (Iceland, Liechtenstein, Norway), the UK, Switzerland, and Türkiye — 33 regions in total. If you query an unsupported country, you get an HTTP 400, not an empty result. I have seen plenty of teams burn a sprint building "US TikTok ad monitoring" on top of this data before discovering the US simply is not there. Don't be that team.

(Separately, TikTok also runs the Creative Center, a global "top ads" showcase at ads.tiktok.com/business/creativecenter. That is a curated highlight reel, often login-gated, and is a different surface from the DSA library. Keep the two straight — people conflate them constantly.)

The two access paths

There are two ways to programmatically reach Commercial Content Library data, and confusing them is the root of most "the TikTok ad library API doesn't work" complaints.

1. The official Commercial Content API (gated)

TikTok publishes an official Commercial Content API under developers.tiktok.com. It is OAuth-based and free, but it is gated. Per TikTok's own documentation, eligibility is limited to qualifying academic institutions and non-profit researchers in the US, EEA, UK, and Switzerland (plus certain Brazilian researchers studying youth safety). Commercial users, creators, and advertisers are explicitly ineligible. Approved applications get a client key and are tightly rate-limited under a non-commercial-use commitment — TikTok's Research Tools documentation cites a ceiling on the order of 1,000 requests per day, and the Commercial Content endpoints additionally cap a single call at roughly 50 ads, so high-volume pulls mean a lot of paginated calls. TikTok says you typically hear back on a Commercial Content API application within about two working days.

So if you are an academic, this is your path. If you are building anything commercial, you are not eligible — and that is by design, not an oversight.

2. The public library's JSON endpoints

The public-facing library at library.tiktok.com is a normal web app that talks to a JSON backend. Because the library itself is public to everyone regardless of location, those read endpoints are reachable without OAuth. This is the path that powers most third-party tooling for the DSA library.

I want to be precise and honest here: this is the public library data, the same archive any person can browse in their own browser. It is not a private feed, and it is rate-limited at scale. Below is the real request shape, which I verified directly against the live endpoints rather than copying from docs.

A working request

The flow is: discover supported regions, then POST a search scoped to one region and a time window. Time bounds are mandatory and expressed in Unix seconds. Here is a runnable Node example (Node 18+ has fetch built in):

const BASE = "https://library.tiktok.com";

// 1) Supported regions (cache this ~24h)
async function getRegions() {
  const res = await fetch(`${BASE}/api/v1/support-regions`);
  const json = await res.json();
  return json.region_list; // [{ region: "DE", name: "Germany" }, ...]
}

// 2) Keyword search, scoped to ONE region + a time window
async function searchAds({ query, region = "DE", days = 30 }) {
  const end = Math.floor(Date.now() / 1000);
  const start = end - days * 24 * 60 * 60;
  const url =
    `${BASE}/api/v1/search` +
    `?region=${region}&type=1&start_time=${start}&end_time=${end}`;

  const res = await fetch(url, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      query,
      query_type: "3",          // STRING. 1=All, 2=AdvName, 3=Keyword
      order: "last_shown_date,desc",
      offset: 0,
      limit: 12,                // server caps page size at 12 regardless
    }),
  });

  // Rate-limit quirk: a soft limit returns HTTP 200 with a PLAIN-TEXT
  // body ("limit exceed"), so res.ok is true but JSON.parse throws.
  const text = await res.text();
  if (/limit\s*exceed|too\s*many/i.test(text)) {
    throw new Error("rate-limited (soft 429) — back off and retry");
  }
  const json = JSON.parse(text);
  return json; // { code: 0, data: [...ads...], total, has_more, search_id }
}

(async () => {
  const data = await searchAds({ query: "skincare", region: "FR", days: 14 });
  console.log(`total=${data.total}, first page=${data.data.length}`);
})();

A few things that will save you hours, all learned the hard way:

query_type is a string ("1"/"2"/"3"), not an integer — and several other response fields are typed as strings too, so don't assume numbers.
Page size is server-capped at 12, no matter what limit you send. Paginate with offset plus the search_id cursor from the previous response.
region=ALL is rejected. One ISO region code per call.
The response is flat: data is the array of ads, and total / has_more / search_id sit at the top level.
A soft rate limit returns HTTP 200 with a plain-text body, not JSON. Sniff the body before JSON.parse and treat that as a retryable 429 with exponential backoff.
Video creative URLs are signed and expire (roughly 24h), so store a fetched_at timestamp next to any media URL you keep.

What you get per ad

This is where the DSA library is genuinely interesting. Because the law mandates targeting transparency, each ad detail record exposes far more than a creative thumbnail. Across the search and per-ad detail endpoints you can assemble roughly 32 fields per ad: the creative URLs, advertiser identity and registered business location, the sponsor/payer, first- and last-shown dates, and — the valuable part — audience targeting and reach broken down by region, age bracket, and gender.

That demographic breakdown is richer than the comparable Meta Ad Library, which only exposes reach and spend data for political and social-issue ads (outside the EU, where the DSA forces broader disclosure). On TikTok's DSA library, the targeting tree is available for commercial ads broadly. If you are doing competitive ad intelligence or studying how a brand splits spend across age and gender in different EU markets, that tree is the whole point.

A faster way to explore before you build

Hand-building region codes, Unix timestamps, and query_type values gets tedious when you are just trying to see whether a brand or keyword has any coverage. I maintain a small free query-builder, TikTok Ad Library Search, that lets you assemble a valid query — keywords plus one of the 33 supported regions — and preview the request you would send. To be clear about what it is: it is a query builder and previewer, not a live in-browser scraper, so it helps you get the parameters right before you run anything.

When you are ready to actually pull ads at volume — paginating past the 12-per-page cap, enriching each ad with its full targeting tree, and handling the soft-rate-limit and signed-URL quirks above — that is what my Apify actor, TikTok Ad Library Pro, automates. It takes keywords, advertiser names, or business IDs, returns the ~32-field records described above across all 33 DSA regions, and is free to start, then pay-as-you-go (the first chargeable events on each run are free, so you can validate it against your own use case before committing).

Disclosure: I built both the free query-builder and the Apify actor linked above.

The honest bottom line

The TikTok ad library API is real and, for the DSA region, surprisingly rich — but it is an EU-transparency tool, not a global ad-spying firehose. Internalize three things and you will avoid every common trap: the data is EU/EEA + UK/CH/TR only (no US), the official API is gated to non-commercial researchers, and the public library's endpoints are usable but rate-limited and full of small typing quirks (string enums, 12-per-page caps, plain-text rate-limit bodies, mandatory Unix-second time bounds). Build with those constraints in mind and the targeting data you get back is some of the most detailed ad transparency available anywhere.

Building a Facebook Ad Library Scraper: API Limits and the Real Approach

Omar Eldeeb — Sat, 13 Jun 2026 16:30:25 +0000

If you want to pull a competitor's running ads programmatically, building a Facebook ad library scraper sounds like it should be a solved problem. Meta has a public Ad Library and an official API, so surely you just grab a token and query? Not quite. The gap between what the official API covers and what most people actually need is the single most expensive misunderstanding in this space, and it sends a lot of developers down a dead end on day one.

This post walks through what's real: where the data lives, exactly what the official API will and won't give you, what the data shape looks like, and what it actually takes to extract commercial ads at scale.

Two different things: the Library vs. the API

The Meta/Facebook Ad Library is a public, browser-accessible database of ads. You can open https://www.facebook.com/ads/library/, pick a country from the dropdown, choose "All ads" for general commercial advertising, type in an advertiser name or keyword, and results load immediately. No login, no account required for commercial ads. For each ad you can see the creative (image, video, or carousel), the primary text and headline, the call-to-action, the advertiser's Page name, which platforms it runs on (Facebook, Instagram, Messenger, Audience Network), its start date, and active/inactive status — including the multiple variations a brand is split-testing at once. It's a genuinely rich competitive-intelligence surface.

The Meta Ad Library API is a separate, gated product — and this is where expectations break.

The API only covers political and issue ads

Here's the fact that isn't obvious until you've already spent an afternoon on it: the official Ad Library API is scoped to ads about social issues, elections, or politics, plus ads delivered to the EU and associated territories. General commercial / "All ads" content is not queryable through the API. The public website lets you browse commercial ads; the API does not let you pull them.

On top of the scope limit, getting access is a process:

Identity verification. You confirm your identity and location at facebook.com/ID, uploading a government ID (passport, national ID, or driver's license) and confirming your country of residence. Approval typically takes one to three business days.
A Meta for Developers app. Once verified, you create an app and add the "Ad Library API" product.
Tokens and permissions. You issue an access token with the appropriate scopes (ads_read, and for the archive, ads_archive).

Worth noting for anyone targeting Europe: as of October 6, 2025, Meta no longer permits political, electoral, or social-issue ads in the EU at all. So the API's "EU-delivered ads" coverage now effectively means the historical archive of those ads — not new ones going forward.

So if your verified token does clear all those hoops and you're researching, say, election spending, a call looks like this:

import requests

# Official Meta Ad Library API — POLITICAL / ISSUE ads ONLY.
# Commercial "All ads" are NOT available through this endpoint.
TOKEN = "YOUR_VERIFIED_ACCESS_TOKEN"

params = {
    "access_token": TOKEN,
    "search_terms": "climate",
    "ad_reached_countries": "['US']",
    "ad_type": "POLITICAL_AND_ISSUE_ADS",  # the only broadly supported type
    "ad_active_status": "ALL",
    "fields": ",".join([
        "id", "page_name", "ad_creative_bodies",
        "ad_delivery_start_time", "publisher_platforms",
        "impressions", "spend",
    ]),
    "limit": 50,
}

resp = requests.get(
    "https://graph.facebook.com/v25.0/ads_archive",  # use the current Graph API version
    params=params,
    timeout=30,
)
resp.raise_for_status()
for ad in resp.json().get("data", []):
    print(ad["id"], ad.get("page_name"), ad.get("ad_delivery_start_time"))

That's the entire official path — and it's a fine path for political-transparency research. But if you're doing competitor analysis, e-commerce product research, or creative inspiration, none of those ads are political, so the API returns nothing useful to you.

The commercial use case = scraping the public Library

When people search for a "facebook ad library scraper," they almost always mean the commercial case: "show me every active ad this brand is running, with the creatives and copy." Since the API doesn't serve that, the only route is extracting it from the public Library website. And the public Library is built to resist exactly that.

What you run into, in roughly the order you'll hit it:

It's a JavaScript application. The ads aren't in the initial HTML. A plain requests.get() returns a shell; you need a real browser engine (Playwright/Puppeteer) that executes JS and lets the results render.
Fingerprint and handshake checks. Meta inspects the TLS handshake, the HTTP/2 settings frame, and the browser fingerprint before serving content. A default headless Chromium gets flagged on the very first navigation — which is why naive got-scraping-class HTTP clients also get challenged.
IP reputation and rate limiting. Requests from datacenter IPs or repetitive patterns get throttled or blocked quickly. Rotating residential proxies are typically required so traffic blends in with organic users.
Shifting selectors. Meta restructures the layout and renames element classes regularly, so brittle CSS selectors break without warning. Extraction logic has to be defensive.

None of this is impossible — it's just real engineering with ongoing maintenance, not a weekend script. Build it yourself and you're signing up to babysit a headless-browser fleet, a proxy budget, and a parser that breaks every time Meta ships a redesign.

What the extracted data actually looks like

Whether you build it or buy it, here's a realistic shape for one commercial ad pulled from the public Library. Designing your downstream code against this shape early saves a lot of rework:

{
  "ad_archive_id": "1234567890123456",
  "page_name": "Acme Outdoor Co.",
  "page_id": "100064123456789",
  "ad_creative": {
    "title": "Built for the Trail",
    "body": "Our lightest pack yet. Free shipping this week only.",
    "cta_text": "Shop Now",
    "link_url": "https://acmeoutdoor.example/packs",
    "images": ["https://scontent.example/ad_img_01.jpg"],
    "videos": []
  },
  "publisher_platforms": ["FACEBOOK", "INSTAGRAM"],
  "ad_delivery_start_time": "2026-05-28",
  "ad_delivery_stop_time": null,
  "is_active": true,
  "ad_snapshot_url": "https://www.facebook.com/ads/library/?id=1234567890123456",
  "country": "US"
}

Note what's present here that the political API doesn't expose for commercial advertisers — the creative assets, CTA, and destination URL — and what's absent: there are no impressions or spend ranges. Those metrics are only published for political/issue ads. For commercial ads, you get creative and delivery metadata, not spend. Knowing that boundary keeps you from promising a stakeholder numbers that don't exist.

A faster path: query builder + a hosted scraper

If you'd rather not hand-roll the browser-and-proxy stack, two tools shorten the loop. I work on these, so treat this as a disclosure, not a neutral review.

To get the request right before you write any code, the free Facebook Ad Library search builder lets you assemble a search config — keyword, advertiser, country, filters — and preview the output shape you'll get back. It's a query builder: it constructs the configuration and shows you the structure, not a live in-browser scrape (Meta isn't CORS-open, so no browser-side tool can fetch results directly). It's a quick way to nail down your parameters and field expectations up front.

When you're ready to actually pull data, Facebook Ad Library Pro runs the extraction on the Apify platform — search by keyword, advertiser, or country, and get ad creatives, text, platforms, and dates, plus deeper ad-detail scraping, with the headless browser, proxy rotation, and parser maintenance handled for you. It's free to start, then pay-as-you-go through Apify platform credits, so you can validate it against a real competitor before committing budget.

The takeaway

For a facebook ad library scraper, draw the line clearly: the official Meta Ad Library API is real but narrow — political and issue ads, ID-verified access, no commercial coverage. The broad competitor-research use case lives in the public Library, which means JavaScript rendering, fingerprinting, proxies, and shifting selectors. Decide which side of that line your project sits on before you write code, design against the actual data shape (creatives yes, commercial spend no), and you'll skip the most common multi-day detour in this whole space.

App Store Top Charts API: Free, Key-Free, and CORS-Open

Omar Eldeeb — Mon, 01 Jun 2026 08:43:43 +0000

If you've ever wanted an app store top charts API that you can hit straight from a browser tab — no API key, no OAuth dance, no server proxy — there's good news. Apple still serves a legacy iTunes RSS feed that returns the App Store top charts as plain JSON, and (the part most people miss) it's CORS-open. That means a single fetch() from client-side JavaScript works. No backend required.

This post walks through exactly how the endpoint is shaped, what the JSON looks like, the limits you'll hit, and where it stops being usable from the browser. Everything here is verified against the live feed, with a runnable example you can paste into your console right now.

The endpoint

The pattern is:

https://itunes.apple.com/{cc}/rss/{chart}/limit={N}/json

Three pieces matter:

{cc} — a two-letter country code (us, gb, jp, de, br, …). Charts are per-country, so the US top free list is often very different from Japan's.
{chart} — one of three values:
- topfreeapplications — ranked by download velocity (free apps)
- toppaidapplications — ranked by download velocity (paid apps)
- topgrossingapplications — ranked by revenue, which includes in-app purchases (this is why a free-to-download game with aggressive IAP can top grossing while sitting far down the free chart)
{N} — how many entries you want, e.g. limit=100.

So the US top free applications, top 100, is:

https://itunes.apple.com/us/rss/topfreeapplications/limit=100/json

That's the whole API. No registration.

Why this works from a browser

The thing that makes this endpoint special for front-end developers is that itunes.apple.com returns a permissive Access-Control-Allow-Origin header on these RSS responses. Your browser won't block the cross-origin read. You can build a chart widget, a dashboard, or a quick research tool entirely client-side.

Here's a real, runnable example. Drop it into your browser console or a <script> tag and it returns immediately:

async function getTopCharts(country = "us", chart = "topfreeapplications", limit = 25) {
  const url = `https://itunes.apple.com/${country}/rss/${chart}/limit=${limit}/json`;
  const res = await fetch(url);
  if (!res.ok) throw new Error(`Apple RSS responded ${res.status}`);

  const data = await res.json();
  const entries = data.feed.entry ?? [];

  return entries.map((entry, i) => ({
    rank: i + 1,
    name: entry["im:name"].label,
    developer: entry["im:artist"].label,
    category: entry.category?.attributes?.label ?? null,
    url: entry.link?.attributes?.href ?? null,
  }));
}

// Top 10 free apps in the US App Store
getTopCharts("us", "topfreeapplications", 10).then(console.table);

And the equivalent one-liner with curl, for shell and CI use:

curl -s "https://itunes.apple.com/us/rss/topfreeapplications/limit=10/json" \
  | jq '.feed.entry[] | {name: .["im:name"].label, dev: .["im:artist"].label}'

The JSON shape

The response is a single object with one top-level feed key. The chart itself lives in feed.entry, an array where each element is one ranked app. Position in the array is the rank — index 0 is #1.

Each entry I pulled from the live feed contains these fields:

im:name — the app name (read .label)
im:artist — the developer/publisher (read .label; it may also carry a developer URL in attributes.href)
category — genre, with the human-readable name under attributes.label and the genre ID under attributes.im:id
link — the App Store URL under attributes.href
id — the canonical app store id, including the numeric im:id attribute
im:image — usually three sizes of icon
im:price — formatted price plus an amount/currency attribute pair
summary, rights, title, im:contentType

The slightly awkward part is Apple's namespaced keys (im:name, im:artist) and the consistent { label, attributes } wrapper on almost every field. Once you internalize "the value I want is usually under .label, and the metadata is under .attributes," parsing is trivial. The mapping function above handles it.

The limits (so you don't get surprised)

A few honest constraints worth knowing before you build on this:

The feed caps at 100 entries per chart. You can ask for limit=200, but you'll get at most 100 back. There is no offset/pagination parameter to walk deeper into the rankings. If you need rank 101+, this feed can't give it to you.
It's per-country, one country per request. Want the top charts for 30 markets? That's 30 requests. There's no "all countries" call.
It's overall charts only via this simple path. The three chart types above are the clean, reliable ones. Category-scoped charts exist on Apple's side but aren't a first-class part of this simple RSS path.
It's the legacy feed. Apple has a newer marketing-tools feed (more on that next), and while the iTunes RSS endpoint has been stable for years, it's not formally a "supported product." Treat it as best-effort.

For a huge share of use cases — a "what's trending today" widget, competitor monitoring, a side-project leaderboard — 100 apps per chart per country is plenty.

The newer feed: server-side only

You may run into Apple's newer endpoints at rss.marketingtools.apple.com (also referenced as applemarketingtools.com). These return similar top-charts data and are perfectly usable — but not from a browser. Those endpoints do not send permissive CORS headers, so a client-side fetch() to them will be blocked by the browser's same-origin policy.

So the rule of thumb is simple:

Browser / client-side code → use itunes.apple.com/{cc}/rss/... (CORS-open).
Server-side code (Node, Python, a cron job, a backend route) → either feed works, including the newer marketing-tools one.

Don't try to call rss.marketingtools.apple.com from front-end JavaScript and expect it to work; it won't, and the failure looks like a confusing CORS error rather than a clear message.

What about Google Play?

This is the other honest caveat. There is no equivalent CORS-open, key-free JSON feed for Google Play top charts. Play's charts aren't exposed as a browser-fetchable JSON endpoint the way Apple's RSS is, so any "Play top charts" lookup needs to run server-side (typically through a proxy or a scraping layer) rather than from the browser. If your project needs both stores, plan for an App Store-from-browser / Play-from-server split.

Try it live, then scale it up

If you just want to see the App Store top charts right now without writing a line of code, I built a free tool that runs this exact iTunes RSS feed live in your browser: datatooly.xyz/app-store-top-charts. Pick a country and a chart type and it fetches the real feed client-side — the same endpoint described above, no key, nothing fake. It's a good way to eyeball the data shape before you wire it into your own code.

When you outgrow the 100-app, one-country-at-a-time, no-history ceiling of the raw feed, the same data is available at scale through the App Store + Google Play Rank Tracker actor on Apify. It covers 150+ countries, all chart types plus category charts, per-app enrichment (ratings, reviews, screenshots), rank deltas with risers/fallers and a forecast, scheduled history so you can track movement over time, and JSON/CSV/API output — including Google Play, which (as noted) you can't reach from the browser. It's free to start, then pay-as-you-go.

Disclosure: I built both the free tool and the Apify actor.

TL;DR

Endpoint: https://itunes.apple.com/{cc}/rss/{topfree|toppaid|topgrossing}applications/limit={N}/json
CORS-open → works from a browser, no API key
Parse feed.entry[]; read names/devs/categories under .label, links/ids under .attributes
Caps at 100 per chart, one country per call, no deep pagination
Top Free/Paid = download velocity; Top Grossing = revenue incl. IAP
Use the newer rss.marketingtools.apple.com feed server-side only (not CORS-open)
Google Play has no browser-fetchable equivalent — proxy it server-side

Copy the fetch() snippet above and you'll have live App Store top charts in under a minute.