DEV Community: Ken-Mutisya

Two public directories that are quietly great B2B lead sources

Ken-Mutisya — Wed, 15 Jul 2026 19:41:43 +0000

Most teams buy lead lists. The lists are stale, half the emails bounce, and everyone in your niche bought the same file. Meanwhile some of the best lead sources sit in plain sight: public directories that publish structure, freshness signals and even contact routes, free to read for anyone willing to parse them.

I run a fleet of scrapers on Apify (a new one ships every few days, so whatever count I write here is already wrong) and the two directory actors I promoted this week are a good case study in what public data can do for outreach.

Source one: the Y Combinator startup directory. YC publishes every funded company with batch, industry, region, team size and hiring status. That is a prefiltered universe of funded buyers: each company passed a selective screen, raised money recently and is usually growing. The directory itself does not list emails, but almost every startup site does, so the actor filters the directory first and then visits each company website to scrape a contact address. One run returns rows like company, pitch, website, team size, batch and the email it found, with the rows that lack an email flagged honestly. Recruiters filter by hiring status, sellers filter by industry, investors watch specific batches.

Source two: podcast RSS feeds. Every podcast feed carries an owner field, and that field usually holds the exact email the host wants business mail sent to. Search the public podcast directory for a keyword, pull each show feed, read the owner email plus episode count and last episode date, and you have a list of active shows in a niche with reachable hosts. The freshness signal matters as much as the email: a show that has not published in two years wastes your pitch, so the actor surfaces the last episode date and lets you skip dead shows before writing to anyone.

The pattern behind both is the same and it generalizes:

Find a directory that curates for you. Curation is the expensive part of lead generation, and YC (funded companies) or a podcast index (shows by topic) already did it.
Look for the freshness field. Hiring status, last episode date, latest batch: recency separates a lead from a museum piece.
Follow the entity to its own property for contact data. Company sites and RSS feeds publish contact routes on purpose, which makes them fair and reliable in a way harvested emails are not.
Return misses honestly. A row that says no email found is information, not failure, and it keeps your input list reconciled one to one.

Everything here is keyless HTTP against public endpoints, no browser automation and no login walls, which keeps runs fast and costs tiny. On pay per event pricing that matters: a lead costs a few cents only when the row actually delivers.

Both actors are live on Apify under the Scrapemint account as YC Startup Leads and Podcast Host Leads, priced per result with the first rows of every run free, and they sit alongside the rest of the fleet: an email checker, a phone checker, a VAT checker and more list hygiene tools that clean whatever the lead sources produce.

The bigger takeaway is that directories are underrated infrastructure. Everyone scrapes search results and social feeds, which are hostile and noisy, while directories are structured, consented and refreshed by their owners. What public directory in your niche is sitting there unparsed?

"Turning Google Maps Into a Clean Local Lead List, With Emails"

Ken-Mutisya — Sun, 12 Jul 2026 20:37:39 +0000

Every local sales motion starts with the same question: who are all the businesses of type X in city Y, and how do I reach them. Google Maps already knows the answer. The trick is getting it out as structured rows instead of a page you scroll forever.

What one place actually contains

A single Google Maps place is far richer than the pin suggests. For each business you can pull:

{
  "name": "Bright Smile Dental",
  "address": "123 Main St, Austin, TX 78701",
  "phone": "+1 512 555 0100",
  "website": "https://brightsmileaustin.com",
  "rating": 4.7,
  "reviews": 312,
  "hours": { "mon": "9-5", "tue": "9-5" },
  "priceRange": "$$"
}

Two things make this a lead list rather than a directory dump. First, the phone and website are on almost every business, so the rows are actually reachable. Second, the rating and review count let you rank by traction, so you call the busy, established places first.

The missing piece: email

Maps gives you a website but not an email, and email is what most outreach tools want. So the last step is a light enrichment pass: take the website from each place, fetch the home and contact pages, and pull the first address plus any obvious contact name.

const html = await (await fetch(place.website)).text();
const email = html.match(/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}/i)?.[0] ?? null;

Run that only for places that have a website and you convert a map of pins into a spreadsheet of name, phone, website, email, rating, and reviews, ready to import.

Query or URL in, rows out

You drive it two ways: a search string like "dentists in Austin" to sweep a whole category in a city, or a specific place URL when you already know the business. Either way you get one clean row per place, and the enrichment is optional so you only pay for emails when you want them.

I packaged the whole flow, Maps parsing plus the website email enrichment, here: https://apify.com/scrapemint/google-maps-scraper

And counting. It is one of a set of lead tools I have been writing up one at a time.

"The Shopify App Store Hands You a B2B Lead List, for Free"

Ken-Mutisya — Sun, 12 Jul 2026 20:37:02 +0000

Shopify app developers are a captive market: agencies, tooling makers, and app store optimization services all want to reach them. It turns out the Shopify App Store publishes everything you need to build that lead list, no login and no API key, if you know where to look.

Step 1: the sitemap gives you every app handle

The App Store ships an XML sitemap. Fetch it and you get the canonical handle for every listed app.

GET https://apps.shopify.com/sitemap.xml

It points at child sitemaps, each of which lists app URLs like https://apps.shopify.com/klaviyo-email-marketing. Pull the last path segment and you have a stable id for every app on the platform. No pagination games, no scrolling.

Step 2: each listing embeds the developer contact

Open any app detail page and the developer contact is right there in the HTML, in a data attribute the page uses for its own "contact" button:

data-developer-support-email="support@example.com"

Every Shopify app is required to publish a support email, so coverage is effectively total. Alongside it you get the developer name, the partner page, the company website, the rating, the review count, and the pricing tiers, all in the server rendered markup.

const html = await (await fetch(`https://apps.shopify.com/${handle}`)).text();
const email = html.match(/data-developer-support-email="([^"]+)"/)?.[1] ?? null;

Step 3: rank by traction

Because you also get rating and review counts, you can sort the whole list by how much traction each app has, so outreach starts with the developers who are clearly making money and reinvesting in tooling.

That is the whole recipe: one sitemap for the universe of apps, one regex per detail page for the reachable contact, and a sort by reviews. No keys, no proxy, no browser.

I packaged it as a keyword search so you can pull, say, every email marketing or upsell app maker with a public support email in one run. It lives here: https://apify.com/scrapemint/shopify-app-developer-leads

And counting. This is one of a batch of keyless lead sources I have been documenting one at a time.

"Every Federal Contract Award Is a Free JSON Row (GovCon Tools Charge $500/mo for This)"

Ken-Mutisya — Fri, 10 Jul 2026 14:12:16 +0000

GovCon market intelligence is one of the priciest alert categories in B2B SaaS: GovWin, HigherGov and friends run from hundreds to thousands per month. The primary source under all of them is USAspending.gov, and its API is public JSON with no key, no registration, and no meaningful rate limits. Here is the part that matters.

Search is one POST

POST https://api.usaspending.gov/api/v2/search/spending_by_award/
Content-Type: application/json

{
  "filters": {
    "time_period": [{
      "start_date": "2026-07-03",
      "end_date": "2026-07-10",
      "date_type": "date_signed"
    }],
    "award_type_codes": ["A", "B", "C", "D"],
    "keywords": ["software"],
    "award_amounts": [{ "lower_bound": 250000 }]
  },
  "fields": ["Award ID", "Recipient Name", "Award Amount", "Description",
             "Awarding Agency", "NAICS", "PSC", "recipient_id"],
  "limit": 100, "page": 1,
  "sort": "Award Amount", "order": "desc"
}

Filters compose: naics_codes (prefixes work, 54 = all professional services), agencies by name, recipient_type_names (small_business, veteran owned...). Two gotchas: the sort field must appear in fields or you get a 400, and keyword search can take 30+ seconds cold, so set your timeout accordingly.

date_signed is the whole product

The default date filter, action_date, matches every modification to every old contract: option years, funding bumps, admin changes. Alert products built on it are noise. date_type: "date_signed" returns contracts that were actually signed in the window, which is the event people pay to hear about. A company that signed a $5M federal contract this week is hiring against a start date, buying equipment, and looking for subcontractors right now.

The recipient profile is the lead

Each result carries a recipient_id. One more GET turns the award into an addressable company:

GET https://api.usaspending.gov/api/v2/recipient/{recipient_id}/

That returns the registered street address, UEI, business type flags (small business, veteran owned, minority owned, ...) and lifetime prime award totals. The lifetime number is a free qualifier: a first time winner with a big award is setting up everything at once, while a $1B incumbent is a different sales conversation entirely. The endpoint is transiently flaky under concurrency, so retry once before giving up.

What registries this beats

SAM.gov has the richer entity data but requires an API key and an account. The FPDS ATOM feed is XML from 2007. USAspending is the only corner of the federal procurement stack that is modern JSON with zero signup, and it updates within days of signing.

I packaged the whole flow (keyword/NAICS/agency filters, date_signed windows, recipient enrichment, retry logic) into a pay per row actor: Government Contract Winner Leads. It is free to try until July 24. And if you would rather build your own alert, the two requests above are the entire pipeline.

"Stop Scraping WHOIS. The Registries Serve JSON Now"

Ken-Mutisya — Fri, 10 Jul 2026 13:46:50 +0000

Every few months someone asks how to scrape WHOIS at scale, and the answers are always the same: port 43 sockets, regex soup for 200 registrar formats, and rate limits that ban you by lunchtime. Almost nobody mentions that the registries replaced the whole thing years ago with a JSON protocol called RDAP, and it is open, keyless, and standardized.

One GET, structured JSON

GET https://rdap.verisign.com/com/v1/domain/stripe.com
Accept: application/rdap+json

You get events (registration, expiration, last changed), status locks, nameservers, DNSSEC delegation, and the registrar as a structured entity. No text parsing. The dates are ISO 8601. The response for a .com domain comes straight from Verisign, the registry itself, not a reseller or a cached mirror.

{
  "events": [
    { "eventAction": "registration", "eventDate": "1995-09-12T04:00:00Z" },
    { "eventAction": "expiration", "eventDate": "2027-09-11T04:00:00Z" }
  ],
  "status": ["client delete prohibited", "client transfer prohibited"],
  "nameservers": [{ "ldhName": "NS1.STRIPE.COM" }]
}

Routing: the IANA bootstrap file

Different TLDs live at different registries, and IANA publishes the routing table as plain JSON:

GET https://data.iana.org/rdap/dns.json

It maps about 1,200 TLDs to their RDAP base URLs. Fetch it once, build a Map of TLD to endpoint, and query each domain at its own registry. Concurrency 8 with a 15 second timeout checks 2,000 domains in a couple of minutes, and nobody rate walls you, because you are spreading standard queries across the services built to answer them.

const boot = await (await fetch('https://data.iana.org/rdap/dns.json')).json();
const map = new Map();
for (const [tlds, urls] of boot.services) {
  const base = urls.find(u => u.startsWith('https://'));
  for (const tld of tlds) map.set(tld, base);
}

The 404 is a feature

An RDAP 404 from the authoritative registry means the domain is not registered. That turns a registration data checker into a bulk availability checker for free. But only trust the 404 when it comes from the TLD's own registry. Aggregators like rdap.org return 404 both for "domain not found" and "TLD not supported", which will happily tell you github.io is available.

The gaps are ccTLDs

ICANN requires RDAP for every gTLD, so .com, .net, .org and the hundreds of new TLDs have full coverage. Country TLDs are voluntary: .ai, .tv, .cc and .uk are in the bootstrap, while .io and .sh are missing from it but quietly served by the registry operator at rdap.identitydigital.services/rdap/. Others (.de, .co, .me) still have nothing public. Handle the miss case explicitly instead of guessing.

Parsing the registrar

The registrar hides inside a jCard, which is vCard reimagined as nested JSON arrays. The name is the fn entry, and the IANA registrar ID is in publicIds:

const reg = data.entities?.find(e => e.roles?.includes('registrar'));
const name = reg?.vcardArray?.[1]?.find(i => i[0] === 'fn')?.[3];

One postscript on privacy: registrant name and email are redacted nearly everywhere since GDPR, in RDAP and WHOIS alike. What registries still publish for every domain is the part that matters for most workflows anyway: dates, registrar, locks, nameservers. Domain age alone sorts a lead list by business credibility, flags a phishing lookalike registered last Tuesday, and prices an expired domain hunt.

I packaged this whole flow (bootstrap routing, ccTLD supplements, jCard parsing, availability detection) into a pay per row actor: Domain WHOIS & Age Checker. Paste domains, get rows. It is free to try until July 24. But if you would rather build it yourself, everything above is the entire recipe.

"Every US Nonprofit's Budget Is Public. Here Is How to Build Lead Lists From It"

Ken-Mutisya — Tue, 07 Jul 2026 19:35:58 +0000

If you sell anything to nonprofits — software, fundraising services, consulting — you have one segmentation question that matters more than all others: can this organization afford me? Unusually, the answer is public. Every US nonprofit above a minimal size files a Form 990 with the IRS, and the filings carry total revenue, expenses, and assets.

The friendly way in is ProPublica's Nonprofit Explorer API. No key, no registration.

Search, then enrich

GET https://projects.propublica.org/nonprofits/api/v2/search.json
      ?q=food+bank&state[id]=TX&ntee[id]=5

Returns organizations with EIN, name, city, state, and NTEE category (1=Arts through 9=Mutual Benefit), paginated 25 at a time. Then per organization:

GET https://projects.propublica.org/nonprofits/api/v2/organizations/{ein}.json

The filings_with_data array is the payoff — one entry per e-filed 990, each with tax_prd_yr, totrevenue, totfuncexpns, totassetsend, and a link to the filing PDF. Sort by year descending and you have the latest financials plus a multi-year trend for free.

The three moves that turn data into leads

Filter by revenue band. totalRevenue >= 1_000_000 removes the all-volunteer organizations that can't buy anything. This single filter is most of what expensive sector databases sell — GuideStar/Candid Pro starts around $2k/year.
Read the trend, not the snapshot. Five years of totrevenue distinguishes a growing organization (hiring, buying tools) from a shrinking one (cutting). {"2023": 401M, "2022": 388M, "2021": 421M} tells you more than any firmographic field.
Treat missing filings as signal, not noise. Small organizations file postcard 990-Ns with no financial data. If your product needs a real budget behind it, the absence of filing data is the disqualification — skip them (and if you're billing per row like I do, make them free).

The workflow

Search fan-out (terms × states × categories) → dedupe by EIN → enrich → filter by band → export. Add a store of seen EINs and a weekly schedule and you get only new organizations matching your profile each run — a prospecting feed instead of a static list. Chain the names into a website contact scraper and the outreach email is one more step.

I packaged this as an Apify actor: Nonprofit Leads Scraper — search filters, 990 enrichment with the five-year trend, revenue-band filtering, cross-run dedupe, at $0.01 per organization row (no-filing orgs free, first 2 rows of every run free). It pairs with the Grant Opportunity Finder I shipped the same week: one side of the sector's money graph is who gives, the other is who raises and spends.

The nonprofit sector runs on transparency rules the for-profit world would never accept. If your ideal customer is a nonprofit, their budget has been sitting in a public API the whole time.

"The $459/mo Grant Alert Industry Runs on a Free Government API"

Ken-Mutisya — Tue, 07 Jul 2026 18:00:19 +0000

Grant discovery platforms are a quiet SaaS category with remarkable pricing: $179 to $459 a month for keyword alerts on grant opportunities. The primary source for US federal grants is grants.gov, and it has a public JSON API with no key, no registration, and no rate-limit drama. Here is the whole thing.

Search is one POST

POST https://api.grants.gov/v1/api/search2
Content-Type: application/json

{
  "keyword": "rural health",
  "oppStatuses": "posted|forecasted",
  "eligibilities": "12",
  "rows": 100,
  "startRecordNum": 0
}

You get hitCount and an oppHits array: opportunity ID, number, title, agency code, open/close dates, status, and CFDA numbers. Filters compose with pipes: agencies (NSF, HHS, USDA...), fundingCategories (ED, ENV, HL...), eligibilities (12 = 501(c)(3) nonprofits, 20 = private universities, 22 = for-profits). Pagination is startRecordNum.

The status values matter more than they look. posted is open-now. forecasted is the interesting one: grants that are announced but not yet open — the window where a grant writer positions a client before the competition starts writing.

Details are one more POST

POST https://api.grants.gov/v1/api/fetchOpportunity
{"opportunityId": 103313}

The synopsis block carries what the alert platforms sell as premium fields: awardCeiling, awardFloor, estimatedFunding, expectedNumberOfAwards, applicantTypes, costSharing, the full description, and — the field that surprised me — agencyContactEmail, the actual program officer inbox. In my verification run, 22 of 25 opportunities had one.

The two gotchas

Relevance order lies. The default ordering happily surfaces a "posted" opportunity from 2012. Sort client-side by openDate descending and drop anything whose closeDate is behind you, or your "new grants" feed opens with the Obama administration.
Money fields are strings with moods. awardCeiling can be a number, an empty string, or the literal string "none". Normalize before you compare.

Alerts are dedupe plus a schedule

The product these platforms sell is "tell me only what is new for my profile." That is: store the opportunity IDs you have already returned (a named key-value store on Apify), run the search on a weekly schedule, and skip seen IDs. Twenty lines of code, and the output is rows you can pipe to Slack or a client tracker instead of a marketing email.

I packaged all of it as an Apify actor: Grant Opportunity Finder — keywords, agency, category, and eligibility filters, forecasted support, detail enrichment with contact emails, recency sorting, past-deadline filtering, and cross-run dedupe, at $0.01 per opportunity row with the first 2 rows of every run free. The 25-row verification run cost $0.00056 to produce, which tells you everything about the margin structure of the incumbents.

Public data with a subscription moat around it is the most common shape of the alert-tool economy. Sometimes the moat is real (hard scraping, entity resolution). Here, it is one POST request.

"Apple Publishes Its App Store Charts as a Free JSON Feed. Rank Trackers Charge $40/mo for It"

Ken-Mutisya — Tue, 07 Jul 2026 17:44:48 +0000

Every ASO tool sells the same first feature: "track your app's chart position daily." Here is the part they rarely mention — Apple publishes the charts as a public, keyless JSON feed, and it has worked for over a decade.

The feed

GET https://itunes.apple.com/us/rss/topfreeapplications/limit=100/genre=6014/json

Swap the pieces:

Country: us, gb, de, jp, br — any storefront code
Chart: topfreeapplications, toppaidapplications, topgrossingapplications
Genre: Apple's genre IDs (6014 Games, 6015 Finance, 6005 Social Networking); drop the segment for the overall chart
Depth: limit= up to 200

Each entry carries the app ID, name, developer, price with currency, category, release date, and artwork URLs. Position in the array is the rank. No key, no auth, no bot wall — this is a feed Apple built to be consumed.

(There is also a newer rss.marketingtools.apple.com API, but it dropped genre filtering, and category charts are where the useful signal lives — a niche app can rank #12 in Finance/DE while being invisible in the overall top 200.)

What turns a feed into a tracker

The feed is a snapshot. The product is the delta. Three things to add:

State between runs. Store {appId: rank} per chart (country × type × genre). On the next run, join against it: previousRank, rankChange, and an isNew flag for chart entries. On Apify this is a named key-value store, so a scheduled run compares automatically.
A tracked-apps mode. Most buyers care about their app, not the whole chart. Filter rows to a list of app IDs — and report "not charting" explicitly (as a free row, since no rank is not data you should pay for).
Rank direction that reads correctly. rankChange = previousRank - rank, so climbing from #47 to #31 is +16. Positive equals good news; your Slack alert should not need a legend.

The rising-apps trick

The same rows serve a second buyer. Scan a full category chart daily and filter isNew == true or rankChange > 20: that is a rising-apps detector, and it surfaces breakout apps days before the roundup articles, in whatever country you point it at. Publishers and investors pay for exactly this view.

The economics

One chart is one HTTP request. Pulling 50 ranks costs a few hundredths of a cent in compute, which is why per-row pricing works against $40+/month subscriptions. I packaged the whole thing as an Apify actor this week: App Store Top Charts Tracker does the fan-out (countries × charts × categories), the state comparison, and the tracked-apps mode, at $0.003 per rank row with the first 2 rows of every run free.

Google Play, for the record, has no equivalent public feed — its charts hide behind an internal RPC. That asymmetry says a lot: the data was never the moat. The scheduling, the state, and the diff were, and those are about 100 lines of code.

"Stop Paying Monthly to Know a Web Page Changed. Diff It Yourself"

Ken-Mutisya — Tue, 07 Jul 2026 16:31:14 +0000

Change-monitoring SaaS has a strange pricing model: you pay every month for the pages that did not change. The actual work — fetch, extract, compare — is close to free, and the hard parts are two design decisions most tools get wrong.

Decision 1: what counts as "the page"

Hash the raw HTML and everything is a change: rotating nonces, cache busters, CSRF tokens, ad slots. The signal is in the content layer:

Strip script, style, nav, header, footer, aside, cookie banners, and aria-hidden nodes.
Scope to main / article / [role="main"] when the page declares one, else body.
Flatten block elements to lines, collapse whitespace per line, drop empties.

Hash that. A pricing page now only "changes" when prices, plans, or copy change — not when the CDN rotates an asset fingerprint. Offer a CSS selector as an override for surgical cases (.pricing-table), but make the no-selector path the default; nobody wants to maintain selectors for 100 monitored URLs.

Decision 2: what a "change" looks like in the output

A screenshot pair is where information goes to die. The useful output is a line diff:

{
  "url": "https://competitor.com/pricing",
  "status": "changed",
  "addedLines": ["Pro plan $49/mo"],
  "removedLines": ["Pro plan $39/mo"],
  "previousChangeAt": "2026-06-20T08:00:00Z",
  "checkedAt": "2026-07-07T08:00:00Z"
}

You do not need a full LCS diff for this. A multiset comparison of lines (count occurrences per line on each side, report the surplus in each direction) catches everything a human cares about on a monitored page, runs in linear time, and never blows up on a 10,000-line page.

The state problem

The piece that makes this a product instead of a script is persistence: each URL's last text, hash, and timestamps have to survive between runs. On Apify that is a named key-value store — it lives in your account, so a scheduled run picks up exactly where the last one left off. Key the state by URL plus selector, so changing the selector re-baselines cleanly instead of producing one giant false diff.

Baselines deserve their own status. The first time a URL is seen there is nothing to compare against; report baseline, store the state, charge nothing. A monitoring tool that bills you for learning what a page looks like is charging you for its own setup.

The economics

A plain-HTTP check on a monitored page costs a few thousandths of a cent in compute. That is why per-change pricing works: I packaged this as an Apify actor — Website Change Monitor — where baselines, unchanged checks, and fetch errors are free and you pay $0.01 per detected change. A hundred stable pages on a daily schedule cost exactly $0 until one of them moves. The verification run for this article watched Hacker News and example.com: HN came back changed +28/-28, example.com came back unchanged, and the bill for both checks was four hundredths of a cent.

Wire the dataset into Slack or Sheets with an integration and you have the same alerts the subscription tools sell, except the diff is structured, the state is yours, and the invoice only exists when the web actually changed.

"Your SEO Audit Should Be JSON, Not a 40-Tab Spreadsheet"

Ken-Mutisya — Tue, 07 Jul 2026 15:27:39 +0000

Every agency has the same Monday ritual: open the desktop crawler, wait for it to chew through a client site, export a giant spreadsheet, and copy numbers into a report nobody reads past page two. The data in that workflow is fine. The shape of it is the problem.

An audit is more useful as one JSON row per page with a computed issues array. Here is why, and how to build the checks.

The row, not the report

{
  "url": "https://example.com/pricing",
  "title": "Pricing",
  "titleLength": 7,
  "metaDescription": null,
  "h1Count": 2,
  "wordCount": 96,
  "redirectHops": 0,
  "brokenLinkCount": 1,
  "issues": ["title_too_short", "missing_meta_description",
             "multiple_h1", "thin_content", "broken_internal_links"]
}

Once a page is a row, everything downstream is trivial: filter issueCount > 0, group by issue type, diff this week against last week, pipe into a dashboard, alert on regressions. The "report" becomes a query.

The checks worth automating

Most of a technical audit is a handful of deterministic checks per page:

Title: missing, under ~10 chars, over ~60 chars, duplicated across pages. The duplicate check must run across the whole crawl, not per page — keep a Map of title → first URL and flag later occurrences.
Meta description: missing, over ~160 chars, duplicated (same cross-crawl map trick).
Headings: zero H1s or more than one.
Indexability: noindex in robots meta, missing canonical.
Redirect chains: follow redirects manually (redirect: 'manual') so you can count hops and record each status. A 301 → 302 → 200 chain is invisible if your HTTP client silently follows it.
Broken internal links: collect every internal href during the crawl, HEAD-check each unique URL exactly once, cache the status run-wide. Fall back to a 1-byte ranged GET when a server rejects HEAD with 405.
Content: word count after stripping nav/header/footer/scripts — thin-content thresholds are debatable, but under ~150 words is rarely a page that deserves to rank.
Images: count how many are missing alt text; don't fail the page, report the number.

None of this needs a browser. Marketing sites, blogs, docs, and stores are server-rendered; a plain HTTP fetch plus an HTML parser audits them accurately, and the whole crawl runs in seconds instead of minutes.

The scheduling trick that makes it an agency product

A one-off audit is a snapshot. The valuable thing is the diff: schedule the crawl weekly per client, store each run's rows, and compare issues arrays. "3 new broken links, 2 pages went noindex after Thursday's deploy" is a report clients act on — and it is 10 lines of comparison code once your audit is rows instead of a spreadsheet.

I packaged this as an Apify actor this week: SEO Site Audit Scraper does everything above (BFS crawl with include/exclude filters, sitemap seeding, manual redirect chains, cross-crawl duplicate detection, free HEAD-checked broken links) and returns one row per page, pay per page, no seat license. The first 2 pages of every run are free.

The desktop crawlers are good tools. But if your audit ends up in a dashboard, a client report, or a CI check, it should have been JSON from the start.

"Four Remote Job Boards Have Free Public APIs. Here Is One Schema for All of Them"

Ken-Mutisya — Sun, 05 Jul 2026 21:52:06 +0000

If you want remote job data, you do not need to scrape HTML or sign up for anything. Four of the bigger remote job boards publish keyless public feeds. The catch is that they all speak different dialects, so the real work is normalization. Here are the endpoints and the traps.

The four feeds

RemoteOK returns its whole current board as one JSON array:

GET https://remoteok.com/api

The first element is a legal notice, not a job: they ask for a link back with attribution as a condition of using the feed. Skip element zero, and honor the attribution if you republish. Jobs carry salary_min and salary_max as numbers, tags, and ISO dates.

Remotive has the friendliest API of the four, including server side search:

GET https://remotive.com/api/remote-jobs?search=python&limit=100

Salary here is free text ("$120k - $160k"), so do not expect numbers. Attribution with a link back is required here too.

WeWorkRemotely publishes RSS:

GET https://weworkremotely.com/remote-jobs.rss

Two quirks: the company name is not a field, it is baked into the title as Company: Role, so split on the first colon. And useful data hides in nonstandard tags like <region>, <skills>, and <category> that generic RSS parsers drop on the floor.

Himalayas has a proper paginated API with a surprisingly deep catalog (100k+ listings):

GET https://himalayas.app/jobs/api?limit=100&offset=0

It gives structured minSalary/maxSalary with a currency and period, seniority arrays, location restrictions, and even timezone restrictions as UTC offsets. Dates are epoch seconds, not ISO strings.

The normalization layer

The row schema that survived contact with all four sources:

{
    "source": "Remotive",
    "title": "Senior Backend Engineer",
    "company": "Acme Corp",
    "tags": ["python", "aws"],
    "salaryMin": null,
    "salaryMax": null,
    "salaryText": "$120k - $160k",
    "location": "Worldwide",
    "postedAt": "2026-07-03T20:01:13.000Z",
    "applyUrl": "https://..."
}

Rules that mattered in practice:

Keep both salary shapes. Boards with numbers fill salaryMin/salaryMax; boards with prose fill salaryText. Collapsing one into the other loses information either way.
Normalize every date to ISO 8601 at the edge. Epoch seconds, RFC 822 RSS dates, and ISO strings all flow through one converter, so downstream code never branches on source.
Dedupe on lowercased title|company. Companies cross post to multiple boards, and the same listing showing up four times makes the feed look broken.
Carry source and sourceUrl on every row. It satisfies the attribution requirements and it turns out buyers of job data want to know provenance anyway.

What this is good for

The obvious build is a job alert pipeline: run it hourly with keywords, diff against what you have seen, push new rows to Slack. The less obvious one is sales intelligence: a company hiring for a role is telling you what they are about to spend money on, and job feeds are the earliest public signal of that.

I packaged the whole thing (four fetchers, normalization, dedupe, keyword and freshness filters) into an actor on Apify if you want it as a scheduled feed. But every endpoint above works with nothing more than fetch, and the boards deserve the link backs their terms ask for.

"arXiv Has One of the Last Truly Open APIs. Here Is How to Build a Paper Monitor on It"

Ken-Mutisya — Sun, 05 Jul 2026 20:44:14 +0000

Every scraping post I write lately is about working around something: bot walls, consent screens, keys that require a developer account. This one is different. arXiv runs one of the last genuinely open APIs on the research web, and it is the right way to keep up with the ~800 AI papers that land there every week.

The whole API is one endpoint

GET https://export.arxiv.org/api/query
      ?search_query=(all:"multi-agent") AND (cat:cs.AI)
      &sortBy=submittedDate&sortOrder=descending
      &start=0&max_results=100

No key. No login. Atom XML out. It is documented, sanctioned for programmatic use, and has been stable for well over a decade — the opposite of reverse-engineered endpoints that break monthly.

The query grammar is small but composes well:

Fields: all: (title + abstract + more), ti:, abs:, au:, cat:
Boolean: AND, OR, ANDNOT, with parentheses
Phrases: double quotes, so all:"retrieval augmented generation" matches the phrase, not the words scattered

So "anything about tool use or agents, in the AI or NLP categories, by this lab" is one query string.

The three things worth knowing before you build

Politeness is the rate limit. The API guidance asks for about one request every 3 seconds. With max_results=100 per page, that is 2,000 papers a minute, which is more than any monitoring workflow needs.
Sort by submittedDate descending and dedupe by ID. arXiv IDs are versioned (2507.01234v2), so decide whether a revision counts as "new" for you. For monitoring, tracking the ID without the version suffix and diffing daily is usually what people want.
Abstracts are full text in the feed. You do not need to touch a PDF to build a useful pipeline: the abstract is enough for embeddings, topic classification, and digest summaries. That turns "paper monitoring" into a pure JSON problem.

The workflow that actually keeps you current

Nobody reads listing pages. The pattern that works is a scheduled diff:

Daily run: query your topics, newest first.
Skip every ID you have seen before.
Push what is left to Slack, a newsletter draft, or an embedding index.

The result is a feed of only-new, only-relevant papers with abstracts, which is what all the "stay current with AI research" tools sell, built on an API that gives the data away.

I packaged this as an Apify actor this week: arXiv Papers Scraper takes keywords, categories, and authors, handles the pagination and politeness delays, normalizes rows (title, full abstract, authors, categories, dates, PDF link, DOI), and has cross-run dedupe built in for exactly the scheduled-diff workflow above. The first 2 rows of every run are free.

One small irony to end on: the hardest part of building AI research tooling is not the AI. It is that most of the web fights being read by machines. arXiv does not, and it is not a coincidence that it is also the most machine-cited corpus in the field.