The data every AI agent needs but nobody sells cleanly — and what you can build on top of it

NexusFeed — Tue, 14 Apr 2026 21:53:21 +0000

Freight audit shops charge their customers 3–8% of recovered overcharges to reconcile invoices against published carrier rates. The data input to that business — current LTL fuel surcharge percentages from each carrier, refreshed weekly — costs $0.03 per lookup if you know where to get it.

That gap is the thing I want to talk about. Not the scraping. The gap.

I just shipped NexusFeed, a JSON API that returns two kinds of data the rest of the web sells badly or not at all: LTL fuel surcharges for 10 freight carriers, and liquor-license compliance records for 5 US states. Both are public information. Both are locked behind portals that range from annoying to actively hostile. And both are the kind of data that used to require either a sales call to a legacy data vendor or a wet-signed NDA with a compliance firm.

This post is not about how I got the data. It is about what you can build on top of it.

Why this data is worth your attention

Here's the thing that surprised me six months into building this: the hard part was never the scraping.

The hard part is that agent-native data doesn't exist yet in most B2B verticals. If you want to wire a language model into a freight-audit workflow or a three-tier alcohol compliance check, you have two options. One, pay a legacy vendor $3–15k/month for a CSV drop and a login to their dashboard. Two, scrape it yourself — and discover that five of the eleven sources I currently cover require non-trivial anti-bot handling, weekly regression testing, and a confidence model so your agent doesn't hallucinate a rate that silently 404'd that morning.

That second path is what I did, which is why I can now charge $0.03 per LTL lookup and $0.05 per ABC lookup with a _verifiability block on every response. Both numbers are roughly two orders of magnitude cheaper than the legacy dashboard subscriptions, because the cost structure is marginal per-call instead of seat-based.

Which means there is a window open right now for someone to build the thin agent layer on top, and keep most of the margin.

Build #1: the freight-audit agent

Pick any mid-market 3PL or shipper doing $50M+ in annual freight spend. Industry studies consistently find 3–8% of LTL invoices contain overcharges — wrong fuel surcharge, wrong accessorial, wrong class. Freight-audit firms recover those overcharges and keep a cut, typically 30–50% of the recovered amount.

The data they need to do this is: for every invoice, what should the fuel surcharge have been on the day the shipment moved, per that carrier's published rate? Historically that data lived in a dozen carrier PDFs and a rates analyst's memory.

With NexusFeed's LTL endpoints it's one HTTP call per carrier-week. Pipe a shipper's invoice CSV into an agent, have the agent pull the correct rate for each row, flag discrepancies above a threshold, draft a dispute letter. Cost to run the data layer on 10,000 invoices a month: about $300. Revenue share on recovered overcharges on that volume: five figures. That's the entire unit economics.

If you build this, your moat is not the data — you're renting that. Your moat is the workflow: the invoice parser, the dispute-letter prompt chain, the customer dashboard. Those are the expensive, differentiated pieces.

The `_verifiability` contract (and why your agent needs it)

This is the design decision I am proudest of, and the one I wish more scraping APIs made. It is also what makes "build an agent on top of this" a real proposal instead of a pipe dream.

Every response from NexusFeed carries a _verifiability block as a first-class field, not a footnote:

class Verifiability(BaseModel):
    source_timestamp: datetime
    extraction_confidence: float
    raw_data_evidence_url: str
    extraction_method: ExtractionMethod
    data_freshness_ttl_seconds: int

extraction_method is an enum with five members: api_mirror, playwright_dom, structured_parse, scraper_api, scraper_api_fallback. An agent reading the response can see immediately whether the number came from a clean JSON API mirror or from a regex over a rendered portal, and make its own call about how much to trust the answer.

extraction_confidence is never hardcoded. It is the output of one short function, called by every extractor:

def compute_confidence(
    required_fields: list[str],
    found_fields: set[str],
    fallback_triggered: bool = False,
) -> float:
    if fallback_triggered:
        return 0.0
    found = len([f for f in required_fields if f in found_fields])
    return round(found / len(required_fields), 2)

Six required fields, four found, fallback path did not trigger? Confidence is 0.67. Primary extractor failed and the HTML fallback ran? Confidence is 0.0 and the router returns 503 instead of a half-answer.

Why this matters for you specifically: an agent making compliance or financial decisions on top of scraped data needs a programmatic honesty signal. Without it, you get silent drift — the source site quietly changes a column, your agent keeps returning plausible-looking answers, and your customer eventually notices. With _verifiability, you can gate agent actions on confidence >= 0.9, log the evidence URL for audit, and get paged the moment a source degrades. It's the difference between a demo and a production system.

Builds #2 and #3

The three-tier compliance dashboard. Every alcohol brand selling into the US has to track which of their distributors' licenses are current, in which states, with which privileges. Today that work is done by paralegals re-typing data out of five different state portals once a quarter. With NexusFeed's ABC endpoints (CA, TX, NY, FL, IL) it's a nightly cron. The buyer is any beverage-alcohol brand with a compliance or legal-ops function. Pricing anchor: Park Street Compliance charges four-figures-a-month for the human-driven version.

The AI freight broker assistant. Brokers spend a large portion of their day quoting shippers, and the quote depends on current fuel surcharge plus accessorials per carrier. An agent that watches a broker's inbox, parses RFQs, pulls current rates for every carrier in their network, and drafts a priced response saves the broker hours per day. The buyer is any brokerage with 5–50 agents — a market with thousands of firms in the US alone. NexusFeed's LTL endpoints are the input layer.

Neither of these is a hypothetical. They're calls I had before I shipped. Both buyers have budget and neither of them wants to be the company that scrapes ODFL's website themselves.

What's shipped, and what it costs

10 LTL carriers: Old Dominion, Saia, Estes, ABF, R+L, TForce, XPO, Southeastern, Averitt, FedEx
5 ABC states: California, Texas, New York, Florida, Illinois
230 passing tests, FastAPI on Railway, Redis cache-aside (7-day LTL TTL, 24-hour ABC TTL)
Stripe metered billing per product — ltl_request at $0.03/call, abc_request at $0.05/call
MCP server — install NexusFeed as a tool in Claude Desktop or Cline with one command, and your agent can call every endpoint as a first-class function

Try it

Docs, OpenAPI spec, and the MCP install command are at docs.nexusfeed.dev.

If you want a key, the two products are on RapidAPI:

LTL Fuel Surcharge API — $0.03/request
ABC License Compliance API — $0.05/request

What would you build on top of this? I gave you three starting points. If you have a fourth — or you're already building one of these three — drop it in the comments. I'll answer every one, and the best use-case gets a free month on both products to ship it.

DEV Community: NexusFeed