K.SLADE

Posted on Apr 4

How I Structured NYC's Open Data for AI Agents Using MCP

#agents #ai #data #mcp

NYC gives away some of the best public data in the world. Property ownership, building violations, restaurant health inspections, tax assessments, complaints — all free, all public.

The problem? It's completely unusable by AI agents.

The Data Fragmentation Problem

NYC's data is scattered across at least six agencies, each with its own API, schema, and query patterns:

Agency	What They Publish	Access Method
DOF (Finance)	Ownership, tax, sales via ACRIS	SODA + web scrape
DOB (Buildings)	Permits, new building apps	SODA + DOB NOW BIS
HPD (Housing)	Violations, complaints, registrations	SODA + web
OATH/ECB	Administrative penalties	SODA
DOHMH (Health)	Restaurant inspections, grades	SODA
HCR	Rent stabilization status	FOIL requests

An AI agent trying to answer a simple question like "is this a bad landlord?" would need to:

Normalize the address (NYC addresses are notoriously inconsistent)
Resolve it to a BBL (Borough-Block-Lot) and BIN (Building ID)
Query DOF for ownership records
Query HPD for housing violations
Query DOB for building violations
Query OATH/ECB for penalty history
Merge all of that into a coherent response

No agent does this today. The friction is too high.

The Solution: One MCP Server, Four Tools

I built NYC API — an MCP server that aggregates NYC's public data into four tools that any AI agent can call:

`resolve_property_identifier`

Normalize any NYC address, BBL, or BIN to a canonical form. This is the critical first step — NYC addresses come in dozens of formats ("123 Main St", "123 MAIN STREET", "123 Main St.", "123 Main Street Apt 4B") and you need a canonical identifier to query anything downstream.

`get_property_intelligence`

Given a resolved identifier, returns ownership records, zoning classification, tax class, assessed values, sales history, liens, and rent stabilization status. One call replaces what used to be 3-4 separate agency lookups.

`get_building_violations`

DOB, HPD, and OATH/ECB violations with severity scoring and risk indicators. An agent can immediately tell whether a building has serious open violations or a clean record.

`get_restaurant_venue_intel`

DOHMH health grades, inspection history, violation codes, and permit status. Useful for restaurant discovery agents, food safety research, and mixed-use property analysis.

Architecture Decisions

Why MCP (Model Context Protocol)?

MCP is becoming the standard way AI agents discover and invoke external tools. By implementing MCP rather than just a REST API, any MCP-compatible client — Claude Desktop, LangChain, CrewAI, OpenAI Assistants — can plug in with just a URL and API key. No SDK installation, no wrapper code.

Why Streamable HTTP Instead of SSE?

The MCP spec supports two transports: SSE (Server-Sent Events) and Streamable HTTP. I went with Streamable HTTP because the server is deployed on Vercel, which is serverless. SSE requires a persistent connection and server-side session state — neither of which works on serverless.

The implementation is stateless by design:

export async function POST(req: NextRequest) {
  const authResult = await validateApiKey(req, PRODUCT);
  if ("error" in authResult) {
    return authResult.error;
  }

  const server = createServer();
  const transport = new WebStandardStreamableHTTPServerTransport({
    sessionIdGenerator: undefined, // stateless — required for serverless
  });

  await server.connect(transport);
  return transport.handleRequest(req satisfies Request);
}

Each request creates a fresh server + transport pair. No cross-invocation state, no session management, no cleanup. It just works on Vercel's serverless functions.

Why Resources in Addition to Tools?

The MCP spec defines both tools (actions the agent can call) and resources (reference data the agent can read). I added five resources:

Capability guide — tells the agent what the server can and can't do
Input formatting — explains address formats, BBL structure, BIN ranges
Schema examples — sample responses so the agent knows what to expect
Coverage notes — which boroughs and data sources are available
Credit policy — how credits are consumed per tool call

These resources help agents make better tool-calling decisions without wasting credits on invalid queries.

The Data Pipeline

All data flows through NYC's Socrata Open Data API (SODA). The pipeline:

Inbound query — agent sends an address or identifier
Address normalization — shared middleware canonicalizes the input
Parallel SODA queries — multiple datasets are queried simultaneously
Response assembly — results are merged, scored, and structured
Credit deduction — usage is tracked in Supabase

For data sources not available via SODA (ACRIS property sales, rent stabilization status), I use supplementary lookups with appropriate caching.

What I'd Do Differently

Address normalization is harder than you think. NYC addresses have edge cases that will break any naive parser — hyphenated Queens addresses (42-15 Crescent St), lettered avenues (Avenue A vs Ave A), and buildings with multiple valid addresses. I spent more time on the normalizer than any other component.

Start with fewer tools. Four tools is manageable, but I could have launched with just resolve_property_identifier + get_property_intelligence and validated demand before building the rest.

Credit-based pricing works for agents. Agents are bursty — they might make 50 calls in a minute during a due diligence workflow, then nothing for days. Per-credit pricing maps to this usage pattern better than flat monthly rates.

Try It

Server URL: https://nycapi.app/api/mcp
Auth: Bearer token (get a free API key at nycapi.app)
Free tier: 50 credits, no card required
Paid tiers: Starter ($29/1K credits), Growth ($99/5K), Scale ($249/15K)

If you're building agents that need to reason about physical locations in NYC — real estate, compliance, tenant advocacy, restaurant discovery — I'd love your feedback on the tool design.

What data would you want to see added?

Built by Matchup Labs. Stack: Next.js, TypeScript, Vercel, Supabase, Stripe.

DEV Community