DEV Community

K.SLADE
K.SLADE

Posted on

How I Structured NYC's Open Data for AI Agents Using MCP

NYC gives away some of the best public data in the world. Property ownership, building violations, restaurant health inspections, tax assessments, complaints — all free, all public.

The problem? It's completely unusable by AI agents.

The Data Fragmentation Problem

NYC's data is scattered across at least six agencies, each with its own API, schema, and query patterns:

Agency What They Publish Access Method
DOF (Finance) Ownership, tax, sales via ACRIS SODA + web scrape
DOB (Buildings) Permits, new building apps SODA + DOB NOW BIS
HPD (Housing) Violations, complaints, registrations SODA + web
OATH/ECB Administrative penalties SODA
DOHMH (Health) Restaurant inspections, grades SODA
HCR Rent stabilization status FOIL requests

An AI agent trying to answer a simple question like "is this a bad landlord?" would need to:

  1. Normalize the address (NYC addresses are notoriously inconsistent)
  2. Resolve it to a BBL (Borough-Block-Lot) and BIN (Building ID)
  3. Query DOF for ownership records
  4. Query HPD for housing violations
  5. Query DOB for building violations
  6. Query OATH/ECB for penalty history
  7. Merge all of that into a coherent response

No agent does this today. The friction is too high.

The Solution: One MCP Server, Four Tools

I built NYC API — an MCP server that aggregates NYC's public data into four tools that any AI agent can call:

resolve_property_identifier

Normalize any NYC address, BBL, or BIN to a canonical form. This is the critical first step — NYC addresses come in dozens of formats ("123 Main St", "123 MAIN STREET", "123 Main St.", "123 Main Street Apt 4B") and you need a canonical identifier to query anything downstream.

get_property_intelligence

Given a resolved identifier, returns ownership records, zoning classification, tax class, assessed values, sales history, liens, and rent stabilization status. One call replaces what used to be 3-4 separate agency lookups.

get_building_violations

DOB, HPD, and OATH/ECB violations with severity scoring and risk indicators. An agent can immediately tell whether a building has serious open violations or a clean record.

get_restaurant_venue_intel

DOHMH health grades, inspection history, violation codes, and permit status. Useful for restaurant discovery agents, food safety research, and mixed-use property analysis.

Architecture Decisions

Why MCP (Model Context Protocol)?

MCP is becoming the standard way AI agents discover and invoke external tools. By implementing MCP rather than just a REST API, any MCP-compatible client — Claude Desktop, LangChain, CrewAI, OpenAI Assistants — can plug in with just a URL and API key. No SDK installation, no wrapper code.

Why Streamable HTTP Instead of SSE?

The MCP spec supports two transports: SSE (Server-Sent Events) and Streamable HTTP. I went with Streamable HTTP because the server is deployed on Vercel, which is serverless. SSE requires a persistent connection and server-side session state — neither of which works on serverless.

The implementation is stateless by design:

export async function POST(req: NextRequest) {
  const authResult = await validateApiKey(req, PRODUCT);
  if ("error" in authResult) {
    return authResult.error;
  }

  const server = createServer();
  const transport = new WebStandardStreamableHTTPServerTransport({
    sessionIdGenerator: undefined, // stateless — required for serverless
  });

  await server.connect(transport);
  return transport.handleRequest(req satisfies Request);
}
Enter fullscreen mode Exit fullscreen mode

Each request creates a fresh server + transport pair. No cross-invocation state, no session management, no cleanup. It just works on Vercel's serverless functions.

Why Resources in Addition to Tools?

The MCP spec defines both tools (actions the agent can call) and resources (reference data the agent can read). I added five resources:

  • Capability guide — tells the agent what the server can and can't do
  • Input formatting — explains address formats, BBL structure, BIN ranges
  • Schema examples — sample responses so the agent knows what to expect
  • Coverage notes — which boroughs and data sources are available
  • Credit policy — how credits are consumed per tool call

These resources help agents make better tool-calling decisions without wasting credits on invalid queries.

The Data Pipeline

All data flows through NYC's Socrata Open Data API (SODA). The pipeline:

  1. Inbound query — agent sends an address or identifier
  2. Address normalization — shared middleware canonicalizes the input
  3. Parallel SODA queries — multiple datasets are queried simultaneously
  4. Response assembly — results are merged, scored, and structured
  5. Credit deduction — usage is tracked in Supabase

For data sources not available via SODA (ACRIS property sales, rent stabilization status), I use supplementary lookups with appropriate caching.

What I'd Do Differently

Address normalization is harder than you think. NYC addresses have edge cases that will break any naive parser — hyphenated Queens addresses (42-15 Crescent St), lettered avenues (Avenue A vs Ave A), and buildings with multiple valid addresses. I spent more time on the normalizer than any other component.

Start with fewer tools. Four tools is manageable, but I could have launched with just resolve_property_identifier + get_property_intelligence and validated demand before building the rest.

Credit-based pricing works for agents. Agents are bursty — they might make 50 calls in a minute during a due diligence workflow, then nothing for days. Per-credit pricing maps to this usage pattern better than flat monthly rates.

Try It

  • Server URL: https://nycapi.app/api/mcp
  • Auth: Bearer token (get a free API key at nycapi.app)
  • Free tier: 50 credits, no card required
  • Paid tiers: Starter ($29/1K credits), Growth ($99/5K), Scale ($249/15K)

If you're building agents that need to reason about physical locations in NYC — real estate, compliance, tenant advocacy, restaurant discovery — I'd love your feedback on the tool design.

What data would you want to see added?


Built by Matchup Labs. Stack: Next.js, TypeScript, Vercel, Supabase, Stripe.

Top comments (0)