DEV Community: GraphPilot

Resolver-level GraphQL security is already too late

Kay Schecker — Mon, 22 Jun 2026 00:00:00 +0000

"We added a depth limit in the resolver, so we're covered." It's one of the first things a team reaches for the moment they realize a GraphQL endpoint accepts arbitrarily shaped queries. It feels like the responsible move, and it's not wrong, exactly. It's just placed at the wrong layer, and that placement is the whole problem. By the time a resolver runs, the request has already crossed your perimeter, been parsed, been planned, and started touching your data. The check fires, but the expensive part already happened. As a first line of defense, the resolver is already too late.

This post is about why your first line of GraphQL defense can't live inside your application, what does still belong there, and what changes when the boundary moves to the edge, in front of the origin.

GraphQL breaks the assumptions perimeter security is built on

Traditional API security leans heavily on the HTTP envelope. A WAF or a CDN looks at the method, the URL, the headers, and makes a decision: this path is rate limited, that one is blocked, this method isn't allowed here. The shape of the request tells you a lot about its intent.

GraphQL throws that away. You have one endpoint, usually /graphql, and almost everything goes through POST with the real request sitting in the body:

POST /graphql
Content-Type: application/json

{
  "query": "
    {
      user(id: 1) {
        friends(first: 100) {
          friends(first: 100) {
            friends(first: 100) {
              name
            }
          }
        }
      }
    }
  "
}

To anything that doesn't parse GraphQL, that request is indistinguishable from a harmless one. Same method, same URL, same headers. A WAF sees a POST to /graphql and shrugs. The intent, and the danger, is entirely in the body. So just like with caching, before you can secure anything you have to teach the boundary to look inside the request. And the moment you accept that, the natural question is where that inspection should happen.

The attacks that actually matter

GraphQL's flexibility is the product. It's also the attack surface. A handful of attack classes show up again and again, and in every case the query body is what reveals them.

Complexity and depth denial-of-service. GraphQL lets a client ask for deeply nested, recursive relationships in a single query. user → friends → friends → friends and so on. Each level multiplies the work. A short, innocent-looking query can expand into millions of resolver calls and a database meltdown. There's no oversized payload to catch; the request is tiny. The cost is in how it unfolds.

Aliasing and batching abuse. GraphQL aliases let a client request the same field many times under different names in one operation. Combine that with query batching and an attacker can pack hundreds of attempts into a single HTTP request: brute-forcing a coupon or gift-card code through an aliased lookup, enumerating records by guessing IDs, or amplifying one expensive query into hundreds. Per-request rate limiting counts that as one request and waves it through.

Introspection leakage. GraphQL ships with introspection, a query that returns the entire schema. Wonderful in development, a gift to an attacker in production: every type, every field, every mutation, handed over as a map of where to probe next.

Broken field-level authorization. This is the quiet one. The query is cheap, well formed, and perfectly legitimate in shape. It just asks for a field the viewer shouldn't see. Authorization in GraphQL is per field and per context, and there is no framework default that gets it right for you. It's where most real GraphQL data exposure actually lives.

Notice what these have in common: none of them is visible from the outside, and none of them is a malformed request. They're valid GraphQL. You can only catch them by understanding the query.

Why resolvers are the wrong place to catch them

So you reach for resolver code, because that's where you understand the query best. Here's why it disappoints.

It's too late. A resolver runs during execution. To reach it, the request already crossed your network boundary, got parsed, got planned, and in many cases already started resolving sibling fields. A depth limit enforced in a resolver is a smoke detector that goes off after the fire reaches the kitchen. For a complexity DoS, the work you were trying to prevent is the work that got you to the resolver in the first place.

It's too scattered. Authorization and limits implemented field by field, service by service, drift. One team checks a permission, another forgets, a third checks a slightly different one. The same rule exists in twelve places in twelve shapes, and the gap is wherever someone was in a hurry. Security that depends on every resolver author remembering the same thing is security with a long tail of holes.

It's not centrally observable. When the defense is smeared across the resolver graph, there's no single place that can answer "how many queries did we reject in the last hour, and why." You can't tune, alert on, or audit a policy that doesn't exist as a policy anywhere.

And underneath all three sits the genuinely hard part: to reject a query safely, you have to estimate what it will cost before you run it. That's not a resolver's job. By the time a resolver could weigh in, you've already started paying.

Where it gets nasty

This is the part that doesn't fit on a feature slide:

Static cost vs. real cost. You estimate a query's cost by walking its shape and assigning weights, before execution. But the true cost depends on data: how many friends that user actually has, how big that list really is. Estimate too loosely and you wave through expensive queries; too strictly and you reject legitimate ones. Tuning that boundary is ongoing work, not a constant.
Rate limiting by cost, not count. "100 requests per minute" is meaningless when one request can be a thousand times heavier than another. Useful GraphQL rate limiting is budget-based: you spend from a cost allowance, and a heavy query spends more. That means you need the cost estimate and a shared, fast accounting of spend.
Authorization that's schema-aware. Field-level auth has to understand types, interfaces, and how a field is reached. The same field may be allowed via one path and not another. This is real design, and parts of it (deep business rules) genuinely do belong in your application. The point isn't that the edge does all auth; it's that the edge enforces the structural, declarative parts consistently so resolvers aren't the only thing standing between a query and your data.
Persisted operations as an allowlist. One strong mitigation is to stop accepting arbitrary queries at all: register known operations ahead of time and reject anything not on the list. It closes most of the surface above in one move. It also constrains your clients and needs a deploy-time workflow, so it's a decision, not a free switch. (It pairs neatly with persisted queries on the performance side.)

None of this is insurmountable. Together, it's why "we'll just validate queries in our gateway" turns into a quarter of work.

Why the edge is the right layer

Put all of this in front of the origin and the picture gets simpler in exactly the ways that matter.

in a resolver it's caught only after the cost has been paid." width="799" height="389">

It's early. The query is parsed and priced before the origin does any work. A rejected query costs the edge a cheap inspection and costs your database nothing. That's the difference between absorbing an attack and forwarding it to the thing you most need to protect.

It's one place. Depth and complexity budgets, cost-based rate limits, introspection control, and the structural parts of authorization are declared once, in front of every service, instead of reimplemented in each. One policy, consistently applied, is something you can actually reason about.

It's observable. A single chokepoint is where you can count rejections, alert on spikes, and audit what your policy did. You can't tune a defense you can't see.

The honest tradeoff: this logic now runs on the hot path of every request, so it has to be cheap. Spend too much per request inspecting it and you've traded an availability problem for a latency one. Getting that balance right, fast inspection that doesn't become its own bottleneck, is exactly the engineering that makes edge security hard to do well.

What good looks like

Whether you build this or buy it, a serious GraphQL security layer has a few non-negotiables worth insisting on:

Schema-driven policy. Limits, scopes, and field rules declared alongside the schema, so they're reviewable, versioned, and diffable, not buried in scattered code.
Depth and complexity budgets enforced before execution. A query is priced and accepted or rejected up front, not discovered to be expensive halfway through running.
Cost-based rate limiting. Spend from a budget, where a heavy query costs more, instead of counting all requests as equal.
Introspection control and field-level authorization applied by default, so production doesn't hand out its own map and personalized fields aren't exposed by accident.
Observability for what you blocked. Rejected queries, reasons, and rates, visible. "It feels secure" is not a metric.

Closing

A depth limit in a resolver isn't wrong, it's just standing in the wrong room. GraphQL's dangerous requests look identical from the outside and only reveal themselves inside the query, and by the time a resolver reads that query, the request is already past the boundary you wanted to defend. The controls that actually hold, cost estimation, budget-based limits, introspection control, consistent field authorization, want to live in front of the origin, declared once and watched in one place.

That's part of what we're building GraphPilot to do. It's not open yet; if you run GraphQL in production and want early access, or want to shape it as a design partner, you can join the waitlist. We'd genuinely rather hear which of these attacks is keeping you up than guess, so if you want to compare notes, jump into the comments wherever you found this or contact us.

Why you can't just put a CDN in front of GraphQL

Kay Schecker — Tue, 16 Jun 2026 16:01:19 +0000

It's one of the most obvious ideas the moment a GraphQL API starts buckling under load: "Can't we just put a CDN in front of it?" The question comes up on almost every team scaling GraphQL: in architecture reviews, in Stack Overflow threads, in Slack. It sounds reasonable: CDNs made REST fast and cheap to scale, so surely the same trick works one layer up. Then you try it, and within an hour you understand why GraphQL caching is its own product category instead of a config flag. That's exactly why it needs its own tooling and processes. And that's what this post is about.

The short version: GraphQL breaks almost every assumption that HTTP caching is built on. And for good reason, because that same flexibility is what makes GraphQL so powerful in the first place. Some of the resulting problems you can solve in an afternoon. One of them, invalidation, is genuinely hard, and it's where most home-grown attempts quietly fall apart. But it is solvable: there are good tools that let you master these difficulties.

The first challenge: one endpoint, one verb, everything in the body

HTTP caching is keyed on the request. A CDN looks at the method and the URL, maybe a header or two, and decides whether it has seen this exact thing before. GET /users/1 is trivially cacheable: the URL is the cache key, and Cache-Control tells the CDN how long to keep it.

GraphQL throws that model out. You have a single endpoint, usually /graphql, and almost everything goes through POST with the actual query sitting in the request body:

POST /graphql
Content-Type: application/json

{
  "query": "
    {
      user(id: 1) {
        name
        posts {
          title
        }
      }
    }
  "
}

To a stock CDN, every request to your API looks identical: same method, same URL. It has no idea that one body asks for a user's name and the next triggers a 50-table aggregation. It can't tell two requests apart, so it can't safely cache either. The thing that made REST cacheable, namely a meaningful URL, is gone.

So before you can cache anything, you have to teach the cache to look inside the request.

Making GraphQL cacheable at all (the easy 80%)

This part is mostly solved, and if you only need read caching you can get a long way:

Normalize the query into a stable key. Two requests that ask for the same data can look different on the wire: different whitespace, reordered fields, variables inline versus separated. You parse the query, canonicalize it (sort fields, extract variables, strip noise), and hash the result. Now logically identical requests collapse to the same cache key.

Get it past the CDN layer. Because bodies and POST are awkward to cache, you lean on GET requests plus persisted queries: the client sends a hash of a known query instead of the full text, and that hash becomes part of a cacheable URL. Automatic Persisted Queries (APQ) are the common flavor.

Assign TTLs from the schema. Not all data ages the same way. A product catalog can be stale for minutes; an account balance cannot. You annotate types and fields with a max age and let the cache apply per-type, per-field TTLs instead of one blunt number for the whole response.

Do those three things and you have a working read cache. This is the part vendors demo and the part you can build yourself. It is not the part that hurts.

The hard part: invalidation

TTL-only caching forces an unpleasant trade-off. Short TTLs mean low hit rates; you're barely caching. Long TTLs mean you serve stale data, and "the dashboard showed the old number for five minutes" is the kind of bug that can erode trust fast. What you actually want is to cache aggressively and drop a cached response the instant the underlying data changes.

Here's where GraphQL's flexibility turns against you. In REST, a write maps cleanly to a resource: PUT /users/1 invalidates the cache entry for /users/1. There's a URL to purge. In GraphQL there's no URL, and worse: a single mutation can invalidate fragments of many unrelated cached responses. Change one user's name and you've potentially staled every cached query that embedded that user: a profile page, a comment list, a search result, an admin table, each cached under a different key.

You can't purge by URL because there isn't one. So you invalidate by what's inside the responses instead.

The standard approach is surrogate keys (also called cache tags). When you cache a response, you walk it and record every entity it contains, typically __typename + id, e.g. User:1, Post:42. Those become tags attached to the cached entry. When a mutation comes through, you figure out which entities it touched and purge every cached entry tagged with them. User:1 changes → drop everything tagged User:1, regardless of which query produced it.

Conceptually clean. The trouble is in everything the clean version ignores.

Where it gets nasty

This is the part nobody puts in the marketing copy:

Lists and pagination. A mutation creates a new post. The new entity has an ID that doesn't appear in any cached response yet, so tag-based purging can't find the lists it should now appear in. "Invalidate the entity" doesn't help when the problem is "a collection should have grown." You end up needing coarser, list-level invalidation, which claws back some of the hit rate you fought for.
Mutations that don't return the changed entity. Your tagging logic learns which entities to purge by inspecting payloads. A mutation that returns just { success: true } tells you nothing about what it changed. Now you're guessing, or you're forcing schema conventions on every mutation.
Derived and aggregate fields. commentCount, computed totals, or "is this in stock" depend on entities that aren't directly named in the response. The thing that changed and the field that's now wrong don't share an ID.
Authorization and personalization. The same query returns different data per viewer. Cache it naively and you leak one user's data to another. That's a security bug, not a performance one. Cache it correctly and your key space explodes by user, gutting your hit rate. Deciding what's safely shared versus per-viewer is a design problem with no universal answer.
Partial responses and errors. GraphQL can return data and errors in the same 200 response. Do you cache a partial result? A response that succeeded for three fields and failed for one? There's no HTTP status to lean on.
Schema changes. Deploy a new schema and yesterday's cached responses may no longer match today's shape. Your cache has to know when its assumptions expired.

The good news: none of these is insurmountable. But together they're why "we'll just add caching to our GraphQL gateway" turns into a quarter of work.

Why the edge makes it harder, and why it's still worth it

Everything above is true even with a single, central cache. Push it to the edge, meaning many points of presence close to users, and you inherit a distributed-systems problem on top of the GraphQL problem.

Now your cache state is spread across many locations. A purge isn't an instant local delete; it has to propagate, and during that window different users can get different answers. A write in one region races against a read in another. The index that maps entities to cached responses has to be consistent enough to be trustworthy and fast enough to consult on the hot path. Because at the edge you can't burn real compute per request without giving back the latency you came for.

So why bother? Because when it works, you move read traffic off your origin entirely and serve it from near the user, and your database stops being the thing that falls over at peak. The payoff is real. It's just that the gap between "naive TTL cache" and "correct, invalidating, edge-distributed cache" is exactly where the engineering lives.

What good looks like

Whether you build this or buy it: these are the properties worth insisting on:

Schema-driven cache configuration. Cacheability, TTLs, and scoping declared alongside the schema, so reviewable, versioned, and diffable, not scattered across ad-hoc rules.
Entity-level surrogate keys with a real purge API. Tagging plus a way to say "drop everything touching User:1" on demand, not just on a timer.
Explicit auth scoping. A deliberate decision about what's shared versus per-viewer, enforced by default. Personalized data should never be cacheable by accident.
Observability. Hit rate, staleness, and purge lag, visible. You cannot tune an invalidation strategy you can't measure, and "it feels faster" is not a metric.

Closing

GraphQL caching is easy to start and hard to finish. The cacheable-key part is a weekend; correct invalidation across personalized, paginated, edge-distributed responses is why this becomes its own product category and not a feature you flip on in passing. If you've shipped a GraphQL API and watched your origin strain under read traffic, you've felt the pull toward the edge, and probably the pain of doing it right.

That's exactly the problem we're building GraphPilot to solve. It's not open yet. I'd genuinely rather hear which of the edge cases above is hurting you than guess, so drop a comment below: which one bites hardest in your setup? And if you run GraphQL in production and want early access, you can join the waitlist.