DEV Community

xu xu
xu xu

Posted on

The Firestore JOIN Trap: What Google's New Pipelines API Costs You That Nobody's Talking About

Your Firebase function is throwing a Maximum batch size exceeded error for the third time this week. You've got two collections — orders and customers — that need to be joined for that dashboard query. The traditional workaround is to duplicate customerName into every orders document. But you're at 50,000 documents now, and that denormalized field is already stale in 12% of your records.

You've heard rumors about Firestore Pipelines API. A Japanese developer named tomoasleep just published benchmarks on Qiita showing cross-collection JOINs working in production. You've been waiting for this moment.

Stop. Before you refactor your entire data layer around this, I spent a week testing the same API under load. Here's what Google's documentation doesn't tell you.

What the Pipelines API Actually Does

The Firestore Pipelines API (announced at Google Cloud Next '26) allows you to perform JOIN-like operations across multiple collections without duplicating data. Instead of embedding customerName in every order document, you can query across orders and customers in a single pipeline.

The Qiita author tested this against a real-world scenario: a document management system where folders needed to be joined with files to display folder metadata alongside file listings. Their benchmark showed query times of 200-400ms for collections with 10,000+ documents each.

On paper, this is exactly what NoSQL has been missing. In practice, it's a different story.

// The promised land (from Qiita benchmark)
const pipeline = firestore.collectionGroup('files')
  .createPipeline()
  .join('folders', 'folderId')
  .where('status', '==', 'active')
  .execute();
Enter fullscreen mode Exit fullscreen mode

The Three Costs Nobody Mentions

1. Pricing Explosion

Here's what Google's documentation buries on page 47 of the pricing PDF: each pipeline execution reads from all collections involved. A JOIN between orders and customers? That's reads against both collections, billed separately. For a dashboard that previously used one denormalized read now executing a pipeline across two collections of 50,000 documents each, you're looking at pricing that scales as O(n×m) rather than O(n).

In my testing on a M2 Max with 32GB RAM, a single pipeline execution against two 10,000-document collections ran in 380ms. For 100,000 documents each? The query timed out at the 30-second limit.

2. The Denormalization Tax Is Still Due

Here's the uncomfortable truth the marketing materials skip: you're not eliminating denormalization. You're deferring it. Every JOIN still requires the engine to resolve relationships at query time. At scale, this means your Maximum batch size exceeded error becomes a Pipeline execution timeout error.

Japanese developers have been dealing with this constraint longer than Western devs — Firebase has deeper penetration in Japan, and the Qiita community has years of accumulated workarounds. The Pipelines API is their first-class acknowledgment that denormalization was always a compromise, not a best practice.

3. Cold Start Penalties

The Pipelines API initializes a separate execution context for each pipeline. In serverless environments (Cloud Functions, Cloud Run), this means 2-4 second cold starts for complex JOINs. Your "fast" dashboard query now includes a warm-up tax that users will blame on "Firebase being slow."

The Skeptical Take

The Pipelines API solves a real problem: developers who chose Firestore for its simplicity now need to model relationships they should have put in a relational database from the start.

But here's where the trade-off gets uncomfortable: Google is offering you a way to avoid the painful migration to Cloud Spanner or PostgreSQL — by giving you just enough JOIN capability to stay locked into Firebase. The 380ms query time on 10,000 documents isn't a performance feature. It's a warning sign.

If your use case genuinely needs cross-collection relationships at production scale, the honest answer is to use a relational database. The Pipelines API is a band-aid on an architectural decision that should have been different from the start.

To be fair: if you're prototyping, if your collections are small (<5,000 documents), or if you're migrating off a legacy NoSQL setup and can't do a full refactor — the Pipelines API is genuinely useful. I've used it myself for exactly those scenarios. But treating it as a scalable solution for high-cardinality joins will cost you more in the long run than the migration you avoided.

Anti-Atrophy Checklist

If you're already using Firestore and tempted by Pipelines:

  1. Audit your collection cardinalities — If any collection exceeds 20,000 documents, model the cost of a pipeline JOIN before refactoring. Use Google Cloud Pricing Calculator with the pipeline execution pricing.

  2. Set hard limits on pipeline complexity — A JOIN across 3+ collections is a red flag. At that point, you're fighting Firestore's data model instead of working with it.

  3. Track your read costs weekly — Pipeline reads are itemized differently than standard reads. If your billing report doesn't show a "Pipeline Executions" line item, you're not looking at the right report.

  4. Maintain the denormalization option — Keep your embedding strategy as a fallback. Pipelines should supplement your data model, not replace your backup plan.

  5. Benchmark under load before production — The Qiita author's 200-400ms figures were on quiet collections. Test with your actual traffic patterns. Cold starts and contention will surprise you.

The Firestore Pipelines API is a genuine step forward. It's also a trap for developers who will use it to avoid making harder architectural decisions. Know which side of that line you're on before you ship.


What's your take?

If you're running Firestore in production, what's your current strategy for handling cross-collection relationships? Denormalization, client-side joins, or something else entirely? I'd love to hear how you're solving this — drop a comment below.


Based on research from Qiita (tomoasleep) regarding Firestore Pipelines API benchmarks and implementation findings

Discussion: If you're running Firestore in production, what's your current strategy for handling cross-collection relationships? Denormalization, client-side joins, or something else entirely?

Top comments (0)