Airbnb · Databases · 24 May 2026
Airbnb's identity graph connects 7 billion nodes and 11 billion edges — every user, every device, every listing, every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. The third-party vendor powering it required periodic manual reboots to stay stable. Queries that needed 8 hops of graph traversal were hitting 5-second P99 latencies. In 2024, a small team rebuilt the entire thing internally. The results were not incremental.
- 7B nodes, 11B edges
- 5M new edges/day
- P99 read: 5.0s → 2.5s (-49%)
- P99 write: 353ms → 156ms (-56%)
- Write QPS: 10× previous vendor max
- Zero manual reboots required
The Story
The stakes of Airbnb's identity graph are not abstract. When a fraudster creates a second account after being banned, tries to rent a listing to damage it, or coordinates with other accounts to inflate reviews, the first system that needs to detect the connection is the identity graph. It holds the relationships between every user, every device, every verified identity, every behavioral signal that Airbnb's Trust and Safety team uses to determine whether a new account is truly new or a known bad actor resurfacing. In 2024, this graph contained 7 billion nodes and 11 billion edges — and was growing by 5 million new edges every day. The vendor powering it was requiring periodic manual reboots to stay stable. That was the state of Airbnb's most critical fraud detection infrastructure when the decision was made to build internally.
🕸️
Airbnb's identity graph is not the only graph at the company — it was simply the first to migrate to the new internal graph infrastructure. The platform also runs inventory knowledge graphs (relationships between listings, amenities, neighborhoods, and availability) and data lineage graphs (tracking how data flows through pipelines for compliance and debugging). All are now converging onto the same JanusGraph-based infrastructure.
The identity graph's architecture progressed through three distinct generations, each solving the previous generation's limit while introducing new constraints. The first generation used a relational database for user and entity data paired with a key-value store holding JSON-encoded edge lists. This worked at low graph density — when users had few connections, the JSON edge lists were manageable. As graph density increased and individual users accumulated hundreds or thousands of edges, the JSON edge lists became expensive to read and update. A query that needed to traverse relationships between users required deserializing large JSON blobs and joining across tables — operations that relational databases (database systems built around tables, rows, and SQL joins — optimal for normalized structured data but increasingly expensive as relationship traversal depth grows, because each hop requires an additional join) are not optimized for at graph scale.
THE FOUR ANTI-PATTERNS THAT PLAGUED AIRBNB'S GRAPH TEAMS
Before the centralized graph infrastructure, Airbnb teams building graph-based products fell into four documented anti-patterns. Relational graphs : modeling nodes and edges in SQL tables, producing expensive joins during traversal. Offline graphs : building the graph in the data warehouse, limiting data freshness to daily batch snapshots — useless for real-time fraud detection. DIY open source : self-managing community versions of graph databases, creating high operational toil and expertise silos. Managed PaaS : using third-party vendors — better operationally but introducing vendor lock-in, limited tuning access, and performance bottlenecks the team couldn't debug. The identity graph's 2021 migration to a third-party SaaS solved the operational overhead of DIY open source but introduced the last anti-pattern.
Problem
Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density
The first-generation architecture used a relational database for entity data and a KV store holding JSON-encoded edge lists. As the identity graph grew in density — individual users accumulating hundreds of edges — querying became expensive. JSON deserialization and cross-table joins are not optimized for the multi-hop traversal patterns that fraud detection requires. The architecture became difficult and expensive to scale.
Cause
Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability
The 2021 migration to a third-party SaaS graph database improved horizontal scalability. But it introduced new problems: long-tail latency (P99 read latency reaching 5 seconds on 8-hop queries), operational instability requiring periodic manual reboots, limited ability to tune performance for Airbnb's specific query patterns, and no fine-grained access controls. The vendor was a black box the team couldn't debug.
Solution
Generation 3: JanusGraph + DynamoDB, Internally Managed
In 2024, Airbnb built an internal graph infrastructure on JanusGraph (open-source, Apache TinkerPop stack, Gremlin query language) with DynamoDB as the storage backend and OpenSearch for indexing. The pluggable storage architecture meant Airbnb could leverage DynamoDB's operational reliability without reinventing distributed storage — while maintaining full control over the graph logic layer. They forked JanusGraph internally to add custom optimizations.
Result
49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots
P99 read end-to-end latency dropped from 5.0s to 2.5s (-49%). P95 from 2.1s to 1.0s (-51%). Write P95 from 353ms to 156ms (-56%). Write QPS during load testing: 10× the previous vendor's maximum. Manual reboots eliminated entirely. Incident investigation time shortened through transparent internal observability. Auto-scaling enabled for the first time.
⚠️
The Long-Tail Latency Problem at High Fanout
The most technically interesting challenge in the identity graph is long-tail latency (the phenomenon where the slowest requests in a system (P95, P99) are dramatically slower than the median — particularly damaging for real-time applications where even a small fraction of slow responses degrades user experience) on high-fanout queries. The graph is not uniformly dense — some nodes (users who have been on Airbnb for years and made hundreds of bookings) have hundreds or thousands of edges. When a fraud detection query traverses relationships at depth and hits one of these high-fanout nodes, the amount of data retrieved explodes. A query with 4–8 hops that hits a high-fanout node at hop 2 can return orders of magnitude more data than the same query on a sparse node. The P50 latency looks fine; the P99 reveals the reality.
The choice of Gremlin as the query language was not coincidental — it was a migration enabler. Both the outgoing vendor system and the incoming JanusGraph implementation support Gremlin (a graph traversal language developed as part of the Apache TinkerPop framework — reads like a path through the graph, e.g. g.V(userId).out('booked').in('listed') means 'find all users who listed properties that this user has booked'), which meant Airbnb could run the same Gremlin queries against both systems simultaneously during the migration. This shadow traffic approach allowed direct performance benchmarking under real production load before any cutover — a stark contrast to migrations that require rewriting queries for the new system before they can be tested. The query language compatibility was a deliberate evaluation criterion, not an accident.
ℹ️
Trust Graph: The Fraud Detection Use Case
Airbnb's internal name for the identity graph's fraud detection application is the Trust Graph. It models connections between Airbnb users and detects two primary fraud patterns: account duplication (a banned user creating a new account and re-joining the platform) and collusion (groups of accounts coordinating to execute fraudulent transactions, inflate review scores, or circumvent platform policies). The Trust Graph feeds ML models that learn patterns of fraudulent connectivity — the specific graph structures that appear before fraud events — and score new accounts and transactions in real time. For this use case, query latency directly impacts both fraud detection speed and host/guest experience during booking.
📦
Storage Separation: Why DynamoDB as the Backend
JanusGraph's pluggable storage architecture was the property that made it the right choice for Airbnb. By using DynamoDB as the storage backend , Airbnb decoupled graph logic (JanusGraph) from distributed storage operations (DynamoDB). DynamoDB brings auto-scaling, multi-region replication, and operational reliability that Airbnb's infrastructure team already understood and trusted. JanusGraph handles the graph data model, schema management, and Gremlin query execution. The combination gave Airbnb full control over the graph layer while standing on a storage foundation that didn't need to be invented from scratch.
⚠️
5 Million New Edges Per Day: The Write Problem
The identity graph's write challenge is as significant as its read challenge. Every day, Airbnb adds approximately 5 million new edges — new bookings creating host-guest relationships, new device associations, new identity verification links. Each edge must be ingested in near real-time through asynchronous events and stored durably before downstream fraud models can use them. The vendor system's write P95 of 353ms was tolerable when the graph was smaller. At 11 billion edges growing at 5 million per day, write throughput headroom becomes critical — a system at its write QPS ceiling cannot absorb traffic spikes without dropping ingestion events. The internal solution's 10× write QPS ceiling during load testing created the headroom the vendor never provided.
The decision framework Airbnb used to evaluate JanusGraph against alternatives reflects a principled approach to graph database selection that is worth examining. Four requirements shaped the evaluation: scalability for online queries (the system had to handle real-time graph traversal at P95 latencies under 500ms), expressive schema and query capabilities (the Gremlin traversal language and labeled property graph model needed to support the identity graph's data model without structural compromises), fit with Airbnb's infrastructure and operational model (DynamoDB as the storage backend meant the team was standing on infrastructure they already operated), and a visible, extensible codebase — the specific requirement that ruled out the previous vendor and every other black-box alternative. Access to the source code was not a preference; it was a prerequisite.
CONNECTED ACCOUNTS: HOW THE GRAPH DETECTS FRAUD
The specific fraud detection application that depends on the identity graph is called Connected Accounts (also referred to as the Trust Graph). It works by finding structural patterns in the graph that correlate with fraud. A legitimate user typically has one main account, one primary device, and verified identity credentials. A fraudster attempting to re-enter the platform after a ban might create a new account — but often reuses the same phone number, the same payment method, the same device, or overlaps with the banned account's booking history. The Connected Accounts system traverses the graph to find these connections: "this new account shares a device with a banned account, which shared a payment method with another banned account, which has reviewed listings that the new account also reviewed." That traversal pattern — spanning 4–8 hops — is exactly why graph depth performance matters for fraud detection.
The Fix
Three JanusGraph Engine Optimizations That Closed the Latency Gap
Deploying stock JanusGraph with DynamoDB would not have been sufficient — Airbnb's query patterns, particularly the high-fanout traversals that caused the worst P99 spikes, required modifications to the JanusGraph engine itself. The team forked JanusGraph internally and made three targeted optimizations: replacing the default locking mechanism with a DynamoDB-native approach, adding parallel execution to the high-fanout data fetching interface, and instrumenting the internal fork with distributed tracing that the open-source version lacked. Together, these changes addressed the specific failure modes that had made the vendor solution unacceptable.
- -49% — P99 end-to-end read latency improvement — from 5.0 seconds on the vendor system to 2.5 seconds on internal JanusGraph infrastructure — directly improving fraud detection response time
- -56% — P95 write latency improvement — from 353ms on the vendor to 156ms internally — enabling faster ingestion of the 5 million new edges added to the graph every day
- 10× — Write QPS ceiling during load testing — the internal JanusGraph infrastructure successfully scaled to ten times the maximum write throughput the vendor could sustain
- 0 — Manual reboots required after migration — the vendor solution required periodic manual instance reboots to maintain optimal performance; the internal solution auto-scales without human intervention
# Conceptual illustration of the three JanusGraph engine optimizations
# that reduced long-tail latency on Airbnb's identity graph
# OPTIMIZATION 1: Custom transaction strategy using DynamoDB conditional writes
# JanusGraph's default locking: acquire explicit distributed lock before write
# Problem: lock acquisition is a round-trip to the storage backend = overhead
# Old approach (simplified — uses JanusGraph's default locking):
def write_edge_default(tx, src_vertex, dst_vertex, edge_label):
lock = acquire_distributed_lock(src_vertex, edge_label) # expensive
try:
tx.add_edge(src_vertex, dst_vertex, edge_label)
tx.commit()
finally:
release_lock(lock)
# New approach: DynamoDB conditional writes (atomic compare-and-swap)
# DynamoDB's native conditional writes ensure integrity without separate lock
def write_edge_optimized(tx, src_vertex, dst_vertex, edge_label):
# Condition: only write if edge doesn't already exist
# DynamoDB evaluates the condition atomically server-side — no round-trip lock
tx.add_edge_with_condition(
src_vertex, dst_vertex, edge_label,
condition="attribute_not_exists(edge_key)" # DynamoDB conditional expression
) # Lower overhead, same integrity guarantee
# OPTIMIZATION 2: Parallel query execution via improved getMultiSlices
# Problem: high-fanout queries (user with 1000+ edges) fetch data serially
# Each 'slice' of edge data retrieved one at a time from DynamoDB
# Before: serial fetching of edge slices
def get_edges_serial(vertex_id, num_slices=50):
results = []
for slice_key in compute_slice_keys(vertex_id, num_slices):
results.append(dynamo.get_item(slice_key)) # Sequential round-trips
return merge(results) # N sequential DynamoDB calls
# After: parallel fetching via improved getMultiSlices interface
def get_edges_parallel(vertex_id, num_slices=50):
slice_keys = compute_slice_keys(vertex_id, num_slices)
# DynamoDB BatchGetItem fetches all slices concurrently
results = dynamo.batch_get_items(slice_keys) # Single batched DynamoDB call
return merge(results) # 1 call instead of N — critical for high-fanout nodes
# OPTIMIZATION 3: Distributed tracing integrated into JanusGraph fork
# OSS JanusGraph: no tracing instrumentation — impossible to profile slow queries
# Internal fork: Airbnb's distributed trace context propagated through graph ops
def execute_gremlin_traversal(query, trace_context):
with airbnb_tracer.start_span('janusgraph.traversal', parent=trace_context) as span:
span.set_tag('query.hops', count_hops(query))
span.set_tag('query.fanout', estimated_fanout(query))
result = janusgraph.execute(query) # Each DynamoDB call creates child spans
span.set_tag('result.edges_traversed', result.edge_count)
return result # Full trace: query → graph ops → DynamoDB calls → result
GREMLIN QUERY REWRITING: CLIENT-SIDE OPTIMIZATION
Even with a faster JanusGraph engine, Airbnb discovered that identical Gremlin queries produced significantly different performance on JanusGraph compared to the vendor system — because each implements different query planning optimizations over TinkerPop steps. Two specific patterns required client-side rewrites. Path steps (Gremlin's
path()andsimplePath()operators) are not optimized as batched queries in JanusGraph, causing non-batched storage backend queries that saturate the DynamoDB connection pool. These were replaced with conditional queries ensuring acyclic results. Side-effect aggregation steps produced non-batched substeps in JanusGraph's query planner and were restructured to minimize unoptimized computation. Both changes required deep knowledge of JanusGraph's query planning internals — knowledge only available because Airbnb owns the code.ℹ️
Schema Enforcement via the Management Service
JanusGraph's open-source version ships with minimal schema management tooling. Airbnb built a Graph Management Service on top of JanusGraph to handle schema enforcement, index management, and schematized Thrift APIs for client access. The management service acts as the control plane for the graph infrastructure: it ensures that vertex and edge schemas are validated before data is written, manages secondary indexes in OpenSearch, and provides the typed API surface that downstream services (fraud detection models, Trust & Safety pipelines) call. This separates schema governance from query execution and prevents different teams' graph data from colliding in a multi-tenant infrastructure.
✅
The Shadow Traffic Migration Strategy
Migrating 7 billion nodes and 11 billion edges without downtime required a migration strategy that validated the new system under real production load before any cutover. Airbnb's approach: run both the vendor system and the internal JanusGraph system in parallel, routing the same production queries to both and comparing results. Because both systems use Gremlin, the same queries ran unchanged on both systems simultaneously. This shadow traffic phase provided two things: a performance benchmark under real load (not synthetic tests), and correctness validation (ensuring internal JanusGraph returned the same results as the vendor for the same queries). Only after shadow traffic validated both correctness and performance was production traffic cut over and the vendor deprecated.
Third-Party Vendor vs Internal JanusGraph: Performance Comparison Across Query Types
| Query Type | Vendor P95 | Internal P95 | Improvement | Vendor P99 | Internal P99 |
|---|---|---|---|---|---|
| 1-hop query | ~180ms | ~65ms | -64% | ~420ms | ~150ms |
| 2-hop query | ~350ms | ~130ms | -63% | ~900ms | ~280ms |
| 2-hop (high fanout) | ~620ms | ~200ms | -68% | ~1,800ms | ~450ms |
| 4-hop query | ~900ms | ~380ms | -58% | ~2,500ms | ~850ms |
| 8-hop query (max depth) | ~2,100ms | ~1,000ms | -52% | ~5,000ms | ~2,500ms |
| Write (edge creation) | ~353ms | ~156ms | -56% | ~800ms | ~360ms |
ℹ️
Two Ingestion Paths: Event-Driven and Bulk Load
The identity graph's data ingestion architecture has two distinct paths. Event-driven ingestion handles the 5 million daily new edges in near real-time — asynchronous events from Airbnb's platform (new booking, new device association, identity verification) trigger graph mutations through the JanusGraph write path within seconds of occurring. Bulk loading handles backfills, historical data migrations, and large-scale data corrections — optimized for high throughput rather than low latency, running as offline jobs that write directly to DynamoDB's storage layer. The two paths are served by separate applications in the identity graph service, ensuring that bulk load operations during data migrations don't contend with real-time fraud detection writes.
Architecture
The architecture of Airbnb's new graph infrastructure has three conceptual layers. The storage layer is DynamoDB for graph data persistence and OpenSearch for secondary indexes — both managed AWS services that auto-scale without Airbnb managing the distributed storage operations. The graph engine layer is Airbnb's internal JanusGraph fork — the Gremlin server that executes traversal queries against the storage layer, with custom optimizations for Airbnb's specific access patterns. The management layer is the Graph Management Service — schema enforcement, index management, multi-tenant namespace isolation, and the Thrift API surface that client services call. Each tenant (the identity graph, inventory knowledge graph, data lineage graph) operates in an isolated namespace within the same infrastructure.
Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Why High-Fanout Nodes Cause Long-Tail Latency: The Graph Traversal Problem
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
JANUSGRAPH'S PLUGGABLE STORAGE: THE ARCHITECTURAL DECISION THAT MADE THIS POSSIBLE
Most graph databases tightly couple the graph logic layer with the storage layer — the query engine and the data store are one system. JanusGraph is architecturally different: it uses a pluggable storage backend , meaning the graph logic layer (Gremlin server, transaction management, schema enforcement) can be decoupled from the distributed storage layer. Airbnb chose DynamoDB as the backend — a service their infrastructure team already operated at scale and trusted. This separation gave Airbnb the ability to: iterate on graph engine features without touching storage operations, leverage DynamoDB's auto-scaling for write throughput bursts (like the 5M daily edges), and evolve the storage backend in the future without rewriting the graph layer.
⚠️
The Open Source Observability Gap
One of the specific problems that drove Airbnb to fork JanusGraph internally was the absence of distributed tracing in the open-source version. Without tracing, there was no way to profile which graph queries were slow, which DynamoDB operations within a query were taking the most time, or which high-fanout nodes were causing P99 spikes. Debugging latency issues required guesswork or custom logging that was expensive to build and maintain. The internal fork integrated Airbnb's distributed tracing infrastructure into every graph operation — Gremlin steps, DynamoDB calls, OpenSearch index queries — giving the team the observability needed to find and fix the exact operations driving long-tail latency. You cannot optimize what you cannot measure; you cannot measure what you cannot instrument.
Lessons
Airbnb's identity graph migration is a case study in the specific moment when the accumulation of vendor limitations justifies the cost of building internally — and in the engineering decisions that made the build worth the investment. The lessons are as much about when to leave a vendor as about how to build a graph database.
- 01. The signals that a vendor relationship has passed its usefulness are specific: recurring manual operational interventions (reboots), inability to instrument or observe the system's internals, no path to tune performance for your specific access patterns, and P99 latency that is an order of magnitude worse than P50. Each of these individually might be tolerable. All four together — as Airbnb experienced — indicate that the vendor relationship is costing more in operational pain and engineering productivity than an internal solution would cost to build and maintain.
- 02. Pluggable storage backends (an architectural pattern where the database query engine and the distributed storage layer are decoupled through a defined interface, allowing different storage systems to be swapped without changing the query layer) are the property that makes graph databases practical for large-scale production deployments. JanusGraph's DynamoDB backend let Airbnb separate concerns cleanly: Airbnb owns the graph logic layer, AWS owns the distributed storage operations. Build where you have competitive advantage; buy where you don't.
- 03. Shadow traffic is the only honest migration validation strategy for a stateful system that cannot be tested in staging. You cannot reproduce 7 billion nodes, 11 billion edges, and real fraud detection query patterns in a staging environment. Running both old and new systems against the same production queries, comparing outputs and latencies, closes the validation gap. The Gremlin query language compatibility between vendor and JanusGraph was what made shadow traffic feasible here — evaluate migration options partly on query language compatibility.
- 04. High-fanout nodes (vertices in a graph database that have an unusually large number of edges — sometimes called 'supernodes' — they cause disproportionate latency on traversal queries because a single hop to a high-fanout node can require fetching thousands of edges) are the specific failure mode of graph databases at scale that don't appear until the graph is large and dense. Design your query architecture around the assumption that some nodes will have orders of magnitude more edges than the average — parallel fetching, fanout budgets, and explicit query limits at high-fanout nodes are the tools that prevent P99 from diverging from P50.
- 05. Fork open-source infrastructure when you have specific, documented performance requirements that the upstream project doesn't address — and when you intend to maintain the fork. Airbnb's JanusGraph fork added parallel query execution, DynamoDB conditional write transactions, and distributed tracing. All three were gap-fills for production requirements the OSS version didn't prioritize. The fork is a commitment — it creates a maintenance obligation and diverges from upstream. Make that decision with eyes open, but don't avoid it when the production requirements are clear.
✅
7 Billion Nodes, One Platform, Multiple Use Cases
The identity graph was the first use case to migrate to Airbnb's internal graph infrastructure — but the infrastructure was built as a multi-tenant platform from the start. Inventory knowledge graphs (relationships between listings, neighborhoods, experiences, amenities) and data lineage graphs (how data flows through Airbnb's pipelines) are among the next use cases converging onto the same JanusGraph infrastructure. Building the identity graph migration as a paved-path platform rather than a one-off solution means every subsequent team that needs a graph database inherits the operational work Airbnb already did on the first migration — schema tooling, observability, auto-scaling, migration playbook.
THE MULTI-TENANT PLATFORM PAYOFF
Airbnb built the internal graph infrastructure as a paved-path multi-tenant platform from the start — a conscious architectural decision to build once and serve many graph use cases, rather than building a one-off solution for the identity graph. The identity graph was tenant 0: the first adopter that validated the platform under real production load. Inventory knowledge graphs and data lineage graphs are following. Each new tenant inherits the schema tooling, observability, auto-scaling, query optimization work, and migration playbook that the identity graph team built. The marginal cost of the second graph use case is dramatically lower than the first, because the platform absorbs the infrastructure complexity.
Airbnb's fraud detection system runs on a graph of 7 billion nodes and 11 billion edges that required periodic manual reboots to stay stable — until a team rebuilt it internally, cut P99 latency in half, and eliminated the reboots, which raises the question of why they waited until 2024 to do it, but the answer is probably 'because that's how long it takes to get frustrated enough with a vendor.'
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)