TechLogStack

Posted on May 26 • Originally published at techlogstack.com on May 24

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

#systemdesign #database #architecture #webdev

7B nodes, 11B edges in Airbnb's identity graph
5M new edges ingested per day
P99 read latency: 5.0s → 2.5s (-49% improvement)
P95 write latency: 353ms → 156ms (-56% improvement)
10× write QPS ceiling vs previous vendor maximum
Zero manual reboots required post-migration

Airbnb's identity graph connects every user, every device, every listing, and every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. In 2024, this graph held 7 billion nodes and 11 billion edges — growing by 5 million new edges every day. The third-party vendor powering it required periodic manual reboots to stay stable, and 8-hop graph traversal queries were hitting 5-second P99 latencies. A small team rebuilt the entire thing internally. The results were not incremental.

The Story

The stakes of Airbnb's identity graph are not abstract. When a fraudster creates a second account after being banned, tries to rent a listing to damage it, or coordinates with other accounts to inflate reviews, the first system that needs to detect the connection is the identity graph. It holds the relationships between every user, every device, every verified identity, every behavioral signal that Airbnb's Trust and Safety team uses to determine whether a new account is truly new or a known bad actor resurfacing.

The identity graph's architecture progressed through three distinct generations, each solving the previous generation's limit while introducing new constraints. The first generation used a relational database for user and entity data paired with a key-value store holding JSON-encoded edge lists. This worked at low graph density. As individual users accumulated hundreds or thousands of edges, the JSON edge lists became expensive to read and update — relational databases (database systems built around tables, rows, and SQL joins — optimal for normalised structured data but increasingly expensive as relationship traversal depth grows, because each hop requires an additional join) are not optimised for multi-hop traversal at graph scale.

The Four Anti-Patterns That Plagued Airbnb's Graph Teams

Before the centralised graph infrastructure, teams building graph-based products fell into four documented patterns: Relational graphs — modelling nodes and edges in SQL tables, producing expensive joins during traversal. Offline graphs — building in the data warehouse, limiting freshness to daily batch snapshots. DIY open source — self-managing community graph databases, creating high operational toil. Managed PaaS — third-party vendors with vendor lock-in, limited tuning access, and performance bottlenecks the team couldn't debug.

Problem

Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density

The first-generation architecture used a relational database for entity data and a KV store holding JSON-encoded edge lists. As graph density grew — individual users accumulating hundreds of edges — querying became expensive. JSON deserialisation and cross-table joins are not optimised for the multi-hop traversal patterns that fraud detection requires.

Cause

Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability

The 2021 migration to a third-party SaaS graph database improved horizontal scalability but introduced new problems: P99 read latency reaching 5 seconds on 8-hop queries, operational instability requiring periodic manual reboots, no ability to tune performance for Airbnb's specific query patterns, and no fine-grained access controls. The vendor was a black box the team couldn't debug.

Solution

Generation 3: JanusGraph + DynamoDB, Internally Managed

In 2024, Airbnb built an internal graph infrastructure on JanusGraph (open-source, Apache TinkerPop stack, Gremlin query language) with DynamoDB as the storage backend and OpenSearch for indexing. The pluggable storage architecture let Airbnb leverage DynamoDB's operational reliability without reinventing distributed storage — while maintaining full control over the graph logic layer. They forked JanusGraph internally to add custom optimisations.

Result

49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots

P99 read end-to-end latency dropped from 5.0s to 2.5s (-49%). P95 from 2.1s to 1.0s (-51%). Write P95 from 353ms to 156ms (-56%). Write QPS during load testing reached 10× the previous vendor's maximum. Manual reboots eliminated entirely. Auto-scaling enabled for the first time.

The Fix

Three JanusGraph Engine Optimisations That Closed the Latency Gap

Deploying stock JanusGraph with DynamoDB would not have been sufficient. Airbnb's query patterns — particularly high-fanout traversals that caused the worst P99 spikes — required modifications to the JanusGraph engine itself. The team forked JanusGraph internally and made three targeted optimisations:

-49% — P99 read latency: 5.0s → 2.5s, directly improving fraud detection response time
-56% — P95 write latency: 353ms → 156ms, enabling faster ingestion of 5M daily new edges
10× — Write QPS ceiling during load testing vs the vendor maximum
0 — Manual reboots required post-migration; the internal solution auto-scales

The choice of Gremlin (a graph traversal language developed as part of the Apache TinkerPop framework — reads like a path through the graph: g.V(userId).out('booked').in('listed') means "find all users who listed properties that this user has booked") as the query language was a deliberate migration enabler. Both the outgoing vendor system and the incoming JanusGraph support Gremlin, which meant Airbnb could run the same queries against both systems simultaneously during migration — direct performance benchmarking under real production load before any cutover.

Connected Accounts: How the Graph Detects Fraud

Airbnb's Trust Graph finds structural patterns that correlate with fraud. A fraudster re-entering after a ban often reuses the same phone number, payment method, or device. The Connected Accounts system traverses the graph to find these connections: "this new account shares a device with a banned account, which shared a payment method with another banned account, which has reviewed listings that the new account also reviewed." That traversal pattern — spanning 4–8 hops — is exactly why graph depth performance matters.

# Three JanusGraph engine optimisations that reduced long-tail latency

# OPTIMISATION 1: DynamoDB conditional writes replace distributed locking
# Old: explicit distributed lock before write = round-trip overhead
def write_edge_default(tx, src_vertex, dst_vertex, edge_label):
    lock = acquire_distributed_lock(src_vertex, edge_label)  # expensive
    try:
        tx.add_edge(src_vertex, dst_vertex, edge_label)
        tx.commit()
    finally:
        release_lock(lock)

# New: DynamoDB evaluates condition atomically server-side — no lock round-trip
def write_edge_optimized(tx, src_vertex, dst_vertex, edge_label):
    tx.add_edge_with_condition(
        src_vertex, dst_vertex, edge_label,
        condition="attribute_not_exists(edge_key)"
    )

# OPTIMISATION 2: Parallel getMultiSlices for high-fanout nodes
# Before: N sequential DynamoDB calls for a user with 1000+ edges
def get_edges_serial(vertex_id, num_slices=50):
    results = []
    for slice_key in compute_slice_keys(vertex_id, num_slices):
        results.append(dynamo.get_item(slice_key))
    return merge(results)

# After: single BatchGetItem — critical for high-fanout nodes
def get_edges_parallel(vertex_id, num_slices=50):
    slice_keys = compute_slice_keys(vertex_id, num_slices)
    results = dynamo.batch_get_items(slice_keys)  # 1 call instead of N
    return merge(results)

# OPTIMISATION 3: Distributed tracing in the internal fork
# OSS JanusGraph: no tracing — impossible to profile slow queries
# Internal fork: Airbnb trace context propagated through every graph op
def execute_gremlin_traversal(query, trace_context):
    with airbnb_tracer.start_span('janusgraph.traversal', parent=trace_context) as span:
        span.set_tag('query.hops', count_hops(query))
        span.set_tag('query.fanout', estimated_fanout(query))
        result = janusgraph.execute(query)
        span.set_tag('result.edges_traversed', result.edge_count)
    return result

The shadow traffic migration strategy

Migrating 7 billion nodes and 11 billion edges without downtime required running both the vendor system and the internal JanusGraph system in parallel, routing the same production queries to both and comparing results. Because both systems use Gremlin, the same queries ran unchanged on both simultaneously. This shadow traffic phase provided a performance benchmark under real load (not synthetic tests) and correctness validation before any cutover. Only after shadow traffic validated both was production traffic cut over and the vendor deprecated.

Architecture

Airbnb's new graph infrastructure has three conceptual layers. The storage layer is DynamoDB for graph data persistence and OpenSearch for secondary indexes — both managed AWS services that auto-scale. The graph engine layer is Airbnb's internal JanusGraph fork — the Gremlin server that executes traversal queries, with custom optimisations for Airbnb's access patterns. The management layer is the Graph Management Service — schema enforcement, index management, multi-tenant namespace isolation, and the Thrift API surface that client services call.

Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why High-Fanout Nodes Cause Long-Tail Latency

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Performance comparison across query types:

Query Type	Vendor P95	Internal P95	Improvement	Vendor P99	Internal P99
1-hop query	~180ms	~65ms	-64%	~420ms	~150ms
2-hop query	~350ms	~130ms	-63%	~900ms	~280ms
2-hop (high fanout)	~620ms	~200ms	-68%	~1,800ms	~450ms
4-hop query	~900ms	~380ms	-58%	~2,500ms	~850ms
8-hop query (max depth)	~2,100ms	~1,000ms	-52%	~5,000ms	~2,500ms
Write (edge creation)	~353ms	~156ms	-56%	~800ms	~360ms

JanusGraph's Pluggable Storage: The Architectural Decision That Made This Possible

Most graph databases tightly couple the query engine and the storage layer — they are one system. JanusGraph decouples them through a pluggable storage backend. Airbnb chose DynamoDB — infrastructure their team already operated at scale. This gave them full control over the graph logic layer while standing on a storage foundation that didn't need to be invented from scratch. The separation also lets them evolve the storage backend in the future without rewriting the graph layer.

Lessons

Know the signals that a vendor relationship has passed its usefulness. Recurring manual operational interventions, inability to instrument the system's internals, no path to tune performance for your access patterns, and P99 latency an order of magnitude worse than P50 — each individually might be tolerable, but all four together mean the vendor is costing more than an internal solution would cost to build.
Pluggable storage backends are what make graph databases practical at scale. JanusGraph's DynamoDB backend let Airbnb separate concerns cleanly: Airbnb owns the graph logic layer, AWS owns the distributed storage operations. Build where you have competitive advantage; buy where you don't.
Shadow traffic is the only honest migration validation strategy for a stateful system. You cannot reproduce 7 billion nodes and 11 billion edges in staging. Running both old and new systems against the same production queries, comparing outputs and latencies, closes the validation gap. Gremlin compatibility between vendor and JanusGraph made shadow traffic feasible here — evaluate migration options partly on query language compatibility.
High-fanout nodes (vertices with an unusually large number of edges — sometimes called supernodes) are the specific failure mode of graph databases at scale. They don't appear until the graph is large and dense. Design your query architecture around the assumption that some nodes will have orders of magnitude more edges than the average — parallel fetching, fanout budgets, and explicit query limits are the tools that prevent P99 from diverging from P50.
Fork open-source infrastructure when you have specific, documented performance requirements the upstream project doesn't address — and when you intend to maintain the fork. The fork is a commitment that creates a maintenance obligation and diverges from upstream. Make that decision with eyes open, but don't avoid it when the production requirements are clear.

Engineering Glossary

DynamoDB — Amazon's fully managed NoSQL key-value and document database, used by Airbnb as JanusGraph's storage backend. Provides auto-scaling, multi-region replication, and conditional write operations used in Airbnb's optimised transaction strategy.

Gremlin — a graph traversal language developed as part of the Apache TinkerPop framework. Reads like a path through the graph: g.V(userId).out('booked').in('listed') means "find all users who listed properties that this user has booked."

High-fanout node — a vertex in a graph database with an unusually large number of edges, sometimes called a supernode. Causes disproportionate latency on traversal queries because a single hop can require fetching thousands of edges.

JanusGraph — an open-source distributed graph database built on Apache TinkerPop, with a pluggable storage backend that can use Cassandra, DynamoDB, or HBase as the underlying data store.

Long-tail latency — the phenomenon where the slowest requests in a system (P95, P99) are dramatically slower than the median. Particularly damaging for real-time applications where even a small fraction of slow responses degrades user experience.

P99 latency — the response time that 99% of requests complete within. A P99 of 5.0s means 1 in 100 requests takes 5 seconds or longer — directly visible to users at scale.

Pluggable storage backend — an architectural pattern where the database query engine and the distributed storage layer are decoupled through a defined interface, allowing different storage systems to be swapped without changing the query layer.

Shadow traffic — a migration validation strategy where the same production queries are routed to both the old and new systems simultaneously, comparing outputs and latencies before committing to a cutover.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

The Story

Problem

Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density

Cause

Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability

Solution

Generation 3: JanusGraph + DynamoDB, Internally Managed

Result

49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots

The Fix

Three JanusGraph Engine Optimisations That Closed the Latency Gap

Architecture

Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds

After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB

Why High-Fanout Nodes Cause Long-Tail Latency

Lessons

Engineering Glossary

Top comments (0)