Shoaibali Mir

Posted on Jan 5

Scaling AI Agents: Distributed Graph Traversal and Choosing the Right Graph Database

#ai #distributedsystems #database #architecture

In Part 1, I showed why graph traversal speed is the bottleneck.

In Part 2, I covered hybrid LLM-graph planning for single-system agents.

But here's the next problem:

What happens when your infrastructure spans multiple regions, thousands of services, and millions of state transitions?

A single graph database instance can't handle 10M+ nodes with sub-second query times.

Standard A* search becomes a distributed systems problem.

Let's deep dive on how to architect planet-scale autonomous agents that can:

Plan across distributed infrastructure graphs
Query massive graphs efficiently
Handle cross-region dependencies
Scale horizontally

The Scale Problem

When Single-Node Graphs Break

Your Kubernetes cluster at scale:

The challenge:

What you need:

Architecture: Distributed Graph Traversal

The Three-Layer Approach

Part 1: Distributed Graph Traversal

Strategy 1: Graph Partitioning

Partition by service domain:

class GraphPartitioner:
    def __init__(self, partition_strategy='domain'):
        self.strategy = partition_strategy
        self.partitions = {}

    def partition_graph(self, services, dependencies):
        """
        Partition infrastructure graph by domain boundaries
        """
        if self.strategy == 'domain':
            return self._partition_by_domain(services)
        elif self.strategy == 'region':
            return self._partition_by_region(services)
        elif self.strategy == 'hybrid':
            return self._partition_hybrid(services)

    def _partition_by_domain(self, services):
        """
        Group services by business domain
        e.g., payment, auth, search, recommendation
        """
        domains = defaultdict(list)

        for service in services:
            domain = service.domain  # e.g., 'payment', 'auth'
            domains[domain].append(service)

        # Create partition for each domain
        partitions = {}
        for domain, service_list in domains.items():
            partitions[domain] = {
                'services': service_list,
                'internal_edges': self._get_internal_edges(service_list),
                'external_edges': self._get_external_edges(service_list),
                'size': len(service_list)
            }

        return partitions

    def _get_internal_edges(self, services):
        """Edges within the same partition"""
        service_ids = {s.id for s in services}
        return [
            edge for edge in all_edges 
            if edge.source in service_ids and edge.target in service_ids
        ]

    def _get_external_edges(self, services):
        """Edges crossing partition boundaries"""
        service_ids = {s.id for s in services}
        return [
            edge for edge in all_edges 
            if (edge.source in service_ids) != (edge.target in service_ids)
        ]

Result:

Strategy 2: Distributed A* Search

Multi-partition pathfinding:

class DistributedPlanner:
    def __init__(self, partitions, partition_metadata):
        self.partitions = partitions
        self.metadata = partition_metadata
        self.local_planners = {}

        # Create local planner for each partition
        for partition_id, graph_db_url in partitions.items():
            self.local_planners[partition_id] = LocalPlanner(graph_db_url)

    def distributed_search(self, start_state, goal_state):
        """
        Distributed A* search across graph partitions
        """
        # Step 1: Identify which partitions contain start and goal
        start_partition = self._find_partition(start_state)
        goal_partition = self._find_partition(goal_state)

        # Step 2: If same partition, use local search
        if start_partition == goal_partition:
            return self.local_planners[start_partition].search(
                start_state, goal_state
            )

        # Step 3: Find inter-partition path
        partition_path = self._find_partition_path(
            start_partition, goal_partition
        )

        # Step 4: Search within each partition along the path
        full_path = []
        current_state = start_state

        for i in range(len(partition_path) - 1):
            current_partition = partition_path[i]
            next_partition = partition_path[i + 1]

            # Find exit point from current partition
            exit_state = self._find_best_exit_state(
                current_state, current_partition, next_partition
            )

            # Local search within partition
            local_path = self.local_planners[current_partition].search(
                current_state, exit_state
            )

            full_path.extend(local_path)
            current_state = exit_state

        # Final hop to goal
        final_path = self.local_planners[goal_partition].search(
            current_state, goal_state
        )
        full_path.extend(final_path)

        return full_path

    def _find_partition_path(self, start_partition, goal_partition):
        """
        Find shortest path through partitions using partition graph
        """
        # Build partition-level graph
        partition_graph = nx.DiGraph()

        for partition_id, metadata in self.metadata.items():
            partition_graph.add_node(partition_id)

            # Add edges to connected partitions
            for neighbor_id in metadata['neighbors']:
                weight = len(metadata['external_edges'][neighbor_id])
                partition_graph.add_edge(
                    partition_id, neighbor_id, weight=weight
                )

        # Shortest path at partition level
        return nx.shortest_path(
            partition_graph, 
            start_partition, 
            goal_partition,
            weight='weight'
        )

    def _find_best_exit_state(self, current_state, from_partition, to_partition):
        """
        Find the best state to transition between partitions
        Based on cost and likelihood of success
        """
        exit_states = self.metadata[from_partition]['exit_states'][to_partition]

        # Score each potential exit state
        scores = []
        for exit_state in exit_states:
            # Distance from current to exit
            local_cost = self.local_planners[from_partition].estimate_cost(
                current_state, exit_state
            )

            # Historical success rate of this transition
            success_rate = self._get_transition_success_rate(
                from_partition, to_partition, exit_state
            )

            score = local_cost / (success_rate + 0.01)  # Lower is better
            scores.append((exit_state, score))

        # Return exit state with best score
        return min(scores, key=lambda x: x[1])[0]

Strategy 3: Parallel Query Execution

Execute searches in parallel across partitions:

import asyncio
from concurrent.futures import ThreadPoolExecutor

class ParallelGraphQuerier:
    def __init__(self, partition_planners):
        self.planners = partition_planners
        self.executor = ThreadPoolExecutor(max_workers=8)

    async def parallel_search(self, query_plan):
        """
        Execute multiple graph queries in parallel
        """
        tasks = []

        for partition_id, local_query in query_plan.items():
            task = asyncio.create_task(
                self._execute_local_query(partition_id, local_query)
            )
            tasks.append(task)

        # Wait for all queries to complete
        results = await asyncio.gather(*tasks)

        # Merge results
        return self._merge_results(results)

    async def _execute_local_query(self, partition_id, query):
        """
        Execute query against single partition
        """
        loop = asyncio.get_event_loop()

        # Run blocking graph query in thread pool
        result = await loop.run_in_executor(
            self.executor,
            self.planners[partition_id].execute_query,
            query
        )

        return {
            'partition': partition_id,
            'result': result,
            'timestamp': time.time()
        }

    def _merge_results(self, results):
        """
        Combine results from multiple partitions
        """
        merged_paths = []

        # Sort by partition order in overall path
        sorted_results = sorted(results, key=lambda x: x['partition'])

        for result in sorted_results:
            merged_paths.extend(result['result']['path'])

        return merged_paths

Part 2: Choosing the Right Graph Database

The Landscape

Commercial Options:

Neo4j - Industry standard, rich ecosystem
Amazon Neptune - Managed AWS service
TigerGraph - Optimized for analytics
Azure Cosmos DB (Gremlin API) - Azure-native

Custom/Open Source:

JanusGraph - Distributed, open source
ArangoDB - Multi-model database
Redis + Custom Logic - Maximum control

Decision Framework

When to use Neo4j:
--> Need rich query language (Cypher)
--> Complex pattern matching required
--> Excellent tooling/ecosystem matters
--> Team familiar with property graphs
--X Budget constraints (<100K nodes: Community Edition works)
--X Cloud vendor lock-in concerns

When to use Neptune:
--> Already on AWS
--> Want fully managed service
--> Need automatic scaling
--> Integration with AWS services critical
--X Team knows Gremlin or willing to learn
--X Cost of managed service acceptable

When to use TigerGraph:
--> Deep graph analytics required
--> Real-time recommendations
--> Need distributed traversal out-of-the-box
--> Handling massive graphs (billions of edges)
--X Smaller community acceptable
--X Learning GSQL is worthwhile

When to build custom (Redis + adjacency lists):
--> Simple shortest path queries dominate
--> Extreme cost sensitivity
--> Need maximum performance for specific patterns
--> Strong engineering team
--X Limited feature requirements
--X Can invest engineering time

Key Evaluation Criteria

1. Query Patterns

Shortest path only --> Custom/Redis
Complex patterns --> Neo4j/Neptune
Analytics --> TigerGraph
Mixed --> Neo4j or hybrid approach

2. Scale

<100K nodes --> Neo4j Community (free)
100K-1M nodes --> Any commercial option
1M-10M nodes --> TigerGraph or distributed Neo4j
10M+ nodes --> TigerGraph or custom distributed

3. Budget

Tight budget -->Custom or Neo4j Community
Moderate --> Neptune or Neo4j Enterprise
Analytics-focused --> TigerGraph

4. Team Expertise

SQL background --> Cypher (Neo4j)
NoSQL background --> Gremlin (Neptune)
Python heavy --> Custom implementation

Database Implementation Patterns

Neo4j Pattern

from neo4j import GraphDatabase

class Neo4jPlanner:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def shortest_path(self, start_id, end_id):
        with self.driver.session() as session:
            result = session.run("""
                MATCH (start:Service {id: $start_id}),
                      (end:Service {id: $end_id}),
                      path = shortestPath((start)-[:DEPENDS_ON*]-(end))
                RETURN path, length(path) as cost
            """, start_id=start_id, end_id=end_id)

            record = result.single()
            return {
                'path': record['path'],
                'cost': record['cost']
            }

    def get_neighbors(self, node_id, depth=3):
        with self.driver.session() as session:
            result = session.run("""
                MATCH (start:Service {id: $node_id})-[:DEPENDS_ON*1..{depth}]-(neighbor)
                RETURN DISTINCT neighbor.id as id, neighbor.name as name
            """.replace('{depth}', str(depth)), node_id=node_id)

            return [dict(record) for record in result]

    def pattern_match(self, pattern_query):
        """
        Execute custom Cypher pattern
        Example: Find all services depending on auth
        """
        with self.driver.session() as session:
            result = session.run(pattern_query)
            return [dict(record) for record in result]

Pros:

Mature, battle-tested
Excellent query language (Cypher is SQL-like)
Rich visualization tools
Strong consistency guarantees

Cons:

Memory intensive (plan for 10x node count in RAM)
Licensing costs for clustering
Vertical scaling limits

Neptune Pattern

from gremlin_python.driver import client, serializer

class NeptunePlanner:
    def __init__(self, endpoint):
        self.client = client.Client(
            f'wss://{endpoint}:8182/gremlin',
            'g',
            message_serializer=serializer.GraphSONSerializersV2d0()
        )

    def shortest_path(self, start_id, end_id):
        query = f"""
        g.V().has('service', 'id', '{start_id}')
         .repeat(out('depends_on').simplePath())
         .until(has('id', '{end_id}'))
         .path()
         .limit(1)
        """

        result = self.client.submit(query).all().result()
        return result[0] if result else None

    def get_neighbors(self, node_id, depth=3):
        query = f"""
        g.V().has('service', 'id', '{node_id}')
         .repeat(out('depends_on'))
         .times({depth})
         .dedup()
         .valueMap()
        """

        return self.client.submit(query).all().result()

Pros:

Fully managed (no ops burden)
Auto-scaling
AWS integration (IAM, VPC, CloudWatch)
Multiple APIs (Gremlin, SPARQL)

Cons:

Gremlin learning curve
Potentially higher cost than self-hosted
Less query optimization control

Custom Redis Pattern

import redis
import json

class RedisGraphPlanner:
    def __init__(self, redis_host='localhost'):
        self.redis = redis.Redis(host=redis_host, decode_responses=True)
        self.adjacency_list = defaultdict(list)

    def load_graph(self, nodes, edges):
        """Load graph into Redis"""
        for edge in edges:
            # Store adjacency list
            key = f"neighbors:{edge['source']}"
            neighbor_data = json.dumps({
                'target': edge['target'],
                'cost': edge.get('cost', 1)
            })

            self.redis.rpush(key, neighbor_data)
            self.redis.expire(key, 3600)  # 1 hour TTL

    def shortest_path(self, start_id, end_id):
        """
        Dijkstra's algorithm using Redis for storage
        """
        # Check cache first
        cache_key = f"path:{start_id}:{end_id}"
        cached = self.redis.get(cache_key)

        if cached:
            return json.loads(cached)

        # Compute path
        distances = {start_id: 0}
        previous = {}
        unvisited = {start_id}

        while unvisited:
            current = min(unvisited, key=lambda x: distances.get(x, float('inf')))

            if current == end_id:
                break

            unvisited.remove(current)

            # Get neighbors from Redis
            neighbors_raw = self.redis.lrange(f"neighbors:{current}", 0, -1)
            neighbors = [json.loads(n) for n in neighbors_raw]

            for neighbor in neighbors:
                distance = distances[current] + neighbor['cost']
                target = neighbor['target']

                if distance < distances.get(target, float('inf')):
                    distances[target] = distance
                    previous[target] = current
                    unvisited.add(target)

        # Reconstruct path
        path = self._reconstruct_path(previous, start_id, end_id)

        # Cache result
        self.redis.set(cache_key, json.dumps(path), ex=600)

        return path

    def _reconstruct_path(self, previous, start, end):
        path = []
        current = end

        while current in previous:
            path.append(current)
            current = previous[current]

        path.append(start)
        path.reverse()

        return path

Pros:

Extreme performance for simple queries
Full control over caching strategy
Very cost-effective
Scales horizontally easily

Cons:

No complex pattern matching
More engineering effort
Limited built-in features
Need to build tooling

Hybrid Architecture: Best of Both Worlds

The Recommended Production Setup

class HybridGraphStore:
    """
    Combine Redis for hot paths + Neo4j for complex queries
    """
    def __init__(self, redis_cluster, neo4j_cluster):
        self.cache = RedisGraphPlanner(redis_cluster)
        self.graph_db = Neo4jPlanner(neo4j_cluster)
        self.hot_path_threshold = 100  # queries/hour

    def shortest_path(self, start, end):
        # Track query frequency
        query_key = f"path:{start}:{end}"
        query_count = int(self.cache.redis.get(f"count:{query_key}") or 0)
        self.cache.redis.incr(f"count:{query_key}")
        self.cache.redis.expire(f"count:{query_key}", 3600)

        # Use Redis for hot paths (frequently queried)
        if query_count > self.hot_path_threshold:
            cached_path = self.cache.shortest_path(start, end)
            if cached_path:
                return cached_path

        # Fall back to Neo4j for rare/complex queries
        path = self.graph_db.shortest_path(start, end)

        # Warm the cache for next time
        cache_key = f"path:{start}:{end}"
        self.cache.redis.set(cache_key, json.dumps(path), ex=3600)

        return path

    def pattern_match(self, pattern):
        """Complex patterns always go to Neo4j"""
        return self.graph_db.pattern_match(pattern)

    def update_state(self, node_id, new_state):
        """Write-through: update both stores"""
        # Invalidate Redis cache for this node
        pattern = f"*:{node_id}:*"
        for key in self.cache.redis.scan_iter(match=pattern):
            self.cache.redis.delete(key)

        # Update Neo4j
        self.graph_db.update_node_property(node_id, 'state', new_state)

Why this works:

80% of queries hit Redis cache (hot paths)
20% use Neo4j (complex/rare queries)
Cost: ~60% less than Neo4j-only
Performance: Better P95 latency

Performance Expectations

What to Expect from Each Approach

Neo4j (Single Instance):

Graph size: 1M nodes
Shortest path: 50-200ms typical
Complex patterns: 200-500ms typical
Memory: ~10GB for 1M nodes

Neptune:

Graph size: 1M nodes
Shortest path: 100-300ms typical
Complex patterns: 300-800ms typical
Managed service overhead: +20-40% latency vs. self-hosted

Custom Redis:

Graph size: 1M nodes
Shortest path (cached): 5-15ms
Shortest path (uncached): 20-80ms
Cache hit rate: 70-90% typical for production workloads

Hybrid (Redis + Neo4j):

Overall P95: 80-150ms (depending on cache hit rate)
Hot paths (80% of queries): 10-20ms
Cold paths (20% of queries): 150-300ms
Best balance of cost and performance

Cost Considerations

Rough Monthly Cost Estimates

Small Scale (100K nodes, 1K queries/sec):

Neo4j Community (self-hosted): $200-400 (EC2 costs)
Neptune (db.r6g.large): $350-500
Custom Redis: $100-200

Medium Scale (1M nodes, 10K queries/sec):

Neo4j Enterprise (clustered): $1,500-2,500
Neptune (db.r6g.2xlarge): $1,400-2,000
TigerGraph Cloud: $1,200-1,800
Custom Redis Cluster: $400-800
Hybrid (Redis + Neo4j): $800-1,400

Large Scale (10M+ nodes, 50K queries/sec):

Neo4j Enterprise (large cluster): $4,000-8,000
Neptune (db.r6g.8xlarge): $5,000-7,000
TigerGraph: $3,500-6,000
Custom Distributed: $2,000-4,000

Implementation Roadmap

Phase 1: Start Small (Week 1-2)

# Deploy Neo4j Community
docker run -d \
  -p 7474:7474 -p 7687:7687 \
  -v $PWD/data:/data \
  neo4j:latest

# Load your infrastructure graph
python load_graph.py --source infrastructure.json --target neo4j://localhost

# Measure baseline performance
python measure_query_performance.py --queries 1000

Phase 2: Add Caching (Week 3-4)

# Add Redis cache layer
redis_client = redis.Redis(host='localhost')

# Implement cache-aside pattern
def shortest_path_cached(start, end):
    cache_key = f"path:{start}:{end}"
    cached = redis_client.get(cache_key)

    if cached:
        return json.loads(cached)

    # Cache miss - query Neo4j
    path = neo4j_query(start, end)
    redis_client.set(cache_key, json.dumps(path), ex=3600)

    return path

Phase 3: Partition for Scale (Month 2-3)

# Analyze graph for partitioning
partitioner = GraphPartitioner(strategy='domain')
partitions = partitioner.analyze(your_graph)

# Deploy regional instances
for partition in partitions:
    deploy_regional_graph_store(partition)

# Update planner to use distributed search
planner = DistributedPlanner(partitions)

Phase 4: Monitor and Optimize (Ongoing)

# Track key metrics
metrics = {
    'query_latency_p95': monitor_latency(),
    'cache_hit_rate': monitor_cache(),
    'cross_partition_queries': monitor_distribution(),
    'cost_per_query': monitor_cost()
}

# Adjust based on data
if metrics['cache_hit_rate'] < 0.7:
    increase_cache_ttl()

if metrics['cross_partition_queries'] > 0.2:
    rebalance_partitions()

Key Takeaways

Distributed graph traversal:
--> Partition by service domain for natural boundaries
--> Use local planning within partitions (80%+ of queries)
--> Coordinate across partitions only when necessary
--> Cache aggressively (70-90% hit rates achievable)

Database selection:

Start with: Neo4j Community (free, feature-rich)
Scale to: Hybrid (Redis cache + Neo4j)
Large scale: TigerGraph or custom distributed
AWS-native: Neptune if already on AWS

Cost optimization:

Caching reduces database load by 70-80%
Partitioning enables horizontal scaling
Hybrid approach: 50-60% cheaper than single database
Monitor query patterns to optimize cache strategy

Performance at scale:

Well-partitioned graphs: <500ms P95 achievable at 10M nodes
Hot path caching: 10-20ms typical
Hybrid architecture balances cost and speed

What's Next

Which topic interests you most?

Multi-agent coordination (preventing conflicts between autonomous systems)
Real-time graph topology updates (handling infrastructure changes mid-planning)
Chaos engineering for autonomous systems (testing agent resilience)

Drop a comment with what you'd like to see next.

Try It Yourself

Quick benchmark setup:

Start with Docker Compose:

Load test data and compare:

Measure what matters for YOUR workload

References

Part 1: Why Your AI Agent Is Slow
Part 2: Hybrid LLM-Graph Planning
Neo4j Operations Manual
Amazon Neptune Best Practices
TigerGraph Distributed Query Processing
Graph Partitioning Strategies (Research Overview)
- Comprehensive Survey (ACM Computing Surveys): A Survey of Distributed Graph Algorithms on Massive Graphs (See Section 3 for a detailed breakdown of 1D, 2D, and hybrid partitioning strategies)
- Streaming/Online Partitioning (VLDB): Streaming Graph Partitioning: An Experimental Study (Excellent overview of how to partition graphs "on the fly" as data arrives, which is critical for real-time databases)
- General Reference: Wikipedia: Graph Partitioning

Hit the ❤️ if this helps you architect planet-scale systems.

Share your infrastructure scale and challenges in the comments.

About the Author

Shoaibali Mir

I'm an engineer with 4+ yrs of experience spanning across DevOps, Data, Cloud and AI/ML Engineering Domain. Along with full time work, I'm pursuing Masters Degree in AI/ML from BITS Pilani.

Connect:

DEV Community

Scaling AI Agents: Distributed Graph Traversal and Choosing the Right Graph Database

The Scale Problem

When Single-Node Graphs Break

Architecture: Distributed Graph Traversal

The Three-Layer Approach

Part 1: Distributed Graph Traversal

Strategy 1: Graph Partitioning

Strategy 2: Distributed A* Search

Strategy 3: Parallel Query Execution

Part 2: Choosing the Right Graph Database

The Landscape

Decision Framework

Key Evaluation Criteria

Database Implementation Patterns

Neo4j Pattern

Neptune Pattern

Custom Redis Pattern

Hybrid Architecture: Best of Both Worlds

The Recommended Production Setup

Performance Expectations

What to Expect from Each Approach

Cost Considerations

Rough Monthly Cost Estimates

Implementation Roadmap

Phase 1: Start Small (Week 1-2)

Phase 2: Add Caching (Week 3-4)

Phase 3: Partition for Scale (Month 2-3)

Phase 4: Monitor and Optimize (Ongoing)

Key Takeaways

What's Next

Try It Yourself

References

About the Author

Shoaibali Mir

Top comments (0)

The Scale Problem

When Single-Node Graphs Break

Architecture: Distributed Graph Traversal

The Three-Layer Approach

Part 1: Distributed Graph Traversal

Strategy 1: Graph Partitioning

Strategy 2: Distributed A* Search

Strategy 3: Parallel Query Execution

Part 2: Choosing the Right Graph Database

The Landscape

Decision Framework

Key Evaluation Criteria

Database Implementation Patterns

Neo4j Pattern

Neptune Pattern

Custom Redis Pattern

Hybrid Architecture: Best of Both Worlds

The Recommended Production Setup

Performance Expectations

What to Expect from Each Approach

Cost Considerations

Rough Monthly Cost Estimates

Implementation Roadmap

Phase 1: Start Small (Week 1-2)

Phase 2: Add Caching (Week 3-4)

Phase 3: Partition for Scale (Month 2-3)

Phase 4: Monitor and Optimize (Ongoing)

Key Takeaways

What's Next

Try It Yourself

References

About the Author

Shoaibali MirFollow

Shoaibali Mir