Cherrypick14

Posted on Jun 22

Trade-offs in Indexing Solana at Scale

#architecture #blockchain #database #systemdesign

Building real-time blockchain indexers means wrestling with hard choices: speed vs. simplicity, RPC dependency vs. reliability, and resource costs. This is what I learned.

The Problem Nobody Talks About
Most Solana developers hit the same wall: the official RPC endpoint is slow, rate-limited, and not designed for analytics. So you build an indexer. But then you realize indexing blockchain data isn't like indexing a traditional database.

You're not indexing static data. You're indexing a stream of transactions that never stops, where missed blocks mean data gaps, where network latency translates directly to stale data, and where a single RPC node failure cascades into the entire system failing.

After shipping a production Solana indexer in Rust, I learned that every architectural decision is a trade-off. Here are the ones that matter.

Trade-off #1: Monolithic vs. Micro-services

What I chose : Monolithic with concurrent components.

Single Binary:
├── Indexer (fetches blocks)
├── Parser (extracts transactions)
├── Database Writer (persists data)
└── REST API (serves queries)

The trade-off :

Mono-lithic wins: Single deployment, shared memory, easier debugging, fewer network calls.

Micro-services lose: You get operational complexity you don't need at this stage.

Why it matters: At scale, you think you need micro-services. You don't. Not yet. A single binary with good concurrency (Tokio) scales vertically and is dramatically simpler to operate. Micro-services introduce failure modes (network latency between services, cascading failures) that are worse than the problems they solve for a single component.

When this breaks: Once your indexer is ingesting 100k+ transactions per second, you might shard by program ID or account. Then you have 3-4 indexers. That's when you revisit this.

Trade-off #2: RPC Polling vs. Geyser Plugin

What I chose: RPC polling (with fallback to Geyser).

// Simple. Reliable. Controllable.
pub async fn get_block(&self, slot: u64) -> Result<Block> {
    retry_manager.execute_with_retry(|| {
        rpc_client.get_block(slot)
    }).await
}

The trade-off :

RPC Polling wins: Works with any provider (Helius, Triton, self-hosted), no special setup, inherent retry logic.
RPC Polling loses: ~200ms latency per block, rate limits, less real-time.
Geyser Plugin wins: Real-time, streaming, no latency.
Geyser Plugin loses: Only works with your own validator, requires Solana knowledge, complex setup, breaking changes between versions.

Why it matters: RPC polling with exponential back-off is 10x simpler and works with any Solana infrastructure. Yes, you're 200ms behind finality. But you're also not debugging Geyser plugin crashes at 3 AM.

The real insight: Geyser plugins are architecturally superior. They're also operationally a nightmare. The question isn't "which is better?" It's "what can your team actually run?"

Trade-off #3: Row-Per-Transaction vs. De-normalized Schema

What I chose: Normalized schema (row per transaction + relationship tables).

-- Normalized approach
CREATE TABLE transactions (
    signature VARCHAR(88) PRIMARY KEY,
    slot BIGINT,
    fee BIGINT,
    success BOOLEAN
);

CREATE TABLE transaction_accounts (
    transaction_signature VARCHAR(88) REFERENCES transactions,
    account_key VARCHAR(44),
    account_index INTEGER,
    PRIMARY KEY (transaction_signature, account_index)
);

The trade-off:

Normalized wins: Query flexibility, storage efficiency, ACID guarantees, easier updates.
Normalized loses: More joins = slower queries for specific use cases, more writes.
De-normalized wins: Single row per transaction with arrays/JSON, super fast specific queries.
De-normalized loses: Harder to query across relationships, data duplication, harder to maintain.

Why it matters: Every transaction touches 10-20 accounts. Store each separately, and queries are flexible. Store them all in one JSON array, and you can't efficiently query "all transactions touching account X" without scanning.

At scale (1B+ transactions), de-normalized becomes expensive in storage. Normalized with proper indexing is actually faster.

What actually works: Normalized in PostgreSQL + caching layer. Don't de-normalize unless you've proven it's your bottleneck.

Trade-off #4: Infinite Retry vs. Graceful Degradation

What I chose: Exponential back-off with 3 attempts, then fail loud.

pub struct RetryManager {
    max_attempts: u32,           // 3 attempts
    base_delay: Duration,        // 1 second
}

// Delays: 1s, 2s, 4s = 7s total before giving up
pub async fn execute_with_retry<F, Fut, T, E>(&self, mut operation: F) 
    -> Result<T, E>
{
    for attempt in 1..=self.max_attempts {
        match operation().await {
            Ok(result) => return Ok(result),
            Err(e) => {
                if attempt < self.max_attempts {
                    tokio::time::sleep(self.calculate_delay(attempt)).await;
                } else {
                    return Err(e);
                }
            }
        }
    }
}

The trade-off:

Limited retry wins: Fast failure, clear logs, easier to detect actual issues.
Limited retry loses: Transient network blips cause gaps in indexing.
Infinite retry wins: Recovers from temporary outages automatically.
Infinite retry loses: Masks real problems, memory leaks if not careful, impossible to debug.

Why it matters: Infinite retry hides bugs. A network timeout that happens at 3 AM gets silently retried forever, and your monitoring doesn't alert. Seven seconds of exponential back-off is aggressive enough for transient issues but fast enough to surface real problems.

The lesson: When in doubt, fail visibly. Let your monitoring system detect it, trigger alerts, and page you. That's how you find real bugs.

Trade-off #5: Single-Threaded RPC vs. Concurrent Block Fetching

What I chose: Single RPC client thread + concurrent database writes.

// Fetch blocks sequentially from RPC
loop {
    let block = rpc_client.get_block(current_slot).await?;

    // Parse and write concurrently
    indexer.process_block(block).await?;
}

The trade-off:

Sequential RPC wins: Ordered data, easier to recover from failures, predictable.
Sequential RPC loses: Can't parallelize RPC calls, slower ingestion.
Concurrent RPC wins: Higher throughput if your RPC provider allows it.
Concurrent RPC loses: Thundering herd on RPC provider, risk of rate limiting, harder to track state.

Why it matters: RPC providers hate thundering herds. Hit them with 100 concurrent requests and they rate-limit you hard. Better to fetch blocks in order (1 at a time) and parallelize the work you can control (parsing,database writes).

The exception: If you have a dedicated RPC node, you can fetch 10 blocks ahead concurrently and always have data ready.

Trade-off #6: Real-Time API vs. Read Replicas

What I chose: Single PostgreSQL instance with connection pooling.

// Shared connection pool, all API requests use the same database
let pool = deadpool_postgres::Pool::from_config(config)?;

pub async fn get_transactions(query: TransactionQuery) -> Result<Vec<Transaction>> {
    let client = pool.get().await?;
    client.query(...).await
}

The trade-off:

Single DB wins: Simpler infrastructure, consistent reads, easier to reason about state.
Single DB loses: API reads block indexing writes (minor), single point of failure.
Read Replicas win: No contention, scales API independently.
Read Replicas lose: Replication lag (you're serving stale data), operational complexity, cost.

Why it matters: At scale, you eventually add read replicas. But before that? A single PostgreSQL instance with connection pooling handles thousands of QPS. You don't need replicas until you prove you do.

Real numbers: PostgreSQL on decent hardware = 5,000-10,000 queries/sec. That's already a lot.

What Actually Happened at Scale

Building this indexer taught me that you should optimize what's actually slow, not what you think might be slow.

The architecture handles concurrent block fetching, parsing, and database writes without a problem. Where real bottlenecks appear depends entirely on your RPC provider and database hardware.

The lesson: Don't overthink this early. The monolithic approach scales further than most people expect. When you actually hit a bottleneck, you'll know it (metrics don't lie). Then you optimize that specific part — whether that's batching database writes, adding connection pooling, or eventually sharding by program ID.

Premature optimization creates complexity you don't need.

What This Indexer Gets Right

Graceful shutdown with SIGINT/SIGTERM handling (kills production processes cleanly).
Progress tracking every 100 blocks (you know exactly what's indexed).
Exponential back-off on RPC failures (survives transient network issues).
Connection pooling (doesn't leak database connections).
REST API with pagination (queryable, not just a black box).
41 property-based tests (catches edge cases your brain misses) .

One More Thing

The most underrated part of building indexers? Testing with real data.

This is why I built integration tests that actually connect to Solana networks. You can run them against dev-net (development), test-net (staging), or main-net (production readiness check):

# Development: devnet (fast, low activity)
cargo test --test integration_test -- --nocapture

# Staging: testnet (moderate activity, real programs)
SOLANA_NETWORK=testnet cargo test --test integration_test -- --nocapture

# Production check: mainnet (high activity, real edge cases)
SOLANA_NETWORK=mainnet cargo test --test integration_test -- --nocapture

Each network teaches you something different:

Dev-net: Does the basic code work?
Test-net: Does it handle real program activity?
Main-net: Where will it actually break?

The integration tests reveal edge cases that unit tests miss: slot skips, transaction failures with success=true, RPC rate limiting, and network latency.

The Finished Product

What's production-ready right now:

9.78s build time (no unnecessary dependencies).
41/41 tests passing (property-based + unit tests).
Integration tests against dev-net/test-net/main-net.
Graceful shutdown with signal handling.
REST API with pagination and filtering.
Connection pooling (deadpool-postgres).
Exponential backoff retry logic (proven in tests).
Progress tracking (indexed 100 blocks = log).
Single binary (~5MB release build).

Performance depends on:

Your RPC provider's speed (50ms-5s latency).
Your PostgreSQL hardware.
Network conditions on Solana.

The code is designed to scale vertically. Horizontal scaling (multiple indexers) comes later if you need it.

Open source: github.com/Cherrypick14/solana-indexer-rs

Final Thought

Indexing Solana isn't hard. What's hard is admitting that simplicity is a feature, not a limitation.

The fanciest architecture I didn't build would have been more impressive. The one I did build actually works.

Questions? Let's chat in the comments.

Top comments (1)

Valentyn Kit • Jul 8

The RPC-polling-vs-Geyser call is right, but the second-order cost people miss: if you ever need Geyser's latency, you're now running a full validator just to get plugin access, which is a much bigger infra commitment than "harder setup" implies. It's not a lateral trade, it's a different order of magnitude in what you're operating.