DEV Community

Cover image for Making blockchain data sane with smarter tools
lina for Rock'n'Block

Posted on

Making blockchain data sane with smarter tools

If you’ve ever tried to extract data from a blockchain, you know it’s not exactly plug-and-play. You’re dealing with distributed infrastructure, frequent reorgs, and often incomplete APIs. The data’s all there — somewhere — but getting it out, structured, and production-ready is a project of its own.

In our full deep dive, we explore the whole landscape. This is a shorter version — a technical summary of the different approaches and tools we’ve seen work when you need blockchain data at scale.

The Most Basic Blockchain Data Indexing Solution (And Why It’s Not Enough)

Every blockchain indexing setup starts with the node. It exposes an RPC interface that lets you query raw data from the chain. On Ethereum, the most straightforward way to get started is with eth_getLogs.

How eth_getLogs works

Logs are emitted by smart contracts to provide information for off-chain consumers — they exist for this exact purpose. With eth_getLogs, you can filter by event signature, contract address, and block range. It’s simple, efficient for many use cases, and works reliably over time.

But once things get more complex, eth_getLogs starts to show its limits.

Logs don’t contain everything. If you need transaction metadata — like timestamps, sender info, or execution context — you’ll have to make additional RPC calls like eth_getTransactionByHash. This means multiple queries per event, which slows down the pipeline and introduces inefficiencies, especially when working with high volumes of data.

Using eth_getBlockReceipts

To improve on that, Ethereum provides eth_getBlockReceipts, which returns all transaction receipts from a given block. This gives you both the input data (calldata) and the resulting logs in one request. It’s a more complete view of block activity and helps reduce the number of round trips to the node.

Still, there are trade-offs. eth_getBlockReceipts doesn’t support filtering — you can’t ask for just the receipts related to a specific contract or event. So even though it reduces the number of calls, it increases the amount of data you have to process.

This can be especially limiting in protocols like Uniswap V3, where swap events trigger deeper state changes that aren’t captured in logs or receipts. To correctly track LP fees, you need access to updated storage values like FeeGrowth, which aren’t emitted as events and aren’t included in receipts. The only way to get them is by querying the contract’s storage directly — and doing that per transaction doesn’t scale, especially on fast chains with hundreds of swaps in a single block.

Full execution context with debug_traceBlock

For use cases like Uniswap V3 state changes, Ethereum offers a more powerful option: the debug_traceBlock method. It’s part of the debug API, and not all RPC providers expose it — but when available, it gives full execution traces for each transaction in a block. That includes calldata, logs, internal calls, and storage changes.

This lets you extract values like FeeGrowth directly from the execution trace, without making additional storage queries. It also shows the full call tree across contracts, which is essential when you need to understand how different components of a protocol interact in a single transaction.

The limitations of polling the node

Working directly with the node still comes with two major limitations.

First, there’s no real push model. You can’t subscribe to a stream of historical data from a specific block. WebSockets only give you events from the moment you connect, and if the connection drops, you lose data. This makes real-time indexing fragile unless you implement polling logic on the client side.

Second, nodes don’t handle chain reorgs for you. If a block gets orphaned, you won’t be notified. You either have to stick to finalized blocks (which adds delay), or write your own logic to detect and handle reorgs. That’s a significant amount of overhead for something the node already does internally.

So while node-level indexing using RPC and debug methods is the foundation of many tools, it has clear limits — especially for teams building real-time, reliable, or high-volume data pipelines.

Solving Polling Limitations with Firehose by The Graph

The two main pain points with traditional polling-based blockchain indexers led to a new approach that actually solves them: Firehose from The Graph. Let’s break down how this service works.

The first piece is a modified node

Running a regular blockchain node like the ones discussed earlier doesn’t make much sense—it just moves us back to the inefficient polling model.

Instead, the node is forked and a streaming patch is added that the service can read from. Here’s how it works:

  • When a new block lands on the node, it’s immediately pushed into a pipe.
  • The indexing service reads from this pipe in real time.
  • For Ethereum, this requires a custom fork since there’s no official way to patch nodes for streaming. On Solana, it’s simpler—there’s a Geyser plugin that allows hooking into the node’s events.
  • When a new block appears, it gets pushed into a pipe that the service reads.

Adding historical streaming

Standard nodes aren’t built to stream historical blockchain data from any point in the past, here’s why:

  • Nodes rely on efficient storage, usually on disk, optimized for quick lookups.
  • Streaming historical data means constant heavy reads from storage, which can overload the system.
  • Streaming live data in memory is one thing, but hitting storage nonstop for older blocks creates unpredictable load.

Because of this, nodes don’t support historical streaming out of the box. Firehose addresses this by providing a service that can stream blockchain data from any block height, letting indexers replay the chain as needed.

The second piece is cloud storage

Firehose stores data as flat files—similar to what the node itself uses—which is the smallest efficient unit. It uses S3-compatible cloud storage, which brings some big benefits:

  • Cloud-native and serverless, so managing infrastructure or scaling is not required.
  • Payment is based exactly on usage.
  • No vendor lock-in, since almost every cloud provider offers S3-compatible storage with similar APIs. Switching providers is straightforward if a better option is found.

The final piece is a better API

Regular nodes communicate over JSON-RPC via HTTP, streaming plain text exactly as received, which isn’t very efficient for modern indexing tools.

Firehose uses gRPC, a binary protocol that:

  • Packs data efficiently before streaming.
  • Works across languages—schemas are defined once, then client code can be generated in whatever language is needed.
  • This eliminates the need to write and maintain separate client libraries for every language, making integration much easier.

Firehose Blockchain indexing service workflow explained

Here’s the basic flow of the Firehose service:

  • A blockchain node is run and modified to enable real-time streaming.
  • The streamed data is pushed into cloud storage buckets (e.g., S3).
  • A streaming interface is built that users connect to for blockchain data indexing.
  • A key part of this interface is the Joined Block Source—a mechanism that automatically switches between data sources depending on user needs.

For example, if a user wants to stream blocks starting from an hour ago (historical data), the service initially fetches data from historical storage (the buckets). Once the user catches up to the latest block (the current block head), the stream switches automatically to real-time data delivered directly from the modified node.

User benefits of Firehose streaming

  • Cursor-based streaming lets users specify the exact block to start from, enabling precise indexing. Chain-agnostic design works across blockchain networks—the node layer changes, but storage and API remain the same.
  • Immediate reorg notifications ensure consistency across indexers.
  • Unified stream for both historical and live data—no manual switching needed.
  • Reorg logic is fully handled inside Firehose, so clients only need to respond to events.

This architecture removes major pain points in blockchain indexing and delivers a scalable, reliable solution that simplifies how developers and applications consume blockchain data.

How Firehose keeps indexing always up and running

To ensure 100% availability, the architecture is built to avoid any single point of failure. Here’s how Firehose handles it:

  • At least two nodes stream blocks in parallel. One node is kept as the primary source, and the second acts as a backup. An RPC provider working in polling mode is added as an additional fallback.
  • To handle these data streams efficiently, the reader component is split into at least two instances. These readers independently fetch blocks from different sources and write them into a centralized bucket storage.
  • Each reader exposes a gRPC interface to stream binary block data.

The Firehose component performs the following for end users:

  • Subscribes to multiple live sources to get the freshest data.
  • Merges incoming data streams and performs deduplication.
  • Whichever reader delivers a block first, that block is sent to the user.
  • If the primary node fails, the backup continues streaming without disruption.

Handling data duplication in storage

Since all readers write blocks to the same bucket, deduplication at the storage level is essential. To solve this, a dedicated merger service is introduced that:
P

  • ulls all blocks from the primary bucket (One blocks bucket).
  • Optimizes storage of finalized blocks by bundling them into groups of 100.
  • Writes these optimized bundles into a separate storage—the Merged blocks bucket.
  • Stores all forked blocks separately in the Forked bloc ks bucket.

Now, Firehose works with three buckets:

  • One blocks bucket (raw blocks from readers)
  • Merged blocks bucket (deduplicated, optimized bundles)
  • Forked blocks bucket (fork data) When a large historical range is requested, the service delivers blocks in bundles of 100 instead of one-by-one, making retrieval faster and more efficient.

Remaining challenges

Firehose solves key problems related to fetching data directly from nodes and greatly improves service reliability.However, overfetching remains an issue. Firehose currently streams all data without filtering, which isn’t optimal since different applications require different data subsets.

Standard filter presets can’t cover every use case because each app’s needs are unique and often complex.

The simplest and most flexible solution is to let devs to write custom filters themselves, streaming only the filtered data their applications actually need and making Firehose more efficient and adaptable. This is where Substreams step in.

Custom Data Filtering with Substreams

Substreams is an engine that allows developers to upload their own code — essentially a function that takes some input, processes it, and returns a result — compiled to WebAssembly.

In practice, the developer writes a function that takes input (for example, a block) and outputs something specific — like Raydium events. How these Raydium events are extracted from the block depends entirely on the developer’s logic.

The code is written, compiled, and uploaded to the server — from there, the engine runs that function on every block. This means the stream delivers exactly the custom data the application needs, as defined by the developer’s logic.

How Blockchain Data Streaming Service Architecture Evolves with Substreams

When Substreams is introduced, the architecture shifts as follows:

  • Substreams operates as its own service alongside Firehose.
  • It runs developer-supplied WebAssembly (Wasm) modules, processes incoming block data, and streams back only the filtered, application-specific data.

  • Developers define exactly what data they need.

  • Contracts, events, or on-chain data relevant to the app are specified — no unnecessary data floods the client.

To support this, a Relayer component is introduced:

  • In the original Firehose setup, Firehose was the sole consumer of reader streams and handled deduplication itself. Now that both Firehose and Substreams consume block data, deduplication logic is moved into the Relayer.
  • The Relayer ensures that whichever node delivers the block first is the one whose data gets streamed to clients.

How Substreams Blockchain Data Streaming Service Scales

The Substreams service is built around two core components: the Front Tier and the Worker Pool.

When a user requests processing for a block range — for example, from block 10,000 to 14,999 (5,000 blocks) — the request is sent to the Front Tier.

The Front Tier manages a group of workers (Substreams Tier 2). Each worker can handle up to 16 concurrent tasks. The Front Tier splits the requested range into smaller segments of about 1,000 blocks each and distributes these segments across the workers.

Each worker processes its assigned block segment and writes the resulting data into a dedicated Substreams store bucket. This bucket serves as a cache layer that stores processed data for quick access and efficient retrieval — its importance will be covered in more detail when discussing data bundling.

Instead of streaming data directly back to the Front Tier, the workers stream progress updates. These updates indicate when a segment finishes processing or if an error occurs (e.g., a function revert), since user-defined logic might occasionally fail.

The Front Tier ensures strict ordering by waiting for the first segment to finish before streaming its data to the user. It then moves sequentially through each segment, waiting for completion before sending its data. This guarantees a reliable, ordered data stream from the start to the end of the requested block range.

How Modules Work in Substreams

A breakdown of the functions loadable into Substreams and how they help with scaling:

Module Outputs Caching

When writing a module, it can be configured to accept the output of some cached module instead of raw blocks. By referencing data from this cached module in a request:

For example, an existing module — built previously — takes blocks from the merged blocks bucket as input. Its job is to extract all Uniswap V3 events within each block. It doesn’t modify data, just filters it down, so the output is smaller than the original block data. Essentially, it contains only the Uniswap V3 events, not the entire block data.

This filtered data is then stored in the Substreams Store Bucket. When writing a module, it can be specified to take another module’s output (the Uniswap V3 events) as input instead of raw blocks. The server recognizes it can pull pre-filtered data directly from the cache, saving compute resources.

Since billing is based on the amount of data retrieved, accessing already filtered data from the cache not only streamlines the developer’s workflow but also reduces costs.

Index Modules

Index modules differ from regular ones by producing a standard kind of output. For every block, they give a list of keys — markers — that help quickly check if the block holds the data needed.

This means the index module takes raw blocks, scans them, and builds an index showing which contracts were touched or what log topics appeared in that block.

How Filters Use Indexes to Cut Down Data

For example, a module called Filtered Transactions uses the index output to narrow down blocks. The module’s manifest specifies “I want to use this index,” adding a filter like “Show me Raydium transactions.”

The server pulls cached indexes, figures out which blocks contain Raydium transactions, and only sends those blocks to the Filtered Transactions module. This prevents time wasted checking every block.
If someone already filtered Raydium transactions before, that data is likely cached. Instead of re-running the index, the filtered result can be grabbed immediately to start right away.

Streaming Blockchain Indexed Data into a Database

At this stage, the goal is to transfer all data processed by Substreams into a database. This is done via SQL Sink, an open-source tool developed by The Graph.

Connecting Substreams to a Database via SQL Sink

SQL Sink connects to the Substreams server and consumes data streams. It requires data modules to emit data in a specific format that maps blockchain data to database operations. This format includes commands like insert, upsert, update, and delete along with their primary keys and associated data.

This design delegates all data transformation logic to Substreams modules, enabling SQL Sink to efficiently execute database operations. Users only need to implement modules that produce data in the required format.

Data Processing Workflow

SQL Sink processes database commands by distributing data across tables as defined by modules.
To handle chain reorganizations (reorgs), every database operation is logged in a History table.

When a reorg occurs, operations linked to invalid blocks are rolled back using the History table, keeping the database consistent.
While SQL Sink currently supports basic commands (insert, upsert, update, delete), it can be forked and extended to support additional operations like increments. Users can create custom modules and handlers to translate these into SQL commands.

Users are not limited to SQL Sink alone; they can build custom sinks tailored to their needs using the core data streams and parallel processing provided by Substreams.

Comparison with Subgraphs

Subgraphs provide a self-contained package where users supply compiled WebAssembly code defining all logic to handle events and transactions.

Unlike Substreams, subgraphs do not maintain their own block storage. Instead, they query nodes directly for block data as needed, simplifying setup and deployment. This independence is a key advantage in simplicity.

However, subgraphs lack data parallelization—they must sync blocks sequentially, which can cause bottlenecks. They work well on networks like Ethereum but are less practical for high-throughput chains such as Solana.

Why Indexing Still Holds Back Blockchain Growth

Despite the rise of new Layer 1 and high-performance chains, indexing infrastructure remains a major bottleneck. Many networks lack native, reliable, and scalable indexing tools.

This results in significant challenges:

  • Accessing blockchain data at scale remains complex.
  • Developers spend time and resources on infrastructure rather than app development.
  • Protocol teams repeatedly solve the same indexing problems.
  • Poor indexing slows adoption by making new networks harder to build on.

Substreams is designed as a high-throughput data indexing framework enabling blockchains to natively provide production-grade data infrastructure.

Key benefits include:

  • Real-time and historical blockchain data streaming.
  • Cursor-based access enables parallel processing.
  • A modular architecture allowing developers to write custom filters.
  • Caching and deduplication that reduce costs and improve performance.

By integrating Substreams, blockchains can provide developers with efficient, structured, and streamable access to blockchain data without sacrificing scalability.

About Rock’n’Block

Rock’n’Block is a Web3-native development company. We build backend infrastructure and indexing pipelines for projects and protocols across multiple blockchain ecosystems.
Our work spans real-time and historical data processing, with production-ready systems tailored to handle high throughput and complex queries.
Focus areas:

  • Firehose, Substreams, and custom indexing pipelines
  • EVM chains, Solana, TON
  • Scalable architecture for developer tooling and dApp infrastructure

Case study: How we built a blockchain data streaming service for Blum → https://rocknblock.io/portfolio/blum
We’ve contributed to over 300 projects that collectively reached 71M+ users, raised $160M+, and hit $2.4B+ in peak market cap. Our role is to handle the backend complexity so teams can move faster and ship with confidence.

Top comments (0)