DEV Community

Quame Jnr
Quame Jnr

Posted on

Why Columnar Storage Makes Analytics Faster

Traditional databases store data row by row, which works well for many applications. But for analytics, most queries only touch one or two columns. Storing data by column is often much more efficient because each column is stored contiguously in memory, making it faster to load and process.

// Row storage
Row 0: [1, "Alice", "Engineering", 95000]
Row 1: [2, "Bob", "Sales", 87000]
Row 2: [3, "Carol", "Engineering", 102000]
Enter fullscreen mode Exit fullscreen mode
// Columnar storage
ids:         [1, 2, 3]
names:       ["Alice", "Bob", "Carol"]
departments: ["Engineering", "Sales", "Engineering"]
salaries:    [95000, 87000, 102000]
Enter fullscreen mode Exit fullscreen mode

A query like the one below becomes much faster since only the relevant columns are loaded and processed:

SELECT AVG(salary) FROM employees WHERE department = 'Engineering'
Enter fullscreen mode Exit fullscreen mode

Why Queries Are Faster with Columns

  • Memory & cache efficiency
    • Each column is stored contiguously, so loading a single column is fast.
    • Smaller memory footprint means only the data you need is loaded. This also fits better in CPU caches.
  • Compression & storage
    • Columns contain the same data type, often with repeated values, making them ideal for compression. For example, a column storing categories, gender or nationality will have low cardinality
    • Algorithms like run-length encoding (RLE) or dictionary encoding handle this repetitive data extremely efficiently.
  • Vectorized processing (SIMD)
    • CPUs can apply the same operation to multiple values in a column at once, speeding up aggregation and analytics.

This is one reason OLAP systems handle large-scale analytics much more efficiently than OLTP systems.

Top comments (0)