Why Columnar Storage Makes Analytics Faster

#analytics #database #dataengineering #performance

Traditional databases store data row by row, which works well for many applications. But for analytics, most queries only touch one or two columns. Storing data by column is often much more efficient because each column is stored contiguously in memory, making it faster to load and process.

// Row storage
Row 0: [1, "Alice", "Engineering", 95000]
Row 1: [2, "Bob", "Sales", 87000]
Row 2: [3, "Carol", "Engineering", 102000]

// Columnar storage
ids:         [1, 2, 3]
names:       ["Alice", "Bob", "Carol"]
departments: ["Engineering", "Sales", "Engineering"]
salaries:    [95000, 87000, 102000]

A query like the one below becomes much faster since only the relevant columns are loaded and processed:

SELECT AVG(salary) FROM employees WHERE department = 'Engineering'

Why Queries Are Faster with Columns

Memory & cache efficiency
- Each column is stored contiguously, so loading a single column is fast.
- Smaller memory footprint means only the data you need is loaded. This also fits better in CPU caches.
Compression & storage
- Columns contain the same data type, often with repeated values, making them ideal for compression. For example, a column storing categories, gender or nationality will have low cardinality
- Algorithms like run-length encoding (RLE) or dictionary encoding handle this repetitive data extremely efficiently.
Vectorized processing (SIMD)
- CPUs can apply the same operation to multiple values in a column at once, speeding up aggregation and analytics.