DEV Community: Gaurav Ramesh

The Engineering Behind Fast Analytics: Columnar Storage Explained

Gaurav Ramesh — Tue, 08 Jul 2025 15:28:47 +0000

A Very Short History of Columnar Stores

The idea of storing data in columns is not new. It was considered to be first introduced comprehensively in 1985 by GP Copeland and SN Khoshafian. Their paper, "A decomposition storage model (DSM)"[1], proposed storing data in binary relations, pairing each attribute value with the record's identifier. This approach organized data by columns rather than rows, which they argued offered advantages in simplicity and retrieval performance for queries involving a subset of attributes, though it required more storage space overall.

MonetDB[2] implemented these ideas by 1999, becoming one of the first systems to embrace columnar architecture for analytical workloads and showcasing their effectiveness. C-Store[3], developed in the mid-2000s, marked another crucial milestone, and introduced advanced concepts explained further in the post that are now standards in modern columnar storage systems.

The late 2000s and early 2010s saw a rise in developments in this area, with projects like Apache Parquet(4) (influenced by Google's Dremel[5] paper) bringing columnar storage to the Hadoop ecosystem.

The Core Concept: Columnar vs. Row-Oriented Storage

Traditional row-oriented databases store all data for a single entity together. The term row in a row-oriented system signifies the conceptual model of storing them, like a sentence written left to right in a notebook. In contrast, columnar data stores store data in columns, with each column containing values for a single attribute across all rows. This seemingly simple change has profound implications for analytical performance.

There are two key concepts to know when discussing transactional and analytical systems: predicates and projections. Predicates are the conditions by which you filter the entities(rows) you want(think of them as a WHERE clause in an SQL query). Projections are the fields(columns) that you want in the response(think of them as the column names you define in a SELECT statement).

If you think of your data as a list of rows, vertically stacked, predicates slice it horizontally, and projections slice it vertically. Transactional queries often rely on predicates to filter rows, with projections spanning the entire row, i.e. all the columns. Here's an example:

-- Example #1
SELECT * FROM order WHERE user_id = 1234; 

-- Example #2 
SELECT * FROM user WHERE user_id = 1234;

Projections in analytical queries involve a small subset of fields of the entity being queried. For example:

SELECT user id, name, num_orders FROM user_aggregates WHERE user_id = 1234

Consider a table with 50 columns and millions of rows. In a row-oriented system, if you need only 3 columns, the database would still have to read all 50 columns for each row. With columnar storage, only the 3 relevant columns are accessed, massively reducing the I/O overhead, i.e. the amount of data you deal with while processing analytical queries.

Key Techniques Powering Columnar Stores

This idea of storing data in columns opens up new avenues for further optimizations. Here's a mental model to make sense of the following techniques: think of query execution as a pipeline that passes data through the various stages, potentially transforming it at each one. And it's a two-way pipeline at that: all the way from the client that wants the data to the system that computes and serves that data and back. In each direction, you can think of the places that benefit from optimization are: the network, CPU, memory and the disk.

In transactional systems(OLTP), the primary means of improving the performance of a query is indexing, which helps you get to the data you need faster, potentially all from memory. It's sufficient since you're usually dealing with one entity at a given point in transactional systems. But in analytical systems, you deal with a large volume of data in a given query, and hence reducing the data you work with in each stage of the pipeline is primarily how you get a better performance. The smaller the data you work with, the lower the cost, and the faster the pipeline.

Here are the primary ways of optimizing analytical pipelines. We'll look at each one of them in the post.

Representing/encoding the data as efficiently as possible(data compression/column-specific compression),
Filtering the data as much and as early as possible(column pruning, predicate pushdown),
Expanding the data as late as possible(direct operation on compressed data, late materialization)
Faster processing of the data(vectorized processing, efficient joins)

A word of caution - Although these are described as distinct techniques here, in reality, they are much more intertwined. The boundaries between where one ends and the others begin are not as clear and they often rely on each other for maximum gains.

Data Compression/Column-specific Compression

Columnar storage enables effective compression. Because data within a single column are of the same type and have similar characteristics, compression algorithms can achieve higher compression ratios. Techniques such as dictionary encoding, run-length encoding(RLE), bit packing, and delta encoding are commonly used in modern columnar stores.

Let's take an example to understand some of them - Say you have a data store for analyzing traffic on your website and are tracking the source from which a user entered your site. You might have noticed that when you click on a link in your email, the link usually opens up with a utm_source=email, or utm_source=newsletter, for example. The utm_source generally has a limited set of values that identifies the channel through which the user visited your site. The details of that source - domain, URL, time, cookies of the user - are tracked separately. Because each post or page could have thousands or millions of visits, your analytics database will have as many entries for each page, but with only a handful of values for the utm_source column.

The string values of the source column can be dictionary encoded using integer values that pack tighter and are easier to work with(e.g. source email = 1, source twitter = 2, and so on.). Once that's done, if the consecutive entries in the DB have the same value, which could happen in instances like when you send out a campaign or a newsletter, it can be further compressed using run-length encoding. If a thousand consecutive entries have the same value for source, say 1, it can be stored as (1, 1000), rather than storing the same value a thousand times. Furthermore, if the integer representations take up less space than a 32-bit integer could hold, bit packing can compress it more, by reducing the number of bits needed to hold the value. In our case, if we have 200 different values for source, we only need 8 bits, rather than 32!

Column Pruning

One of the direct results of storing data in a columnar fashion is that it makes it easy to eliminate entire columns required(or rather, not required) for processing. Simply put, your query only ever touches the columns needed for the SELECT, WHERE, GROUP BY, ORDER BY or JOIN columns. Depending on the complexity of the query, you can eliminate a whole lot of data from even being brought into the query execution pipeline.

Consider a users table has 25 columns and the query is this:

SELECT first_name, last_name, email, phone FROM users WHERE num_orders > 10

This query only needs 5 columns, so through column pruning we potentially reduce 80% of I/O overhead by eliminating the other 20 columns.

Predicate Pushdown

The idea behind predicate pushdown is to reduce the data footprint as close to the lowest possible level of execution as possible. Columnar stores do this by evaluating query predicates(think WHERE clause) at the storage layer thus removing the overhead of dealing with unnecessary data in memory and at the CPU. This becomes possible because columnar stores organize the data in blocks or chunks, with each block containing metadata about the range of values in the block(also called Zone Maps), like min/max values, null counts, and so on.

So when a query comes in, the metadata can be used to skip entire blocks. Only blocks that contain the matching data are selected to be read from the disk. Within the selected blocks, the filtering of specific values happens during decompression.

Example - Consider a simple query

SELECT name FROM users WHERE age > 50 AND city = 'New York'

Without predicate pushdown, it'd read all column values for columns age and city, and then filter the data in memory. With predicate pushdown, it'd check block-level metadata for age and city columns, and skip blocks that have a max-age of 50, or those that don't have New York in the range of values within the block. For the remaining blocks, it'd apply the filter during decompression.

Direct Operation on Compressed Data

Storing data by columns also makes it easier to work on partially compressed data, which also reduces the I/O overhead. Consider a sample query

SELECT sum(salary) FROM employees WHERE department = 1002

Now let's say that the data before compression looks something like this -

department: 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002, 1003, 1003
salary: 100000, 110000, 100000, 100000, 95000, 95000, 100000, 100000, 100000, 100000

After dictionary encoding and run-length encoding, department column might look like this

dictionary: {1001: 1, 1002: 2, 1003: 3}
department: 1, 1, 1, 2, 2, 2, 2, 2, 3, 3 (dictionary encoded)
department: (1, 3), (2, 5), (3, 2) (run-length encoded)

Similarly, salaries will look like this:

salary: (100000, 1), (110000, 1), (100000, 2), (95000, 2), (100000, 4)

While evaluating the department predicate, the first three and the last two rows of the column can be skipped immediately merely by looking at the run-length encoded data(because they contain the departments that we're not interested in). The rows can then be encoded using a bitmap like 0001111100, where the 0s indicate the position of the rows that are to be excluded from the aggregation and the 1s indicate the rows that need to be included. Now the bitmap can be used to sum the salary column. The first two run-length encoded rows can be skipped. Since the next 5 rows need to be included, the third and the fourth blocks in salary can be multiplied right away, to get (100000 * 2 = 200000) and (95000 * 2 = 190000). The last block of salary needs to be expanded because only the first entry needs to be included out of the four, which gives us 100000. So the final sum is derived by adding three values rather than individually adding the salaries.

Late Materialization, aka Late Tuple Reconstruction

In the spirit of minimizing the data you work with that we alluded to earlier, the idea behind late materialization is that you expand the data only when you need to. While predicate pushdown lets us operate on just the data we want, based on the predicates in the query, late materialization delays the assembly of required projected fields until we have determined what rows need to be returned. This also reduces the overhead of processing unnecessary data at each stage.

Take the same query we considered in the predicate pushdown section

SELECT name FROM users WHERE age > 30 AND city = 'New York'

We can work with age and city columns the entire pipeline up until we're ready to send the results back to the user, at which stage, we'll need to fetch the name column from the store.

Vectorized Processing

Vectorized processing in columnar databases operates on batches of data rather than individual values, leading to significant performance improvements.

SIMD(Single Instruction, Multiple Data) is a parallel processing technique invented to solve the problem of efficiently processing large arrays of similar data(mathematically called vectors) that require the same operation to be performed on each element. It's commonly found in most modern CPU architectures. Let's focus on the two main parts of the problem it solves -

Large arrays of similar data - When data is stored by columns, this is exactly what you get while processing a single column
Same operation to be performed on each element - This is true for most analytical queries, where you want to apply a predicate on the values of a specific column

Consider this query

SELECT sum(price) FROM sales WHERE user_id = 1234

To find the desired rows, you're looking at the column user_id and performing the same operation, an equals check, on all the values in that column. In traditional processing you'd perform two things primarily sequentially -

Check if value of user_id column of a given row is equal to 1234
If it is, add the value of the sum
Repeat for the next row

With vectorized processing, chunks of data, say 1000 userid values at once is loaded into memory and compared to the value 1234 in parallel using SIMD. A bit of matches is then created to sum corresponding prices values in parallel.

Efficient Join Implementations

Columnar storage enables advanced join implementations beyond the traditional hash or merge joins. One example of such a technique is a semi-join, which aims to determine if one table has matching values in the join column of another table, without needing to return all column values from the second table. It uses bloom filters to achieve this. They are probabilistic data structures that help efficiently check for the potential existence of values, i.e. answer set membership queries. When asked whether a value exists in a set of values, they never produce false negatives(which means they can say with certainty that a value does not exist), but might produce false positives(which means they might say a value exists that does not). Instead of storing the actual join keys(which could be millions of values taking gigabytes), a bloom filter uses just a few megabytes to represent the same set with high accuracy.

Here's how it would work for a query like this

SELECT * 
FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'EMEA'

It filters customers by region column, say 10,000 out of a million customers. For the 10,000 customers, builds a bloom filter that only takes a few megabytes. Now it scans the orders table(say 100mn records), and tests each customer_id column against the bloom filter, skips the orders from customers that don't match, performs the more traditional and expensive hash joins on the remaining orders.

Conclusion

Combining all the above techniques, columnar data stores not only save storage space on the disk but also reduce I/O overhead, both of which result in cost savings for the organization. By working on lesser data and faster, they also offer significant performance gains for analytical workloads, and in a scalable manner. They have been gaining wide adoption in the areas of web analytics, business intelligence, machine learning infrastructure, log and event analysis, and real-time analytics, to name a few.

If you are a data practitioner who already works with columnar stores, I hope that the knowledge of the internals helps you squeeze the most juice out of them, and optimize them for your use cases. If you are an application developer building analytical products, think about your stack and consider introducing a columnar data store where you require performance and scalability. This post should help you make a case for why columnar storage might make sense for your needs. If you're an engineering leader considering adding a columnar datastore to your stack, knowledge of the above techniques should help you evaluate the trade-offs and make the right strategic decisions for your organization.

The post was originally published here

Resources

Book Notes: Cloud Native Patterns

Gaurav Ramesh — Mon, 07 Feb 2022 02:41:21 +0000

I have worked on cloud systems for a few years now, but as is said, “Fish don’t know they’re in water”, when you’re deep into something it’s hard to zoom out and look at the big picture, to see how ideas and practices are connected to each other and how they share common goals.

This is my attempt to do that, zoom out, learn more about and reflect on things I’ve worked on and make sense of them as a coherent whole, rather than a bunch of loosely and randomly connected ideas.

I’m starting with a book called “Cloud Native Patterns”, by Cornelia Davis, and I plan to use this space to document things I read, learn and think about along the way. As such, I expect this to be a constantly changing and evolving post, much like the cloud systems that the book talks about.

The Basics

Cloud-first before cloud-native – Cloud-first is a good precursor to understanding what cloud-native means and what it has to offer. Although the way people define cloud-first strategy varies widely, it essentially is thinking of the cloud as the primary medium of running your software, as opposed to the older approach that tied all your systems to a set of servers with dedicated resources in a specific physical location that you owned or managed. Cloud, more than anything, is a layer of abstraction built to free the developers’ minds from the concerns of how to run systems and promote thinking about what to build. Hence moving to the cloud doesn’t necessarily mean taking your systems off-premises, but it’s a way of how you think(or rather, how much you don’t think) about them.

Natural Side-effects of moving to the cloud – Think of taking care of your babies at home versus sending them to daycare. A natural and inevitable side-effect of sending your baby away for a few hours is that you lose some control. It’s neither good or bad, it just is. So while it relieves you of some duties and responsibilities, it also raises the stakes in other ways. Similarly, when you take your systems from your premises to the cloud, it means that you lose some control. But it doesn’t magically solve all your problems, it solves some, and throws other, newer problems at you.

Inherent characteristics of the cloud – Going back to the daycare example, it’s a more dynamic place than your home. There are other actors at play – other babies, the care-givers, other babies’ parents, and the challenges of an unfamiliar setting. In other words, it’s a chaotic place. Things are not always predictable, and are bound to go wrong, in one way or the other. Because of the sheer number of unknowns, learning about all of them, let alone preventing all the problems is not an option. Cloud is exactly like that too. There are other systems at play, their dependencies, and the shared infrastructure, needless to say all that bound by the laws of Physics. It sounds scary at first, but accepting these truths enables us to think about how to build systems that function amidst the chaos, despite all the failings.

So why go to the cloud – If you lose control and visibility into our systems, why on Earth would you want to migrate? Because absolute control and perfection are delusional goals to build your life or business upon. And while you lose some control on how your systems work together, you gain control on what your systems does, or how they provide value to the customers. If you had all the time in the world, you would have taken care of your babies at home. Not that you don’t like them going out, but that you would like to be around them as much as you can. But time is finite, and society and its changing needs throw a lot at you to juggle. In order to do the best with your time, you prioritize things and delegate or outsource some responsibilities. As technology has come to dominate most people’s lives, the changing needs of society often translates to changing needs of software. As the demand of your business increases, you’re compelled to not only prioritize time, but the limited material resources you possess.

Desirable characteristics of a successful business – In order to serve the ever-changing needs of the society, certain characteristics naturally emerge that are desirable to run a successful business. Simply put, a business should provide the value promised to all its customers, anywhere and anytime, and evolve quickly to address their needs.

All its customers – In the world of the Internet, your customers come from different backgrounds, cultures, ages, and gender, they use different devices, and operate under widely different constraints. So accessibility, multi-language and multi-device support now become first-class citizens.

Anywhere and anytime – When your customers are based in different geographical locations across continents, compliance to legal and other policies, and the ability to operate in different time-zones, essentially being available around the clock are of prime importance.

Evolve quickly – Evolving quickly to serve such a diverse group of customers and a unique set of constraints means having shorter and accurate feedback cycles.

Cloud-native – is an umbrella term used to capture a set of ideas, principles, processes and tools, that aims to marry the inherent characteristics of the cloud with the desirable characteristics of today’s businesses in a seamless manner.

A Story About Simple and Obvious Things

Gaurav Ramesh — Sun, 28 Jun 2020 14:42:20 +0000

One afternoon on a weekday, I had a meeting with a few people at work to release the last piece of a project that was meant to change the way our clients interacted with the backend systems. It was a fairly simple change and was expected to be done in a few minutes. Not all people invited were strictly required in the meeting, so while a few joined, others kept an eye on their respective systems for signals and were kept abreast of the progress via Slack.

So there I was, in a room with a few people, talking to a few on a video call, and chatting with a couple more on the side.

Five minutes into the meeting, we did the change and shifted our attention to the monitoring systems to see early signs of success, or to minimize damage in the remote possibility of failure. Twenty minutes later, we still didn’t see any activity that we’d expected or hoped for.

The nature of the change was such that we’d expected it to take a short period to propagate but didn’t know how long exactly. So we didn’t sweat. We had theories about why it might be taking longer than usual. About thirty minutes later, we decided to involve somebody from the iOS team to simulate tests from the clients’ perspective to get some validation.

They took their sweet time to prepare for the test, and after about fifteen minutes, finally tested it. They saw some errors. When we dug into it, we learned that the clients had been seeing those errors all this time since I made the change — so for about fifty minutes. What was weird was that no whistles were blowing off, no pages triggered, so we doubted if the clients did see any errors at all, or had had any negative impact. In any case, we reverted the change, just to be safe. The errors, as you’d guess, went away.

We all have been a part of something like this. Your first big screw-up is almost like a rite of passage to a long career in software engineering.

When I first learned about the impact of the incident, I felt embarrassed and a little nervous. Not because it could have been avoided — that realization only came later — but because I instantly started questioning my abilities. I was trusted with it, and I was incapable of executing it. It was my lack of knowledge, I thought, that was the problem, and my self-doubt started to eat me up. At some point, I even went into denial and started blaming other people for it, at least in my mind, if not in reality.

After it was all dealt with, my manager called me on my phone to have a chat. He was kind and responsible enough to reflect on it and think of ways he could have helped me. But there was one thing he mentioned which guided my thinking in the right direction.

He asked me what my plan was before I started to work on it. How dare he ask me what the plan was, I thought. Was he asking if I had a plan at all? I’d obviously have had a plan.

Only I didn’t.

I went into the meeting, assuming that I knew all that had to be done, and I didn’t anticipate anything going wrong. So it wasn’t a gap in my knowledge that was the problem, but lack of thought and foresight, about the potential consequences and impact of the change on the system as a whole, both good and bad.

There were many pieces to look after and even though each of those tasks was fairly simple, every one of them was also crucial to the success, and because of their fragmented nature, it was really easy to miss one or two.

Atul Gawande, in his book The Checklist Manifesto: How to Get Things Right, mentions an essay from the 1970s by philosophers Samuel Gorovitz and Alasdair MacIntyre, which talks about human fallibility. One reason they talk about is “necessary fallibility”, which is when things go wrong because they’re beyond our capacity to control. These, by definition, can’t be prevented from happening.

But he identifies two other reasons for why things go wrong — /ignorance/, the gaps in our knowledge, and /ineptitude/, our failure to apply the knowledge correctly.

Ignorance, if we’re willing to learn, can be reduced over time; the more time we invest in something, the more we learn about it, the more knowledge we gain. But competence is something that doesn’t just come over time. Getting rid of ineptitude requires persistent effort and deliberate practice. In fact, there’s a case to be made that the more knowledge we possess, the more we stand a chance to blind ourselves to a holistic, systems thinking approach unless we’re constantly reevaluating ourselves.

The more I thought about it, the more I realized that my mistake had a meta nature to it. One is that despite having the necessary knowledge to execute the task at hand, I had failed to come up with a plan, a checklist of things that I should have had in place before I jumped on it. Second, and the meta part, is that I knew the importance of checklists, but had failed to put that knowledge too, into practice.

After this incident, I finally started taking the idea of checklists, the habit of note-taking, seriously.

In that spirit, here’s a list of things I’ve learned and tried to follow since:

Meetings

Take notes before, during, and after meetings.

Before a meeting take note of the agenda, about what you intend to learn, what you want to say, what you expect to get out of the meeting.

During a meeting, take note of what people say, along with their names, if possible, keywords or jargon mentioned, to read about them later, questions asked, observations, and decisions made.

After a meeting, list the important things, organize and summarize ideas to make sense of it in the future. Put them in chronological order, relative to the notes of other meetings about a related, or similar topic.

Projects

Other than the shared documents of a project like planning document, design document, and so on, it’s immensely helpful to maintain separate notes for yourself about things you’re working on. Like in software systems where you design different data models depending on the use-case and requirements, you can create multiple documents based on the nature of their use.

“Quick Reference” document for things you’d frequently need while working on a project, for example, necessary links and resources, usernames and passwords, contact info of people important for the project, and so on.
“Questions” document for things you need clarifications on.
“Details” document for technical details of the project
“Planning” document for milestones and timelines associated with the project
“Checklist” for the list of things you’re working on, will work on and have already completed

I usually tag them by their project name, and their type, so they’re all searchable with one keyword.

Releases

This usually breaks out from one of the “Planning” items in the project documents. One document each for every release, with the list of things to take care of before and after the release.

Audience

Remember that all of these are for your benefit and sanity. The primary audience is you. Don’t get too worked up about the format or the industry conventions. If it works for you, it’s good enough. It doesn’t even have to be full sentences or grammatically correct, as long as the information contained is accurate. Pictures, links, references, anything is game. When it’s time to share or present the information to people, you can clean it up, organize, and proof-read it.

Progress

While stand-up, stand-down, and check-in meetings are great to let other people know what you’re working on or blocked by keeping them as a list will help you go through the pending things in a streamlined, efficient and timely manner. You’ll also get a sense of accomplishment to see things get off of your checklist, and a feeling of closure.

Connections

Fragmented bits of data put together in one place helps you keep the focus on the bigger picture, and helps you find patterns and relationships between seemingly disparate ideas.

While this list looks fairly simple and the things in it seem obvious, if there’s one thing I’ve learned from my experience is that things that seem simple and obvious, are not necessarily easy and not so obvious.

Resources

Bear Notes - After trying hundreds of note apps, I've found a sweet spot of functionality and aesthetics with this one.
Original Post - Original post on my blog, which was edited a little for dev.to

DEV Community: Gaurav Ramesh

The Engineering Behind Fast Analytics: Columnar Storage Explained

A Very Short History of Columnar Stores

The Core Concept: Columnar vs. Row-Oriented Storage

Key Techniques Powering Columnar Stores

Data Compression/Column-specific Compression

Column Pruning

Predicate Pushdown

Direct Operation on Compressed Data

Late Materialization, aka Late Tuple Reconstruction

Vectorized Processing

Efficient Join Implementations

Conclusion

Resources

Book Notes: Cloud Native Patterns

The Basics

Recommended Reading

A Story About Simple and Obvious Things

Meetings

Projects

Releases

Audience

Progress

Connections

Recommended Reading

Resources