DEV Community: andrewlegacci

A timeline of the history of blockchain

andrewlegacci — Sun, 15 Mar 2026 01:26:00 +0000

Here is a concise historical timeline of major events in blockchains and crypto, from pre‑Bitcoin ideas through to recent L1s and key forks.

Pre‑Bitcoin foundations (1980s–2007)

1983 – David Chaum proposes eCash, an anonymous cryptographic electronic money system, and later implements it via DigiCash in 1995, using blind signatures to enable privacy‑preserving digital payments.
Late 1990s – Concepts like Wei Dai’s b‑money (1998) and Nick Szabo’s bit gold outline ideas for decentralized digital cash and proof‑of‑work style systems, influencing later designs like Bitcoin.
Early 2000s – Cypherpunk and cryptography mailing lists become the main venues where digital cash, censorship‑resistant systems, and strong cryptography are actively discussed, setting the social and technical context for Bitcoin.

Birth of Bitcoin (2008–2012)

18 Aug 2008 – The domain bitcoin.org is registered, which later becomes the main informational site for Bitcoin.
31 Oct 2008 – Satoshi Nakamoto publishes the Bitcoin whitepaper, “Bitcoin: A Peer‑to‑Peer Electronic Cash System”, to a cryptography mailing list, outlining a decentralized, proof‑of‑work‑based currency.
3 Jan 2009 – Satoshi mines the Bitcoin “genesis block” (block 0), marking the live launch of the Bitcoin blockchain and network.
2009–2010 – Early Bitcoin clients and nodes appear; the first exchanges and markets form around BTC as a digital‑native asset (e.g., early OTC trades, then Mt. Gox in 2010).
2010 – The “Bitcoin pizza” transaction (10,000 BTC for two pizzas) becomes one of the first widely cited real‑world Bitcoin purchases, illustrating emergent value.
Apr 2011 – Namecoin launches as one of the first Bitcoin‑fork‑style altcoins, aiming to provide a decentralized DNS and identity system.
Oct 2011 – Litecoin launches, reusing Bitcoin’s codebase but switching to the scrypt proof‑of‑work function and faster block times, becoming an early major altcoin.
Aug 2012 – Peercoin launches with an early hybrid proof‑of‑work / proof‑of‑stake design, anticipating later PoS systems.

Early altcoins and ecosystem growth (2013–2014)

2013 – A broad wave of altcoins appears, experimenting with parameters (block time, supply schedule, privacy features) and building early exchange infrastructure and market cycles.
2013–2014 – Centralized exchanges (e.g., Mt. Gox) dominate liquidity; the 2013 Bitcoin bull run and the 2014 Mt. Gox collapse highlight custody risk and bring stronger regulatory attention.
2014 – On‑chain data, block explorers, and early wallets mature, making it easier for non‑technical users to interact with Bitcoin and altcoins.

Ethereum and smart contract era (2015–2017)

2015 – Ethereum mainnet launches (Frontier phase), introducing a general‑purpose smart contract platform and the Ethereum Virtual Machine (EVM), expanding blockchains beyond currency into programmable dApps and tokens.
2016 (Homestead) – The first planned Ethereum hard fork removes certain “canary contracts”, improves Solidity, and ships Mist, an ETH wallet and dApp browser, hardening Ethereum as a production platform.
2016 (The DAO & fork)
- The DAO is created as an on‑chain investment vehicle on Ethereum.
- After a major exploit, the Ethereum community splits over how to respond, leading to the DAO fork:
- Ethereum (ETH) – chain that implemented a state change to reverse DAO theft.
- Ethereum Classic (ETC) – chain that continued without reversal, emphasizing immutability.
2017 (Byzantium) – Ethereum’s Byzantium upgrade enhances privacy primitives and performance, and is part of a broader roadmap of iterative hard forks.
2015–2017 (ICO boom) – Teams issue ERC‑20 tokens on Ethereum to raise capital, driving a massive influx of users, developers, and speculation; Ethereum solidifies as the second‑largest crypto asset by market cap.

Major Bitcoin forks and scaling debates (2015–2018)

2015–2017 – The Bitcoin block‑size debate intensifies over how to scale throughput while preserving decentralization and node accessibility.
Aug 2017 (SegWit) – Bitcoin activates Segregated Witness, a soft fork that changes transaction structure, enables more efficient block usage, and lays groundwork for second‑layer protocols like the Lightning Network.
Aug 2017 (Bitcoin Cash) – Bitcoin Cash (BCH) hard‑forks from Bitcoin, choosing larger block sizes to pursue on‑chain scaling.
2018 (Bitcoin SV and others) – Subsequent forks and protocol variants (e.g., Bitcoin SV from BCH) emerge, reflecting differing philosophies on scaling, governance, and protocol complexity.

Institutionalization, DeFi, and global regulation (2018–2020)

2018–2019 – More regulated custodians, futures markets, and institutional‑grade products appear, moving crypto from a retail‑only phenomenon toward broader financial integration.
2019 (Istanbul) – Ethereum’s Istanbul upgrade focuses on gas cost optimizations, DoS resistance, and better interoperability with Equihash‑based PoW chains and L2 zk‑proof systems.
2019 (early DeFi) – The concept of “DeFi” (decentralized finance) gains traction on Ethereum, with protocols for decentralized exchanges, lending, and synthetic assets forming an on‑chain financial stack.

Ethereum 2.0 roadmap and PoS transition (2020–2023)

Dec 2020 (Beacon Chain) – The Ethereum Beacon Chain launches as a separate PoS consensus chain, beginning the multi‑year transition from proof‑of‑work to proof‑of‑stake.
Aug 2021 (London / EIP‑1559) – The London upgrade introduces a fee burn mechanism and changes the ETH fee market, reducing fee volatility and adding a structural burn to ETH supply.
Sep 2021 (China ban) – China declares all cryptocurrency transactions illegal, completing a long‑running crackdown that had already restricted exchanges and mining, and pushing more mining and activity offshore.
Sep 2022 (The Merge) – Ethereum fully transitions from PoW to PoS by merging the Beacon Chain with the main execution layer, reducing energy usage and carbon emissions by orders of magnitude.
Apr 2023 (Shapella) – The Shapella (Shanghai + Capella) hard fork enables withdrawals of staked ETH, completing the main PoS transition phase and unlocking validator exit and partial withdrawals.

Recent Ethereum upgrades and roadmap (2024–2025)

2023–2024 (Dencun) – The Dencun upgrade focuses on data availability and rollup scaling, making L2 solutions cheaper and more efficient.
2025 (Pectra, Fusaka, etc.) – Ethereum adopts a twice‑yearly hard fork cadence; upgrades like Pectra (Prague/Electra) and Fusaka improve user experience and validator operations (e.g., features such as EIP‑7702 for temporary smart‑contract‑like behavior of regular accounts).

New L1 ecosystems and trends (2017–2025)

2017–2018 (new L1s) – New L1s such as EOS and others launch with different consensus models (e.g., DPoS, BFT variants) and throughput claims, experimenting with governance and fee structures.
2018–2020 (privacy chains) – Privacy‑focused chains (e.g., Monero, Zcash) refine confidential transaction primitives, ring signatures, and zk‑SNARKs, emphasizing fungibility and privacy in the blockchain stack.
2020–2021 (high‑throughput L1s) – High‑throughput L1s and sidechains (e.g., Solana, Binance Smart Chain, Avalanche) gain traction with EVM compatibility, faster confirmation times, and lower fees, catalyzing alternative DeFi and NFT ecosystems.
2021–2022 (NFT boom) – Ethereum and EVM‑compatible chains see explosive growth in non‑fungible tokens, on‑chain art, gaming assets, and NFT marketplaces.

Regulatory and macro milestones (2013–2025)

2013–2020 – Multiple jurisdictions begin issuing guidance and regulations around exchanges, KYC/AML, and token classification, moving crypto from a legal gray area toward more formal regulatory treatment.
2021 (nation‑state adoption) – El Salvador adopts Bitcoin as legal tender and issues a state‑run wallet, marking the first nation‑state Bitcoin adoption.
2021–2023 (rules and CBDCs) – Many countries refine crypto tax rules, clarify stablecoin treatment, and debate central bank digital currencies (CBDCs), while addressing risks from DeFi, stablecoins, and systemic exposure.
2021–2022 (mining geography) – China’s bans on mining and transactions reshape mining geography and industrial structure, particularly for Bitcoin and pre‑Merge Ethereum.

Compact reference table

Period / Year	Event / Theme	Example chains / actions
1983–1998	Early digital cash, cryptography roots	Chaum’s eCash, DigiCash, b‑money, bit gold
2008–2012	Bitcoin invention and early altcoins	Bitcoin whitepaper & genesis; Namecoin, Litecoin, Peercoin
2013–2014	Altcoin wave, exchange era	Mt. Gox cycle, early privacy and parameter‑tweaked coins
2015–2017	Ethereum & ICO boom	Ethereum launch, DAO fork, Byzantium, ERC‑20 ICOs
2015–2018	Bitcoin scaling debates and forks	SegWit, Bitcoin Cash, later BSV and variants
2018–2020	Institutional entry, DeFi seeds	Early DeFi on Ethereum, Istanbul upgrade
2020–2023	PoS and rollup era	Beacon Chain, London, The Merge, Shapella
2017–2025	New L1s, NFTs, high‑throughput chains	Solana, BSC, Avalanche, NFT markets
2013–2025	Regulation and state‑level responses	China bans, tax and KYC rules, El Salvador, CBDC debates

Kafka partitioning: the part that decides whether your design works

andrewlegacci — Mon, 09 Mar 2026 01:06:00 +0000

Once you know the basics of Kafka, partitioning is where the real design work starts. Topics, consumers, offsets, and producer APIs are straightforward enough. Partitioning is where correctness, throughput, and operational behavior meet. It is also where teams make decisions that look harmless early on and become expensive later.

The first thing to keep in mind is that partitioning is not just a scaling mechanism. It is also Kafka’s ordering model. Kafka only guarantees order within a single partition. That means every decision about how records are assigned to partitions is also a decision about which records are ordered relative to each other, and which are not.

How records get routed to partitions

When a producer writes a record to Kafka, the partition is chosen in one of a few ways.

The common case is keyed routing. If the record has a key, the producer hashes that key and uses the result to choose a partition. The exact hashing details depend on the client and partitioner implementation, but the general idea is stable: the same key maps to the same partition, as long as the topic’s partition count stays the same. That is what gives you per-key ordering. All events for account-123 or order-987 land in one partition, so consumers see them in the order they were written.

If the record has a null key, Kafka does not have any domain signal to preserve. In that case producers typically distribute records across partitions in a round-robin or sticky way, depending on the client implementation. The practical outcome is the same for design purposes: records are spread out to balance load, but there is no ordering guarantee between related records because Kafka has no notion of which records are related.

There is also the option of a custom partitioner. This lets you override the default routing logic and choose partitions yourself. That can be useful when the built-in behavior does not match your needs, such as routing based on a derived field, adding affinity rules, or implementing a load-spreading strategy for skewed keys. But a custom partitioner is not a free optimization knob. It becomes part of your correctness model. Once you own partition routing, you also own the consequences.

The partition key is really a domain decision

It is easy to talk about partition keys as a technical setting. In practice, key choice is a domain modeling decision disguised as infrastructure.

The key defines your unit of ordering. That is the real rule. If two events must be observed in order relative to each other, they need to land in the same partition. If they do not need a shared order, they can be partitioned independently.

Take bank account events. Suppose you emit AccountDebited, AccountCredited, and AccountClosed. If a consumer builds account balances or checks business invariants, order matters per account. A debit followed by a close is different from a close followed by a debit. In that case, the obvious partition key is account ID. That keeps all events for a single account in one partition and gives you a clean per-account sequence.

Now imagine choosing customer region as the partition key instead. That may look reasonable from a load distribution perspective, but it breaks the ordering model you actually need. Two events for the same account could end up in different partitions if the routing logic is not aligned with account identity, and now consumers no longer have a reliable per-account event stream. The problem is not that Kafka failed. The problem is that the topic’s partitioning scheme encoded the wrong domain boundary.

The same thing shows up with order processing. If the business rule is that all events for one order must remain ordered, key by order ID. If what actually needs ordering is the aggregate of all orders for a warehouse or a merchant, then one of those may be the correct key instead. There is no universally correct answer. The correct answer is whichever entity defines the ordering boundary your system depends on.

That is why “just use user ID” or “just use tenant ID” is not a design principle. It is a guess. Sometimes it is the right guess. Often it is not.

Bad key choices fail in subtle ways

The tricky part is that a poor partition key does not always fail loudly. The system still runs. Records still arrive. Consumers still process them. The failure shows up as rare inconsistencies, impossible state transitions, or compensating logic that seems to grow for no clear reason.

A classic symptom is when a downstream service starts needing extra reads, locks, or deduplication tricks to recover order that Kafka never promised in the first place. Another is when teams start saying things like “this event is usually earlier, but sometimes it arrives later.” That is often not a transport delay problem. It is a partitioning problem.

If your correctness depends on a particular order, that order must be encoded in the partition key. There is no middleware setting later that will restore it.

Hotspots and disproportionately popular keys

Even with a correct key, load distribution may still be poor. Some keys are much hotter than others.

Suppose you key events by merchant ID and one merchant handles half the traffic in the system. All of that merchant’s records map to one partition. That partition becomes a hotspot. Producer throughput for that key is bounded by one partition’s write path, and consumer throughput for that key is bounded by one active consumer in the group. The rest of the partitions may be idle while one partition is overloaded.

This is the main tradeoff in Kafka partitioning. Ordering and even distribution often pull in opposite directions. The more you concentrate related records to preserve order, the more you risk skew.

There is no universal fix. Sometimes the hotspot is acceptable because correctness matters more than balanced throughput. Sometimes the domain lets you shard further. If order only matters per account, and a merchant contains many independent accounts, then account ID may distribute better than merchant ID. In other cases teams deliberately split a hot key into subkeys such as merchant-123#0, merchant-123#1, and so on. That can spread load, but it also weakens ordering from “per merchant” to “per merchant shard.” If you do that, you are changing the contract, not just tuning performance.

Another mitigation is separating workloads. A topic used for strict transactional processing may need one key strategy, while an analytics-oriented topic can use a different one because its ordering requirements are weaker. That is often cleaner than trying to make one topic satisfy incompatible requirements.

Partition count is a mostly one-way door

Partition count looks like a capacity knob, but it is closer to a design commitment.

You can increase the number of partitions later, and teams often do. The problem is that changing partition count changes the key-to-partition mapping for hashed keys. A key that used to land in partition 3 may now land in partition 11. From that point onward, new records for the same key go to a different partition than older records.

That matters because ordering is only meaningful within a partition. Once the mapping changes, you no longer have one continuous ordered stream for that key across old and new data. During migration windows or with in-flight data, consumers may observe sequences that are ordered within each partition but not globally across the old and new placement. If your application assumes a single uninterrupted per-key stream, increasing partitions is not operationally neutral.

This is why partition count is a mostly one-way door. You can raise it, but the consequences are real. Reducing partitions is even harder and generally treated as impractical. So plan ahead. Pick a partition count that gives you room for growth, not just today’s traffic.

Partitions are also your unit of parallelism

Kafka scales consumers at the partition level. Within one consumer group, a partition is assigned to only one active consumer at a time. That means partitions define the maximum parallelism of that group.

If a topic has twelve partitions, the most useful number of active consumers in one group is twelve. If you run fewer, some consumers will process multiple partitions. If you run more, the extras sit idle because there is no partition left to own. Adding consumers beyond the partition count does not increase throughput. It just increases noise, resource use, and group management churn.

This is worth keeping in mind when people say they want “more consumer scalability.” Often what they really need is more partitions, assuming the key design can tolerate it. But as already noted, more partitions are not a free change. Parallelism, ordering, and future flexibility are all tied together.

Rebalancing and the pause you eventually notice

Consumer groups are dynamic. Instances join, leave, crash, restart, and deploy. Whenever group membership changes, Kafka reassigns partitions among consumers. That process is called rebalancing.

A rebalance also happens in other situations, such as topic metadata changes or partition count changes. During a rebalance, consumers briefly stop normal processing while the group agrees on a new assignment. In healthy systems this is usually short, but it is still visible. You may see brief processing pauses, lag spikes, or latency jumps around deployments and failures.

This is one reason noisy consumer fleets can hurt stability. Frequent restarts mean frequent rebalances. Kafka has improved this over time, and cooperative rebalancing reduces disruption by avoiding full stop-the-world reassignment in many cases. It helps, but it does not eliminate the fact that partition ownership is a coordinated group activity with short pauses built into the model.

The design question that matters

Partitioning is where Kafka stops being generic infrastructure and starts reflecting your domain. The partition key decides which records stay together, which ones can be processed in parallel, where hotspots appear, and what kind of ordering your consumers can rely on. Partition count decides how much parallelism you can get and how much future flexibility you have without remapping keys and complicating ordering.

Before creating a topic, the most important question is simple: what is my unit of ordering, and does my partition key reflect that?

Kafka for Software Engineers

andrewlegacci — Mon, 09 Mar 2026 00:07:00 +0000

Kafka is often described as a message broker, an event bus, or a streaming platform. Those descriptions are not wrong, but they can be misleading if they make it sound like Kafka is just another queue with more knobs.

The most useful way to understand Kafka is this: Kafka is a durable, distributed log that many producers can append to and many consumers can read from at their own pace. A lot of its design starts making sense once you focus on the log part.

If you have only built request-response systems, Kafka can feel unfamiliar at first. In a typical HTTP setup, one service calls another service directly and waits for an answer. That works well for many problems, but it creates tight coupling in time and availability. The caller has to know who to call, the callee has to be up, and the two sides need to agree on how fast the interaction should happen.

Kafka solves a different kind of problem. It helps when systems need to exchange facts about things that happened, without every producer and consumer needing to talk to each other directly.

The problem Kafka solves

Imagine an e-commerce system. An order is placed. Several things may need to happen next: inventory should update, payment should be processed, an email should be sent, analytics should record the purchase, fraud checks may run, and a warehouse system may start fulfillment.

You could wire all of that directly into the order service. But then the order service becomes responsible for too much. It needs to know about every downstream system, handle their failures, deal with latency, and evolve whenever a new consumer appears.

Kafka changes the shape of the integration. Instead of the order service calling everything else, it publishes an event like OrderPlaced. Other systems subscribe to that stream and react independently.

That gives you looser coupling in a few important ways. Producers do not need to know which consumers exist. Consumers can be added later without changing producers. Consumers can fall behind temporarily and catch up later. Events can also be replayed, which is a major difference from many queue-based systems.

That last point matters. In a queue, the main model is often “work to be done.” In Kafka, the model is closer to “a history of facts.” Consumers read that history and derive their own view of the world from it.

The log abstraction

The log abstraction is the core idea.

A log is an ordered sequence of records that can only be appended to. You do not normally insert into the middle. You do not update old entries in place. You keep adding new records to the end.

Kafka stores records in topics, and each topic is split into partitions. A partition is an append-only ordered log. Every record in a partition gets an offset, which is basically its position in that log.

This is a very different mental model from a traditional queue. In many queue systems, once a message is consumed, it is gone. In Kafka, records stay in the log for some retention period, or until storage rules delete them. Consumption does not remove the record. A consumer tracks its own position, usually by storing offsets.

That means two consumers can read the same topic independently. One consumer might be at offset 1000, another at 4000. That is normal. Kafka is not asking “has this message been consumed?” in a global sense. It is asking “where is each consumer in the log?”

This is what makes replay possible. If a bug in your analytics service caused bad calculations for the last two days, you can fix the code and replay from an earlier offset. That is a powerful capability, and it is one of the reasons Kafka shows up in data pipelines, event-driven systems, and audit-style architectures.

Core architecture

At a high level, Kafka has producers, brokers, topics, partitions, and consumers.

Producers write records to topics. A record usually has a key, a value, a timestamp, and some metadata. The key is especially important because it often determines which partition the record goes to. If you use a customer ID as the key, all records for that customer can be routed to the same partition, preserving order for that key.

Brokers are the Kafka servers that store data and serve reads and writes. A Kafka cluster has multiple brokers, and topic partitions are distributed across them.

Topics are logical streams of records. You can think of a topic as a named feed such as orders, payments, or user-signups.

Partitions are where ordering and scalability meet. Ordering in Kafka is guaranteed within a partition, not across an entire topic. This is an important detail. If a topic has six partitions, Kafka can scale reads and writes across them, but there is no single total order across all six. There is only an order inside each partition.

Consumers read records from topics. Consumers are usually organized into consumer groups. Within a consumer group, each partition is assigned to one consumer instance at a time. That lets a group scale out horizontally. If a topic has eight partitions, up to eight consumers in the same group can process in parallel. If you run more than eight, some will sit idle.

This is one of Kafka’s central tradeoffs. Partitions give you throughput and parallelism, but they also define your ordering boundaries.

Why partitioning matters

Partitioning is not just a storage detail. It affects correctness.

Suppose you are processing bank account events and order matters per account. If AccountDebited and AccountCredited for the same account can land in different partitions, consumers may see them in different relative orders. That can break assumptions.

The usual answer is to choose a partition key that matches your ordering needs. If order matters per account, key by account ID. If order matters per tenant, key by tenant ID. You are deciding what unit of ordering you care about.

This is one reason Kafka design discussions often turn into domain discussions. Picking a key is really about deciding which events belong to the same ordered stream.

Delivery semantics and duplicates

Kafka often gets described with phrases like at-most-once, at-least-once, and exactly-once. These are worth understanding without overcomplicating them.

At-most-once means records may be lost but not redelivered. At-least-once means records are not lost in normal operation, but duplicates can happen. Exactly-once is the strongest and most complicated model, and in practice it depends on more than just Kafka itself.

For most engineers, the important point is this: assume duplicates are possible unless you have designed very carefully around them. Consumer logic should often be idempotent, meaning processing the same event twice should not produce a bad result.

That advice applies even outside Kafka, but Kafka makes it especially relevant because retries and rebalances are normal parts of distributed systems.

Kafka is not just async HTTP

One mistake is to treat Kafka as if it were just asynchronous RPC. That misses the point.

If service A publishes an event only because it expects service B to act immediately in a request-like chain, you have preserved much of the same coupling, just with more moving parts. Kafka is most useful when the event is meaningful on its own. “Order placed” is a fact. “Please call this specific service later” is usually not.

That distinction matters because events should represent something that happened in the domain, not just a transport mechanism between services.

Schemas, compatibility, and evolution

Once multiple systems rely on the same event stream, schemas matter. If one team changes the shape of an event carelessly, other teams can break.

That is why Kafka setups often include schema discipline, whether through JSON conventions, Avro, Protobuf, or a schema registry. The exact tooling varies, but the principle is the same: event formats are contracts, and those contracts evolve over time.

Experienced engineers usually already know this lesson from APIs. Kafka makes it even more important because consumers may lag behind, replay old data, or be owned by different teams.

When Kafka is a good fit

Kafka fits well when you have multiple consumers for the same stream of events, when replay matters, when systems need to be decoupled in time, or when throughput is high enough that a distributed log is worth the operational cost.

It is especially useful for audit trails, integration between services, event-driven workflows, change-data-capture pipelines, and stream processing.

It is not always the right answer. If one service just needs to call another and get a response, plain HTTP may be simpler. If you only need a small background job queue, a simpler queue may be enough. Kafka adds real operational and conceptual complexity, so it is worth using when its specific strengths matter.

A practical mental model

If you are new to Kafka, the best mental model is not “queue” but “shared history.”

Producers append facts to a durable log. Consumers read that history and build their own outcomes from it. Offsets let each consumer decide where it is in the stream. Partitions let the system scale, but they also define where ordering exists. Consumer groups let you distribute work across instances. Schemas keep the shared contract stable enough for many systems to coexist.

Once you see Kafka as a distributed log rather than a fancy mailbox, the rest becomes easier to reason about. It is a system for recording streams of events durably and letting many independent consumers make use of them without every integration turning into a web of direct calls.

That is the problem Kafka solves, and that is why the log abstraction is the center of the whole design.

Understanding `GROUP BY` in SQL

andrewlegacci — Sun, 08 Mar 2026 10:26:00 +0000

GROUP BY is one of those SQL clauses that starts out feeling simple and then becomes confusing as soon as real queries get involved. The short version is that it lets you take many rows and treat some of them as belonging to the same group, usually so you can calculate something for each group.

That idea matters more than the syntax. GROUP BY is not mainly about sorting, filtering, or removing duplicates. It is about collapsing rows into groups based on shared values, then producing one result row per group.

What `GROUP BY` does

Imagine a table called orders:

order_id | user_id | status  | amount
---------|---------|---------|-------
1        | 10      | paid    | 50
2        | 10      | paid    | 20
3        | 11      | pending | 15
4        | 12      | paid    | 40
5        | 11      | paid    | 30

If you run this:

SELECT status
FROM orders;

you get one row per order. If instead you run this:

SELECT status
FROM orders
GROUP BY status;

you get one row per distinct status value:

paid
pending

At first glance, that can look similar to DISTINCT, and in this case the result is similar. But GROUP BY becomes useful when you combine it with aggregate functions like COUNT, SUM, AVG, MIN, or MAX.

For example:

SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status;

This gives you one row for each status, plus the number of rows inside each group:

status  | order_count
--------|------------
paid    | 4
pending | 1

So the pattern is this: define groups, then calculate something per group.

The underlying idea

The mental model that helps most is to think of SQL working in stages.

First, SQL starts with rows from a table. Then GROUP BY partitions those rows into buckets based on the column or columns you specify. After that, aggregate functions are applied within each bucket. The final result has one row per bucket, not one row per original input row.

If you group by one column, rows with the same value in that column go together. If you group by multiple columns, rows only go together when all those values match.

For example:

SELECT user_id, status, COUNT(*) AS count
FROM orders
GROUP BY user_id, status;

Now the groups are based on the combination of user_id and status.

That means these rows:

user_id | status
--------|--------
10      | paid
10      | paid
11      | pending
11      | paid

become groups like:

(10, paid)
(11, pending)
(11, paid)

Each distinct combination becomes its own group.

Why SQL is strict about grouped queries

A common source of confusion is this kind of query:

SELECT user_id, created_at
FROM orders
GROUP BY user_id;

In most SQL systems, this is invalid, or at least unsafe, because once rows are grouped by user_id, there may be many created_at values inside each group. SQL needs to know which one you want.

That is why grouped queries usually follow this rule: every selected column must either be part of the GROUP BY clause, or be wrapped in an aggregate function.

This works:

SELECT user_id, COUNT(*) AS total_orders
FROM orders
GROUP BY user_id;

This also works:

SELECT user_id, MAX(amount) AS largest_order
FROM orders
GROUP BY user_id;

But this does not make sense in standard SQL:

SELECT user_id, amount
FROM orders
GROUP BY user_id;

because a user may have many amounts, and GROUP BY produces one row per user.

That rule is not arbitrary. It comes directly from what grouping means. After grouping, individual rows are no longer the main unit. Groups are.

Examples that show when `GROUP BY` is useful

A common use is counting rows per category.

SELECT user_id, COUNT(*) AS total_orders
FROM orders
GROUP BY user_id;

This tells you how many orders each user has placed.

Another common use is summing values.

SELECT user_id, SUM(amount) AS total_spent
FROM orders
GROUP BY user_id;

Now you get total spending per user.

You can also group by dates or derived values:

SELECT DATE(created_at) AS order_date, COUNT(*) AS total_orders
FROM orders
GROUP BY DATE(created_at);

That gives daily totals.

And you can combine grouping with filtering. For example, if you only care about paid orders:

SELECT user_id, SUM(amount) AS total_paid
FROM orders
WHERE status = 'paid'
GROUP BY user_id;

The WHERE clause filters rows before grouping happens. That matters. You are grouping only the rows that survive the filter.

When to use `GROUP BY`

Use GROUP BY when your question is about categories, buckets, or summaries rather than individual rows.

If you want to know how many users signed up each day, how much revenue each product generated, or how many orders each status has, that is a grouping problem. The same goes for average salary per department, maximum score per player, or number of tickets per support agent.

A good test is to ask whether your result should contain one row per original record, or one row per logical group. If it is one row per group, GROUP BY is probably involved.

When not to use it

Do not use GROUP BY just because you want unique rows. Sometimes people reach for it when DISTINCT is simpler and clearer.

For example:

SELECT DISTINCT user_id
FROM orders;

is usually better than:

SELECT user_id
FROM orders
GROUP BY user_id;

Both may return the same values, but DISTINCT says exactly what you mean: give me unique user_id values. GROUP BY suggests you are preparing to aggregate.

Also do not use GROUP BY when you still need row-level detail. Once you group, you lose the original per-row shape unless you use more advanced techniques like window functions or subqueries.

For example, if you want every order row plus the total number of orders for that user, a plain GROUP BY is not enough, because it collapses rows. That is often a sign you want a window function instead.

`GROUP BY` vs `DISTINCT`

DISTINCT removes duplicate rows from the result set. GROUP BY forms groups, usually so aggregates can be computed.

That difference is easier to see with examples.

This:

SELECT DISTINCT status
FROM orders;

returns unique statuses.

This:

SELECT status
FROM orders
GROUP BY status;

also returns one row per status, but it does so by grouping rows.

And this is where GROUP BY goes beyond DISTINCT:

SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status;

DISTINCT cannot do that by itself. It can remove duplicates, but it does not summarize each group with counts or sums.

So a rough rule is this: use DISTINCT when you want uniqueness, use GROUP BY when you want summaries per category.

A note on `HAVING`

HAVING often appears next to GROUP BY, so it is worth mentioning even if it is not exactly similar to DISTINCT.

WHERE filters rows before grouping. HAVING filters groups after grouping.

For example:

SELECT user_id, COUNT(*) AS total_orders
FROM orders
GROUP BY user_id
HAVING COUNT(*) >= 2;

This returns only users who have at least two orders.

You cannot do that with WHERE COUNT(*) >= 2, because aggregates are computed after rows are grouped. HAVING exists for conditions on aggregated results.

Common mistakes

One mistake is selecting columns that are neither grouped nor aggregated. That usually means the query does not match the shape of the result you are asking for.

Another is using GROUP BY when DISTINCT would be clearer. The query may still work, but it makes the intent less obvious.

A third is forgetting that grouping changes the meaning of the result. Once rows are grouped, you are no longer dealing with individual records. You are dealing with summaries of sets of records.

It is also easy to confuse ORDER BY and GROUP BY. ORDER BY sorts rows. GROUP BY combines rows into groups. A grouped result can still be sorted afterward:

SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status
ORDER BY order_count DESC;

Here, the rows are grouped first, and then the grouped result is sorted.

Closing thought

GROUP BY is best understood as a change in level of detail. A normal query works at the row level. A grouped query works at the group level.

Once that clicks, a lot of SQL becomes easier to reason about. You stop memorizing syntax and start asking a simpler question: am I trying to return rows, or am I trying to return summaries of rows?

If the answer is summaries, GROUP BY is usually the tool.

DEV Community: andrewlegacci

A timeline of the history of blockchain

Pre‑Bitcoin foundations (1980s–2007)

Birth of Bitcoin (2008–2012)

Early altcoins and ecosystem growth (2013–2014)

Ethereum and smart contract era (2015–2017)

Major Bitcoin forks and scaling debates (2015–2018)

Institutionalization, DeFi, and global regulation (2018–2020)

Ethereum 2.0 roadmap and PoS transition (2020–2023)

Recent Ethereum upgrades and roadmap (2024–2025)

New L1 ecosystems and trends (2017–2025)

Regulatory and macro milestones (2013–2025)

Compact reference table

Kafka partitioning: the part that decides whether your design works

How records get routed to partitions

The partition key is really a domain decision

Bad key choices fail in subtle ways

Hotspots and disproportionately popular keys

Partition count is a mostly one-way door

Partitions are also your unit of parallelism

Rebalancing and the pause you eventually notice

The design question that matters

Kafka for Software Engineers

The problem Kafka solves

The log abstraction

Core architecture

Why partitioning matters

Delivery semantics and duplicates

Kafka is not just async HTTP

Schemas, compatibility, and evolution

When Kafka is a good fit

A practical mental model

Understanding `GROUP BY` in SQL

What GROUP BY does

The underlying idea

Why SQL is strict about grouped queries

Examples that show when GROUP BY is useful

When to use GROUP BY

When not to use it

GROUP BY vs DISTINCT

A note on HAVING

Common mistakes

Closing thought

What `GROUP BY` does

Examples that show when `GROUP BY` is useful

When to use `GROUP BY`

`GROUP BY` vs `DISTINCT`

A note on `HAVING`