DEV Community: Ayush Shrivastav

Why Production Databases Break Normalization (And Why That's Okay)

Ayush Shrivastav — Wed, 11 Mar 2026 18:36:31 +0000

If you've taken any database course, you've been taught that normalization is the right way to design a schema. No duplicate data. Clean relationships. Every table with a single responsibility.

Then you get a job at a real company, open the codebase, and see this:

messages
--------
message_id
channel_id
user_id
user_name
user_avatar
message_body
created_at

user_name? user_avatar? Right there in the messages table?

Your first instinct might be to flag it as a bug. But here's the thing — this is not a mistake. Systems like this run at Slack, Discord, Instagram, and Twitter. These are deliberate decisions made by engineers who understand normalization perfectly. They just chose not to use it.

This piece explains why.

First, what normalization actually solves

Normalization isn't just an academic exercise. It solves a real problem: when the same information lives in more than one place, updates become dangerous.

Say you store a user's name in the users table and also inside each message they've sent. If the user changes their name, you now have to update two places. If one update fails halfway through, half your messages show the old name and half show the new one. That's a data consistency bug — and it's genuinely bad.

A normalized schema avoids this entirely:

users          messages
-----          --------
id             id
name           sender_id → users.id
avatar         body

Now there's one source of truth. Loading a message with a username requires a JOIN:

SELECT m.body, u.name, u.avatar
FROM messages m
JOIN users u ON u.id = m.sender_id
WHERE m.channel_id = 42
ORDER BY m.created_at DESC
LIMIT 50;

This is correct, clean, and totally fine — until you're running it 300,000 times per second.

The actual bottleneck at scale

Here's something most database courses don't tell you: production systems are not evenly split between reads and writes. Most systems are overwhelmingly read-heavy.

Twitter's public architecture data puts this into perspective. At 150 million active users, they were handling roughly 300,000 read requests per second for home timelines, versus about 6,000 write requests per second for new tweets. That's a 50-to-1 read/write ratio. For something like a message inbox or social feed, it can be even more extreme.

When reads outnumber writes by that margin, every millisecond you can shave off a read multiplies across a huge volume. A JOIN that costs 10ms doesn't sound bad in isolation. But when you need to execute it 300,000 times a second across a distributed database cluster, that cost becomes the bottleneck.

And JOINs are expensive in ways that aren't obvious. They require two table scans (or index lookups), loading data from potentially different disk pages, merging the results in memory, and sorting them. At modest table sizes this is fine. At hundreds of millions of rows spread across a sharded cluster, the database is doing significant work for every single request.

What denormalization actually is

Denormalization is the decision to intentionally store the same data in more than one place to make reads faster.

Going back to that messages table:

messages
--------
message_id
channel_id
user_id
user_name       ← duplicated from users table
user_avatar     ← duplicated from users table
message_body
created_at

Now loading a channel's message history looks like this:

SELECT message_id, user_name, user_avatar, message_body
FROM messages
WHERE channel_id = 42
ORDER BY created_at DESC
LIMIT 50;

No join. One table. One index scan. The database does a fraction of the work.

This is faster, cheaper in CPU and disk I/O, easier to cache, and simpler to scale horizontally — because no cross-table coordination is needed.

Discord: 4 trillion messages and zero joins

Discord is probably the most documented example of this pattern at scale.

When Discord launched in 2015, they stored messages in MongoDB. By late 2015, with 100 million messages in storage, queries were becoming slow and unpredictable as data grew beyond what fit in RAM. They migrated to Apache Cassandra, a database that forces you into denormalized design — there are no server-side JOINs in Cassandra at all.

Their message schema looked something like this:

CREATE TABLE messages (
    channel_id  bigint,
    bucket      int,
    message_id  bigint,
    author_id   bigint,
    content     text,
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

The key design decision is (channel_id, bucket) as the composite partition key. All messages in a channel, within a ~10-day time window (the "bucket"), live on the same database node. A query for "last 50 messages in channel X" hits exactly one node, reads from one partition, and returns. No joins, no cross-node coordination.

By 2022, Discord had grown to 177 Cassandra nodes handling trillions of messages. JVM garbage collection pauses were causing unpredictable latency spikes. They migrated to ScyllaDB (a C++ reimplementation of Cassandra with no JVM), dropping from 177 nodes to 72 while also cutting p99 insert latency from a variable 5–70ms range down to a steady 5ms.

The schema design stayed fundamentally the same. Denormalized. Query-driven. No joins.

Twitter: precomputing your entire timeline

Twitter's approach to denormalization is even more aggressive.

In a normalized system, showing you your home timeline would work like this: query all users you follow, fetch their recent tweets, merge and sort by time. At scale, this is a scatter-gather operation — hitting potentially thousands of small queries per timeline load. Completely impractical.

What Twitter does instead is called fan-out on write. When you post a tweet, a background process immediately pushes your tweet ID into the Redis timeline cache of every one of your followers. If you have 20,000 followers, that's 20,000 cache writes triggered by your single post.

The read path becomes trivially simple. When you open your timeline, Twitter reads your pre-built list from Redis. Your 800-tweet timeline is already there, pre-sorted, waiting. The read takes about 5 milliseconds.

This is denormalization taken to its logical extreme. The tweet ID is physically duplicated across N followers' data stores — not in the database sense, but in the cache sense. Write time is more expensive. Read time is nearly free.

One obvious problem: Lady Gaga had 31 million followers at one point. A single tweet from her would require 31 million cache writes. Twitter handles this with a hybrid approach — for accounts above a certain follower threshold, they skip the fan-out and instead merge the celebrity's tweets at read time. Most users get the precomputed timeline. Celebrities are handled as a special case.

Instagram's like count problem

Instagram ran into a specific version of this problem around engagement metrics.

In a normalized schema, counting likes looks like this:

SELECT COUNT(*) FROM likes WHERE post_id = 12345;

On a busy post, the likes table has millions of rows. Even with an index, counting them is work the database has to do on every request — and Instagram serves hundreds of millions of users who are all loading posts and seeing like counts constantly.

When a celebrity posted something, the spike in traffic would hammer this query on every load. The database struggled.

The fix was straightforward: add a like_count column directly to the posts table. When someone likes a post, you write to likes and you also increment like_count in posts. Now the read path for a like count is a single indexed lookup returning a single value.

The tradeoff: writes are slightly more complex, and the like_count column is technically derived data (it duplicates what you could compute from the likes table). But the query went from being expensive to being instant. That's a tradeoff Instagram very deliberately made.

Microservices make denormalization unavoidable

There's another context where denormalization isn't really optional: distributed microservice architectures.

If your user data lives in a User Service and your message data lives in a Message Service, there is no JOIN. You can't run a SQL query that spans two independently deployed, separately scaled services with separate databases.

You have a few options:

Load the message, then make an HTTP call to the User Service to get the username. This adds a network round-trip to every message load — not great when loading 50 messages at once.
Batch the calls. Slightly better but still adds latency and coupling.
Store the username inside the message when it's created. No cross-service call needed at read time.

Option 3 is denormalization, and it's the most common approach in real microservice architectures. The Message Service stores enough user data to render messages without calling anyone else. When a user changes their name, the message records don't immediately reflect it — but that's an acceptable tradeoff for most systems. Yesterday's chat messages showing your old username for a few seconds is not a critical business failure.

The real cost of denormalization

Denormalization is not free. The engineering tradeoff is real, and it's worth being honest about it.

Update complexity. If a user changes their avatar, and their avatar URL is embedded in thousands of message records, you now have to update all of those records. In some systems this is acceptable (the old avatar is just stale, not wrong). In others you need a propagation mechanism — typically an event stream that broadcasts the change and downstream services update their own copies.

Eventual consistency. In distributed systems, that propagation doesn't happen instantly. There's a window where your data is inconsistent — the users table has the new name, but message records still show the old one. Systems that use this pattern explicitly accept eventual consistency as a design choice.

Storage cost. Storing the same data multiple times costs more. At small scale this barely matters. At Discord's scale (petabytes of messages), duplicating even a small amount of data per message adds up.

Schema evolution. If a denormalized field changes meaning or format, you have to update it everywhere it's been copied — which might mean migrating billions of rows.

How production systems handle both

Most well-designed systems don't pick one approach and abandon the other. They use normalization where correctness matters most and denormalization where read performance matters most.

A common architecture:

Normalized database (source of truth)
           ↓
     Event stream (Kafka)
           ↓
  Denormalized read tables
           ↓
       API queries

The normalized database is where you write to. It's correct, consistent, and acts as the ground truth for your data. When something changes here, an event gets published.

Downstream services consume those events and update their own denormalized projections — tables or caches optimized for specific read patterns. These projections might live in Redis, Cassandra, Elasticsearch, or a separate PostgreSQL schema. They're not the source of truth, but they're what your APIs query because they're fast.

This pattern has a name — CQRS (Command Query Responsibility Segregation) — and it's what Netflix, LinkedIn, Uber, and most large-scale platforms use in one form or another. Netflix processes 140 million hours of viewing data per day through Kafka, with multiple downstream consumers each building their own optimized projection for recommendations, billing, and analytics.

The insight worth taking away

Database theory teaches you how to design schemas that are correct. Production engineering teaches you how to design schemas that are fast under real load.

Normalization is right in the sense that it prevents a class of consistency bugs. It belongs in your schema wherever data correctness is non-negotiable.

Denormalization is right in the sense that JOINs at scale are expensive, network round-trips add up, and precomputed data is faster than computed data. It belongs in your read paths, your caches, and your projections.

The engineers at Discord, Twitter, Instagram, and Netflix aren't ignoring normalization. They understand it well enough to know when breaking it is the right call.

That's the difference between knowing the rules and knowing when to apply them.

Real-world data referenced in this post comes from Discord's engineering blog (message storage architecture), Twitter's 2012 architecture documentation, Instagram's engineering posts on sharding and denormalization, and Netflix's TechBlog on viewing data storage.

Sonar Cude Basic Understanding

Ayush Shrivastav — Tue, 25 Apr 2023 07:01:54 +0000

Advantages of SonarQube

SonarQube detects bugs in the code automatically and gives alerts to the developers to resolve the issues before rolling it out for production.
It aids developers in reducing code redundancy and complexity
Maintenance time and cost of production and make code easy to read and understand.
Developers receive regular feedback on coding standards and quality issues which helps in increasing the programming skills.
It is scalable.
It enables continuous code quality management and decreases the cost and risk associated with software management.
SonarQube provides additional value and professional support. Services including development, technical support, consulting and training are designed to help companies get long term benefits.

Disadvantages of SonarQube

Learning curve is hard
Report generation is often very time consuming
Initial setup is quite complicated
Sometimes it reports an issue that is not actually a problem, or false negatives.
Plugins for some languages are available only in commercial versions of the platform.

Evidence of Stability

SonarQube is known for its stability, partly because they do not release updates too frequently. For instance, their product SonarQube Scanner has only undergone three major releases over the past six to seven years.
sonarqube-scanner npm

Competitors of SonarQube

Embold
Checkmarx SAST
Synopsys Coverity
Micro Focus Fortify Static Code Analyzer (SCA)
Veracode Static Analysis
Snyk Code

While the following are some well-known static code analysis tools, it is possible that there are numerous others available as well.

What is best time of add SonarCube ?

The best time to add SonarQube to your project is as early as possible in the development process. This allows you to continually analyze and improve the quality of your code throughout the project's lifecycle. By identifying and addressing potential issues early on, you can reduce the risk of errors and technical debt down the line. Integrating SonarQube into your project from the beginning can help ensure that you are delivering a high-quality product.

If we add SonarCube after some development of project then what is strategies of refactoring old code ?

One strategy for refactoring old code is to focus on the critical issues identified by SonarQube first, such as security vulnerabilities or major code smells. Once these issues have been addressed, you can move on to tackling less critical issues.
Another strategy is to prioritize refactoring code that is frequently modified or has a high level of complexity. This can help reduce technical debt and make the codebase more maintainable in the long run.

It's important to keep in mind that refactoring should be done incrementally and carefully to avoid introducing new issues or disrupting the functionality of the code. With SonarQube's guidance and the help of experienced developers, you can gradually improve the quality of your codebase and ensure that it meets high standards of maintainability and reliability.

How to reduce the technical debt in Sonarqube?

Prioritize issues: Focus on addressing critical issues first, such as security vulnerabilities or major code smells. These issues can have a significant impact on the quality of your codebase and should be tackled as soon as possible.
Address code duplication: Code duplication can increase technical debt and make your codebase more difficult to maintain. Use SonarQube's code duplication detection to identify areas where code can be consolidated or refactored.
Improve code coverage: Low code coverage can indicate that important parts of your codebase are not being tested. Use SonarQube's test coverage analysis to identify areas where additional tests are needed and ensure that your codebase is adequately covered.
Monitor code complexity: High code complexity can make your codebase difficult to understand and maintain. Use SonarQube's code complexity analysis to identify areas where code can be simplified or refactored.
Set coding standards: Define coding standards for your project and use SonarQube to enforce them. This can help ensure that your codebase is consistent and maintainable.

List out all risk to our project because of SonarCube ?

Integration complexity: Integrating SonarQube into your development workflow can be complex, especially if you have a large or complex codebase. There may be some initial setup and configuration required, which can be time-consuming.
False positives: SonarQube's analysis is based on a set of rules and algorithms that may not always accurately reflect the specific needs of your project. This can result in false positives, where issues are flagged that are not actually problems.
Performance impact: Running code analysis with SonarQube can take time and resources, especially on larger codebases. This can impact the performance of your development environment and slow down your development process.
Maintenance overhead: SonarQube requires ongoing maintenance and configuration to ensure that it continues to provide accurate results. This can add an additional overhead to your development process.
Increase in development time: While resolving issues during development will definitely increase the development time.

How to overcome risk?

To reduce false positives, customize SonarQube's rule sets to fit your project's specific needs. You can also create your own custom rules and modify existing rules to better suit your project.
To minimize the performance impact of running code analysis with SonarQube, use a dedicated server for analysis, optimize SonarQube's configuration for your specific environment, and consider running analysis during off-hours.
Use SonarLint, a SonarQube plugin, to perform real-time code analysis directly in your integrated development environment (IDE). This will allow you to address issues as you write code, reducing the time required to fix issues later on.

What benefits we will get ?

Improved code quality.
Increased productivity by decrease the time of manual code review.
Consistent coding standards.
Improve in developer coding skills and practices.

Complete Roadmap to Learn React

Ayush Shrivastav — Sun, 31 Jul 2022 10:41:00 +0000

If you are new to React JS then you would have faced a problem that what to learn in react and what not to, what is more important and on which concept I need to focus more all these question pops up in mind every time we start learning new things.

In this blog we will be discussing a complete roadmap to learn react, On which part you have to focus more and a task at the end based on complete react concept.

If you will follow this roadmap then at the end you can say yourself that you know react Js in any of the interviews.

Lets not waste our time and start with first Prerequisites for Learning react.

Prerequisites for Learning React

Html (basic)
CSS (basic)
JavaScript (intermediate)

Ya that's all you want to know before learning React.

If you know the Prerequisites then we are good to Go for learning React JS. Lets first see the Overview of and then we will discuss each one in detail.

Topics To learn in React

Basic React Concepts
Advance React Concepts
Hooks
React-Router
Store Management (Redux)
Api calls in React (Axios)
A task on Complete React Concept

Now lets discuss each Topics in details and how and where can you learn these concepts ,lets discuss everything.

1. Basic React Concepts

So in Basics react concepts we have to just focus on the following things.

What is JSX ?
How an Element is Rendered?
What is components? (functional and class component)
What is state and why they are required?
What is props?
How can we pass props from parent to child component?
Conditional Rendering

Focus on these concepts more because these will be the base of your react building structure.

After learning these concepts lets move towards our Advance Concepts of React

2. Advance React Concepts

How to Print list with the help of Map method in JSX and importance of key in list?
Handling Forms in React
How to pass data from child component to parent component?(Lifting State Up)
How to use Reference(ref) in React?
Fragments
Component Lifecycle methods
Higher order component.
Context
difference between Controlled and uncontrolled form

Congrats Now you know the React JS and their functionality .
Its time to strong your Concepts and learn about hooks in react.

3. Hooks
Moving forward hooks is the one of the most important concepts of react. Lets take a look at it.

What is hooks?
Some important hooks like useState, useEffect and useMemo.
Some important rules of hooks.
how to make a custom hook.

4. React Router

What is Routing?
How Routing is Done in React?
Difference between Link and Navlink.
UseNavigate, UseParams , UseSearchParams hooks.
Nested Routing.
Parameterise Routing.
Conditional Routing.

5. Store Management (Redux)

Why centralized State (store) is required?
Basic informations about redux and redux toolkit
What is Reducer in Redux?
what are Actions and Action Type?
How to get the store through useSelector hook?
How to dispatch an action?
Any middleware for redux and why middleware is required (Thunk or Saga)?

6. Api calls in React (Axios)

What is Api and how we can fire a request through fetch?
How Axios is better than fetch?
How can we fire Get,Post,Put and Delete request through Axios?
What is Axios Interceptor and Why it is required?

Task
Follow each step and move to next step after developing and verifying First one.

- Step 1

create a react app using create react app CLI command
Make 3 screens (login page , Dashboard , Users)
Login Page should not have any navbar
Rest all pages will have same navbar and user information will be their at on any corner of navbar
You can choose your style of UI and be creative with it.
Add Routing to your App.

- Step 2

Make a Json Web server for fake Data to use as Api.
Fill up Json server with some demo data of users. { user_name:"Ayush Shrivastav", user_email:"ayushshrivastava076@gmail.com" password:"789365ayush" Role:"admin"/"user" Designation:"CEO" }
Now use these data for the user login with the help of api call.