DEV Community

Cover image for Best Practices Working with Billion-row Tables in Databases
Nixon Islam
Nixon Islam

Posted on

Best Practices Working with Billion-row Tables in Databases

This whole post is about a video from Hussein Nasser I saw from youtube. This is just the summarise version of the video along with my key takeaways from the video

Learning:

The discussion is on

  • How to handle data in billion row based table
  • What kind of approach can be taken
  • How to redesign the table to handle 2 billion rows based table

Summary:

Here the discussion is being started with how a twitter follower table has been designed. A simple approach, make a table with which person follows whom. A 2-3 columns based rows. But it will be a huge table in the long term in sense if we use this for twitter. So what can be done

  1. Do normal query without the concept of indexing. Just brute forcing the data without the concept of anything. Do multi threading, multi processing and find the data from the table using lots of machines (map reduce)
  2. Use indexing on the table and find the data from the table using the indexed data.
  3. Now you have billions of data so indexing is huge, to search that use database partitioning in the same disk. Use pair partitioning
  4. Now to optimize more use sharding in the system(multiple host). But it adds more complexity in the system. The client needs to be aware of the shard info before querying, then needs to find the proper partition for the query, then make an actual query. Which makes another layer of logics along with business logics.
  5. Another way is to redesign the system like in the profile table add 2 more rows to hold db columns like follower count, followers(in json). In that way a profile holds all information about followers of a profile. The problem might generate how to write/edit this data. But that is another kind of system design like querying, CQRS, event based solution. It solves current issues.

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

Billboard image

Use Playwright to test. Use Playwright to monitor.

Join Vercel, CrowdStrike, and thousands of other teams that run end-to-end monitors on Checkly's programmable monitoring platform.

Get started now!

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay