DEV Community

Henry Fradley
Henry Fradley

Posted on

Dealing with huge amounts of data in a sql database!

So I was recently tasked with creating a postgres database that contains 10 million unique records. I also have to be able to perform realistic queries on this database in a max time of 50ms.

First, let's talk about the seeding script. A regular seeding script with insert statements may work for a few thousand pieces of information but for 10 million it simply just is not going to work. So for this seeding script we are going to write to a csv file, and then transfer that csv to our database. CSV stands for comma separate values and a csv is essentially just a text list with text value for each column of your database. For this process we will be using fs so be sure to require that in the top of your seeding file.
So lets make a 10 million record csv file! First set up a write stream to the correct path, this is where your csv will be generated. Then write a header line with the column names of your database. For the next step we are just going to write a function that essentially is just going to run a while loop with faker.js that writes a new piece of fake data for every row of the csv file. To be safe it may be smart to practice with a 100 record csv before you switch to a 10 million record database.

Postgres time! Now we will simply create our table that this data is going to live in, be sure to add all of the correct columns. It may be smart to not add foreign keys in the creation of the table to save time.
Once the table is ready just perform a copy statement in the postgres shell that will write every line of your csv to the database.

Now we have data! So lets worry about query times, you can add timing to the postgres shell with \timing and see exactly how long your queries are taking. My intitial query is a select inner join from 2 tables and it is taking about 700ms. That is incredibly long so we are going to do something called indexing. Select the proper column to index and add an index, for my example it is a listing id on my photos table that corresponds to a listing on the listings table. The new query time after indexing is .8ms!

Top comments (0)