Discussion on: How can We implement Data Structures and Algorithms in Backend Frameworks to reach O(log(n)) Run Time ?

View post

Hi Divyesh,

I'll try to give you a practical answer.

Most ORM (ActiveRecord for Rails, Django's and SQLAlchemy for Python) are lazy. This means that when you build a query such query is not going to be executed until the last moment.

Using Django's example: SomeClass.objects.all() won't hit the db unless you iterate on it, neither does .filter(). You can check here: docs.djangoproject.com/en/2.0/ref/...

Most developers, at least those with a little bit of experience, tend to query using pagination if they have to query an immense amount of data from the DB. The usual tabular UI or the infinite scrolling are using cursor pagination under the hood.

So if you have 1 million objects but you paginate by 50, the most you're going to query is 50 objects at a time. There are different ways of pagination: you can use numerical ids, strings that have a natural order or some custom other ways. You can also do it with Django: docs.djangoproject.com/en/2.0/topi...

So there's no reason, unless you have special needs, for you to issue a raw SQL query for it.

Let's talk about databases that are (at least PostgreSQL and I guess MySQL) super optimised for this type of access, provided you have the right things in place.

ORMs are pretty good at generating the correct SQL. If you have performance problems (on selects) most of the times you are either filtering on an unindexed column or you have issues with the setup of your table or database. There are obviously cases where even with all the indexes and optimizations in place you're killing your DB but for a lot of situations that's the last of the issues to solve.

How do you know if a query is optimized? You can use the awesome EXPLAIN ANALYZE in PostgreSQL. It will tell you a lot of things: which algorithm is going to use to select your data, if it will hit the indexes and it gives you a numerical cost of execution.

How do you get the SQL from a Django query? You can either use django debug toolbar which shows you all the queries that built the page or you can put a breakpoint right before the query and inspect it (google it, I'm not sure which is the correct answer for Django 2). Django debug toolbar is definitely the best for this.

Long story short: trust the database, you don't need to replicate the database in your application.

If you need to cache because the DB is still too slow, then cache.

You'll notice I mostly avoided the topic of data structures and algorithms because I think you are mixing multiple things that can or can't be related, depending on what your goal is. A database is a giant data structure with A LOT of algorithms and optimizations. PostgreSQL has been around 21 years :D

BTW if you want to learn more about how to create a database and you have a bit of spare time I think SQLite source code should still be small enough, I haven't used it years though so I might be wrong.

Divyesh Parmar • Apr 12 '18

So are those pagination simply stored like say Binary Tree, or like Stack like how PDFs work. BTW thanks for bringing up that infinite scrolling topic, I didn't know about it at all and was really curious all the time how does it work.

Wow you really explained super amazing I'm noting this all down in my OneNote, for the future knowledge whenever I get job and start working this is some dope shit I should be knowing

What is the cache process like? I know the word cache due to Computer Organization subject in sophomore year college. But how does it actually work in real world application (may be for a social media site, or is it like whenever, even if I don't have internet, i start chrome and it can load up all the sites I have visited so those must have been cached in browser's memory, right?)

Yeah I mean you really explained that there is no need of touching those DS & Algorithms things to touch, but what worried me is why is every company (may be especially here in India) always ask for DS & Algorithms problem solving questions in Interview?

rhymes • Apr 12 '18

Database pagination is an abstraction on the traversing a data structure, in this case it's done inside the database but you can create a paginator on an array for example. Pagination means dividing the results in "pages" of the same size and using an offset to move the cursor.

DBMSs are very complex structures, I'm quite sure they don't use a single data structure, but many. If you really want to dig in the insides of a DMBS you could start with an overview of how one of them phisically stores data: Introduction to PostgreSQL physical storage

A cache in a real word application is not very different from the concept of cache in general. The purpose of a cache is to be a temporary store of information that is expensive to compute. You have caches in the CPU, you have caches in the database, you can have caches in your program, you can have caching at the HTTP level. They all do, in different ways, the same thing: return data instead of going to the source to recompute it.

Browsers have caches too, some sites employ offline caching so to be partially usable even without connection.

Anyway I didn't say you don't need to know about data structures and algorithms, you do. I'm saying there's no point into replicating the functionality of the database in your code. The reason why companies ask them of you is because they are useful in designing programs and algorithms. It's just that they might not in your particular example: extracting data from the database in Django.

Still, you need to know in general how a database works and yes, sometimes, about caching too.

Good luck with your learning!

Divyesh Parmar • Apr 13 '18

THank you man! THanks a ton

Mohamed Saad • Apr 17 '18

Hey rhymes

I don't think the problem mentioned here is choosing between ORM and raw SQL.

The main problem is how to store data and how to retrieve them.

First, you mentioned the first 50 item ahead of the whole data then reload it with another 50 and so on. This is okay of course. However the 50 data is previously organized in a way that let those 50 comes first from all those main data.

I have a lot of cases in my mind . I will try to add them as I remember to discuss

1 - If someone signs up in your website : it will add his email after the last signed up person.... 1,2,3,4,5,6,7,8,9 ,.............., 5,000,000

How can I authenticate the id=4,000,000 for example ( mr.example@gmail.com)
as I understand it will loop through all emails till it finds ( mr.example@gmail.com) then compare the hashed value of the given password with the stored one in the database. Correct me if I am wrong.

2- Imagine searching "Data structures" term in udemy. It will get all general data structures courses then data structures using specific language like java or c++ or python. this query will also maintain the first courses to be the highest rated.

3- friends of friends in facebook

I do think there are a lot of cases and as I mentioned I will add some later on.

I'm open for discussions anytime

rhymes • Apr 17 '18 • Edited

There are a few different things:

to solve the problem of "find the nth million user and check its password", assuming you have it's email or id or whatever you can use indexes. Most databases have it. Instead of traversing millions of rows you just find the right one through the index. PostgreSQL has different types of indexes: postgresql.org/docs/10/static/inde...
that's probably implemented using a full text search with weights: Full Text Search Explained
that's one of the hardest things to implement at scale. That's why Facebook is as big as Facebook ;-) Jokes aside, I would look into graph theory. You could probably implement a super simple version with a traditional database but it would probably explode if you have billions of users like FB. You might get by using Redis but you'll still have scaling problems. There's a discussion on the topic here: quora.com/How-does-Facebook-mainta...

1 and 2 are "free" in basically all relation database systems. 3 requires non trivial logic and implementation.