loading...

How can We implement Data Structures and Algorithms in Backend Frameworks to reach O(log(n)) Run Time ?

oathkeeper profile image Divyesh Parmar ・2 min read

Backend frameworks/platforms like Node.js(EcmaScript) and Django(Python).

for example, I do think that calling SomeClass.objects.all().filter() will loop the whole data ( O(N) Linear Way) until it gets the full results . I think it would be okay if the data is kept small up to medium but for large data; I don't think so. Imagine million of these data and thousands of requests per day. How can We handle those request in an efficient way

  • I can call raw SQL in Django (suppose I'm working with Django) if I need, that can allow for more efficient request with large amount of data

  • I believe in that too.However if I need to make that, first I need to organize the data in sql appropriately ( using Data Structures techniques like Hash Tables or Balanced Binary Tree) then I can quickly retrieve them later using Algorithms. What I'm asking is " Am I thinking Right?" "is what I want to accomplish is right and achievable ?" " Do Big Companies implement their own DS and Algorithms in that way ?" I've seen many people teaching DS and Algorithms separately and talk a lot about them and their power. But I've never seen both of them discussed and implemented in any framework by Teachers!!

From What I see , I started to believe that No Body Really care and I don't know Why?

  • Being out of college with CS, and always only worked my way around DS & Algorithms mostly through Online Judges I don't know how to merge my knowledge into this?

I hope my question doesn't sound vague or irrelevant I'm in learning phase, so please ignore any immaturity. Any explanation/advice/guidance is appreciated.

Discussion

pic
Editor guide
Collapse
rhymes profile image
rhymes

Hi Divyesh,

I'll try to give you a practical answer.

Most ORM (ActiveRecord for Rails, Django's and SQLAlchemy for Python) are lazy. This means that when you build a query such query is not going to be executed until the last moment.

Using Django's example: SomeClass.objects.all() won't hit the db unless you iterate on it, neither does .filter(). You can check here: docs.djangoproject.com/en/2.0/ref/...

Most developers, at least those with a little bit of experience, tend to query using pagination if they have to query an immense amount of data from the DB. The usual tabular UI or the infinite scrolling are using cursor pagination under the hood.

So if you have 1 million objects but you paginate by 50, the most you're going to query is 50 objects at a time. There are different ways of pagination: you can use numerical ids, strings that have a natural order or some custom other ways. You can also do it with Django: docs.djangoproject.com/en/2.0/topi...

So there's no reason, unless you have special needs, for you to issue a raw SQL query for it.

Let's talk about databases that are (at least PostgreSQL and I guess MySQL) super optimised for this type of access, provided you have the right things in place.

ORMs are pretty good at generating the correct SQL. If you have performance problems (on selects) most of the times you are either filtering on an unindexed column or you have issues with the setup of your table or database. There are obviously cases where even with all the indexes and optimizations in place you're killing your DB but for a lot of situations that's the last of the issues to solve.

How do you know if a query is optimized? You can use the awesome EXPLAIN ANALYZE in PostgreSQL. It will tell you a lot of things: which algorithm is going to use to select your data, if it will hit the indexes and it gives you a numerical cost of execution.

How do you get the SQL from a Django query? You can either use django debug toolbar which shows you all the queries that built the page or you can put a breakpoint right before the query and inspect it (google it, I'm not sure which is the correct answer for Django 2). Django debug toolbar is definitely the best for this.

Long story short: trust the database, you don't need to replicate the database in your application.

If you need to cache because the DB is still too slow, then cache.

You'll notice I mostly avoided the topic of data structures and algorithms because I think you are mixing multiple things that can or can't be related, depending on what your goal is. A database is a giant data structure with A LOT of algorithms and optimizations. PostgreSQL has been around 21 years :D

BTW if you want to learn more about how to create a database and you have a bit of spare time I think SQLite source code should still be small enough, I haven't used it years though so I might be wrong.

Collapse
mohamed profile image
Mohamed Saad

Hey rhymes

I don't think the problem mentioned here is choosing between ORM and raw SQL.

The main problem is how to store data and how to retrieve them.

First, you mentioned the first 50 item ahead of the whole data then reload it with another 50 and so on. This is okay of course. However the 50 data is previously organized in a way that let those 50 comes first from all those main data.

I have a lot of cases in my mind . I will try to add them as I remember to discuss

1 - If someone signs up in your website : it will add his email after the last signed up person.... 1,2,3,4,5,6,7,8,9 ,.............., 5,000,000

How can I authenticate the id=4,000,000 for example ( mr.example@gmail.com)
as I understand it will loop through all emails till it finds ( mr.example@gmail.com) then compare the hashed value of the given password with the stored one in the database. Correct me if I am wrong.

2- Imagine searching "Data structures" term in udemy. It will get all general data structures courses then data structures using specific language like java or c++ or python. this query will also maintain the first courses to be the highest rated.

3- friends of friends in facebook

I do think there are a lot of cases and as I mentioned I will add some later on.

I'm open for discussions anytime

Collapse
rhymes profile image
rhymes

There are a few different things:

  1. to solve the problem of "find the nth million user and check its password", assuming you have it's email or id or whatever you can use indexes. Most databases have it. Instead of traversing millions of rows you just find the right one through the index. PostgreSQL has different types of indexes: postgresql.org/docs/10/static/inde...

  2. that's probably implemented using a full text search with weights: Full Text Search Explained

  3. that's one of the hardest things to implement at scale. That's why Facebook is as big as Facebook ;-) Jokes aside, I would look into graph theory. You could probably implement a super simple version with a traditional database but it would probably explode if you have billions of users like FB. You might get by using Redis but you'll still have scaling problems. There's a discussion on the topic here: quora.com/How-does-Facebook-mainta...

1 and 2 are "free" in basically all relation database systems. 3 requires non trivial logic and implementation.

Collapse
oathkeeper profile image
Divyesh Parmar Author

So are those pagination simply stored like say Binary Tree, or like Stack like how PDFs work. BTW thanks for bringing up that infinite scrolling topic, I didn't know about it at all and was really curious all the time how does it work.

Wow you really explained super amazing I'm noting this all down in my OneNote, for the future knowledge whenever I get job and start working this is some dope shit I should be knowing

What is the cache process like? I know the word cache due to Computer Organization subject in sophomore year college. But how does it actually work in real world application (may be for a social media site, or is it like whenever, even if I don't have internet, i start chrome and it can load up all the sites I have visited so those must have been cached in browser's memory, right?)

Yeah I mean you really explained that there is no need of touching those DS & Algorithms things to touch, but what worried me is why is every company (may be especially here in India) always ask for DS & Algorithms problem solving questions in Interview?

Collapse
rhymes profile image
rhymes

Database pagination is an abstraction on the traversing a data structure, in this case it's done inside the database but you can create a paginator on an array for example. Pagination means dividing the results in "pages" of the same size and using an offset to move the cursor.

DBMSs are very complex structures, I'm quite sure they don't use a single data structure, but many. If you really want to dig in the insides of a DMBS you could start with an overview of how one of them phisically stores data: Introduction to PostgreSQL physical storage

A cache in a real word application is not very different from the concept of cache in general. The purpose of a cache is to be a temporary store of information that is expensive to compute. You have caches in the CPU, you have caches in the database, you can have caches in your program, you can have caching at the HTTP level. They all do, in different ways, the same thing: return data instead of going to the source to recompute it.

Browsers have caches too, some sites employ offline caching so to be partially usable even without connection.

Anyway I didn't say you don't need to know about data structures and algorithms, you do. I'm saying there's no point into replicating the functionality of the database in your code. The reason why companies ask them of you is because they are useful in designing programs and algorithms. It's just that they might not in your particular example: extracting data from the database in Django.

Still, you need to know in general how a database works and yes, sometimes, about caching too.

Good luck with your learning!

Thread Thread
oathkeeper profile image
Divyesh Parmar Author

THank you man! THanks a ton

Collapse
nestedsoftware profile image
Nested Software

I think that ORMs generally support query objects that allow the filtering to happen in the relational database, for e.g. sqlalchemy (python):

>>> s = select([users, addresses]).where(users.c.id == addresses.c.user_id)
Collapse
oathkeeper profile image
Divyesh Parmar Author

ohh okay! Can you explain more on what is ORM basically? I mean why would we need it?

Collapse
nestedsoftware profile image
Nested Software

It depends on the particular ORM, but here are some features that I think are fairly common:

  • You don't have to manually write SQL code to access your database. Instead you declare how your objects are mapped to the db and then create expressions like the one I showed above in your target language. This can speed up your development and give you some flexibility as well.
  • I think it is not uncommon for ORMs to have built-in caching that you can configure. That way you don't have to write a lot of your own custom code to deal with caching.
  • ORMs also usually support a transaction or unit of work. This again means you don't have to write as much custom code to deal with, say, multiple threads accessing the same entities. You don't have to write the code to deal with what these different threads will see. There are usually configuration options that allow you to make reasonable choices.

The basic point is that it is middleware code that allows you to do less work to get your code integrated with a database. You can certainly manage all of the persistence yourself using a lower-level library, but I think it is probably a good idea to use an ORM in many cases.

Thread Thread
oathkeeper profile image
Divyesh Parmar Author

ohh So ORMs are implemented really well we are just putting a layer over it. Thanks