First of all before we dive in... What are trees? Obviously in computer science. A tree is just a data structure which is composed of nodes. Every tree has a root node. The root node has zero or more children nodes. Each child node has zero or more child nodes, and so on.
Okay, so what does this have to do with databases anyway?? Well i saw this in PgAdmin after running migrations some time ago.. Something similar to this:
CREATE INDEX nameIndex ON table USING btree (name);
Wait what?... What's btree?.. Well a quick google should tell you that, B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time.. Well again, another word here.. Logarithmic time..
lets visit the picture every programmer must have..
Its in the green so i think its good news right?..
A stackoverflow answer I liked for this was:
"O(log N) basically means time goes up linearly while the n goes up exponentially. So if it takes 1 second to compute 10 elements, it will take 2 seconds to compute 100 elements, 3 seconds to compute 1000 elements, and so on."
it is also known that logarithmic function is the opposite of an exponential function. When they say something grows exponentially, it is being multiplied. When something grows logarithmically, it is being divided. :)
Feel like Einstein yet?
Okay lets go on dear Einsteins.
So let's say you need to store a list of some numbers in some array and later you will want to search through it.. Damn me. I am going to use a list / array for that.. yeah so here we go..
var searchMeBoo = new List<int>() { 11,6,7,10,15, 17, 19 };
The list above contains seven items. So now i need to search for a particular number.. which sadly is at the end of the list which is 19. I will need to check the whole list till the very end before i get the match.We can denote this worst case as O(n) in asymptotic notation. This simply means if your list size is "n" at most, you need to do "n" number of searches to find a given value in an array.
Now lets be smart.. Let us improve the time needed to search this thing.
Lets make sure the data is going to be sorted. Whenever we put something in the List we have to make sure it keeps it order. If not call some .Sort() or sorted() on it in your language so we can get going. So now searching on such a sorted thing should begin in the middle. Then compare the selected value to the value being searched,
If the selected value is greater than search value, ignore the left side of the array and search the value on the right side and vice versa.
Now lets see it:
11,6,7,10,15, 17, 19 -> Sort -> 6,7,10,11,15, 17, 19
Here, we try to search number 19 from the List 6,7,10,11,15, 17,19 . If you do a normal search, then it will take seven units of time to search since the element is in the seventh position. But in the binary search, it will take only three searches, Because you will start your search in the middle which is 11..and compare with 19..which does not match ..so (404 Not found lol) Then since its greater than 11 you will then focus on the other half of the List, Then move to 17. Which 19 is still greater. Then FINALLY.. move to the last element which is 19. which is found lol..
We moved three times instead of seven times!!!
Well that was exciting.. Such an algorithm has a time complexity of O(Log n). Great lets move on.
Okay for some simple clarity, A B-tree is a balanced tree and not a binary tree. Now lets dive in. Take this image for example:
Remember nodes in the first picture in the post above? Well you could take a look again. Now over here, the branch node contains the biggest value in the leaf nodes, Lets take something like 57, which corresponds to the biggest value in 55,57,57 in the leaf node. Same applies to 83, 46 and 53. So in the algorithm a branch "layer" is made up of such values until all leaf nodes are covered by a branch node.
The next layer is built similarly, but on top of the first branch node level. The procedure repeats until all keys fit into a single node, the root node. The structure is a balanced search tree because the tree depth is equal at every position; the distance between root node and leaf nodes is the same everywhere.
Okay so now lets say we were gonna look for 57
Now lets traverse the tree to find that number. We are going to start at the root node on the left-hand. Now each value in there is processed in ascending order until a value is greater than or equal to the search term which is 57 in our case is found. In the figure it is the entry 83. The database follows the reference to the corresponding branch node which contains 57 which makes the (>=) comparison "True", it then moves down to get the data - simply.
Through this I have learnt that the B-tree enables the database to find a leaf node quickly :).
Anyway for deeper understanding please visit https://winand.at/ and grab the book “SQL Performance Explained” The author is the guy with all the knowledge. Markus Winand @MarkusWinand on twitter.
Top comments (11)
There's this new implementation in C# which allows you to implement a loop in an array from both the front and behind at the same time. That'll mean that for every list of n elements the worst case scenario will occur at n/2.for asymptotic relations these constants don't really matter though but do you think an implementation lik e this ideal. In the case of search for 19 it'll find 19 in the first attempt.
Send me the link to this implementation.. will check it
The section titled indexes and ranges
docs.microsoft.com/en-us/dotnet/cs...
This is how I used I to implement the aforementioned
dev-to-uploads.s3.amazonaws.com/i/...
Nice! I wrote same to see what was going on, it was what i guessed. Was the numbers intentional? As in repeated in descending order from the number 5? dev-to-uploads.s3.amazonaws.com/i/...
Yes it was. I was just comparing to see opposite ends were the same. Typically in a search like this you will most like have a worst case. The smaller the number the harder to find but the since it's index is far. And the higher the number the higher you are in you array and closer to your destination so the best case scenario mostly be bit so great ever with binary search tree
Well making double of the numbers means more space will be taken.. So on large datasets it will take more memory. (space complexity). Now lets imagine if the number is not in there... for example looking for 10. You will iterate through the whole numbers. from index 1 to .. arr.Length. Which is what you will likely not prefer.
Oh okay, so that means like nothing will really change in time complexity as opposed to running two different for loops from the opposite ends of the array; the only difference will be in space complexity of the variables that hold the arr.index if either situation, I variable in the first (new implementation) and 2 variables in using traditional method of two loops
Awesome. Looking forward to experimenting on Postgres and how it can optimize database designs and structure esp handling big data.
Looking forward to your next article
Cool seems I should make a post on scaling postgres.
You definitely should.