DEV Community: debricked

Cloning All Github Data – How, Why and Everything in Between

debricked — Fri, 07 May 2021 11:26:11 +0000

Debricked has achieved a not so small feat – we are now able to actively keep and maintain a clone of all data on GitHub! For what reason? You may ask. To understand all the why’s and how’s we have interviewed our Head of Data Science, Emil Wåréus.

Before we start with the questions, who are we talking to?

My name is Emil and I’m the Head of Data Science at Debricked. Me and my team of 5 data engineers are the masters behind everything related to data. Also, I was the second employee at Debricked!

Debricked Cloning Github Data – Why?

Let’s start with the million dollar question: why would anyone want a copy of all GitHub data?

The short answer is – to have a better and faster representation of the data that we need to service our customers. You see, we want to do big computations on all open source. Yes! You heard that right. On all open source.

If we only wanted to monitor a couple of thousands of open source projects we could do it through the API calls provided by default.

But our products and solutions are not meant to give customers partial coverage; it’s supposed to be extensive. Therefore we decided to index all 28M projects on GitHub, and that’s not the end of it. Soon we will be adding the other large repositories such as Gitlab, and more.

But doing this, cloning all of GitHub that is, poses quite an interesting challenge because of the many different data structures and relational dependencies in the data. Some can be loosely coupled and some can be tight.

As a result, huge challenges arise regarding the time complexity for calculations on such a large dataset. For these reasons we decided to go on a journey and see if we could create an up to date hourly mirror of GitHub locally.

Why do you choose to store the data locally?
Haha! Yes, that may sound a bit unintuitive in the era of free cloud credits. But the truth is that it saves us a lot of money. Because the RAM and CPU load is high at all times, constantly chugging data, it is more cost-efficient.

Cloud is great when you have variable loads such as ML training. But when you are pushing a model like ours, scraping data with a lot of database interactions, it is a lot cheaper to host it on prem.

For all the hardware geeks out there, what does a setup like this look like?
There are a couple of things at play. First, we have a relational database that is not too large, meaning not in the terabyte space. Second, in terms of cloning all the repositories we are looking at about 20 terabytes of data. Third, we have a graph database in which we want to understand how the open source relates to one another.

For this we use Neo4j. Essentially we are doing graph computations to answer questions such as “what are the root dependencies” and “what open source impacts what open source”.

The final component of the puzzle is to do the actual scraping and analytics. Here we have a cluster with a couple of hundred pods working in parallel to both gather and analyse the incoming data.

What people don’t realise is that there is a lot of complexity to this. Just cloning a repository is not enough. We need the history of it and we need to create multiple snapshots of the data for different points in time. This multiplies the problem by many factors.

And before I forget, all of that storage is of course on PCI-Express NVMe SSDs sending Gigabytes of data for analysis every second. We must also thank AMD for creating such nice server CPUs with many cores! 🙂

Emil, smiling when thinking about his beautiful server.

20 terabytes sounds like a lot of data – what does it entail exactly?
There are about 10 000 – 30 000 new pull requests each hour at GitHub. This is 5-10 times more than about 5 years ago, and that’s only the open source contributions on GitHub! We are monitoring about +40M repositories, 100M issues, 80M pull requests and 12M active users.

Thanks to our technology, both hardware and software, we are able to keep our model up to date within the hour. So, whenever someone makes a pull request, comments, stars something or follows someone new we can collect that data with minimal time lag.

A new way of Identifying Vulnerabilities

How did you go about setting this up? Did you wake up one day and think “I need to clone Github”? What triggered it?
Haha, no. Of course not! We did GitHub scraping early on in the history of Debricked. It all started when we realised that a lot of vulnerabilities are discussed in issues and we wanted to link and classify those. Through research we realised that this data turned out to be highly relevant.

About 2% of all issues are related to security, but only 0.2% are correctly labeled with a security tag. It is not until much later in time that those discussions are publicly disclosed as vulnerabilities through NVD and other sources of curated vulnerability information.

To curb this, we developed our own machine learning models to classify issues as security related or not. This way we can provide customers with a huge security demand information about issues that may be a risk, but which have not turned into officially disclosed vulnerabilities yet.

I must brag and say that we have achieved state of the art in the security text/issue classification space and have surpassed previous research in the area.

How many Vulnerabilities have you discovered this way so far?
About 200 000. In contrast the probably most popular database for vulnerability information NVD (Natural Vulnerability Database) has about 130 000 where a lot of them are not directly related to open source. We are also investigating how we could properly disclose this data to the public.

How did this later on translate to gathering all of the Github data?
We realised that this data (issues and vulnerabilities) is not independent from pull requests etc, which could provide us with additional signal to our models. For this reason we started to scrape commits, releases, code diffs etc., and perform lots of interesting calculations on the data.

We discovered that with stronger links between vulnerabilities and the original source (the open source projects) we could increase the precision of our data dramatically. This turned out to be a quite technical machine learning solution.

We match the software bill of materials to vulnerabilities and leverage all of our different data points which give us superior data quantity and quality in terms of accurately finding vulnerabilities in our customers proprietary code.

Open Source Security – more than just Vulnerabilities

But the data is not only used for security, right?
Correct! As our customers will know, we are also providing health data on open source projects. For example, we look at the trend of super users commits to a certain repository to build scores which will give us intel about the project’s well being.

This is a good example of how complex the data can be. You want to monitor the contributors who know their way around a project and see that they keep contributing. This, in combination with all the other data we have on a project, can give us an indication of its health.

So, in essence data has become our core here. Gathering, understanding and enriching data from GitHub and other sources has enabled Debricked to create services unheard of before.

This all sounds great! But what does this mean for current Debricked customers and potential future users?

The most impactful part of it is that you get more accurate information on vulnerabilities. We correctly map vulnerabilities to repositories at GitHub, commits, issues etc. which in turn increases our confidence that a vulnerability actually affects an open source project, and gives us insight into how it is affected.

This is a problem in general in our industry; the precision of the vulnerabilities that you are presented with in your Software Composition Analysis tools, Vulnerability Scanners etc.

It is a difficult challenge to match the description of vulnerabilities and vulnerability information in for example email threads and other highly unstructured data sources to the actual software they affect. And then to find out if you are using that open source component and determine how you are using it and if you are in fact affected by the vulnerability.

Good tools have 85-95% precision (True Positive rate). While we have seen competitors and free tools go as low as 60%, Debricked currently has a 90-98% precision rate on the languages we fully support. We manage this while being one tenth of the size of some of our competitors. This would not be possible without the data combined with our algorithms and raw talent in the team.

A little bird whispered that this might open doors for some interesting new functionalities?
Yes! I can’t tell you all the secrets just yet… But by having all this data we are able to determine if a vulnerability affects a certain class or function. A vulnerability is only relevant if you are using the class or function containing the vulnerability in your software.

This can remove second order false positives and increase the accuracy of your internal triaging of the vulnerability. You will know exactly where in your code the vulnerability is, and if the code is called. This greatly reduces the amount of vulnerabilities you actually have to check, which will save a lot of time. So, stay tuned!

Thanks a lot for your time Emil! Is there anything else you would add?

Yes. Keep an eye on our blog for some cool insights later on, both about this and other fun stuff! 🙂

Debugging Docker performance issues on Dell XPS 15 7590

debricked — Fri, 21 Feb 2020 13:00:54 +0000

We recently bought a new set of laptops to Debricked, and since I’m a huge fan of Arch Linux, I decided to use it as my main operating system. The laptop in question, a Dell XPS 15 7590, is equipped with both an Intel CPU with built-in graphics, as well as a dedicated Nvidia GPU. You may wonder why this is important to mention, but at the end of this article, you’ll know. For now, let’s discuss the performance issues I quickly discovered.

Docker performance issues

Our development environment makes heavy use of Docker containers, with local folders bind mounted into the containers. While my new installation seemed snappy enough during regular browsing, I immediately noticed something was wrong when I had setup my development environment. One of the first steps is to recreate our test database. This usually takes around 3-4 minutes, but I can admit I was eagerly looking forward to benchmark how fast this would be on my new shiny laptop.

After a minute I started to realize that something was terribly wrong.

16 minutes and 46 seconds later, the script had finished, and I was disappointed. Recreating the database was almost five times as slow as on my old laptop! Running the script again using time ./recreate_database.sh, I got the following output:

real 16m45.615s user 2m16.547s sys 13m3.622s

What stands out is the extreme amount of time spent in kernel space. What is happening here? A quick check on my old laptop, for reference, showed that it only spent 4 seconds in kernel space for the same script. Clearly something was way off with the configuration of my new installation.

Debug all the things

Disk I/O

My initial thought was that the underlying issue was with my partitioning and disk setup. All that time in the kernel must be spent on something, and I/O wait seemed like the most likely candidate. I started by checking the Docker storage driver, since I’ve heard that the wrong driver can severely affect the performance, but no, the storage driver was overlay2, just as it was supposed to be.

The partition layout of the laptop was a fairly ordinary LUKS+LVM+XFS layout, to get a flexible partition scheme with full-disk encryption. I didn’t see a particular reason why this wouldn’t work, but I tested several different options:

Using ext4 instead of XFS,
Create an unencrypted partition outside LVM,
Use a ramdisk

After using a ramdisk, and still getting an execution time of 16 minutes, I realised that clearly disk I/O can’t be the culprit.

What’s really going on in the kernel?

After some searching, I found out about the very neat perf top tool, which allows profiling of threads, including threads in the kernel itself. Very useful for what I’m trying to do!

Firing up perf top at the same time as running the recreate_database.sh script yielded the following very interesting results, as can be seen in the screenshot below.

That’s a lot of time spent in read_hpet which is the High Precision Event Timer. A quick check on my other computers showed that no other computer had the same behaviour. Finally I had some clue on how to proceed.

The solution

While reading up on the HPET was interesting on its own, it didn’t really give me an immediate answer to what was happening. However, in my aimless, almost desperate, searching I did stumble upon a couple of threads discussing the performance impact of having HPET either enabled or disabled when gaming.

While not exactly related to my problem - I simply want my Docker containers to work, not do any high performance gaming - I did start to wonder which of the graphics cards that was actually being used on my system. After installing Arch, the graphical interface worked from the beginning without any configuration, so I hadn’t actually selected which driver to use: the one for the integrated GPU, or the one for the dedicated Nvidia card.

After running lsmod to list the currently loaded kernel modules, I discovered that modules for both cards were in fact loaded, in this case both i915 and nouveau. Now, I have no real use for the dedicated graphics card, so having it enabled would probably just draw extra power. So, I blacklisted the modules related to nouveau by adding them to /etc/modprobe.d/blacklist.conf, in this case the following modules:

blacklist nouveau blacklist rivafb blacklist nvidiafb blacklist rivatv blacklist nv

Upon rebooting the computer, I confirmed that only the i915 module was loaded. To my great surprise, I also noticed that perf top no longer showed any significant time spent in read_hpet. I immediately tried recreating the database again, and finally I got the performance boost I wanted from my new laptop, as can be seen below:

real 2m36.722s user 1m39.528s sys 0m2.674s

As you can see, almost no time is spent in kernel space, and the total time is now faster than the 3-4 minutes of my old laptop. Finally, to confirm, I whitelisted the modules again, and after a reboot the problem was back! Clearly the loading of nouveau causes a lot of overhead, for some reason still unknown to me.

Conclusion

So there you go, apparently having the wrong graphics drivers loaded can make your Docker containers unbearably slow. Hopefully this post can help someone else in the same position as me to get their development environment up and running at full speed.