DEV Community

Gabor Szabo
Gabor Szabo

Posted on • Edited on

Billions of unnecessary files in GitHub

As I was looking for easy assignments for the Open Source Development Course I found something very troubling which is also an opportunity for a lot of teaching and a lot of practice.

Some files don't need to be in git

The common sense dictates that we rarely need to include generated files in our git repository. There is no point in keeping them in our version control as they can be generated again. (The exception might be if the generation takes a lot of time or can be done only during certain phases of the moon.)

Neither is there a need to store 3rd party libraries in our git repository. Instead of that we store a list of our dependencies with the required version and then we download and install them. (Well, the rightfully paranoid might download and save a copy of every 3rd party library they use to ensure it can never disappear, but you'll see we are not talking about that).

.gitignore

The way to make sure that neither we nor anyone else adds these files to the git repository by mistake is to create a file called .gitignore, include patterns that match the files we would like to exclude from git and add the .gitignore file to our repository. git will ignore those file. They won't even show up when you run git status.

The format of the .gitignore file is described in the documentation of .gitignore.

In a nutshell:

/output.txt
Enter fullscreen mode Exit fullscreen mode

Ignore the output.txt file in the root of the project.

output.txt
Enter fullscreen mode Exit fullscreen mode

Ignore output.txt anywhere in the project. (in the root or any subdirectory)

*.txt
Enter fullscreen mode Exit fullscreen mode

Ignore all the files with .txt extension

venv
Enter fullscreen mode Exit fullscreen mode

Ignore the venv folder anywhere in the project.

There are more. Check the documentation of .gitignore!

Not knowing about .gitignore

Apparently a lot of people using git and GitHub don't know about .gitignore

The evidence:

Python developers use something called virtualenv to make it easy to use different dependencies in different projects. When they create a virtualenv they usually configure it to install all the 3rd party libraries in a folder called venv. This folder we should not include in git. And yet:

There are 452M hits for this search venv

In a similar way NodeJS developers install their dependencies in a folder called node_modules. There are 2B responses for this search: node_modules

Finally, if you use the Finder applications on macOS and open a folder, it will create an empty(!) file called .DS_Store. This file is really not needed anywhere. And yet I saw many copies of it on GitHub. Unfortunately so far I could not figure out how to search for them. The closest I found is this search.

Misunderstanding .gitignore

There are also many people who misunderstand the way .gitignore works. I can understand it as the wording of the explanation is a bit ambiguous. What we usually say is that

If you'd like to make sure that git will ignore the __pycache__ folder then you need to put it in .gitignore.

A better way would be to say this:

If you'd like to make sure that git will ignore the __pycache__ folder then you need to put its name in the .gitignore file.

Without that people might end up creating a folder called .gitignore and moving all the __pycache__ folder to this .gitignore folder. You can see it in this search

Help

Can you suggest other common cases of unnecessary files in git that should be ignored?

Can you help me creating the search for .DS_store in GitHub?

Updates

More based on the comments:

  • .o files the result of compilation of C and C++ code: .o
  • .class files the result of compilation of Java code: .class
  • .pyc files are compiled Python code. Usually stored in the __pycache__ folder mentioned earlier: .pyc

How to create a .gitignore file?

A follow-up post:

Latest comments (50)

Collapse
 
slot-gacor-777 profile image
a1 toto

Slot Gacor dan Scatter Hitam telah tersedia disini.

Collapse
 
cubiclesocial profile image
cubiclesocial

The real question is how much money keeping all this cruft is costing Microsoft/GitHub. Storage is cheap but it isn't free! And Enterprise storage is more expensive than consumer-grade hardware.

IMO, the real reason to only send over the minimal amount of stuff to GitHub is to keep those deltas tiny. When someone clones a repo for the first time, it has to pull everything. That large binary blob no longer being used anywhere in the project or was accidentally committed one time? That comes down (compressed, of course) in the git deltas/packs. If that "someone" is an automated build process that runs regularly and starts from scratch each time, then that's going to draw down a lot of network traffic. And some places still charge $ per GB of transfer.

I think part of the problem is that people can't see all of the files that will be included with git add . when they do a straight git status before the add operation. Directories can be seen but not the files (at least in the versions of git I use). When I do my very first commit for a project, I do a git add ., check each line with git status to make sure I really do want it in there, and then I commit/push. If anything looks out of place, I git reset to undo the add operation and then adjust .gitignore to exclude what shouldn't be included and then try again. The first commit is everything.

Another possible source of this problem is Microsoft Windows. On Windows, users can't create a file with just an extension in File Explorer. That means the first .gitignore file on the system has to be created via other means. It's an extra step and inertia is tough to overcome for some folks. Mac and Linux and other Unix-ey systems don't have goofy restrictions on filenames.

Collapse
 
thomasbnt profile image
Thomas Bnt

In a similar way NodeJS developers install their dependencies in a folder called node_modules. There are 2B responses for this search: node_modules

Only cr*p ! 🤯

Collapse
 
posandu profile image
Posandu

The .gitignore folder search link is wrong. It should have the query .gitignore/ not .gitignore

https://github.com/search?q=path%3A.gitignore%2F&type=code

Collapse
 
szabgab profile image
Gabor Szabo

Yours looks more correct, but I get the same results for both searches. Currently I get for both searches: 1,386,986 code results

Collapse
 
posandu profile image
Posandu

Weird, I get two different results. 🤣

.gitignore/
Image description

.gitignore
Image description

Thread Thread
 
szabgab profile image
Gabor Szabo

You use night-mode and I use day-mode. That must be the explanation. 🤔

Also I have a menu at the top, next to the search box with items such as "Pull requests", "Issues", ... and you don't. Either some configuration or we are on different branches of their AB testing.

Collapse
 
z2flow profile image
Z2Flow

Thanks Gabor.
Totally off-topic, but your namesake's music is fantastic.
Checkout the "Dreams" album.

Collapse
 
szabgab profile image
Gabor Szabo

It is indeed off-topic and yes I like his music :)

Collapse
 
eekee profile image
Ethan Azariah • Edited

I was quite used to configuring everything by text file when I first encountered Git in 2005, but I still needed a little help and a little practice to get used to .gitignore. :) I think the most help was seeing examples in other peoples' projects; that's what usually works for me.

Collapse
 
wizkid_alex profile image
Alex Oladele

I manage a GitHub Enterprise instance for work and this is soooo incredibly important actually. The files you commit to git really build up overtime. Even if you "remove" a file in a subsequent commit, it is still in git history, which means you're still cloning down giant repo history whenever you clone. You might think: "oh well so what? What's the big deal? This is a normal part of the development cycle"

Let's couple these large repo clones with automation that triggers multiple times a day. Now let's say that a bunch of other people are also doing automated clones of repos with large git histories. The amount of network traffic that this generates is actually significant and starts to impact performance for everyone. Not to mention that the code has to live on a server somewhere, so its likely costing your company a lot of money just o be able to host it.

*General advice whether you're using GitHub Enterprise or not:
*

  1. Utilize a .gitignore from the start! Be overzealous in adding things to your .gitignore file because its likely safer for you. I use Toptal to generate my gitignores personally

  2. If you're committing files or scripts that are larger than 100mb, just go ahead and use git-lfs to commit them. You're minimizing your repo history that way

  3. Try to only retain source code in git. Of course there will be times where you need to store images and maybe some documents, but really try to limit the amount of non source-code files that you store in git. Images and other non text-based files can't be diffed with git so they're essentially just reuploaded to git. This builds up very quickly

  4. Weirdly enough, making changes to a bunch of minified files can actually be more harmful to git due to the way it diffs. If git has to search for a change in a single line of text, it still has to change that entire (single) line. Having spacing in your code makes it easier to diff things with git since only a small part of the file has to change instead of the entire file.

  5. If you pushed a large file to git and realized that you truly do not need it in git, use BFG repo cleaner to remove it from your git history. This WILL mess with your git history, so I wouldn't use it lighty, but its an incredibly powerful and useful tool for completely removing large files from git history.

  6. Utilize git-sizer to see how large your repo truly is. Cloning your repo and then looking at the size on disk is probably misleading because its likely not factoring in git history.

  7. Review your automation that interacts with your version control platform. Do you really need to clone this repo 10 times an hour? Does it really make a difference to the outcome if you limited it to half that amount? A lot of times you can reduce the number of git operations you're making which just helps the server overall

Collapse
 
jnareb profile image
Jakub Narębski

There is gitignore.io service that can be used to generate good .gitignore file for the programming language and/or framework that you use, and per-user or per-repository ignore file for the IDE you use.

Collapse
 
anushibin007 profile image
Anu Shibin Joseph Raj

Using scaffolding tools like Spring Initializer, Vite scaffolder, Eclipse Maven project initialisation wizard, etc automatically generates a gitignore file. I always find the entries in these auto generated files much helpful.

Collapse
 
damiensedgwick profile image
Damien Sedgwick

Nice article! I created a little tool to help generate .gitignore files from the command line.

I did a small write up about it here: dev.to/damiensedgwick/how-to-gener...

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more