DEV Community

Billions of unnecessary files in GitHub

Gabor Szabo on December 21, 2022

As I was looking for easy assignments for the Open Source Development Course I found something very troubling which is also an opportunity for a lo...

Read full post

Brian Kirkpatrick • Dec 22 '22

Did you know the command git clean -Xfd will remove all files from your project that match the current contents of your .gitignore file? I love this trick.

cubiclesocial • Jan 7 '23

Be careful with this one. Some of my repos have bits and pieces I expressly never commit and are in .gitignore but also don't want to branch/stash those things either. Things like files with sensitive configuration information or credentials in them that exist for development/testing purposes but should never reach GitHub.

Perchun Pak • Jan 9 '23 • Edited

Maybe use environment variables in your IDE? Or if you're on Linux, you can set those values automatically when you enter the folder with cd. This is much safer in both situations, you will never commit this data and will never delete it.

For e.g. syncing it between devices, use password manager (like BitWarden).

Real AI • Feb 1 '23 • Edited

The thing with repos is that git clean -Xfd should not be dangerous, if it is then you have important information that should be stored elsewhere, NOT on the filesystem.

Please learn to use a proper pgpagent or something.

The filesystem should really be ephemeral.

cubiclesocial • Feb 2 '23

Information has to be stored somewhere. And that means everything winds up stored in a file system somewhere.

Arik • Dec 28 '22

Lifesaver!

Exoutia • Dec 28 '22

I was just looking for this comman I wanted to remove some of the sqlite files from GitHub

Gabor Szabo • Dec 28 '22

This won't remove the already committed files from github. It removes the files from your local disk that should not be committed to git.

Seth Berrier • Dec 22 '22

Game engine projects often have very large cache folders that contain auto generated files which should not be checked into repositories. There are well established .gitignore files to help keep these out of GitHub, but people all to often don't use them.

For Unity projects, "Library" off the root is a cache (hard to search for that one, it's too generic).

For Unreal, "DerivedDataCache" is another (search link)

There's also visual studio's debug symbol files with extension .pdb. these can get pretty damn big and often show up in repos when they shouldn't: search link

Gabor Szabo • Dec 23 '22 • Edited

Thanks! That actually gave me the idea to open the recommended gitignore files and use those as the criteria for searches.

Chris Hansen • Dec 28 '22

See also gitignore generators like gitignore.io. For example, this generated .gitignore has some interesting ones like *.log and *.tmp.

Kolja • Dec 21 '22

Does GitHub really store duplicate files?

Gabor Szabo • Dec 21 '22 • Edited

I don't know how github stores the files, but I am primarily interested in the health of each individual project. Having these files stored and then probably updated later will cause misunderstandings and it will make harder to track changes.

Márton Somogyi • Dec 23 '22

Duplicate or not, git clone is create them. 😞

Gabor Szabo • Dec 23 '22

I am not sure I understand what you meant by this comment.

Márton Somogyi • Dec 24 '22

It doesn't matter if github stores it in duplicate or not, because git clone will create it unnecessarily on the client side.

Gabor Szabo • Dec 24 '22

Right

Comment marked as low quality/non-constructive by the community. View Code of Conduct

Zaira Chupalaeva • Jan 7 '23 • Edited

He wanna make force ppl do what he wants, assuming they dumb.
UPD sorry, wrong pic

Gabor Szabo • Jan 7 '23

I am sorry, but it is unclear what you mean by that comment and what does that image refer to? Could you elaborate, please?

Thomas Bnt • Jan 7 '23

He demonstrates how to lighten your open source projects with the use of .gitignore. 👍🏼
At no time does he point at people and tell them that. Why do you think like that? 🤔

Alex Oladele • Dec 28 '22

I manage a GitHub Enterprise instance for work and this is soooo incredibly important actually. The files you commit to git really build up overtime. Even if you "remove" a file in a subsequent commit, it is still in git history, which means you're still cloning down giant repo history whenever you clone. You might think: "oh well so what? What's the big deal? This is a normal part of the development cycle"

Let's couple these large repo clones with automation that triggers multiple times a day. Now let's say that a bunch of other people are also doing automated clones of repos with large git histories. The amount of network traffic that this generates is actually significant and starts to impact performance for everyone. Not to mention that the code has to live on a server somewhere, so its likely costing your company a lot of money just o be able to host it.

*General advice whether you're using GitHub Enterprise or not:
*

Utilize a .gitignore from the start! Be overzealous in adding things to your .gitignore file because its likely safer for you. I use Toptal to generate my gitignores personally
If you're committing files or scripts that are larger than 100mb, just go ahead and use git-lfs to commit them. You're minimizing your repo history that way
Try to only retain source code in git. Of course there will be times where you need to store images and maybe some documents, but really try to limit the amount of non source-code files that you store in git. Images and other non text-based files can't be diffed with git so they're essentially just reuploaded to git. This builds up very quickly
Weirdly enough, making changes to a bunch of minified files can actually be more harmful to git due to the way it diffs. If git has to search for a change in a single line of text, it still has to change that entire (single) line. Having spacing in your code makes it easier to diff things with git since only a small part of the file has to change instead of the entire file.
If you pushed a large file to git and realized that you truly do not need it in git, use BFG repo cleaner to remove it from your git history. This WILL mess with your git history, so I wouldn't use it lighty, but its an incredibly powerful and useful tool for completely removing large files from git history.
Utilize git-sizer to see how large your repo truly is. Cloning your repo and then looking at the size on disk is probably misleading because its likely not factoring in git history.
Review your automation that interacts with your version control platform. Do you really need to clone this repo 10 times an hour? Does it really make a difference to the outcome if you limited it to half that amount? A lot of times you can reduce the number of git operations you're making which just helps the server overall

Mohammad Hosein Balkhani • Dec 25 '22

I was really shocked, i read your article 3 times and opened the node modules search to believe this.
Wow GitHub should start to alert this people!

Darren Cunningham • Dec 28 '22

github.com/community/community#mak...

Neo Sibanda • Dec 22 '22

Ignored files are usually build artifacts and machine generated files that can be derived from your repository source or should otherwise not be committed. Some common examples are: dependency caches, such as the contents of /node_modules or /packages. compiled code, such as .o , .pyc , and .class files.

Gabor Szabo • Dec 22 '22

I've updated the article based on your suggestions. Thanks.

Gil Fewster • Dec 21 '22 • Edited

Good explanation of .gitgnore

Don’t forget those .env files as well!

GitHub’s extension search parameter doesn’t require the dot, so your .DS_Store search should work if you make that small change

extension:DS_Store

https://github.com/search?q=extension%3ADS_Store&type=Code

Ethan Azariah • Dec 29 '22 • Edited

I was quite used to configuring everything by text file when I first encountered Git in 2005, but I still needed a little help and a little practice to get used to .gitignore. :) I think the most help was seeing examples in other peoples' projects; that's what usually works for me.

Jakub Narębski • Dec 27 '22

There is gitignore.io service that can be used to generate good .gitignore file for the programming language and/or framework that you use, and per-user or per-repository ignore file for the IDE you use.

Posandu • Dec 30 '22

The .gitignore folder search link is wrong. It should have the query .gitignore/ not .gitignore

https://github.com/search?q=path%3A.gitignore%2F&type=code

Gabor Szabo • Dec 30 '22

Yours looks more correct, but I get the same results for both searches. Currently I get for both searches: 1,386,986 code results

Posandu • Dec 30 '22

Weird, I get two different results. 🤣

.gitignore/

.gitignore

Gabor Szabo • Dec 30 '22

You use night-mode and I use day-mode. That must be the explanation. 🤔

Also I have a menu at the top, next to the search box with items such as "Pull requests", "Issues", ... and you don't. Either some configuration or we are on different branches of their AB testing.

Angela • Dec 23 '22

Could GitHub inform users about it?

Gabor Szabo • Dec 23 '22 • Edited

It could or there could be a bot doing that, however there might be some legitimate cases when you want to store these files. (I don't have an example now, but I would not rule that out.)

Also when you create a new repository github offers you to add a .gitignore file based on the files here

Comment marked as low quality/non-constructive by the community. View Code of Conduct

velomad • Dec 25 '22

Comment hidden by post author

Muhammad Rabi • Dec 23 '22

Good explanation!

intrnl • Dec 24 '22

I somewhat disagree with this take, checking in files like node_modules ensures that everyone is installing the same exact dependencies.

tools like Yarn even encourages this which is why they came up with Yarn PnP where package archives are stored in your project directory

Luuk Lamers • Dec 24 '22

Tools like npm, yarn and pnpm have lock files for this. They will ensure everyone uses the same exact versions of dependencies.

There is absolutely no need to ever commit your node_modules directory if you’re not paranoid about packages disappearing.

And if you really care about that last thing there are better options than caching every file from every package

Gabor Szabo • Dec 24 '22 • Edited

I would not use the word "paranoid" here. I think it is a totally valid concern that we should not blindly assume that all the external packages and services will be always there when we need them.

Anyway I am not familiar with those better options. Would you have the time to point at some of those tools?

Luuk Lamers • Dec 24 '22

I don’t personally use them but there are tools that analyze your code and only cache the pieces of your dependencies that you actually use. A sort of tree shaking for dependencies. There also some monorepo caching strategies that enable using multiple versions of dependencies at the same time which do the same thing.

But I don’t use any of these as package and lock files have always worked out for me. Deprecated/broken deps have even helped me find better packages and optimize/get rid of old code, so I actually like the changing nature of package files.

Some google searches might help you find out more.

Alex Oladele • Dec 28 '22

Very much agreed here. Proper lock file management will ensure that everyone is installing the same dependencies across the board. If you run into a situation where a dependency has been deleted or no longer available as a package, then the better solution here is to find an alternative to it

Gabor Szabo • Dec 24 '22

Good point, however I think among those the only really important one is the last one: To ensure that external decision (e.g. removal of a dependency from npm or removal of a specific version of a dependency from npm) won't impact you.
For that, IMHO, it would be a better idea to create a local copy of a subsection of npm - all the files the project or the company needs. However I don't know if there is an easy solution for that.

In addition, this advice might need to be different for libraries and for end-user projects.

Z2Flow • Dec 29 '22

Thanks Gabor.
Totally off-topic, but your namesake's music is fantastic.
Checkout the "Dreams" album.

Gabor Szabo • Dec 29 '22

It is indeed off-topic and yes I like his music :)

cubiclesocial • Jan 7 '23

The real question is how much money keeping all this cruft is costing Microsoft/GitHub. Storage is cheap but it isn't free! And Enterprise storage is more expensive than consumer-grade hardware.

IMO, the real reason to only send over the minimal amount of stuff to GitHub is to keep those deltas tiny. When someone clones a repo for the first time, it has to pull everything. That large binary blob no longer being used anywhere in the project or was accidentally committed one time? That comes down (compressed, of course) in the git deltas/packs. If that "someone" is an automated build process that runs regularly and starts from scratch each time, then that's going to draw down a lot of network traffic. And some places still charge $ per GB of transfer.

I think part of the problem is that people can't see all of the files that will be included with git add . when they do a straight git status before the add operation. Directories can be seen but not the files (at least in the versions of git I use). When I do my very first commit for a project, I do a git add ., check each line with git status to make sure I really do want it in there, and then I commit/push. If anything looks out of place, I git reset to undo the add operation and then adjust .gitignore to exclude what shouldn't be included and then try again. The first commit is everything.

Another possible source of this problem is Microsoft Windows. On Windows, users can't create a file with just an extension in File Explorer. That means the first .gitignore file on the system has to be created via other means. It's an extra step and inertia is tough to overcome for some folks. Mac and Linux and other Unix-ey systems don't have goofy restrictions on filenames.

Damien Sedgwick • Dec 26 '22

Nice article! I created a little tool to help generate .gitignore files from the command line.

I did a small write up about it here: dev.to/damiensedgwick/how-to-gener...

Anu Shibin Joseph Raj • Dec 27 '22

Using scaffolding tools like Spring Initializer, Vite scaffolder, Eclipse Maven project initialisation wizard, etc automatically generates a gitignore file. I always find the entries in these auto generated files much helpful.

Thomas Bnt • Jan 7 '23

In a similar way NodeJS developers install their dependencies in a folder called node_modules. There are 2B responses for this search: node_modules

Only cr*p ! 🤯

a1 toto • Jan 27

Slot Gacor dan Scatter Hitam telah tersedia disini.

Alan • Dec 26 '22

Generate your standard gitignore from github.com/michaelliao/gitignore-o... and use it for every project, a good solution help to understand things well.

Some comments have been hidden by the post's author - find out more