I recently organized my pinned repositories on GitHub and noticed that the language shown for one of my repositories didn't quite seem right. It indicated HTML
but I was expecting JavaScript
because it was a vanilla JavaScript frontend and there were more lines of JavaScript code than HTML.
To really set the scene, here's a screenshot of my pinned repos with the incorrectly labeled repo (IMO) in question:
I did some digging to figure out how GitHub determines the language for the repository as well as looking at how I can change the language shown.
GitHub and the Linguist Library
GitHub indicates it uses the open source Linguist Library to determine the file language for syntax highlighting and repository statistics.
Once you push changes to a repository on GitHub, the Linguist does its thing with a low-priority background job that will go through all of the files to determine the language of each file. Some things to note:
- all of the languages it knows about are listed in languages.yml
- excluded files include binary data, vendored code, generates code, documentation, files with either
data
(ie SQL) orprose
(ie Markdown) languages, and explicit language overrides.
To determine the language for each remaining file, the Linguist employs the seven strategies listed below, done in the same order. Each step will either identify the exact language or will reduce the number of possible languages that get passed down to the next strategy.
- Vim or Emacs modeline
- commonly used filename
- shell shebang
- file extension
- XML header
- heuristics
- naïve Bayesian classification
The results are then used to produce the language stats bar that shows the languages and its respective percentages that make up the repository. The percentage is determined by the bytes of code for each language as indicated by the List Languages API. The language shown for all of my pinned repos up top is the majority language.
Also, I was today years old when I found about the language stats bar. If you’re wondering where it is, it’s the colorful bar up at the top of your repository just under the commits/branches/etc. bar. Those colors indicate the languages that make up your repo, and click on it to get the full breakdown. 🤯
Changing the Repo Language Shown
Now that we know the background of how GitHub determines the repository language, I’ll show you how to change the language shown using gitattributes
.
- Create a
.gitattributes
file in your repo at the top-level -
Edit the file and add the below line, subbing in the language(s) you want ignored denoted by its file extension before
linguist-detectable=false
. Since I want HTML ignored, I’ve included HTML below.
*.html linguist-detectable=false
Add, commit, and push the changes
And voila, the language is changed to JavaScript!
Resources
About Repository Languages
Linguist
How Do I Change the Category?
Top comments (9)
How do I exclude a folder/file from being used for language determination?
Example: I have a .obsidian folder (i.e a folder created by a tool), it has JS and other stuff. I don't want this to be tracked, as it's just a tool, not the content of the repo, per se.
This should do the trick:
.obsidian/** linguist-vendored
It looks like this was published a bit too early. I'd like to know how to do it though!
Hi, thanks for pointing that out! There was an issue with dev.to showing an old draft since the whole post was there when I initially published it. But it’s been updated so check it out!
Ah I've had this kind of issues, too. Very frustrating!
This is the file I was looking for. It explains the options available in
.gitattributes
Thanks for this!!!
thanks a lot!
Tysm !!!