DEV Community

Cover image for How to detect code language in browser
Ray-D-Song
Ray-D-Song

Posted on

How to detect code language in browser

Repository: https://github.com/ray-d-song/guesslang-js

Demo: https://ray-d-song.github.io/guesslang-js/

Recently, I'm working on a project called EchoRSS, and I have a very wanted feature, which is to intercept external links in subscriptions (read full text, quote, etc.) and display them directly on the current page.

There is a problem that the returned HTML code block loses the language annotation (or the language was not annotated on the pre and code tags in the original code block), so it cannot be highlighted using tools like shiki or prism.js.

I found three solutions to detect code language:

1. linguist

This is a Ruby project deployed on the server, and Github uses it to detect the language composition of the repository. If you need extremely high accuracy and can be calculated on the server, this is the best solution.

2. hljs

highlight.js is a very famous web code highlighting library, and it is also the only library that provides automatic code detection.

The principle is very simple, which is to enumerate the keywords of the language, and then match them one by one with the text, and finally see which one has the highest matching degree.

hljs has four problems.

  • It requires a very long code length, and most languages require at least 300 characters to achieve a relatively good accuracy.
  • The part that detects the language is not a separate module, but tightly coupled with the parser and render, and the code is also very imperative, making it difficult to extract useful parts.
  • If you don't extract the detection module, the original format (line breaks and indentation) of the code will be lost when using hljs to highlight.
  • It requires a lot of regular matching, the performance is poor, and because of reason 2, it cannot be run in a web worker.

3. guesslang

guesslang is a machine learning project based on tensorflow.js.

Microsoft ported this project to node.js in 2021 and added the automatic language detection function to vscode.

A Vietnamese guy hieplpvip three years ago also ported this project to the browser, but there are also three problems:

  • Memory leak, memory leak...
  • Only supports the <script> tag to introduce the umd format, does not support esm, does not support bundle
  • Similarly, because of reason 2, it does not support web worker

The guy has not maintained this project, and the feat request to support esm in March has not been replied.

So I extracted the detection module from hljs, and forked guesslang-js to fix the above problems, and in the end guesslang won, the result is this:
https://github.com/ray-d-song/guesslang-js

I think it's too much to talk about, maybe someone will need it in the future, so I'll post it.

If someone knows tensorflow.js, I hope they can recommend some learning materials, I want to further modify it to web gpu calculation to improve efficiency.

Top comments (0)