DEV Community

Kevin Naidoo
Kevin Naidoo

Posted on

Hugging Face and Kaggle? (Machine Learning)

Did you know there is a whole community out there, similar to GitHub for data science?

Most developers are familiar with OpenAI. Since ChatGPT exploded onto the scene, this is all you hear about on Twitter, YouTube, and various other platforms.

As a backend engineer who mostly deals with web-related engineering - machine learning, and data science is not my area of expertise.

Still, I'm always experimenting with new technologies. For the past few years I've gone deep into the machine-learning world to see how I can bring some of that technology into my regular webdev space.

DISCLAIMER: This is not a Hugging Face or Kaggle sponsored article. I have personally used several models and datasets from these platforms in my own projects and it's been a great developer experience for a non-data scientist.

What is Hugging Face?

Before ChatGPT became a thing, you would need to build models yourself using Pytorch or Tensorflow. Two powerful libraries that can be used for various tasks. Some examples:

  • Text classification - analyze reviews and figure out the sentiment of the review.
  • Image recognition.
  • Image classification.
  • Text generation.
  • Image generation.

And the list goes on...

Now, if you are a developer like me - you probably are not too familiar with all the terminology and math involved in training and working with data science models.

This is where Hugging Face shines, for developers - Hugging Face is like a GitHub for models. You can find hundreds, if not thousands of pre-built machine learning models already trained on platforms such as Quora, GitHub, Reddit, and more.

How do I use Hugging Face models?

If you have ever worked with PyTorch or Tensorflow - you will know, for a developer - it's such a pain. Almost like writing config files in code.

Anyway, with Hugging Face - they provide a consistent API that allows you to swap in and out models quite easily.

Example:


from sentence_transformers import SentenceTransformer
sentences = ["Some random sentence here", "Some random sentence 2 here"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)
Enter fullscreen mode Exit fullscreen mode

This model basically takes in a sentence or list of sentences and converts them into vector embeddings.

With vector embeddings - you can now build a smarter search where you store vectors in some backend database like Qdrant or Redis and then perform a cosine comparison with search terms against all the vectors in the DB. (This is very oversimplified but you get the idea?)

The beauty of Hugging Face - is that you can swap out "sentence-transformers/all-mpnet-base-v2" for any other compatible model e.g.: "paraphrase-MiniLM-L6-v2". All you need to do is just change the argument passed to "SetenceTransformer"

With the power of sentence transformers, you can now experiment with multiple models without having to make major changes to your code every time.

What is Kaggle?

While pre-trained models are great, sometimes you are going to need to do something custom. Perhaps - you want to extend a Hugging Face model to be able to differentiate between different items of clothing.

This is called fine-tuning, you use a pre-existing model as a base and just feed it extra data relevant to your use case.

The problem is - where do you get this data from?

Kaggle is the answer - with Kaggle there are a ton of data scientists and programmers sharing data and even code examples for free.

In my example above relating to fashion, you can simply search their collection for fashion datasets, e.g.:

https://www.kaggle.com/search?q=fashion+dataset+datasetFileTypes%3Acsv

This will give you a list of CSV files that have been pre-formatted with various useful pieces of information, that you can import and use to fine-tune your own models.

Kaggle also provides full models and code examples. This will surely save you a lot of time compared to writing scrapers.

Conclusion

Both Kaggle and Hugging Face are awesome tools with a great community. For developers - these are amazing resources at our disposal for free.

Hopefully, this article will help you look beyond OpenAI, while OpenAI and LLMs like ChatGPT are great tools - you don't always need that level of processing power.

Smaller compact sentence transformers can get a lot done and solve niche machine-learning problems, even with limited hardware and without a GPU.

Top comments (0)