DEV Community

Piotr
Piotr

Posted on • Edited on

Sequence Transformers for Polish language

In this tutorial, I'll show you how to generate embeddings for sequences in polish using Sequence Transformers. I won't explain how they work, there are many great articles:

What we'll need is a Sequence Transformers library from Huggingface:

pip install sequence_transformers
Enter fullscreen mode Exit fullscreen mode

The code is simple, we import library, create model and ask it for embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('Voicelab/sbert-base-cased-pl')
embeddings = model.encode(["Ten tekst zostanie zakodowany"])
print(embeddings)
Enter fullscreen mode Exit fullscreen mode

And that's it. This is the output of the model:

[[ 7.74895132e-01  7.00104088e-02 -5.02209544e-01 -2.06187874e-01
  -1.28363922e-01  1.18705399e-01 -1.88303709e-01 -9.09971595e-02
...
Enter fullscreen mode Exit fullscreen mode

If you wanted to change the model change Voicelab/sbert-base-cased-pl to a model from this list, it's pre-filtered for Polish language.

Those embeddings can be pretty useful, as we could use them for classification, similarity search etc.

Example of usage

I have a list of sentences. I want to know which ones are the most similar. How could I do that? As you can guess – with embeddings. We'll calculate a distance matrix for each sentence and look which are the most similar.

sentences = [
"Pożar w mieście. Zgnięło 10 osób."
,"Wypadek pod wiaduktem kolejowym."
,"W Poniedziałek odbędzie się konferencja naukowa"
,"Magia potrafi wzniecać pożary"]

embeddings = model.encode(sentences)
Enter fullscreen mode Exit fullscreen mode

I'll use cosine distance as measure of similarity.

from sklearn.metrics import pairwise

sns.heatmap(pairwise.cosine_similarity(embeddings, embeddings))
Enter fullscreen mode Exit fullscreen mode

Heatmap of distance matrix

From this heatmap we can deduce that our model works, it found similarity between sentences with pożar and wypadek which both refer to an accident.

Pracę przygotowano w ramach realizacji projektu pt.: „Hackathon Open Gov Data oraz stworzenie innowacyjnych aplikacji, z wykorzystaniem technologii GPU”, dofinansowanego przez Ministra Edukacji i Nauki ze środków z budżetu państwa
w ramach programu „Studenckie koła naukowe tworzą innowacje”.

Image description

Top comments (0)