Finding meaning in text, an experiment in document clustering

#beginners #datascience

Problem

For an assignment in the University of British Columbia's CPSC330 course in Applied Machine Learning, we were tasked with categorizing titles pulled from a sample of Food.com recipes. The goal was simple, use a subset of the banks 180,000+ recipes to find categories of recipes purely based off their titles. Achieving said goal was the real challenge, with so many different considerations made in the modeling process due to the nature of the data - text.

The data

From our sample of recipes, we pulled a smaller subset of data consisting 9100 words. We did this by removing duplicate entries, NaNs, short names (< 5 characters), and only selecting observations with tags that were amongst the top 300 tags in our sample. Below we can see what this unprocessed data of title names looks like, and a visualization of the words within the dataset.

Index	Recipe Name
42	i yam what i yam muffins
101	to your health muffins
129	250 00 chocolate chip cookies
138	lplermagronen
163	california roll salad
...	...
231430	zucchini wheat germ cookies
231514	zucchini blueberry bread
231547	zucchini salsa burgers
231596	zuppa toscana
231629	zydeco salad

We found that the shortest name in our subsample was "bread", and the longest was "baked tomatoes with a parmesan cheese crust and balsamic drizzle". The most commonly occurring words included those like "chicken" and "cake". You can refer to the wordcloud visualization above to get a better picture of the data we were working with.

Now we need to decide how best to represent these recipe names as features for our model.

Representing text

As we are trying to go from a sample of words to sensible categories of those words, we needed a way to accurately represent said text to our models. In class we learnt one of the go-to methods for exploratory encoding of text was using a CountVectorizer and Bag-of-words encoding to count the word counts.

The easiest route

Using a CountVectorizer with Bag-of-words encoding is intentionally shallow. We do not capture any meaning of the words, just their frequency and pass this on to our model. We lose all the nuances that word meaning affords us (the humans), but at the same time we have a feature set that the model can interpret and make connections from, even if rudimentary at best. From our testing on Wikipedia data (the toy corpus we tried models on at the start of hw6!), we can see that Bag-of-words encoding does present clusters for us but they do not scream sensible right away. For example the queries: "Quantum Computer", "Environmental protection", "Renewable Energy", and "Climate Change" were merged as one cluster by this model.

Working with pre-trained models and text embeddings

If bag-of-words encoding does not do the job, then what might? We explored using sentence embedding representations next. Using a pre-trained text embedding model 'all-MiniLM-L6-v2' from the Sentence Transformers package we can convert the text into vector representations that might be significantly more detail rich in the eyes of our model. By using a pre-trained model we essentially get the advantage of a model that already knows the "meaning" behind the words it is looking like and it can attach relevant weights that will help our model find patterns in the words to effectively cluster from. To get an idea of what this vector array looks like, take a look at the table below which is a sample of our Wikipedia data after encoding.

	0	1	2	3	4
0	-0.005857	-0.004795	-0.000976	0.011121	0.005294
1	-0.122905	-0.092450	0.085379	-0.009761	0.001422
2	-0.056721	-0.049697	-0.014780	0.022572	0.051773
3	-0.068892	0.010519	-0.064375	0.028483	-0.128162
4	-0.012700	0.101830	0.066676	-0.007987	0.140040

Using these sentence embeddings with k-means clustering as before, we see a marked improvement in our clusters. The clusters make more immediate sense, unlike before the queries: "Quantum Computer", "Environmental protection", "Renewable Energy", and "Climate Change" are no longer one cluster! The model smartens up and excludes "Quantum Computer" from the grouping, instead opting to include that query with others such as "Unsupervised learning" and "Deep learning".

Is k-means the only way?

Now that we have elected to use sentence embeddings for our text encoding, we consider our method of clustering itself. While we have been using k-means clustering from the beginning of this project, there are still countless other methods we can consider that we learned over the duration of CPSC 330. For time saving I will gloss over the model specifics, but we tested our embeddings with using DBSCAN with cosine distances (as opposed to k-means which uses Euclidian distances) and hierarchial clustering. In the plot below you can see the clusters that were identified by each model.

In the end I selected hierarchical clustering as my preferred clusering method as it presented the most sensible results in my opinion.

Modelling on food

Now that we have selected our method of encoding data and our clustering method, we can go ahead and start work on the recipes! After fitting and training our models we can take a look at what clusters each one produces. There is a general consensus between the models to categorize types of food, whether it be hard, liquid, sweet, or spicy. Below we see the clusters that each model managed to identify.

Cluster	K-Means	DBSCAN	Hierarchical
-1 / Noise	—	Abstract (Filet sorrentino, Jalousie)	—
0	Abstract (Cosmo, Love wrap)	Savory (Chili, Meatloaf, Pork chops)	—
1	Sweet baked (Oatmeal, Chocolate chip, Sugar cookies)	Identical/Near-identical naming ("After eight")	Fruits / Snacks (Apple, Peach, Fruit dip)
2	Sweet baked and complex (Streusel, Cheesecake, Lemon cake)	"Chex Mix" variations	Baked and savory (Focaccia, Pesto, Yuca)
3	Salads (Potato salad, Vinaigrette, Lemon dressing)	Juices and "Go go" drinks (V8, Juice)	Savory vegetable and pasta (Stuffed peppers, Pizza, Kale soup)
4	Sweet baked (Zucchini muffins, Banana bread, Irish brown bread)	Spam-based (Island spam, Spam hotdish)	Comfort (Pork chop casserole, Cheesy potatoes)
5	Alcoholic drinks (Margarita, Martini, Wine-based)	Ethnic cooking (Mujadara)	Sweet drinks (Martini, Mango, Cranberry)
6	Sweet dessert (Brownies, Cupcakes, Trifle)	Specific repeated names (Pompagne)	Abstracts (Lullaby, Dream, Cloud)
7	Savory (Salsa, Souffle, Focaccia)	—	Meat dishes (Chicken broccoli, Beef patties, Flank steak)
8	Meat dishes (Enchiladas, Cashew chicken, Buffalo meatloaf)	—	Sweet mixed drinks (Kahlua mousse, Lemonade cocktail)
9	—	—	Cakes and sweet breads (Toffee cookies, Sweet potato cake, Swirl cake)

DBSCAN seems to be a very face-value model, with some weird clusters like the "chex mex" one. Hierarchical and k-means both do a decent job at separating the foods into different categories, but from my visual inspection it seems like the results from the hierarchical model are more "sensible" in that they match my intuition for what the clusters should look like.

All models have some sort of "abstract" name cluster, and all have at least one sweet cluster. The results from k-means show some redundant clustering and also further confirm my preference towards the hierarchical model.

Who wins?

Given the cluster results and my interpretations of them, hierarchical clustering wins for interpretability. But if I'm being honest, the real lesson here is that representation matters more than algorithm choice - all three embedding-based methods broadly agreed on what the clusters should look like, while bag-of-words failed regardless of which algorithm we threw at it.

That said, this approach is not without its quirks. Recipe names don't always reflect their content; take "California roll salad" for example - it lands with salads, not Japanese food. This makes sense when you think about it, the model is just reading the words and "salad" is right there in the name. It has no way of knowing what a California roll actually is or what cuisine it belongs to, it just sees the word "salad" and groups accordingly.

This ties into a broader limitation with our embedding model. While 'all-MiniLM-L6-v2' is a powerful pre-trained model, it was trained on general English text rather than food-specific language. It understands that "chicken" and "beef" are related, sure, but it might not pick up on the nuances between something like "braised" and "stewed" the way a food-specific model might. A model fine-tuned on recipe or culinary data could potentially produce even tighter, more meaningful clusters.

Finally, we should acknowledge that our sample itself is doing some heavy lifting here. By filtering down to only recipes with tags in the top 300, we are inherently making our data more well-behaved. The clusters we see look pretty clean, but that's at least in part because we've already removed a lot of the noise and edge cases that would exist in the full 180,000+ recipe dataset. On a messier, more complete sample, our models would likely struggle more and the clusters would not look nearly as tidy.