I made search engines understand emojis (and it's weirdly useful)

#ai #machinelearning #showdev

Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.

Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.

Try it yourself

These are live search engines running on real e-commerce data:

Type 🔑 (key emoji) → get actual keys: https://search.opensolr.com/dedeman?q=🔑

Type 🚲 (bike) → get bicycles and accessories: https://search.opensolr.com/dedeman?q=🚲

Type 🖨️📄 (printer + paper) → get printer supplies: https://search.opensolr.com/b2b?q=🖨️📄

This one's my favorite - type "cute domestic pet earrings" on a jewelry store: https://search.opensolr.com/rueb?q=cute+domestic+pet+earrings

(it finds cat and dog earrings even though the product titles are in a completely different language)

How it actually works

The pipeline is:

Crawl website → extract text with Trafilatura
Generate 1024D embeddings via BGE-M3
Store in Solr with both text + vectors
At query time: run lexical search + KNN vector search
Combine scores (hybrid approach)

The emoji thing works because BGE-M3 was trained on multilingual + multimodal data. The model learned that 🔑 and "key" and "Schlüssel" (German) and "cheie" (Romanian) are all semantically close.

So when someone searches 🚲, the embedding is close to "bicycle", "bike", "Fahrrad", "bicicletă", etc.

The weird part

Cross-language search just... works. The Romanian e-commerce site has products in Romanian, but you can search in English or with emojis and it finds relevant stuff. No translation layer, no language detection preprocessing - the embeddings handle it.

Same with conceptual queries. "things to wear around neck" finds necklaces, pendants, chains - even though no product has "things to wear around neck" in the title.

Stack details for the curious

Embeddings: BGE-M3 (BAAI), 1024 dimensions
Inference: Running on RTX 4000 Ada, ~2-5ms per query
Search: Solr 9.6 with dense vector support
Crawling: Custom PHP + Python (Playwright for JS-heavy sites, Trafilatura for extraction)
Extra features: VADER for sentiment, langid for language detection, custom price extraction

Query latency is ~40-50ms total including embedding generation.

Hybrid vs pure vector

Pure vector search is cool but has issues:

Exact matches sometimes rank lower than "similar" results
Product codes/SKUs get weird results
Users expect "nike shoes" to prioritize exact Nike matches

Hybrid fixes this. Lexical handles exact matches, vectors handle the "I don't know the exact word but I know what I want" queries.

The Solr query can be seen in the debig view (bottom-right button) where you can see the actual vector query functions.

vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]

lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1" 
                         pf="title^1100 description^900" ...}

q = {!func}sum(
      product(1, query($vectorQuery)), 
      product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
    )

Bonus: AI-generated hints

Added an experimental feature where the search can explain results. Search "measure 🔥" on a technical documentation site and it tells you which specific device to use for measuring temperature/fire:

https://search.opensolr.com/fluke?q=measure+🔥

It pulls context from indexed PDFs and generates a recommendation. Uses a local LLM (running on same GPU).

Anyway, thought some of you might find the emoji thing interesting. The cross-language aspect was unexpected - I didn't build it for that, it just emerged from using multilingual embeddings.

Happy to answer questions about the setup or hybrid search in general.