Been working on hybrid search (lexical + vector) for a while and accidentally discovered something fun: when you use good embeddings, you can literally search with emojis.
Not as a gimmick - it actually works because the embedding model (BGE-M3, 1024 dimensions) learned semantic relationships between concepts and their emoji representations.
Try it yourself
These are live search engines running on real e-commerce data:
Type 🔑 (key emoji) → get actual keys: https://search.opensolr.com/dedeman?q=🔑
Type 🚲 (bike) → get bicycles and accessories: https://search.opensolr.com/dedeman?q=🚲
Type 🖨️📄 (printer + paper) → get printer supplies: https://search.opensolr.com/b2b?q=🖨️📄
This one's my favorite - type "cute domestic pet earrings" on a jewelry store: https://search.opensolr.com/rueb?q=cute+domestic+pet+earrings
(it finds cat and dog earrings even though the product titles are in a completely different language)
How it actually works
The pipeline is:
- Crawl website → extract text with Trafilatura
- Generate 1024D embeddings via BGE-M3
- Store in Solr with both text + vectors
- At query time: run lexical search + KNN vector search
- Combine scores (hybrid approach)
The emoji thing works because BGE-M3 was trained on multilingual + multimodal data. The model learned that 🔑 and "key" and "Schlüssel" (German) and "cheie" (Romanian) are all semantically close.
So when someone searches 🚲, the embedding is close to "bicycle", "bike", "Fahrrad", "bicicletă", etc.
The weird part
Cross-language search just... works. The Romanian e-commerce site has products in Romanian, but you can search in English or with emojis and it finds relevant stuff. No translation layer, no language detection preprocessing - the embeddings handle it.
Same with conceptual queries. "things to wear around neck" finds necklaces, pendants, chains - even though no product has "things to wear around neck" in the title.
Stack details for the curious
- Embeddings: BGE-M3 (BAAI), 1024 dimensions
- Inference: Running on RTX 4000 Ada, ~2-5ms per query
- Search: Solr 9.6 with dense vector support
- Crawling: Custom PHP + Python (Playwright for JS-heavy sites, Trafilatura for extraction)
- Extra features: VADER for sentiment, langid for language detection, custom price extraction
Query latency is ~40-50ms total including embedding generation.
Hybrid vs pure vector
Pure vector search is cool but has issues:
- Exact matches sometimes rank lower than "similar" results
- Product codes/SKUs get weird results
- Users expect "nike shoes" to prioritize exact Nike matches
Hybrid fixes this. Lexical handles exact matches, vectors handle the "I don't know the exact word but I know what I want" queries.
The Solr query can be seen in the debig view (bottom-right button) where you can see the actual vector query functions.
vectorQuery = {!knn f=embeddings topK=250}[-0.032, 0.009, -0.049, ...]
lexicalQuery = {!edismax qf="title^550 description^450 uri^1 text^0.1"
pf="title^1100 description^900" ...}
q = {!func}sum(
product(1, query($vectorQuery)),
product(1, div(query($lexicalQuery), sum(query($lexicalQuery), 6)))
)
Bonus: AI-generated hints
Added an experimental feature where the search can explain results. Search "measure 🔥" on a technical documentation site and it tells you which specific device to use for measuring temperature/fire:
https://search.opensolr.com/fluke?q=measure+🔥
It pulls context from indexed PDFs and generates a recommendation. Uses a local LLM (running on same GPU).
Anyway, thought some of you might find the emoji thing interesting. The cross-language aspect was unexpected - I didn't build it for that, it just emerged from using multilingual embeddings.
Happy to answer questions about the setup or hybrid search in general.
Top comments (0)