DEV Community: marplex

Visually Multilingual: Introducing mcdse-2b

marplex — Sun, 27 Oct 2024 14:09:10 +0000

Today, I'm introducing a new experimental multilingual embedding model for flexible visual document retrieval. mcdse-2b-v1 (🤗) builds upon MrLight/dse-qwen2-2b-mrl-v1 and it is trained using the DSE approach.

This model allows you to embed page/slide screenshots and query them using natural language. Whether it's tables, graphs, charts, schemas, images, or text, mcdse-2b-v1 encodes everything into a single embedding vector, eliminating the need for traditional OCR, document layout analysis, reading order detection, chunking, table/formula extraction...

Strong metrics on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German.
Matryoshka Representation Learning: embeddings can efficiently scale from 1536 to 256 dimensions. You can reduce the size 6x and still keep 95% of the embeddings quality.
Exceptional on binarization: 768d binary vectors keep 99% retrieval quality of the base 1536d float vectors. Using binary vectors, you can encode 100 million multilingual pages in just 10GB.
Fast vLLM inference: run inference on vLLM and efficiently serve embeddings at scale, production ready. Check Deployment to learn more.

My benchmarks aren't flawless, so I encourage you to test the model on your own data. This is an early version with plenty of room for improvement. However, despite this, the results highlight a strong multilingual retriever that adapts remarkably well to various memory/speed requirements.

Training

mcdse-2b is trained from MrLight/dse-qwen2-2b-mrl-v1 using low-rank adapters (LoRA) on a multilingual corpus of documents. I have trained it on 8xRTX3090 using the DSE approach with the following parameters:

Epochs = 1
Warmup ratio = 0.1
Learning rate = 1e-5
Optimizer = adamw_torch
Schedule = linear
Total batch size = 16
LoRA
- Alpha = 64
- R = 16
- Dropout = 0.1
- DoRA = True

Dataset

The dataset comprises 24K PDF documents automatically scraped from the public internet. Random pages were extracted from each document, converted into compressed JPEG images, and filtered to remove blank pages and duplicates. The resulting page screenshots are unique and span a wide range of topics.

I used gemini-flash-1.5-002 to generate queries based on each image. Gemini was instructed to come up with three type of queries:

A broad topical query: summarizing the overall theme of the document.
A specific detailed question: capturing subtle nuances within the content.
A visual query: focusing on visual elements such as charts, graphs, images, or signatures.

The entire training and evaluation datasets were generated for just €2 (thanks, Gemini Flash!)

Each image is then classified by its text density on a scale from 0 to 2. I used omoured YOLOv10n model, fine-tuned on DocLayNet, to detect areas such as figures versus text. Based on these proportions, I heuristically calculate the text density. I plan to use this classification to improve the model's performance on text-dense documents.

0 = only visuals
1 = a mix of visuals and text
2 = only text

The eval and train datasets are not yet published. I'm very willing to open source them, but I'm still unsure on how to properly do it without breaking any license (if any). If you know how to help me, please reach out!

Train Runs

The model was sequentially trained for each language in the following order:
1) French: 6k samples
2) Spanish: 6k samples
3) Italian: 6k samples
4) German: 6k samples

This order was determined by the base model's retrieval performance in these languages, the first being the best performing. My intuition is that, given the small dataset, starting with the stronger languages could help balance overall improvements across the model.

Before reaching this final checkpoint, I conducted multiple runs to test various strategies and validate some of my intuitions.

Language order: I swapped the order of the last two languages and found that training German last improved its performance on evaluations by 1.7%, while maintaining similar scores across the other languages.
Model initialization: I initialized the model with 10k mmarco pairs for each language. This resulted in worse performance across all languages, particularly with lower-dimensional embeddings. For example, French NDCG@5 using 512-dimensional embeddings dropped by 2% when trained with mmarco.
Different image resize algorithm: I developed a custom resize function (custom_resize) that strictly preserves the image's aspect ratio while scaling it down to fit within min_pixels and max_pixels. All evaluations used the standard resize function from qwen_vl_utils. Models trained with the custom resize function outperformed the standard method, with an average +1.7% NDCG@5 improvement (1536 dimensions). It would be interesting to explore training a ColQwen model with this custom_resize function.

Resize function Avg English Italian Spanish French German

qwen2_vl_utils 80.8 80.2 80.5 79.6 81 82.6

custom_resize 82.2 80.8 81.2 80.7 84.5 83.8

+1.7% +0.7% +0.9% +1.4% +4.0% +1.4%

Resize function	Avg	English	Italian	Spanish	French	German
qwen2_vl_utils	80.8	80.2	80.5	79.6	81	82.6
custom_resize	82.2	80.8	81.2	80.7	84.5	83.8
	+1.7%	+0.7%	+0.9%	+1.4%	+4.0%	+1.4%

Evaluations

Due to the limited availability of publicly available datasets for multilingual document image retrieval, the model has been evaluated using a custom-built dataset. This eval dataset was specifically designed to benchmark the model's performance across various languages.

This evaluation dataset was created using the same methodologies and pipelines of the training dataset. However, the document topics are generally different, and no images are shared between the training and evaluation datasets to avoid any evaluation contamination. NDCG scores were calculated by running 100 unique queries across 1K document indexes for each language.

Matryoshka Representation Learning

This model is trained with Matryoshka Representation Learning (MRL) on the following dimensions: 1536, 1024, 768, 512, 384, 256. The loss function used during training is calibrated to track performance across all these dimensions, leading the model to frontload the most important identifying information. This effectively allows you to shrink the embedding dimensions according to your scale and budget.

Average NDCG@5 for every dimensions. Interestingly, the model shows improvements in English, even though this language wasn't included in the training set. The model performs 6% better on 256 dimensions, and shows an overall improvement of 4% on average across all dimensions. Evaluations were conducted using FAISS with IndexFlatL2.

NDCG@5 (float)

	Average	English	Italian	Spanish	French	German
1536 dimensions
dse-qwen2-2b-mrl-v1	79.5	79.2	80.2	77.9	80.6	79.6
mcdse-2b-v1	82.2	80.8	81.2	80.7	84.5	83.8
	+3.28%	+1.98%	+1.23%	+3.47%	+4.62%	+5.01%
1024 dimensions
dse-qwen2-2b-mrl-v1	78.3	78.8	78.5	76.5	80	77.5
mcdse-2b-v1	81.7	80	80.2	80.1	84	84.3
	+4.23%	+1.75%	+2.12%	+4.49%	+4.76%	+8.07%
768 dimensions
dse-qwen2-2b-mrl-v1	77.8	78.4	78.3	75.6	80.8	75.9
mcdse-2b-v1	81.1	79.6	79.9	79.2	83.3	83.3
	+4.02%	+1.51%	+2.00%	+4.55%	+3.00%	+8.88%
512 dimensions
dse-qwen2-2b-mrl-v1	76.2	77.6	75.9	73.1	79.2	75.2
mcdse-2b-v1	79.3	78.5	79.1	75.8	81.4	81.7
	+3.91%	+1.15%	+4.05%	+3.56%	+2.70%	+7.96%
384 dimensions
dse-qwen2-2b-mrl-v1	75.7	76.2	75.5	74.6	78.4	74
mcdse-2b-v1	78.8	77.5	78.5	76.1	80.4	81.4
	+3.86%	+1.68%	+3.82%	+1.97%	+2.49%	+9.09%
256 dimensions
dse-qwen2-2b-mrl-v1	73.5	74.5	73.6	70.6	74.8	73.8
mcdse-2b-v1	78.1	78.5	77.6	76.2	80.1	77.9
	+5.89%	+5.10%	+5.15%	+7.35%	+6.62%	+5.26%

Binary Embeddings

mcdse-2b-v1 clearly performs better on binarization, especially at lower dimensions. The model is 23% better on 256 dimensions, with an average improvement of 13% overall. Evaluations were conducted using FAISS with IndexBinaryFlat. But why are binary embeddings superior?

	NDCG@5	Memory needed for 100M embeddings
dse-qwen2-2b-mrl-v1 (float16)	79.5	286 GB
mcdse-2b-v1 (binary)	80.6	18 GB

This table shows that mcdse-2b-v1's binary embeddings are 1% better than the base model's 1536-dimensional float vectors while reducing memory consumption by 16x. Besides these advantages, binary embeddings can also be searched 40x faster with hamming distance, as comparing two binary vectors only uses 2 CPU cycles (xor, popcnt)

NDCG@5 (binary)

	Average	English	Italian	Spanish	French	German
1536 dimensions
dse-qwen2-2b-mrl-v1	75.0	75.8	75.4	72.4	78.1	73.2
mcdse-2b-v1	80.6	79.5	76.9	81.9	83.7	80.8
	+6.93%	+4.65%	+1.95%	+11.60%	+6.69%	+9.41%
1024 dimensions
dse-qwen2-2b-mrl-v1	72.2	74.8	71	70.8	74.6	69.6
mcdse-2b-v1	79.3	78.4	75.4	80.8	82.6	79.5
	+9.05%	+4.59%	+5.84%	+12.38%	+9.69%	+12.45%
768 dimensions
dse-qwen2-2b-mrl-v1	70.1	71.7	69.3	69.8	73.7	65.9
mcdse-2b-v1	78.8	77.1	75.4	80	83	78.5
	+11.07%	+7.00%	+8.09%	+12.75%	+11.20%	+16.05%
512 dimensions
dse-qwen2-2b-mrl-v1	66.5	70	65.4	63.7	70.2	63
mcdse-2b-v1	76.6	74.8	74.2	77.7	80.9	75.3
	+13.21%	+6.42%	+11.86%	+18.02%	+13.23%	+16.33%
384 dimensions
dse-qwen2-2b-mrl-v1	61.1	62.7	58.5	58.6	65.1	60.8
mcdse-2b-v1	74.3	74.5	71.4	77.2	75.2	73
	+17.67%	+15.84%	+18.07%	+24.09%	+13.43%	+16.71%
256 dimensions
dse-qwen2-2b-mrl-v1	54.3	59	56.5	53.6	53	49.6
mcdse-2b-v1	70.9	72.6	66.4	73.5	72.6	69.2
	+23.31%	+18.73%	+14.91%	+27.07%	+27.00%	+28.32%

ShiftProject

The vidore/shiftproject_test dataset is part of the ViDoRe Benchmark. It contains French queries and documents sourced from the Shift Project about the environment. Queries were generated with Claude-3 Sonnet on the same, French-translated, prompt used for generating queries of the scrapped documents of vidore/colpali_train_set.

	ShiftProject (NDCG@5)
dse-qwen2-2b-mrl-v1	80.8
mcdse-2b-v1	78.6
	-2.80%

This is the NDCG@5 on the ShiftProject dataset, with 1536 float dimensions and evaluated using at most 960 image patches.

I expected the score of mcdse-2b-v1 to be higher than the base model, instead it's 3% worse.
The base model was trained on the colpali train set, I tought that maybe it may have been over-optimized for "Claude-3 Sonnet like" queries. To investigate this, I regenerated the ShiftProject dataset queries using gemini-1.5-flash-002 and my prompts.

	ShiftProject_Gemini (NDCG@5)
dse-qwen2-2b-mrl-v1	67
mcdse-2b-v1	70.8
	+5.37%

The scores change wildly, but in this case, mcdse-2b-v1 is 5% better. These results tends to suggest two possible causes:

1) The base model is over-optimized for "Claude-3 Sonnet like" queries
2) My model is over-optimized for "gemini-1.5-flash-002 like" queries

In both scenarios, I believe mcdse-2b-v1 has mitigated these overoptimizations by understanding broader query distributions.

My generated gemini queries are in two formats: questions and queries. colpali_train_set generated queries are only questions. I also tested both models on just Gemini queries and just Gemini questions.

	ShiftProject_GeminiQuestions (NDCG@5)	ShiftProject_GeminiQueries (NDCG@5)
dse-qwen2-2b-mrl-v1	74.8	58.6
mcdse-2b-v1	69.5	63.5
	-7.63%	+7.72%

The base model is 7% better on gemini questions and 7% worse on gemini queries. The average scores between queries and questions are nearly identical (66.7 and 66.5). This suggests that my model has mitigated the previously mentioned overoptimizations and is generally better at understanding a wider variety of queries. Training on more multilingual data will probably increase this average and eventually improve perfomances on ShiftProject.

Cohere Embed v3 Image

I conducted some preliminary (and rushed) tests using the recently announced cohere embed-multilingual-v3.0 multimodal embeddings on a smaller version of the English dataset. The model achieved an NDCG@5 score of 71, while mcdse-2b-v1 scored around 84. I'm working on more comprehensive evaluations for this model.

Deployment

On HuggingFace Transformers, you can expect to encode ~3 images/s using an RTX3090 with a batch size of 32 (35TFLOPS). A more common inference side GPU like RTX 4000 Ada will roughly deliver the same troughput.

vLLM

vLLM officially supports Qwen2VL for generation only, I have added a new model class Qwen2VLForEmbeddingGeneration to support embedding tasks. Running inference on vLLM should be ~5x faster than HuggingFace Transformers.

Download the new model class

git clone https://github.com/marplex/mcdse && cd mcdse

Download mcdse-2b-v1 for local inference

from huggingface_hub import snapshot_download
snapshot_download(repo_id="marco/mcdse-2b-v1", local_dir="/path/to/model/mcdse-2b-v1")

Edit config.json

Replace Qwen2VLForConditionalGeneration with Qwen2VLForEmbeddingGeneration

sed -i -e 's/Qwen2VLForConditionalGeneration/Qwen2VLForEmbeddingGeneration/g' /path/to/model/mcdse-2b-v1/config.json

Check `vllm/main.py` for local inference

#vllm/main.py
from qwen2_vl_dse import Qwen2VLForEmbeddingGeneration, get_query_prompt, get_document_prompt
from vllm import ModelRegistry, LLM
from PIL import Image

ModelRegistry.register_model("Qwen2VLForEmbeddingGeneration", Qwen2VLForEmbeddingGeneration)

llm = LLM(
    model="/path/to/model/mcdse-2b-v1",
    limit_mm_per_prompt={
        "image": 1
    }
)

# Encode queries
query_prompt, image = get_query_prompt("Quali erano le passività totali al 31 dicembre 2017?")
outputs = llm.encode({"prompt": query_prompt, "multi_modal_data": {"image": [image]}})
outputs[0].outputs.embedding #1536 dimensional embedding

# Encode documents
dummy_document_image = Image.new('RGB', (256, 256))
document_prompt, image = get_document_prompt(dummy_document_image)
outputs = llm.encode({"prompt": document_prompt, "multi_modal_data": {"image": [image]}})
outputs[0].outputs.embedding #1536 dimensional embedding

Conclusion

This is my first time training a model, it was challenging but incredibly fun. I don't think I could have ever done this without the amazing work of the HuggingFace team and contributors. I also want to thank Manuel Faysse, Tony Wu, and the entire Vidore team for their work on ColPali, Xueguang Ma for all its work on the Tevatron codebase and for training a very strong base model. I was also inspired by Benjamin Clavié and his impressive model announcements.

I hope this model proves useful for your retrieval and RAG pipelines. As mentioned in the beginning, my benchmarks are far from perfect, and results in real-world scenarios may vary. I encourage you to test it on your own use cases. Overall, a significant advantage of visual retrieval is that you can scrap your complex indexing pipeline by simply embedding the page. This is the future!

Hono on Azure Functions

marplex — Wed, 08 May 2024 14:39:36 +0000

hono-azurefunc-adapter is one of the simplest yet incredibly useful js library I have ever written.

Hono is a web application framework built on web standards. It's incredibly fast and lightweight. Because it's built using web standard APIs, the same code will run on multiple runtimes (Cloudflare, Fastly, Deno, Bun, AWS, or Node.js).

For platforms that don't directly support web standards, Hono comes with adapters. For example, running in Node.js requires an adapter that converts requests and responses into node types and objects.

Building the Hono adapter

There are a lot of community-made adapters for running Hono on many more environments. Unfortunately, no one has ever made one for Azure Functions. I decided to built it, free and open-source.

The entire library has just only 54 lines of code.
Its so simple and maintainable. Yet it allows to easily port, with minimal/no code rewrites, APIs built with Hono on the powerful Azure Functions platform.

Simplicity wins

It is incredible to think of how many new possibilities this library unlocks with just 54 lines of code.

It's true, simple things are always the most difficult.
Although hono-azurefunc-adapter now appears clean and concise, it took a while to get to this point. I spent a lot of time polishing, refactoring, and rethinking how to accomplish the same things with fewer lines of code. I had to really dig deep into how the (partly document) Azure Functions API works.

Hono has rapidly become one of the most widely used frameworks for building Javascript web APIs. Hats off to yusukebe and all the other contributors! Now it's finally possible to run it on Azure Functions, effortlessly with just azureHonoHandler(honoApp.fetch)!

How to use

It's very simple. Install hono-azurefunc-adapter with:

npm i @marplex/hono-azurefunc-adapter

Now create the http trigger for Azure Functions. honoApp is your exported Hono application object.

import honoApp from "./app";

import { azureHonoHandler } from "@marplex/hono-azurefunc-adapter";
import { app } from "@azure/functions";

app.http("httpTrigger", {
  methods: [
    "GET",
    "POST",
    "DELETE",
    "HEAD",
    "PATCH",
    "PUT",
  ],
  authLevel: "anonymous",
  route: "{*proxy}",
  handler: azureHonoHandler(honoApp.fetch),
});

That's it, you're done.

Limitations

There are some limitations and other things you should keep in mind when running Hono inside Azure Functions.

Route Prefix

The default Azure Functions route prefix is /api. Be sure to start all your Hono routes with /api or change the default Azure Functions route prefix in host.json

{
    "extensions": {
        "http": {
            "routePrefix": ""
        }
    }
}

Crypto

In Node <=18 environments, if you are using hono/bearer-auth or any other library that uses crypto, be sure to define global.crypto = require("crypto"); before registering the http trigger.

Request signal

Azure Functions does not expose any signal or event for listening to http request interruptions. c.req.raw.signal is useless and its never aborted.

Conclusion

I think Azure is one of the most trusted enterprise-ready cloud providers. By building hono-azurefunc-adapter, I hope this will finally allow many to port the same popular Hono APIs to Azure Functions, especially for private enterprise needs.

hono-azurefunc-adapter is available on NPM and GitHub Packages. This project is fully open source and MIT licensed, so do what you want! Contributions are welcome 🥳

Italian Laws Unigram Viewer on the Edge With Cloudflare Pages

marplex — Mon, 17 Jul 2023 20:05:49 +0000

Months ago I shared my Italian law mapping project, where I mapped 13K Italian laws and extracted relationships between them (labs.marcocimolai.xyz/tessuto-normativo). It went viral on Reddit, LinkedIn and has been covered by some of Italy's leading newspapers. Today, I will share my journey of building and deploying a "Google NGram viewer" for Italian laws.

Let's start with the basic idea: you search for a word and the site returns how many Italian laws containing that word have been published for each year, from the Constitution to 2022.

I didn't know much about search engines or information retrieval. So, as usual, my journey began with a Google search.

That's where I found the well known Apache Lucene. I started digging and learning all about it. I discovered that there are many other optimizations and pre-processes that are essentials for serving a search endpoint, and that there is a project called Solr that is doing this for me.

Solr is a search engine built on top of Apache Lucene, and it comes with sane defaults and a handy HTTP API. The first part is indexing and processing the laws, I used pysolr client to loop trough each law and add them to the index. The document format only contains the text (not stored) and the publication date (stored), which is all I need to reconstruct the term usage plot.

import pysolr
solr = pysolr.Solr('http://localhost:8983/solr/norms', always_commit=True)

for norm in norms:
  solr.add([
      {
          "text": norm.text,
          "date": norm.date
      }
  ])

It's true that I left every configuration at default (so it may not be very optimized for my use-case), but I didn't expect the process to be so fast and easy. Performing queries was also very straightforward, I just needed to retrieve the date field with no additional scoring.

query = "lire"
results = solr.search('text:"%s"' % query, **{
    'fl': 'date',
    'rows': 15000
})

After processing this data with pandas, here's the result of this query. The Y axis represents how many norms with the term "lire" were published in respect to the total count of published norms. That is term_occurrencies / total_norms.

Thinking on problems

Thanks to Solr I had just indexed Italian norms, performed queries and retrieved the term usage graph, exactly what I needed. Not really, because there are also some downsides:

I need to maintain additional server costs
Not easily scalable
Solr is doing too much for what I need to achieve

The previous tool was deployed entirely on Cloudflare Pages with no server compute. The graph is downloaded, opened and processed client side, hassle free and with zero deployment costs.

With this new Solr architecture, I have to use and maintain an external server. In addition, this custom solution is not easily scalable, difficult to distribute, and as it stands, acts as a central point of failure. During high demand spikes (which were very common in my previous project), I doubt that Solr will be able to serve all those users.

On top of that, Solr does some pre-processing on the words like lemmatization, stemming, storing word positions, storing multiple dates for the same year. Solr is great, but overkill for my needs.

Solution number two

I need a simple inverted index without stemming/lemmatization, where documents are the pre-computed term usage graphs.

The key part here is pre-computed. In games, light is too demanding to be solved in real time (not anymore actually). That's why we've always used _baked lightin_g. Light is pre-calculated and applied to the world textures.

My idea is similar: querying a large corpus of text is too demanding (in terms of computation, time, and cost) for real-time use. So I will do the search on the precomputed results and deliver them already baked.

Building the inverted index

An inverted index consists of two things: terms and documents. In this case, terms are words that appear in each law, and documents are usage distributions.

I have started by extracting tokens from a sample law using nltk. The tokens are processed and words are filtered (e.g. by removing stopwords or odd characters). As mentioned before, I don't need to do any stemming/lemmatization since I need to search for exact terms.

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import nltk

stop_words = set(stopwords.words('italian'))

#extract tokens from the norm text
def tokenize(norm):
  word_tokens = word_tokenize(norm.lower(), language="italian")
  tokens = []

  for word in word_tokens:

    #remove special characters
    token = process_word(word)

    #skip stop words
    if token not in stop_words:
      continue

    #skip invalid tokens (weird strings such as html)
    if !is_valid_token(token):
      continue

    #skip tokens that can't be stored in less than 32 bytes.
    #Note: remember this step for later
    token_bytes = bytes(token, 'utf-8')
    if len(token_bytes) > 32:
        continue

    tokens.append(token)

  return tokens

The tokenize(norm) function returns a list of unique tokens, extracted from the input text. The next part is to build the inverted index.

norm = open('urn:nir:stato:legge:1967-03-09;150.txt').read()
year = 1967

inverted_index = {}

#Only take unique tokens.
# If a word occurs just once in the norm, it's counted in the final result.
unique_tokens = set(tokenize(norm))

for token in unique_tokens:
  #Get the count of this token
  freq = inverted_index.get(token, [[year, 0]])

  #Add +1 to the count
  freq[-1][1] += 1

  #Assign it back to the inverted index
  inverted_index[token] = freq

#The sorted keys of the inverted index is our vocabulary
vocab = sorted(inverted_index)

This code is a simplification of the final result. Here the year is fixed, but the final code should also be able to update the index at a specific year.

With the code above, I have extracted from legge:1967-03-09;150 407 unique tokens. Here are some examples:

[
  'periodo',
  'ciascun',
  'termine',
  'incarichi',
  'liquidazione',
  'esso',
  'qualsiasi',
  'agrarie',
  'leggi',
  ...
]

🧊 Freezing the index

I am able to process norms and create inverted indexes, but the results are only available in RAM and in Python data structures (dictionaries and lists). To use this index online, I need to export it to a file. I call this the freezing part.

The simplest solution would be to serialize the dictionary into JSON, then have the user download it and start looking up terms offline. This way I don't have to maintain any servers and everything is automatically hosted and distributed by Cloudflare or some other CDN.

Unfortunately, there are multiple problems with this approach:

JSON is a text format and produces too big files with many byte repetitions. We mostly need to store numbers.
We need to parse the entire JSON file to perform search
Users download every document, even if they will probably not search on all of them.

I came up with a better solution that is more efficient and deparates the index from the documents. The idea is to have an index.bin and a documents.bin file.

The index.bin file contains the ordered list of tokens (the vocabulary). Each token is stored in 32 bytes, with trailing zeros if necessary. The most important thing to note here is that each token is byte aligned. This makes it possible to do a binary search without the need for a full read or parse of the file.

The documents.bin file contains the term usage, each stored in 152 bytes. Why 152? Because I want to analyze norms from the Constitution (1947) to the present day (2023), exactly 76 years. Each year's term count is an unsigned 2 byte integer. So 2 bytes times 76 years equals 152 bytes.

This way of storing documents consumes a lot of space (many term usage distributions contain a lot of zeros) but again, it is byte aligned and can be easily indexed by reading at an offset.

In production, I decided to index 53,036 laws. This resulted in a vocabulary size of 196,082 tokens, a 6MB index and a 28MB documents file.The entire script ran in about 16 minutes.

The compressed index.bin is about 800kb. The compressed documents.bin is about 1.8MB. This is not bad when you consider that the original Solr index took up more than 100 MB.

________________________________________________________
Executed in   16.75 mins    fish           external
   usr time  727.07 secs    0.00 micros  727.07 secs
   sys time    4.71 secs  779.00 micros    4.71 secs

To recap, storing the inverted index in two parts and in binary format is way better because:

Users will only download the index file (smaller and can be compressed with gzip/brotli)
No need to read and parse the entire index to perform search
Documents are retrieved on-demand, only the parts that users need

As I said before, these two files will be distributed by a CDN. But how can the client download only a tiny part of the documents.bin file? After all, CDNs only serve static files and don't perform any kind of computation.

Leveraging HTTP functionalities

Introducing Range headers. Not every server supports it (GitHub does), but basically it allows you to download specific parts of a file. This is mainly used for watching MP4 videos on the web (without having to use HLS or DASH). It's useful because you don't have to download the whole file just to watch a tiny part of it.

Range: {byteFrom}-{byteTo}

This suits my needs perfectly. Finally, the whole system can be divided into three steps:

1) Perform binary search on index.bin
2) Find the offset
3) Download the specific documents.bin part with Range header

➗ Divide and conquer

Again, I had to find another solution, without resorting to hosting the files on my server.

I've decided to split the documents.bin file into chunks. By choosing the number of chunks, I can reduce the load on the client to a reasonable amount. Too many chunks and the user has a higher chance of typing words that are in different files, too few and the user waits longer to download larger files.

I decided to split it into 10 parts, each of which weighs about 200kb compressed. The client knows what files to download simply by doing word_position / ( 196082 / 10), where 196082 is the vocabulary size.

Many ups and one big down

After all this long ping-pong between solutions and problems, I think I've found the (almost) perfect solution. It has numerous advantages and reduces maintenance and costs. The only major drawback is that it only searches for single words (unigrams).

I made a conscious decision to store usage graphs without considering word positions, which limits the ability to search for sequences of words (phrases). Adding this feature would have significantly increased the index and document size, making it challenging to deliver to clients.

I can easily make changes to the Python script to include n-grams, which include bigrams (two-word combinations). However, bigrams are less unique than unigrams because they are simply combinations of two single words. As a result, they are more sparse and diverse, resulting in much larger indexes.

When I tried to extract bigrams from 53036 norms, the vocabulary grew to 38 million bigrams, resulting in a 293MB index.bin and a 696MB documents.bin.

I'm still exploring new methods for searching phrases with static files and no server computation, as this remains an ongoing area of development.

Conclusion

This was like a challenge, trying to reduce cost and maintenance time as much as possible. I wanted to push myself to new limits, building custom and "low level" stuff with technologies I'd never really understood before.

So this is the complete architecture. One index.bin and 10 chunked documents files are enough to perform search over 50,000 Italian laws.

You can view the final result here labs.marcocimolai.xyz/term-trend

Add some snow in your WPF apps

marplex — Mon, 16 Jan 2023 21:44:53 +0000

I always loved how Telegram changes its style during Christmas and winter. And I wanted it too on some of WPF apps that I maintain.

So I started building WpfSnowfall, a WPF snowfall user control. It is super simple to use and fully customizable, it even comes with different types of snowflakes (what a feature)!

You can use it to add some detail and quality on your apps, adding a touch of snow during winter times.

Now, show me the code!

1) Import WpfSnowfall from NuGet

dotnet add package WpfSnowfall --version 1.0.0

2) Profit

<sf:Snowfall
    EmissionRate="5"
    Fill="White"
    ScaleFactor="1.1"
    OpacityFactor="1"
    ParticleSpeed="1" />

That's it! You can configure the snowflake color, opacity, speed, size and amount.

Behind the scenes

Under the hood, snowflakes are rendered as vectors. They are animated separately using the good old Storyboard and animations from System.Windows.Media.Animation.

Basically, here's the simple version of the entire user control:

//Initial snowflake transform
RotateTransform rotateTransform = new(rotateAmount);
ScaleTransform scaleTransform = new(scale, scale);
TranslateTransform translateTransform = new(initialX, initialY);

//Spawn snowflake
var flake = Snowflake.Generate();
flake.RenderTransform = new TransformGroup
{
    Children = new TransformCollection { rotateTransform, scaleTransform, translateTransform }
};

Children.Add(flake);

//Create transform animations
var xAnimation = GenerateAnimation(xAmount, duration, flake, "RenderTransform.Children[2].X");
var yAnimation = GenerateAnimation(yAmount, duration, flake, "RenderTransform.Children[2].Y");
var rotateAnimation = GenerateAnimation(rotateAmount, duration, flake, "RenderTransform.Children[0].Angle");

//Start the animations
Storyboard story = new();
story.Children.Add(xAnimation);
story.Children.Add(yAnimation);
story.Children.Add(rotateAnimation);
flake.Loaded += (sender, args) => story.Begin();

//Remove snowflake when animation stops
story.Completed += (sender, e) => Children.Remove(flake);

WpfSnowfall is available on GitHub and licensed under the MIT license, so do whatever you want (or leave a star 😀)!

https://github.com/Marplex/WpfSnowfall

How I Built Skillbit: Linktree, but for Your Skills

marplex — Thu, 01 Sep 2022 15:59:03 +0000

I read a lot of personal portfolio pages, and almost all of them had the classical "My Skills" section. I wanted to give this opportunity to everyone, that's why I've decided to build Skillbit: the easiest and fastest way to have your personal "My Skills" section on the internet.

It's easier if you see it in action: skillb.it/marplex

Despite my previous experience working on "indie apps" (a few years ago I built dreambox.one, an AI assisted android dream journal), Skillbit was my very first project built entirely for the web.

Skillbit is a Remix React app that runs on Cloudflare Pages, written in Typescript. I decided to use this tech stack because It seemed hassle-free and I wanted to test the real capabilities of running apps on the edge.

The entire structure of the app is pretty simple, but there are a lot of components that I had to set up (easily).

If there's a phrase that can describe the entire architecture, it's probably:

Minimum effort, maximum effect

Database

First things first, the database. I used PostgreSQL as it is extremely versatile and lets you write complex in-db functions.

Since Cloudflare can only communicate with external services trough HTTP/S, I had to expose my database with an API.

API

A PostgREST server helps Cloudflare Workers to connect and talk with the database. It was super easy to setup: I configured the required role/permissions, and that was it.

postgrest-js

I used postgrest-js to communicate with my PostgREST endpoint. The library is easy to use and does everything for you.

Unfortunately, it is not suitable for doing complex queries. In this case, I simply called some functions on the database that followed the complex processes (login/register/adding new skills, ...)

User management

This was my biggest concern before starting to build Skillbit. Managing users and coding every authentication flow is a pain in the a**.

Following my "Minimum effort, maximum effect" principle , Firebase seemed an obvious choice. In fact, that's what I ended up using.

Of course, nothing is as easy as it seems. It turns out that the Firebase JS SDK does not work on Cloudflare Workers but only on Node.JS

After hours of trying to solve this problem, in the midst of my desperation, I finally decided to built a wrapper around the Firebase REST APIs and package it as a Javascript Library.

Although I had found a solution, I thought that creating this library was not in accordance with my principle.... So why not make it open source and maybe save other people time?

And that's exactly what I did; you can view flarebase-auth on my GitHub profile. I also made a post that explains more about the inner workings of this library.

Firebase Authentication on Cloudflare Workers

marplex ・ Jul 26 '22

#firebase #javascript #typescript #serverless

React and Remix

This was my first time using React and my first time using Remix. I just have to say that it is a joy to develop applications with these technologies. They are easy to learn and everything seems to work at the first time, it's a magical feeling.

If you want to know more, I made two posts about my first time experience with React and Remix.

Give it a try!

Now that you know how it's built, give it a try and create your Skillbit! Of course, we are developer, we find bugs everywhere. If you find one, please share it to me and I'll fix it (hopefully)

Oh, and did I mention that Skillbit is completely free?

I don't want to monetize this project, at least for now. Even though I might do it in the future, I would still do it ethically.

Foundation, Isaac Asimov and Software Engineering

marplex — Fri, 26 Aug 2022 15:17:02 +0000

The world of Isaac Asimov’s Foundation is incredible. He made predictions that, at first glance, seems very off and unlikely. But when you start to notice the bigger picture, you realize how these predictions are built on a foundation that is now real and solid.

The fact that the universe depicted is so futuristic and so connected to today’s reality fascinates me. Isaac Asimov is a tremendous thinker, I admire all his work and his astounding long term thinking capabilities, he is a true genius.

Portable nuclear energy, memory degradation, using data to predict actions, the role of religion, economy and knowledge…. It feels so fiction and so real at the same time.

At first sight, this is probably unrelated to software engineering and programming. But when you look closer, Foundation is actually a great source of insights that we can apply as developers.

This book is an incredible opportunity to learn, it’s like you traveled to the future and then came back to the present with precious knowledge.

Research and development

Energy is an important part of the story, like many other elements, it is a fundamental resource for the success of the civilization. Foundation specifically talks about nuclear energy, but it’s totally different from what we have today.

Thanks to research and development, nuclear reactors will be small, efficient and portable. And because of that, all electronic appliances will embed them as the main source of power. We will see nuclear washing machines, nuclear ovens, nuclear fridges….

Continuous improvements transforms the products as they become better and better. The key takeaway here is to think about what it could be and not on what it is.

This fits well with the current blockchain/cryptocurrency situation. Right now it is still unstable, slow and inefficient. If the fundamentals are solid, only research and development will affine this raw technology and turn it to what we initially envisioned. So, in order to succeed we just need 3 things. A great and solid idea, optimisms and time.

Preserving knowledge

When they say “knowledge is power”, it really is. Isaac Asimov shows us how knowledge can be used to maintain or create power. It also tells us the importance of not loosing it, not forgetting it.

A lack of documentation and history preservation can completely erase our knowledge. What was once common sense and taken for granted, will become myth and mystical magic, a legend.

That’s why writing software documentation is important. Our superpower, as humans, is working together. If we want to continue doing this, we must keep track of our knowledge, even if it seems so trivial and dumb. It’s like a long term investment, it will help in the future.

Data science

Psychohistory is a fictional science in Isaac Asimov’s Foundation universe which combines history, sociology, and mathematical statistics to make general predictions about the future behavior of very large groups of people.

For me, this is the biggest and most fascinating “prediction” of Isaac. He maybe went even a bit conservative. Making predictions with AI and statistical models is now the hottest theme in tech. Right now as I’m writing, we’re probably using this “science”. I think that marketing will be the one that becomes psychohistory. We now use data, statistics, sociology and psychology to create ad copies, drive people choices and predict market (large group of people) outcomes.

Again, the key takeaway here is that collecting and using data gives power. This data has to be stored and kept for the future. Any type of information will become, someday, valuable.

That's why collecting logs & crashes, monitoring usage and user behavior (possibly anonymous) is important to keep going in the right direction. Without this type of data, it's almost impossible to know where to focus our development and improve the product.

The butterfly effect

This is the point that may summarize the entire book. There are a lot of huge time skips in Foundation, and this makes centuries feel like days. You become aware of how little things become big and how important stuff becomes irrelevant.

This is the power of the butterfly effect. If you see your choices in this perspective, you will see that anything can happen. What sticks longer are the first principles, the foundations.

That’s why I consider these two mental models (long term and first principle thinking) very powerful during decision making.

Little choices at the beginning of our development journey can become strong blockers or incredible features. It's safer to work in "cycles" to make better decisions (TDD, Agile, ...).

Conclusion

Isaac Asimov is a genius. He came up with these concepts and ideas in 1942, more than 75 years ago! It's really fascinating to understand how concepts and best practices that we use every day to develop software were already forged almost a century ago.

Before watching the Apple TV series, I urge you to read the original Foundation books. They better explore all the connections, implications, causes and effects. I hope this 80-year-old science fiction universe brings you knowledge and wisdom.

Firebase Authentication on Cloudflare Workers

marplex — Tue, 26 Jul 2022 09:40:23 +0000

Firebase is super easy to use. The provided SDK is available for almost every language and platform. The one that is currently missing is the Admin SDK for the web.

Actually, it is available for Javascript but it's built to run on Node. There are some environments that doesn't support this platform that use standard Web APIs.

One of this is Cloudflare Workers. If you try to use the Admin SDK for Node on these workers, it simply won't work because of missing libraries.

The point is that I desperately needed to use it for my current personal project. I started surfing the Internet looking for some already implemented solution.... but nothing, zero results.

So, I decided to build my own library.

Say hello to flarebase-auth

As you noticed from the name of the library, it only covers the authentication part.

I used standard Web APIs such as fetch() and WebCrypto. The most common thing I had to do was JWT token generation/validation. I worked with the jose library (the only dependency in the project) because it is cross-platform and also works with the WebCrypto API.

flarebase-auth is quite simple and is written mainly in 2 files: google-oauth.ts and flarebase-auth.ts

google-oauth.ts

All code related to validating and generating Google OAuth 2.0 tokens is written inside this file. Since almost every request has to be authenticated, I've used this quite extensively.

Generating an OAuth 2.0 token is a 2 step process. Firstly, you sign a JWT token with your Google service account private key. Then, you pass this JWT to https://oauth2.googleapis.com/token and retrieve the access token. The process is implemented in the getAuthToken() method.

flarebase-auth.ts

This is where the actual core library lives. The goal is to implement every method that you would normally use with getAuth() in the Firebase Admin SDK.

Right now, I've written just these methods as they are sufficient to built a basic login/sign-up system:

createSessionCookie()
verifySessionCookie()
signInWithEmailAndPassword()
signUpWithEmailAndPassword()
changePassword()
lookupUser()

Using the library

You may wonder, how can I use it? Here's an example, let's start by creating the FlarebaseAuth instance.

import { FlarebaseAuth } from 'flarebase-auth';

const auth = new FlarebaseAuth({
  apiKey: 'Firebase api key',
  projectId: 'Firebase project id',
  privateKey: 'Firebase private key or service account private key',
  serviceAccountEmail: 'Firebase service account email',
});

Now you're ready to do the real stuff! For example, here's how you can sign in users with email and password.

//Sign in with username and password
const { token, user } = await auth.signInWithEmailAndPassword(
  'my@email.com',
  'supersecurepassword'
);

const userEmail = user.email;
const refreshToken = token.refreshToken;

The library is tested using a dummy Firebase project with a dummy user. Later I discovered that there's a Firebase Authentication Emulator that was made specifically for debugging purposes.
Right now, I'll stick with the test Firebase project and continue implementing other methods. If you want to add this feature, you're more than welcome to create a pull request!

flarebase-auth also supports caching: you can use CloudflareKv to automatically store OAuth 2.0 tokens until expiration.

import { FlarebaseAuth, CloudflareKv } from 'flarebase-auth';

const auth = new FlarebaseAuth({
  apiKey: 'Firebase api key',
  projectId: 'Firebase project id',
  privateKey: 'Firebase private key or service account private key',
  serviceAccountEmail: 'Firebase service account email',

  cache: new CloudflareKv(NAMESPACE);
});

Next steps for `flarebase-auth`

Although I’m now successfully using this library for my current project, there are still a lot of improvements and new features to implement. Here’s a list of things I want to add:

Extend caching support for public keys (token validation)
Implement sendEmailVerification()
Implement confirmEmailVerification()
Implement deleteAccount()

Links

flarebase-auth is available on NPM and GitHub Packages. This project is fully open source and MIT licensed, so do wathever you want! Contributions are welcomed 🥳

Only one open-source project can be saved for future humanity. Which one would you choose?

marplex — Wed, 04 May 2022 17:10:04 +0000

My answer? Git.

LiveData: Bringing the best of Android to .NET

marplex — Mon, 25 Apr 2022 15:06:41 +0000

I’ve been building Android apps for years; the development experience and community built around it is fantastic. There are a lot of open source libraries and projects from where you can learn from. Thanks to Android Jetpack and Google pushing MVVM design pattern adoption, almost every app follows the same rules and uses the same robust core libraries.
Microsoft vs Google

I can’t say the same for .NET and Microsoft. When I started working on WPF apps, I’ve immediately felt “uncomfortable”. There is a smaller community and really few open source projects, Microsoft is suggesting to use MVVM but you often need to break this pattern because there are controls and classes that are not built for it.

For years, Microsoft and it's closed-source philosophy slowed down the evolution of the .NET ecosystem. Now, after shifting towards a more "open" approach, Microsoft is trying to build back a strong developer community built around C# and .NET framework. They're making more open source libraries ("CommunityToolkit" clearly explains the new strategy) and extending support for other platforms such as Linux. Nevertheless, Microsoft is still years behind the Android developer experience.

Notify in the multiverse of madness

When I started to develop WPF apps and write my first view model, I encountered what I call the "Notify madness" problem. Let me explain this better, have a look at this view model:

public class ViewModel : ObservableObject {

  private string name = "Marco";
  public string Name
  {
    get => name;
    set => SetProperty(ref name, value);
  }

}

I find it absurd that you need to write 6 lines of code just to define a single notifiable property. Less than a year ago, Microsoft released MvvmToolkit 7.1 Preview where they finally introduced source generators. Now it’s way better, you can just add an attribute to a property and all that code is generated for you.

public partial class ViewModel : ObservableObject {

  [ObservableProperty]
  private string name;

}

That being said, this feature was added just less than a year ago, when Android already had LiveData, Kotlin Flow and code generators were used for a long time to reduce boilerplate code.

Another aspect that I didn’t like about building view models was mapped properties. Every time the source property changed, you had to remember to notify all the other properties that were depending on it. With the latest MvvmToolkit releases you can use codegen with [AlsoNotifyChangeFor] to notify other properties automatically. But mapped properties should update by themselves. I don’t want to always add (and always forget) methods to notify the new values.

That’s why I’ve taken inspiration from Android LiveData and built a similar library for .NET. And because I always find creative names, I’ve called it…

Bringing the best of Android to .NET

I decided to develop LiveData to simplify normal and mapped properties while having support for async operations. Anyway:

Talk is cheap. Show me the code.

public class LiveDataViewModel {

    public LiveData<string> Name => new("Marco");
    public LiveData<string> HelloMessage { get; }

    public LiveDataViewModel() {
        HelloMessage = Name.Map(x => $"Hello {x}!");
    }

}

As you can see, every property is defined in one single line and the internal Value is automatically notified when changed. Mapped properties are also notified automatically, so there is literally no way to write unfinished or broken UIs just because you forgot to call notify().

Most of the time you will find yourself dealing with async tasks (like if you’re using Refit). LiveData automatically transforms async functions into bindable properties. For example:

//Map a string "SearchQuery" into an asynchronously retrieved list of users
Users = SearchQuery.MapAsync(query => api.SearchUsers(query));

//Convert an async function to a LiveData
LiveData<bool> IsVisible = asyncTask.ToLiveData<bool>();

In a single line, async functions can be mapped and transformed to LiveData objects while automatically updating the UI. And finally, you can concatenate all this transformations to create complex reactive properties in just a few lines of code

//Concatenate transformation functions
FinalLiveData = Name.Delay(1000)
                    .Debounce(800)
                    .Map(name => "Hello" + name);

To recap, here are all the advantages of using LiveData:

One line notifiable properties
You don’t have to remember to notify after every change
Mapped properties are notified automatically
Seamless async Task support
Easily create complex reactive properties

What’s next for LiveData

Although I’ve used LiveData in multiple projects, there are still a lot of improvements and new features to implement. Here’s a list of things I want to add:

Better lifecycle management
Better exception support
Personalized thread pools on async functions

Links

LiveData is free, open source and licensed under MIT. Contributions are welcomed 🥳

GitHub: https://github.com/Marplex/LiveData
NuGet: https://www.nuget.org/packages/LiveData/

DEV Community: marplex

Visually Multilingual: Introducing mcdse-2b

Training

Dataset

Train Runs

Evaluations

Matryoshka Representation Learning

NDCG@5 (float)

Binary Embeddings

NDCG@5 (binary)

ShiftProject

Cohere Embed v3 Image

Deployment

vLLM

Download the new model class

Download mcdse-2b-v1 for local inference

Edit config.json

Check vllm/main.py for local inference

Conclusion

Hono on Azure Functions

Building the Hono adapter

Simplicity wins

How to use

Limitations

Conclusion

Italian Laws Unigram Viewer on the Edge With Cloudflare Pages

Thinking on problems

Solution number two

Building the inverted index

🧊 Freezing the index

Leveraging HTTP functionalities

More problems, thank you CORS

➗ Divide and conquer

Many ups and one big down

Conclusion

Add some snow in your WPF apps

Now, show me the code!

Behind the scenes

How I Built Skillbit: Linktree, but for Your Skills

Database

API

postgrest-js

User management

Firebase Authentication on Cloudflare Workers

marplex ・ Jul 26 '22

React and Remix

Give it a try!

Foundation, Isaac Asimov and Software Engineering

Research and development

Preserving knowledge

Data science

The butterfly effect

Conclusion

Firebase Authentication on Cloudflare Workers

Say hello to flarebase-auth

Using the library

Next steps for flarebase-auth

Links

Only one open-source project can be saved for future humanity. Which one would you choose?

LiveData: Bringing the best of Android to .NET

Notify in the multiverse of madness

Bringing the best of Android to .NET

What’s next for LiveData

Links

Check `vllm/main.py` for local inference

Next steps for `flarebase-auth`