<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: marplex</title>
    <description>The latest articles on DEV Community by marplex (@marplex).</description>
    <link>https://dev.to/marplex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F851956%2F2b93d969-a66a-4888-92af-7c079d6a984b.jpg</url>
      <title>DEV Community: marplex</title>
      <link>https://dev.to/marplex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marplex"/>
    <language>en</language>
    <item>
      <title>Visually Multilingual: Introducing mcdse-2b</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Sun, 27 Oct 2024 14:09:10 +0000</pubDate>
      <link>https://dev.to/marplex/visually-multilingual-introducing-mcdse-2b-41gj</link>
      <guid>https://dev.to/marplex/visually-multilingual-introducing-mcdse-2b-41gj</guid>
      <description>&lt;p&gt;Today, I'm introducing a new experimental multilingual embedding model for flexible visual document retrieval. &lt;a href="https://huggingface.co/marco/mcdse-2b-v1" rel="noopener noreferrer"&gt;mcdse-2b-v1 (🤗)&lt;/a&gt; builds upon &lt;a href="https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1" rel="noopener noreferrer"&gt;MrLight/dse-qwen2-2b-mrl-v1&lt;/a&gt; and it is trained using the &lt;a href="https://arxiv.org/abs/2406.11251" rel="noopener noreferrer"&gt;DSE approach&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This model allows you to embed page/slide screenshots and query them using natural language. Whether it's tables, graphs, charts, schemas, images, or text, mcdse-2b-v1 encodes everything into a single embedding vector, eliminating the need for traditional OCR, document layout analysis, reading order detection, chunking, table/formula extraction... &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fio9u3kg1con21orp6hfh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fio9u3kg1con21orp6hfh.png" alt="image/png" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Strong metrics on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Matryoshka Representation Learning:&lt;/strong&gt; embeddings can efficiently scale from 1536 to 256 dimensions. You can reduce the size 6x and still keep 95% of the embeddings quality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exceptional on binarization&lt;/strong&gt;: 768d binary vectors keep 99% retrieval quality of the base 1536d float vectors. Using binary vectors, you can encode &lt;strong&gt;100 million multilingual pages in just 10GB&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fast vLLM inference:&lt;/strong&gt; run inference on vLLM and efficiently serve embeddings at scale, production ready. Check Deployment to learn more.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My benchmarks aren't flawless, so &lt;strong&gt;I encourage you to test the model on your own data&lt;/strong&gt;. This is an early version with plenty of room for improvement. However, despite this, the results highlight a strong multilingual retriever that adapts remarkably well to various memory/speed requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training
&lt;/h2&gt;

&lt;p&gt;mcdse-2b is trained from &lt;a href="https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1" rel="noopener noreferrer"&gt;MrLight/dse-qwen2-2b-mrl-v1&lt;/a&gt; using low-rank adapters (&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA&lt;/a&gt;) on a multilingual corpus of documents. I have trained it on 8xRTX3090 using the &lt;a href="https://arxiv.org/abs/2406.11251" rel="noopener noreferrer"&gt;DSE&lt;/a&gt; approach with the following parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Epochs = 1&lt;/li&gt;
&lt;li&gt;Warmup ratio = 0.1&lt;/li&gt;
&lt;li&gt;Learning rate = 1e-5&lt;/li&gt;
&lt;li&gt;Optimizer = adamw_torch&lt;/li&gt;
&lt;li&gt;Schedule = linear&lt;/li&gt;
&lt;li&gt;Total batch size = 16&lt;/li&gt;
&lt;li&gt;LoRA

&lt;ul&gt;
&lt;li&gt;Alpha = 64&lt;/li&gt;
&lt;li&gt;R = 16&lt;/li&gt;
&lt;li&gt;Dropout = 0.1&lt;/li&gt;
&lt;li&gt;DoRA = True&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dataset
&lt;/h3&gt;

&lt;p&gt;The dataset comprises 24K PDF documents automatically scraped from the public internet. Random pages were extracted from each document, converted into compressed JPEG images, and filtered to remove blank pages and duplicates. The resulting page screenshots are unique and span a wide range of topics.&lt;/p&gt;

&lt;p&gt;I used gemini-flash-1.5-002 to generate queries based on each image. Gemini was instructed to come up with three type of queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A broad topical query: summarizing the overall theme of the document.&lt;/li&gt;
&lt;li&gt;A specific detailed question: capturing subtle nuances within the content.&lt;/li&gt;
&lt;li&gt;A visual query: focusing on visual elements such as charts, graphs, images, or signatures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The entire training and evaluation datasets were generated for just €2&lt;/strong&gt; (thanks, Gemini Flash!)&lt;/p&gt;

&lt;p&gt;Each image is then classified by its text density on a scale from 0 to 2. I used &lt;a href="https://huggingface.co/omoured/YOLOv10-Document-Layout-Analysis" rel="noopener noreferrer"&gt;omoured YOLOv10n&lt;/a&gt; model, fine-tuned on DocLayNet, to detect areas such as figures versus text. Based on these proportions, I heuristically calculate the text density. I plan to use this classification to improve the model's performance on text-dense documents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0 = only visuals&lt;/li&gt;
&lt;li&gt;1 = a mix of visuals and text&lt;/li&gt;
&lt;li&gt;2 = only text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The eval and train datasets are not yet published. I'm very willing to open source them, but I'm still unsure on how to properly do it without breaking any license (if any). If you know how to help me, please reach out!&lt;/p&gt;

&lt;h3&gt;
  
  
  Train Runs
&lt;/h3&gt;

&lt;p&gt;The model was sequentially trained for each language in the following order:&lt;br&gt;
1) French: 6k samples&lt;br&gt;
2) Spanish: 6k samples&lt;br&gt;
3) Italian: 6k samples&lt;br&gt;
4) German: 6k samples&lt;/p&gt;

&lt;p&gt;This order was determined by the base model's retrieval performance in these languages, the first being the best performing. My intuition is that, given the small dataset, starting with the stronger languages could help balance overall improvements across the model.&lt;/p&gt;

&lt;p&gt;Before reaching this final checkpoint, I conducted multiple runs to test various strategies and validate some of my intuitions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Language order:&lt;/strong&gt; I swapped the order of the last two languages and found that training German last improved its performance &lt;em&gt;on evaluations&lt;/em&gt; by 1.7%, while maintaining similar scores across the other languages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model initialization:&lt;/strong&gt; I initialized the model with 10k mmarco pairs for each language. This resulted in worse performance across all languages, particularly with lower-dimensional embeddings. For example, French NDCG@5 using 512-dimensional embeddings dropped by 2% when trained with mmarco.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Different image resize algorithm:&lt;/strong&gt; I developed a custom resize function (&lt;code&gt;custom_resize&lt;/code&gt;) that strictly preserves the image's aspect ratio while scaling it down to fit within &lt;code&gt;min_pixels&lt;/code&gt; and &lt;code&gt;max_pixels&lt;/code&gt;. All evaluations used the standard resize function from &lt;a href="https://github.com/QwenLM/Qwen2-VL/blob/3bf7dbd7877892934bd7f8f4b00cd23cc2b35e4a/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L53" rel="noopener noreferrer"&gt;qwen_vl_utils&lt;/a&gt;. Models trained with the custom resize function outperformed the standard method, with an average +1.7% NDCG@5 improvement (1536 dimensions). It would be interesting to explore training a ColQwen model with this &lt;code&gt;custom_resize&lt;/code&gt; function.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resize function&lt;/th&gt;
&lt;th&gt;Avg&lt;/th&gt;
&lt;th&gt;English&lt;/th&gt;
&lt;th&gt;Italian&lt;/th&gt;
&lt;th&gt;Spanish&lt;/th&gt;
&lt;th&gt;French&lt;/th&gt;
&lt;th&gt;German&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2_vl_utils&lt;/td&gt;
&lt;td&gt;80.8&lt;/td&gt;
&lt;td&gt;80.2&lt;/td&gt;
&lt;td&gt;80.5&lt;/td&gt;
&lt;td&gt;79.6&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;82.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;custom_resize&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Evaluations
&lt;/h2&gt;

&lt;p&gt;Due to the limited availability of publicly available datasets for multilingual document image retrieval, the model has been evaluated using a custom-built dataset. This eval dataset was specifically designed to benchmark the model's performance across various languages.&lt;/p&gt;

&lt;p&gt;This evaluation dataset was created using the same methodologies and pipelines of the training dataset. However, the document topics are generally different, and no images are shared between the training and evaluation datasets to avoid any evaluation contamination. NDCG scores were calculated by running 100 unique queries across 1K document indexes for each language.&lt;/p&gt;
&lt;h3&gt;
  
  
  Matryoshka Representation Learning
&lt;/h3&gt;

&lt;p&gt;This model is trained with Matryoshka Representation Learning (&lt;a href="https://arxiv.org/abs/2205.13147" rel="noopener noreferrer"&gt;MRL&lt;/a&gt;) on the following dimensions: 1536, 1024, 768, 512, 384, 256. The loss function used during training is calibrated to track performance across all these dimensions, leading the model to frontload the most important identifying information. This effectively allows you to shrink the embedding dimensions according to your scale and budget.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd250hlapkcjp61bni1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd250hlapkcjp61bni1n.png" alt="average ndcg matryoshka float" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Average NDCG@5 for every dimensions. Interestingly, the model shows improvements in English, even though this language wasn't included in the training set. The model performs &lt;strong&gt;6% better on 256 dimensions&lt;/strong&gt;, and shows an overall improvement of 4% on average across all dimensions. Evaluations were conducted using FAISS with IndexFlatL2.&lt;/p&gt;
&lt;h4&gt;
  
  
  NDCG@5 (float)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Average&lt;/th&gt;
&lt;th&gt;English&lt;/th&gt;
&lt;th&gt;Italian&lt;/th&gt;
&lt;th&gt;Spanish&lt;/th&gt;
&lt;th&gt;French&lt;/th&gt;
&lt;th&gt;German&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1536 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;79.5&lt;/td&gt;
&lt;td&gt;79.2&lt;/td&gt;
&lt;td&gt;80.2&lt;/td&gt;
&lt;td&gt;77.9&lt;/td&gt;
&lt;td&gt;80.6&lt;/td&gt;
&lt;td&gt;79.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.98%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.23%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.47%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.01%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1024 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;78.3&lt;/td&gt;
&lt;td&gt;78.8&lt;/td&gt;
&lt;td&gt;78.5&lt;/td&gt;
&lt;td&gt;76.5&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;77.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.23%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.75%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.12%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.49%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.76%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8.07%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;768 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;77.8&lt;/td&gt;
&lt;td&gt;78.4&lt;/td&gt;
&lt;td&gt;78.3&lt;/td&gt;
&lt;td&gt;75.6&lt;/td&gt;
&lt;td&gt;80.8&lt;/td&gt;
&lt;td&gt;75.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.02%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.51%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.00%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.55%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.00%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8.88%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;512 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;76.2&lt;/td&gt;
&lt;td&gt;77.6&lt;/td&gt;
&lt;td&gt;75.9&lt;/td&gt;
&lt;td&gt;73.1&lt;/td&gt;
&lt;td&gt;79.2&lt;/td&gt;
&lt;td&gt;75.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.91%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.15%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.05%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.56%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.70%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+7.96%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;384 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;75.7&lt;/td&gt;
&lt;td&gt;76.2&lt;/td&gt;
&lt;td&gt;75.5&lt;/td&gt;
&lt;td&gt;74.6&lt;/td&gt;
&lt;td&gt;78.4&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;76.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.86%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.68%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.82%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.97%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.49%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+9.09%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;256 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;73.5&lt;/td&gt;
&lt;td&gt;74.5&lt;/td&gt;
&lt;td&gt;73.6&lt;/td&gt;
&lt;td&gt;70.6&lt;/td&gt;
&lt;td&gt;74.8&lt;/td&gt;
&lt;td&gt;73.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;76.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.9&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.89%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.10%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.15%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+7.35%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+6.62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.26%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Binary Embeddings
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqj47sbqsumfmcm2se0o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqj47sbqsumfmcm2se0o.png" alt="average ndcg matryoshka binary" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;mcdse-2b-v1 clearly performs better on binarization, especially at lower dimensions. The model is &lt;strong&gt;23% better on 256 dimensions&lt;/strong&gt;, with an average improvement of 13% overall. Evaluations were conducted using FAISS with IndexBinaryFlat. But why are binary embeddings superior?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;NDCG@5&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Memory needed for 100M embeddings&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;dse-qwen2-2b-mrl-v1&lt;/strong&gt; (float16)&lt;/td&gt;
&lt;td&gt;79.5&lt;/td&gt;
&lt;td&gt;286 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;mcdse-2b-v1&lt;/strong&gt; (binary)&lt;/td&gt;
&lt;td&gt;80.6&lt;/td&gt;
&lt;td&gt;18 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This table shows that mcdse-2b-v1's &lt;strong&gt;binary embeddings are 1% better than the base model's 1536-dimensional float vectors&lt;/strong&gt; while reducing memory consumption by 16x. Besides these advantages, binary embeddings can also be searched 40x faster with hamming distance, as comparing two binary vectors only uses 2 CPU cycles (xor, popcnt)&lt;/p&gt;
&lt;h4&gt;
  
  
  NDCG@5 (binary)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Average&lt;/th&gt;
&lt;th&gt;English&lt;/th&gt;
&lt;th&gt;Italian&lt;/th&gt;
&lt;th&gt;Spanish&lt;/th&gt;
&lt;th&gt;French&lt;/th&gt;
&lt;th&gt;German&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1536 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;75.0&lt;/td&gt;
&lt;td&gt;75.8&lt;/td&gt;
&lt;td&gt;75.4&lt;/td&gt;
&lt;td&gt;72.4&lt;/td&gt;
&lt;td&gt;78.1&lt;/td&gt;
&lt;td&gt;73.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;76.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+6.93%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.65%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.95%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+11.60%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+6.69%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+9.41%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1024 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;72.2&lt;/td&gt;
&lt;td&gt;74.8&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;70.8&lt;/td&gt;
&lt;td&gt;74.6&lt;/td&gt;
&lt;td&gt;69.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79.5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+9.05%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.84%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12.38%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+9.69%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12.45%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;768 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;70.1&lt;/td&gt;
&lt;td&gt;71.7&lt;/td&gt;
&lt;td&gt;69.3&lt;/td&gt;
&lt;td&gt;69.8&lt;/td&gt;
&lt;td&gt;73.7&lt;/td&gt;
&lt;td&gt;65.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78.5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+11.07%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+7.00%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8.09%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12.75%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+11.20%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.05%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;512 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;66.5&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;65.4&lt;/td&gt;
&lt;td&gt;63.7&lt;/td&gt;
&lt;td&gt;70.2&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;76.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13.21%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+6.42%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+11.86%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+18.02%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13.23%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.33%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;384 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;61.1&lt;/td&gt;
&lt;td&gt;62.7&lt;/td&gt;
&lt;td&gt;58.5&lt;/td&gt;
&lt;td&gt;58.6&lt;/td&gt;
&lt;td&gt;65.1&lt;/td&gt;
&lt;td&gt;60.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+17.67%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+15.84%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+18.07%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+24.09%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+13.43%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16.71%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;256 dimensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;54.3&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;56.5&lt;/td&gt;
&lt;td&gt;53.6&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;49.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;66.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;69.2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+23.31%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+18.73%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+14.91%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+27.07%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+27.00%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+28.32%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  ShiftProject
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://huggingface.co/datasets/vidore/shiftproject_test" rel="noopener noreferrer"&gt;vidore/shiftproject_test&lt;/a&gt; dataset is part of the ViDoRe Benchmark. It contains French queries and documents sourced from the &lt;a href="https://theshiftproject.org/" rel="noopener noreferrer"&gt;Shift Project&lt;/a&gt; about the environment. Queries were generated with Claude-3 Sonnet on the same, French-translated, prompt used for generating queries of the scrapped documents of &lt;a href="https://huggingface.co/datasets/vidore/colpali_train_set" rel="noopener noreferrer"&gt;vidore/colpali_train_set&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;ShiftProject (NDCG@5)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;78.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;-2.80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the NDCG@5 on the ShiftProject dataset, with 1536 float dimensions and evaluated using at most 960 image patches. &lt;/p&gt;

&lt;p&gt;I expected the score of mcdse-2b-v1 to be higher than the base model, instead it's 3% worse. &lt;br&gt;
The base model was trained on the &lt;a href="https://huggingface.co/datasets/vidore/colpali_train_set" rel="noopener noreferrer"&gt;colpali train set&lt;/a&gt;, I tought that maybe it may have been over-optimized for "Claude-3 Sonnet like" queries. To investigate this, I regenerated the ShiftProject dataset queries using gemini-1.5-flash-002 and my prompts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;ShiftProject_Gemini (NDCG@5)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.37%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The scores change wildly, but in this case, mcdse-2b-v1 is 5% better. These results tends to suggest two possible causes:&lt;/p&gt;

&lt;p&gt;1) The base model is over-optimized for "Claude-3 Sonnet like" queries&lt;br&gt;
2) My model is over-optimized for "gemini-1.5-flash-002 like" queries&lt;/p&gt;

&lt;p&gt;In both scenarios, I believe mcdse-2b-v1 has mitigated these overoptimizations by understanding broader query distributions.&lt;/p&gt;

&lt;p&gt;My generated gemini queries are in two formats: questions and queries. &lt;a href="https://huggingface.co/datasets/vidore/colpali_train_set" rel="noopener noreferrer"&gt;colpali_train_set&lt;/a&gt; generated queries are only questions. I also tested both models on just Gemini queries and just Gemini questions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnkuep5qgb49ub41glo6w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnkuep5qgb49ub41glo6w.png" alt="image/png" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;ShiftProject_GeminiQuestions (NDCG@5)&lt;/th&gt;
&lt;th&gt;ShiftProject_GeminiQueries (NDCG@5)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;dse-qwen2-2b-mrl-v1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;58.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mcdse-2b-v1&lt;/td&gt;
&lt;td&gt;69.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;63.5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;-7.63%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+7.72%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The base model is 7% better on gemini questions and 7% worse on gemini queries. The average scores between queries and questions are nearly identical (66.7 and 66.5). This suggests that my model has mitigated the previously mentioned overoptimizations and is generally better at understanding a wider variety of queries. Training on more multilingual data will probably increase this average and eventually improve perfomances on ShiftProject.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cohere Embed v3 Image
&lt;/h3&gt;

&lt;p&gt;I conducted some preliminary (and rushed) tests using the recently announced cohere &lt;a href="https://docs.cohere.com/v2/changelog/embed-v3-is-multimodal" rel="noopener noreferrer"&gt;embed-multilingual-v3.0 multimodal&lt;/a&gt; embeddings on a smaller version of the English dataset. The model achieved an NDCG@5 score of 71, while mcdse-2b-v1 scored around 84. I'm working on more comprehensive evaluations for this model.&lt;/p&gt;

&lt;p&gt;&lt;a id="deployment"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;On HuggingFace Transformers, you can expect to encode ~3 images/s using an RTX3090 with a batch size of 32 (35TFLOPS). A more common inference side GPU like RTX 4000 Ada will roughly deliver the same troughput.&lt;/p&gt;
&lt;h3&gt;
  
  
  vLLM
&lt;/h3&gt;

&lt;p&gt;vLLM officially supports Qwen2VL for generation only, I have added a new model class &lt;code&gt;Qwen2VLForEmbeddingGeneration&lt;/code&gt; to support embedding tasks. Running inference on vLLM should be ~5x faster than HuggingFace Transformers.&lt;/p&gt;
&lt;h4&gt;
  
  
  Download the new model class
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/marplex/mcdse &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;mcdse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Download mcdse-2b-v1 for local inference
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;huggingface_hub&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;snapshot_download&lt;/span&gt;
&lt;span class="nf"&gt;snapshot_download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;marco/mcdse-2b-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;local_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/model/mcdse-2b-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Edit config.json
&lt;/h4&gt;

&lt;p&gt;Replace &lt;code&gt;Qwen2VLForConditionalGeneration&lt;/code&gt; with &lt;code&gt;Qwen2VLForEmbeddingGeneration&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/Qwen2VLForConditionalGeneration/Qwen2VLForEmbeddingGeneration/g'&lt;/span&gt; /path/to/model/mcdse-2b-v1/config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check &lt;code&gt;vllm/main.py&lt;/code&gt; for local inference
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#vllm/main.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qwen2_vl_dse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Qwen2VLForEmbeddingGeneration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_query_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_document_prompt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ModelRegistry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;

&lt;span class="n"&gt;ModelRegistry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen2VLForEmbeddingGeneration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Qwen2VLForEmbeddingGeneration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/model/mcdse-2b-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;limit_mm_per_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Encode queries
&lt;/span&gt;&lt;span class="n"&gt;query_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_query_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quali erano le passività totali al 31 dicembre 2017?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_modal_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;]}})&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="c1"&gt;#1536 dimensional embedding
&lt;/span&gt;
&lt;span class="c1"&gt;# Encode documents
&lt;/span&gt;&lt;span class="n"&gt;dummy_document_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;document_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_document_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dummy_document_image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;document_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_modal_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;]}})&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="c1"&gt;#1536 dimensional embedding
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This is my first time training a model, it was challenging but incredibly fun. I don't think I could have ever done this without the amazing work of the HuggingFace team and contributors. I also want to thank &lt;a href="https://twitter.com/ManuelFaysse" rel="noopener noreferrer"&gt;Manuel Faysse&lt;/a&gt;, &lt;a href="https://twitter.com/tonywu_71" rel="noopener noreferrer"&gt;Tony Wu&lt;/a&gt;, and the entire Vidore team for their work on &lt;a href="https://arxiv.org/abs/2407.01449" rel="noopener noreferrer"&gt;ColPali&lt;/a&gt;, &lt;a href="https://x.com/xueguang_ma" rel="noopener noreferrer"&gt;Xueguang Ma&lt;/a&gt; for all its work on the Tevatron codebase and for training a very strong base model. I was also inspired by &lt;a href="https://x.com/bclavie" rel="noopener noreferrer"&gt;Benjamin Clavié&lt;/a&gt; and his impressive model announcements.&lt;/p&gt;

&lt;p&gt;I hope this model proves useful for your retrieval and RAG pipelines. As mentioned in the beginning, my benchmarks are far from perfect, and results in real-world scenarios may vary. I encourage you to test it on your own use cases. Overall, a significant advantage of visual retrieval is that you can scrap your complex indexing pipeline by simply embedding the page. This is the future!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>Hono on Azure Functions</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Wed, 08 May 2024 14:39:36 +0000</pubDate>
      <link>https://dev.to/marplex/hono-on-azure-functions-15g</link>
      <guid>https://dev.to/marplex/hono-on-azure-functions-15g</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/Marplex/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;hono-azurefunc-adapter&lt;/a&gt; is one of the simplest yet incredibly useful js library I have ever written.&lt;/p&gt;

&lt;p&gt;Hono is a web application framework built on web standards. It's incredibly fast and lightweight. Because it's built using web standard APIs, the same code will run on multiple runtimes (Cloudflare, Fastly, Deno, Bun, AWS, or Node.js).&lt;/p&gt;

&lt;p&gt;For platforms that don't directly support web standards, Hono comes with adapters. For example, running in Node.js requires &lt;a href="https://github.com/honojs/node-server" rel="noopener noreferrer"&gt;an adapter&lt;/a&gt; that converts requests and responses into node types and objects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the Hono adapter
&lt;/h2&gt;

&lt;p&gt;There are a lot of community-made adapters for running Hono on many more environments. Unfortunately, no one has ever made one for Azure Functions. I decided to built it, free and open-source.&lt;/p&gt;

&lt;p&gt;The entire library has just only 54 lines of code.&lt;br&gt;
Its so simple and maintainable. Yet it allows to easily port, with minimal/no code rewrites, APIs built with Hono on the powerful Azure Functions platform.&lt;/p&gt;
&lt;h2&gt;
  
  
  Simplicity wins
&lt;/h2&gt;

&lt;p&gt;It is incredible to think of how many new possibilities this library unlocks with just 54 lines of code. &lt;/p&gt;

&lt;p&gt;It's true, simple things are always the most difficult.&lt;br&gt;
Although &lt;a href="https://github.com/Marplex/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;hono-azurefunc-adapter&lt;/a&gt; now appears clean and concise, it took a while to get to this point. I spent a lot of time polishing, refactoring, and rethinking how to accomplish the same things with fewer lines of code. I had to really dig deep into how the (partly document) Azure Functions API works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lgrapmkzfnxoxdz4s63.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lgrapmkzfnxoxdz4s63.png" alt="Hono github star history" width="800" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hono has rapidly become one of the most widely used frameworks for building Javascript web APIs. Hats off to &lt;a href="https://github.com/yusukebe" rel="noopener noreferrer"&gt;yusukebe&lt;/a&gt; and all the other contributors! Now it's finally possible to run it on Azure Functions, effortlessly with just &lt;code&gt;azureHonoHandler(honoApp.fetch)&lt;/code&gt;!&lt;/p&gt;
&lt;h2&gt;
  
  
  How to use
&lt;/h2&gt;

&lt;p&gt;It's very simple. Install &lt;a href="https://github.com/Marplex/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;hono-azurefunc-adapter&lt;/a&gt; with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i @marplex/hono-azurefunc-adapter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now create the http trigger for Azure Functions. &lt;code&gt;honoApp&lt;/code&gt; is your exported Hono application object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;honoApp&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./app&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;azureHonoHandler&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@marplex/hono-azurefunc-adapter&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@azure/functions&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;http&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;httpTrigger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;methods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GET&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;DELETE&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;HEAD&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PATCH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PUT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;authLevel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;anonymous&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;{*proxy}&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;azureHonoHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;honoApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it, you're done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;There are some limitations and other things you should keep in mind when running Hono inside Azure Functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route Prefix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The default Azure Functions route prefix is &lt;code&gt;/api&lt;/code&gt;. Be sure to start all your Hono routes with &lt;code&gt;/api&lt;/code&gt; or change the default Azure Functions route prefix in &lt;code&gt;host.json&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"extensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"http"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"routePrefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Crypto&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Node &amp;lt;=18 environments, if you are using &lt;code&gt;hono/bearer-auth&lt;/code&gt; or any other library that uses crypto, be sure to define &lt;code&gt;global.crypto = require("crypto");&lt;/code&gt; before registering the http trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Request signal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Azure Functions does not expose any signal or event for listening to http request interruptions. &lt;code&gt;c.req.raw.signal&lt;/code&gt; is useless and its never aborted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I think Azure is one of the most trusted enterprise-ready cloud providers. By building &lt;a href="https://github.com/Marplex/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;hono-azurefunc-adapter&lt;/a&gt;, I hope this will finally allow many to port the same popular Hono APIs to Azure Functions, especially for private enterprise needs.&lt;/p&gt;




&lt;p&gt;hono-azurefunc-adapter is available on NPM and GitHub Packages. This project is fully open source and MIT licensed, so do what you want! Contributions are welcome 🥳&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/Marplex/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;https://github.com/Marplex/hono-azurefunc-adapter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NPM: &lt;a href="https://www.npmjs.com/package/@marplex/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@marplex/hono-azurefunc-adapter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Package: &lt;a href="https://github.com/Marplex/hono-azurefunc-adapter/pkgs/npm/hono-azurefunc-adapter" rel="noopener noreferrer"&gt;https://github.com/Marplex/hono-azurefunc-adapter/pkgs/npm/hono-azurefunc-adapter&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>javascript</category>
      <category>serverless</category>
      <category>typescript</category>
      <category>azurefunctions</category>
    </item>
    <item>
      <title>Italian Laws Unigram Viewer on the Edge With Cloudflare Pages</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Mon, 17 Jul 2023 20:05:49 +0000</pubDate>
      <link>https://dev.to/marplex/italian-laws-unigram-viewer-on-the-edge-with-cloudflare-pages-18b9</link>
      <guid>https://dev.to/marplex/italian-laws-unigram-viewer-on-the-edge-with-cloudflare-pages-18b9</guid>
      <description>&lt;p&gt;Months ago I shared my Italian law mapping project, where I mapped 13K Italian laws and extracted relationships between them (&lt;a href="https://labs.marcocimolai.xyz/tessuto-normativo" rel="noopener noreferrer"&gt;labs.marcocimolai.xyz/tessuto-normativo&lt;/a&gt;). It went viral on Reddit, LinkedIn and has been covered by some of Italy's leading newspapers. Today, I will share my journey of building and deploying a "Google NGram viewer" for Italian laws.&lt;/p&gt;

&lt;p&gt;Let's start with the basic idea: you search for a word and the site returns how many Italian laws containing that word have been published for each year, from the Constitution to 2022.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmg36p5flbu80x5y6c6m9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmg36p5flbu80x5y6c6m9.png" alt="Google NGram Viewer" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I didn't know much about search engines or information retrieval. So, as usual, my journey began with a Google search.&lt;/p&gt;

&lt;p&gt;That's where I found the well known Apache Lucene. I started digging and learning all about it. I discovered that there are many other optimizations and pre-processes that are essentials for serving a search endpoint, and that there is a project called Solr that is doing this for me.&lt;/p&gt;

&lt;p&gt;Solr is a search engine built on top of Apache Lucene, and it comes with sane defaults and a handy HTTP API. The first part is indexing and processing the laws, I used &lt;code&gt;pysolr&lt;/code&gt; client to loop trough each law and add them to the index. The document format only contains the text (not stored) and the publication date (stored), which is all I need to reconstruct the term usage plot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pysolr&lt;/span&gt;
&lt;span class="n"&gt;solr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pysolr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Solr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8983/solr/norms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;always_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;norms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;solr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's true that I left every configuration at default (so it may not be very optimized for my use-case), but I didn't expect the process to be so fast and easy. Performing queries was also very straightforward, I just needed to retrieve the date field with no additional scoring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lire&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;solr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15000&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After processing this data with pandas, here's the result of this query. The Y axis represents how many norms with the term "lire" were published in respect to the total count of published norms. That is &lt;em&gt;term_occurrencies / total_norms.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5n2i6w3kjdaanybyyo1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5n2i6w3kjdaanybyyo1s.png" alt="" width="565" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fschlec7pl5hzm9q6f3iq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fschlec7pl5hzm9q6f3iq.png" alt="" width="565" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Thinking on problems
&lt;/h3&gt;

&lt;p&gt;Thanks to Solr I had just indexed Italian norms, performed queries and retrieved the term usage graph, exactly what I needed. Not really, because there are also some downsides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I need to maintain additional server costs&lt;/li&gt;
&lt;li&gt;Not easily scalable&lt;/li&gt;
&lt;li&gt;Solr is doing too much for what I need to achieve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The previous tool was deployed entirely on Cloudflare Pages with no server compute. The graph is downloaded, opened and processed client side, hassle free and with zero deployment costs.&lt;/p&gt;

&lt;p&gt;With this new Solr architecture, I have to use and maintain an external server. In addition, this custom solution is not easily scalable, difficult to distribute, and as it stands, acts as a central point of failure. During high demand spikes (which were very common in my previous project), I doubt that Solr will be able to serve all those users.&lt;/p&gt;

&lt;p&gt;On top of that, Solr does some pre-processing on the words like lemmatization, stemming, storing word positions, storing multiple dates for the same year. Solr is great, but overkill for my needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution number two
&lt;/h2&gt;

&lt;p&gt;I need a simple inverted index without stemming/lemmatization, where documents are the pre-computed term usage graphs.&lt;/p&gt;

&lt;p&gt;The key part here is &lt;em&gt;pre-computed&lt;/em&gt;. In games, light is too demanding to be solved in real time (not anymore actually). That's why we've always used _baked lightin_g. Light is pre-calculated and applied to the world textures.&lt;/p&gt;

&lt;p&gt;My idea is similar: querying a large corpus of text is too demanding (in terms of computation, time, and cost) for real-time use. So I will do the search on the precomputed results and deliver them already &lt;em&gt;baked&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the inverted index
&lt;/h3&gt;

&lt;p&gt;An inverted index consists of two things: terms and documents. In this case, terms are words that appear in each law, and documents are usage distributions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7p2g1xjyed5cad9w1r1q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7p2g1xjyed5cad9w1r1q.png" alt="Inverted index structure, each term is associated to its usage graph" width="398" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I have started by extracting tokens from a sample law using &lt;code&gt;nltk&lt;/code&gt;. The tokens are processed and words are filtered (e.g. by removing stopwords or odd characters). As mentioned before, I don't need to do any stemming/lemmatization since I need to search for exact terms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.corpus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;word_tokenize&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;

&lt;span class="n"&gt;stop_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;italian&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;#extract tokens from the norm text
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;word_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;word_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;italian&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;word_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="c1"&gt;#remove special characters
&lt;/span&gt;    &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;#skip stop words
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stop_words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;#skip invalid tokens (weird strings such as html)
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;is_valid_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;#skip tokens that can't be stored in less than 32 bytes.
&lt;/span&gt;    &lt;span class="c1"&gt;#Note: remember this step for later
&lt;/span&gt;    &lt;span class="n"&gt;token_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;tokenize(norm)&lt;/code&gt; function returns a list of unique tokens, extracted from the input text. The next part is to build the inverted index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;urn:nir:stato:legge:1967-03-09;150.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1967&lt;/span&gt;

&lt;span class="n"&gt;inverted_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="c1"&gt;#Only take unique tokens.
# If a word occurs just once in the norm, it's counted in the final result.
&lt;/span&gt;&lt;span class="n"&gt;unique_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;#Get the count of this token
&lt;/span&gt;  &lt;span class="n"&gt;freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inverted_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

  &lt;span class="c1"&gt;#Add +1 to the count
&lt;/span&gt;  &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

  &lt;span class="c1"&gt;#Assign it back to the inverted index
&lt;/span&gt;  &lt;span class="n"&gt;inverted_index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;

&lt;span class="c1"&gt;#The sorted keys of the inverted index is our vocabulary
&lt;/span&gt;&lt;span class="n"&gt;vocab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inverted_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code is a simplification of the final result. Here the year is fixed, but the final code should also be able to update the index at a specific year.&lt;/p&gt;

&lt;p&gt;With the code above, I have extracted from &lt;code&gt;legge:1967-03-09;150&lt;/code&gt; 407 unique tokens. Here are some examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;periodo&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ciascun&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;termine&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;incarichi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;liquidazione&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;esso&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;qualsiasi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agrarie&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;leggi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧊 Freezing the index
&lt;/h3&gt;

&lt;p&gt;I am able to process norms and create inverted indexes, but the results are only available in RAM and in Python data structures (dictionaries and lists). To use this index online, I need to export it to a file. I call this the freezing part.&lt;/p&gt;

&lt;p&gt;The simplest solution would be to serialize the dictionary into JSON, then have the user download it and start looking up terms offline. This way I don't have to maintain any servers and everything is automatically hosted and distributed by Cloudflare or some other CDN.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbx557l3sue8qea6h1peq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbx557l3sue8qea6h1peq.png" alt="representation of the proposed flow, from python dictionary, to json, to javascript hashmap" width="538" height="138"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, there are multiple problems with this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON is a text format and produces too big files with many byte repetitions. We mostly need to store numbers.&lt;/li&gt;
&lt;li&gt;We need to parse the entire JSON file to perform search&lt;/li&gt;
&lt;li&gt;Users download every document, even if they will probably not search on all of them.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I came up with a better solution that is more efficient and deparates the index from the documents. The idea is to have an &lt;code&gt;index.bin&lt;/code&gt; and a &lt;code&gt;documents.bin&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;index.bin&lt;/code&gt; file contains the ordered list of tokens (the vocabulary). Each token is stored in 32 bytes, with trailing zeros if necessary. The most important thing to note here is that each token is byte aligned. This makes it possible to do a binary search without the need for a full read or parse of the file.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;documents.bin&lt;/code&gt; file contains the term usage, each stored in 152 bytes. Why 152? Because I want to analyze norms from the Constitution (1947) to the present day (2023), exactly 76 years. Each year's term count is an unsigned 2 byte integer. So 2 bytes times 76 years equals 152 bytes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7vqjsmov5m1zannv7xh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7vqjsmov5m1zannv7xh.png" alt="representation of the new solution, building two files index and documents" width="607" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This way of storing documents consumes a lot of space (many term usage distributions contain a lot of zeros) but again, it is byte aligned and can be easily indexed by reading at an offset.&lt;/p&gt;

&lt;p&gt;In production, I decided to index 53,036 laws. This resulted in a vocabulary size of 196,082 tokens, a 6MB index and a 28MB documents file.The entire script ran in about 16 minutes.&lt;/p&gt;

&lt;p&gt;The compressed &lt;code&gt;index.bin&lt;/code&gt; is about 800kb. The compressed &lt;code&gt;documents.bin&lt;/code&gt; is about 1.8MB. This is not bad when you consider that the original Solr index took up more than 100 MB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;________________________________________________________&lt;/span&gt;
&lt;span class="s"&gt;Executed in   16.75 mins    fish           external&lt;/span&gt;
   &lt;span class="s"&gt;usr time  727.07 secs    0.00 micros  727.07 secs&lt;/span&gt;
   &lt;span class="s"&gt;sys time    4.71 secs  779.00 micros    4.71 secs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;To recap, storing the inverted index in two parts and in binary format is way better because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users will only download the index file (smaller and can be compressed with gzip/brotli)&lt;/li&gt;
&lt;li&gt;No need to read and parse the entire index to perform search&lt;/li&gt;
&lt;li&gt;Documents are retrieved on-demand, &lt;em&gt;only the parts that users need&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As I said before, these two files will be distributed by a CDN. But how can the client download only  a tiny part of the &lt;code&gt;documents.bin&lt;/code&gt; file? After all, CDNs only serve static files and don't perform any kind of computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leveraging HTTP functionalities
&lt;/h3&gt;

&lt;p&gt;Introducing Range headers. Not every server supports it (GitHub does), but basically it allows you to download specific parts of a file. This is mainly used for watching MP4 videos on the web (without having to use HLS or DASH). It's useful because you don't have to download the whole file just to watch a tiny part of it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Range&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;byteFrom&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;byteTo&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This suits my needs perfectly. Finally, the whole system can be divided into three steps:&lt;/p&gt;

&lt;p&gt;1) Perform binary search on &lt;code&gt;index.bin&lt;/code&gt;&lt;br&gt;
2) Find the offset&lt;br&gt;
3) Download the specific &lt;code&gt;documents.bin&lt;/code&gt; part with Range header&lt;/p&gt;

&lt;h2&gt;
  
  
  More problems, thank you CORS
&lt;/h2&gt;

&lt;p&gt;Before using this solution with Range headers, I tested if GitHub was accepting them, and it was. I also checked to see if it was serving files with the &lt;code&gt;Access-Control-Allow-Origin: *&lt;/code&gt; header (to access it from other domains), and it was. I did this with Postman and everything worked as expected.&lt;/p&gt;

&lt;p&gt;I only found the problem when I tried to run the app with files on GitHub (not from a local server). Little do I know, browsers do not just only look at domain origin to allow cross origin requests. During the CORS preflight, they also check for headers, methods, credentials and other options. In particular, GitHub does not accept any other additional header, including Range :(&lt;/p&gt;

&lt;h2&gt;
  
  
  ➗ Divide and conquer
&lt;/h2&gt;

&lt;p&gt;Again, I had to find another solution, without resorting to hosting the files on my server.&lt;/p&gt;

&lt;p&gt;I've decided to split the documents.bin file into chunks. By choosing the number of chunks, I can reduce the load on the client to a reasonable amount. Too many chunks and the user has a higher chance of typing words that are in different files, too few and the user waits longer to download larger files.&lt;/p&gt;

&lt;p&gt;I decided to split it into 10 parts, each of which weighs about 200kb compressed. The client knows what files to download simply by doing &lt;code&gt;word_position / ( 196082 / 10)&lt;/code&gt;, where &lt;code&gt;196082&lt;/code&gt; is the vocabulary size.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt35e5krxojv2e3ipq4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt35e5krxojv2e3ipq4t.png" alt="One index file indexes into multiple chunks of the documents" width="607" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Many ups and one big down
&lt;/h2&gt;

&lt;p&gt;After all this long ping-pong between solutions and problems, I think I've found the (almost) perfect solution. It has numerous advantages and reduces maintenance and costs. The only major drawback is that it only searches for single words (unigrams).&lt;/p&gt;

&lt;p&gt;I made a conscious decision to store usage graphs without considering word positions, which limits the ability to search for sequences of words (phrases). Adding this feature would have significantly increased the index and document size, making it challenging to deliver to clients.&lt;/p&gt;

&lt;p&gt;I can easily make changes to the Python script to include n-grams, which include bigrams (two-word combinations). However, bigrams are less unique than unigrams because they are simply combinations of two single words. As a result, they are more sparse and diverse, resulting in much larger indexes.&lt;/p&gt;

&lt;p&gt;When I tried to extract bigrams from 53036 norms, the vocabulary grew to 38 million bigrams, resulting in a 293MB &lt;code&gt;index.bin&lt;/code&gt; and a 696MB &lt;code&gt;documents.bin&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I'm still exploring new methods for searching phrases with static files and no server computation, as this remains an ongoing area of development.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This was like a challenge, trying to reduce cost and maintenance time as much as possible. I wanted to push myself to new limits, building custom and "low level" stuff with technologies I'd never really understood before.&lt;/p&gt;

&lt;p&gt;So this is the complete architecture. One &lt;code&gt;index.bin&lt;/code&gt; and 10 chunked documents files are enough to perform search over 50,000 Italian laws.&lt;/p&gt;

&lt;p&gt;You can view the final result here &lt;a href="https://labs.marcocimolai.xyz/term-trend" rel="noopener noreferrer"&gt;labs.marcocimolai.xyz/term-trend&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdbrknax3lhinqx13lp4f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdbrknax3lhinqx13lp4f.png" alt="screenshot of the final result webpage" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Add some snow in your WPF apps</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Mon, 16 Jan 2023 21:44:53 +0000</pubDate>
      <link>https://dev.to/marplex/add-some-snow-in-your-wpf-apps-3dck</link>
      <guid>https://dev.to/marplex/add-some-snow-in-your-wpf-apps-3dck</guid>
      <description>&lt;p&gt;I always loved how Telegram changes its style during Christmas and winter. And I wanted it too on some of WPF apps that I maintain.&lt;/p&gt;

&lt;p&gt;So I started building &lt;a href="https://github.com/Marplex/WpfSnowfall" rel="noopener noreferrer"&gt;WpfSnowfall&lt;/a&gt;, a WPF snowfall user control. It is super simple to use and fully customizable, it even comes with different types of snowflakes (what a feature)!&lt;/p&gt;

&lt;p&gt;You can use it to add some detail and quality on your apps, adding a touch of snow during winter times.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now, show me the code!
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1) Import &lt;code&gt;WpfSnowfall&lt;/code&gt; from NuGet&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dotnet add package WpfSnowfall &lt;span class="nt"&gt;--version&lt;/span&gt; 1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2) Profit&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;sf:Snowfall&lt;/span&gt;
    &lt;span class="na"&gt;EmissionRate=&lt;/span&gt;&lt;span class="s"&gt;"5"&lt;/span&gt;
    &lt;span class="na"&gt;Fill=&lt;/span&gt;&lt;span class="s"&gt;"White"&lt;/span&gt;
    &lt;span class="na"&gt;ScaleFactor=&lt;/span&gt;&lt;span class="s"&gt;"1.1"&lt;/span&gt;
    &lt;span class="na"&gt;OpacityFactor=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt;
    &lt;span class="na"&gt;ParticleSpeed=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! You can configure the snowflake color, opacity, speed, size and amount.&lt;/p&gt;




&lt;h2&gt;
  
  
  Behind the scenes
&lt;/h2&gt;

&lt;p&gt;Under the hood, snowflakes are rendered as vectors. They are animated separately using the good old Storyboard and animations from &lt;code&gt;System.Windows.Media.Animation&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Basically, here's the simple version of the entire user control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;//Initial snowflake transform&lt;/span&gt;
&lt;span class="n"&gt;RotateTransform&lt;/span&gt; &lt;span class="n"&gt;rotateTransform&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotateAmount&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;ScaleTransform&lt;/span&gt; &lt;span class="n"&gt;scaleTransform&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;TranslateTransform&lt;/span&gt; &lt;span class="n"&gt;translateTransform&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;initialX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;initialY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;//Spawn snowflake&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;flake&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Generate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RenderTransform&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;TransformGroup&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Children&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;TransformCollection&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;rotateTransform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scaleTransform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;translateTransform&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="n"&gt;Children&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;//Create transform animations&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;xAnimation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;GenerateAnimation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xAmount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"RenderTransform.Children[2].X"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;yAnimation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;GenerateAnimation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yAmount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"RenderTransform.Children[2].Y"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;rotateAnimation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;GenerateAnimation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotateAmount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"RenderTransform.Children[0].Angle"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;//Start the animations&lt;/span&gt;
&lt;span class="n"&gt;Storyboard&lt;/span&gt; &lt;span class="n"&gt;story&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;story&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Children&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xAnimation&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;story&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Children&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yAnimation&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;story&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Children&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotateAnimation&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Loaded&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;story&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Begin&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;//Remove snowflake when animation stops&lt;/span&gt;
&lt;span class="n"&gt;story&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Completed&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Children&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flake&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;WpfSnowfall is available on GitHub and licensed under the MIT license, so do whatever you want (or leave a star 😀)!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Marplex/WpfSnowfall" rel="noopener noreferrer"&gt;https://github.com/Marplex/WpfSnowfall&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>brightdatachallenge</category>
      <category>devchallenge</category>
      <category>challenge</category>
      <category>support</category>
    </item>
    <item>
      <title>How I Built Skillbit: Linktree, but for Your Skills</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Thu, 01 Sep 2022 15:59:03 +0000</pubDate>
      <link>https://dev.to/marplex/how-i-built-skillbit-linktree-but-for-your-skills-2fd9</link>
      <guid>https://dev.to/marplex/how-i-built-skillbit-linktree-but-for-your-skills-2fd9</guid>
      <description>&lt;p&gt;I read a lot of personal portfolio pages, and almost all of them had the classical "My Skills" section. I wanted to give this opportunity to everyone, that's why I've decided to build Skillbit: the easiest and fastest way to have your personal "My Skills" section on the internet.&lt;/p&gt;

&lt;p&gt;It's easier if you see it in action: skillb.it/marplex&lt;/p&gt;

&lt;p&gt;Despite my previous experience working on "indie apps" (a few years ago I built &lt;a href="https://dreambox.one" rel="noopener noreferrer"&gt;dreambox.one&lt;/a&gt;, an AI assisted android dream journal), Skillbit was my very first project built entirely for the web.&lt;/p&gt;

&lt;p&gt;Skillbit is a Remix React app that runs on Cloudflare Pages, written in Typescript. I decided to use this tech stack because It seemed hassle-free and I wanted to test the real capabilities of running apps on the edge.&lt;/p&gt;

&lt;p&gt;The entire structure of the app is pretty simple, but there are a lot of components that I had to set up (easily).&lt;/p&gt;

&lt;p&gt;If there's a phrase that can describe the entire architecture, it's probably:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Minimum effort, maximum effect&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoswaw1byxf0f2vukvzk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoswaw1byxf0f2vukvzk.png" alt="Skillbit architecture" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Database
&lt;/h2&gt;

&lt;p&gt;First things first, the database. I used PostgreSQL as it is extremely versatile and lets you write complex in-db functions.&lt;/p&gt;

&lt;p&gt;Since Cloudflare can only communicate with external services trough HTTP/S, I had to expose my database with an API.&lt;/p&gt;

&lt;h2&gt;
  
  
  API
&lt;/h2&gt;

&lt;p&gt;A PostgREST server helps Cloudflare Workers to connect and talk with the database. It was super easy to setup: I configured the required role/permissions, and that was it.&lt;/p&gt;

&lt;h2&gt;
  
  
  postgrest-js
&lt;/h2&gt;

&lt;p&gt;I used &lt;a href="https://github.com/supabase/postgrest-js" rel="noopener noreferrer"&gt;postgrest-js&lt;/a&gt; to communicate with my PostgREST endpoint. The library is easy to use and does everything for you.&lt;/p&gt;

&lt;p&gt;Unfortunately, it is not suitable for doing complex queries. In this case, I simply called some functions on the database that followed the complex processes (login/register/adding new skills, ...)&lt;/p&gt;

&lt;h2&gt;
  
  
  User management
&lt;/h2&gt;

&lt;p&gt;This was my biggest concern before starting to build Skillbit. Managing users and coding every authentication flow is a pain in the a**.&lt;/p&gt;

&lt;p&gt;Following my &lt;em&gt;"Minimum effort, maximum effect"&lt;/em&gt; principle , &lt;strong&gt;Firebase seemed an obvious choice&lt;/strong&gt;. In fact, that's what I ended up using.&lt;/p&gt;

&lt;p&gt;Of course, nothing is as easy as it seems. It turns out that the Firebase JS SDK does not work on Cloudflare Workers but only on Node.JS&lt;/p&gt;

&lt;p&gt;After hours of trying to solve this problem, in the midst of my desperation, I finally decided to built a wrapper around the Firebase REST APIs and package it as a Javascript Library.&lt;/p&gt;

&lt;p&gt;Although I had found a solution, I thought that creating this library was not in accordance with my principle.... So why not make it open source and maybe save other people time?&lt;/p&gt;

&lt;p&gt;And that's exactly what I did; you can view &lt;a href="https://github.com/Marplex/flarebase-auth" rel="noopener noreferrer"&gt;flarebase-auth&lt;/a&gt; on my GitHub profile. I also made a post that explains more about the inner workings of this library.&lt;/p&gt;


&lt;div class="ltag__link"&gt;
  &lt;a href="/marplex" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F851956%2F2b93d969-a66a-4888-92af-7c079d6a984b.jpg" alt="marplex"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/marplex/firebase-authentication-on-cloudflare-workers-24o3" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Firebase Authentication on Cloudflare Workers&lt;/h2&gt;
      &lt;h3&gt;marplex ・ Jul 26 '22&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#firebase&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#javascript&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#typescript&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#serverless&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  React and Remix
&lt;/h2&gt;

&lt;p&gt;This was my first time using React and my first time using Remix. I just have to say that it is a joy to develop applications with these technologies. They are easy to learn and everything seems to work at the first time, it's a magical feeling.&lt;/p&gt;

&lt;p&gt;If you want to know more, I made two posts about my first time experience with React and Remix.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://levelup.gitconnected.com/i-changed-my-mind-on-react-js-4ecf4b73e14" rel="noopener noreferrer"&gt;I Changed My Mind on React.JS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://javascript.plainenglish.io/one-month-with-remix-and-react-ba3659c299a2" rel="noopener noreferrer"&gt;One Month With Remix and React&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Give it a try!
&lt;/h2&gt;

&lt;p&gt;Now that you know how it's built, give it a try and create your Skillbit! Of course, we are developer, we find bugs everywhere. If you find one, please share it to me and I'll fix it (hopefully)&lt;/p&gt;

&lt;p&gt;Oh, and did I mention that Skillbit is completely free?&lt;/p&gt;

&lt;p&gt;I don't want to monetize this project, at least for now. Even though I might do it in the future, I would still do it ethically.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://skillb.it" rel="noopener noreferrer"&gt;https://skillb.it&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skillb.it/maplex" rel="noopener noreferrer"&gt;https://skillb.it/marplex&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>react</category>
      <category>showdev</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Foundation, Isaac Asimov and Software Engineering</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Fri, 26 Aug 2022 15:17:02 +0000</pubDate>
      <link>https://dev.to/marplex/foundation-isaac-asimov-and-software-engineering-3kd</link>
      <guid>https://dev.to/marplex/foundation-isaac-asimov-and-software-engineering-3kd</guid>
      <description>&lt;p&gt;The world of Isaac Asimov’s Foundation is incredible. He made predictions that, at first glance, seems very off and unlikely. But when you start to notice the bigger picture, you realize how these predictions are built on a foundation that is now real and solid.&lt;/p&gt;

&lt;p&gt;The fact that the universe depicted is so futuristic and so connected to today’s reality fascinates me. Isaac Asimov is a tremendous thinker, I admire all his work and his astounding long term thinking capabilities, he is a true genius.&lt;/p&gt;

&lt;p&gt;Portable nuclear energy, memory degradation, using data to predict actions, the role of religion, economy and knowledge…. It feels so fiction and so real at the same time.&lt;/p&gt;

&lt;p&gt;At first sight, this is probably unrelated to software engineering and programming. But when you look closer, Foundation is actually a great source of insights that we can apply as developers.&lt;/p&gt;

&lt;p&gt;This book is an incredible opportunity to learn, it’s like you traveled to the future and then came back to the present with precious knowledge.&lt;/p&gt;

&lt;h1&gt;
  
  
  Research and development
&lt;/h1&gt;

&lt;p&gt;Energy is an important part of the story, like many other elements, it is a fundamental resource for the success of the civilization. Foundation specifically talks about nuclear energy, but it’s totally different from what we have today.&lt;/p&gt;

&lt;p&gt;Thanks to research and development, nuclear reactors will be small, efficient and portable. And because of that, all electronic appliances will embed them as the main source of power. We will see nuclear washing machines, nuclear ovens, nuclear fridges….&lt;/p&gt;

&lt;p&gt;Continuous improvements transforms the products as they become better and better. The key takeaway here is to think about what it could be and not on what it is.&lt;/p&gt;

&lt;p&gt;This fits well with the current blockchain/cryptocurrency situation. Right now it is still unstable, slow and inefficient. If the fundamentals are solid, only research and development will affine this raw technology and turn it to what we initially envisioned. So, in order to succeed we just need 3 things. A great and solid idea, optimisms and time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Preserving knowledge
&lt;/h1&gt;

&lt;p&gt;When they say “knowledge is power”, it really is. Isaac Asimov shows us how knowledge can be used to maintain or create power. It also tells us the importance of not loosing it, not forgetting it.&lt;/p&gt;

&lt;p&gt;A lack of documentation and history preservation can completely erase our knowledge. What was once common sense and taken for granted, will become myth and mystical magic, a legend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9mvog3n2ahn3dj0xpfk.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9mvog3n2ahn3dj0xpfk.jpeg" alt="Image description" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That’s why writing software documentation is important. Our superpower, as humans, is working together. If we want to continue doing this, we must keep track of our knowledge, even if it seems so trivial and dumb. It’s like a long term investment, it will help in the future.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data science
&lt;/h1&gt;

&lt;p&gt;Psychohistory is a fictional science in Isaac Asimov’s Foundation universe which combines history, sociology, and mathematical statistics to make general predictions about the future behavior of very large groups of people.&lt;/p&gt;

&lt;p&gt;For me, this is the biggest and most fascinating “prediction” of Isaac. He maybe went even a bit conservative. Making predictions with AI and statistical models is now the hottest theme in tech. Right now as I’m writing, we’re probably using this “science”. I think that marketing will be the one that becomes psychohistory. We now use data, statistics, sociology and psychology to create ad copies, drive people choices and predict market (large group of people) outcomes.&lt;/p&gt;

&lt;p&gt;Again, the key takeaway here is that collecting and using data gives power. This data has to be stored and kept for the future. Any type of information will become, someday, valuable.&lt;/p&gt;

&lt;p&gt;That's why collecting logs &amp;amp; crashes, monitoring usage and user behavior (possibly anonymous) is important to keep going in the right direction. Without this type of data, it's almost impossible to know where to focus our development and improve the product.&lt;/p&gt;

&lt;h1&gt;
  
  
  The butterfly effect
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5vuj7cm7xgqk7rda2rt.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5vuj7cm7xgqk7rda2rt.jpg" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the point that may summarize the entire book. There are a lot of huge time skips in Foundation, and this makes centuries feel like days. You become aware of how little things become big and how important stuff becomes irrelevant.&lt;/p&gt;

&lt;p&gt;This is the power of the butterfly effect. If you see your choices in this perspective, you will see that anything can happen. What sticks longer are the first principles, the foundations.&lt;/p&gt;

&lt;p&gt;That’s why I consider these two mental models (long term and first principle thinking) very powerful during decision making. &lt;/p&gt;

&lt;p&gt;Little choices at the beginning of our development journey can become strong blockers or incredible features. It's safer to work in "cycles" to make better decisions (TDD, Agile, ...).&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Isaac Asimov is a genius. He came up with these concepts and ideas in 1942, more than 75 years ago! It's really fascinating to understand how concepts and best practices that we use every day to develop software were already forged almost a century ago.&lt;/p&gt;

&lt;p&gt;Before watching the Apple TV series, I urge you to read the original Foundation books. They better explore all the connections, implications, causes and effects. I hope this 80-year-old science fiction universe brings you knowledge and wisdom.&lt;/p&gt;

</description>
      <category>writing</category>
      <category>computerscience</category>
      <category>sciencefiction</category>
    </item>
    <item>
      <title>Firebase Authentication on Cloudflare Workers</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Tue, 26 Jul 2022 09:40:23 +0000</pubDate>
      <link>https://dev.to/marplex/firebase-authentication-on-cloudflare-workers-24o3</link>
      <guid>https://dev.to/marplex/firebase-authentication-on-cloudflare-workers-24o3</guid>
      <description>&lt;p&gt;Firebase is super easy to use. The provided SDK is available for almost every language and platform. The one that is currently missing is the Admin SDK for the web.&lt;/p&gt;

&lt;p&gt;Actually, it is available for Javascript but it's built to run on Node. There are some environments that doesn't support this platform that use standard Web APIs.&lt;/p&gt;

&lt;p&gt;One of this is Cloudflare Workers. If you try to use the Admin SDK for Node on these workers, it simply won't work because of missing libraries.&lt;/p&gt;

&lt;p&gt;The point is that I desperately needed to use it for my current personal project. I started surfing the Internet looking for some already implemented solution.... but nothing, zero results.&lt;/p&gt;

&lt;p&gt;So, I decided to build my own library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Say hello to flarebase-auth
&lt;/h2&gt;

&lt;p&gt;As you noticed from the name of the library, it only covers the authentication part.&lt;/p&gt;

&lt;p&gt;I used standard Web APIs such as fetch() and WebCrypto. The most common thing I had to do was JWT token generation/validation. I worked with the &lt;a href="https://github.com/panva/jose" rel="noopener noreferrer"&gt;jose&lt;/a&gt; library (the only dependency in the project) because it is cross-platform and also works with the WebCrypto API.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;flarebase-auth&lt;/code&gt; is quite simple and is written mainly in 2 files: &lt;code&gt;google-oauth.ts&lt;/code&gt; and &lt;code&gt;flarebase-auth.ts&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;google-oauth.ts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All code related to validating and generating Google OAuth 2.0 tokens is written inside this file. Since almost every request has to be authenticated, I've used this quite extensively.&lt;/p&gt;

&lt;p&gt;Generating an OAuth 2.0 token is a 2 step process. Firstly, you sign a JWT token with your Google service account private key. Then, you pass this JWT to &lt;code&gt;https://oauth2.googleapis.com/token&lt;/code&gt; and retrieve the access token. The process is implemented in the &lt;a href="https://github.com/Marplex/flarebase-auth/blob/2e3fa6705f7b053ed39ef4fe16dbf9d118fa6f15/src/lib/google-oauth.ts#L10-L45" rel="noopener noreferrer"&gt;&lt;code&gt;getAuthToken()&lt;/code&gt;&lt;/a&gt; method.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;flarebase-auth.ts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where the actual core library lives. The goal is to implement every method that you would normally use with &lt;code&gt;getAuth()&lt;/code&gt; in the Firebase Admin SDK.&lt;/p&gt;

&lt;p&gt;Right now, I've written just these methods as they are sufficient to built a basic login/sign-up system: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;createSessionCookie()&lt;/li&gt;
&lt;li&gt;verifySessionCookie()&lt;/li&gt;
&lt;li&gt;signInWithEmailAndPassword()&lt;/li&gt;
&lt;li&gt;signUpWithEmailAndPassword()&lt;/li&gt;
&lt;li&gt;changePassword()&lt;/li&gt;
&lt;li&gt;lookupUser()&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Using the library
&lt;/h2&gt;

&lt;p&gt;You may wonder, how can I use it? Here's an example, let's start by creating the FlarebaseAuth instance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;FlarebaseAuth&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;flarebase-auth&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FlarebaseAuth&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase api key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase project id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;privateKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase private key or service account private key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;serviceAccountEmail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase service account email&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you're ready to do the real stuff! For example, here's how you can sign in users with email and password.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;//Sign in with username and password&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signInWithEmailAndPassword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my@email.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;supersecurepassword&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userEmail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;refreshToken&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;refreshToken&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The library is tested using a dummy Firebase project with a dummy user. Later I discovered that there's a Firebase Authentication Emulator that was made specifically for debugging purposes.&lt;br&gt;
Right now, I'll stick with the test Firebase project and continue implementing other methods. If you want to add this feature, you're more than welcome to create a pull request!&lt;/p&gt;

&lt;p&gt;&lt;code&gt;flarebase-auth&lt;/code&gt; also supports caching: you can use &lt;code&gt;CloudflareKv&lt;/code&gt; to automatically store OAuth 2.0 tokens until expiration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;FlarebaseAuth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CloudflareKv&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;flarebase-auth&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FlarebaseAuth&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase api key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase project id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;privateKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase private key or service account private key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;serviceAccountEmail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Firebase service account email&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CloudflareKv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;NAMESPACE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next steps for &lt;code&gt;flarebase-auth&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Although I’m now successfully using this library for my current project, there are still a lot of improvements and new features to implement. Here’s a list of things I want to add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extend caching support for public keys (token validation)&lt;/li&gt;
&lt;li&gt;Implement sendEmailVerification()&lt;/li&gt;
&lt;li&gt;Implement confirmEmailVerification()&lt;/li&gt;
&lt;li&gt;Implement deleteAccount()&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;flarebase-auth&lt;/code&gt; is available on &lt;a href="https://www.npmjs.com/package/@marplex/flarebase-auth" rel="noopener noreferrer"&gt;NPM&lt;/a&gt; and &lt;a href="https://github.com/Marplex/flarebase-auth/packages/1517813" rel="noopener noreferrer"&gt;GitHub Packages&lt;/a&gt;. This project is fully open source and MIT licensed, so do wathever you want! Contributions are welcomed 🥳&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/Marplex/flarebase-auth" rel="noopener noreferrer"&gt;https://github.com/Marplex/flarebase-auth&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://firebase.google.com/docs/reference/rest/auth" rel="noopener noreferrer"&gt;Firebase Auth REST API documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>firebase</category>
      <category>javascript</category>
      <category>typescript</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Only one open-source project can be saved for future humanity. Which one would you choose?</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Wed, 04 May 2022 17:10:04 +0000</pubDate>
      <link>https://dev.to/marplex/only-one-open-source-project-can-be-saved-for-future-humanity-which-one-would-you-choose-14c6</link>
      <guid>https://dev.to/marplex/only-one-open-source-project-can-be-saved-for-future-humanity-which-one-would-you-choose-14c6</guid>
      <description>&lt;p&gt;My answer? Git.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>opensource</category>
    </item>
    <item>
      <title>LiveData: Bringing the best of Android to .NET</title>
      <dc:creator>marplex</dc:creator>
      <pubDate>Mon, 25 Apr 2022 15:06:41 +0000</pubDate>
      <link>https://dev.to/marplex/livedata-bringing-the-best-of-android-to-net-f14</link>
      <guid>https://dev.to/marplex/livedata-bringing-the-best-of-android-to-net-f14</guid>
      <description>&lt;p&gt;I’ve been building Android apps for years; the development experience and community built around it is fantastic. There are a lot of open source libraries and projects from where you can learn from. Thanks to Android Jetpack and Google pushing MVVM design pattern adoption, almost every app follows the same rules and uses the same robust core libraries.&lt;br&gt;
Microsoft vs Google&lt;/p&gt;

&lt;p&gt;I can’t say the same for .NET and Microsoft. When I started working on WPF apps, I’ve immediately felt “uncomfortable”. There is a smaller community and really few open source projects, Microsoft is suggesting to use MVVM but you often need to break this pattern because there are controls and classes that are not built for it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3x22oqpruws74s6bzjmo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3x22oqpruws74s6bzjmo.png" alt="" width="640" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For years, Microsoft and it's closed-source philosophy slowed down the evolution of the .NET ecosystem. Now, after shifting towards a more "open" approach, Microsoft is trying to build back a strong developer community built around C# and .NET framework. They're making more open source libraries ("CommunityToolkit" clearly explains the new strategy) and extending support for other platforms such as Linux. Nevertheless, Microsoft is still years behind the Android developer experience.&lt;/p&gt;
&lt;h2&gt;
  
  
  Notify in the multiverse of madness
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx8rc0vbtunawo5q9yim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx8rc0vbtunawo5q9yim.png" alt="Doctor Strange notifying every property" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I started to develop WPF apps and write my first view model, I encountered what I call the "Notify madness" problem. Let me explain this better, have a look at this view model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ViewModel&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ObservableObject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Marco"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;get&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;SetProperty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I find it absurd that you need to write 6 lines of code just to define a single notifiable property. Less than a year ago, Microsoft released &lt;a href="https://devblogs.microsoft.com/ifdef-windows/windows-community-toolkit-7-1-preview-release/" rel="noopener noreferrer"&gt;MvvmToolkit 7.1 Preview&lt;/a&gt; where they finally introduced source generators. Now it’s way better, you can just add an attribute to a property and all that code is generated for you.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ViewModel&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ObservableObject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ObservableProperty&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That being said, this feature was added just less than a year ago, when Android already had LiveData, Kotlin Flow and code generators were used for a long time to reduce boilerplate code.&lt;/p&gt;

&lt;p&gt;Another aspect that I didn’t like about building view models was mapped properties. Every time the source property changed, you had to remember to notify all the other properties that were depending on it. With the latest MvvmToolkit releases you can use codegen with &lt;code&gt;[AlsoNotifyChangeFor]&lt;/code&gt; to notify other properties automatically. But mapped properties should update by themselves. I don’t want to always add (and always forget) methods to notify the new values.&lt;/p&gt;

&lt;p&gt;That’s why I’ve taken inspiration from &lt;a href="https://developer.android.com/topic/libraries/architecture/livedata" rel="noopener noreferrer"&gt;Android LiveData&lt;/a&gt; and built a similar library for .NET. And because I always find creative names, I’ve called it…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynro3plcs5uupw34fzxn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynro3plcs5uupw34fzxn.png" alt="LiveData library cover image" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bringing the best of Android to .NET
&lt;/h2&gt;

&lt;p&gt;I decided to develop LiveData to simplify normal and mapped properties while having support for async operations. Anyway:&lt;/p&gt;

&lt;p&gt;Talk is cheap. Show me the code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LiveDataViewModel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;LiveData&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Marco"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;LiveData&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;HelloMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;LiveDataViewModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;HelloMessage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;$"Hello &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;!"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, every property is defined in one single line and the internal Value is automatically notified when changed. Mapped properties are also notified automatically, so there is literally no way to write unfinished or broken UIs just because you forgot to call notify().&lt;/p&gt;

&lt;p&gt;Most of the time you will find yourself dealing with async tasks (like if you’re using &lt;a href="https://github.com/reactiveui/refit" rel="noopener noreferrer"&gt;Refit&lt;/a&gt;). LiveData automatically transforms async functions into bindable properties. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;//Map a string "SearchQuery" into an asynchronously retrieved list of users&lt;/span&gt;
&lt;span class="n"&gt;Users&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SearchQuery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MapAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;SearchUsers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;//Convert an async function to a LiveData&lt;/span&gt;
&lt;span class="n"&gt;LiveData&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;IsVisible&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncTask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToLiveData&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a single line, async functions can be mapped and transformed to LiveData objects while automatically updating the UI. And finally, you can concatenate all this transformations to create complex reactive properties in just a few lines of code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;//Concatenate transformation functions&lt;/span&gt;
&lt;span class="n"&gt;FinalLiveData&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Debounce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;800&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"Hello"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To recap, here are all the advantages of using LiveData:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One line notifiable properties&lt;/li&gt;
&lt;li&gt;You don’t have to remember to notify after every change&lt;/li&gt;
&lt;li&gt;Mapped properties are notified automatically&lt;/li&gt;
&lt;li&gt;Seamless async Task support&lt;/li&gt;
&lt;li&gt;Easily create complex reactive properties&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What’s next for LiveData
&lt;/h2&gt;

&lt;p&gt;Although I’ve used LiveData in multiple projects, there are still a lot of improvements and new features to implement. Here’s a list of things I want to add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better lifecycle management&lt;/li&gt;
&lt;li&gt;Better exception support&lt;/li&gt;
&lt;li&gt;Personalized thread pools on async functions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;LiveData is free, open source and licensed under MIT. Contributions are welcomed 🥳&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Marplex/LiveData" rel="noopener noreferrer"&gt;https://github.com/Marplex/LiveData&lt;/a&gt;&lt;br&gt;
NuGet: &lt;a href="https://www.nuget.org/packages/LiveData/" rel="noopener noreferrer"&gt;https://www.nuget.org/packages/LiveData/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>csharp</category>
      <category>wpf</category>
    </item>
  </channel>
</rss>
