<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuan</title>
    <description>The latest articles on DEV Community by Yuan (@yuan_dev).</description>
    <link>https://dev.to/yuan_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2812547%2Fe6ec20e2-f957-43c3-bfb1-351e8ba13725.png</url>
      <title>DEV Community: Yuan</title>
      <link>https://dev.to/yuan_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yuan_dev"/>
    <language>en</language>
    <item>
      <title>RawWeb Updates: SimHash and Meilisearch</title>
      <dc:creator>Yuan</dc:creator>
      <pubDate>Sat, 26 Apr 2025 04:57:25 +0000</pubDate>
      <link>https://dev.to/yuan_dev/rawweb-updates-simhash-and-meilisearch-4po9</link>
      <guid>https://dev.to/yuan_dev/rawweb-updates-simhash-and-meilisearch-4po9</guid>
      <description>&lt;p&gt;Over the past two weeks, I've made two significant changes to &lt;a href="https://rawweb.org/" rel="noopener noreferrer"&gt;RawWeb&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Introduced SimHash for document deduplication.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Migrated from Elasticsearch to Meilisearch&lt;/strong&gt; to lower operational costs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation went smoothly. I successfully migrated and cleaned up 56k similar documents, but I also ran into some memory and performance challenges with Meilisearch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Document Deduplication
&lt;/h2&gt;

&lt;p&gt;Previously, I simply used URLs as the unique constraint, which often led to discovering a large number of duplicate documents during maintenance.&lt;/p&gt;

&lt;p&gt;Common reasons for duplication:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-standardized URLs

&lt;ul&gt;
&lt;li&gt;Inconsistent case sensitivity&lt;/li&gt;
&lt;li&gt;Inconsistent trailing slashes&lt;/li&gt;
&lt;li&gt;Useless query parameters&lt;/li&gt;
&lt;li&gt;Different number or order of query parameters&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;A single blog having multiple domains&lt;/li&gt;

&lt;li&gt;Blogs changing their path structure, e.g., from &lt;code&gt;/blog/1&lt;/code&gt; to &lt;code&gt;/posts/1&lt;/code&gt;
&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;URL-based deduplication is simple to implement but has significant limitations and can't cover all edge cases. To solve the duplication problem fundamentally, we need to work with the content itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  SimHash
&lt;/h3&gt;

&lt;p&gt;SimHash is a locality-sensitive hashing algorithm. It reflects the features of different parts of a text and allows for efficient similarity assessment using the &lt;a href="https://en.wikipedia.org/wiki/Hamming_distance" rel="noopener noreferrer"&gt;Hamming distance&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, consider the following two strings. Their SimHash values differ by only a few bits, while their md5 hashes are completely different.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;data&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;simhash&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;md5&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;i think &lt;strong&gt;a&lt;/strong&gt; is the best&lt;/td&gt;
&lt;td&gt;10000110110000000000000000101001&lt;/td&gt;
&lt;td&gt;846ff6bebe901ead008e9c0e01a87470&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;i think &lt;strong&gt;b&lt;/strong&gt; is the best&lt;/td&gt;
&lt;td&gt;10000010110000000000100001101010&lt;/td&gt;
&lt;td&gt;ba1d2dc00d0a23dbb2001d570f03fb19&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compared to other text similarity calculation methods, Simhash's advantages are its lightweight computation, storage, and comparison. It also has a strong endorsement: Google's web crawler uses SimHash to identify duplicate web pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Calculating Hash and Hamming Distance
&lt;/h3&gt;

&lt;p&gt;Implementing SimHash is simple. I relied entirely on Claude Sonnet 3.5 to implement the basic SimHash and Hamming distance calculations, then used test cases to evaluate the results.&lt;/p&gt;

&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a 64-bit hash value. This is a balanced length.&lt;/li&gt;
&lt;li&gt;Use the fnv hash algorithm. It's simple, efficient, and distributes well.&lt;/li&gt;
&lt;li&gt;Split the hash value into four 16-bit segments for database storage. This provides more flexibility when comparing Hamming distances.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;simhash&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"backend/core/pkgs/simhash/tokenizer"&lt;/span&gt;
    &lt;span class="s"&gt;"hash/fnv"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c"&gt;// HASH_BITS represents the number of bits in a SimHash value&lt;/span&gt;
    &lt;span class="n"&gt;HASH_BITS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// CalculateSimHash calculates the SimHash value of a text&lt;/span&gt;
&lt;span class="c"&gt;// SimHash is a hashing algorithm used for calculating text similarity, effectively detecting the degree of similarity between texts&lt;/span&gt;
&lt;span class="c"&gt;// Algorithm steps:&lt;/span&gt;
&lt;span class="c"&gt;// 1. Tokenize the text&lt;/span&gt;
&lt;span class="c"&gt;// 2. Calculate the hash value for each token&lt;/span&gt;
&lt;span class="c"&gt;// 3. Merge all token hash values into a feature vector&lt;/span&gt;
&lt;span class="c"&gt;// 4. Derive the final SimHash value from the feature vector&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;CalculateSimHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HASH_BITS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Calculate hash for each token and update weights&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;getHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c"&gt;// Update weights based on each bit position&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;HASH_BITS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Generate the final simhash value&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;simhash&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;HASH_BITS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;simhash&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;simhash&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// getHash calculates the hash value of a string&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;getHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fnv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New64a&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sum64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// HammingDistance calculates the Hamming distance between two simhash values&lt;/span&gt;
&lt;span class="c"&gt;// Hamming distance is the number of different characters at corresponding positions in two equal-length strings&lt;/span&gt;
&lt;span class="c"&gt;// In SimHash, a smaller Hamming distance indicates higher similarity between two texts&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;HammingDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash2&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;xor&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hash1&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="n"&gt;hash2&lt;/span&gt;
    &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;

    &lt;span class="c"&gt;// Count the number of different bits (Brian Kernighan algorithm, better performance)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;xor&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
        &lt;span class="n"&gt;xor&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;=&lt;/span&gt; &lt;span class="n"&gt;xor&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// SplitSimHash splits a simhash value into four 16-bit parts&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;SplitSimHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;uint16&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;uint16&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;uint16&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;48&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="m"&gt;0xFFFF&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="kt"&gt;uint16&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="m"&gt;0xFFFF&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="kt"&gt;uint16&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="m"&gt;0xFFFF&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="kt"&gt;uint16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="m"&gt;0xFFFF&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// MergeSimHash merges four 16-bit parts into a single simhash value&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;MergeSimHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;uint16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;48&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// IsSimilar determines whether two texts are similar&lt;/span&gt;
&lt;span class="c"&gt;// threshold represents the similarity threshold, typically a value between 3-10&lt;/span&gt;
&lt;span class="c"&gt;// Returns true if the two texts are similar, false if not&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;IsSimilar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash2&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;HammingDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;pgsql&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;hamming_distance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;simhash1_1&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash1_2&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash1_3&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash1_4&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash2_1&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash2_2&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash2_3&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;simhash2_4&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="n"&gt;PARALLEL&lt;/span&gt; &lt;span class="n"&gt;SAFE&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;bit_count&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;simhash1_1&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;simhash2_1&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
           &lt;span class="n"&gt;bit_count&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;simhash1_2&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;simhash2_2&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
           &lt;span class="n"&gt;bit_count&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;simhash1_3&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;simhash2_3&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
           &lt;span class="n"&gt;bit_count&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;simhash1_4&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;simhash2_4&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;BIT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="n"&gt;plpgsql&lt;/span&gt; &lt;span class="k"&gt;IMMUTABLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Further Optimizing Tokenization
&lt;/h3&gt;

&lt;p&gt;The tokens generated by the tokenizer are the basic units for calculating SimHash. The higher the quality of tokenization, the more accurately SimHash can reflect the document's features.&lt;/p&gt;

&lt;p&gt;I tested several types of tokenizers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Based on language features

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/meilisearch/charabia" rel="noopener noreferrer"&gt;Charabia&lt;/a&gt; works quite well, maintained by the Meilisearch team.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/go-ego/gse" rel="noopener noreferrer"&gt;gse&lt;/a&gt; seems sufficient for Chinese tokenization based on test cases, but the overall experience isn't as good as Charabia.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Whitespace: Suitable for languages like English that use spaces as separators, but requires additional implementation for normalization, stopword removal, etc.&lt;/li&gt;

&lt;li&gt;Unicode: Intended as a tokenizer for CJK languages, but the tokenization quality was not ideal.&lt;/li&gt;

&lt;li&gt;N-gram: Considered as a general-purpose tokenizer, but the quality fluctuates significantly.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Overall, Charabia produced the best results. However, it's a Rust project, while the RawWeb backend stack is Go. This requires using CGO to call Charabia (calling via Go's &lt;code&gt;exec&lt;/code&gt; package is at least 10x slower than calling via CGO), which introduces cross-compilation complexity.&lt;/p&gt;

&lt;p&gt;I'm not familiar with Rust or CGO, so most of the following code was generated by Claude Sonnet 3.5/3.7, with some adjustments based on the actual situation.&lt;/p&gt;

&lt;p&gt;First, expose Charabia's &lt;code&gt;Tokenize&lt;/code&gt; method with a simple Rust function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;charabia&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Tokenize&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nb"&gt;c_char&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;ffi&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;CStr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CString&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;tokenize_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;
        &lt;span class="nf"&gt;.tokenize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.filter&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="nf"&gt;.is_word&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="nf"&gt;.lemma&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.filter&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="nf"&gt;.is_empty&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="cd"&gt;/// Tokenizes the input string and returns a JSON string containing the tokens&lt;/span&gt;
&lt;span class="cd"&gt;///&lt;/span&gt;
&lt;span class="cd"&gt;/// # Safety&lt;/span&gt;
&lt;span class="cd"&gt;///&lt;/span&gt;
&lt;span class="cd"&gt;/// This function is unsafe because it deals with raw pointers&lt;/span&gt;
&lt;span class="nd"&gt;#[no_mangle]&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;unsafe&lt;/span&gt; &lt;span class="k"&gt;extern&lt;/span&gt; &lt;span class="s"&gt;"C"&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="nb"&gt;c_char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="nb"&gt;c_char&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// C stuff ...&lt;/span&gt;

    &lt;span class="c1"&gt;// Tokenize the input&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenize_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_str&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// C stuff ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cargo.toml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# ...&lt;/span&gt;

&lt;span class="nn"&gt;[lib]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"charabia_rs"&lt;/span&gt;
&lt;span class="py"&gt;crate-type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"cdylib"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"staticlib"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;charabia&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.9.3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;default-features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"chinese-segmentation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;# disable chinese-normalization (https://github.com/meilisearch/charabia/issues/331)&lt;/span&gt;
    &lt;span class="c"&gt;#"german-segmentation",&lt;/span&gt;
    &lt;span class="s"&gt;"japanese"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For local testing, just run &lt;code&gt;cargo build --release&lt;/code&gt;. But cross-platform compilation is much more complicated. Fortunately, the &lt;a href="https://ziglang.org/" rel="noopener noreferrer"&gt;Zig&lt;/a&gt; toolchain greatly simplifies C cross-compilation, eliminating the need for musl libc!&lt;/p&gt;

&lt;p&gt;Install Zig and &lt;a href="https://github.com/rust-cross/cargo-zigbuild" rel="noopener noreferrer"&gt;zigbuild&lt;/a&gt;, then compile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo zigbuild &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt; aarch64-unknown-linux-gnu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After compiling the Rust code into a &lt;code&gt;.so&lt;/code&gt; file, call its exported method in RawWeb. Need to configure linking to the correct &lt;code&gt;.so&lt;/code&gt; during cross-compilation and loading the &lt;code&gt;.so&lt;/code&gt; file from &lt;code&gt;./lib&lt;/code&gt; when the application starts online:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// #cgo linux,amd64 LDFLAGS: -L${SRCDIR}/charabia-rs/target/x86_64-unknown-linux-gnu/release -lcharabia_rs -Wl,-rpath,./lib&lt;/span&gt;
&lt;span class="c"&gt;// #cgo linux,arm64 LDFLAGS: -L${SRCDIR}/charabia-rs/target/aarch64-unknown-linux-gnu/release -lcharabia_rs -Wl,-rpath,./lib&lt;/span&gt;
&lt;span class="c"&gt;// #cgo LDFLAGS: -L${SRCDIR}/charabia-rs/target/release -lcharabia_rs&lt;/span&gt;
&lt;span class="c"&gt;// #include &amp;lt;stdlib.h&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;// #include &amp;lt;stdint.h&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;//&lt;/span&gt;
&lt;span class="c"&gt;// typedef void* charabia_result_t;&lt;/span&gt;
&lt;span class="c"&gt;//&lt;/span&gt;
&lt;span class="c"&gt;// extern char* tokenize(const char* input);&lt;/span&gt;
&lt;span class="c"&gt;// extern void free_tokenize_result(char* ptr);&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"C"&lt;/span&gt;

&lt;span class="c"&gt;// Tokenize tokenizes the given text using the Rust implementation via cgo&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// C stuff ...&lt;/span&gt;

    &lt;span class="n"&gt;cResult&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cText&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// C stuff ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also using Zig for cross-compiling the Go code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CGO_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;GOOS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;linux &lt;span class="nv"&gt;GOARCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arm64 &lt;span class="nv"&gt;CC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"zig cc -target aarch64-linux"&lt;/span&gt; &lt;span class="nv"&gt;CXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"zig c++ -target aarch64-linux"&lt;/span&gt; go build ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, during deployment, place &lt;code&gt;libcharabia_rs.so&lt;/code&gt; into the &lt;code&gt;./lib/&lt;/code&gt; directory so it can be loaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Filtering Similar Content
&lt;/h3&gt;

&lt;p&gt;According to resources, a Hamming distance of less than 3 for a 64-bit SimHash can generally identify similar content.&lt;/p&gt;

&lt;p&gt;However, due to limitations in tokenization quality, content length, etc., I observed false positives even with a Hamming distance threshold of 1 in my test cases. Additionally, my server has low specs. Calculating and comparing the Hamming distance for one document against 700k records takes about 1.2 seconds. At this rate, a full comparison would take 10 days, which is unacceptable.&lt;/p&gt;

&lt;p&gt;Therefore, for now, I only filtered documents with identical hash values. This avoids calculating Hamming distance and allows the search to use database indexes, making it very fast. Ultimately, I cleaned up 56,000 similar documents. This number was much higher than I expected. Given that I encountered SimHash collisions during testing, I reasonably suspect there might be quite a few false positives among them. Further optimization of tokenization and token weighting is needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://moz.com/devblog/near-duplicate-detection" rel="noopener noreferrer"&gt;Near-Duplicate Detection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ben-whitmore.com/simhash-and-solving-the-hamming-distance-problem-explained/" rel="noopener noreferrer"&gt;SimHash and solving the hamming distance problem: explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/33026.pdf" rel="noopener noreferrer"&gt;Detecting Near-Duplicates for Web Crawling - Google Research&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Migrating to Meilisearch
&lt;/h2&gt;

&lt;p&gt;The full-text search engine I previously used was Elasticsearch. It's feature-rich and battle-tested in countless production environments.&lt;/p&gt;

&lt;p&gt;As RawWeb's features and data volume stabilized, I realized I wasn't using most of Elasticsearch's capabilities, yet I still had to bear the extra operational costs (actually, it had been very stable since deployment, but if such a behemoth encountered problems one day, I wasn't sure if I had the ability or energy to fix them). Also, the &lt;code&gt;elasticsearch-go&lt;/code&gt; client is very difficult to use.&lt;/p&gt;

&lt;p&gt;Meilisearch is a more lightweight alternative, with most features working out of the box. The migration process was very smooth, although a few unexpected issues popped up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multilingual Documents
&lt;/h3&gt;

&lt;p&gt;Referencing &lt;a href="https://w3techs.com/technologies/overview/content_language" rel="noopener noreferrer"&gt;W3Techs' statistics on content languages on the internet&lt;/a&gt;, RawWeb specifically tags content in English, Chinese, German, French, Spanish, Russian, and Japanese to enable filtering search results by language.&lt;/p&gt;

&lt;p&gt;In Elasticsearch, I used separate fields like &lt;code&gt;content_en&lt;/code&gt;, &lt;code&gt;content_zh&lt;/code&gt;, etc., with dedicated tokenizers. Theoretically, this step could be simplified in Meilisearch because it can automatically detect content language. However, I ended up splitting the content into multiple indexes because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RawWeb's existing natural language detection module can automatically sample based on document title and content length, switching precision modes automatically, which is more efficient than full-text detection.&lt;/li&gt;
&lt;li&gt;To filter search results by language, I need to add a &lt;code&gt;lang&lt;/code&gt; field in Meilisearch to mark the document's language. So, besides Meilisearch, I still need to perform natural language detection once.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Splitting different languages into separate indexes also aligns with Meilisearch's official recommendations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 1: High Storage Space Usage
&lt;/h3&gt;

&lt;p&gt;The PostgreSQL database size is about 2.4GB. After importing documents, the Meilisearch database size grew to about 23GB (with &lt;code&gt;searchableAttributes&lt;/code&gt; and &lt;code&gt;filterableAttributes&lt;/code&gt; configured correctly).&lt;/p&gt;

&lt;p&gt;Initially, I didn't realize the implication about disk usage mentioned in the &lt;a href="https://www.meilisearch.com/docs/learn/engine/storage#measured-disk-usage" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;, which led to the hard drive filling up. Fortunately, hard drive space is the cheapest cloud resource, so expanding it wasn't expensive.&lt;/p&gt;

&lt;p&gt;Besides this, there's another potential issue: Meilisearch doesn't release disk space after deleting documents (&lt;a href="https://www.meilisearch.com/docs/learn/engine/storage#database-size" rel="noopener noreferrer"&gt;docs&lt;/a&gt;). Reclaiming space might require using snapshots (&lt;a href="https://github.com/meilisearch/meilisearch/discussions/3156#discussioncomment-4262368" rel="noopener noreferrer"&gt;related discussion&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 2: Memory Usage Limit Ineffective
&lt;/h3&gt;

&lt;p&gt;Meilisearch is deployed on a low-spec server with 2 vCPUs and 4GB RAM. Since Elasticsearch previously ran fine on a server with the same configuration, I assumed Meilisearch would be smooth sailing too. After indexing all documents, I went to sleep peacefully (I later realized they were probably just queued in Meilisearch's task queue at that time).&lt;/p&gt;

&lt;p&gt;I woke up to find the server's CPU maxed out and disk read speeds exceeding 1GB/s, causing the entire system to freeze. After a forced reboot, I checked the system logs and the only abnormality found was an OOM error from the Meilisearch container. I then used &lt;code&gt;MEILI_MAX_INDEXING_MEMORY&lt;/code&gt; to limit indexing memory usage to 2GB. However, the next day, it experienced OOM and maxed-out CPU again.&lt;/p&gt;

&lt;p&gt;Looking through the documentation, I found the &lt;code&gt;MEILI_EXPERIMENTAL_REDUCE_INDEXING_MEMORY_USAGE&lt;/code&gt; parameter. Although experimental, I tried it and found it worked really well. CPU and disk I/O were no longer aggressive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faez174je5chk114xtq3a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faez174je5chk114xtq3a.png" alt="monitor" width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 3: Very Slow Document Deletion Leading to Task Backlog
&lt;/h3&gt;

&lt;p&gt;To clean up documents in Meilisearch that were deleted from the database, the operation for each batch during synchronization was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Delete documents in Meilisearch within the range &lt;code&gt;id &amp;gt;= ? AND id &amp;lt;= ?&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; Add new documents.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After deploying, data synchronization ran into problems. Investigation revealed that Meilisearch had accumulated 133k tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/tasks?statuses=enqueued,processing&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"results"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"uid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;16354&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"batchUid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"indexUid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"items_es"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"enqueued"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"documentAdditionOrUpdate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"canceledBy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"details"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"receivedDocuments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"indexedDocuments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"enqueuedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2025-04-12T04:59:27.183657254Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"startedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"finishedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="nl"&gt;"total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;13385&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;16354&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;16334&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observing task execution, I found that document deletion operations were extremely slow. A range deletion involving up to 1,000 documents took nearly 20 minutes. Furthermore, because deletion and addition operations were interspersed during synchronization, Meilisearch couldn't automatically merge adjacent tasks.&lt;/p&gt;

&lt;p&gt;I couldn't find similar issues online; this seems to be a problem unique to my setup. I suspect it's because of the limited memory and the fact that Hetzner's expanded storage performance is only 1/10th of the original disk performance (&lt;a href="https://pcr.cloud-mercato.com/providers/hetzner/flavors/cpx11/performance/storage-bandwidth" rel="noopener noreferrer"&gt;benchmark&lt;/a&gt;). I'll retest this once I get sponsorship funds to upgrade to a higher-spec server.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The main goals of this update have been achieved. I will continue debugging and optimizing.&lt;/p&gt;

&lt;p&gt;Additionally, the implementation process exposed some operational issues, such as excessive downtime and lack of server resource alerts (I removed New Relic monitoring during the last refactor). These will be targets for the next round of optimization.&lt;/p&gt;

</description>
      <category>go</category>
      <category>meilisearch</category>
      <category>rawweb</category>
    </item>
  </channel>
</rss>
