<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arian Stolwijk</title>
    <description>The latest articles on DEV Community by Arian Stolwijk (@arian).</description>
    <link>https://dev.to/arian</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F152656%2F7b0bcf65-7984-4f86-b0d8-3d281361f26c.jpg</url>
      <title>DEV Community: Arian Stolwijk</title>
      <link>https://dev.to/arian</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arian"/>
    <language>en</language>
    <item>
      <title>Our web performance got worse by switching from Cloudfront Proxy to AWS Cloudfront</title>
      <dc:creator>Arian Stolwijk</dc:creator>
      <pubDate>Fri, 07 Feb 2025 15:48:50 +0000</pubDate>
      <link>https://dev.to/arian/cloudfront-vs-cloudflare-proxy-latency-4125</link>
      <guid>https://dev.to/arian/cloudfront-vs-cloudflare-proxy-latency-4125</guid>
      <description>&lt;p&gt;In an effort to improve world wide, multi region, latency, I've set up Cloudfront in front of our Application Load Balancer in AWS. &lt;/p&gt;

&lt;p&gt;Previously we put the ALB (in Europe) public DNS in a CNAME record on Cloudflare and enabled (free) Cloudflare Proxy. But we noticed higher latency from US or Asia. So our idea was to use Cloudfront instead, as everything will be inside AWS, and we have more tools to tweak caching policies of different pages.&lt;/p&gt;

&lt;p&gt;Surprisingly, the first result we got, is that Cloudfront is slower than the basic Cloudflare setup! Especially from the US and Sidney.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqj7y11xjl3w823evjts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqj7y11xjl3w823evjts.png" alt="Page load times from different locations" width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We measured this with automated headless browsers in different regions visiting the homepage, and reporting the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Performance_API" rel="noopener noreferrer"&gt;Performance API&lt;/a&gt; statistics, using &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html" rel="noopener noreferrer"&gt;Cloudwatch Synthetic Canaries&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We're still looking for settings to improve the global website performance. At least using smart caching, as explained in &lt;a href="https://lucvandonkersgoed.com/2024/11/16/making-aws-news-stupid-fast-with-smart-caching/" rel="noopener noreferrer"&gt;Making AWS News stupid fast with smart caching&lt;/a&gt; is on our roadmap.&lt;/p&gt;

&lt;p&gt;Let me know if you know anything!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>webperf</category>
    </item>
    <item>
      <title>Building a Recommender System using vector databases</title>
      <dc:creator>Arian Stolwijk</dc:creator>
      <pubDate>Sun, 21 Jan 2024 10:23:51 +0000</pubDate>
      <link>https://dev.to/arian/vector-database-recommender-system-4800</link>
      <guid>https://dev.to/arian/vector-database-recommender-system-4800</guid>
      <description>&lt;p&gt;Dense vector embeddings make it easy to find similar documents in a dataset. With a document in the embedding space, just look at the other documents that are close, and probably they are related. This is called a k-nearest neighbors (KNN) search. Classic recommender systems such as collaborative filtering require a lot of user data and training. However with pre-trained embedding models you don't need any training to recommend related documents.&lt;/p&gt;

&lt;p&gt;This works well for item-to-item recommendations. But what if you want to recommend something for a specific user. If you want to go to user-to-items recommendations you need some interaction data again. One strategy is to build a user-embedding based on the mean of the embeddings of the documents the user interacted with. A KNN search with this user-embedding will give the results the user is most likely interested in. Erika Cardenas explained this in further detail in her &lt;a href="https://haystackconf.com/us2023/talk-20/"&gt;Haystack US 2023 talk&lt;/a&gt;. With just a few clicks the user quickly gets nice recommendations without a lot of extra systems if you already have a vector database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2on2bh59wa5ovagwio0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2on2bh59wa5ovagwio0.png" alt="user embeddings" width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem that I'm trying to solve goes a bit deeper. I've got a lot of documents that have a certain label (e.g. a brand of a product, some color, language, author of articles). And when the user selects that label we want to show the most interesting documents. But at the moment we only know if that document has the label or not, which doesn't say anything about how we should rank those documents. Also the documents are refreshed regularly and the number of documents is quite large compared to the number of clicks, so an individual document only gets a few clicks. What we want are the best documents ranked best. Using a similar approach as the user-embedding, we can create a 'label-embedding' of clicked documents when users were searching in this category.&lt;/p&gt;

&lt;p&gt;One problem of this, contrary to user-embeddings that can be limited to averaging the last few document embeddings the user clicked on, is that we would eventually take the average embedding of many clicked document, and does that actually represent the most popular type of document, or is that some 'random' point in the embedding space.&lt;/p&gt;

&lt;p&gt;Instead, in the list of all clicked documents, there are probably a few clusters of related documents. The idea is that instead of one average embedding we get a few embeddings, one for each cluster, and for each embedding we do a KNN search, and finally combine the results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;From a high level the pipeline looks as shown in this image. First we collect the relevant click logs, and see which documents are clicked, then we get the embeddings for these documents, and create clusters. For each cluster we do a KNN search to get more related documents (even though that document might not have any interactions yet), combine these lists in a single final list, and finally present that as the ranking to the user.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7qlrjfo6mopc6tw0a0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7qlrjfo6mopc6tw0a0v.png" alt="Pipeline" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Clustering
&lt;/h3&gt;

&lt;p&gt;The idea to cluster dense embedding vectors resembles most of what BERTopic does, as described in &lt;a href="https://arxiv.org/pdf/2203.05794.pdf"&gt;BERTopic: Neural topic modeling with a class-based TF-IDF procedure&lt;/a&gt; by Maarten Grootendorst. From documents it can create embeddings using an embedding model, cluster the embeddings and finally extract the topic description of each cluster. The first and last steps are not necessary for us, we already have the embeddings, and we don't really need to label the topics, but we can look how it creates clusters from the embeddings.&lt;/p&gt;

&lt;p&gt;It uses two steps: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dimension reduction&lt;/strong&gt; is necessary to prevent the &lt;em&gt;curse of dimensionality&lt;/em&gt;. From dense vectors of 384, 768 or sometimes more than 4000 dimensions. That is too much for clustering algorithms to work well. Dimension reduction reduces those higher dimensions to a much lower number, such as 5 or 10 that the clustering algorithm can handle well. If you reduce it further to 2 or 3 dimensions you can also render the vectors in a 2 or 3-dimensional graph. BERTopic uses &lt;a href="https://umap-learn.readthedocs.io/en/latest/"&gt;UMAP&lt;/a&gt; over other methods such as PCA or t-SNE because UMAP has shown to preserve more of the local and global features of high-dimensional data in lower projected dimensions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Clustering&lt;/strong&gt;. For clustering BERTopic uses the &lt;a href="https://hdbscan.readthedocs.io/en/latest/"&gt;HDBSCAN&lt;/a&gt; clustering algorithm. Clustering assigns a cluster label to the (reduced) embeddings. Contrary to the KMeans algorithm the number of clusters is dynamic. Compared to the DBSCAN algorithm it works better if the (reduced) embeddings are not homogeneously distributed in the embedding space, so it can also create different clusters in dense areas.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After this process we have cluster labels for all clicked documents, and we can use them to create the 'cluster-embeddings' that we will use as center points for our KNN search. I found two methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the &lt;strong&gt;centroid embedding&lt;/strong&gt;. This is possible when using a the &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans"&gt;KMeans clustering algorithm&lt;/a&gt;. However the clustering is done on the reduced dimension vectors, so it would need to be mapped back to the original document (embedding).&lt;/li&gt;
&lt;li&gt;Take the &lt;strong&gt;mean of all embeddings&lt;/strong&gt; that fall in this cluster. As HDBSCAN can create non linear clusters (e.g. an inner and an outer circle as two clusters) taking the mean seems a bit weird. But as the means are taken on the original high dimensional vectors it does work OK in practice.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  KNN Search
&lt;/h3&gt;

&lt;p&gt;For KNN search we use a vector database. A vector database stores the documents with the document embeddings and provides a way to query the database. A naive way to find the most similar documents of a given vector would be to calculate the similarity for each document with the query vector, and take the &lt;em&gt;k&lt;/em&gt; vectors with the highest similarity. This scales linearly and isn't suited for bigger datasets. Vector databases implement Approximate Nearest Neighbor (ANN) algorithms which makes querying much more efficient (with the cost of being &lt;em&gt;approximate&lt;/em&gt;). The most popular algorithm is Hierarchical Navigable Small Worlds (HNSW) and is implemented in vector databases like qdrant, Weaviate or &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html"&gt;Elasticsearch&lt;/a&gt; (that we use).&lt;/p&gt;

&lt;p&gt;Note that the vector database needs to support some way of filtering, as the results need to match the initial search constraints (in our case, the document should have a label).&lt;/p&gt;

&lt;p&gt;The top-k documents closest to the cluster-embedding are retrieved with the KNN search. These are however the top-k closest to that embedding, but may not be representative for the entire space of the cluster. Especially with e-commerce products there are a lot of variants of the same product, while it might be more relevant to get a diverse set of recommendations for this cluster. The &lt;a href="https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf"&gt;Maximal Marginal Relevance&lt;/a&gt; algorithm is an algorithm that can diversify the ranking of the found documents. The algorithm iteratively adds documents to a new list from the original ranked list that best matches the cluster-embedding (or more generally, the query vector), but penalizes documents that are already in the resulting list: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71t6jd3q5wxmlv14q2ma.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71t6jd3q5wxmlv14q2ma.png" alt="MMR" width="800" height="103"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Fusing results from different clusters
&lt;/h3&gt;

&lt;p&gt;After calculating the clusters and cluster embeddings, and retrieving and diversifying potential candidate documents, we have a list of documents for each cluster. However we want to present the user with a final flat list. Reciprocal Rank Fusion (RFF) is a straight forward algorithm that got implemented recently into some vector databases that also implemented hybrid search. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foikksb569uc1mz3h4n8d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foikksb569uc1mz3h4n8d.png" alt="Score_RFF(d) = 1 / (k + rank(d))" width="746" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;RFF calculates a score based on the rank of the documents in each list. In our case we have lists of documents for each cluster, and the position in that list is the rank for that document. &lt;em&gt;k&lt;/em&gt; is a tunable parameter, usually set to 60, that improves the resulting score. As the documents are all in a different cluster each candidate document only is in one of the lists. So initially the score for all first documents will be &lt;code&gt;1/(60 + 1)&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;However some clusters are more important than other clusters, some clusters are bigger as more users clicked on documents that are like the documents in that cluster, so we assume those documents should be more relevant. Using the relative sizes of the clusters we can add a weight in the numerator of the fraction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        num_docs_in_cluster
w = a + -------------------
           total_docs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also add a tunable parameter &lt;code&gt;a&lt;/code&gt; that makes the cluster size weight more or less important.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                w
Score(d) =  ----------
            k + rank(d)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After computing the score for all documents we can sort them and get the final, diverse, sorted relevant list of documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production concerns
&lt;/h2&gt;

&lt;p&gt;Processing all the click logs and calculating the clusters is a bit compute intensive, and would not work for real time traffic. But we can periodically calculate the cluster embeddings. That's not a lot of data, basically just one vector for each cluster. Doing the KNN search using a vector database and doing some re-ranking and fusing using MRR and RFF can be done in your web application. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Vector databases with embeddings unlock building a recommender system quickly without a lot of data. Furthermore it can be used for creating rankings in a search engine when there is no natural ranking because the documents either match or not, (a product is in sale or not, an article is written by this author or not, etc.) but there is no way yet in telling if it matches more or less. With click logs we can create an embedding of all the documents that are clicked and get related documents that are most clicked. However, if it's likely there are more types of documents in the click logs, we create clusters using the UMAP and HDBSCAN dimension reduction and clustering algorithms. Finally with KNN search, optionally some diversification of the results using MRR, and combining the results of each cluster using a modified RFF score, we will get results where documents that were similar to clicked documents will get a higher ranking than documents that are less clicked. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;small&gt;Cross-posted on &lt;a href="https://www.aryweb.nl/2024/01/21/vector-database-recommender-system.html"&gt;https://www.aryweb.nl/2024/01/21/vector-database-recommender-system.html&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>vectordatabase</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>elasticsearch</category>
    </item>
    <item>
      <title>ElasticSearch Boolean Query Performance</title>
      <dc:creator>Arian Stolwijk</dc:creator>
      <pubDate>Mon, 08 Jan 2024 16:03:25 +0000</pubDate>
      <link>https://dev.to/arian/elasticsearch-boolean-query-performance-5e0</link>
      <guid>https://dev.to/arian/elasticsearch-boolean-query-performance-5e0</guid>
      <description>&lt;p&gt;A Boolean Query caused performance issues. The problem was that an empty &lt;code&gt;filter&lt;/code&gt; clause behaved differently from a non-empty &lt;code&gt;filter&lt;/code&gt; clauses with &lt;code&gt;match_all&lt;/code&gt; and &lt;code&gt;should&lt;/code&gt; clauses, because of tricky &lt;code&gt;minimum_should_match&lt;/code&gt; behavior. I'll try to explain what happened and how to fix this issue.&lt;/p&gt;

&lt;p&gt;A few weeks ago I changed an ElasticSearch query in the application I was working on. The new query was structured in such a way it should return better search results, and that it would be easier to tweak which fields contributed to the scores of the documents. The changes were reviewed, tested, and merged, and it looked good.&lt;/p&gt;

&lt;p&gt;Then we deployed these changes to production. It still looked good, and with the live data it did give better results given the search inputs.&lt;/p&gt;

&lt;p&gt;Then, during the day, traffic increased. And it didn't look good anymore. Our monitoring systems notified that not all request could be handled anymore by the application and the Load Balancer was queued requests. The response times got really high! It turned out the load on ElasticSearch got too high and overloaded.&lt;/p&gt;

&lt;p&gt;We quickly rolled back the change, and things stabilized. We moved the new version of the query under a feature flag to dynamically enable or disable it later. So far the stressful part, now, why was the performance so different?&lt;/p&gt;

&lt;h3&gt;
  
  
  Compound Queries
&lt;/h3&gt;

&lt;p&gt;For the search functionality of the application we want to search documents for a specific search term the user enters. Aside from just a name or description, each document also has fields like a ratings or views that should affect the scoring.&lt;/p&gt;

&lt;p&gt;ElasticSearch provides the &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html"&gt;Function Score Query&lt;/a&gt; for this: an inner query that provides results with scores, of which the scores are adjusted by some function (e.g. multiplying with another field of the document). So the resulting query is a query with a query inside it. The documents are then filtered using the &lt;code&gt;min_score&lt;/code&gt; field, and sorted by the new score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Boolean Query
&lt;/h3&gt;

&lt;p&gt;Another type of compound query is the &lt;em&gt;Boolean Query&lt;/em&gt;. This query combines different queries in four types, and each type is a list of queries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;filter&lt;/em&gt;: either the document matches a query or not. It doesn't affect scoring.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;should&lt;/em&gt;: the query contributes a score, e.g. a query results into a score how much it matches a search term. If any subquery in the &lt;em&gt;should&lt;/em&gt; list matches the document is included.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;must&lt;/em&gt;: almost the same as &lt;em&gt;should&lt;/em&gt;, but documents must match all subqueries.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;must_not&lt;/em&gt;: the query must match, but the scores are ignored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An example of a query like this is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"should"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deployed"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It reads like: filter all documents that have &lt;code&gt;production&lt;/code&gt; in the &lt;code&gt;tags&lt;/code&gt; field. Then documents with &lt;code&gt;env1&lt;/code&gt; or &lt;code&gt;deployed&lt;/code&gt; in the &lt;code&gt;tags&lt;/code&gt; fields get a high score.&lt;/p&gt;

&lt;p&gt;This is almost exactly what we had, except embedded in a Function Score query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"function_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"should"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deployed"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"min_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Filters
&lt;/h4&gt;

&lt;p&gt;The code dynamically added filters to the bool query. However in the default cause, it didn't add filters, but returned a &lt;code&gt;match_all&lt;/code&gt; query, assuming these cases would be identical&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match_all"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"should"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"should"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;But it is not!!&lt;/strong&gt;, the results are the same, but not how the query is executed.&lt;/p&gt;

&lt;p&gt;The devil is in the details of the &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html#bool-min-should-match"&gt;documentation&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So &lt;code&gt;0&lt;/code&gt; or &lt;code&gt;1&lt;/code&gt; item in the &lt;code&gt;filter&lt;/code&gt; clause changes the behavior! We want that it matches the &lt;code&gt;should&lt;/code&gt; clause always. Each non-matching document is eventually still filtered by the outer function query (&lt;code&gt;min_score&lt;/code&gt;), but not by the bool query, if there is one item in the &lt;code&gt;filter&lt;/code&gt; clause.&lt;/p&gt;

&lt;h4&gt;
  
  
  Use &lt;code&gt;"minimum_should_match": 1&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;The quickest solution would be to add &lt;code&gt;"minimum_should_match": 1&lt;/code&gt; to the query, as that would ensure each document is only included if it matches one item in the &lt;code&gt;should&lt;/code&gt; clause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match_all"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"should"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"minimum_should_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Solution: use the &lt;code&gt;must&lt;/code&gt; clause
&lt;/h4&gt;

&lt;p&gt;A better solution in our case is to use &lt;code&gt;must&lt;/code&gt;. That ensures each document matches the subquery and it contributes to the score.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match_all"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"must"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And even better, don't treat &lt;code&gt;match_all: {}&lt;/code&gt; as 'identity', but leave the &lt;code&gt;filter&lt;/code&gt; clause empty, and only add one if we really want to filter something (e.g. language, ...):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"must"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"term"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Debugging
&lt;/h3&gt;

&lt;p&gt;Finally, some things we did to debug this.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Kibana&lt;/em&gt; is really useful as a UI to experiment with ElasticSearch queries and explore the data. Especially the "Dev Tools console".&lt;/p&gt;

&lt;p&gt;You can execute a query, and the JSON ElasticSearch returns contains the &lt;code&gt;took&lt;/code&gt; property, which is how long the query took. This is a rough number, but can give you an indication of the order. Before the query usually took ~100ms, and after less than ~10ms!&lt;/p&gt;

&lt;p&gt;But also the &lt;a href="https://www.elastic.co/guide/en/kibana/current/xpack-profiler.html"&gt;&lt;em&gt;Search Profiler&lt;/em&gt;&lt;/a&gt; is a useful tool. This gives insights into which part of the (compound) query takes the most time. In this case a lot of time was spent in &lt;code&gt;next_doc&lt;/code&gt;, which makes sense when the bool query didn't filter out the documents that scored &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This post was also published on my &lt;a href="https://www.aryweb.nl/2021/05/30/elasticsearch-bool-query-performance.html"&gt;personal blog&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    <item>
      <title>Creating efficient File Storage of Uploads for Web Applications</title>
      <dc:creator>Arian Stolwijk</dc:creator>
      <pubDate>Fri, 18 Dec 2020 10:05:17 +0000</pubDate>
      <link>https://dev.to/arian/creating-efficient-file-storage-of-uploads-for-web-applications-3clg</link>
      <guid>https://dev.to/arian/creating-efficient-file-storage-of-uploads-for-web-applications-3clg</guid>
      <description>&lt;p&gt;At my company, &lt;a href="https://symbaloo.com"&gt;Symbaloo&lt;/a&gt;, we have various features that need storage of files, mainly images uploaded by users. Files are stored on the filesystem of the servers. The files are usually referenced from the database to a path on the file system. It's important to do this correctly, so we're sure the paths in the database point to existing files on the file system, and files on the file system have a reference in the database. It's bad if users see an &lt;code&gt;&amp;lt;img&amp;gt;&lt;/code&gt; tag with a file that doesn't exist! On the other hand, to keep hosting costs down we don't want to store more than necessary. In this post I will explain how we solved this problem and now store files efficiently.&lt;/p&gt;

&lt;p&gt;We'll solve the problem and discuss:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  File naming using SHA-256 hashes&lt;/li&gt;
&lt;li&gt;  Storing references in a database&lt;/li&gt;
&lt;li&gt;  Garbage collecting files&lt;/li&gt;
&lt;li&gt;  Safely deleting files&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  File naming scheme using SHA-256 hashes
&lt;/h3&gt;

&lt;p&gt;A simple way to name files would be to create a database entry, which has an &lt;code&gt;id&lt;/code&gt; field, and name the file &lt;code&gt;[id].png&lt;/code&gt;, for example (&lt;code&gt;/some/path/1337.png&lt;/code&gt;). A problem, however, is that we would need to know the &lt;code&gt;id&lt;/code&gt; upfront, or rename the file after inserting the entry. Another problem is when we use CDN with caching, the cache would need to be purged or users will get out-of-date content. Using a &lt;a href="https://en.wikipedia.org/wiki/Universally_unique_identifier"&gt;UUID&lt;/a&gt; would help, but we can do better!&lt;/p&gt;

&lt;p&gt;We can look at the file content and generate a unique file name based on the file content: using a &lt;a href="https://en.wikipedia.org/wiki/SHA-2"&gt;SHA-256 hash&lt;/a&gt;. This is inspired by the way &lt;a href="https://git-scm.com/book/en/v2/Git-Internals-Git-Objects"&gt;git stores objects&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This has these two benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The same files will get the same hash, so you won't store duplicate files. This is especially useful for cases that many database entries will refer to the same file.&lt;/li&gt;
&lt;li&gt;  File paths are based on the content, so it's perfect for generating URLs to these files in combination with CDNs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Hashing A File
&lt;/h4&gt;

&lt;p&gt;Let's review what it means to hash a file. A hash is some output based on the input of the hash function. Running the function with the same input always returns the same hash. A hash is also a one-way function, so from the hash it's impossible to get the original input.&lt;/p&gt;

&lt;p&gt;SHA-256 hashes are always 32 bytes long (256 bits are 32 bytes of 8 bits 🤯). It doesn't matter how long the input is. Other hash functions might have different lengths. For example hashes from the SHA-1 hash function, that &lt;code&gt;git&lt;/code&gt; uses, are 20 bytes long.&lt;/p&gt;

&lt;p&gt;On the command line, using the &lt;code&gt;sha256sum&lt;/code&gt; command you can test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"hello"&lt;/span&gt; | &lt;span class="nb"&gt;sha256sum
&lt;/span&gt;2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824  -
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result here is the &lt;em&gt;hex&lt;/em&gt; representation of 32 bytes. The returned string is 64 characters, so each two characters represents one byte in the hexadecimal notation: from &lt;code&gt;0&lt;/code&gt; as &lt;code&gt;00&lt;/code&gt; up to &lt;code&gt;255&lt;/code&gt; as &lt;code&gt;ff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In Java or Kotlin (JVM) a &lt;code&gt;ByteArray&lt;/code&gt; can be hashed using &lt;code&gt;MessageDigest&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nc"&gt;ByteArray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;ByteArray&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
    &lt;span class="nc"&gt;MessageDigest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getInstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SHA-256"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="nd"&gt;@sha256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nc"&gt;ByteArray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toHexString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
    &lt;span class="nf"&gt;joinToString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s"&gt;"%02x"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toByteArray&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toHexString&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="c1"&gt;// prints "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To hash complete files it's good to use &lt;code&gt;DigestOutputStream&lt;/code&gt; instead, to the hash while uploading/downloading/moving/editing the file and not load the entire file into memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;inputStream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toByteArray&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;inputStream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;outputStream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ByteArrayOutputStream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;digestStream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DigestOutputStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputStream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;MessageDigest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getInstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SHA-256"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copyTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digestStream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digestStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messageDigest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toHexString&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="c1"&gt;// prints "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Hash to file name
&lt;/h4&gt;

&lt;p&gt;Just 32 bytes isn't a filename yet. The hex representation of the bytes is a good string representation of the hash. When browsing the file system it can be slow if thousands of files are in the same directory, and some (older) file systems even have a limit how many files can be into a single directory. We can introduce some sub directories by taking the first bytes of the 32 byte hash and use that as directory name. Using two levels deep, a file containing the string &lt;code&gt;hello&lt;/code&gt; would be saved to the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2c/f2/4dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To format the hashes and parse a file path to the 32 byte hash we can use these functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;toFileSystemPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ByteArray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s"&gt;"bytes size must be 32 bytes"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;hash&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toHexString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/$prefix/${hash.substring(0..1)}/${hash.substring(2..3)}"&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;filename&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${hash.drop(4)}$suffix"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"$path/$filename"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;parseFileSystemPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ByteArray&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;s&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;regex&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"^/$p/([0-9a-f]{2})/([0-9a-f]{2})/([0-9a-f]{60})$s$"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="py"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;matchEntire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="n"&gt;destructured&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"$a$b$c"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexStringToByteArray&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;toFileSystemPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256AsBytes&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".png"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// prints "/image/2c/f2/4dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824.png"&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;hash&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseFileSystemPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".png"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;toHexString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// prints "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Storing File Reference in the database
&lt;/h3&gt;

&lt;p&gt;If we let users upload all kinds of files, and not keeping track of them tightly, we will end up with many files on the file system that might or might not be used. To keep track of this we store the file names in a database table. If we want to be able to check the database if a specific file is actually used or not we need to be able to do a database query with the file name in the &lt;code&gt;WHERE&lt;/code&gt; clause. If the database table contains many entries this would be really slow without an index. So let's see how that looks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`tablename`&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="c1"&gt;-- other columns&lt;/span&gt;
  &lt;span class="nv"&gt;`file`&lt;/span&gt; &lt;span class="nb"&gt;BINARY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`fileId`&lt;/span&gt; &lt;span class="nb"&gt;BINARY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="nv"&gt;`fileId`&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the file name is a 32 byte hash, we can use the &lt;code&gt;BINARY(32)&lt;/code&gt; data type. The table has a second column with only 4 bytes. This as an index. A second column is a slight storage overhead, but smaller than using the complete 32 bytes as index for the fast table lookup.&lt;/p&gt;

&lt;p&gt;With the index we can quickly check if the file has a reference in the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt; &lt;span class="n"&gt;di&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;di&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fileId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;fileId&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;di&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;fileId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copyOfRange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;fileId&lt;/code&gt; index is not unique, so that returns multiple entries. With 4 bytes we have 256^4 is more than 4*10^9 entries. So if the amount of entries in your database is less or in that order it won't be a problem at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  Garbage collecting files
&lt;/h3&gt;

&lt;p&gt;In reality, it's super hard to keep the files on the file system exactly in-sync with the entries in the database: entries could be deleted by something in your application that doesn't check the files, the entries might be deleted directly, manually, from the database, or maybe deleted by some database cleanup batch processing.&lt;/p&gt;

&lt;p&gt;Fortunately with the choice of the database scheme, the file naming, and nested folders, it becomes pretty straightforward to check a given file against the database and delete if it's obsolete.&lt;/p&gt;

&lt;p&gt;Modern programming languages often use Garbage Collection. That means you as a programmer don't need to cleanup memory manually, but the runtime checks objects and frees them from memory if they're not used. For our files we can do something similar. Traverse the file system in the background at given intervals and check if files are referenced.&lt;/p&gt;

&lt;p&gt;Our garbage collection background job does the following steps at given intervals.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  pick a random nested folder as the file paths are two levels deep.&lt;/li&gt;
&lt;li&gt;  check all files in a given folder&lt;/li&gt;
&lt;li&gt;  parse the file paths into the 32 byte hash using the &lt;code&gt;parseFileSystemPath&lt;/code&gt; function&lt;/li&gt;
&lt;li&gt;  query the database and return the used entries&lt;/li&gt;
&lt;li&gt;  delete the files of which the hashes are not in the database.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Safely deleting files
&lt;/h3&gt;

&lt;p&gt;There is one tricky moment while files are created or deleted at the same time: a web request just stores a file using the hashed file name on the file system. At this moment, the file isn't stored in the database just yet. At the same time, the garbage collector or another requests decided the file can be removed, checks the database and doesn't find a reference, so thinks it is okay to delete the file. This is not good because the just created file got deleted, while the original request could store the reference into the database anyway, leaving us in a bad state. The newly created file must be protected from other processes.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;.lock&lt;/code&gt; file can protect the new created file from deletion until a reference is saved in the database:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The &lt;code&gt;OutputStream&lt;/code&gt; the &lt;code&gt;DigestOutputStream&lt;/code&gt; writes to, is writing to a temporary file (e.g. using &lt;code&gt;UUID().toString()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;  After that's done we know the hash.&lt;/li&gt;
&lt;li&gt;  Before moving the temporary file to the final location, we first create an empty file &lt;code&gt;[hash].lock&lt;/code&gt;. If the lockfile already exists, we add a counter to the filename to create a unique lock for the process that creates the file.&lt;/li&gt;
&lt;li&gt;  We move the temporary file to the final location.&lt;/li&gt;
&lt;li&gt;  We store the file name / hash in the database&lt;/li&gt;
&lt;li&gt;  The &lt;code&gt;[hash].lock&lt;/code&gt; is deleted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And when deleting a file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  We first check if there is a &lt;code&gt;[hash].lock&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;  If that file exists, we can assume another process is creating the file, and don't need to delete it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Using file hashes together with a database index we can efficiently store files. The file system will keep clean as we can safely check if the file is referenced in the database, and garbage collecting by checking the files in the database if something slipped through.&lt;/p&gt;

&lt;p&gt;Did you ever built something similar, or are there existing packages or frameworks you know of? Let me know in the comments!&lt;/p&gt;

</description>
      <category>kotlin</category>
      <category>java</category>
      <category>webdev</category>
      <category>filestorage</category>
    </item>
  </channel>
</rss>
