DEV Community: Arian Stolwijk

Our web performance got worse by switching from Cloudfront Proxy to AWS Cloudfront

Arian Stolwijk — Fri, 07 Feb 2025 15:48:50 +0000

In an effort to improve world wide, multi region, latency, I've set up Cloudfront in front of our Application Load Balancer in AWS.

Previously we put the ALB (in Europe) public DNS in a CNAME record on Cloudflare and enabled (free) Cloudflare Proxy. But we noticed higher latency from US or Asia. So our idea was to use Cloudfront instead, as everything will be inside AWS, and we have more tools to tweak caching policies of different pages.

Surprisingly, the first result we got, is that Cloudfront is slower than the basic Cloudflare setup! Especially from the US and Sidney.

We measured this with automated headless browsers in different regions visiting the homepage, and reporting the Performance API statistics, using Cloudwatch Synthetic Canaries.

We're still looking for settings to improve the global website performance. At least using smart caching, as explained in Making AWS News stupid fast with smart caching is on our roadmap.

Let me know if you know anything!

Building a Recommender System using vector databases

Arian Stolwijk — Sun, 21 Jan 2024 10:23:51 +0000

Dense vector embeddings make it easy to find similar documents in a dataset. With a document in the embedding space, just look at the other documents that are close, and probably they are related. This is called a k-nearest neighbors (KNN) search. Classic recommender systems such as collaborative filtering require a lot of user data and training. However with pre-trained embedding models you don't need any training to recommend related documents.

This works well for item-to-item recommendations. But what if you want to recommend something for a specific user. If you want to go to user-to-items recommendations you need some interaction data again. One strategy is to build a user-embedding based on the mean of the embeddings of the documents the user interacted with. A KNN search with this user-embedding will give the results the user is most likely interested in. Erika Cardenas explained this in further detail in her Haystack US 2023 talk. With just a few clicks the user quickly gets nice recommendations without a lot of extra systems if you already have a vector database.

The problem that I'm trying to solve goes a bit deeper. I've got a lot of documents that have a certain label (e.g. a brand of a product, some color, language, author of articles). And when the user selects that label we want to show the most interesting documents. But at the moment we only know if that document has the label or not, which doesn't say anything about how we should rank those documents. Also the documents are refreshed regularly and the number of documents is quite large compared to the number of clicks, so an individual document only gets a few clicks. What we want are the best documents ranked best. Using a similar approach as the user-embedding, we can create a 'label-embedding' of clicked documents when users were searching in this category.

One problem of this, contrary to user-embeddings that can be limited to averaging the last few document embeddings the user clicked on, is that we would eventually take the average embedding of many clicked document, and does that actually represent the most popular type of document, or is that some 'random' point in the embedding space.

Instead, in the list of all clicked documents, there are probably a few clusters of related documents. The idea is that instead of one average embedding we get a few embeddings, one for each cluster, and for each embedding we do a KNN search, and finally combine the results.

Implementation

From a high level the pipeline looks as shown in this image. First we collect the relevant click logs, and see which documents are clicked, then we get the embeddings for these documents, and create clusters. For each cluster we do a KNN search to get more related documents (even though that document might not have any interactions yet), combine these lists in a single final list, and finally present that as the ranking to the user.

Clustering

The idea to cluster dense embedding vectors resembles most of what BERTopic does, as described in BERTopic: Neural topic modeling with a class-based TF-IDF procedure by Maarten Grootendorst. From documents it can create embeddings using an embedding model, cluster the embeddings and finally extract the topic description of each cluster. The first and last steps are not necessary for us, we already have the embeddings, and we don't really need to label the topics, but we can look how it creates clusters from the embeddings.

It uses two steps:

Dimension reduction is necessary to prevent the curse of dimensionality. From dense vectors of 384, 768 or sometimes more than 4000 dimensions. That is too much for clustering algorithms to work well. Dimension reduction reduces those higher dimensions to a much lower number, such as 5 or 10 that the clustering algorithm can handle well. If you reduce it further to 2 or 3 dimensions you can also render the vectors in a 2 or 3-dimensional graph. BERTopic uses UMAP over other methods such as PCA or t-SNE because UMAP has shown to preserve more of the local and global features of high-dimensional data in lower projected dimensions.
Clustering. For clustering BERTopic uses the HDBSCAN clustering algorithm. Clustering assigns a cluster label to the (reduced) embeddings. Contrary to the KMeans algorithm the number of clusters is dynamic. Compared to the DBSCAN algorithm it works better if the (reduced) embeddings are not homogeneously distributed in the embedding space, so it can also create different clusters in dense areas.

After this process we have cluster labels for all clicked documents, and we can use them to create the 'cluster-embeddings' that we will use as center points for our KNN search. I found two methods:

Find the centroid embedding. This is possible when using a the KMeans clustering algorithm. However the clustering is done on the reduced dimension vectors, so it would need to be mapped back to the original document (embedding).
Take the mean of all embeddings that fall in this cluster. As HDBSCAN can create non linear clusters (e.g. an inner and an outer circle as two clusters) taking the mean seems a bit weird. But as the means are taken on the original high dimensional vectors it does work OK in practice.

KNN Search

For KNN search we use a vector database. A vector database stores the documents with the document embeddings and provides a way to query the database. A naive way to find the most similar documents of a given vector would be to calculate the similarity for each document with the query vector, and take the k vectors with the highest similarity. This scales linearly and isn't suited for bigger datasets. Vector databases implement Approximate Nearest Neighbor (ANN) algorithms which makes querying much more efficient (with the cost of being approximate). The most popular algorithm is Hierarchical Navigable Small Worlds (HNSW) and is implemented in vector databases like qdrant, Weaviate or Elasticsearch (that we use).

Note that the vector database needs to support some way of filtering, as the results need to match the initial search constraints (in our case, the document should have a label).

The top-k documents closest to the cluster-embedding are retrieved with the KNN search. These are however the top-k closest to that embedding, but may not be representative for the entire space of the cluster. Especially with e-commerce products there are a lot of variants of the same product, while it might be more relevant to get a diverse set of recommendations for this cluster. The Maximal Marginal Relevance algorithm is an algorithm that can diversify the ranking of the found documents. The algorithm iteratively adds documents to a new list from the original ranked list that best matches the cluster-embedding (or more generally, the query vector), but penalizes documents that are already in the resulting list:

Fusing results from different clusters

After calculating the clusters and cluster embeddings, and retrieving and diversifying potential candidate documents, we have a list of documents for each cluster. However we want to present the user with a final flat list. Reciprocal Rank Fusion (RFF) is a straight forward algorithm that got implemented recently into some vector databases that also implemented hybrid search.

RFF calculates a score based on the rank of the documents in each list. In our case we have lists of documents for each cluster, and the position in that list is the rank for that document. k is a tunable parameter, usually set to 60, that improves the resulting score. As the documents are all in a different cluster each candidate document only is in one of the lists. So initially the score for all first documents will be 1/(60 + 1).

However some clusters are more important than other clusters, some clusters are bigger as more users clicked on documents that are like the documents in that cluster, so we assume those documents should be more relevant. Using the relative sizes of the clusters we can add a weight in the numerator of the fraction.

        num_docs_in_cluster
w = a + -------------------
           total_docs

We can also add a tunable parameter a that makes the cluster size weight more or less important.

                w
Score(d) =  ----------
            k + rank(d)

After computing the score for all documents we can sort them and get the final, diverse, sorted relevant list of documents.

Production concerns

Processing all the click logs and calculating the clusters is a bit compute intensive, and would not work for real time traffic. But we can periodically calculate the cluster embeddings. That's not a lot of data, basically just one vector for each cluster. Doing the KNN search using a vector database and doing some re-ranking and fusing using MRR and RFF can be done in your web application.

Conclusion

Vector databases with embeddings unlock building a recommender system quickly without a lot of data. Furthermore it can be used for creating rankings in a search engine when there is no natural ranking because the documents either match or not, (a product is in sale or not, an article is written by this author or not, etc.) but there is no way yet in telling if it matches more or less. With click logs we can create an embedding of all the documents that are clicked and get related documents that are most clicked. However, if it's likely there are more types of documents in the click logs, we create clusters using the UMAP and HDBSCAN dimension reduction and clustering algorithms. Finally with KNN search, optionally some diversification of the results using MRR, and combining the results of each cluster using a modified RFF score, we will get results where documents that were similar to clicked documents will get a higher ranking than documents that are less clicked.

Cross-posted on https://www.aryweb.nl/2024/01/21/vector-database-recommender-system.html

ElasticSearch Boolean Query Performance

Arian Stolwijk — Mon, 08 Jan 2024 16:03:25 +0000

A Boolean Query caused performance issues. The problem was that an empty filter clause behaved differently from a non-empty filter clauses with match_all and should clauses, because of tricky minimum_should_match behavior. I'll try to explain what happened and how to fix this issue.

A few weeks ago I changed an ElasticSearch query in the application I was working on. The new query was structured in such a way it should return better search results, and that it would be easier to tweak which fields contributed to the scores of the documents. The changes were reviewed, tested, and merged, and it looked good.

Then we deployed these changes to production. It still looked good, and with the live data it did give better results given the search inputs.

Then, during the day, traffic increased. And it didn't look good anymore. Our monitoring systems notified that not all request could be handled anymore by the application and the Load Balancer was queued requests. The response times got really high! It turned out the load on ElasticSearch got too high and overloaded.

We quickly rolled back the change, and things stabilized. We moved the new version of the query under a feature flag to dynamically enable or disable it later. So far the stressful part, now, why was the performance so different?

Compound Queries

For the search functionality of the application we want to search documents for a specific search term the user enters. Aside from just a name or description, each document also has fields like a ratings or views that should affect the scoring.

ElasticSearch provides the Function Score Query for this: an inner query that provides results with scores, of which the scores are adjusted by some function (e.g. multiplying with another field of the document). So the resulting query is a query with a query inside it. The documents are then filtered using the min_score field, and sorted by the new score.

Boolean Query

Another type of compound query is the Boolean Query. This query combines different queries in four types, and each type is a list of queries.

filter: either the document matches a query or not. It doesn't affect scoring.
should: the query contributes a score, e.g. a query results into a score how much it matches a search term. If any subquery in the should list matches the document is included.
must: almost the same as should, but documents must match all subqueries.
must_not: the query must match, but the scores are ignored.

An example of a query like this is:

{
  "query": {
    "bool": {
      "filter": {
        "term": { "tags": "production" }
      },
      "should": [
        { "term": { "tags": "env1" } },
        { "term": { "tags": "deployed" } }
      ]
    }
  }
}

It reads like: filter all documents that have production in the tags field. Then documents with env1 or deployed in the tags fields get a high score.

This is almost exactly what we had, except embedded in a Function Score query:

{
  "size": 20,
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": {
            "term": { "tags": "production" }
          },
          "should": [
            { "term": { "tags": "env1" } },
            { "term": { "tags": "deployed" } }
          ]
        }
      },
      "min_score": 1
    }
  }
}

Filters

The code dynamically added filters to the bool query. However in the default cause, it didn't add filters, but returned a match_all query, assuming these cases would be identical

{
  "bool": {
    "filter": [{ "match_all": {} }],
    "should": [{ "term": { "tags": "env1" } }]
  }
}

and

{
  "bool": {
    "filter": [],
    "should": [{ "term": { "tags": "env1" } }]
  }
}

But it is not!!, the results are the same, but not how the query is executed.

The devil is in the details of the documentation:

If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.

So 0 or 1 item in the filter clause changes the behavior! We want that it matches the should clause always. Each non-matching document is eventually still filtered by the outer function query (min_score), but not by the bool query, if there is one item in the filter clause.

Use `"minimum_should_match": 1`

The quickest solution would be to add "minimum_should_match": 1 to the query, as that would ensure each document is only included if it matches one item in the should clause:

{
  "bool": {
    "filter": [{ "match_all": {} }],
    "should": [{ "term": { "tags": "env1" } }],
    "minimum_should_match": 1
  }
}

Solution: use the `must` clause

A better solution in our case is to use must. That ensures each document matches the subquery and it contributes to the score.

{
  "bool": {
    "filter": [{ "match_all": {} }],
    "must": [{ "term": { "tags": "env1" } }]
  }
}

And even better, don't treat match_all: {} as 'identity', but leave the filter clause empty, and only add one if we really want to filter something (e.g. language, ...):

{
  "bool": {
    "filter": [],
    "must": [{ "term": { "tags": "env1" } }]
  }
}

Debugging

Finally, some things we did to debug this.

Kibana is really useful as a UI to experiment with ElasticSearch queries and explore the data. Especially the "Dev Tools console".

You can execute a query, and the JSON ElasticSearch returns contains the took property, which is how long the query took. This is a rough number, but can give you an indication of the order. Before the query usually took ~100ms, and after less than ~10ms!

But also the Search Profiler is a useful tool. This gives insights into which part of the (compound) query takes the most time. In this case a lot of time was spent in next_doc, which makes sense when the bool query didn't filter out the documents that scored 0.

This post was also published on my personal blog

Creating efficient File Storage of Uploads for Web Applications

Arian Stolwijk — Fri, 18 Dec 2020 10:05:17 +0000

At my company, Symbaloo, we have various features that need storage of files, mainly images uploaded by users. Files are stored on the filesystem of the servers. The files are usually referenced from the database to a path on the file system. It's important to do this correctly, so we're sure the paths in the database point to existing files on the file system, and files on the file system have a reference in the database. It's bad if users see an <img> tag with a file that doesn't exist! On the other hand, to keep hosting costs down we don't want to store more than necessary. In this post I will explain how we solved this problem and now store files efficiently.

We'll solve the problem and discuss:

File naming using SHA-256 hashes
Storing references in a database
Garbage collecting files
Safely deleting files

File naming scheme using SHA-256 hashes

A simple way to name files would be to create a database entry, which has an id field, and name the file [id].png, for example (/some/path/1337.png). A problem, however, is that we would need to know the id upfront, or rename the file after inserting the entry. Another problem is when we use CDN with caching, the cache would need to be purged or users will get out-of-date content. Using a UUID would help, but we can do better!

We can look at the file content and generate a unique file name based on the file content: using a SHA-256 hash. This is inspired by the way git stores objects

This has these two benefits:

The same files will get the same hash, so you won't store duplicate files. This is especially useful for cases that many database entries will refer to the same file.
File paths are based on the content, so it's perfect for generating URLs to these files in combination with CDNs.

Hashing A File

Let's review what it means to hash a file. A hash is some output based on the input of the hash function. Running the function with the same input always returns the same hash. A hash is also a one-way function, so from the hash it's impossible to get the original input.

SHA-256 hashes are always 32 bytes long (256 bits are 32 bytes of 8 bits 🤯). It doesn't matter how long the input is. Other hash functions might have different lengths. For example hashes from the SHA-1 hash function, that git uses, are 20 bytes long.

On the command line, using the sha256sum command you can test it:

$ echo -n "hello" | sha256sum
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824  -

The result here is the hex representation of 32 bytes. The returned string is 64 characters, so each two characters represents one byte in the hexadecimal notation: from 0 as 00 up to 255 as ff.

In Java or Kotlin (JVM) a ByteArray can be hashed using MessageDigest:

fun ByteArray.sha256(): ByteArray =
    MessageDigest.getInstance("SHA-256").run {
        update(this@sha256)
        digest()
    }

fun ByteArray.toHexString() =
    joinToString("") { "%02x".format(it) }

fun main() {
    println("hello".toByteArray().sha256().toHexString())
    // prints "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"
}

To hash complete files it's good to use DigestOutputStream instead, to the hash while uploading/downloading/moving/editing the file and not load the entire file into memory:

fun main() {
    val inputStream = "hello".toByteArray().inputStream()
    val outputStream = ByteArrayOutputStream()
    val digestStream = DigestOutputStream(outputStream, MessageDigest.getInstance("SHA-256"))
    inputStream.copyTo(digestStream)
    println(digestStream.messageDigest.digest().toHexString())
    // prints "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"
}

Hash to file name

Just 32 bytes isn't a filename yet. The hex representation of the bytes is a good string representation of the hash. When browsing the file system it can be slow if thousands of files are in the same directory, and some (older) file systems even have a limit how many files can be into a single directory. We can introduce some sub directories by taking the first bytes of the 32 byte hash and use that as directory name. Using two levels deep, a file containing the string hello would be saved to the file:

2c/f2/4dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

To format the hashes and parse a file path to the 32 byte hash we can use these functions:

fun toFileSystemPath(bytes: ByteArray, prefix: String, suffix: String): String {
    require(bytes.size == 32) { "bytes size must be 32 bytes" }
    val hash = bytes.toHexString()
    val path = "/$prefix/${hash.substring(0..1)}/${hash.substring(2..3)}"
    val filename = "${hash.drop(4)}$suffix"
    return "$path/$filename"
}

fun parseFileSystemPath(path: String, prefix: String, suffix: String): ByteArray? {
    val p = Pattern.quote(prefix)
    val s = Pattern.quote(suffix)
    val regex = Regex("^/$p/([0-9a-f]{2})/([0-9a-f]{2})/([0-9a-f]{60})$s$")
    val (a, b, c) = regex.matchEntire(path)?.destructured ?: return null
    return "$a$b$c".hexStringToByteArray()
}

fun main() {
    val path = toFileSystemPath("hello".sha256AsBytes(), prefix = "image", ".png")
    println(path)
    // prints "/image/2c/f2/4dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824.png"
    val hash = parseFileSystemPath(path, "image", ".png")?.toHexString()
    println(hash)
    // prints "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"
}

Storing File Reference in the database

If we let users upload all kinds of files, and not keeping track of them tightly, we will end up with many files on the file system that might or might not be used. To keep track of this we store the file names in a database table. If we want to be able to check the database if a specific file is actually used or not we need to be able to do a database query with the file name in the WHERE clause. If the database table contains many entries this would be really slow without an index. So let's see how that looks:

CREATE TABLE `tablename`(
  -- other columns
  `file` BINARY(32) NOT NULL,
  `fileId` BINARY(4) NOT NULL,
  KEY `fileId` (fileId)
);

As the file name is a 32 byte hash, we can use the BINARY(32) data type. The table has a second column with only 4 bytes. This as an index. A second column is a slight storage overhead, but smaller than using the complete 32 bytes as index for the fast table lookup.

With the index we can quickly check if the file has a reference in the database.

SELECT file FROM tablename di WHERE di.fileId = :fileId AND di.file = :file

With

val fileId = file.copyOfRange(0, 4)

The fileId index is not unique, so that returns multiple entries. With 4 bytes we have 256^4 is more than 4*10^9 entries. So if the amount of entries in your database is less or in that order it won't be a problem at all.

Garbage collecting files

In reality, it's super hard to keep the files on the file system exactly in-sync with the entries in the database: entries could be deleted by something in your application that doesn't check the files, the entries might be deleted directly, manually, from the database, or maybe deleted by some database cleanup batch processing.

Fortunately with the choice of the database scheme, the file naming, and nested folders, it becomes pretty straightforward to check a given file against the database and delete if it's obsolete.

Modern programming languages often use Garbage Collection. That means you as a programmer don't need to cleanup memory manually, but the runtime checks objects and frees them from memory if they're not used. For our files we can do something similar. Traverse the file system in the background at given intervals and check if files are referenced.

Our garbage collection background job does the following steps at given intervals.

pick a random nested folder as the file paths are two levels deep.
check all files in a given folder
parse the file paths into the 32 byte hash using the parseFileSystemPath function
query the database and return the used entries
delete the files of which the hashes are not in the database.

Safely deleting files

There is one tricky moment while files are created or deleted at the same time: a web request just stores a file using the hashed file name on the file system. At this moment, the file isn't stored in the database just yet. At the same time, the garbage collector or another requests decided the file can be removed, checks the database and doesn't find a reference, so thinks it is okay to delete the file. This is not good because the just created file got deleted, while the original request could store the reference into the database anyway, leaving us in a bad state. The newly created file must be protected from other processes.

A .lock file can protect the new created file from deletion until a reference is saved in the database:

The OutputStream the DigestOutputStream writes to, is writing to a temporary file (e.g. using UUID().toString())
After that's done we know the hash.
Before moving the temporary file to the final location, we first create an empty file [hash].lock. If the lockfile already exists, we add a counter to the filename to create a unique lock for the process that creates the file.
We move the temporary file to the final location.
We store the file name / hash in the database
The [hash].lock is deleted

And when deleting a file:

We first check if there is a [hash].lock file
If that file exists, we can assume another process is creating the file, and don't need to delete it.

Conclusion

Using file hashes together with a database index we can efficiently store files. The file system will keep clean as we can safely check if the file is referenced in the database, and garbage collecting by checking the files in the database if something slipped through.

Did you ever built something similar, or are there existing packages or frameworks you know of? Let me know in the comments!

DEV Community: Arian Stolwijk

Our web performance got worse by switching from Cloudfront Proxy to AWS Cloudfront

Building a Recommender System using vector databases

Implementation

Clustering

KNN Search

Fusing results from different clusters

Production concerns

Conclusion

ElasticSearch Boolean Query Performance

Compound Queries

Boolean Query

Filters

Use "minimum_should_match": 1

Solution: use the must clause

Debugging

Creating efficient File Storage of Uploads for Web Applications

File naming scheme using SHA-256 hashes

Hashing A File

Hash to file name

Storing File Reference in the database

Garbage collecting files

Safely deleting files

Conclusion

Use `"minimum_should_match": 1`

Solution: use the `must` clause