<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sam Stoelinga</title>
    <description>The latest articles on DEV Community by Sam Stoelinga (@samos123).</description>
    <link>https://dev.to/samos123</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F305210%2F84f0238d-d14f-411a-aaa6-c9fac0307bdb.jpeg</url>
      <title>DEV Community: Sam Stoelinga</title>
      <link>https://dev.to/samos123</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samos123"/>
    <language>en</language>
    <item>
      <title>Don't use a K8s Service for LLM Serving!</title>
      <dc:creator>Sam Stoelinga</dc:creator>
      <pubDate>Tue, 11 Mar 2025 07:25:05 +0000</pubDate>
      <link>https://dev.to/samos123/dont-use-a-k8s-service-for-llm-serving-1a2j</link>
      <guid>https://dev.to/samos123/dont-use-a-k8s-service-for-llm-serving-1a2j</guid>
      <description>&lt;p&gt;Relying solely on standard Kubernetes Services for load balancing can lead to suboptimal performance when sering LLMs. That's because engines like vLLM provide Prefix caching which can speed up the inference. However, you need to make sure request with same prompt prefix go to the same vLLM instance when you have multiple instances serving the same model. That's why a standard K8s service won't work:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmw124ir39gnfis6ufrv4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmw124ir39gnfis6ufrv4.png" alt="Why hashing is important" width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How LLM engines use caching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs use Key-Value (KV) caches to store processed data from input prompts. This "prefix caching" speeds up responses when similar requests are made. However, standard Kubernetes Services distribute requests randomly, causing cache misses and slower responses. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Standard Load Balancing Falls Short&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Random request distribution leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Frequent Cache Evictions&lt;/strong&gt;: Caches are cleared often, reducing efficiency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Increased Latency&lt;/strong&gt;: More time is needed to process requests without cache benefits.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A Smarter Approach: Prompt Prefix Consistent Hashing with Bounded Loads (CHWBL)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prefix/Prompt based CHWBL offers a better solution by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Maximizing Cache Use&lt;/strong&gt;: Similar requests go to the same LLM replica, keeping caches relevant.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Balancing Load&lt;/strong&gt;: Ensures no single replica is overwhelmed, maintaining performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Benefits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Implementing CHWBL has shown:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j2pcvzqjwj720tswtk0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j2pcvzqjwj720tswtk0.png" alt="Serving LLM LB benchmarks" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;95% Faster Initial Responses&lt;/strong&gt;: Quicker start to data processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;127% Increase in Throughput&lt;/strong&gt;: More requests handled efficiently.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For effective LLM serving, move beyond standard Kubernetes Services. Adopting advanced load balancing like CHWBL can significantly enhance performance and user satisfaction. &lt;/p&gt;

&lt;p&gt;Paper used as the source: &lt;a href="https://www.kubeai.org/blog/2025/02/26/llm-load-balancing-at-scale-chwbl/" rel="noopener noreferrer"&gt;Prefix Aware Load Balancing Paper&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Tutorial: Deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB</title>
      <dc:creator>Sam Stoelinga</dc:creator>
      <pubDate>Tue, 08 Oct 2024 02:44:38 +0000</pubDate>
      <link>https://dev.to/samos123/tutorial-deploying-llama-31-405b-on-gke-autopilot-with-8-x-a100-80gb-a97</link>
      <guid>https://dev.to/samos123/tutorial-deploying-llama-31-405b-on-gke-autopilot-with-8-x-a100-80gb-a97</guid>
      <description>&lt;p&gt;Tutorial on how to deploy the Llama 3.1 405B model on GKE Autopilot with 8 x A100 80GB GPUs using &lt;a href="https://github.com/substratusai/kubeai" rel="noopener noreferrer"&gt;KubeAI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We're using fp8 (8 bits) precision for this model. This allows us to reduce GPU memory required and allows us to serve the model on a single machine.&lt;/p&gt;

&lt;p&gt;Create a GKE Autopilot cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud container clusters create-auto cluster-1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the helm repo for KubeAI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add kubeai https://www.kubeai.org
helm repo update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a values file for KubeAI with required settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; kubeai-values.yaml
resourceProfiles:
  nvidia-gpu-a100-80gb:
    imageName: "nvidia-gpu"
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"
      # Each A100 80GB GPU gets 10 CPU and 12Gi memory
      cpu: 10
      memory: 12Gi
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
    nodeSelector:
      cloud.google.com/gke-accelerator: "nvidia-a100-80gb"
      cloud.google.com/gke-spot: "true"
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install KubeAI with Helm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; kubeai kubeai/kubeai &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-f&lt;/span&gt; ./kubeai-values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy Llama 3.1 405B by creating a KubeAI Model object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-405b-instruct-fp8-a100
spec:
  features: [TextGeneration]
  owner:
  url: hf://neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
  engine: VLLM
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER
  args:
    - --max-model-len=65536
    - --max-num-batched-token=65536
    - --gpu-memory-utilization=0.98
    - --tensor-parallel-size=8
    - --enable-prefix-caching
    - --disable-log-requests
    - --max-num-seqs=128
    - --kv-cache-dtype=fp8
    - --enforce-eager
    - --enable-chunked-prefill=false
    - --num-scheduler-steps=8
  targetRequests: 128
  minReplicas: 1
  maxReplicas: 1
  resourceProfile: nvidia-gpu-a100-80gb:8
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pod takes about 15 minutes to startup. Wait for the model pod to be ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the pod is ready, the model is ready to serve requests.&lt;/p&gt;

&lt;p&gt;Setup a port-forward to the KubeAI service on localhost port 8000:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward service/kubeai 8000:80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Send a request to the model to test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; curl &lt;span class="nt"&gt;-v&lt;/span&gt; http://localhost:8000/openai/v1/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "llama-3.1-405b-instruct-fp8-a100", "prompt": "Who was the first president of the United States?", "max_tokens": 40}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's run a benchmarking using the vLLM benchmarking script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/vllm-project/vllm.git
&lt;span class="nb"&gt;cd &lt;/span&gt;vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py &lt;span class="nt"&gt;--backend&lt;/span&gt; openai &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--base-url&lt;/span&gt; http://localhost:8000/openai &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--dataset-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sharegpt &lt;span class="nt"&gt;--dataset-path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ShareGPT_V3_unfiltered_cleaned_split.json &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; llama-3.1-405b-instruct-fp8-a100 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--seed&lt;/span&gt; 12345 &lt;span class="nt"&gt;--tokenizer&lt;/span&gt; neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was the output of the benchmarking script on 8 x A100 80GB GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  410.49
Total input tokens:                      232428
Total generated tokens:                  173391
Request throughput (req/s):              2.44
Output token throughput (tok/s):         422.40
Total Token throughput (tok/s):          988.63
---------------Time to First Token----------------
Mean TTFT (ms):                          136607.47
Median TTFT (ms):                        125998.27
P99 TTFT (ms):                           335309.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          302.24
Median TPOT (ms):                        267.34
P99 TPOT (ms):                           1427.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           249.94
Median ITL (ms):                         128.63
P99 ITL (ms):                            1240.35
==================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hope this is helpful to other folks struggling to get Llama 3.1 405B up and running on GKE. Similar steps would work for GKE standard as long as you create your a2-ultragpu-8g nodepools in advance.&lt;/p&gt;

</description>
      <category>gke</category>
      <category>kubernetes</category>
      <category>ai</category>
    </item>
    <item>
      <title>Infinity embeddings on Kubernetes with KubeAI</title>
      <dc:creator>Sam Stoelinga</dc:creator>
      <pubDate>Wed, 25 Sep 2024 13:41:20 +0000</pubDate>
      <link>https://dev.to/samos123/infinity-embeddings-on-kubernetes-with-kubeai-2a4a</link>
      <guid>https://dev.to/samos123/infinity-embeddings-on-kubernetes-with-kubeai-2a4a</guid>
      <description>&lt;p&gt;Just merged and released the &lt;a href="https://github.com/substratusai/kubeai/pull/197" rel="noopener noreferrer"&gt;Infinity support PR&lt;/a&gt; in KubeAI, adding Infinity as an embedding engine. So you can get embeddings running on your Kubernetes clusters with an OpenAI compatible API.&lt;/p&gt;

&lt;p&gt;Infinity is a high performance and low latency embeddings engine: &lt;a href="https://github.com/michaelfeil/infinity" rel="noopener noreferrer"&gt;https://github.com/michaelfeil/infinity&lt;/a&gt;&lt;br&gt;
KubeAI is a Kubernetes Operator for running OSS ML serving engines: &lt;a href="https://github.com/substratusai/kubeai" rel="noopener noreferrer"&gt;https://github.com/substratusai/kubeai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How to use this?&lt;/p&gt;

&lt;p&gt;Run on any K8s cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add kubeai https://www.kubeai.org
helm install kubeai kubeai/kubeai --wait --timeout 10m
cat &amp;gt; model-values.yaml &amp;lt;&amp;lt; EOF
catalog:
  bge-embed-text-cpu:
    enabled: true
    features: ["TextEmbedding"]
    owner: baai
    url: "hf://BAAI/bge-small-en-v1.5"
    engine: Infinity
    resourceProfile: cpu:1
    minReplicas: 1
EOF
helm install kubeai-models kubeai/models -f ./model-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forward kubeai service to local host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl port-forward svc/kubeai 8000:80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Afterwards you could use the OpenAI Python client to get embeddings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from openai import OpenAI
# Assumes port-forward of kubeai service to localhost:8000.
client = OpenAI(api_key="ignored", base_url="http://localhost:8000/openai/v1")
response = client.embeddings.create(
    input="Your text goes here.",
    model="bge-embed-text-cpu"
)
print(response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What’s next?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support for autoscaling based on Infinity reported metrics.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Introducing KubeAI: Open AI Inference Operator</title>
      <dc:creator>Sam Stoelinga</dc:creator>
      <pubDate>Mon, 16 Sep 2024 23:31:16 +0000</pubDate>
      <link>https://dev.to/samos123/introducing-kubeai-open-ai-inference-operator-4hm3</link>
      <guid>https://dev.to/samos123/introducing-kubeai-open-ai-inference-operator-4hm3</guid>
      <description>&lt;p&gt;We recently launched KubeAI. The goal of KubeAI is to get LLMs, embedding models and Speech to text running on Kubernetes with ease.&lt;/p&gt;

&lt;p&gt;KubeAI provides an OpenAI compatible API endpoint which makes it work out of the box with most software that works with the OpenAI APIs.&lt;/p&gt;

&lt;p&gt;Repo on GitHub: &lt;a href="https://github.com/substratusai/kubeai" rel="noopener noreferrer"&gt;substratusai/kubeai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd75lnti2ecnd3bzej1ww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd75lnti2ecnd3bzej1ww.png" alt="Image description" width="780" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When it comes to LLMs, KubeAI directly operates vLLM and Ollama servers in isolated Pods, configured and optimized on a model-by-model basis. You get metrics-based auto scaling out of the box (including scale-from-zero). When you hear scale-from-zero in Kubernetes-land you probably think Knative and Istio - but not in KubeAI! We made an early design decision to avoid any external dependencies (Kubernetes is complicated enough as-is).&lt;/p&gt;

&lt;p&gt;We are hoping to release more functionality soon. Next up: model caching, metrics and dashboard.&lt;/p&gt;

&lt;p&gt;If you need any help or have any feedback, reach out directly, here, or via the channels listed in the repo. We are currently making it our priority to assist the project’s early adopters. So far users have seen success in use cases ranging from processing large scale batches in the cloud to running lightweight inference at the edge.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>How I deployed a global website speed checking service to 25 locations while costing less than $5/yr using GCP Cloud Run</title>
      <dc:creator>Sam Stoelinga</dc:creator>
      <pubDate>Mon, 28 Feb 2022 22:57:19 +0000</pubDate>
      <link>https://dev.to/samos123/how-i-deployed-a-global-website-speed-checking-service-to-25-locations-while-costing-less-than-5yr-using-gcp-cloud-run-5a05</link>
      <guid>https://dev.to/samos123/how-i-deployed-a-global-website-speed-checking-service-to-25-locations-while-costing-less-than-5yr-using-gcp-cloud-run-5a05</guid>
      <description>&lt;p&gt;I created &lt;a href="https://websu.io" rel="noopener noreferrer"&gt;https://websu.io&lt;/a&gt; an open source webpage speed monitoring tool and a key feature was the ability to test your webpage speed from locations across the world. This post will describe the architecture of Websu and how that allowed me to easily deploy it to 25 locations across the world for $3.73/yr&lt;/p&gt;

&lt;p&gt;The easiest thing to do would be to spin up a server in each location across the world and when user requests a speed test you run it on the server. That however comes with a large downside, you will have to pay for this server even when users aren't submitting any requests to that specific location. If I had done this, this would have cost me 25 * ~$5/month = $1500 / year even when I didn't have any users. This cost, being purely a hobby project for me wouldn't have been feasible. Note I don't like spending money, I am Dutch after all, there is a reason there is a saying "going Dutch" when you split up your bill.&lt;/p&gt;

&lt;p&gt;So what comes to the mind of a scrappy Dutchmen? Serverless with Cloud Run! I split up Websu into 3 components to make it possible to run from any location:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;websu-ui: React/Nextjs front-end that consumes a REST API. OSS version here: &lt;a href="https://github.com/websu-io/websu-ui" rel="noopener noreferrer"&gt;https://github.com/websu-io/websu-ui&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;websu-api: The API front-end that takes in the end-user requests. Source repo: &lt;a href="https://github.com/websu-io/websu" rel="noopener noreferrer"&gt;https://github.com/websu-io/websu&lt;/a&gt; note: this project was a result of me trying to learn code, excuse the bad code.&lt;/li&gt;
&lt;li&gt;lighthouse-server: GRPC server that takes in lighthouse requests and responds with the raw lighthouse json output. Source repo: &lt;a href="https://github.com/websu-io/websu/tree/master/pkg/lighthouse" rel="noopener noreferrer"&gt;https://github.com/websu-io/websu/tree/master/pkg/lighthouse&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The websu-ui and websu-api are deployed to a single region in Cloud Run however the lighthouse-server is deployed to all regions available with Cloud Run. This way, cloud run will run 0 lighthouse-server instances in any of the 25 regions so I won't get billed. Only when an end-user tries to test the page speed in a specific location e.g. Japan, then it will spin up a cloud run instance automatically. The cold start delay has been very minimal for me and definitely worth the savings. You can see a screenshot of my Cloud Run and Artifact Registry costs here:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j15wakorma0tx7ffmfd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8j15wakorma0tx7ffmfd.png" alt="Websu Cloud Run costs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make my life easier, I also wrote a simple python script to manage the deployment of new lighthouse-servers: &lt;a href="https://github.com/websu-io/websu/blob/master/scripts/deploy-lighthouse-servers.py" rel="noopener noreferrer"&gt;https://github.com/websu-io/websu/blob/master/scripts/deploy-lighthouse-servers.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During this time I was able to serve 3.1k users (see analytics screenshot below) across 25 locations for $3.73/yr instead of $1500/yr. I hope this posts helps people give insight to new use cases for serverless.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fut5a2vy6egkf52giwgxi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fut5a2vy6egkf52giwgxi.png" alt="Websu visitor analytics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The big spike was due to Korben.info mentioning Websu: &lt;a href="https://korben.info/script-mesurer-performances-site-web.html" rel="noopener noreferrer"&gt;https://korben.info/script-mesurer-performances-site-web.html&lt;/a&gt; and I was happy that Cloud Run could easily sustain that.&lt;/p&gt;

&lt;p&gt;Note/Disclaimer: I'm excluding the cost of a VM that's running Mongo, which is costing me about $10 / month. The reason for that is because with or without Cloud Run I would have incurred that cost. Opinions are my own and not the views of my employer&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>gcp</category>
      <category>serverless</category>
    </item>
  </channel>
</rss>
