<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wayne</title>
    <description>The latest articles on DEV Community by Wayne (@wheynelau).</description>
    <link>https://dev.to/wheynelau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898242%2F77ad6a26-606a-4f53-a83c-55494768faf9.jpeg</url>
      <title>DEV Community: Wayne</title>
      <link>https://dev.to/wheynelau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wheynelau"/>
    <language>en</language>
    <item>
      <title>Learnings of the Poor</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 06:44:18 +0000</pubDate>
      <link>https://dev.to/wheynelau/learnings-of-the-poor-2086</link>
      <guid>https://dev.to/wheynelau/learnings-of-the-poor-2086</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Necessity is the mother of invention&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I was already GPU poor, but a recent job change combined with rising component prices have also made me RAM and NVMe poor.&lt;/p&gt;

&lt;p&gt;While I am nowhere close to the experts of optimisations in the early 2000s or 90s, I took this time to brush up on some fundamentals and key concepts in Python. As the saying goes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Premature optimisation is the root of all evil"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We are not looking for very deep level optimisations, these changes aim to follow the Pareto Principle where 80% of the outcome comes from 20% of the effort. The changes below may or may not be 20% effort but I would consider them low-effort.&lt;/p&gt;

&lt;p&gt;As such, there won't be any discussion on performance profiling, where we are determining hot loops, cache misses, memory reallocations etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iterators
&lt;/h2&gt;

&lt;p&gt;Frankly I think this is an important concept that has a great carryover regardless of languages. Understanding iterators also helps if you need to think of channels, which is very important in Go.&lt;/p&gt;

&lt;p&gt;The typical approach collects results at every stage into lists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;first_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;second_processing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;write_processed_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The issue: if &lt;code&gt;data.jsonl&lt;/code&gt; is bigger than your RAM, you run OOM very fast. Using &lt;code&gt;yield&lt;/code&gt; instead keeps memory usage low:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections.abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;first_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_good&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function in the pipeline takes an &lt;code&gt;Iterator[dict]&lt;/code&gt; and yields records one at a time. Memory usage drops significantly.&lt;/p&gt;

&lt;p&gt;Caveats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files are held open throughout the pipeline, so unintentional edits or moves will break it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;json.dumps&lt;/code&gt; does not add a trailing newline, so &lt;code&gt;f.write(json.dumps(record) + '\n')&lt;/code&gt; is intentional when writing JSONL.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Learning points
&lt;/h3&gt;

&lt;p&gt;I find that iterators are a step before understanding pipelines, channels, or pub/sub patterns. When you understand iterators, you understand the bottlenecks of your code. They are fundamentally all iterators that consume and yield.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;process_data&lt;/code&gt; is slow (1 line per second) while reading and filtering is fast (4 lines per second), the pipeline is bounded by 1 line per second. The solution is more processing workers bridged through queues or channels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Read-worker-1 -&amp;gt; Filter-worker-1 -&amp;gt; Process-worker-{1..4} -&amp;gt; Write-worker-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Compression
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://wheynelau.dev/posts/compression-with-ztsd/" rel="noopener noreferrer"&gt;Compression&lt;/a&gt; post, I mentioned that benchmarks should be done to know whether your use case supports compressions. For write once, read many scenarios, higher compression values may help.&lt;/p&gt;

&lt;p&gt;Here is a measurement for an IO-constrained scenario (reading a JSONL file from NAS):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;ZST: 100000it [00:05, 17220.01it/s]  (9.47 MB/s)
Raw: 100000it [00:40, 2492.39it/s]  (11.15 MB/s)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because data is compressed, you can read more data per buffer. More lines are stored per MB of compressed JSONL compared to its raw form.&lt;/p&gt;

&lt;h2&gt;
  
  
  Less is more
&lt;/h2&gt;

&lt;p&gt;Less work means more efficient processing. It's about eliminating wasted work, not always adding a cache everywhere.&lt;/p&gt;

&lt;p&gt;If filtering takes 1s per line and processing takes 5s per line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process then filter on 10000 lines: &lt;code&gt;10000 * 6s = 60000s&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Filter then process on 10000 lines (50% bad): &lt;code&gt;10000 * 1s + 5000 * 5s = 35000s&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No complex code, no need for compiled languages. Algorithmic complexity matters too. Choosing the right data structure — a set for membership checks instead of a list, a deque instead of a list for queue operations — can eliminate entire classes of wasted work regardless of language.&lt;/p&gt;

&lt;p&gt;The full version with code examples and benchmarks is on &lt;a href="https://wheynelau.dev/posts/2026-03-27-learnings-of-the-poor/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>optimization</category>
      <category>iterators</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Benchmark LLM Inference Performance: TTFT, ITL, and Throughput Metrics</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 05:05:46 +0000</pubDate>
      <link>https://dev.to/wheynelau/how-to-benchmark-llm-inference-performance-ttft-itl-and-throughput-metrics-416p</link>
      <guid>https://dev.to/wheynelau/how-to-benchmark-llm-inference-performance-ttft-itl-and-throughput-metrics-416p</guid>
      <description>&lt;p&gt;When deploying large language models to production, measuring performance accurately is critical. Whether you're using vLLM, SGLang, TensorRT-LLM, or a custom inference stack, you need to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: How many requests per second can your system handle?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency metrics&lt;/strong&gt;: Time to First Token (TTFT), Inter-Token Latency (ITL), and end-to-end latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token generation speed&lt;/strong&gt;: Tokens per second under different concurrency levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tail latency&lt;/strong&gt;: P95 and P99 values that affect user experience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, I'll walk through the key metrics for benchmarking language models and share why I built &lt;a href="https://github.com/wheynelau/llmperf-rs" rel="noopener noreferrer"&gt;llmperf-rs&lt;/a&gt;, a Rust-based benchmarking tool that takes a different approach to measuring these metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Existing Tools
&lt;/h2&gt;

&lt;p&gt;While working with &lt;a href="https://github.com/ray-project/llmperf" rel="noopener noreferrer"&gt;ray-project/llmperf&lt;/a&gt; (now archived), I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages. This approach works well for many use cases, but I needed to preserve individual latency spikes during testing.&lt;/p&gt;

&lt;p&gt;There's also &lt;a href="https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html" rel="noopener noreferrer"&gt;genai-perf&lt;/a&gt;, which is very comprehensive. My only issue was running it on Ubuntu 22.04 without Docker. As of this update, they've sunsetted &lt;code&gt;genai-perf&lt;/code&gt; in favor of &lt;a href="https://github.com/ai-dynamo/aiperf" rel="noopener noreferrer"&gt;aiperf&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.vllm.ai/en/latest/benchmarking/cli/#dataset-overview" rel="noopener noreferrer"&gt;vllm-bench&lt;/a&gt; is solid too, but requires installing &lt;code&gt;vllm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The goal was to build a simple binary that runs almost anywhere with minimal dependencies. It was also a learning project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;This is a summary of the full &lt;a href="https://github.com/wheynelau/llmperf-rs/blob/master/docs/metrics.md" rel="noopener noreferrer"&gt;metrics documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time To First Token (TTFT)
&lt;/h3&gt;

&lt;p&gt;TTFT measures how quickly the model begins responding after receiving your request. For interactive applications, this is the perceived latency before the user sees any output. It's also important for RAG-based applications where a large chunk of processing happens at the prefill stage.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;TTFT = first_token_timestamp - request_start_timestamp&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Lower is better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inter-Token Latency (ITL)
&lt;/h3&gt;

&lt;p&gt;ITL is the time between consecutive tokens during generation. Spikes can reveal multiple issues, most commonly network problems. ITL is usually consistent due to how KV caches and the computation works.&lt;/p&gt;

&lt;p&gt;When testing against vLLM, I noticed that high ITL spikes happen when you benchmark close to the context limit. I suspect this is due to vLLM's eviction of requests if they exceed the KV cache size.&lt;/p&gt;

&lt;p&gt;For example, if 3 requests come in with &lt;code&gt;0.8x&lt;/code&gt; context length and &lt;code&gt;0.2x&lt;/code&gt; for generation, but the GPU has space for only &lt;code&gt;2.8x&lt;/code&gt; context length, one of the requests will be preempted.&lt;/p&gt;

&lt;p&gt;Aggregation: concatenate ALL ITL values across all responses, then compute statistics. Each response produces &lt;code&gt;(N-1)&lt;/code&gt; ITL values (where &lt;code&gt;N&lt;/code&gt; is the token count). By aggregating raw values instead of per-request averages, you preserve the true distribution including outliers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prefill TPS&lt;/strong&gt; — tokens processed per second during the prefill phase:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Prefill TPS = input_tokens / TTFT&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;However, prefill TPS doesn't accurately reflect system performance because TTFT includes queue wait time, not just actual processing time. When a server is under load, your request might sit in a queue waiting for resources. The lower prefill TPS in that case reflects queue contention, not the system's processing capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode TPS&lt;/strong&gt; — tokens generated per second during the decode phase:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Decode TPS = output_tokens / (final_time - decode_start_time)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is the generation speed: how fast the model produces output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Matters Most
&lt;/h2&gt;

&lt;p&gt;For production serving, focus on &lt;strong&gt;TTFT&lt;/strong&gt;, &lt;strong&gt;ITL stats&lt;/strong&gt;, and maybe &lt;strong&gt;RPM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTFT&lt;/strong&gt; measures how quickly users see their first token — this is the perceived responsiveness of your system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ITL statistics&lt;/strong&gt; reveal decode-phase issues that throughput metrics hide. The 99th percentile and max ITL values expose preemption events from KV cache limits and network issues between components.&lt;/p&gt;

&lt;p&gt;ITL matters less for batch jobs or non-streaming APIs where users don't watch tokens arrive in real-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token Counting
&lt;/h2&gt;

&lt;p&gt;Accurate metrics require accurate token counts. llmperf-rs handles this in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;API response&lt;/strong&gt; — Most OpenAI-compatible endpoints return token counts in the &lt;code&gt;usage&lt;/code&gt; field. By default, llmperf-rs uses this as priority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizer&lt;/strong&gt; — For exact input counts, pass a HuggingFace tokenizer. Note that chat templates may cause &amp;lt;10 token variance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The original llmperf uses a single tokenizer for all models. Different models use different tokenizers, so llmperf-rs lets you specify the correct one or rely on API-reported counts.&lt;/p&gt;

&lt;p&gt;For example, Llama-2 has a vocab size of 32000, while Qwen3-4B has 151936. In my own testing, setting input tokens to 8192 against a Qwen endpoint while using the default llama tokenizer returned values around 7363-7376 tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validating Your Results
&lt;/h2&gt;

&lt;p&gt;All benchmark runs should end with &lt;code&gt;finish_reason = length&lt;/code&gt; (meaning the model hit the &lt;code&gt;max_tokens&lt;/code&gt; limit). If you see &lt;code&gt;finish_reason = stop&lt;/code&gt;, the model stopped early. This affects metrics like RPM and E2E latency. Higher rejection rates can produce higher RPMs and lower latency due to shorter responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use llmperf-rs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use llmperf-rs when:&lt;/strong&gt; running benchmarks with minimal dependencies, testing OpenAI-compatible endpoints, wanting low overhead (Rust, no Ray/ZMQ), or needing a quick way to test endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consider alternatives when:&lt;/strong&gt; you need GPU-level metrics (use trtllm-bench or aiperf), testing vLLM-specific features, requiring extensive reporting dashboards, or needing distributed testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ITL Matters Even When Throughput Looks Good
&lt;/h2&gt;

&lt;p&gt;High throughput with bad ITL means tokens arrive in bursts, and chat users notice the choppy streaming. ITL spikes (p99 &amp;gt;100ms) often indicate preemption, network issues, or other problems. For non-user-facing use cases like agentic coding, throughput may matter more than ITL specifics.&lt;/p&gt;

&lt;p&gt;The full version with code examples, benchmarks, and installation instructions is on &lt;a href="https://wheynelau.dev/posts/2025-12-15-benchmarking-performance/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>benchmarking</category>
      <category>rust</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
