<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexander Matveev</title>
    <description>The latest articles on DEV Community by Alexander Matveev (@bizoxe).</description>
    <link>https://dev.to/bizoxe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4006385%2F01d59766-81b7-424d-bae4-a2179dcde584.jpg</url>
      <title>DEV Community: Alexander Matveev</title>
      <link>https://dev.to/bizoxe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bizoxe"/>
    <language>en</language>
    <item>
      <title>Analyzing and Troubleshooting Bottlenecks in FastAPI: Optimizing Auth Flow, Cryptography, and Data Serialization</title>
      <dc:creator>Alexander Matveev</dc:creator>
      <pubDate>Sun, 28 Jun 2026 12:07:39 +0000</pubDate>
      <link>https://dev.to/bizoxe/analyzing-and-troubleshooting-bottlenecks-in-fastapi-optimizing-auth-flow-cryptography-and-data-41li</link>
      <guid>https://dev.to/bizoxe/analyzing-and-troubleshooting-bottlenecks-in-fastapi-optimizing-auth-flow-cryptography-and-data-41li</guid>
      <description>&lt;p&gt;This article is intended for those who, like me, are interested in profiling and performance optimization.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Key Takeaways (TL;DR):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Offload CPU-bound crypto:&lt;/strong&gt; Move synchronous operations like &lt;code&gt;Argon2id&lt;/code&gt; into a &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; to stop them from blocking the asyncio event loop.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Choose the right algorithm:&lt;/strong&gt; &lt;code&gt;Ed25519&lt;/code&gt; is much faster for &lt;em&gt;signing&lt;/em&gt; tokens than &lt;code&gt;RSA-2048&lt;/code&gt;, but slower for &lt;em&gt;verification&lt;/em&gt;. This trade-off can be mitigated by caching validated tokens.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Use &lt;code&gt;msgspec&lt;/code&gt; for serialization:&lt;/strong&gt; Native &lt;code&gt;msgspec&lt;/code&gt; serialization can be up to 5x faster than FastAPI's default with Pydantic, drastically reducing CPU load on data-heavy responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  → Jump straight to the Bottleneck Analysis
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Preface&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To begin with, I would like to touch upon the subject of development culture in high-level languages.&lt;/p&gt;

&lt;p&gt;Python is built for maximum productivity: it hides everything "superfluous" from the developer — memory management, register operations, threads, and system calls. This often creates a false impression that the code runs "by itself," regardless of the environment.&lt;/p&gt;

&lt;p&gt;Python's popularity is due to its low barrier to entry. It is often called simple, but that does not mean the language is easy to master: behind the external simplicity lies the colossally complex work of the interpreter. The language can be visualized as an onion, where layer by layer you delve into its inner workings. Initially, it was perceived as a simple scripting tool, but modern Python and the one from 15 years ago are different languages. Not in terms of syntax, but in terms of usage methodology. The shift from synchronous "scripting" to asynchronous services with deep static typing has radically changed the requirements for developers. If previously the "barrier to entry" was defined by basic syntax knowledge, today it includes an understanding of concurrency, system abstractions, and static analysis tools. The barrier to entry has significantly increased.&lt;br&gt;
Python is a tool that allows for the creation of complex systems, and that is precisely what makes it treacherous for those who do not look deeper.&lt;/p&gt;

&lt;p&gt;When we write in high-level languages, we are in a "cozy bubble" of abstractions. Profiling and benchmarking are tools that pierce this bubble.&lt;/p&gt;

&lt;p&gt;Here is what they provide, at the very least, in this context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Awareness of the cost of abstractions&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We often do not pay attention to the fact that even behind a simple list.append() operation or a function call in Python lies an entire stack: memory allocation, type checking, GIL operation, and system calls. Through benchmarking, we learn to measure not the "beauty of the code," but the real cost of CPU time and memory bytes. We begin to see how non-obvious things (for example, the method of data serialization or unnecessary allocations in a loop) can slow down the system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Development of systems thinking&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Most developers perceive code as an autonomous entity. Profiling forces one to see the "submerged part":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OS Interaction:&lt;/strong&gt; I/O profiling often shows that latency arises not because of the Python code, but because of how Linux manages context switching or how data is buffered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Connection:&lt;/strong&gt; We begin to understand that "fast" Python code is just a transport for queries, and the "bottleneck" is often the execution plan of an SQL query or the absence of indexes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Stack:&lt;/strong&gt; Load testing shows system behavior under the pressure of TCP connections — this is critical for operation, even though it lies outside the logic of the application itself.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Correct application of optimizations&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Without using profiling tools, we can only guess which part of the codebase we think is slow and whether it is truly the "bottleneck." A profiler (e.g., &lt;code&gt;py-spy&lt;/code&gt;) provides objective data (with the correct approach to profiling). This eliminates guesswork and allows for optimizations exactly where they are truly needed.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most important thing benchmarks provide is curiosity. When you see that your code is running slowly, you go digging deeper: you start studying the layout of data structures in memory, looking into system calls (&lt;code&gt;strace&lt;/code&gt;), and hardware architecture (CPU cache, memory management).&lt;/p&gt;



&lt;p&gt;Let us touch slightly on solving the problem of slow systems through faster hardware.&lt;/p&gt;

&lt;p&gt;When we encounter such problems, the obvious solution becomes purchasing more powerful equipment. In cloud computing, switching to a computer with more cores, disk space, or adding RAM can be done in minutes or seconds. Given that developer time is expensive, switching to more powerful equipment is often viewed as the simplest and fastest solution to the problem. However, in the long run, you risk ending up with a system that is slow and extremely expensive. Also, it should be taken into account that a performance problem cannot always be solved only by more powerful equipment; it all depends on where the bottleneck is located. For example, if you just need more RAM or your application's work can be executed in parallel, then switching to powerful equipment might improve the situation.&lt;/p&gt;

&lt;p&gt;Using a hardware-first approach can entail long-term costs exceeding the price of the equipment itself. These include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Culture of inefficiency.&lt;/strong&gt; When developers have access to unlimited resources, they lose the motivation to write resource-efficient code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal scaling.&lt;/strong&gt; It is one thing if the code requires data processing a couple of times — spending an extra $5-10 is not a problem. But if this same task is executed, for example, 1000 times a month, a multiplicative effect occurs and expenses can reach significant sums.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical scaling.&lt;/strong&gt; When further scaling on a single machine becomes a problem, it is required to move to a distributed system, which often entails significant changes to the codebase and/or increased debugging complexity. Thus, reaching architectural turning points occurs at earlier stages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From a business perspective, we face an inevitable trade-off: you can spend money on equipment or spend developer time on writing more efficient code. Both cost options are evaluated, and the one that is lower in the current situation is chosen. However, the problem can be looked at from another angle: all other things being equal, a more efficient program is better than a less efficient one. Faster hardware may be a suitable solution in many cases, but it is also worth considering how to make the application more efficient by default.&lt;/p&gt;

&lt;p&gt;If we proceed only from the assumption that slow or inefficient software is inevitable and unavoidable, then we do not even think about how to improve the program's performance. The ability to write more resource-efficient code is not a constant, but a skill that can be developed. By applying profiling and developing skills, you will spend time programming and creating systems that are initially faster and consume fewer resources.&lt;/p&gt;

&lt;p&gt;The reader might have a number of questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;is optimization at early stages premature?&lt;/li&gt;
&lt;li&gt;does writing efficient code require more effort and is it harder to maintain?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We all know Knuth's famous saying: "premature optimization is the root of all evil." Let us rephrase: "doing something at the wrong time is not the best option." The full quote is: "We should forget about small efficiencies, say about 97% of the time; premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." Premature optimization here is understood as the attempt to improve code where the impact on overall performance is negligible, while sacrificing implementation cleanliness. If you dig deeper, developing fast code from the very beginning can bring huge benefits.&lt;/p&gt;

&lt;p&gt;I will give a simple example: choosing &lt;code&gt;msgspec&lt;/code&gt; and &lt;code&gt;msgspec.Struct&lt;/code&gt; models for serialization at the API output layer instead of standard serialization in a FastAPI application from the very beginning. This provides a significant performance boost under high load. We immediately lay the foundation for a high-throughput system capable of "digesting" a large number of requests (the article will break down the efficiency of native &lt;code&gt;msgspec.json&lt;/code&gt; serialization vs the standard FastAPI framework solution). On the other hand, if you have a small application and/or a large volume of data is not expected, then such optimization is likely premature until you truly hit a limit.&lt;/p&gt;

&lt;p&gt;Let us also consider a small example of how architectural performance influences the development cycle using integration testing as an example.&lt;/p&gt;

&lt;p&gt;Suppose the application architecture (for example, due to blocking I/O, lack of indexing, or inefficient serialization patterns) makes running tests "heavy" — running the full test suite takes, for instance, 10 minutes. This creates so-called "infrastructure friction": the developer is forced to accumulate a "batch" of changes before each run to avoid wasting time waiting. In such a model, the cost of an error increases critically. If a bug pops up during the run, finding it in a "batch" of changes is much harder than in an isolated change. Conversely, if tests run quickly enough, the developer moves to a "short cycle." This forms a habit of taking small, safe steps: made a change — ran tests — got a result. This does not just speed up the work; it changes the approach to design, allowing more experiments to be made with less risk of breaking the system.&lt;/p&gt;

&lt;p&gt;Writing faster code requires more time and is harder to maintain. Let us compare two examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Next, we need to manually check the keys, types, etc.
&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# ... validation code
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;msgspec&lt;/span&gt;

&lt;span class="c1"&gt;# We define the structure once
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Struct&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msgspec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Writing a model with &lt;code&gt;msgspec.Struct&lt;/code&gt; is comparable in the amount of code to manual dictionary checking. We described the data structure, obtained automatic validation, static typing, and a performance boost. The cost of maintaining such code is even lower because it becomes more transparent for the whole team: the data contract is described explicitly, not hidden in the form of string keys throughout the code.&lt;/p&gt;

&lt;p&gt;The reader might object that these are just examples specially chosen to illustrate my point of view. And that is true. One can find many examples where writing faster code requires more effort, leads to less maintainable code, or both. Nevertheless, obviously, there are situations where the fast version of the code is just as easy to maintain as the slow one, and just as simple to write.&lt;/p&gt;

&lt;p&gt;If we are not accustomed to paying attention to speed, there is a high probability that the code could have been written faster without additional effort. The problems we are solving are likely not unique. This means that someone has likely already written the corresponding library, tool, or documented an interaction pattern that allows for writing faster code.&lt;/p&gt;




&lt;p&gt;This article will outline my personal experience within the framework of a training project. Also, I draw the reader's attention to the fact that this is not an attempt to imitate a production-level standard.&lt;/p&gt;

&lt;p&gt;I intentionally chose the authentication/authorization module for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;heavy mathematics is present (synchronous password hashing for users, token issuance using the RSA-2048 algorithm);&lt;/li&gt;
&lt;li&gt;it fits perfectly into the load testing methodology (hypothesis based on theoretical knowledge — measurements — confirmation/refutation of the hypothesis — optimization — re-measurements).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test bench (AMD FX-8320/HDD) was chosen not only due to its accessibility but also based on the fact that "weak" hardware makes it easier to highlight problem areas that would have been nullified on more powerful and modern hardware.&lt;br&gt;
All tests were conducted on a single host machine (one node, Loopback/Synthetic testing)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Server:&lt;/strong&gt; Uvicorn 0.40.0

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flags:&lt;/strong&gt; &lt;code&gt;--loop=uvloop&lt;/code&gt; &lt;code&gt;--http=httptools&lt;/code&gt; &lt;code&gt;--backlog=2048&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS Tuning:&lt;/strong&gt; &lt;code&gt;ulimit -n 65535&lt;/code&gt;, &lt;code&gt;somaxconn=2048&lt;/code&gt;, &lt;code&gt;overcommit_memory=1&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Environment:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/blob/main/benchmarks/SPEC.md" rel="noopener noreferrer"&gt;Test Bench Specification&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick Jump:&lt;/strong&gt; The Hunt | Round 1: Optimizations | Deep Dive: Serialization | Round 2: Final Polish | The Final Scorecard | Bonus: Logging Cost | Bonus: DBMS Planner&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Note: Justification for the relevance of wrk2 metrics (System Jitter)
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;A series of 3 tests with 3 runs per series in TTY mode for the &lt;code&gt;GET /ping&lt;/code&gt; endpoint (returns &lt;code&gt;b"OK"&lt;/code&gt;) with application/load generator core affinity (0,1/6,7) and disabled uvicorn logs (&lt;code&gt;/dev/null&lt;/code&gt;) revealed identical peak latencies (Max Latency ~80ms). Also, during three independent test series (sar, TTY mode, uvicorn logs -&amp;gt; &lt;code&gt;/dev/null&lt;/code&gt;), the repeatability of system anomalies was confirmed: short-term spikes in the task queue (&lt;code&gt;runq-sz&lt;/code&gt; from 3 to 4) and bursts of context switches (&lt;code&gt;cswch/s&lt;/code&gt; ~11k), despite the application being allocated only 2 physical cores with sufficient core resources (~75% &lt;code&gt;idle&lt;/code&gt;); values &amp;gt;2 mathematically confirm a state of preemption. This links the high values of Max Latency, P99, and StdDeviation to the operation of the Linux OS scheduler and hardware interrupt handling, rather than the efficiency of the application code (&lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/results/auth-serialization/test-ping/sar-metrics/without-logging/reports/system.txt" rel="noopener noreferrer"&gt;sar-logs&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;The test results will include Max Latency, P99, and StdDeviation values. However, when analyzing the results, we will primarily focus on **Mean Latency&lt;/em&gt;* and &lt;strong&gt;P90&lt;/strong&gt; for the following reasons:*&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;**On lightweight endpoints&lt;/em&gt;&lt;em&gt;, system jitter manifests in a classic way: Max Latency and P99 disproportionately spike upwards relative to the smooth P90 plateau, artificially inflating &lt;code&gt;StdDeviation&lt;/code&gt;.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;**On heavy (CPU-bound) endpoints&lt;/em&gt;&lt;em&gt;, the overhead from the OS scheduler is physically smoothed out against the background of the long execution of the business logic itself. In this case, the percentile distribution appears smooth, but &lt;code&gt;Max Latency&lt;/code&gt; and &lt;code&gt;StdDeviation&lt;/code&gt; still remain noisy due to hardware error.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Bottleneck Analysis
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Hypothesis for &lt;code&gt;/api/v1/access/signup&lt;/code&gt; (Registration)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Premise:&lt;/strong&gt; Using synchronous password hashing at the ORM level (&lt;code&gt;advanced-alchemy&lt;/code&gt; | &lt;code&gt;pwdlib.hashers.argon2.Argon2Hasher&lt;/code&gt;) requires significant CPU computational power, and the subsequent data writing involves disk I/O operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis Formulation:&lt;/strong&gt; Due to the cooperative nature of asynchrony in Python, performing heavy synchronous hash calculations (&lt;code&gt;Argon2id&lt;/code&gt;) in the main thread under load from &lt;code&gt;wrk2&lt;/code&gt; will completely block the Event Loop and the application server worker for the duration of the mathematical operation. This will stop the processing of incoming network events, causing other clients' requests to queue up at the system socket level. The bottleneck will not be the disk subsystem, but the monopolization of the single Event Loop thread by synchronous code (lack of offloading the task to &lt;code&gt;run_in_executor&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Objective:&lt;/strong&gt; Determine whether async offloading of CPU-bound tasks (delegating computations to a &lt;code&gt;ThreadPoolExecutor&lt;/code&gt;) is required for user password hashing, or if the current synchronous execution is optimal.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2. Hypothesis for &lt;code&gt;/api/v1/access/signin&lt;/code&gt; (Login)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Premise:&lt;/strong&gt; User authentication involves synchronous password verification (see &lt;code&gt;/api/v1/access/signup&lt;/code&gt; above), as well as the generation of two tokens (access/refresh) using &lt;code&gt;RSA-2048&lt;/code&gt; asymmetric cryptography, which requires significant CPU computational power. Searching for the user in the database (I/O) has low latency due to the use of an index (the &lt;code&gt;email&lt;/code&gt; column).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis Formulation:&lt;/strong&gt; See point one. On this endpoint, the situation will worsen: the worker will sequentially hang first on password verification, and then be blocked twice by RSA-2048 asymmetric encryption when generating signatures for the access and refresh tokens. The total time a single request monopolizes the thread will increase several times over.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Objective:&lt;/strong&gt; Verify the hypothesis that the sequential execution of CPU-bound operations (password verification + RSA signatures) is a critical bottleneck, and evaluate the feasibility of switching to signature algorithms with lower computational costs (&lt;code&gt;Ed25519&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3. Hypothesis for &lt;code&gt;/api/v1/access/me&lt;/code&gt; (Get User Profile)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Premise:&lt;/strong&gt; The endpoint relies on the &lt;code&gt;get_current_active_user&lt;/code&gt; dependency. On each request, the JWT token is first decoded (RSA-2048 asymmetric algorithm via &lt;code&gt;decode_jwt&lt;/code&gt;). After successful validation, the cache is checked (decorator &lt;code&gt;@cache&lt;/code&gt; from &lt;code&gt;fastapi_cache.decorator&lt;/code&gt;). On a cache hit, the bytes are retrieved and deserialized (calling &lt;code&gt;msgspec.msgpack.decode&lt;/code&gt; + &lt;code&gt;UserAuth.model_validate&lt;/code&gt;). On a cache miss, a database query is executed, followed by converting the ORM object to &lt;code&gt;UserAuth&lt;/code&gt; and saving it to the cache (serialized using &lt;code&gt;msgspec.msgpack.encode&lt;/code&gt; + &lt;code&gt;jsonable_encoder&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis Formulation:&lt;/strong&gt; With a cache, we reduce the load on the database but transfer part of the load to the CPU (deserialization). However, as the load on the endpoint increases, the cryptographic decoding and verification of the JWT (RSA-2048) will place a noticeable load on the CPU and reduce overall throughput. Deserialization will not have a significant impact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Objective:&lt;/strong&gt; Evaluate the computational costs of decoding and cryptographically verifying access tokens when using the &lt;code&gt;RSA-2048&lt;/code&gt; algorithm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Authentication/Authorization chain execution flow&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[HTTP Request]
     │
     ▼
[Dependency Depends(access_token)]
     │
     ▼
[1. Decode token: get_payload_from_token()]
     │
     ├── ❌ decode_jwt() using PyJWT - CPU-bound (RSA-2048 modular arithmetic)
     └── Structure and exception check
     │
     ▼
[2. Get user: Authenticate.get_current_user()]
     │
     └── Pass token_payload["sub"] to _get_user_from_payload() method
     │
     ▼
[3. Check cache / Get data]
     │
     ├── ✔️ Cache hit: Deserialization (MsgPackCoderUserAuth) — CPU-bound
     │    └─ (msgspec.msgpack.decode + UserAuth.model_validate)
     │
     └── ❌ Cache miss: I/O-bound
          ├── DB query (users_service.get)
          └── Convert to UserAuth schema (users_service.to_schema) and write to cache (msgspec.msgpack.encode + jsonable_encoder)
     │
     ▼
[4. Additional checks (depending on the endpoint)]
     │
     ├── get_current_active_user — check is_active
     └── superuser_required / trainer_required — check is_superuser or role_slug
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Hypothesis for &lt;code&gt;MsgSpecJSONResponse (msgspec.json)&lt;/code&gt; (Serialization)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Premise:&lt;/strong&gt; The project implements a custom &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt;, but for versatility and compatibility with Pydantic models, we are forced to use &lt;code&gt;jsonable_encoder&lt;/code&gt; as an intermediate step before serialization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis Formulation:&lt;/strong&gt; On high-load endpoints or when serializing large volumes of data, the overhead of &lt;code&gt;jsonable_encoder&lt;/code&gt; will cause noticeable latency and increase the CPU load. It is expected that the overhead of &lt;code&gt;jsonable_encoder&lt;/code&gt; will be partially compensated by the &lt;code&gt;msgspec&lt;/code&gt; library.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note 1:&lt;/strong&gt; How jsonable_encoder works: &lt;code&gt;jsonable_encoder&lt;/code&gt; traverses all objects, including lists and nested dictionaries, checks types, and converts complex structures (e.g., datetime, UUID, Pydantic models) into standard Python types. New intermediate objects are created for each such operation. This leads to excessive memory allocation and additional load on the garbage collector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note 2:&lt;/strong&gt; Before the release of Pydantic v2, using &lt;code&gt;jsonable_encoder&lt;/code&gt; + &lt;code&gt;orjson&lt;/code&gt; did provide a performance boost because, despite the overhead of &lt;code&gt;jsonable_encoder&lt;/code&gt;, the final byte assembly was faster than the standard mechanisms of the FastAPI framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Objective:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Evaluate the efficiency of serialization via &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; compared to the standard "out-of-the-box" serialization of Pydantic models by the FastAPI framework.&lt;/li&gt;
&lt;li&gt;Determine if it makes sense to introduce separate &lt;code&gt;msgspec.Struct&lt;/code&gt; "output" schemas into the project just for serialization to achieve maximum performance.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Round 1: The First Wave of Optimizations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Note 1:&lt;/strong&gt; Profiling with py-spy, baseline tests, and tests after the first optimization were carried out with the default &lt;code&gt;Argon2id&lt;/code&gt; settings.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Note 2:&lt;/strong&gt; The first optimization included:

&lt;ul&gt;
&lt;li&gt;Changing the &lt;strong&gt;PyJWT&lt;/strong&gt; library to &lt;strong&gt;joserfc&lt;/strong&gt; and the token signing algorithm from &lt;code&gt;RSA-2048&lt;/code&gt; to &lt;code&gt;Ed25519&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The built-in synchronous password hashing at the ORM level of the advanced-alchemy library was moved to a &lt;code&gt;ThreadPoolExecutor(max_workers=2)&lt;/code&gt;. The value &lt;code&gt;max_workers=2&lt;/code&gt; was chosen strictly for the 2 physical cores of the test bench (running the Uvicorn server with core affinity to 0,1).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Preliminary Analysis of Event Loop Blocking:&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Before proceeding to analyze the flame graphs from the &lt;code&gt;py-spy&lt;/code&gt; profiler and load testing, we will switch &lt;code&gt;asyncio&lt;/code&gt; to &lt;code&gt;debug&lt;/code&gt; mode at the FastAPI application level to record event loop slowdowns at runtime:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asynccontextmanager&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lifespan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Enable asyncio debug mode
&lt;/span&gt;    &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Set the threshold to 100 ms (0.1 sec)
&lt;/span&gt;    &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slow_callback_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results of logging "heavy" callbacks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/api/v1/access/signup&lt;/code&gt;&lt;/strong&gt;: Thread blocking during synchronous password hashing was &lt;strong&gt;~150 ms&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/api/v1/access/signin&lt;/code&gt;&lt;/strong&gt;: The total blocking time for password verification and the sequential issuance of an access/refresh token pair reaches &lt;strong&gt;~350–400 ms&lt;/strong&gt; (of which: password verification — &lt;strong&gt;~150 ms&lt;/strong&gt;, token generation — &lt;strong&gt;~250 ms&lt;/strong&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The figures provided are Wall-clock time (total astronomical delay time) recorded by &lt;code&gt;asyncio debug&lt;/code&gt;. This is not pure cryptography mathematics in a vacuum, but the total time during which the main Python thread was monopolized by computations without returning control to the Event Loop. These measurements serve as a clear demonstration of how heavy CPU-bound code paralyzes the asynchronous runtime.&lt;/p&gt;

&lt;h4&gt;
  
  
  Flame Graph Analysis
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Conditions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Application Server:&lt;/strong&gt; Uvicorn, 1 worker, mapped to 1 physical module (core 0).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;py-spy (&lt;code&gt;record&lt;/code&gt; mode, no core affinity) | wrk2 (mapped to 1 physical module (core 7))&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;: py-spy -&amp;gt; &lt;code&gt;--rate 100&lt;/code&gt;; wrk2 -&amp;gt; RPS 4, 1 threads.&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;: py-spy -&amp;gt; &lt;code&gt;--rate 100&lt;/code&gt; (increased to 150 after optimization); wrk2 -&amp;gt; RPS 2, 1 threads.&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;: py-spy -&amp;gt; &lt;code&gt;--rate 150&lt;/code&gt;; wrk2 -&amp;gt; RPS 300, 1 threads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker containers:&lt;/strong&gt; PostgreSQL, PGBouncer (cores 2,3) | Valkey (cores 4,5) were running. &lt;code&gt;ulimit (nofile=65535)&lt;/code&gt; was set for all containers.&lt;/li&gt;
&lt;li&gt;OS limits were raised (see &lt;code&gt;OS Tuning&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Baseline Analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;: The CPU time share for synchronous user password hashing was within &lt;strong&gt;~22-25%&lt;/strong&gt;. The flame graph clearly shows a deep call stack from the ORM integration level (&lt;code&gt;advanced_alchemy/types/password...&lt;/code&gt;) down to the low-level &lt;code&gt;argon2.low_level.hash_secret&lt;/code&gt; library, which monopolizes CPU time within the main thread.&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;: The flame graph shows the cumulative effect of blocking the Event Loop with two heavy CPU-bound operations within a single request. The share of synchronous password verification via &lt;code&gt;argon2.low_level.verify_secret&lt;/code&gt; takes about &lt;strong&gt;~10–12%&lt;/strong&gt; of the worker's CPU time. The main overhead in the execution profile is formed by the sequential issuance of a pair of JWT tokens (access/refresh) using the asymmetric RSA-2048 algorithm (&lt;code&gt;jwt.algorithms.prepare_key&lt;/code&gt; / &lt;code&gt;encode_jwt&lt;/code&gt;), taking up &lt;strong&gt;~32–34%&lt;/strong&gt; of CPU time for each token (a total of ~66% of the entire graph width).&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;: Decoding and verifying the signature of the incoming JWT token using the asymmetric RSA-2048 algorithm (&lt;code&gt;decode_jwt&lt;/code&gt; -&amp;gt; &lt;code&gt;pyjwt.verify&lt;/code&gt;) takes about &lt;strong&gt;~10–12%&lt;/strong&gt; of CPU time. At the same time, the layer for working with the Valkey cache backend and deserializing user data using &lt;code&gt;msgspec&lt;/code&gt; take up a minimal share of the CPU.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/signup.svg" rel="noopener noreferrer"&gt;signup.svg&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/signup.svg" rel="noopener noreferrer"&gt;signin.svg&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/me.svg" rel="noopener noreferrer"&gt;me.svg&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Analysis after the first optimization:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;: The flame graph clearly shows two isolated towers. Synchronous password hashing (&lt;code&gt;Argon2id&lt;/code&gt;), wrapped in the pwdlib interface, has completely moved out of the main Event Loop into a dedicated ThreadPoolExecutor (crypto_executor). Token generation and signing operations are completely absent from the profiling graph.&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;: Decoding and verifying the token signature via &lt;code&gt;joserfc&lt;/code&gt; collectively take ~20-22% of CPU (compared to ~10-12% for &lt;code&gt;PyJWT&lt;/code&gt;). The load is distributed between the computational verification of the cryptographic signature (verify) and the internal mechanisms of deserialization/claims validation (&lt;code&gt;deserialize_compact&lt;/code&gt;, &lt;code&gt;validate_compact&lt;/code&gt;). &lt;strong&gt;As a primary working hypothesis, I assumed (analyzing the me-joserfc.svg graph)&lt;/strong&gt; that this increase is explained by the library's deep OOP layer (the &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; class), where the call to &lt;code&gt;.validate(claims)&lt;/code&gt; leads to the creation of intermediate objects, dynamic validator lookup, and data type inspection, taking up a significant percentage of processor time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important clarification on the results of the final profiling: (Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;
Subsequent analysis of the flame graph after implementing the "fast-path" showed that the Python wrapper for claims validation (&lt;code&gt;exp&lt;/code&gt;, &lt;code&gt;iat&lt;/code&gt; check) itself created minimal overhead.&lt;br&gt;
The main processor time in these ~20-22% is spent on the &lt;strong&gt;mathematics of verifying the cryptographic signature of the Ed25519 algorithm&lt;/strong&gt;. Unlike the asymmetric RSA-2048, where verification with a public key is "free" for the processor, the Ed25519 algorithm requires equally heavy scalar computations for both signing and verification.&lt;br&gt;
Thus, the manual "fast-path" made the code cleaner and freed it from the library's unnecessary abstractions, but it was not possible to significantly reduce this ~20-22% CPU plateau on the &lt;code&gt;/api/v1/access/me&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/signin-optimized.svg" rel="noopener noreferrer"&gt;signin-optimized.svg&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/me-joserfc.svg" rel="noopener noreferrer"&gt;me-joserfc.svg&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Hypothesis Confirmation
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Flame Graph analysis (baseline) confirmed the hypothesis: &lt;code&gt;Argon2id&lt;/code&gt; monopolizes the Event Loop.&lt;/li&gt;
&lt;li&gt;System artifacts analysis: &lt;code&gt;sar&lt;/code&gt; data confirms critical load on the scheduler during peak registration moments: &lt;code&gt;runq-sz&lt;/code&gt; spikes up to 10 are recorded, which significantly exceeds the number of allocated cores and indicates a state of acute CPU time deficit (resource contention). The high level of &lt;code&gt;cswch/s&lt;/code&gt; (up to ~5335) indicates excessive context switching caused by the OS's attempts to process incoming requests while the main Event Loop is blocked by synchronous &lt;code&gt;Argon2id&lt;/code&gt; computations (&lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/results/auth-serialization/test-signup/baseline/sar-metrics/reports/system.txt" rel="noopener noreferrer"&gt;sar-logs&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Flame Graph analysis (baseline) partially confirmed my hypothesis: The Event Loop blocking is cumulative. Although password verification (&lt;code&gt;Argon2id&lt;/code&gt;) makes a significant contribution, the main load is generated by the sequential generation of JWT tokens via RSA-2048.&lt;/li&gt;
&lt;li&gt;System artifacts analysis: &lt;code&gt;sar&lt;/code&gt; data confirms that during authentication, the system enters a "pulsating" load mode: alternating periods of deep Event Loop blocking (generating RSA signatures) and sharp processing bursts (processing the accumulated queue). Spikes in &lt;code&gt;runq-sz&lt;/code&gt; to 10 and &lt;code&gt;cswch/s&lt;/code&gt; to ~6592 clearly illustrate the degradation of response time when several CPU-bound tasks are executed simultaneously (&lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/results/auth-serialization/test-signin/baseline/sar-metrics/reports/system.txt" rel="noopener noreferrer"&gt;sar-logs&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Flame Graph analysis (baseline) confirmed the hypothesis: Decoding and verifying the cryptographic signature of access tokens take significantly more processor time than deserializing user data. The flame graph analysis is self-sufficient to confirm the hypothesis.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;The methodology for the runs (wrk2) and sar metric collection are described below.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Tests
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conditions:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run methodology&lt;/strong&gt;: Before the start of each test series (Baseline / Optimization), the database was completely cleared. Testing was performed as a series of 4 consecutive runs: the first to record system metrics (sar), the subsequent three to average the performance results. The total size of the &lt;code&gt;user_account&lt;/code&gt; table within one series did not exceed 1000 records.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;wrk2 (cores 6,7)&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;: 5 RPS, 2 threads and 4 connections.&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;: 3 RPS, 2 threads and 4 connections.&lt;/li&gt;
&lt;li&gt;Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;: 600 RPS, 2 threads and 4 connections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Server:&lt;/strong&gt; Uvicorn, 2 workers, mapped to 1 physical module (cores 0,1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker containers:&lt;/strong&gt; PostgreSQL, PGBouncer (cores 2,3) | Valkey (cores 4,5) were running. &lt;code&gt;ulimit (nofile=65535)&lt;/code&gt; was set for all containers.&lt;/li&gt;
&lt;li&gt;OS limits were raised (see &lt;code&gt;OS Tuning&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Round 1 Results
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Technical note:&lt;/strong&gt; For an objective assessment of efficiency, we rely on &lt;strong&gt;Mean Latency&lt;/strong&gt;, &lt;strong&gt;P90&lt;/strong&gt;, and &lt;strong&gt;CPU Load&lt;/strong&gt; (see &lt;code&gt;Justification for the relevance of wrk2 metrics&lt;/code&gt;).&lt;/p&gt;

&lt;h4&gt;
  
  
  Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimization 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Difference (Delta)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;662.76 ms&lt;/td&gt;
&lt;td&gt;656.63 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-0.9% (-6.13 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;674.30 ms&lt;/td&gt;
&lt;td&gt;679.93 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.8% (+5.63 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;679.93 ms&lt;/td&gt;
&lt;td&gt;684.54 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.7% (+4.61 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9.79 ms&lt;/td&gt;
&lt;td&gt;24.50 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+150.3% (+14.71 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;682.50 ms&lt;/td&gt;
&lt;td&gt;684.54 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.3% (+2.04 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.55%&lt;/td&gt;
&lt;td&gt;58.29%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.74%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.66%&lt;/td&gt;
&lt;td&gt;58.41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.75%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion (Architectural isolation via ThreadPoolExecutor):&lt;/strong&gt; We observe no change in CPU load and &lt;code&gt;P90/Mean Latency&lt;/code&gt;, as the computational complexity of &lt;code&gt;Argon2id&lt;/code&gt; remained the same. The optimization here brought an exclusively &lt;em&gt;architectural&lt;/em&gt; benefit — isolating the Event Loop from blocking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/baseline/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/baseline/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimization 1 artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/optimization-01/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/optimization-01/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimization 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Difference (Delta)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;857.39 ms&lt;/td&gt;
&lt;td&gt;664.35 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-22.5% (-193.04 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1100.00 ms (1.10s)&lt;/td&gt;
&lt;td&gt;689.15 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-37.4% (-410.85 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1100.00 ms (1.10s)&lt;/td&gt;
&lt;td&gt;693.76 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-36.9% (-406.24 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;227.92 ms&lt;/td&gt;
&lt;td&gt;26.36 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-88.4% (-201.56 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1110.00 ms (1.11s)&lt;/td&gt;
&lt;td&gt;704.51 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-36.5% (-405.49 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;58.76%&lt;/td&gt;
&lt;td&gt;35.88%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-22.88%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;59.04%&lt;/td&gt;
&lt;td&gt;35.68%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-23.36%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion (Changing token signing cryptography):&lt;/strong&gt; We observe a pure &lt;em&gt;computational&lt;/em&gt; benefit. The drop in CPU utilization on cores 0 and 1 from 58% to 35% and the corresponding decrease in &lt;code&gt;Mean/P90 Latency&lt;/code&gt; directly confirm that abandoning the resource-intensive RSA-2048 eliminated the main bottleneck of the endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signin/baseline/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signin/baseline/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimization 1 artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signin/optimization-01/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signin/optimization-01/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimization 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Difference (Delta)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.70 ms&lt;/td&gt;
&lt;td&gt;4.23 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+14.3% (+0.53 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.36 ms&lt;/td&gt;
&lt;td&gt;4.61 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.7% (+0.25 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.27 ms&lt;/td&gt;
&lt;td&gt;5.73 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8.7% (+0.46 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.55 ms&lt;/td&gt;
&lt;td&gt;0.43 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-21.8% (-0.12 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.92 ms&lt;/td&gt;
&lt;td&gt;7.44 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-6.1% (-0.48 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;44.56%&lt;/td&gt;
&lt;td&gt;54.30%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+9.74%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45.12%&lt;/td&gt;
&lt;td&gt;54.26%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+9.14%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion (Changing cryptography for decoding/verifying token signatures):&lt;/strong&gt; We do not observe a critical drop in response time, but after optimization, there is an increase in CPU load (~9.5%). When analyzing the flame graph &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/me-joserfc.svg" rel="noopener noreferrer"&gt;me-joserfc.svg&lt;/a&gt;, I hypothesized that validation via the &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; class introduces significant overhead. The &lt;code&gt;get_current_user&lt;/code&gt; dependency is called in all protected endpoints, so it makes sense to get rid of validation via &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; in favor of a manual "fast-path" branching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical note:&lt;/strong&gt; In the final optimization, validation via the &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; class was removed. Subsequent profiling via py-spy and flame graph analysis showed that my assumption about the impact of &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; was incorrect. The main CPU time is spent on verifying token signatures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-me/baseline/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-me/baseline/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimization 1 artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-me/optimization-01/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-me/optimization-01/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Preliminary summary:&lt;/strong&gt; We implemented offloading of password hashing and verification to a &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; and migrated the token signing algorithm from RSA-2048 to Ed25519, thereby eliminating the blocking of the Event Loop by heavy computational operations. We recorded an unexpected increase in processor time for decoding and verifying token signatures when switching to the &lt;code&gt;joserfc&lt;/code&gt; library via the flame graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main conclusion (Event loop blocking):&lt;/strong&gt; In the case of using an asynchronous wrapper (offloading) for password hashing/verification, the mathematics itself has not disappeared. For a single isolated user, the response on the &lt;code&gt;signup&lt;/code&gt;/&lt;code&gt;signin&lt;/code&gt; endpoints even slightly increased compared to the synchronous approach, as overhead was added for the context and management of the thread pool itself.&lt;br&gt;
But we got an &lt;em&gt;architectural benefit on the scale of the entire system&lt;/em&gt;:&lt;br&gt;
With synchronous hashing, the event loop is monopolistically blocked for the entire duration of the calculations, which means the server worker is physically unable to process parallel requests from other clients — the system gets queued up. Moving this logic to a thread leaves the Event Loop free. A specific heavy request waits for its turn in the pool, but the server continues to process light/medium traffic in parallel and without delay on the same worker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Dive: The Serialization Trap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Intermediate Test: Evaluating Serialization Efficiency (&lt;code&gt;FastAPI + Pydantic&lt;/code&gt; vs. &lt;code&gt;msgspec&lt;/code&gt;)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Theoretical Premise
&lt;/h4&gt;

&lt;p&gt;Using FastAPI's standard response with a Pydantic model out-of-the-box involves the &lt;code&gt;json.dumps&lt;/code&gt; serializer from the Python standard library (or Pydantic's built-in mechanisms), which require more CPU cycles for validation and transformation of complex types (e.g., &lt;code&gt;UUID&lt;/code&gt;, &lt;code&gt;datetime&lt;/code&gt;) compared to native &lt;code&gt;msgspec.json&lt;/code&gt;. A custom &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; implementation allows data to be serialized directly into bytes, eliminating the intermediate overhead of the standard mechanism.&lt;/p&gt;

&lt;h4&gt;
  
  
  Test Objective
&lt;/h4&gt;

&lt;p&gt;Compare the serialization efficiency of the standard FastAPI mechanism and a custom implementation (&lt;code&gt;msgspec.Struct&lt;/code&gt; + &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; with native &lt;code&gt;msgspec.json&lt;/code&gt;) on a real data profile.&lt;/p&gt;

&lt;h4&gt;
  
  
  Test Description
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Implementation:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Two models with identical fields were implemented: &lt;code&gt;ExerciseReadPydantic&lt;/code&gt; and &lt;code&gt;ExerciseReadStruct&lt;/code&gt;, which have complex types (&lt;code&gt;UUID&lt;/code&gt;, &lt;code&gt;datetime&lt;/code&gt;) and nested objects.&lt;/li&gt;
&lt;li&gt;Two endpoints were implemented: &lt;code&gt;GET /serialization-pydantic&lt;/code&gt; and &lt;code&gt;GET /serialization-msgspec&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Both endpoints return the same dataset: &lt;code&gt;list[ExerciseReadPydantic]&lt;/code&gt; | &lt;code&gt;list[ExerciseReadStruct]&lt;/code&gt; of 50 objects.&lt;/li&gt;
&lt;li&gt;To eliminate the influence of database I/O on Latency, no database query was performed.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conditions:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Approach 1 (out-of-the-box):&lt;/strong&gt; FastAPI + &lt;code&gt;Pydantic&lt;/code&gt; model (returned via the framework's standard JSON response).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approach 2 (custom):&lt;/strong&gt; &lt;code&gt;msgspec.Struct&lt;/code&gt; + custom &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; class.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;wrk2&lt;/strong&gt;: 600 RPS, 2 threads and 6 connections, mapped to 1 physical module (cores 6,7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Server:&lt;/strong&gt; Uvicorn, 2 workers, mapped to 1 physical module (cores 0,1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker containers:&lt;/strong&gt; No Docker containers were running.&lt;/li&gt;
&lt;li&gt;Open file limits have been increased (&lt;code&gt;ulimit -n 65535&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Comparison Summary
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;msgspec&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Pydantic v2&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Difference (Delta)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.72 ms&lt;/td&gt;
&lt;td&gt;7.73 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+349.4% (+6.01 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.52 ms&lt;/td&gt;
&lt;td&gt;9.01 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+257.5% (+6.49 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.99 ms&lt;/td&gt;
&lt;td&gt;69.76 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2233.1% (+66.77 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.55 ms&lt;/td&gt;
&lt;td&gt;10.15 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1745.5% (+9.60 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.71 ms&lt;/td&gt;
&lt;td&gt;90.24 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2332.3% (+86.53 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;11.66%&lt;/td&gt;
&lt;td&gt;63.40%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+443.7%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;11.72%&lt;/td&gt;
&lt;td&gt;64.36%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+449.1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Technical note:&lt;/strong&gt; For an objective assessment of computational efficiency, we rely on &lt;strong&gt;Mean Latency&lt;/strong&gt;, &lt;strong&gt;P90&lt;/strong&gt;, and &lt;strong&gt;CPU Load&lt;/strong&gt;, which better reflect the actual load on the system (see &lt;code&gt;Justification for the relevance of wrk2 metrics&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; The difference in speed between the libraries on the current hardware is in the range of 4-5 times in favor of &lt;code&gt;msgspec&lt;/code&gt;. The difference in CPU Load (11.7% vs 64%) confirms that &lt;code&gt;msgspec&lt;/code&gt; uses computational resources much more efficiently, which is an important factor for service scalability under high load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/ext/auth-serialization/serialization.py" rel="noopener noreferrer"&gt;Source code&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-serialization" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-serialization/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Test of &lt;code&gt;jsonable_encoder&lt;/code&gt;'s impact on serialization (&lt;code&gt;MsgSpecJSONResponse (msgspec.json)&lt;/code&gt;)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Implementation and Conditions:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;Pydantic&lt;/code&gt; model was implemented: &lt;code&gt;ExerciseReadPydantic&lt;/code&gt;, which has complex types (&lt;code&gt;UUID&lt;/code&gt;, &lt;code&gt;datetime&lt;/code&gt;) and nested objects.&lt;/li&gt;
&lt;li&gt;Two endpoints were implemented:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /serialization-pydantic&lt;/code&gt; - standard &lt;code&gt;FastAPI&lt;/code&gt; serialization mechanism.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /serialization-jsonable-encoder&lt;/code&gt; - serialization via the &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; class (&lt;code&gt;jsonable_encoder&lt;/code&gt; -&amp;gt; &lt;code&gt;msgspec.json&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Both endpoints return the same dataset: &lt;code&gt;list[ExerciseReadPydantic]&lt;/code&gt; of 50 objects.&lt;/li&gt;
&lt;li&gt;To eliminate the influence of database I/O on Latency, no database query was performed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;wrk2&lt;/strong&gt;: 600 RPS, 2 threads and 6 connections, mapped to 1 physical module (cores 6,7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Server:&lt;/strong&gt; Uvicorn, 2 workers, mapped to 1 physical module (cores 0,1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker containers:&lt;/strong&gt; No Docker containers were running.&lt;/li&gt;
&lt;li&gt;Open file limits have been increased (&lt;code&gt;ulimit -n 65535&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Test Results
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The test results for the &lt;code&gt;GET /serialization-pydantic&lt;/code&gt; endpoint can be found in the section above: Evaluating Serialization Efficiency (&lt;code&gt;FastAPI + Pydantic&lt;/code&gt; vs. &lt;code&gt;msgspec&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Testing the &lt;code&gt;GET /serialization-jsonable-encoder&lt;/code&gt; endpoint at 600 RPS and 6 connections led to a critical performance degradation: response times increased to second-long values, indicating throughput saturation. A series of tests showed that this endpoint can handle the load without degradation at a significantly lower RPS: ~150 and 4 connections.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis partially confirmed:&lt;/strong&gt; An increase in latency and CPU load was expected, and the overhead of &lt;code&gt;jsonable_encoder&lt;/code&gt; was supposed to be smoothed out by the &lt;code&gt;msgspec&lt;/code&gt; library. During the tests, it was found that &lt;code&gt;jsonable_encoder&lt;/code&gt; introduces significant overhead. Additional testing and profiling via py-spy are pointless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; The call to &lt;code&gt;jsonable_encoder&lt;/code&gt; needs to be removed from the current implementation of the &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; class and switched to native serialization (&lt;code&gt;msgspec.json&lt;/code&gt;). This will require having two schemas in the project: Pydantic for "input" (validation), and &lt;code&gt;msgspec.Struct&lt;/code&gt; for "output" (serialization).&lt;br&gt;
Having multiple schemas complicates project maintenance but is compensated by scalability under high loads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/ext/auth-serialization/serialization.py" rel="noopener noreferrer"&gt;Source code&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-serialization/test-jsonable_enc" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Round 2: Applying the Final Optimizations
&lt;/h2&gt;

&lt;p&gt;This optimization includes the changes made in the first stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Isolation of password hashing via &lt;code&gt;ThreadPoolExecutor&lt;/code&gt;, but now with explicit fixing of &lt;strong&gt;&lt;code&gt;Argon2id&lt;/code&gt;&lt;/strong&gt; parameters (&lt;code&gt;parallelism=1&lt;/code&gt;, &lt;code&gt;time_cost=3&lt;/code&gt;, &lt;code&gt;memory_cost=65536&lt;/code&gt;). Although the time and memory remained at their default values, forcing &lt;code&gt;parallelism=1&lt;/code&gt; protects the CPU from context switching under high load by preventing the algorithm from spawning 4 computational threads.&lt;/li&gt;
&lt;li&gt;Changing the &lt;strong&gt;PyJWT&lt;/strong&gt; library to &lt;strong&gt;joserfc&lt;/strong&gt; and the token signing algorithm from &lt;code&gt;RSA-2048&lt;/code&gt; to &lt;code&gt;Ed25519&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Changes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;:&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ORM-level optimization:&lt;/strong&gt; The &lt;code&gt;auto_refresh=False&lt;/code&gt; parameter was introduced when calling the user creation method in &lt;code&gt;advanced-alchemy&lt;/code&gt;. This eliminated the redundant hidden &lt;code&gt;SELECT&lt;/code&gt; query (a repeated round-trip to the DB) after the &lt;code&gt;INSERT&lt;/code&gt; operation, as all necessary auto-generated fields are read from the session context.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transport layer optimization:&lt;/strong&gt; A complete rejection of Pydantic models on the output layer in favor of &lt;code&gt;msgspec.Struct&lt;/code&gt; structures. Instead of the standard FastAPI serializer, the native &lt;code&gt;MsgSpecJSONResponse&lt;/code&gt; was used, which completely eliminated the heavy &lt;code&gt;jsonable_encoder&lt;/code&gt; from the serialization process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Elimination of redundant queries and limitation of field selection (ORM layer):&lt;/strong&gt; In the &lt;code&gt;UserService.authenticate&lt;/code&gt; method, the data loading strategy was optimized. The &lt;code&gt;noload(m.User.role)&lt;/code&gt; directive was applied, which canceled the default &lt;code&gt;selectinload&lt;/code&gt; strategy for this relationship. The &lt;code&gt;load_only&lt;/code&gt; directive was also used, limiting the initial query to the &lt;code&gt;user_account&lt;/code&gt; table to strictly the fields necessary for validation (&lt;code&gt;id&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;is_active&lt;/code&gt;, &lt;code&gt;password&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Elimination of redundant queries and limitation of field selection (ORM layer):&lt;/strong&gt; In the &lt;code&gt;_get_user_from_payload&lt;/code&gt; method, the data loading strategy was optimized. Instead of two separate queries (the primary one to the user and the automatic &lt;code&gt;selectinload&lt;/code&gt; to their role), a single SQL query with a &lt;code&gt;LEFT JOIN&lt;/code&gt; via &lt;code&gt;joinedload(User.role)&lt;/code&gt; is now executed. At the same time, &lt;code&gt;load_only&lt;/code&gt; at both levels limits the selection to the absolute minimum (&lt;code&gt;Role.slug&lt;/code&gt; and 5 basic user fields).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transport layer optimization:&lt;/strong&gt; The &lt;code&gt;UserAuth&lt;/code&gt; transport model was replaced from Pydantic to a lightweight &lt;code&gt;msgspec.Struct&lt;/code&gt; structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Switch to In-Memory Caching:&lt;/strong&gt; The &lt;code&gt;fastapi-cache2&lt;/code&gt; library was replaced with &lt;code&gt;cashews&lt;/code&gt; with local caching enabled (&lt;code&gt;client_side=True&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Refactoring the validation layer:&lt;/strong&gt; Token validation via the &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; class in the &lt;code&gt;get_payload_from_token&lt;/code&gt; function was replaced with a straightforward manual "fast-path".&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Flame graph analysis after removing &lt;code&gt;JWTClaimsRegistry&lt;/code&gt;:&lt;/strong&gt; A comparison of the &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/me-joserfc.svg" rel="noopener noreferrer"&gt;me-joserfc.svg&lt;/a&gt; and &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/me-final.svg" rel="noopener noreferrer"&gt;me-final.svg&lt;/a&gt; profiles shows that removing &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; eliminated the small "fringe" of Python calls at the tails of the &lt;code&gt;validate_compact&lt;/code&gt; function, making the stack structure flat. However, the width of the entire JWT branch remained unchanged (~21%). This clearly proves: the overhead from Python's object abstractions in this node was close to zero, and all the CPU utilization is the pure mathematics of the Ed25519 algorithm, which can only be bypassed architecturally (by caching). From this, it follows that my assumption about the influence of &lt;code&gt;JWTClaimsRegistry&lt;/code&gt; is incorrect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key architectural shift (Caching access tokens):&lt;/strong&gt; Despite the fact that point optimization removed the OOP overhead of Python, the endpoint continues to be limited by the mathematics of verifying the Ed25519 cryptographic signature. The optimal solution will be to cache valid tokens by their &lt;code&gt;jti&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General conclusion on token cryptography:&lt;/strong&gt;&lt;br&gt;
During the tests, we encountered an architectural paradox of cryptography. The asymmetric &lt;strong&gt;RSA-2048&lt;/strong&gt; algorithm turned out to be heavy when generating tokens (signing), but when validating (decoding/verifying), in contrast to the modern Ed25519, it suddenly became faster and "lighter".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flame graph analysis of access token caching:&lt;/strong&gt; When analyzing the &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/me-cached.svg" rel="noopener noreferrer"&gt;me-cached.svg&lt;/a&gt; flame graph, the cryptographic signature (Ed25519), which previously took ~20-22% of processor time, is no longer a bottleneck (the CPU-bound operation has been removed).&lt;br&gt;
A shift in load is observed: to the IO-bound area (waiting for data) and the application's business logic. Deserializing the access token takes ~5% of processor time, deserializing user data ~6%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary:&lt;/strong&gt; Thanks to the introduction of token caching by jti, the CPU load during validation has been reduced from ~22% to zero (assuming a cache hit), moving the endpoint to an IO-bound state with a total CPU cost for deserialization of ~11%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fundamental difference between the two algorithms is explained below:&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Security Performance Analysis: RSA-2048 vs Ed25519
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The RSA algorithm is &lt;strong&gt;asymmetric&lt;/strong&gt; not only in its key logic but also in its computational load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signing (Private Key):&lt;/strong&gt; The processor raises a number to the giant power of a 2048-bit secret exponent $d$. This involves thousands of heavy multiplication cycles of large numbers, which heavily burn the CPU when issuing tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification (Public Key):&lt;/strong&gt; Here, a global constant is used — a fixed small number &lt;strong&gt;&lt;code&gt;65537&lt;/code&gt;&lt;/strong&gt; ($2^{16} + 1$). To raise the token matrix to this power, the processor needs to perform only &lt;strong&gt;17 simple multiplications&lt;/strong&gt; using the binary exponentiation algorithm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Profiling summary:&lt;/strong&gt; Despite the fact that the RSA signature verification operation is mathematically cheap, the full JWT processing stack (parsing, decoding, cryptographic verification) in the baseline version consumes ~10–12% of processor time on the endpoint.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ed25519 algorithm: there is no exponentiation to giant powers; all work is based on &lt;strong&gt;scalar multiplication of points on a curve&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signing (Private Key):&lt;/strong&gt; The algorithm multiplies a fixed base point of the curve. For it, pre-computation tables ("cheat sheets") are pre-wired into the libraries. The processor digests this task instantly, and issuing tokens ceases to be a bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification (Public Key):&lt;/strong&gt; The processor needs to perform &lt;strong&gt;two scalar multiplications&lt;/strong&gt; at once. One of them is for the public key, for which it is impossible to pre-compile a "cheat sheet" in memory. The processor is forced to unwind the full mathematics of the elliptic curve from scratch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Profiling summary:&lt;/strong&gt; The Ed25519 verification operation mathematically requires more processor cycles than RSA-2048 verification. This is confirmed by the increase in CPU time from ~12% to ~22%. However, it was this transition that made it possible to completely move away from "heavy" RSA operations at the token generation stage (signing) in other parts of the system. In the context of the /me endpoint, we compensated for this cryptographic overhead by introducing caching by jti, turning a CPU-bound operation into an IO-bound one and achieving an overall reduction in processor time of ~11%.&lt;/p&gt;

&lt;p&gt;### The Final Scorecard: Before and After&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical note:&lt;/strong&gt; For an objective assessment of efficiency, we rely on &lt;strong&gt;Mean Latency&lt;/strong&gt;, &lt;strong&gt;P90&lt;/strong&gt;, and &lt;strong&gt;CPU Load&lt;/strong&gt; (see &lt;code&gt;Justification for the relevance of wrk2 metrics&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Before starting the final series of runs, the database was cleared.&lt;/p&gt;

&lt;h4&gt;
  
  
  Endpoint &lt;code&gt;/api/v1/access/signup&lt;/code&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimization 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Final&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Delta (Baseline → Final)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;662.76 ms&lt;/td&gt;
&lt;td&gt;656.63 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;601.42 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-9.26% (-61.34 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;674.30 ms&lt;/td&gt;
&lt;td&gt;679.93 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;654.34 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-2.96% (-19.96 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;679.93 ms&lt;/td&gt;
&lt;td&gt;684.54 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;662.02 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-2.63% (-17.91 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9.79 ms&lt;/td&gt;
&lt;td&gt;24.50 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;52.76 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+438.9% (+42.97 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;682.50 ms&lt;/td&gt;
&lt;td&gt;684.54 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;661.50 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-3.08% (-21.00 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.55%&lt;/td&gt;
&lt;td&gt;58.29%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55.46%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-3.63% (-2.09%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.66%&lt;/td&gt;
&lt;td&gt;58.41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;56.51%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-2.00% (-1.15%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Brief analysis:&lt;/strong&gt; A decrease in Mean Latency of ~9% and a slight decrease in CPU load are observed (Delta Baseline → Final).&lt;br&gt;
The introduction of the &lt;code&gt;auto_refresh=False&lt;/code&gt; parameter at the ORM level and the switch to native &lt;code&gt;msgspec.json&lt;/code&gt; serialization in the final optimization provided a performance boost. However, the dominant factor (P90/P99) remains the &lt;code&gt;Argon2id&lt;/code&gt; cryptography.&lt;br&gt;
It should be noted that a small amount of data is serialized on this endpoint, and the difference between the custom implementation and the standard serialization of the FastAPI framework is not so obvious in this case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final optimization artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/final/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/final/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Endpoint &lt;code&gt;/api/v1/access/signin&lt;/code&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimization 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Final&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Delta (Baseline → Final)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;857.39 ms&lt;/td&gt;
&lt;td&gt;664.35 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;520.95 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-39.2% (-336.44 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1100.00 ms&lt;/td&gt;
&lt;td&gt;689.15 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;676.86 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-38.5% (-423.14 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1100.00 ms&lt;/td&gt;
&lt;td&gt;693.76 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;683.52 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-37.9% (-416.48 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;227.92 ms&lt;/td&gt;
&lt;td&gt;26.36 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;110.52 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-51.5% (-117.40 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1110.00 ms&lt;/td&gt;
&lt;td&gt;704.51 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;686.08 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-38.2% (-423.92 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;58.76%&lt;/td&gt;
&lt;td&gt;35.88%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;33.31%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43.3% (-25.45%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;59.04%&lt;/td&gt;
&lt;td&gt;35.68%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34.18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-42.1% (-24.86%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Brief analysis:&lt;/strong&gt; In the final optimization, the loading strategy was changed: limiting the selection of &lt;code&gt;User&lt;/code&gt; model fields and applying the &lt;code&gt;noload(m.User.role)&lt;/code&gt; directive. A decrease in Mean Latency of ~39% is observed (Delta Baseline → Final). Although ORM optimization improved the average latency by reducing I/O and memory overhead, the tail latencies P90/P99 between the first and final optimizations show diminishing returns.&lt;br&gt;
As with the &lt;code&gt;/api/v1/access/signup&lt;/code&gt; endpoint, the determining factor remains the &lt;code&gt;Argon2id&lt;/code&gt; cryptography.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final optimization artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/final/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/final/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Endpoint &lt;code&gt;/api/v1/access/me&lt;/code&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimization 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Final&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Delta (Baseline → Final)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.70 ms&lt;/td&gt;
&lt;td&gt;4.23 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.35 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-36.5% (-1.35 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.36 ms&lt;/td&gt;
&lt;td&gt;4.61 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.01 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-31.0% (-1.35 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.27 ms&lt;/td&gt;
&lt;td&gt;5.73 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.90 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-26.0% (-1.37 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.55 ms&lt;/td&gt;
&lt;td&gt;0.43 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.51 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-7.3% (-0.04 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.92 ms&lt;/td&gt;
&lt;td&gt;7.44 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.91 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-38.0% (-3.01 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;44.56%&lt;/td&gt;
&lt;td&gt;54.30%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.93%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-35.1% (-15.63%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45.12%&lt;/td&gt;
&lt;td&gt;54.26%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.19%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-33.1% (-14.93%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Brief analysis:&lt;/strong&gt; During the first optimization, it was found that verifying access token signatures via the &lt;code&gt;Ed25519&lt;/code&gt; algorithm introduces significant overhead. Optimization was performed at the transport and ORM layers. However, a significant performance boost and reduction in CPU load occurred due to local caching of access tokens (see flame graph analysis &lt;code&gt;me-cached.svg&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final optimization artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/final/wrk2-logs" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-signup/final/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The section below presents additional benchmarks and profiling results. The studies are intended to quantify infrastructure overhead (logging), as well as to demonstrate the effect of changing scheduler parameters (&lt;code&gt;random_page_cost&lt;/code&gt;) on PostgreSQL query performance.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Bonus Round: The Hidden Cost of Logging
&lt;/h2&gt;

&lt;h4&gt;
  
  
  1. Theoretical premise
&lt;/h4&gt;

&lt;p&gt;Structured logging using &lt;code&gt;StructLogMiddleware&lt;/code&gt; requires additional I/O operations and processor time (calculating time, extracting headers, forming a JSON structure).&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Test objective
&lt;/h4&gt;

&lt;p&gt;The purpose of this test is to determine the overhead that logging middleware adds to the life cycle of each application request using the example of the most lightweight endpoint.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Test description
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Endpoint:&lt;/strong&gt; &lt;code&gt;GET /ping&lt;/code&gt; (returns &lt;code&gt;PlainTextResponse&lt;/code&gt;, &lt;code&gt;b"OK"&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conditions:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Without using &lt;code&gt;StructLogMiddleware&lt;/code&gt; (pure FastAPI).&lt;/li&gt;
&lt;li&gt;With &lt;code&gt;StructLogMiddleware&lt;/code&gt; enabled, format &lt;code&gt;json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;wrk2&lt;/strong&gt;: 1000 RPS, 2 threads and 10 connections, mapped to 1 physical module (cores 6,7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Server:&lt;/strong&gt; Uvicorn, 2 workers, mapped to 1 physical module (cores 0,1).&lt;/li&gt;
&lt;li&gt;To eliminate the influence of rendering Uvicorn logs in the terminal, the output was redirected to &lt;code&gt;/dev/null&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker containers:&lt;/strong&gt; No Docker containers were running.&lt;/li&gt;
&lt;li&gt;Open file limits have been increased (&lt;code&gt;ulimit -n 65535&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Comparison summary
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Pure Ping&lt;/th&gt;
&lt;th&gt;Ping + StructLog&lt;/th&gt;
&lt;th&gt;Difference (Overhead)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.26 ms&lt;/td&gt;
&lt;td&gt;2.83 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+25.22% (+0.57 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 (90%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.29 ms&lt;/td&gt;
&lt;td&gt;3.56 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+8.21% (+0.27 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P99 (99%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.12 ms&lt;/td&gt;
&lt;td&gt;4.41 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+7.04% (+0.29 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;StdDev&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.78 ms&lt;/td&gt;
&lt;td&gt;3.01 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+285.90% (+2.23 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.26 ms&lt;/td&gt;
&lt;td&gt;82.37 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1465.97% (+77.11 ms)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;23.43%&lt;/td&gt;
&lt;td&gt;38.44%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+64.06%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Load (Core 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;24.01%&lt;/td&gt;
&lt;td&gt;37.96%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+58.10%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Technical note:&lt;/strong&gt; For an objective assessment of efficiency, we rely on &lt;strong&gt;Mean Latency&lt;/strong&gt;, &lt;strong&gt;P90&lt;/strong&gt;, and &lt;strong&gt;CPU Load&lt;/strong&gt; (see &lt;code&gt;Justification for the relevance of wrk2 metrics&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; If we compare Mean Latency/P90, logging hardly introduces any significant overhead (+0.57 ms/+0.27 ms). But looking at the CPU Load, we see an increase in load from ~24% to ~39%. This is +15% of the total core power (in absolute terms).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Artifacts:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-ping" rel="noopener noreferrer"&gt;wrk2-logs directory&lt;/a&gt; | &lt;a href="https://github.com/bizoxe/iron-track/tree/benchmarks/benchmarks/results/auth-serialization/test-ping/sar-metrics" rel="noopener noreferrer"&gt;sar-metrics directory&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Flame Graph Analysis
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conditions:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;py-spy&lt;/strong&gt;: &lt;code&gt;record&lt;/code&gt; mode, &lt;code&gt;--rate 150&lt;/code&gt;, mapped to 1 physical module (cores 2,3).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;wrk2&lt;/strong&gt;: 600 RPS, 2 threads and 4 connections, mapped to 1 physical module (cores 6,7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Server:&lt;/strong&gt; Uvicorn, 1 worker, mapped to 1 physical module (core 0).&lt;/li&gt;
&lt;li&gt;Open file limits have been increased (&lt;code&gt;ulimit -n 65535&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt; &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/auth-serialization/ping-with-logging.svg" rel="noopener noreferrer"&gt;ping-with-logging.svg&lt;/a&gt;:&lt;br&gt;
In total, non-blocking structured logging takes ~34% of processor time: ~23% is spent on the request passing through the FastAPI layers and executing the endpoint logic (&lt;code&gt;await self.app&lt;/code&gt;), ~11% - forming the log and passing it to the non-blocking output (&lt;code&gt;logger.info&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The figure of ~34% may seem huge, but it should be borne in mind that the profiler shows us the &lt;strong&gt;share of time&lt;/strong&gt; spent in a particular node of the stack &lt;strong&gt;relative to the total time&lt;/strong&gt; of a particular request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here is an example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For &lt;code&gt;/ping&lt;/code&gt;: The middleware takes, say, 0.05 ms out of a total time of 0.2 ms → 23%.&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;/signin&lt;/code&gt;: The middleware takes the same 0.05 ms out of a total time of 500 ms → 0.01%.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus, the &lt;strong&gt;absolute time&lt;/strong&gt; (CPU cost of executing the &lt;code&gt;StructLogMiddleware&lt;/code&gt; code) is the same in both cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note on data interpretation:&lt;/strong&gt; The percentages shown reflect the relative cost of performing operations on a trivial endpoint. The absolute cost (Fixed Cost) of logging in this benchmark is ~0.57 ms (Mean Latency). As the business logic of the endpoint becomes more complex (e.g., when performing cryptographic operations or complex SQL queries), the relative contribution of logging will tend to &amp;lt;1%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus Round: Taming the Database Planner
&lt;/h2&gt;

&lt;h4&gt;
  
  
  1. Theoretical premise
&lt;/h4&gt;

&lt;p&gt;When working with a PostgreSQL DBMS on hard disk drives (HDDs), it is critically important to consider the physical limitations of the disk subsystem, in particular the delays that arise when moving the magnetic heads.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Test objective
&lt;/h4&gt;

&lt;p&gt;Let's compare the default value &lt;code&gt;random_page_cost=4.0&lt;/code&gt; and &lt;code&gt;random_page_cost=1.0&lt;/code&gt;, and how this affects the query planner (Cost-Based Optimizer) when working with a selection that makes up about 1% of the total table size (4937 rows out of 500,000).&lt;br&gt;
The &lt;code&gt;exercises&lt;/code&gt; table was chosen for the experiment. When comparing &lt;code&gt;random_page_cost=4.0&lt;/code&gt; and &lt;code&gt;random_page_cost=1.0&lt;/code&gt;, the &lt;code&gt;seq_page_cost&lt;/code&gt; value was not changed (default &lt;code&gt;seq_page_cost=1.0&lt;/code&gt;).&lt;br&gt;
The queries and main (default) PostgreSQL settings can be viewed here: &lt;a href="https://github.com/bizoxe/iron-track/blob/benchmarks/benchmarks/docs/assets/resources/postgresql-random-page-cost-analysis.md" rel="noopener noreferrer"&gt;postgresql-random-page-cost-analysis.md&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; All measurements were made on data located in the cache (&lt;code&gt;shared buffers&lt;/code&gt;). This was done to isolate the influence of the planner (Access Path) from the physical delay of reading from the disk, which allows for a clear demonstration of the decision-making logic of the optimizer (Cost-Based Optimizer).&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Comparison summary:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Selection characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total table size: 500,000 rows.&lt;/li&gt;
&lt;li&gt;Number of rows satisfying the condition (&lt;code&gt;is_system_default IS TRUE&lt;/code&gt;): 4,937 (less than 1% of the total).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;With &lt;code&gt;random_page_cost = 4.0&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plan:&lt;/strong&gt; Bitmap Heap Scan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Time:&lt;/strong&gt; 11.560 ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planner logic:&lt;/strong&gt; Due to the high cost of random data access (4.0), the planner decides to first build a bitmap and read the pages sequentially.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics:&lt;/strong&gt; &lt;code&gt;Buffers: shared hit=3861&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;With &lt;code&gt;random_page_cost = 1.0&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plan:&lt;/strong&gt; Index Scan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Time:&lt;/strong&gt; 7.794 ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planner logic:&lt;/strong&gt; The cost of random access is equated to sequential access. The planner believes that reading through the index will be cheaper and more direct, which leads to a reduction in query execution time by almost 1.5 times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics:&lt;/strong&gt; &lt;code&gt;Buffers: shared hit=3967&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Thank you for joining me on this deep dive into performance optimization. My goal was not just to share a few tricks, but to demonstrate a methodology of forming hypotheses, measuring, and drawing evidence-based conclusions. I hope this journey was as insightful for you to read as it was for me to conduct.&lt;/p&gt;

&lt;p&gt;All the code, benchmarks, and artifacts discussed in this article are part of the open-source &lt;strong&gt;IronTrack&lt;/strong&gt; project. I invite you to explore the repository, check out the implementation details, and perhaps even run the benchmarks yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/bizoxe/iron-track" rel="noopener noreferrer"&gt;► Explore the IronTrack Project on GitHub&lt;/a&gt;
&lt;/h3&gt;

</description>
      <category>fastapi</category>
      <category>performance</category>
      <category>backend</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
