<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Douglas Carmo</title>
    <description>The latest articles on DEV Community by Douglas Carmo (@douglas_carmo_cd84c5548f2).</description>
    <link>https://dev.to/douglas_carmo_cd84c5548f2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3785349%2F44526b73-c6f8-4b5d-b177-25538207b999.png</url>
      <title>DEV Community: Douglas Carmo</title>
      <link>https://dev.to/douglas_carmo_cd84c5548f2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/douglas_carmo_cd84c5548f2"/>
    <language>en</language>
    <item>
      <title>Don t Lock Your API - Lock Your Scheduler Instead</title>
      <dc:creator>Douglas Carmo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 18:09:23 +0000</pubDate>
      <link>https://dev.to/douglas_carmo_cd84c5548f2/dont-lock-your-api-lock-your-scheduler-instead-2i15</link>
      <guid>https://dev.to/douglas_carmo_cd84c5548f2/dont-lock-your-api-lock-your-scheduler-instead-2i15</guid>
      <description>&lt;p&gt;&lt;em&gt;How inverting responsibility eliminated distributed lock contention from a high-throughput settlement pipeline&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;When building the ingestion layer of a real-time interbank settlement pipeline, I faced a classic distributed systems problem: how do you guarantee that all orders belonging to the same settlement window are grouped together before being published to Kafka, without killing your API throughput?&lt;/p&gt;

&lt;p&gt;The naive answer involves a distributed lock on the hot path. I went a different direction.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;The system receives settlement orders via a REST API built with Spring Boot 3.5 and Java 21. Orders must be grouped by &lt;strong&gt;settlement window&lt;/strong&gt; (a time-based boundary, e.g., 17:00) before being published as a batch to Kafka. The consumer downstream expects a complete, immutable batch, not a partial one.&lt;/p&gt;

&lt;p&gt;The constraint: orders can arrive at any rate. The window must close atomically. No order should be published before the window closes, and no order from a closed window should sneak into the next batch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Naive Approach And Why It Fails
&lt;/h2&gt;

&lt;p&gt;The first instinct is to coordinate at ingestion time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /orders → acquire distributed lock → check window → persist → release lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works at low volume. At scale, it becomes your bottleneck:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every incoming request competes for the same lock&lt;/li&gt;
&lt;li&gt;Lock contention grows with throughput&lt;/li&gt;
&lt;li&gt;Latency spikes under load, exactly when you need stability most&lt;/li&gt;
&lt;li&gt;Redis or ZooKeeper become single points of failure on the &lt;strong&gt;critical path&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You've introduced coordination overhead into the one place that should be as fast as possible: the API entry point.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Inversion
&lt;/h2&gt;

&lt;p&gt;Instead of coordinating at ingestion, I moved all coordination &lt;strong&gt;out of the hot path entirely&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The architecture splits responsibility across two distinct phases:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 — API: Persist and Forget
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@PostMapping&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/orders"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;receiveOrder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;@RequestBody&lt;/span&gt; &lt;span class="nc"&gt;SettlementOrderRequest&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;settlementOrderService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;persist&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;OrderStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PENDING&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;accepted&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API does exactly one thing: persist the order with status &lt;code&gt;PENDING&lt;/code&gt;. No window check. No lock. No coordination. Just a fast write via HikariCP and a 202 back to the client.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2 — Scheduler: Close, Batch, Publish
&lt;/h3&gt;

&lt;p&gt;The scheduler entry point iterates over all participants. Before the loop, the JPA first-level cache is cleared to avoid stale reads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0 0 17 * * MON-FRI"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@ShedLock&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"settlement-window-close"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtMostFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"PT10M"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;closeWindowAndPublish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;windowKey&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;entityManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;clear&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// flush JPA first-level cache before processing&lt;/span&gt;

    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Participant&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;participants&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;participantPort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findAll&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getType&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nc"&gt;ParticipantType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BACEN&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Participant&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;participants&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Critical error processing participant [{}]. Skipping."&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getIspb&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each participant is processed in complete isolation via &lt;code&gt;REQUIRES_NEW&lt;/code&gt;, so a failure in one does not roll back the others:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Transactional&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;propagation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Propagation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;REQUIRES_NEW&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SettlementWindow&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LocalDate&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Participant&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isOpen&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clock&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;rejectPendingOrders&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// explicit cutoff rejection&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batchPort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;existsActiveBatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Batch already exists for window [{}] participant [{}]. Skipping."&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPartitioningKey&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getIspb&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SettlementOrder&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orderPort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findPendingForWindow&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getIspb&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pending&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SettlementOrder&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;batched&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;SettlementOrder:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;FileBatch&lt;/span&gt; &lt;span class="n"&gt;savedBatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batchPort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileBatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batched&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;orderPort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;updateStatusBatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;associateWithBatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;savedBatch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="o"&gt;())).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

    &lt;span class="c1"&gt;// Kafka is fired only after the transaction commits successfully&lt;/span&gt;
    &lt;span class="nc"&gt;TransactionSynchronizationManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;registerSynchronization&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TransactionSynchronization&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;afterCommit&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;publisherPort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;publish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;savedBatch&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scheduler owns the window boundary. It runs once per window, outside the request lifecycle, under zero user-facing load pressure. ShedLock with Redis (&lt;code&gt;SET NX PX&lt;/code&gt;) ensures only one instance executes across the entire ECS/EKS cluster, but this lock fires &lt;strong&gt;once per window&lt;/strong&gt;, not once per request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four Details That Matter
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;entityManager.clear()&lt;/code&gt; before the loop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some operations in the batching flow rely on JPQL bulk updates. Since JPQL bulk updates bypass Hibernate's managed entities and go directly to the database, the persistence context may contain stale state after those updates, Hibernate still believes it holds the correct version of those entities in memory. Clearing the persistence context before processing ensures that subsequent reads are served from the database rather than from outdated managed entities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;REQUIRES_NEW&lt;/code&gt; per participant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each participant gets its own isolated transaction. If processing fails for participant A, participants B through N are unaffected and their transactions commit independently. Without this, a single failure rolls back the entire batch cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;afterCommit()&lt;/code&gt; before publishing to Kafka&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a correctness guarantee, not a performance optimization. Publishing inside the transaction would risk sending a Kafka message for a batch that never actually committed to the databas a phantom message that consumers would try to process against data that doesn't exist. &lt;code&gt;afterCommit()&lt;/code&gt; ensures the database write is durable before Kafka ever sees the event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Explicit &lt;code&gt;rejectCutoff()&lt;/code&gt; for late orders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Orders that arrive after the window closes are not silently ignored, they are explicitly marked as &lt;code&gt;REJECTED_CUTOFF&lt;/code&gt;. This makes the system auditable: at any point you can query exactly which orders missed which window and why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Works: The Key Insight
&lt;/h2&gt;

&lt;p&gt;The distributed lock didn't disappear, it moved.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Naive Approach&lt;/th&gt;
&lt;th&gt;Inverted Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lock location&lt;/td&gt;
&lt;td&gt;API hot path&lt;/td&gt;
&lt;td&gt;Scheduler (cold path)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lock frequency&lt;/td&gt;
&lt;td&gt;Every request&lt;/td&gt;
&lt;td&gt;Once per window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lock contention&lt;/td&gt;
&lt;td&gt;High under load&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API latency&lt;/td&gt;
&lt;td&gt;Unpredictable&lt;/td&gt;
&lt;td&gt;Consistent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput impact&lt;/td&gt;
&lt;td&gt;Degrades with volume&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ShedLock on a scheduler that fires once per settlement window has &lt;strong&gt;negligible throughput impact&lt;/strong&gt;. ShedLock on an API endpoint receiving thousands of requests per minute does not.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kafka Partitioning by Window
&lt;/h2&gt;

&lt;p&gt;Once the batch reaches Kafka, ordering guarantees are maintained by partitioning on the window key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;kafkaTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;send&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"str.batch.emission"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;windowKey&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// windowKey: "STR-D0-17h00", "STR-D1-17h00", etc.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each settlement window maps to a dedicated partition. This guarantees temporal ordering &lt;strong&gt;without any additional locking mechanism&lt;/strong&gt;. Kafka's own partition semantics do the work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Status Flow
&lt;/h2&gt;

&lt;p&gt;The order lifecycle makes the phase separation explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PENDING  →  BATCHED  →  EMITTED
  (API)    (Scheduler)  (Consumer)
             ↓
       REJECTED_CUTOFF
      (window closed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each transition is unambiguous. At any point you can query exactly where in the pipeline an order is and why it got there.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About Late Orders?
&lt;/h2&gt;

&lt;p&gt;Orders that arrive after the scheduler closes the window are not silently dropped or deferred, they are explicitly rejected with status &lt;code&gt;REJECTED_CUTOFF&lt;/code&gt;. This is intentional on two levels: the STR protocol requires complete, bounded batches, and a silent failure is always worse than an explicit one. The system knows exactly which orders missed which window, and so does the operator.&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;If you find yourself reaching for a distributed lock on an API hot path to coordinate time-based grouping, ask whether the coordination can be deferred to a scheduled process instead.&lt;/p&gt;

&lt;p&gt;Moving the lock from the hot path to the cold path cost nothing in correctness and eliminated the primary throughput bottleneck. The API became a pure write endpoint. The scheduler became the sole owner of window semantics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lock the thing that controls the boundary. Not the thing that feeds it.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article is part of a series on the STR-XML-Pipeline, a high-throughput interbank settlement system built with Spring Boot 3.5, Java 21, Apache Kafka, PostgreSQL 16, Redis 7, and AWS S3 &amp;amp; Fargate.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>kafka</category>
      <category>postgres</category>
    </item>
    <item>
      <title>When NASDAQ Freezes: Chaos Engineering a Stock Quotes API with Java and ToxiProxy</title>
      <dc:creator>Douglas Carmo</dc:creator>
      <pubDate>Sun, 22 Feb 2026 20:16:56 +0000</pubDate>
      <link>https://dev.to/douglas_carmo_cd84c5548f2/when-nasdaq-freezes-chaos-engineering-a-stock-quotes-api-with-java-and-toxiproxy-3nmo</link>
      <guid>https://dev.to/douglas_carmo_cd84c5548f2/when-nasdaq-freezes-chaos-engineering-a-stock-quotes-api-with-java-and-toxiproxy-3nmo</guid>
      <description>&lt;p&gt;I wanted to understand what really happens to a distributed system when things go wrong. So I broke it on purpose.&lt;/p&gt;

&lt;p&gt;This article walks through a chaos engineering experiment I built around a real-time stock quotes API for NASDAQ-listed companies. The stack involves Java 21, Spring Boot 4, Redis, PostgreSQL, Resilience4j and ToxiProxy — and the results were more interesting than I expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Chaos Engineering?
&lt;/h2&gt;

&lt;p&gt;Chaos Engineering is the practice of intentionally injecting failures into a system to observe how it behaves under stress. The idea, popularized by Netflix with their Chaos Monkey tool, is simple: if failures are going to happen in production anyway, it's better to discover how your system reacts in a controlled experiment than during a real outage.&lt;/p&gt;

&lt;p&gt;In a financial context like stock quotes, the stakes are clear. Imagine a system that feeds real-time prices to traders during market open at 9:30 AM EST — and suddenly Redis starts responding in 2 seconds instead of 6 milliseconds. What happens? Does the system degrade gracefully, or does it collapse?&lt;/p&gt;

&lt;p&gt;That's what I set out to find out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The project simulates a production-like environment where every external dependency passes through ToxiProxy — a programmable network proxy that can inject latency, bandwidth limits, connection resets and more.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client
  └─► Nginx (port 8000 / 20000)
        └─► Backend (Spring Boot :8080)
              ├─► ToxiProxy :20001 ──► PostgreSQL :5432
              ├─► ToxiProxy :20002 ──► Redis :6379
              └─► ToxiProxy :20003 ──► finnhub.io:443
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key detail: the backend never connects directly to PostgreSQL, Redis or Finnhub. Every call passes through ToxiProxy, so we can degrade any dependency at any time without touching application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Java 21 + Spring Boot 4&lt;/li&gt;
&lt;li&gt;PostgreSQL 17 (persistence)&lt;/li&gt;
&lt;li&gt;Redis 7 (cache via Spring Cache + Lettuce)&lt;/li&gt;
&lt;li&gt;Resilience4j (Circuit Breaker + Retry)&lt;/li&gt;
&lt;li&gt;ToxiProxy (chaos injection)&lt;/li&gt;
&lt;li&gt;Nginx (reverse proxy + SSL termination for Finnhub)&lt;/li&gt;
&lt;li&gt;Finnhub API (real market data)&lt;/li&gt;
&lt;li&gt;Docker + Docker Compose&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How the Cache Works
&lt;/h2&gt;

&lt;p&gt;The application fetches stock quotes from Finnhub, saves them to PostgreSQL and caches them in Redis using &lt;code&gt;@Cacheable&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt; &lt;span class="nd"&gt;@Cacheable&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"quotes"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"#symbol"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;StockQuote&lt;/span&gt; &lt;span class="nf"&gt;getQuote&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cache MISS for symbol: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;FinnhubQuoteResponse&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;finnhubClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fetchQuote&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Objects&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentPrice&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ZERO&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;AssetNotFoundException&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Asset not found for symbol: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="nc"&gt;StockQuote&lt;/span&gt; &lt;span class="n"&gt;quote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stockQuoteMapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toEntity&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;stockQuoteRepository&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quote&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
 &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flow is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1st call (cache miss):&lt;/strong&gt; Finnhub → PostgreSQL → Redis → response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2nd call (cache hit):&lt;/strong&gt; Redis → response (PostgreSQL and Finnhub are never touched)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A background scheduler also refreshes all quotes every 60 seconds, so the cache stays warm automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Circuit Breaker
&lt;/h2&gt;

&lt;p&gt;The Finnhub client is wrapped with Resilience4j annotations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@CircuitBreaker&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"financialApi"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallbackMethod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"fetchQuoteFallback"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@Retry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"financialApi"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;FinnhubQuoteResponse&lt;/span&gt; &lt;span class="nf"&gt;fetchQuote&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;finnhubWebClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;uri&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/quote"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;queryParam&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"symbol"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;queryParam&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"token"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;retrieve&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;bodyToMono&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;FinnhubQuoteResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofSeconds&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;block&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;FinnhubQuoteResponse&lt;/span&gt; &lt;span class="nf"&gt;fetchQuoteFallback&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Fallback triggered for symbol: {} - reason: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMessage&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;FinnhubQuoteResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;empty&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Circuit Breaker is configured to open when 50% of calls are slow (above 3 seconds) across a sliding window of 10 calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resilience4j&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;circuitbreaker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;registerHealthIndicator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;slidingWindowSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
        &lt;span class="na"&gt;minimumNumberOfCalls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
        &lt;span class="na"&gt;permittedNumberOfCallsInHalfOpenState&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
        &lt;span class="na"&gt;automaticTransitionFromOpenToHalfOpenEnabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;waitDurationInOpenState&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
        &lt;span class="na"&gt;slowCallRateThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
        &lt;span class="na"&gt;failureRateThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
        &lt;span class="na"&gt;eventConsumerBufferSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;financialApi&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
        &lt;span class="na"&gt;waitDurationInOpenState&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
        &lt;span class="na"&gt;recordExceptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;org.springframework.web.client.HttpServerErrorException&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;java.util.concurrent.TimeoutException&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;java.io.IOException&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chaos Scripts
&lt;/h2&gt;

&lt;p&gt;Three scripts control the experiment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;setup-toxiproxy.sh&lt;/code&gt;&lt;/strong&gt; — creates the proxies, no chaos yet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Configuring Toxiproxy proxies..."&lt;/span&gt;

curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8474/proxies &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"postgres_proxy","listen":"0.0.0.0:20001","upstream":"stock-quotes-postgres:5432","enabled":true}'&lt;/span&gt;

curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8474/proxies &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"redis_proxy","listen":"0.0.0.0:20002","upstream":"stock-quotes-redis:6379","enabled":true}'&lt;/span&gt;


curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8474/proxies &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name":"finnhub_proxy",
    "listen":"0.0.0.0:20003",
    "upstream":"finnhub.io:443",
    "enabled":true
  }'&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Toxiproxy configuration complete."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;inject-chaos.sh&lt;/code&gt;&lt;/strong&gt; — injects 2000ms ±500ms latency on all three proxies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="nv"&gt;TOXIPROXY_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8474"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Injecting chaos: 2000ms latency (±500ms jitter) to postgres_proxy, redis_proxy and finnhub_proxy..."&lt;/span&gt;

curl &lt;span class="nt"&gt;--fail&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TOXIPROXY_URL&lt;/span&gt;&lt;span class="s2"&gt;/proxies/postgres_proxy/toxics"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":2000,"jitter":500}}'&lt;/span&gt;

curl &lt;span class="nt"&gt;--fail&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TOXIPROXY_URL&lt;/span&gt;&lt;span class="s2"&gt;/proxies/redis_proxy/toxics"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":2000,"jitter":500}}'&lt;/span&gt;

curl &lt;span class="nt"&gt;--fail&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TOXIPROXY_URL&lt;/span&gt;&lt;span class="s2"&gt;/proxies/finnhub_proxy/toxics"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":2000,"jitter":500}}'&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Chaos injected. Monitor your Resilience4j Circuit Breaker status."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;remove-chaos.sh&lt;/code&gt;&lt;/strong&gt; — removes all toxics, system recovers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="nv"&gt;TOXIPROXY_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8474"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Removing chaos from postgres_proxy, redis_proxy and finnhub_proxy..."&lt;/span&gt;

curl &lt;span class="nt"&gt;--fail&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; DELETE &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TOXIPROXY_URL&lt;/span&gt;&lt;span class="s2"&gt;/proxies/postgres_proxy/toxics/latency_downstream"&lt;/span&gt;

curl &lt;span class="nt"&gt;--fail&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; DELETE &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TOXIPROXY_URL&lt;/span&gt;&lt;span class="s2"&gt;/proxies/redis_proxy/toxics/latency_downstream"&lt;/span&gt;

curl &lt;span class="nt"&gt;--fail&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; DELETE &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TOXIPROXY_URL&lt;/span&gt;&lt;span class="s2"&gt;/proxies/finnhub_proxy/toxics/latency_downstream"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Chaos removed. System should be recovering — watch the Circuit Breaker close."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Experiment Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Baseline — No Chaos
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# First call (cold cache)&lt;/span&gt;
curl &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Total time: %{time_total}s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
&lt;span class="c"&gt;# Total time: 0.029s&lt;/span&gt;

&lt;span class="c"&gt;# Second call (Redis cache hit)&lt;/span&gt;
curl &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Total time: %{time_total}s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
&lt;span class="c"&gt;# Total time: 0.006s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Redis responding in 6ms is the happy path. The PostgreSQL metrics confirm this — the database itself responds in under 2ms, the overhead comes from the ToxiProxy layer even without chaos injected.&lt;/p&gt;

&lt;h3&gt;
  
  
  With Chaos — 2000ms ±500ms on All Services
&lt;/h3&gt;

&lt;p&gt;Here's where things get interesting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./scripts/inject-chaos.sh

&lt;span class="c"&gt;# First call&lt;/span&gt;
curl &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Total time: %{time_total}s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
&lt;span class="c"&gt;# Total time: 9.4s&lt;/span&gt;

&lt;span class="c"&gt;# Second call (expected Redis hit...)&lt;/span&gt;
curl &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Total time: %{time_total}s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
&lt;span class="c"&gt;# Total time: 10.8s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The second call was slower than the first.&lt;/strong&gt; That was unexpected.&lt;/p&gt;

&lt;p&gt;The reason: with 2000ms of latency on the Redis proxy, the write to Redis after the first call was delayed so much that by the time the second request arrived, the cache entry hadn't been written yet. Both calls went all the way to Finnhub and PostgreSQL — each adding their own 2000ms latency.&lt;/p&gt;

&lt;p&gt;This is the invisible failure mode: &lt;strong&gt;Redis latency doesn't just slow down cache reads, it breaks the cache population itself.&lt;/strong&gt; The system continued to function, but silently lost all caching benefits. Every request was hitting the full stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real Latency Numbers per Asset
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro0kphojj64by87oilcc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro0kphojj64by87oilcc.png" alt="Chaos Engineering — Latency per Asset" width="800" height="453"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Figure 1:&lt;/strong&gt; &lt;em&gt;Comparison between healthy Redis hits (green) and system performance under 2000ms ±500ms injected latency (red). Note that even a cold cache miss under normal conditions takes only 29ms, proving that chaos doesn't just slow down the system—it fundamentally breaks the efficiency of the architecture.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Asset&lt;/th&gt;
&lt;th&gt;No Chaos (Redis hit)&lt;/th&gt;
&lt;th&gt;With Chaos&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;5.9ms&lt;/td&gt;
&lt;td&gt;1823.9ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GOOGL&lt;/td&gt;
&lt;td&gt;6.7ms&lt;/td&gt;
&lt;td&gt;1870.6ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVDA&lt;/td&gt;
&lt;td&gt;7.4ms&lt;/td&gt;
&lt;td&gt;2279.5ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMZN&lt;/td&gt;
&lt;td&gt;7.2ms&lt;/td&gt;
&lt;td&gt;2295.1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;META&lt;/td&gt;
&lt;td&gt;6.4ms&lt;/td&gt;
&lt;td&gt;1522.2ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MSFT&lt;/td&gt;
&lt;td&gt;6.0ms&lt;/td&gt;
&lt;td&gt;2445.1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMD&lt;/td&gt;
&lt;td&gt;5.6ms&lt;/td&gt;
&lt;td&gt;2180.9ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Redis is approximately &lt;strong&gt;300x faster&lt;/strong&gt; than the chaos scenario.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Circuit Breaker Opens
&lt;/h3&gt;

&lt;p&gt;After enough slow calls, the Circuit Breaker opened automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: Fallback triggered for symbol: INTC - reason: CircuitBreaker 'financialApi' is OPEN
WARN:  Financial API Circuit Breaker state changed: OPEN -&amp;gt; HALF_OPEN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/actuator/circuitbreakers | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"circuitBreakers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"financialApi"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"bufferedCalls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"failedCalls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"failureRate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.0%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"failureRateThreshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50.0%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notPermittedCalls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"slowCallRate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.0%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"slowCallRateThreshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50.0%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"slowCalls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"slowFailedCalls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OPEN"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Recovery
&lt;/h3&gt;

&lt;p&gt;After removing chaos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./scripts/remove-chaos.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Circuit Breaker transitioned automatically: &lt;strong&gt;OPEN → HALF_OPEN → CLOSED&lt;/strong&gt;. The system recovered without any manual intervention.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Redis latency silently breaks your cache.&lt;/strong&gt; This is the most important finding. The system didn't throw errors — it just stopped caching, and every request paid the full cost. Without observability, this would be invisible in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Circuit Breakers are essential, but need correct wiring.&lt;/strong&gt; Getting the &lt;code&gt;@CircuitBreaker&lt;/code&gt; annotation to work required: creating an interface for the client class (so Spring AOP could create a proxy), aligning the instance name across the annotation, YAML config and registry, and using the right exception types. The annotation is simple — the configuration is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. ToxiProxy gives you a realistic baseline before chaos.&lt;/strong&gt; Even with zero toxics configured, routing through ToxiProxy adds ~250ms of overhead from TCP proxy traversal and deserialization. This is your real baseline, not localhost-to-localhost speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The scheduler is also affected.&lt;/strong&gt; The background job that refreshes quotes every 60 seconds suffered the same latency — meaning cached data would eventually become stale even if the circuit breaker protected the API endpoint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the Project
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Doug16Yanc/stock-quotes.git
&lt;span class="nb"&gt;cd &lt;/span&gt;stock-quotes
&lt;span class="nb"&gt;cp&lt;/span&gt; .env-example .env
&lt;span class="c"&gt;# fill in your Finnhub API key and credentials&lt;/span&gt;

&lt;span class="c"&gt;# start infrastructure&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; postgres redis toxiproxy

&lt;span class="c"&gt;# configure proxies&lt;/span&gt;
./scripts/setup-toxiproxy.sh

&lt;span class="c"&gt;# start application&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; backend nginx

&lt;span class="c"&gt;# test&lt;/span&gt;
curl http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next? (Hardening the Resilience)
&lt;/h2&gt;

&lt;p&gt;Now that we’ve proven how the system behaves under stress and confirmed that the PostgreSQL fallback keeps the lights on, the next steps to achieve true high availability are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Stale-While-Revalidate Pattern: Implement a strategy where Redis serves "stale" (slightly outdated) data instantly while a background thread fetches the update, eliminating user-facing latency during cache refreshes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bulkhead Isolation: Isolate resource pools (threads and connections) to ensure that a catastrophic slowdown in the NASDAQ API doesn't exhaust the backend's thread pool, which could otherwise crash unrelated services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deep Observability: Integrate Micrometer, Prometheus, and Grafana to visualize Circuit Breaker state transitions in real-time and create alerts based on "Anomalous Cache Miss Rates."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stress &amp;amp; Load Testing: Use k6 to simulate a massive volume of concurrent requests during chaos injection to observe the "Thundering Herd" effect—where multiple requests try to rebuild the cache simultaneously.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you found this useful or have questions, drop a comment. The full source code is on GitHub: &lt;a href="https://github.com/Doug16Yanc/stock-quotes" rel="noopener noreferrer"&gt;Doug16Yanc/stock-quotes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Breaking things on purpose is the best way to learn how they work.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>docker</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
