<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Venkat Raman</title>
    <description>The latest articles on DEV Community by Venkat Raman (@venkat2811).</description>
    <link>https://dev.to/venkat2811</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F971654%2F0803cead-b30a-4206-b8f3-8f648f2f848f.jpg</url>
      <title>DEV Community: Venkat Raman</title>
      <link>https://dev.to/venkat2811</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/venkat2811"/>
    <language>en</language>
    <item>
      <title>The power of Mechanical Sympathy in Software Engineering</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Thu, 18 Apr 2024 18:34:00 +0000</pubDate>
      <link>https://dev.to/venkat2811/the-power-of-mechanical-sympathy-in-software-engineering-4ng0</link>
      <guid>https://dev.to/venkat2811/the-power-of-mechanical-sympathy-in-software-engineering-4ng0</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbrt7v9hr3yy6iffurk0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbrt7v9hr3yy6iffurk0.png" alt="Image description" width="800" height="570"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Modern software programming languages, compilers, and frameworks abstract away underlying complexities and details, allowing developers to focus on building systems and applications to solve business problems. This design enables engineers to specialize and build expertise in specific layers, pushing boundaries. However, when tasked with solving problems that stretch hardware capabilities to the maximum, and the hardware is operating at its peak, understanding the underlying architecture and complexities becomes crucial. Novel software paradigms that dramatically increase system performance with real-world implications arise from such scenarios.&lt;/p&gt;

&lt;p&gt;Flash Attention is one such algorithm that made huge waves in the NLP community, especially in Transformer Architecture. I first encountered Flash Attention in 2022, when it dramatically improved inference speeds in Stable Diffusion models for image generation. Upon recently revisiting the paper, it reminded me of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;'Locality of Reference' principle from Computer Architecture class in University.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;'LMAX Disruptor' the underlying library used in my GSoC project.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, we'll explore these concepts and appreciate how having mechanical sympathy makes us better engineers. To &lt;a href="https://martinfowler.com/articles/lmax.html"&gt;quote&lt;/a&gt; Martin Flower, &lt;em&gt;"The phrase Martin Thompson likes to use is 'mechanical sympathy.' The term comes from race car driving and reflects the driver having an innate feel for the car, enabling them to get the best out of it."&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Locality of Reference
&lt;/h1&gt;

&lt;p&gt;Locality Of Reference (LOR) is a principle in computer architecture that refers to the tendency of programs to access data and instructions that are close to each other in memory. As we saw in previous blog &lt;a href="https://venkat.eu/cpu-gpu-the-basics-and-high-level-overview#heading-core-in-a-modern-cpu"&gt;post&lt;/a&gt;, CPU &amp;amp; GPU cores make use of registers and layers of caches for faster data access &amp;amp; processing. Here are key LOR types used by processors (firmware) for better performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Temporal Locality -&lt;/strong&gt; Tendency of programs to access the same memory location repeatedly for a short time. Eg: a+=10 -&amp;gt; Reading the value of a and saving the result back to a. It is beneficial to keep a close to processor to avoid costly (slow) access to main memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Spatial Locality -&lt;/strong&gt; Tendency of programs to access memory locations to nearby to data that is currently being accessed. Eg: we have two variables a and b declared in program and they will be close together in main memory page when program is loaded in memory. So, during fetch cycle, when a is being read from main memory (cache line), b will likely also be in the same cache line and will be available in cache.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sequential Locality -&lt;/strong&gt; Tendency of programs to access memory locations sequentially. Eg: array elements will be stored sequentially in memory. When program is iterating over an array, when first element is being read, next contiguous elements will also be read (as part of cache line) from main memory and be available in cache.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Instruction Locality -&lt;/strong&gt; Similar to above data LOR types, instructions are also prefetched and made available in caches.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LtfS2nQf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713197476776/a502d56f-9481-49c4-a3d2-ebf7bf958b35.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LtfS2nQf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713197476776/a502d56f-9481-49c4-a3d2-ebf7bf958b35.png" alt="" width="800" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, if data load happens for a single element in a cache line, all elements in a cache line are loaded resulting in quicker access for subsequent elements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Matrix Multiplication
&lt;/h2&gt;

&lt;p&gt;Matrix multiplication is a classic example with which we can quickly see the impact of LOR principle. Here is a simple program that does matmul without any libraries in Python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sys, random
from tqdm import tqdm
from time import *

n = 500

A = [[random.random()
      for row in range(n)]
      for col in range(n)]

B = [[random.random()
      for row in range(n)]
      for col in range(n)]

C = [[0 for row in range(n)]
     for col in range(n)]

print("calculating ... \n")

start = time()
# inefficient
for i in tqdm(range(n)):
    for j in range(n):
        for k in range(n):
            C[i][j] += A[i][k] * B[k][j]
# efficient
#for i in tqdm(range(n)):
#    for k in range(n):
#        for j in range(n):
#            C[i][j] += A[i][k] * B[k][j]
end = time()

print("%0.6f"%(end-start))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above python program can be further sped up in several ways (changing programming language, compiler optimizations, parallel calculation, tiling, vectorization, AVX, CUDA etc.,) which are not in scope for this post. If interested in those, refer:&lt;/p&gt;

&lt;p&gt;MIT OpenCourseWare - Performance Engineering - &lt;a href="https://youtu.be/o7h_sYMk_oc?si=mWgShE48VgXARZEz"&gt;Matrix Multiplication&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Vipul Vaibhaw's &lt;a href="https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa"&gt;summary&lt;/a&gt; &amp;amp; &lt;a href="https://github.com/vaibhawvipul/performance-engineering"&gt;repo&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tao Xu's &lt;a href="https://xta0.me/2021/07/12/MIT-6172-1.html"&gt;summary&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running the inefficient &amp;amp; efficient versions of above program in my ubuntu workstation &amp;amp; benchmarking using cachegrind gives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ valgrind --tool=cachegrind python matmul_inefficient.py
==253768== Cachegrind, a cache and branch-prediction profiler
==253768== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==253768== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==253768== Command: python matmul_inefficient.py
==253768== 
--253768-- warning: L3 cache found, using its data for the LL simulation.
calculating ... 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [14:33&amp;lt;00:00,  1.75s/it]
873.798730
==253768== 
==253768== I   refs:      314,734,342,652
==253768== I1  misses:          5,738,193
==253768== LLi misses:            870,629
==253768== I1  miss rate:            0.00%
==253768== LLi miss rate:            0.00%
==253768== 
==253768== D   refs:      150,606,141,341  (105,453,303,262 rd   + 45,152,838,079 wr)
==253768== D1  misses:        622,837,260  (    616,546,831 rd   +      6,290,429 wr)
==253768== LLd misses:          2,065,607  (      1,493,478 rd   +        572,129 wr)
==253768== D1  miss rate:             0.4% (            0.6%     +            0.0%  )
==253768== LLd miss rate:             0.0% (            0.0%     +            0.0%  )
==253768== 
==253768== LL refs:           628,575,453  (    622,285,024 rd   +      6,290,429 wr)
==253768== LL misses:           2,936,236  (      2,364,107 rd   +        572,129 wr)
==253768== LL miss rate:              0.0% (            0.0%     +            0.0%  )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ valgrind --tool=cachegrind python matmul_efficient.py
==296074== Cachegrind, a cache and branch-prediction profiler
==296074== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==296074== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==296074== Command: python matmul_efficient.py
==296074== 
--296074-- warning: L3 cache found, using its data for the LL simulation.
calculating ... 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [14:31&amp;lt;00:00,  1.74s/it]
871.885507
==296074== 
==296074== I   refs:      318,987,466,754
==296074== I1  misses:          4,224,884
==296074== LLi misses:            832,073
==296074== I1  miss rate:            0.00%
==296074== LLi miss rate:            0.00%
==296074== 
==296074== D   refs:      151,347,143,927  (106,200,231,179 rd   + 45,146,912,748 wr)
==296074== D1  misses:        218,499,487  (    216,816,521 rd   +      1,682,966 wr)
==296074== LLd misses:          2,111,315  (      1,539,359 rd   +        571,956 wr)
==296074== D1  miss rate:             0.1% (            0.2%     +            0.0%  )
==296074== LLd miss rate:             0.0% (            0.0%     +            0.0%  )
==296074== 
==296074== LL refs:           222,724,371  (    221,041,405 rd   +      1,682,966 wr)
==296074== LL misses:           2,943,388  (      2,371,432 rd   +        571,956 wr)
==296074== LL miss rate:              0.0% (            0.0%     +            0.0%  )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My workstation is a powerful machine, and 500x500 is a small matrix. So treat L3 cache as main memory and L1 cache as cache memory. &lt;strong&gt;The D1 miss rate of inefficient version is 0.4% and for the efficient version is 0.1% resulting in runtime improvement of ~2s.&lt;/strong&gt; Let's apply sequential locality to a small matrix (for purpose of visualization) and &lt;strong&gt;see how changing loop order is giving this performance gain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OXWbSdRq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713216601332/8fc10765-3377-4354-bdf6-ab97ca03067d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OXWbSdRq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713216601332/8fc10765-3377-4354-bdf6-ab97ca03067d.png" alt="" width="800" height="736"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As seen above, memory access pattern for matrix B is inefficient on the left. Just by changing iteration order, access pattern for matrix B is fixed and we get free performance boost. Thus, having mechanical sympathy for the underlying hardware architecture helps in improving matmul performance.&lt;/p&gt;

&lt;h1&gt;
  
  
  LMAX Disruptor
&lt;/h1&gt;

&lt;p&gt;When announced in early 2010s it made rounds in Java world and in HPC trading firms. It was also later adopted in &lt;a href="https://logging.apache.org/log4j/2.x/manual/async.html"&gt;Log4j&lt;/a&gt; and also in &lt;a href="https://www.nasdaq.com/docs/2023/03/16/NBV-FOSS-LIST_NBV.pdf"&gt;Nasdaq&lt;/a&gt;. Exchanges and brokerages workloads demand millisecond and microsecond latencies. They usually run on beefy bare-metal hardware as performance impact of running on VMs is too costly. These services are written in Thread per Core model (because context switching and L1, L2 cache invalidations are expensive) unlike traditional web-servers that operate on Thread per Request model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; LMAX Disruptor is a high performance inter-thread communication library. Using it in a wrong way can cause significant performance degradation. Generally as a rule of thumb, if a problem can be solved just by scaling out instead of scaling up, it need not be used.&lt;/p&gt;

&lt;p&gt;Here is a high level overview of LMAX Exchange.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with traditional queues
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c8Ls6Flb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713281905756/2742059e-3695-4cbc-ab7f-bbb90837bc23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c8Ls6Flb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713281905756/2742059e-3695-4cbc-ab7f-bbb90837bc23.png" alt="" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above diagram shows high level LMAX system receiving market data, doing auxiliary processing, core business logic processing and then sending orders to market. Replicator, Journaller &amp;amp; Un-Marshaller can process in parallel, but queues are still needed for ordered processing. So, we have receiver acting as producer and replicator, journaller &amp;amp; un-marshaller acting as consumers contending over shared resource - Queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3VkD-pjH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713283721242/b06b34f0-67c0-46ed-80fd-07de4307987d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3VkD-pjH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713283721242/b06b34f0-67c0-46ed-80fd-07de4307987d.png" alt="" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rW8ik4Aw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713283671964/0f5961b7-ea29-4a1c-8d61-cf0bea44a273.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rW8ik4Aw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713283671964/0f5961b7-ea29-4a1c-8d61-cf0bea44a273.png" alt="" width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we saw in matmul section, it is more likely that &lt;strong&gt;tail &amp;amp; head vars fall within the same cache line&lt;/strong&gt;. Producer thread is adding at the end of the queue and consumer thread is consuming from beginning of the queue. When threads are running in different cores, both their &lt;strong&gt;L1 &amp;amp; L2 caches needs to be invalidated each time producer / consumer is updating the state of the queue&lt;/strong&gt;. LMAX team observed that their producer &amp;amp; consumer were running at the same rate &amp;amp; significant time was spent on keeping the L1 &amp;amp; L2 caches up-to date rather than doing actual producing &amp;amp; consuming.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LMAX Disruptor is so Fast
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lock-Free RingBuffer
&lt;/h3&gt;

&lt;p&gt;RingBuffer (CircularQueue) is also a Queue which operates in FIFO fashion. Key difference between RingBuffer &amp;amp; a traditional Queue is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;When values are consumed, it is not removed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When end of the queue is reached, writer goes to the beginning of the queue and value is overwritten.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In LMAX Disruptor's RingBuffer implementation, the 'head' &amp;amp; 'tail' are managed outside the buffer instead of using a blanket lock which prevents adding value to end of the queue when consumption is happening in the beginning of the queue &amp;amp; vice versa.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--s9SC22fb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713368090245/767b14d3-34e9-4014-892b-bb77cfc0b681.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--s9SC22fb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713368090245/767b14d3-34e9-4014-892b-bb77cfc0b681.png" alt="" width="800" height="975"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's look at the highlighted sequence snapshot in the above diagram.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1)&lt;/strong&gt; Fast consumer 2 has processed until buffer location 5 &amp;amp; asks cursor for next location. Cursor provides location 6, as location 0 is already processed by consumer 2. Consumer 2 fetches value in buffer location 6 and starts processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2)&lt;/strong&gt; Producer barrier that is tracking consumer 1 &amp;amp; 2 sequences is aware that, consumer 1 is done only until buffer location 1, and consumer 2 is done until buffer location 5. So, only value at location 0 can be overwritten.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3)&lt;/strong&gt; Producer can write only one value at location 0. Producer is preparing new value. Eg: Fetching latest value from network for example&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4)&lt;/strong&gt; Once new value is ready, producer asks producer barrier to commit. Value is updated in buffer, and sequence is updated to 7 from 0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5)&lt;/strong&gt; Slow consumer 1 that is done with processing buffer location 1, asks for next value. It gets the location 7 from cursor. Consumer 1 gets all entries in location 3-7 and works on processing it.&lt;/p&gt;

&lt;p&gt;Consumers update their respective consumer sequences after processing. Only when a buffer location is processed by all consumers, it can be overwritten. Producer barrier keeps track of all consumer sequences.&lt;/p&gt;

&lt;p&gt;Batching can be done in producer and consumer sequences (not in scope of this post, see references).&lt;/p&gt;

&lt;p&gt;Buffer sequences are monotonically increasing as it provides an easier way of tracking consumer and producer buffer locations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Static Memory Allocation &amp;amp; Delayed Garbage Collection
&lt;/h3&gt;

&lt;p&gt;RingBuffer array is statically allocated with dummy values. Producers write to the next available buffer location using cursor and consumers consume previously unconsumed buffer locations using cursor. Once the value is overwritten there won't be any reference to it and will be easily garbage collected (GC).&lt;/p&gt;

&lt;p&gt;In Java 8, GC there are 4 memory spaces. Young Generation (Eden, Survivor) spaces, Old Generation (Tenured Generation), Metaspace (non-heap memory) &amp;amp; Code Cache (JIT compiler related).&lt;/p&gt;

&lt;p&gt;Since RingBuffer itself is statically allocated, it will be metaspace and will not be GC'd. The values in buffer are written and consumed quickly and will be GC'd in Eden cycle (quick and cheap), &lt;strong&gt;hence avoiding large GC pauses (survivor and old-gen spaces).&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Avoiding False Sharing in Cache Lines
&lt;/h3&gt;

&lt;p&gt;In matmul, we saw that variables in a program can share the same cache line. In Disruptor, we have Cursor, Sequence Barriers for both Producer &amp;amp; Consumers. Since we want producer and consumer threads to not be affected by updates to their variables (unlike ArrayBlockingQueue), we have to &lt;strong&gt;add padding so that the variable occupy entire cache line.&lt;/strong&gt; So when producer is updating cursor, consumer caches need not be refreshed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--n5Jrk_He--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713354498472/75a9bd87-fbce-46e2-9bc3-4c5381f1e853.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--n5Jrk_He--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713354498472/75a9bd87-fbce-46e2-9bc3-4c5381f1e853.png" alt="" width="800" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we don't do this when producer thread updates cursor, consumer caches needs to be refreshed as they share the same cache line. This is called as &lt;a href="https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html"&gt;&lt;strong&gt;False Sharing&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Java 8 has &lt;a href="https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html"&gt;Contended Annotation&lt;/a&gt; for this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Producer &amp;amp; Consumer Sequence Barriers
&lt;/h3&gt;

&lt;p&gt;CPU core does several optimizations using instruction pipelining, reordering etc., as long as the outcome of a reorder or concurrent execution in execution units of a CPU core, doesn't change the outcome of the program.&lt;/p&gt;

&lt;p&gt;Java provides Volatile keyword which is a special type of barrier know as write / store barrier. There are also other &lt;a href="https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html"&gt;types of memory barriers and fences&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here we have two programs where counter is not volatile and counter is volatile. We know that arithmetic operations happen in ALU of CPU core. Core operates using values from registers.&lt;/p&gt;

&lt;p&gt;In first program, once counter is loaded into register, 10 iterations of loop happens and each change in counter value is saved only in register. Once the iteration is done, during "write-back" cycle, value is copied and written back to L1 cache and memory unit takes care of propagating this change to other levels of caches and to main memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public class LoopCounterExample {
    public static void main(String[] args) {
        int iterations = 10;
        int counter = 0;

        for (int i = 0; i &amp;lt; iterations; i++) {
            counter++;
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public class LoopCounterExample {
    public static void main(String[] args) {
        int iterations = 10;
        volatile int counter = 0;

        for (int i = 0; i &amp;lt; iterations; i++) {
            counter++;
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In second program, every update to counter is written back from Register to L1 cache and memory unit takes care of invalidating any other reference to this value. This has significant performance cost, but comes at the value of shared state across multiple threads.&lt;/p&gt;

&lt;p&gt;In case of Disruptor, Cursor, Consumer sequences &amp;amp; Producer sequences use memory barrier &amp;amp; fences which offer finer control than volatile keyword. This is done using &lt;a href="https://www.baeldung.com/java-variable-handles"&gt;Java VarHandle&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;These techniques offer finer control than Reentrant Lock used in ArrayBlockingQueue. Producer and Consumers can write and consume from ring-buffer at the same time ,and can be confident that when value is read from a buffer location it is always the latest because barriers &amp;amp; fences guarantee:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Anything that happened before barrier call is flushed out (producer adding newly produced value at location 0 and then incrementing cursor from 6 to 7).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Value updated by one thread is immediately visible to all threads (value of cursor to consumer barrier and value consumer sequences producer barrier).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Avoiding Context Switching
&lt;/h3&gt;

&lt;p&gt;Even without the below optimizations Disruptor's performance is significantly higher than an ArrayBlockingQueue (see perf benchmark section below). I found these optimizations very interesting (feel free to skip this and jump to perf section). These were done for LMAX matching engine service that has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;1 Inbound Disruptor with 1 producer and three consumers threads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;3 Outbound Disruptor with 1 producer thread (one of the consumer threads from Inbound Disruptor) and 3 consumer threads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Yellow arrows indicate critical threads that needs dedicated CPU core (for peak performance)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--B6jy3e-Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713345020785/3973c64b-f4e2-4ad4-8d3d-e2eeda96e8fa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--B6jy3e-Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713345020785/3973c64b-f4e2-4ad4-8d3d-e2eeda96e8fa.png" alt="" width="800" height="544"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ isolcpus=0,2,4,6,8,24,26,28,30,32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To isolate CPUs. Above diagram shows 10 cpu cores (20 hyper-threads) isolated from OS kernel scheduler. OS will not schedule any process or thread in these cores. (Plugging my previous post here if you want to understand &lt;a href="https://venkat.eu/cpu-gpu-the-basics-and-high-level-overview#heading-core-in-a-modern-cpu"&gt;cpu cores&lt;/a&gt; and &lt;a href="https://venkat.eu/cpu-gpu-the-basics-and-high-level-overview#heading-hyper-threading"&gt;hyper-threading&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cset set --set=/system --cpu=18,20,...,46 
$ cset set --set=/app --cpu=0,2,...,40
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To partition system resources. Separate cpu sets for system and app&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cset proc --move -k --threads --force \ --from-set=/ --to-set=/system
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command moves kernel threads from the default CPU set to the "/system" CPU set. Kernel threads are system-level threads managed by the kernel itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cset proc --exec /app \ taskset -cp 10,12...38,40 \ java &amp;lt;args&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command executes the Java application (java ) within the CPU set "/app" using the taskset command. The taskset -cp option specifies which CPUs the process should be allowed to run on. In this case, the Java application is allowed to run on CPUs 10, 12, ..., 38, and 40.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sched_set_affinity(0); 
sched_set_affinity(2);....
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Java thread is pinned to dedicated core in application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Benchmark
&lt;/h2&gt;

&lt;p&gt;I've briefly covered principles and techniques through which LMAX disruptor gives performance gains. I would like to call out that I've used a mix of Disruptor 1.0 &amp;amp; 2.0 terminologies above to easily communicate the problem and underlying principles. For more detailed understanding, see sources in reference section.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FtWDm6Wp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713344471873/2eda67d9-dcc3-42f1-8c46-b51c4ab4c0eb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FtWDm6Wp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713344471873/2eda67d9-dcc3-42f1-8c46-b51c4ab4c0eb.png" alt="" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xHLJInVK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713344487654/89bde732-10e6-42d5-abe6-2f1e5f9a23e7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xHLJInVK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713344487654/89bde732-10e6-42d5-abe6-2f1e5f9a23e7.png" alt="" width="800" height="253"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source: LMAX perf test &lt;a href="https://lmax-exchange.github.io/disruptor/disruptor.html#_throughput_performance_testing"&gt;Throughput&lt;/a&gt; &amp;amp; &lt;a href="https://lmax-exchange.github.io/disruptor/disruptor.html#_latency_performance_testing"&gt;Latency&lt;/a&gt;. The above benchmarks were done &lt;strong&gt;without context switching optimizations&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Thus, having mechanical sympathy for the underlying hardware architecture helps to speed up inter-thread messaging and achieve peak performance.&lt;/p&gt;

&lt;h1&gt;
  
  
  Flash Attention
&lt;/h1&gt;

&lt;p&gt;So far in this post, we looked at LOR using matmul &amp;amp; Disruptor and see how understanding underlying CPU architecture helps with extracting maximum performance. In this section, we'll look at &lt;a href="https://github.com/Dao-AILab/flash-attention"&gt;Flash Attention&lt;/a&gt; - "A new attention algorithm that computes exact attention with far fewer memory accesses."&lt;/p&gt;

&lt;p&gt;In my previous &lt;a href="https://venkat.eu/cpu-gpu-the-basics-and-high-level-overview#heading-bandwidth-compute-intensity-amp-latency"&gt;post&lt;/a&gt;, we understood HBM memory and compute intensity of A100 GPU using 2x2 matmul as an example. Flash Attention optimization leads to direct performance gains primarily in &lt;a href="https://horace.io/brrr_intro.html"&gt;bandwidth &amp;amp; overhead bound&lt;/a&gt; rather than optimizations in compute bound regime.&lt;/p&gt;

&lt;p&gt;As of April 2024, I don't have deep expertise / understanding to explain attention layer of Transformers in detail. Refer to Jay Alammar's amazing &lt;a href="https://jalammar.github.io/illustrated-transformer/"&gt;post&lt;/a&gt; or high quality video from &lt;a href="https://youtu.be/eMlx5fFNoYc?si=wmNOf5977RvETfaZ"&gt;3Blue1Brown&lt;/a&gt; for that. I also cannot do a better job than Aleksa Gordi in explaining step-by-step changes in Flash Attention 1 algorithm with supporting math. Refer to his excellent &lt;a href="https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad"&gt;post&lt;/a&gt; for that. Below, I try to provide a practical high level FlashAttention 1 explanation w.r.t underlying Hardware - CUDA Ampere Architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Paper Title: Fast and Memory-Efficient Exact Attention with IO-Awareness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact Attention:&lt;/strong&gt; It's not using sparse matrix / approximation methods to speed up attention calculation. These technique when used, result in models with poor quality. Flash Attention 1 uses exact attention calculation, so there is no reduction in quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast &amp;amp; Memory-Efficient:&lt;/strong&gt; Space complexity of vanilla self-attention is O(N), while the algorithmic optimization leads to space complexity of N (O(N)). This reduction is space complexity results in increased memory bandwidth availability, decreasing compute intensity [more data is fed to the beast - CUDA &amp;amp; Tensor cores :) from caches], resulting in improvement in speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IO Awareness:&lt;/strong&gt; NVIDIA A100 SXM GPU has 40-80 GB of HBM (VRAM / DRAM) &amp;amp; 88.1 MB of SRAM in total shared across all SMs (256k registers, 192k L1 cache per SM -&amp;gt; 27.8 MB for registers and 20.3MB for L1 cache combined + 40MB of shared L2 cache).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_vy_ope2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713393370860/653b373a-d333-4d4f-962c-e70c61d92cc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_vy_ope2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713393370860/653b373a-d333-4d4f-962c-e70c61d92cc0.png" alt="" width="331" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagram source is NVIDIA. It shows required &lt;a href="https://venkat.eu/cpu-gpu-the-basics-and-high-level-overview#heading-bandwidth-compute-intensity-amp-latency"&gt;Compute Intensity&lt;/a&gt; for FMA operation in CUDA &amp;amp; Tensor Cores - to make the read operations worth the cost. Except for matmul, there are not many computations that have such high compute intensity to make reads from slower memory worth the cost. So, model implementations must try to keep the compute intensity as low as possible. i.e, read and write from caches and registers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jyAW5EKB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713393576279/943b784c-71f0-4ec5-bdf0-0925869d3356.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jyAW5EKB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713393576279/943b784c-71f0-4ec5-bdf0-0925869d3356.png" alt="" width="248" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagram source is Flash Attention paper. In above attention diagram, in native attention implementation in PyTorch, we can that only ~4ms out of 17ms is spent on Matmul operation (compute bound). The rest of the operations are not that compute heavy. But because of frequent read and writes from HBM, the bandwidth is significantly reduced resulting in wasted GPU compute cycles and higher latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standard Self-Attention
&lt;/h2&gt;

&lt;p&gt;I'm providing just a high level self-attention calculation operations needed to understand FlashAttention1. Refer &lt;a href="https://youtu.be/eMlx5fFNoYc?si=J7vohprdDIL3Yl4i"&gt;3Blue1Brown's video&lt;/a&gt; for detailed explanation.&lt;/p&gt;

&lt;p&gt;Q1.K1 to Qn.Kn are matrix multiplication of Q&amp;amp;K matrices. The division is for numeric stability (not critical for this post).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SFInPuXa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713436124446/30074419-5b17-45b1-af2a-7909cd23d2bd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SFInPuXa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713436124446/30074419-5b17-45b1-af2a-7909cd23d2bd.png" alt="" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The resulting values from matmul range from - infinity to + infinity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9DkPdLp9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713438892339/1f3dc4c4-1e7c-46e2-9a56-1769aa1769ec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9DkPdLp9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713438892339/1f3dc4c4-1e7c-46e2-9a56-1769aa1769ec.png" alt="" width="800" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since matrix column values are used for predicting next token, we need a probability distribution. &lt;strong&gt;Softmax operation is applied to every column of the result embedding matrix. The denominator needs sum of all elements in a given column.&lt;/strong&gt; See sample program below and results with help from ChatGPT.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uqt4EvIG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713437654620/e2641b51-76cf-4c39-a9cd-62c8e518fa3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uqt4EvIG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713437654620/e2641b51-76cf-4c39-a9cd-62c8e518fa3c.png" alt="" width="800" height="448"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch
import torch.nn.functional as F

def traditional_softmax(matrix, column_index):
    column = matrix[:, column_index]
    softmax_column = F.softmax(column, dim=0)
    return softmax_column

# Example usage
matrix = torch.tensor([[-0.8, -5.0, 5.0, 1.5, 3.4, -2.3, 2.5],
                       [-0.2,  2.3, 3.5, 1.8, 0.9, -1.5, 0.5]], dtype=torch.float32)
column_index = 2
softmax_result = traditional_softmax(matrix, column_index)
print("Softmax for column", column_index, ":", softmax_result)

# result
# Softmax for column 2 : tensor([0.8176, 0.1824])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The token with high probability score get's more "attention".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1PGk1e_I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713439214683/e66f171f-8f2c-43aa-9257-f1c2efa932ed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1PGk1e_I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713439214683/e66f171f-8f2c-43aa-9257-f1c2efa932ed.png" alt="" width="800" height="539"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So far, we briefly saw matmul of matrices Q, K and softmax operation gives result matrix with probability distribution. Masking is applied before softmax to prevent next probability influencing previous token (refer video). Below we see outcome of result matrix after softmax is multiplied with V matrix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TlBk4eol--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713440356830/a04458b1-ea48-4877-8909-797619cc70f9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TlBk4eol--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713440356830/a04458b1-ea48-4877-8909-797619cc70f9.png" alt="" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is how LLMs understand importance for words and sentences in different parts of the text. These steps are done for each layer of the model.&lt;/p&gt;

&lt;p&gt;Below is the standard self-attention implementation which does above mentioned calculations for each input token in every layer of a transformer model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FRrISKFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713435814025/c1a72af6-e7cb-4daf-a9c6-c2f7b5e768bc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FRrISKFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713435814025/c1a72af6-e7cb-4daf-a9c6-c2f7b5e768bc.png" alt="" width="714" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagram source: &lt;a href="https://arxiv.org/pdf/2205.14135.pdf"&gt;Flash attention paper&lt;/a&gt;. One can quickly see, several reads and writes being done to HBM without taking bandwidth and compute intensity of underlying GPU architecture into account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flash Attention Optimizations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jLPplnQt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713444358865/dbda08e6-b5ca-40fe-b823-d5cd2d5fffe6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jLPplnQt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713444358865/dbda08e6-b5ca-40fe-b823-d5cd2d5fffe6.png" alt="" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagram source: &lt;a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention"&gt;HuggingFace TGI&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiled Matrix Multiplication
&lt;/h3&gt;

&lt;p&gt;We are going to revisit.. caches ! (you guessed it :)) Refer to MIT OpenCourseWare &lt;a href="https://youtu.be/o7h_sYMk_oc?si=cU0QMBGQ3uG-MidG&amp;amp;t=2427"&gt;matmul with tiling&lt;/a&gt;. This is the critical critical change&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NoHJ3VTN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713456152244/2f71e9d5-c9a3-4162-af68-af3faba00810.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NoHJ3VTN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713456152244/2f71e9d5-c9a3-4162-af68-af3faba00810.png" alt="" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--a8VKQRCh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713456265658/28adc0a1-8125-494f-96ef-b1e18a797c82.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--a8VKQRCh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713456265658/28adc0a1-8125-494f-96ef-b1e18a797c82.png" alt="" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the first slide, entire matrix B is loaded as all columns are needed. This is not very efficient use of memory bandwidth. &lt;strong&gt;As we saw earlier, In self-attention there are 3 matrix multiplications and one softmax&lt;/strong&gt; (next section covers online softmax, so for now assume that not all columns are needed for softmax calculation).&lt;/p&gt;

&lt;p&gt;Once tiling is done, some bandwidth in HBM is freed up, and some L1 &amp;amp; L2 cache memory are also freed up. This will be used to do softmax operation once Q.K for the block is done. Once softmax is done, we do another matrix multiplication with V block. This result is then written back to HBM. This is called as &lt;strong&gt;"Kernel Fusion".&lt;/strong&gt; ie., a CUDA kernel is doing 3 operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A side note:&lt;/strong&gt; I would imagine there was some kind of tiling already happening on transformer models before FlashAttention. Because, CUDA Thread Blocks &amp;amp; Warps are &lt;a href="https://youtu.be/n6M8R8-PlnE?si=9i7BXDsAkFFnlMZ0&amp;amp;t=1291"&gt;designed to do parallel operations&lt;/a&gt; on every memory page read. I haven't looked into FlashAttention 2, but from reading the abstract, I think this is being done. Again, this highly emphasizes the need for optimizations with Mechanical Sympathy :)&lt;/p&gt;

&lt;h3&gt;
  
  
  Online Softmax Calculation
&lt;/h3&gt;

&lt;p&gt;Earlier we saw that softmax needs the sum of all elements in a given column. In online softmax calculation, computations are performed for columns in smaller matrix blocks, reducing the memory footprint in SRAM. With each block calculation in flash attention, the maximum score within the block is tracked and saved.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;m&lt;/em&gt;(&lt;em&gt;x&lt;/em&gt;) (Maximum Score): The highest value within a block of scores.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;f(x)&lt;/em&gt; (Exponential Function): Transforms scores into positive values by raising them to the power of the difference between the score and the maximum score across all blocks resulting in numerical stability&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;l(x)&lt;/em&gt; (Sum of Exponential Scores): The sum of exponential values obtained from applying the exponential function to each score within a block, used for softmax probability computation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See sample program and results with help from ChatGPT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch
import torch.nn.functional as F

def flash_attention_softmax(matrix, column_index, block_sizes):
    # Step 1: Extract the column vector
    column = matrix[:, column_index]

    # Step 2: Compute the total size of the concatenated vector
    total_size = column.size(0)

    # Step 3: Split the concatenated vector into blocks
    blocks = torch.split(column, block_sizes)

    # Step 4: Compute the maximum value within each block (𝑚(𝑥))
    max_values = [torch.max(block) for block in blocks]

    # Step 5: Compute the global maximum value across all blocks
    global_max = torch.max(torch.stack(max_values))

    numerator = torch.zeros_like(column)
    for i, block in enumerate(blocks):
        # Step 6: Compute numerator for each block (𝑓(𝑥))
        numerator[i * block_sizes[i]:(i + 1) * block_sizes[i]] = torch.exp(block - global_max)

    # Step 7: Compute the sum of exponentials (ℓ(𝑥))
    denominator = torch.sum(numerator)

    # Step 8: Compute softmax probabilities for each block
    softmax_probabilities = numerator / denominator

    return softmax_probabilities

# Example usage
matrix = torch.tensor([[-0.8, -5.0, 5.0, 1.5, 3.4, -2.3, 2.5],
                       [-0.2,  2.3, 3.5, 1.8, 0.9, -1.5, 0.5]], dtype=torch.float32)
column_index = 2
block_sizes = [1, 1]  # Splitting the column into individual elements

print("Matrix:")
print(matrix)

print("\nColumn:")
column = matrix[:, column_index]
print(column)

print("\nBlocks after splitting:")
blocks = torch.split(column, block_sizes)
print(blocks)

print("\nMax values within each block:")
max_values = [torch.max(block) for block in blocks]
print(max_values)

print("\nGlobal maximum value across all blocks:")
global_max = torch.max(torch.stack(max_values))
print(global_max)

softmax_result = flash_attention_softmax(matrix, column_index, block_sizes)
print("\nSoftmax for column", column_index, ":", softmax_result)int("Softmax for column", column_index, ":", softmax_result)

# Matrix:
# tensor([[-0.8000, -5.0000,  5.0000,  1.5000,  3.4000, -2.3000,  2.5000],
#         [-0.2000,  2.3000,  3.5000,  1.8000,  0.9000, -1.5000,  0.5000]])

# Column:
# tensor([5.0000, 3.5000])

# Blocks after splitting:
# (tensor([5.]), tensor([3.5000]))

# Max values within each block:
# [tensor(5.), tensor(3.5000)]

# Global maximum value across all blocks:
# tensor(5.)

# Softmax for column 2 : tensor([0.8176, 0.1824])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  To summarize:
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zUpzpsmG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713460702067/a76eeb50-c175-4473-aae7-ecef9fceda28.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zUpzpsmG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1713460702067/a76eeb50-c175-4473-aae7-ecef9fceda28.png" alt="" width="800" height="642"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Although I haven't gone into Transformer attention mechanism, math &amp;amp; Flash Attention algorithm, math; I am hoping that at a high level, I was able to communicate the essence of Flash Attention 1 optimizations.)&lt;/p&gt;

&lt;p&gt;Tri Dao, et al., with their combined research / expertise &amp;amp; very good understanding in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Transformer Attention mechanism and the math behind it&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NVIDIA Ampere series GPU architecture &amp;amp; CUDA parallel programming&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Earlier research works - &lt;a href="https://arxiv.org/abs/2112.05682"&gt;Self-attention Does Not Need O(n2) Memory&lt;/a&gt; by Google Researchers with &lt;a href="https://github.com/google-research/google-research/tree/master/memory_efficient_attention"&gt;reference implementation&lt;/a&gt; in JAX for TPU &amp;amp; &lt;a href="https://arxiv.org/abs/1805.02867"&gt;Online normalizer calculation for softmax&lt;/a&gt; by NVIDIA researchers with &lt;a href="https://github.com/NVIDIA/online-softmax"&gt;reference implementation&lt;/a&gt; in CUDA&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;have shown Mechanical Sympathy to extract the best out of NVIDIA Ampere GPU hardware architecture.&lt;/p&gt;

&lt;h1&gt;
  
  
  Outro
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Implementing matmul of 4096x4096 in C and changing loop order provides &lt;a href="https://xta0.me/2021/07/12/MIT-6172-1.html"&gt;461% improvement&lt;/a&gt; in GFLOPS utilization compared to C implementation with inefficient loop order. This is done purely by exploiting CPU cache line behavior.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;P99 latency % improvement when comparing Disruptor against ArrayBlockingQueue is &lt;a href="https://lmax-exchange.github.io/disruptor/disruptor.html#_latency_performance_testing"&gt;99%&lt;/a&gt; &amp;amp; enabled LMAX Exchange to handle 6M &lt;a href="https://www.infoq.com/presentations/lmax-trading-architecture/"&gt;order matching engine TPS&lt;/a&gt; on a single machine. This is done primarily by using granular inter-thread messaging allowing concurrent read and writes to buffer, and efficient use of CPU cache line.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;FlashAttention trains Transformers faster than existing baselines: &lt;a href="https://arxiv.org/pdf/2205.14135.pdf"&gt;15%&lt;/a&gt; end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3 speedup on GPT-2 (seq. length 1K), and 2.4 speedup on long-range arena (seq. length 1K-4K)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, we saw examples of Mechanical Sympathy being applied in wide range of problems requiring different skill-sets and expertise with real world impact.&lt;/p&gt;

&lt;p&gt;Deep Learning space is still in its nascent phase. People with expertise in several background (Data Engineering, Model Training, Deep Learning Algorithms, Compiler, Hardware Interface - CPUs, GPUs, Accelerators, Model Inference, Distributed Systems, Infrastructure, Mathematics, Physics, Chemistry) are all working within their domain and rightfully so. Current cost of training &amp;amp; inference for quality models is prohibitively high. Given how LLMs are going to be human companions like a Laptop and a smartphone, several optimizations will be required and some of which will be solved by engineers having very good understanding of underlying hardware and architecture.&lt;/p&gt;

&lt;p&gt;It's interesting that FlashAttention 1 was done in &lt;a href="https://youtu.be/IoMSGuiwV3g?si=WvjirXW7hduMTfmd&amp;amp;t=877"&gt;2-3 months&lt;/a&gt;. In 2023, they've also published Flash Attention 2 with better parallelism and work partitioning (efficient use of CUDA thread blocks &amp;amp; warps) resulting in optimizations primarily in compute bound regime. I cannot imagine the breakthroughs we would see - If more DeepLearning / Transformer algorithm experts/researchers and CUDA Architects like &lt;a href="https://www.nvidia.com/en-us/on-demand/search/?facet.mimetype[]=event%20session&amp;amp;layout=list&amp;amp;page=1&amp;amp;q=Stephen%20Jones%20%28SW%29&amp;amp;sort=date&amp;amp;sortDir=desc"&gt;Stephen Jones&lt;/a&gt;, work on optimizing existing layers and algorithms for couple years or so. I'm highlighting CUDA here as NVIDIA is the market leader. Intel, AMD, and other transformer accelerators' computing platform teams should also be spending more effort on optimizing model implementations for their respective hardware.&lt;/p&gt;

&lt;h1&gt;
  
  
  References:
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;MIT OpenCourseWare - Performance Engineering - Matrix Multiplication &lt;a href="https://youtu.be/o7h_sYMk_oc?si=mWgShE48VgXARZEz"&gt;https://youtu.be/o7h_sYMk_oc?si=mWgShE48VgXARZEz&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Vipul Vaibhaw's MIT matmul summary - &lt;a href="https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa"&gt;https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tao Xu's MIT matmul summary - &lt;a href="https://xta0.me/2021/07/12/MIT-6172-1.html"&gt;https://xta0.me/2021/07/12/MIT-6172-1.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;InfoQ Martin Thompson and Michael Barker on building HPC fintech handling over 100k TPS at LMAX - &lt;a href="https://www.infoq.com/presentations/LMAX"&gt;https://www.infoq.com/presentations/LMAX&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;InfoQ Sam Adams on LMAX Exchange Architecture &lt;a href="https://www.infoq.com/presentations/lmax-trading-architecture/"&gt;https://www.infoq.com/presentations/lmax-trading-architecture/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Trisha Gee on LMAX Disruptor Internals - &lt;a href="https://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html"&gt;RingBuffer&lt;/a&gt;, &lt;a href="https://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast.html"&gt;LocksAreBad&lt;/a&gt;, &lt;a href="https://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html"&gt;MemoryBarriers&lt;/a&gt;, &lt;a href="https://mechanitis.blogspot.com/2011/06/dissecting-disruptor-how-do-i-read-from.html"&gt;Consumer&lt;/a&gt;, &lt;a href="https://mechanitis.blogspot.com/2011/07/dissecting-disruptor-writing-to-ring.html"&gt;Producer&lt;/a&gt;, &lt;a href="https://mechanitis.blogspot.com/2011/08/disruptor-20-all-change-please.html"&gt;Disruptor 2.0&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;LMAX Exchange Disruptor - &lt;a href="https://lmax-exchange.github.io/disruptor/#_read_this_first"&gt;https://lmax-exchange.github.io/disruptor/#_read_this_first&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Thompson on Memory Barriers &amp;amp; Fences - &lt;a href="https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html"&gt;https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guy Nir on LMAX Disruptor - &lt;a href="https://www.slideshare.net/slideshow/the-edge-2012-disruptor-guy-raz-nir-published/22790571"&gt;https://www.slideshare.net/slideshow/the-edge-2012-disruptor-guy-raz-nir-published/22790571&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Martin Fowler on LMAX Disruptor: &lt;a href="https://martinfowler.com/articles/lmax.html"&gt;https://martinfowler.com/articles/lmax.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tri Dao, et al., FLASH ATTENTION 1 &amp;amp; 2: &lt;a href="https://github.com/Dao-AILab/flash-attention"&gt;https://github.com/Dao-AILab/flash-attention&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aleksa Gordi on ELI5 - FLASH ATTENTION 1: &lt;a href="https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad"&gt;https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jay Alammar - &lt;a href="https://jalammar.github.io/illustrated-transformer/"&gt;The Illustrated Transformer&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>java</category>
      <category>python</category>
    </item>
    <item>
      <title>CPU &amp; GPU - The Basics</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Mon, 08 Apr 2024 11:51:00 +0000</pubDate>
      <link>https://dev.to/venkat2811/cpu-gpu-the-basics-5c5g</link>
      <guid>https://dev.to/venkat2811/cpu-gpu-the-basics-5c5g</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6pk1v3jm2ec5c6t4hmo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6pk1v3jm2ec5c6t4hmo.png" alt="Image description" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;In this article, we'll go through some fundamental low level details to understand why GPUs are good at Graphics, Neural Network and Deep Learning tasks and CPUs are good at wide number of sequential, complex general purpose computing tasks. There were several topics that I had to research and get a bit more granular understanding for this post, some of which I will be just mentioning in passing. It is done deliberately to focus just on the absolute basics of CPU &amp;amp; GPU processing.&lt;/p&gt;

&lt;h1&gt;
  
  
  Von Neumann Architecture
&lt;/h1&gt;

&lt;p&gt;Earlier computers were dedicated devices. Hardware circuits and logic gates were programmed to do specific set of things. If something new had to be done, circuits needed to be rewired. "Something new" could be as simple as doing mathematical calculations for two different equations. During WWII, Alan Turing was working on a programmable machine to beat Enigma machine and later published "Turing Machine" paper. Around the same time, John von Neumann and other researchers were also working on idea which fundamentally proposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Instruction and data should be stored in shared memory (Stored program).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Processing and memory units should be separate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Control unit takes care of reading data &amp;amp; instructions from memory to do calculations using processing unit.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Bottleneck
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Processing bottleneck - Only one instruction and its operand can be at a time in a processing unit (physical logic gate). Instructions are executed sequentially one after another. Over the years, focus and improvements has been in making processors smaller, faster clock cycles, increasing number of cores.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Memory bottleneck - As processors grew faster and faster, speed and amount of data that could be transferred between memory and processing unit became a bottleneck. Memory is several order slower than CPU. Over the years, focus and improvements has been in making memory denser and smaller.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  CPUs
&lt;/h1&gt;

&lt;p&gt;We know that everything in our computer is binary. String, image, video, audio, OS, application program, etc., are all represented as 1s &amp;amp; 0s. CPU architecture (RISC, CISC, etc.,) specifications have instruction sets (x86, x86-64, ARM, etc.,), which CPU manufacturers must comply with &amp;amp; is available for OS to interface with hardware.&lt;/p&gt;

&lt;p&gt;OS &amp;amp; application programs including data are translated into instruction set and binary data for processing in the CPU. At chip level, processing is done at transistors and logic gates. If you execute a program to add two numbers, addition (the "processing") is done at logic gate in the processor.&lt;/p&gt;

&lt;p&gt;In CPU as per Von Neumann architecture, when we are adding two numbers, a single add instruction runs on two numbers in the circuit. For a fraction of that millisecond, only add instruction was executed in (execution) core of the processing unit ! This detail always fascinated me.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core in a modern CPU
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--G7kM52e5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712387837513/3eccc30d-7a30-4190-9188-dedc858a3a78.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--G7kM52e5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712387837513/3eccc30d-7a30-4190-9188-dedc858a3a78.png" alt="" width="800" height="947"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The components in the above diagram are self evident. For more details and detailed explanation refer to this excellent &lt;a href="https://www.redhat.com/sysadmin/cpu-components-functionality"&gt;article&lt;/a&gt;. In modern CPUs, a single physical core can contain more than one integer ALU, floating-point ALU, etc., Again, these units are physical logic gates.&lt;/p&gt;

&lt;p&gt;We need to understand 'Hardware Thread' in CPU core for better appreciation of GPU. &lt;strong&gt;A hardware thread is an unit of compute that can be done in execution units of a CPU core, every single CPU clock cycle&lt;/strong&gt;. &lt;strong&gt;It represents the smallest unit of work that can be executed in a core.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Instruction cycle
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lbQ7FhIS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712432020670/3f4787fe-43fd-4871-bb3f-7d96bad86f3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lbQ7FhIS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712432020670/3f4787fe-43fd-4871-bb3f-7d96bad86f3e.png" alt="" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above diagram illustrates CPU instruction cycle / machine cycle. It is a series of steps that CPU performs to execute a single instruction (eg: c=a+b).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fetch:&lt;/strong&gt; Program counter (special register in CPU core) keeps track of which instruction must be fetched. Instruction is fetched and stored in instruction register. For simple operations, corresponding data is also fetched.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decode:&lt;/strong&gt; Instruction is decoded to see operator and operands.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Execute:&lt;/strong&gt; Based on the operation specified, appropriate processing unit is chosen and executed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory Access:&lt;/strong&gt; If instruction is complex or additional data is needed (several factors can cause this), memory access is done before execute. (Ignored in above diagram for simplicity). &lt;strong&gt;For a complex instruction, initial data will be available in data register of compute unit, but for complete execution of instruction, data access from L1 &amp;amp; L2 cache is required. This means could be a small wait time before compute unit executes and the hardware thread is still holding compute unit during wait time.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write Back:&lt;/strong&gt; If execution produces output (eg: c=a+b), output is written back to register / cache / memory. (Ignored in above diagram or anyplace later in the post for simplicity)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the above diagram, only at t2 compute is being done. Rest of the time, core is just idle (we are not getting any work done).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modern CPUs have HW components which essentially enables (fetch-decode-execute) steps to happen concurrently per clock cycle.A single hardware thread can now do computation in every clock cycle.&lt;/strong&gt; This is called as instruction pipelining.&lt;/p&gt;

&lt;p&gt;Fetch, Decode, Memory Access, Write Back are done by other components in a CPU. For lack of a better word, these are called "pipeline threads". Pipeline thread becomes hardware thread when it is in execute stage of an instruction cycle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lwhdNkov--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712432055203/d1fc48e1-32b7-4fe1-b06a-8ecc647f7dd0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lwhdNkov--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712432055203/d1fc48e1-32b7-4fe1-b06a-8ecc647f7dd0.png" alt="" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, we get compute output every cycle from t2. Previously, we got compute output once every 3 cycle. &lt;strong&gt;Pipelining improves compute throughput. This is one of the techniques to manage processing bottleneck in Von Neumann Architecture.&lt;/strong&gt; There are also other optimizations like out of order execution, branch prediction, speculative execution, etc.,&lt;/p&gt;

&lt;h3&gt;
  
  
  Hyper-Threading
&lt;/h3&gt;

&lt;p&gt;This is the last concept I want to discuss in CPU before we conclude and move on to GPUs. As the clock speeds increased, the processors also got so fast and efficient. With increase in application (instruction set) complexity, CPU compute cores were underutilized and it was spending more time waiting on memory access.&lt;/p&gt;

&lt;p&gt;So, we are seeing memory bottleneck. Compute unit is spending time on memory access and not doing any useful work. Memory is several order slower than CPU and gap is not going to close anytime soon. The idea was to increase memory bandwidth in some units of a single CPU core and keep data ready to utilize the compute units when it is awaiting memory access.&lt;/p&gt;

&lt;p&gt;Hyper-threading was made available in 2002 by Intel in Xeon &amp;amp; Pentium 4 processors. Prior to hyper-threading there was only one hardware thread per core. With hyper-threading, there will be 2 hardware threads per core. What does it mean ? Duplicate processing circuit for some registers, program counter, fetch unit, decode unit, etc.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rzT26WTy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712358597762/0afc525c-8033-4c0f-9f16-146edf21dd03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rzT26WTy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712358597762/0afc525c-8033-4c0f-9f16-146edf21dd03.png" alt="" width="800" height="618"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above diagram just shows new circuit elements in a CPU core with hyper threading. &lt;strong&gt;This is how a single physical core is visible as 2 cores to the Operating System. If you had a 4 core processor, with hyper-threading enabled, it is seen by OS as 8 cores&lt;/strong&gt;. Obviously, L1 - L3 cache size will increase to accommodate additional registers. Note that the execution units are shared.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8pL_MNDx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712432094887/076256b3-0999-4c6d-b907-3b1e2b9585ca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8pL_MNDx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712432094887/076256b3-0999-4c6d-b907-3b1e2b9585ca.png" alt="" width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Assume we have processes P1 and P2 doing a=b+c, d=e+f, this can execute concurrently in a single clock cycle because of HW threads 1 and 2. Witch single HW thread as we saw earlier, this would not be possible. &lt;strong&gt;Here we are increasing memory bandwidth within a core by adding additional Hardware Thread so that, the processing unit can be utilized efficiently. This improves compute concurrency.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some interesting scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;CPU has only one integer ALU. One HW Thread 1 or HW Thread 2 must wait for one clock cycle and proceed with compute in next cycle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CPU has one integer ALU and one floating point ALU. HW Thread 1 and HW Thread 2 can do addition concurrently using ALU and FPU respectively.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All available ALUs are being utilized by HW Thread 1. HW Thread 2 must wait until ALU is available. (Not applicable for the addition example above, but can happen with other instructions).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why CPU is so good at traditional desktop / server computing ?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;High clock speeds - Higher than GPU clock speeds. Combining this high speed with instruction pipelining, CPUs are extremely good at sequential tasks. &lt;strong&gt;Optimized for latency.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Diverse applications &amp;amp; computation needs - Personal computer and servers have wide range of applications and computation needs. This results in complex instruction set. CPU has to be good at several things.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multitasking &amp;amp; Multi processing - With so many apps in our computers, CPU workload demands context switching. Caching systems and memory access is setup to support this. When a process is scheduled in CPU hardware thread, it has all necessary data ready and executes compute instructions quickly one by one.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CPU Drawbacks
&lt;/h3&gt;

&lt;p&gt;Check this &lt;a href="https://medium.com/analytics-vidhya/using-pytorch-and-cuda-for-large-computation-in-google-colabs-f1c026c17673"&gt;article&lt;/a&gt; &amp;amp; also try the &lt;a href="https://colab.research.google.com/drive/1nw34aks9SdMwHXl9Gf5T9GPxRB9BIIyr"&gt;Colab notebook&lt;/a&gt;. It shows how matrix multiplication is a parallelizable task and how parallel compute cores can speedup the calculation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Extremely good at sequential tasks but not good at parallel tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Complex instruction set and complex memory access pattern.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CPU also spends lots of energy on context switching, control unit activities in addition to compute&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Instruction pipelining improves compute throughput.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Increasing memory bandwidth improves compute concurrency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CPUs are good at sequential tasks (optimized for latency). Not good at massively parallel tasks as it needs large number of compute units and hardware threads which are not available (not optimized for throughput). These are not available because CPUs are built for general purpose computing and have complex instruction sets.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  GPUs
&lt;/h1&gt;

&lt;p&gt;As computing power increased, so did the demand for graphics processing. Tasks like UI rendering and gaming require parallel operations, driving the need for numerous ALUs and FPUs at the circuit level. CPUs, designed for sequential tasks, couldn't handle these parallel workloads effectively. Thus, GPUs were developed to fulfill the demand for parallel processing in graphics tasks, later paving the way for their adoption in accelerating deep learning algorithms.&lt;/p&gt;

&lt;p&gt;I would highly recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Watching this &lt;a href="https://youtu.be/C8YtdC8mxTU?si=OdrFXUFMLBhuZF34"&gt;video&lt;/a&gt; that explains parallel tasks involved in Video Games rendering.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reading this &lt;a href="https://jalammar.github.io/illustrated-transformer/"&gt;blog post&lt;/a&gt; to understand parallel tasks involved in a transformer. There are other deep learning architectures like CNNs, RNNs as well. Since LLMs are taking over the world, high level understanding of parallelism in matrix multiplications required for transformer tasks would set a good context for the remainder of this post. (At a later time, I plan to fully understand transformer &amp;amp; share a digestible high-level overview of what happens in transformer layers of a small GPT model.)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example CPU vs GPU spec
&lt;/h3&gt;

&lt;p&gt;Cores, hardware threads, clock speed, memory bandwidth, and on chip memory of CPUs &amp;amp; GPUs differ significantly. Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Intel Xeon 8280 &lt;a href="https://www.intel.com/content/www/us/en/products/sku/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz/specifications.html"&gt;has&lt;/a&gt;:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Nvidia A100 80GB SXM &lt;a href="https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf"&gt;has&lt;/a&gt;:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Core in a modern GPU
&lt;/h3&gt;

&lt;p&gt;Terminologies we saw in CPU doesn't always translate directly to GPUs. Here we'll see components and core NVIDIA A100 GPU. One thing that was surprising to me while researching for this article was that CPU vendors don't publish how many ALUs, FPUs, etc., are available in execution units of a core. NVIDIA is very transparent about number of cores and CUDA framework gives complete flexibility &amp;amp; access at circuit level.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vc0BlCTS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712503965633/0b04a91e-6790-4f56-9a05-ddf3bcb3c02f.png" alt="" width="800" height="404"&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the above diagram in GPU, we can see that there is no L3 Cache, smaller L2 cache, smaller but a lot more control unit &amp;amp; L1 cache and large number of processing units.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---39tT0_9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712535081640/3c8499bf-4e86-4496-ac86-fd19fc9f86f0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---39tT0_9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712535081640/3c8499bf-4e86-4496-ac86-fd19fc9f86f0.png" alt="" width="800" height="912"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IRAoL_Y---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712535131260/d20f6b3b-f97d-4efa-82fd-ab64ab6f09ba.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IRAoL_Y---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712535131260/d20f6b3b-f97d-4efa-82fd-ab64ab6f09ba.png" alt="" width="800" height="1070"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are GPU components in the above diagrams and their CPU equivalent for our initial understanding. I haven't done CUDA programming, so comparing it with CPU equivalents helps with initial understanding. CUDA programmers understand this very well.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Multiple Streaming Multiprocessors &amp;lt;&amp;gt; Multi Core CPU&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Streaming Multiprocessor (SM) &amp;lt;&amp;gt; CPU Core&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Streaming processor (SP)/ CUDA Core &amp;lt;&amp;gt; ALU / FPU in execution units of a CPU Core&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tensor Core (capable doing 4x4 FP64 operations on a single instruction) &amp;lt;&amp;gt; SIMD execution units in a modern CPU core (eg: AVX-512)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hardware Thread (doing compute in CUDA or Tensor Cores in a single clock cycle) &amp;lt;&amp;gt; Hardware Thread (doing compute in execution units [ALUs, FPUs, etc.,] in a single clock cycle)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;HBM / VRAM / DRAM / GPU Memory &amp;lt;&amp;gt; RAM&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;On-chip memory/SRAM (Registers, L1, L2 cache) &amp;lt;&amp;gt; On-chip memory/SRAM (Registers, L1, L2, L3 cache)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Moving data &amp;amp; memory bandwidth
&lt;/h3&gt;

&lt;p&gt;Graphics and deep learning tasks demand SIM(D/T) [Single instruction multi data / thread] type execution. i.e., reading and working on large amounts of data for a single instruction.&lt;/p&gt;

&lt;p&gt;We discussed instruction pipelining and hyper-threading in CPU and GPUs also have capability. How it is implemented and working is slightly different but the principles are the same.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bJiOhehm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712561411743/c07ce188-29ba-4fc1-9529-33c53ccefaaa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bJiOhehm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712561411743/c07ce188-29ba-4fc1-9529-33c53ccefaaa.png" alt="" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unlike CPUs, GPUs (via CUDA) provide direct access to Pipeline Threads (fetching data from memory and utilizing the memory bandwidth). GPU schedulers work first by trying to fill compute units (includes associated shared L1 cache &amp;amp; registers for storing compute operands), then "pipeline threads" which fetch data into registers and HBM. Again, I want to emphasize that CPU app programmers don't think about this, and specs about "pipeline threads" &amp;amp; number of compute units per core is not published. Nvidia not only publishes these, but also provides complete control to programmers.&lt;/p&gt;

&lt;p&gt;I will go into more details about this in a dedicated post about CUDA programming model &amp;amp; "batching" in model serving optimization technique where we can see how beneficial this is.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4eKzxDQT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712562279792/0ccc0f09-30f9-4287-8c4c-79aee036124d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4eKzxDQT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712562279792/0ccc0f09-30f9-4287-8c4c-79aee036124d.png" alt="" width="749" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above diagram depicts hardware thread execution in CPU &amp;amp; GPU core. Refer "memory access" section we discussed earlier in CPU pipelining. This diagram shows that. CPUs complex memory management makes this wait time small enough (few clock cycles) to fetch data from L1 cache to registers. When data needs to be fetched from L3 or main memory, the other thread for which data is already in register (we saw this in hyper-threading section) get's control of execution units.&lt;/p&gt;

&lt;p&gt;In GPUs, because of over subscription (high number of pipeline threads &amp;amp; registers) &amp;amp; simple instruction set, large amount of data is already available on registers pending execution. These pipeline thread waiting for execution become hardware threads and do the execution as often as every clock cycle as pipeline threads in GPUs are lightweight.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bandwidth, Compute Intensity &amp;amp; Latency
&lt;/h4&gt;

&lt;p&gt;What's over goal?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Fully utilize hardware resources (compute units) every clock cycle to get the best out of GPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To keep the compute units busy, we need to feed it enough data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--h8S5Yl3k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712569518138/3760def5-b2e9-4719-ae18-ffa1f4751f49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--h8S5Yl3k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712569518138/3760def5-b2e9-4719-ae18-ffa1f4751f49.png" alt="" width="800" height="873"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the main reason why latency of matrix multiplication of smaller matrices are the same more or less in CPU &amp;amp; GPU. &lt;a href="https://ashanpriyadarshana.medium.com/cuda-gpu-memory-architecture-8c3ac644bd64"&gt;Try it out&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Tasks needs to be parallel enough, data needs to be huge enough to saturate compute FLOPs &amp;amp; memory bandwidth. If a single task is not big enough, multiple such tasks needs to be packed to saturate memory and compute to fully utilize the hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute Intensity = FLOPs / Bandwidth&lt;/strong&gt;. i.e., Ratio of amount of work that can be done by the compute units per second to amount of data that can be provided by memory per second.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Htn5n1uN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712535632913/e5a8a5ba-4376-4780-b38d-35ddc83e3bb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Htn5n1uN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1712535632913/e5a8a5ba-4376-4780-b38d-35ddc83e3bb6.png" alt="" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In above diagram, we see that compute intensity increases as we go to higher latency and lower bandwidth memory. &lt;strong&gt;We want this number to be as small as possible so that compute is fully utilized.&lt;/strong&gt; For that, we need to keep as much as data in L1 / Registers so that compute can happen quickly. If we fetch single data from HBM, there are only few operations where we do 100 operations on single data to make it worth it. If we don't do 100 operations, compute units were idle. &lt;strong&gt;This is where high number of threads and registers in GPUs come into play. To keep as much as data in L1/Registers to keep the compute intensity low and to keep parallel cores busy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There is a difference in compute intensity of 4X between CUDA &amp;amp; Tensor cores because CUDA cores can done only one 1x1 FP64 MMA where as Tensor cores can do 4x4 FP64 MMA instruction per clock cycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;High number of compute units (CUDA &amp;amp; Tensor cores), high number of threads and registers (over subscription), reduced instruction set, no L3 cache, HBM (SRAM), simple &amp;amp; high throughput memory access pattern (compared to CPU's - context switching, multi layer caching, memory paging, TLB, etc.,) are the principles that make GPUs so much better than CPUs in parallel computing (graphics rendering, deep learning, etc.,)&lt;/p&gt;

&lt;h1&gt;
  
  
  Beyond GPUs
&lt;/h1&gt;

&lt;p&gt;GPUs were first created for handling graphics processing tasks. AI researchers started taking advantage of CUDA and it's direct access to powerful parallel processing via CUDA cores. NVIDIA GPU has Texture Processing, Ray Tracing, Raster, Polymorph engines, etc., (let's say graphics specific instruction sets). With increase in adoption in AI, Tensor cores which are good at 4x4 matrix calculation (MMA instruction) are being added which are dedicated for deep learning.&lt;/p&gt;

&lt;p&gt;Since 2017, NVIDIA has been increasing number of Tensor cores in each architecture. But, these GPUs are also good at graphics processing. Although the instruction set and complexity is very less in GPUs, it's not fully dedicated to deep learning (especially Transformer Architecture).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2307.08691"&gt;FlashAttention 2&lt;/a&gt;, a software layer optimization (mechanical sympathy for attention layer's memory access pattern) for transformer architecture provides 2X speedup in tasks.&lt;/p&gt;

&lt;p&gt;With our in-depth first principles based understanding of CPU &amp;amp; GPU, we can understand need for Transformer Accelerators: A dedicated chip (circuit only for transformer operations), with even large number of compute units for parallelism, reduced instruction set, no L1/L2 caches, massive DRAM (registers) replacing HBM, memory units optimized for memory access pattern of transformer architecture. After all LLMs are new companions for humans (after web and mobile), and they need dedicated chips for efficiency and performance.&lt;/p&gt;

&lt;h4&gt;
  
  
  Some AI Accelerators:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/machine-learning/inferentia/"&gt;AWS Inferentia&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://cloud.google.com/tpu"&gt;Google TPU&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.cerebras.net/blog/cerebras-cs3"&gt;Cerebras CS3&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;...&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Transformer Accelerators:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://wow.groq.com/why-groq/"&gt;Groq LPU&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.etched.com/"&gt;Transformers Etched into silicon&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;...&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  FPGA based Transformer Accelerators:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.achronix.com/blog/fpga-accelerated-large-language-models-used-chatgpt"&gt;Achronix&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;...&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  References:
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Von_Neumann_architecture"&gt;https://en.wikipedia.org/wiki/Von_Neumann_architecture&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://chsasank.com/llm-system-design.html"&gt;https://chsasank.com/llm-system-design.html&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.redhat.com/sysadmin/cpu-components-functionality"&gt;https://www.redhat.com/sysadmin/cpu-components-functionality&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.wixstatic.com/ugd/56440f_e458602dcb0c4af9aaeb7fdaa34bb2b4.pdf"&gt;https://docs.wixstatic.com/ugd/56440f_e458602dcb0c4af9aaeb7fdaa34bb2b4.pdf&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.nand2tetris.org/course"&gt;https://www.nand2tetris.org/course&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://cpu.land/"&gt;https://cpu.land/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Hyper-threading"&gt;https://en.wikipedia.org/wiki/Hyper-threading&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How do Video Game Graphics Work? - &lt;a href="https://youtu.be/C8YtdC8mxTU?si=OdrFXUFMLBhuZF34"&gt;https://youtu.be/C8YtdC8mxTU?si=OdrFXUFMLBhuZF34&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CPU vs GPU vs TPU vs DPU vs QPU - &lt;a href="https://www.youtube.com/watch?v=r5NQecwZs1A"&gt;https://www.youtube.com/watch?v=r5NQecwZs1A&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How GPU Computing Works | GTC 2021 | Stephen Jones - &lt;a href="https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31151/"&gt;https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31151/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compute Intensity - &lt;a href="https://www.linkedin.com/pulse/threads-tensor-cores-beyond-unveiling-dynamics-gpu-memory-florit-smg2c/"&gt;https://www.linkedin.com/pulse/threads-tensor-cores-beyond-unveiling-dynamics-gpu-memory-florit-smg2c/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How CUDA Programming Works | GTC Fall 2022 | Stephen Jones - &lt;a href="https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41101/"&gt;https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41101/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Why use GPU with Neural Networks? - &lt;a href="https://www.youtube.com/watch?v=GRRMi7UfZHg"&gt;https://www.youtube.com/watch?v=GRRMi7UfZHg&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CUDA Hardware | Tom Nurkkala | Taylor University Lecture - &lt;a href="https://www.youtube.com/watch?v=kUqkOAU84bA"&gt;https://www.youtube.com/watch?v=kUqkOAU84bA&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://ashanpriyadarshana.medium.com/cuda-gpu-memory-architecture-8c3ac644bd64"&gt;https://ashanpriyadarshana.medium.com/cuda-gpu-memory-architecture-8c3ac644bd64&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://colab.research.google.com/drive/1nw34aks9SdMwHXl9Gf5T9GPxRB9BIIyr"&gt;https://colab.research.google.com/drive/1nw34aks9SdMwHXl9Gf5T9GPxRB9BIIyr&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/"&gt;https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>gpu</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>OS Error: Too many open files. Understanding file and socket descriptors.</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Tue, 26 Mar 2024 15:12:19 +0000</pubDate>
      <link>https://dev.to/venkat2811/os-error-too-many-open-files-understanding-file-and-socket-descriptors-1o89</link>
      <guid>https://dev.to/venkat2811/os-error-too-many-open-files-understanding-file-and-socket-descriptors-1o89</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgdnattb4v46i9jm9123.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgdnattb4v46i9jm9123.png" alt="banner" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Engineers who've built, deployed and operated backend services would've encountered this error. It usually means your service is serving real user requests - Yay 🎉 ! One possible scenario is - you need to fine-tune server OS configuration to scale up, and the other is - there is resource leakage in your system.&lt;/p&gt;

&lt;p&gt;I've encountered this error four times so far in 8 years of my career. Wanted to write and share as it was always interesting.&lt;/p&gt;

&lt;p&gt;Let's start with resource leakage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Service written in kotlin using ktor
&lt;/h2&gt;

&lt;p&gt;In late 2019, our team wanted to experiment with kotlin ktor as an alternative to SpringBoot. We wanted to quickly try it out in a simple microservice receiving just few hundred requests a day. &lt;em&gt;Our app was deployed and serving customer requests for few days without any issue.There were no server restarts since deployment.&lt;/em&gt; One morning, 500 service error alerts were triggered. I was looking at the server logs and requests were accepted and erroring out in one of the processing steps. I reproduced this issue in prod env to gather recent logs and restarted the service. Bug was filed and was not yet prioritised (not a critical service receiving millions of requests). A couple days went by and the same issue started occurring in the evening. Our friend, server restart solved it again 😊&lt;/p&gt;

&lt;p&gt;By this time, few L1 customer tickets were also filed, and I started looking into the issue. Prometheus showed that memory usage increased over time, and flatlined around time the service started rejecting requests. Also from the logs, we found that error started occurring in one of our processing steps where ktor okhttp client was used. Found &lt;a href="https://github.com/ktorio/ktor/issues/1009"&gt;this issue&lt;/a&gt; in github and upgrading the lib solved this issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Service written in python using langchain &amp;amp; openai lib
&lt;/h2&gt;

&lt;p&gt;LangChain is a framework for developing applications (RAG &amp;amp; AI agents) powered by language models. &lt;em&gt;Our app was deployed and serving customer requests for few days without any issue.There were no server restarts since deployment&lt;/em&gt; (see the pattern ?). One afternoon in early 2024, 500 service error alerts were triggered. I was looking at the server logs and requests were being rejected with &lt;code&gt;OS Error: Too many open files&lt;/code&gt;. Good old server restart quickly fixed the error and the service started serving user requests. My immediate hunch (from ktor issue few years ago) was that there was an underlying resource leakage.&lt;/p&gt;

&lt;p&gt;I wanted to reproduce this issue in staging environment. A quick google search showed &lt;a href="https://github.com/langchain-ai/langchain/issues/13509"&gt;this&lt;/a&gt; issue. So, I monitored the below while simulating few 100 requests&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Processes grouped by name &amp;amp; connection status, sorted by count&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Local and remote addr of connections in &lt;code&gt;CLOSE_WAIT&lt;/code&gt; status&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the remote addr matched OpenAI' api domain. Since Langchain uses LLM provider's client lib to connect and interact with the models, the leak should be in OpenAI client lib. A quick search on openai github issue showed that it was addressed and fixed &lt;a href="https://github.com/openai/openai-python/issues/933#issuecomment-1857718386"&gt;already&lt;/a&gt;. So our fix was to upgrade the underlying openai lib version. The fix was verified in staging and rolled out to customers.&lt;/p&gt;

&lt;p&gt;There is a small difference in how 500 service errors were triggered in above services. Kotlin service using ktor server accepted the request and errored out in one of the processing steps that ktor okhttp client. Python service using Flask server errored out while accepting the request for processing. I will punt this for now and cover in a separate post at a later time as it deals with difference in server frameworks.&lt;/p&gt;

&lt;p&gt;Before fine-tuning server configuration to scale up let's understand network connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding connections &amp;amp; OS files
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Opening a file
&lt;/h3&gt;

&lt;p&gt;When a process opens a file, a file descriptor with the following metadata is created: the file position (offset), access mode (read, write, or both), file status flags (such as whether the file is open for appending or is non-blocking), and a reference to the corresponding file table entry in the kernel's file descriptor table. When the file is closed, the file descriptor is released, freeing up system resources associated with the file and removing its entry from the process's file descriptor table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lw-nwzEM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1711449713803/08a3389d-f6c4-4350-8fd4-eed120d328a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lw-nwzEM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1711449713803/08a3389d-f6c4-4350-8fd4-eed120d328a2.png" alt="two processes opening the same file" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  IPC via shared message queue
&lt;/h3&gt;

&lt;p&gt;Processes on the same machine often use Inter-Process Communication (IPC) mechanisms like message queues for data exchange. Message queues are associated with unique identifiers, akin to file descriptors, enabling processes to access them using standard file I/O operations. They provide synchronization and data buffering, facilitating asynchronous communication and enabling processes to operate independently without waiting for message exchange.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lwn4vuiO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1711458906492/033f9da6-5774-4063-9131-5b1e637de6f3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lwn4vuiO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1711458906492/033f9da6-5774-4063-9131-5b1e637de6f3.png" alt="two processes communicating via shared message queue" width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Client server communication via HTTP
&lt;/h3&gt;

&lt;p&gt;Similarly in a client server communication when a HTTP request is made, the library uses processes or threads, and a network file descriptor with the following metadata is created: the network socket type (TCP or UDP), the local and remote addresses and ports, socket options (such as whether the socket is reusable or whether it's in blocking or non-blocking mode), and a reference to the corresponding socket data structures in the operating system's networking stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3BHmuB-C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1711465716668/909893cf-fcb1-4ad1-a3a4-2d6e68d6d1e4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3BHmuB-C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1711465716668/909893cf-fcb1-4ad1-a3a4-2d6e68d6d1e4.png" alt="A process using two network sockets" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Message queue descriptor and Open MQ descriptor table; Socket descriptor and Open socket descriptor table are all considered and treated as file descriptor and file descriptor table by OS (Linux and POSIX). So far, we discussed high level overview of file descriptors. See references at the end for more details.&lt;/p&gt;

&lt;h3&gt;
  
  
  ulimit
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ulimit&lt;/code&gt; is a command-line utility in Unix-like operating systems used to control and report resource limits for processes. It can be used to set limits on the maximum number of file descriptors that a process can open. This is important for preventing resource exhaustion and ensuring system stability. By adjusting the &lt;code&gt;nofile&lt;/code&gt; (or &lt;code&gt;open files&lt;/code&gt;) limit with &lt;code&gt;ulimit&lt;/code&gt;, one can control how many files a process can have open simultaneously, including regular files, directories, pipes, and sockets.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Soft Limit&lt;/strong&gt; : In the context of file descriptors, the soft limit might determine the maximum number of file descriptors a process can open. If a process tries to exceed the soft limit, it may receive warnings and / or errors, but it can continue operating within the limit&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hard Limit&lt;/strong&gt; : The hard limit would establish the absolute maximum allowable number per process above which OS will terminate the process.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Load testing my GSOC Project
&lt;/h2&gt;

&lt;p&gt;In my final semester I worked on building a HTTP Load Balancer on top of WSO2 gateway. The gateway core uses LMAX Disruptor - a high-performance inter-thread messaging library used by the London Stock Exchange, renowned for its "mechanical sympathy" approach. It facilitates low-latency, high-throughput messaging between threads, crucial for real-time financial trading systems, by minimizing contention and maximizing CPU cache efficiency. I will discuss about this is a separate blog post.&lt;/p&gt;

&lt;p&gt;I wanted to run some benchmarks to see how my load balancer fared against nginx. I started hitting too many openfiles error. I had to make changes to OS configurations to increase the number of concurrent connections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#&amp;lt;domain&amp;gt;        &amp;lt;type&amp;gt;  &amp;lt;item&amp;gt;  &amp;lt;value&amp;gt;
#

*         hard    nofile      500000
*         soft    nofile      500000
root      hard    nofile      500000
root      soft    nofile      500000

# End of file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#

net.ipv4.netfilter.ip_conntrack_max = 32768
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_orphan_retries = 1
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_max_orphans = 32768
net.ipv4.ip_local_port_range = 1025    61000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can refer it here as well: &lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer/tree/master/performance-benchmark/test-bed"&gt;https://github.com/wso2-incubator/HTTP-Load-balancer/tree/master/performance-benchmark/test-bed&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimizing resources and auto scaling policy for spring boot microservice
&lt;/h2&gt;

&lt;p&gt;I wanted to evaluate how many concurrent connections our single container can handle to optimize autoscaling policy. Default ECS container ulimit is 1024. After ~600 parallel user requests, with memory &amp;amp; cpu utilization of ~50% &amp;amp; 30% respectively, I started seeing &lt;code&gt;too many open files&lt;/code&gt; errors. p99 for these 600 parallel user requests was 2s. I increased container ulimit to 2400, and also increased db &amp;amp; HTTP connection pool size (will write about why connection pool is important in a separate post). With the increased limits &amp;amp; optimizations the benchmark showed more than 90% memory and 60% cpu utilization. Based on these, autoscaling was set to trigger at 85% memory utilization.&lt;/p&gt;

&lt;p&gt;Thanks for reading !&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf"&gt;https://man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.usna.edu/Users/cs/wcbrown/courses/IC221/classes/L09/Class.html"&gt;https://www.usna.edu/Users/cs/wcbrown/courses/IC221/classes/L09/Class.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.codequoi.com/en/handling-a-file-by-its-descriptor-in-c/"&gt;https://www.codequoi.com/en/handling-a-file-by-its-descriptor-in-c/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Ulimit.html"&gt;https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Ulimit.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://osquery.io/schema/5.11.0/#process_open_files"&gt;https://osquery.io/schema/5.11.0/#process_open_files&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://osquery.io/schema/5.11.0/#process_open_sockets"&gt;https://osquery.io/schema/5.11.0/#process_open_sockets&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://osquery.io/schema/5.11.0/#file"&gt;https://osquery.io/schema/5.11.0/#file&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://osquery.io/schema/5.11.0/#device_file"&gt;https://osquery.io/schema/5.11.0/#device_file&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openfileserror</category>
      <category>ktor</category>
      <category>openai</category>
      <category>langchain</category>
    </item>
    <item>
      <title>HTTP Load Balancer on Top of WSO2 Gateway — Part 2: Performance Benchmarks</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Fri, 19 Aug 2016 13:00:54 +0000</pubDate>
      <link>https://dev.to/venkat2811/http-load-balancer-on-top-of-wso2-gateway-part-2-performance-benchmarks-638</link>
      <guid>https://dev.to/venkat2811/http-load-balancer-on-top-of-wso2-gateway-part-2-performance-benchmarks-638</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ldJ8Kl3D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668358081421/n6O9TKZrX.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ldJ8Kl3D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668358081421/n6O9TKZrX.png" alt="banner-gsoc2016_2.png" width="640" height="191"&gt;&lt;/a&gt;In my &lt;a href="https://venkat2811.blogspot.in/2016/08/http-load-balancer-on-top-of-wso2.html"&gt;&lt;strong&gt;previous post&lt;/strong&gt;&lt;/a&gt;, I discussed Load Balancer Engine Architecture and its features. In this post, Ill be discussing on performance bench-marks.&lt;/p&gt;

&lt;p&gt;Kindly note that the underlying carbon-gateway-framework is under development. And hence more features will be developed and improvisations will be done to this LB.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Benchmarks
&lt;/h3&gt;

&lt;p&gt;In this performance test, five instances of simple service created using Netty framework were used. Each instance is a &lt;strong&gt;fast backend (0s delay)&lt;/strong&gt; with &lt;strong&gt;response of size 1KB&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One Million (1,000,000) requests&lt;/strong&gt; were sent at different &lt;strong&gt;concurrency levels (500 to 12,000)&lt;/strong&gt; to Netty backend, Nginx (Open source version) and GW-LB using apache bench via automated script.&lt;/p&gt;

&lt;p&gt;Benchmarks were conducted in Round-Robin algorithm mode with no persistence policies.&lt;/p&gt;

&lt;p&gt;More details can be found &lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer/tree/master/performance-benchmark#performance-test-using-high-performance-netty-back-end"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test Bed
&lt;/h3&gt;

&lt;p&gt;These are the configurations used during bench-marking.&lt;/p&gt;

&lt;h4&gt;
  
  
  VM Details
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;**Guest OS :**&lt;/code&gt; Ubuntu 64-bit 16.04 VM&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**RAM :**&lt;/code&gt; 8 GB&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**CPU cores :**&lt;/code&gt; 4&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**JVM Version :**&lt;/code&gt; 1.8.0_91&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**Java Runtime :**&lt;/code&gt; Java(TM) SE Runtime Environment (build 1.8.0_91-b14)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**Java HotSpot :**&lt;/code&gt; Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Host Machine Details
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;**Host OS :**&lt;/code&gt; OS X EI Captain Version 10.11.5 (15F34) MacBook Pro (Mid 2015)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**Hypervisor :**&lt;/code&gt; VMware Fusion Professional Version 8.1.1 (3771013)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**Processor :**&lt;/code&gt; 2.5 GHz Intel Core i7&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;**Memory :**&lt;/code&gt; 16 GB 1600 MHz DDR3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;  &lt;strong&gt;HTTP Load Balancer on Top of WSO2 Gateway&lt;/strong&gt; is referred as &lt;strong&gt;GW-LB&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput Test
&lt;/h3&gt;

&lt;p&gt;Tests were done twice. Average of Average throughput for each concurrency level is calculated and plotted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt; shows throughput comparison between Open Source Nginx and GW-LB and &lt;strong&gt;Figure 1.1&lt;/strong&gt; shows throughput comparison along with Netty backend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QiIFeJWW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355838343/g4AHVA72J.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QiIFeJWW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355838343/g4AHVA72J.png" alt="" width="600" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1 : Nginx Open Source Version vs GW-LB&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Bp4VYn0L--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355839793/n_1zahJDN.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Bp4VYn0L--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355839793/n_1zahJDN.png" alt="" width="600" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1.1 : Nginx Open Source Version vs GW-LB along with Netty Back-End&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency Test
&lt;/h3&gt;

&lt;p&gt;Tests were done twice. Average of Mean Latency for each concurrency level is calculated and plotted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--G2ZwXLmy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355840902/tQkvfshSr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--G2ZwXLmy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355840902/tQkvfshSr.png" alt="" width="600" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2: Nginx Open Source Version vs GW-LB along with Netty BE&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Test
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Java Flight Recorder (JFR)&lt;/strong&gt; is enabled while starting LB server and recording is stopped after load test ends. The obtained JFR recording has memory usage details.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt; shows &lt;strong&gt;Committed, Reserved and Used Heap&lt;/strong&gt; values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uWzbEi_U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355842109/4WVeCjdak.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uWzbEi_U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355842109/4WVeCjdak.png" alt="" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3: Committed vs Reserved vs Used Heap memory values of GW-LB&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It took three code reviews and several rounds of performance bench-marking to get these results. Guidance and suggestions from my mentors and community members were very helpful. I enjoyed discussing with my mentors and it was a great learning too! I loved this phase of my GSoC project more than any because our discussions were very good and I could sense that my mentors were also as eager and motivated as me to achieve good performance.&lt;/p&gt;

&lt;p&gt;I also kept track of the performance improvisation that we were doing by creating it as an issue so that it might be helpful one day. You can find it &lt;a href="https://github.com/Venkat2811/product-http-load-balancer/issues/5"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sample Configuration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You can find a sample config &lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer#sample-configuration"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Also find more samples &lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer/tree/master/product/carbon-home/samples"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kindly note that DSL and underlying Gateway Framework is evolving and it will change over the period of time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Javadoc
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You can find javadoc &lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer/blob/master/docs/javadoc_http_load_balancer.zip"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks for reading and Happy Coding !&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://venkat.eu/http-load-balancer-on-top-of-wso2-gateway-part-2-performance-benchmarks-e06cc63e256d"&gt;HTTP Load Balancer on Top of WSO2 GatewayPart 2: Performance Benchmarks&lt;/a&gt; &lt;em&gt;on August 18, 2016.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>HTTP Load Balancer on Top of WSO2 Gateway — Part 1: Project Repository, Architecture and Features</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Thu, 18 Aug 2016 15:04:55 +0000</pubDate>
      <link>https://dev.to/venkat2811/http-load-balancer-on-top-of-wso2-gateway-part-1-project-repository-architecture-and-features-bm8</link>
      <guid>https://dev.to/venkat2811/http-load-balancer-on-top-of-wso2-gateway-part-1-project-repository-architecture-and-features-bm8</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XcuItZxk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668356758609/EX8b6jpm0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XcuItZxk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668356758609/EX8b6jpm0.png" alt="banner-gsoc2016_2.png" width="640" height="191"&gt;&lt;/a&gt;Its almost four months and it has been an amazing journey! At this point, I would like to thank my mentors &lt;a href="https://github.com/isururanawaka"&gt;&lt;strong&gt;Isuru Ranawaka&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://github.com/kasun04"&gt;&lt;strong&gt;Kasun Indrasiri&lt;/strong&gt;&lt;/a&gt; and WSO2 community members especially &lt;a href="https://github.com/bsenduran"&gt;&lt;strong&gt;Senduran&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://github.com/isudana"&gt;&lt;strong&gt;Isuru Udana&lt;/strong&gt;&lt;/a&gt; for continuously mentoring, supporting and guiding me throughout this project.&lt;/p&gt;

&lt;p&gt;Google announced list of accepted mentoring organizations and I was looking for projects related to Networking and Java. I came across WSO2s &lt;a href="https://docs.wso2.com/display/GSoC/Project+Proposals+for+2016#ProjectProposalsfor2016-Proposal8:%5BESB/GW%5DHTTPLoadbalancerontopofWSO2Gateway"&gt;&lt;strong&gt;idea list&lt;/strong&gt;&lt;/a&gt; and was pretty excited to see HTTP Load Balancer as a project idea. I had good idea on load balancers and was eager to get into the internals of it and develop one. So I contacted my mentors and they gave me idea on getting started with WSO2 stack. They asked me to come up with set of features that I am willing to develop as part of this project. I also got great help and guidance from WSO2 community members right from the time of writing proposal. With their guidance and suggestions, I was able to come up with basic architecture, set of features and a tentative timeline.&lt;/p&gt;

&lt;p&gt;Once selected project proposals were announced, my mentor gave clear idea on what is expected and how to proceed with the project. Here are my previous blog posts on &lt;a href="https://venkat2811.blogspot.in/2016/05/gsoc-community-bonding-period.html"&gt;&lt;strong&gt;community bonding period&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://venkat2811.blogspot.in/2016/07/gsoc-mid-term-evaluation.html"&gt;&lt;strong&gt;mid term evaluations&lt;/strong&gt;&lt;/a&gt;. Ive completed all the features that I had committed in my project proposal. Based on the performance benchmarks (will be discussed in &lt;a href="https://venkat2811.blogspot.in/2016/08/http-load-balancer-on-top-of-wso2_18.html"&gt;&lt;strong&gt;next post&lt;/strong&gt;&lt;/a&gt;) that I have done, this &lt;strong&gt;Load Balancer is performing better than Nginx (Open Source Version&lt;/strong&gt;). My mentors are also happy with the outcome. There is lot more to be done to take this LB to production like performance improvisation, ability to function in multi-level mode, etc., and moreover the underlying Carbon Gateway Framework is continuously evolving. Even after GSoC period, Ill be contributing to this project and make it production ready.&lt;/p&gt;

&lt;p&gt;In this post, Ill be discussing High Level Architecture, Engine Architecture, Message Flow and Load Balancer specific features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;  &lt;strong&gt;HTTP Load Balancer on Top of WSO2 Gateway&lt;/strong&gt; is referred as &lt;strong&gt;GW-LB&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Repository
&lt;/h3&gt;

&lt;p&gt;This GSoC project has been added to &lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer"&gt;&lt;strong&gt;WSO2 Incubator&lt;/strong&gt;&lt;/a&gt; and Ive been given membership to WSO2 Incubator Organization :D !&lt;/p&gt;

&lt;p&gt;Since GW-LB has standalone run-time, it is developed and managed as a separate project. You can also find the project in my &lt;a href="https://github.com/Venkat2811/product-http-load-balancer"&gt;&lt;strong&gt;personal repository&lt;/strong&gt;&lt;/a&gt; from which it has been added to WSO2 Incubator.&lt;/p&gt;

&lt;p&gt;Carbon Gateway Framework with ANTLR grammar support for LB can be found &lt;a href="https://github.com/Venkat2811/carbon-gateway-framework-with-LB"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;. As DSL and gateway framework are evolving, there will be some changes to grammar in future. &lt;strong&gt;Please find my commits for handling LB specific configurations and ANTLR grammar support&lt;/strong&gt; &lt;a href="https://github.com/Venkat2811/carbon-gateway-framework-with-LB/commits/master?author=Venkat2811"&gt;&lt;strong&gt;here&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/wso2-incubator/HTTP-Load-balancer#building-product"&gt;&lt;strong&gt;README&lt;/strong&gt;&lt;/a&gt; file has instructions to build and work with the product. You could also simply extract this &lt;a href="https://github.com/Venkat2811/HTTP-Load-Balancer-Zip-File/blob/master/wso2gwlbserver-1.0.0-SNAPSHOT.zip"&gt;&lt;strong&gt;file&lt;/strong&gt;&lt;/a&gt; and try it out !&lt;/p&gt;

&lt;h3&gt;
  
  
  High Level Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://wso2.com/blogs/thesource/2016/07/gateway-server-framework/"&gt;&lt;strong&gt;WSO2 Gateway Framework&lt;/strong&gt;&lt;/a&gt; is a high performance, lightweight, low-latency messaging framework based on standard gateway pattern. Its &lt;strong&gt;Netty based non-blocking IO and Disruptor (ring-buffer) architecture&lt;/strong&gt; makes it the fastest open-source gateway available. Benchmarks[1] show that the performance of gateway is very high when compared to other solutions and is close to the direct netty based backend (without any intermediate gateway).&lt;/p&gt;

&lt;p&gt;GW-LB makes use of WSO2s &lt;a href="https://github.com/wso2/carbon-gateway-framework"&gt;&lt;strong&gt;Carbon Gateway Framework&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://github.com/wso2/carbon-transports/"&gt;&lt;strong&gt;Carbon Transports&lt;/strong&gt;&lt;/a&gt; and &lt;a href="https://github.com/wso2/carbon-messaging"&gt;&lt;strong&gt;Carbon Messaging&lt;/strong&gt;&lt;/a&gt;. These are highly modular and easily extensible as they are OSGi bundles and are part of &lt;a href="http://wso2.com/products/carbon/"&gt;&lt;strong&gt;WSO2 Carbon Platform&lt;/strong&gt;&lt;/a&gt;. This LB by itself is an OSGi bundle and it is built on to of carbon gateway framework. When all these bundles are bundled together along with &lt;a href="https://github.com/wso2/carbon-kernel"&gt;&lt;strong&gt;Carbon Kernel&lt;/strong&gt;&lt;/a&gt; it forms &lt;strong&gt;GW-LB Server&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Carbon gateway framework provides &lt;strong&gt;configuration management and basic mediation capabilities&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Carbon transports acts as &lt;strong&gt;transport layer&lt;/strong&gt; within WSO2 stack.&lt;/li&gt;
&lt;li&gt;Within WSO2 stack, Messages (Requests / Response) are mediated in the form of &lt;strong&gt;Carbon Messages&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a request reaches &lt;strong&gt;Carbon Transports (WSO2-Netty Listener)&lt;/strong&gt; additional layer required for mediation is added and it becomes Carbon Message. Similarly, after mediation when Carbon Message reaches &lt;strong&gt;Carbon Transports (WSO2-Netty Sender)&lt;/strong&gt;, all Carbon Message related details will be removed and message is sent to corresponding endpoint. It works similarly when response arrives from back-end. This flow is clearly shown in &lt;strong&gt;Figure 1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YU5oSiJs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355861612/wn6ShKcui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YU5oSiJs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355861612/wn6ShKcui.png" alt="" width="640" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1: High Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Engine Architecture
&lt;/h3&gt;

&lt;p&gt;This LB has been built by keeping in mind the modular and extensible nature of WSO2 product stack. WSO2 uses &lt;a href="http://www.antlr.org/"&gt;&lt;strong&gt;ANTLR4&lt;/strong&gt;&lt;/a&gt; to develop domain specific language &lt;strong&gt;(DSL)&lt;/strong&gt; for its carbon gateway framework. This DSL will be used to configure and define mediation rules for various products built using this framework including this LB.&lt;/p&gt;

&lt;p&gt;The Gateway Framework and DSL are continuously evolving and LB Engine is completely decoupled from the DSL. Also, developers can easily develop their own LB algorithms and persistence policies and plug it into this LB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2&lt;/strong&gt; clearly depicts the modules that are specific to LB Engine, Carbon Gateway Framework and Carbon Transports.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--30H2Q4De--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355863156/YZVZ6qWQv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--30H2Q4De--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355863156/YZVZ6qWQv.png" alt="" width="640" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2: Engine Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Message Flow
&lt;/h3&gt;

&lt;p&gt;As mentioned above, modules within WSO2 stack communicate via Carbon Messages. Refer &lt;strong&gt;Figure 3&lt;/strong&gt; to get clear idea on how message flows through various LB modules.&lt;/p&gt;

&lt;h3&gt;
  
  
  Request Flow: From Client -&amp;gt; LB -&amp;gt; Back-End
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;When a &lt;strong&gt;clients request&lt;/strong&gt; reaches &lt;strong&gt;WSO2-Netty Listener&lt;/strong&gt; it gets transformed to &lt;strong&gt;Carbon Message&lt;/strong&gt;. This carbon message then reaches Inbound Endpoint.&lt;/li&gt;
&lt;li&gt;This &lt;strong&gt;carbon message&lt;/strong&gt; then flows &lt;strong&gt;via Pipeline&lt;/strong&gt; and reaches &lt;strong&gt;LB Mediator&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Each &lt;strong&gt;LB Outbound Endpoint&lt;/strong&gt; has its own &lt;strong&gt;LB Endpoint Call Mediator&lt;/strong&gt;. LB Mediator uses this LB Endpoint Call Mediator to forward request to the corresponding LB Outbound Endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If there is no persistence policy, LB Algorithm&lt;/strong&gt; returns the &lt;strong&gt;name of LB Outbound Endpoint&lt;/strong&gt; to which LB Mediator has to forward the request.&lt;/li&gt;
&lt;li&gt;I &lt;strong&gt;f there is any persistence policy, LB Mediator&lt;/strong&gt;  &lt;strong&gt;takes appropriate action&lt;/strong&gt; (discussed later) and finds the &lt;strong&gt;name of LB Outbound Endpoint&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LB Mediator&lt;/strong&gt; then passes the carbon message to &lt;strong&gt;LB Endpoint Call Mediator&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;This &lt;strong&gt;LB Endpoint Call Mediator&lt;/strong&gt; creates a &lt;strong&gt;LB Mediator Call Back&lt;/strong&gt; and forwards the carbon message to &lt;strong&gt;LB Outbound Endpoint&lt;/strong&gt; which in-turn forwards message to &lt;strong&gt;Outbound Endpoint&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outbound Endpoint&lt;/strong&gt; then forwards carbon message to &lt;strong&gt;back-end service&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;When &lt;strong&gt;carbon message&lt;/strong&gt; reaches &lt;strong&gt;WSO2-Netty Sender&lt;/strong&gt; , it transforms carbon message back to &lt;strong&gt;original client request&lt;/strong&gt; and sends it to the &lt;strong&gt;corresponding back-end service&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_nTHDM6K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355864408/2wymPWgw4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_nTHDM6K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355864408/2wymPWgw4.png" alt="" width="640" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3: Message Flow&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Flow: From Back-End -&amp;gt; LB -&amp;gt; Client
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;When &lt;strong&gt;response from back-end&lt;/strong&gt; reaches &lt;strong&gt;WSO2-Netty Sender&lt;/strong&gt; , it gets transformed to &lt;strong&gt;Carbon Message&lt;/strong&gt; and then its corresponding &lt;strong&gt;LB Mediator Callback&lt;/strong&gt; is invoked.&lt;/li&gt;
&lt;li&gt;Based on the configured &lt;strong&gt;session persistence policy&lt;/strong&gt; , &lt;strong&gt;LB Mediator Callback&lt;/strong&gt; takes corresponding action required for session persistence and forwards carbon message to &lt;strong&gt;Response Mediator.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Response Mediator&lt;/strong&gt; then forwards message to &lt;strong&gt;Pipeline&lt;/strong&gt; which in-turn forwards message to &lt;strong&gt;Inbound Endpoint&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inbound Endpoint&lt;/strong&gt; then forwards message to corresponding &lt;strong&gt;client&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;When &lt;strong&gt;carbon message&lt;/strong&gt; reaches &lt;strong&gt;WSO2-Netty Listener&lt;/strong&gt; , it transforms carbon message back to &lt;strong&gt;original back end response&lt;/strong&gt; and sends it to the &lt;strong&gt;corresponding client&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Outbound Endpoints
&lt;/h3&gt;

&lt;p&gt;Back end services endpoints are mapped as Outbound Endpoints in Carbon Gateway Framework. LB Engine requires few additional attributes for load balancing these Outbound Endpoints. &lt;strong&gt;Figure 4&lt;/strong&gt; clearly explains the differences between Outbound Endpoint, LB Outbound Endpoint, Weighted LB Outbound Endpoint and LB Outbound Endpoint for Least Response time algorithm.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rQovv6dD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355865588/WEMTZ-9zS.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rQovv6dD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355865588/WEMTZ-9zS.png" alt="" width="640" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 4: Different Outbound Endpoints in GW-LB&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;This LB supports various load balancing algorithms, session persistence policies, health checking and redirection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithms
&lt;/h3&gt;

&lt;p&gt;This LB supports both weighted and non-weighted algorithms. In non-weighted algorithms, all Outbound endpoints are considered to be of equal weights. In weighted algorithms, weights can be configured to each and every Outbound Endpoint. If no weight is specified for an endpoint, default value of 1 will be taken as its weight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Non-Weighted (Simple) Algorithms
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1) Round-Robin
&lt;/h4&gt;

&lt;p&gt;LB Mediator forwards requests to Outbound Endpoints in a Round-Robin fashion. If there is any persistence policy, LB Mediator forwards request to the Outbound Endpoint based on it.&lt;/p&gt;

&lt;h4&gt;
  
  
  2) Random
&lt;/h4&gt;

&lt;p&gt;LB Mediator forwards request to Outbound Endpoints in Random fashion. If there is any persistence policy, LB Mediator forwards request to the Outbound Endpoint based on it.&lt;/p&gt;

&lt;h4&gt;
  
  
  3) Strict Client IP Hashing
&lt;/h4&gt;

&lt;p&gt;LB looks for clients IP address in incoming request header (request headers will be available in Carbon Message). As of now, LB looks for the following headers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a)&lt;/strong&gt; X-Forwarded-For&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b)&lt;/strong&gt; Client-IP&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c)&lt;/strong&gt; Remote-Addr&lt;/p&gt;

&lt;p&gt;But these are always configurable and more headers can be added to look for if necessary. If LB cannot retrieve Clients IP or if that IP is not a valid one, LB will send internal server error response to the client. &lt;strong&gt;In this algorithm mode, persistence policy should be NO_PERSISTENCE and request will be load balanced only if a valid Client IP is available&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Also, LB uses scalable and efficient &lt;a href="http://www.tom-e-white.com/2007/11/consistent-hashing.html"&gt;&lt;strong&gt;consistent hashing&lt;/strong&gt;&lt;/a&gt; over simple modulo hashing.&lt;/p&gt;

&lt;p&gt;Advantage of using consistent hashing is that if a particular node is down, only those clients that maintained session with that node are to be remapped to other nodes. Session persistence or affinity of other clients are not affected. When that node is up and back to healthy state, only those clients that were remapped, will be mapped back to this node. But if we use modulo hashing, all the clients will be remapped and session will be lost, which creates bad user experience.&lt;/p&gt;

&lt;h4&gt;
  
  
  4) Least Response Time
&lt;/h4&gt;

&lt;p&gt;Running average is calculated for endpoints on each and every request. After a fixed WINDOW number of requests are elapsed and load distribution for endpoints will be decided based on their response time. &lt;strong&gt;Higher the response time of an endpoint, higher the load on it&lt;/strong&gt;. So, LB always tries to reduce the response time of an endpoint by forwarding fewer number of requests to it and by forwarding more requests to endpoint with least response time. By doing this, LB achieves even load distribution based on the response time of endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weighted Algorithms
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1) Weighted Round-Robin
&lt;/h4&gt;

&lt;p&gt;LB Mediator forwards requests to Outbound Endpoints in a Round-Robin fashion by considering endpoints weights. For example, if endpoints A, B, C have weights of 3, 2, 5 respectively. In a total of 10 requests..&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a)&lt;/strong&gt; First 3 requests goes to endpoints A, B, C.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b)&lt;/strong&gt; Second 3 requests also goes to endpoints A, B, C. Now, 2 requests have been forwarded to endpoint B. So it will not be considered until a total of 10 requests are elapsed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c)&lt;/strong&gt; Endpoint A receives next request. Now, 3 requests have been forwarded to endpoint A. So it will not be considered until a total of 10 requests are elapsed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d)&lt;/strong&gt; Remaining 3 requests will be forwarded to Endpoint C. No a total of 10 requests have elapsed.&lt;/p&gt;

&lt;p&gt;The cycle begins again.&lt;/p&gt;

&lt;p&gt;Since endpoints are weighted and weights represent processing power, requests forwarded to endpoints based on persistence policy will also be taken into account.&lt;/p&gt;

&lt;h4&gt;
  
  
  2) Random
&lt;/h4&gt;

&lt;p&gt;Similar to weighted round-robin but order of endpoints are chosen in random manner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Easily Extensible Nature
&lt;/h3&gt;

&lt;p&gt;Custom Load Balancing Algorithms (Simple or Weighted) can be easily written by implementing corresponding interfaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session Persistence
&lt;/h3&gt;

&lt;p&gt;Client IP Hashing, Application Cookie and LB Inserted cookie are the three persistence policies supported by this LB as of now.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) Client IP Hashing
&lt;/h3&gt;

&lt;p&gt;Similar to Strict IP Hashing algorithm, but the only difference is that if LB couldnt find a valid Client IP in request header, the request will be still load balanced based on the configured load balancing algorithm. It also uses scalable &lt;strong&gt;Consistent Hashing&lt;/strong&gt; over modulo hashing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Application Cookie
&lt;/h3&gt;

&lt;p&gt;LB inserts its own cookie inside Application server inserted cookie. So when client sends request, LB will be looking for cookie in the specified format (LB inserted cookie) and based on the cookies value, request will be forwarded to the corresponding back-end and persistence is maintained. And LB also removes the cookie inserted by it before forwarding the request to the Outbound endpoint.&lt;/p&gt;

&lt;p&gt;Cookie expiration value is controlled by back-end application and not LB. If there is no cookie available in response sent by the back-end, LB will insert its own cookie for the sake of maintaining persistence. This cookie will be a session cookie. i.e., session persistence will be maintained till the clients browser is open. Once it is closed, persistence will be lost. Also, this custom cookie inserted by LB will be removed before request is forwarded to the client.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) LB Inserted Cookie
&lt;/h3&gt;

&lt;p&gt;This persistence policy will come in handy when back-end application service is not inserting cookie but persistence has to be maintained. It works similar to that of Application cookie, but the only difference is that inserted cookie is a session cookie.&lt;/p&gt;

&lt;h3&gt;
  
  
  Health Checking and Redirection
&lt;/h3&gt;

&lt;p&gt;This LB supports both active and passive health checking mode. If health checking is not necessary it can also be disabled. Passive health checking is the default mode as it doesnt introduce any additional overhead on back-end services or networks. Be it active or passive health checking, the following parameters are required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a) Request Timeout:&lt;/strong&gt; Time interval after which, request has to be marked as timed out if response is not received.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b) Health Check Interval:&lt;/strong&gt; Time interval between two health checks&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c) Unhealthy Retries:&lt;/strong&gt; Number of times the request has to continuously fail (timeout) before marking an endpoint as unHealthy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d) Healthy Retries:&lt;/strong&gt; Number of times LB should be able to successfully establish connection to servers port before marking it back to healthy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For each Timed-out request LB will send &lt;strong&gt;HTTP Status Code: 504, Gateway Timeout&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If all Outbound endpoints are unhealthy and are unavailable LB will send &lt;strong&gt;HTTP Status Code: 503, Service Unavailable&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Passive Health Check
&lt;/h3&gt;

&lt;p&gt;In this mode, LB doesnt send any additional connection probes to check whether an endpoint is healthy or not. It simply keeps track of consecutive failed (timed-out) requests to an endpoint. If Unhealthy retries count is reached, that endpoint will be marked as unhealthy and no more requests will be forwarded to that endpoint until it is back to healthy state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Active Health Check
&lt;/h3&gt;

&lt;p&gt;LB will be periodically sending connection probes to check whether an endpoint is healthy or not. In this case both consecutive failed requests and consecutive failed connection probe to an endpoint wil be taken into account. If Unhealthy retries count is reached, that endpoint will be marked as unhealthy and no more requests will be forwarded to that endpoint until it is back to healthy state.&lt;/p&gt;

&lt;h3&gt;
  
  
  BackToHealthyHandler
&lt;/h3&gt;

&lt;p&gt;BackToHealthyHandler is a thread that is scheduled to run after time interval every &lt;strong&gt;Health Check Interval&lt;/strong&gt; is elapsed. It sends connection probe to unhealthy endpoints and tries to establish connection. If it succeeds in establishing connection for healthy retries number of times, that endpoint will be marked as healthy again and requests will be forwarded to that endpoint.&lt;/p&gt;

&lt;p&gt;In my &lt;a href="https://venkat2811.blogspot.in/2016/08/http-load-balancer-on-top-of-wso2_18.html"&gt;&lt;strong&gt;next post&lt;/strong&gt;&lt;/a&gt;, Ill be discussing on performance benchmark results of this load balancer.&lt;/p&gt;

&lt;p&gt;Thanks for Reading !&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://venkat.eu/http-load-balancer-on-top-of-wso2-gateway-part-1-project-repository-architecture-and-features-d4df775af48e"&gt;https://venkat.eu/http-load-balancer-on-top-of-wso2-gateway-part-1-project-repository-architecture-and-features-d4df775af48e&lt;/a&gt; &lt;em&gt;on Aug 18, 2016.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>GSoC — Mid Term Evaluation</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Sun, 03 Jul 2016 13:00:23 +0000</pubDate>
      <link>https://dev.to/venkat2811/gsoc-mid-term-evaluation-4cji</link>
      <guid>https://dev.to/venkat2811/gsoc-mid-term-evaluation-4cji</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ie0_1E2T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355846593/f7J6R0mUq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ie0_1E2T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355846593/f7J6R0mUq.png" alt="" width="640" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, all GSoCers were eagerly waiting for Mid-Term Evaluation results (27 June 19:00 UTC). Google gives one week (June 2027) for students and mentors to submit their evaluations. Students who clear evaluations will be rewarded and they can continue to code for next six weeks till the final evaluations!! Most of us cleared mid-term evaluation with flying colors and members activity in our FB group spiked as expected. We were posting our mentors feedback. It was fun and exciting to read others feedback too :D&lt;/p&gt;

&lt;p&gt;Most of the students would have gone through code reviews &amp;amp; demos before mid-term. My mentors have been very busy and they couldnt conduct any demos or code reviews. I keep them posted on a weekly basis on the progress made through mailing list and they are very well aware of the features implemented so far.&lt;/p&gt;

&lt;p&gt;Yesterday, my mentor was free and we had demo / discussion in hangouts for about one and half hours. My mentor was very happy with the outcome and he also thanked me for my contribution so far and asked to keep up the same pace and complete the project successfully. I felt so happy and humbled that my 6 weeks of hard work is definitely going to make a difference!! We also discussed in detail about the further road map. We are yet to have our code review.&lt;/p&gt;

&lt;h4&gt;
  
  
  Progress so far:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;ANTLR 4 grammar support for reading Load balancer specific configurations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Algorithms:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;a)&lt;/strong&gt; Round Robin&lt;br&gt;&lt;br&gt;
&lt;strong&gt;b)&lt;/strong&gt; Random&lt;br&gt;&lt;br&gt;
&lt;strong&gt;c)&lt;/strong&gt; Strict Client IP Hashing&lt;br&gt;&lt;br&gt;
&lt;strong&gt;d)&lt;/strong&gt; Least Response Time&lt;br&gt;&lt;br&gt;
&lt;strong&gt;e)&lt;/strong&gt; Weighted Round Robin&lt;br&gt;&lt;br&gt;
&lt;strong&gt;f)&lt;/strong&gt; Weighted Random&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Session Persistence:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;a)&lt;/strong&gt; Client IP Hashing&lt;br&gt;&lt;br&gt;
&lt;strong&gt;b)&lt;/strong&gt; Application Cookie&lt;br&gt;&lt;br&gt;
&lt;strong&gt;c)&lt;/strong&gt; LB Cookie&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Health Checking &amp;amp; Redirection:&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;a)&lt;/strong&gt; Active&lt;br&gt;&lt;br&gt;
&lt;strong&gt;b)&lt;/strong&gt; Passive&lt;/p&gt;

&lt;h4&gt;
  
  
  Todo List:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;SSL SupportBoth SSL Offloading and End To End.&lt;/li&gt;
&lt;li&gt;Performance evaluation.&lt;/li&gt;
&lt;li&gt;Unit testing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also find detailed list of my progress &lt;a href="https://github.com/Venkat2811/product-http-load-balancer"&gt;here&lt;/a&gt; under Mid Term label of README.&lt;/p&gt;

&lt;p&gt;Also find my previous post about writing a good GSoC proposal &lt;a href="https://venkat2811.blogspot.in/2016/06/writing-good-google-summer-of-code-gsoc.html"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thanks for reading !!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://venkat.eu/gsoc-mid-term-evaluation"&gt;https://venkat.eu/gsoc-mid-term-evaluation&lt;/a&gt; &lt;em&gt;on July 3, 2016.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Writing a good Google Summer of Code (GSoC) Proposal</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Sun, 26 Jun 2016 13:00:47 +0000</pubDate>
      <link>https://dev.to/venkat2811/writing-a-good-google-summer-of-code-gsoc-proposal-4e64</link>
      <guid>https://dev.to/venkat2811/writing-a-good-google-summer-of-code-gsoc-proposal-4e64</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OUrCRCvu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355850602/STjTHxsU3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OUrCRCvu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355850602/STjTHxsU3.png" alt="" width="640" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting selected in GSoC has various aspects and the most important of them is writing a good proposal. Students who are aspiring to participate in GSoC for the first time find it very difficult to write a proposal. Being a first time GSoCer, I went through the same. Few organizations like &lt;a href="https://www.kde.org/"&gt;KDE&lt;/a&gt; have their standard GSoC &lt;a href="https://community.kde.org/GSoC#Student_proposal_guidelines"&gt;template&lt;/a&gt; so that students can follow it. But, most of the organizations dont have a standard template. So students find it confusing and difficult. You can also find few sample proposals online and use it as a reference. Students are allowed to submit a maximum of 5 proposals, but only one project is allocated per student.&lt;/p&gt;

&lt;p&gt;For those who dont know what GSoC is, have a look at my previous &lt;a href="https://venkat2811.blogspot.in/2016/06/what-is-google-summer-of-code-gsoc.html"&gt;post&lt;/a&gt;. This &lt;a href="https://developers.google.com/open-source/gsoc/timeline"&gt;timeline&lt;/a&gt; will give you an idea about how it works.&lt;/p&gt;

&lt;p&gt;Here are the few thing that one must consider before writing a proposal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selecting an Organization:
&lt;/h3&gt;

&lt;p&gt;Students should know that selecting a proposal is completely in organizations hand. Google doesnt involve in selecting proposal, but it is responsible for allocating maximum number of projects per organization for that year. My organization &lt;a href="http://wso2.com/"&gt;WSO2&lt;/a&gt; started participating in GSoC from &lt;a href="https://docs.wso2.com/display/GSoC/WSO2+GSoC+Project+Proposals"&gt;2014&lt;/a&gt;. Google doesnt provide stats on number of proposal received per organization versus number of proposals selected per organization. But, it provides success rate (number of proposals selected and successfully completed projects).&lt;/p&gt;

&lt;p&gt;For instance, success of rate of WSO2 is 5/6 in 2014 and 9/10 in 2015. This year (in 2016) 14 proposals have been selected. Next year around 20 proposals might get selected. If you look at this (especially for first time aspirants), selecting an organization is very crucial. If an organization has more that 5 projects with good success rate in previous year, you are good to go.&lt;/p&gt;

&lt;p&gt;Kindly note that I am not against selecting new organizations, just that Google allocates less number of projects. If you have relevant skill sets and comfortable with the technologies that are required for the project, you can proceed. In such cases, try to submit more that one proposal (have a backup).&lt;/p&gt;

&lt;p&gt;Certain organizations like KDE, Apache, etc., are highly competitive. Though they select (Google allows them) 3040 proposals, there will be 2 or more students competing for a single project. In such cases either more than one proposal (best proposals) for a single project get selected (happens very rarely) or only one gets selected (happens in most of the cases). Its always good to have a backup while applying for such big organizations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Relevant Skill Sets:
&lt;/h3&gt;

&lt;p&gt;Students should have relevant skill sets. While organizations dont expect students to be extremely proficient with technologies (programming language, domain etc.,), they do expect good skills so that students will be able to manage and deliver. Note that just because Google has allotted maximum number of projects, organizations dont try to utilize it fully. If organizations feel that your proposal is not good or you dont have relevant skill sets, they will not select your proposal. This is because their success rate is very important.&lt;/p&gt;

&lt;h3&gt;
  
  
  Communicating with Organization:
&lt;/h3&gt;

&lt;p&gt;Frequent communication with organization is very important. Here are the various phases where students can communicate.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google starts accepting applications from organizations from October or November every year. Students can look for organizations selected in previous year and follow them. Because, organizations selected in previous year are most likely to be selected in next year also.&lt;/li&gt;
&lt;li&gt;Once the &lt;strong&gt;idea list is put-up (by organization)&lt;/strong&gt;, students can contact organization mentors and members. They will give some idea and students can start contributing from November or December itself. This will be very helpful for &lt;strong&gt;first time aspirants&lt;/strong&gt; who wish to work with organizations like Apache, KDE etc., Note that the more trust you build (by proving yourself) with your organization members and mentors, there is a good chance that your proposal will get selected.&lt;/li&gt;
&lt;li&gt;Once &lt;strong&gt;Google announces selected organizations&lt;/strong&gt; , look for projects matching your skill sets. You will have one month time to discuss with organizations and for submitting proposals.&lt;/li&gt;
&lt;li&gt;Factors like how often you communicate, how interested and dedicated you are, matters a lot.&lt;/li&gt;
&lt;li&gt;Mentors help you in writing your proposal. They correct your mistakes and they will be very clear in letting you know what is expected out of this project duration.&lt;/li&gt;
&lt;li&gt;Since we will be working on their existing projects, mentors insist us in trying out their products, frameworks etc., They also help us if we ran into some difficulties. Once you do this, youll get clear idea on proceeding with technical implementation of your project. &lt;strong&gt;Yes, your proposal must include relevant architecture diagrams, technical implementation details, deliverables and tentative timeline for those deliverables&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Mentor organization and Google will take &lt;strong&gt;one month of time&lt;/strong&gt; to announce the selected proposals. Once your proposal is selected it becomes a GSoC project !! In that one month time, &lt;strong&gt;organizations expect you to communicate with mentors and start working on small tasks and deliverables. Submitting patches will also help&lt;/strong&gt;. Since organizations success rate is at stake, writing a good proposal with crystal clear road-map, regular commitment and contribution to your organization is very important. It also builds trust on you and youll have an edge if there is any one competing with you for the same project.&lt;/li&gt;
&lt;li&gt;Remember that &lt;strong&gt;Communication is the key. Never hesitate to ask questions. Dont think that you might be asking stupid questions. Its never the case. Mentors know that we are students after all.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;While approaching a mentor for the first time, give a brief introduction about you. This will help mentors to understand where you stand.&lt;/li&gt;
&lt;li&gt;Once your project gets selected, mentors will guide you through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is my &lt;a href="https://github.com/Venkat2811/product-http-load-balancer"&gt;project&lt;/a&gt; and &lt;a href="http://docs.google.com/document/d/1Agl-Y_UKM5eMon8IDZa02aDGj0Yh_vZ7kFC2b1IzFI4/edit?usp=sharing"&gt;proposal&lt;/a&gt; for you reference. Here is the project &lt;a href="https://docs.wso2.com/display/GSoC/Project+Proposals+for+2016#ProjectProposalsfor2016-Proposal8:%5BESB/GW%5DHTTPLoadbalancerontopofWSO2Gateway"&gt;idea&lt;/a&gt; from WSO2.&lt;/p&gt;

&lt;p&gt;Feel free to contact me if you have any questions.&lt;/p&gt;

&lt;p&gt;Happy Coding !!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://venkat.eu/writing-a-good-google-summer-of-code-gsoc-proposal-6f040e217af4"&gt;https://venkat.eu/writing-a-good-google-summer-of-code-gsoc-proposal-6f040e217af4&lt;/a&gt; &lt;em&gt;on June 26, 2016.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What is Google Summer of Code (GSoC) ?</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Fri, 24 Jun 2016 13:00:43 +0000</pubDate>
      <link>https://dev.to/venkat2811/what-is-google-summer-of-code-gsoc--16ig</link>
      <guid>https://dev.to/venkat2811/what-is-google-summer-of-code-gsoc--16ig</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7ECJUIdt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668357680215/pEMzFrFMy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7ECJUIdt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668357680215/pEMzFrFMy.png" alt="banner-gsoc2016_2.png" width="640" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://summerofcode.withgoogle.com/"&gt;Google Summer of Code&lt;/a&gt; is a global program focused on bringing more student developers into open source software development. More importantly Open source organizations find it prestigious for being selected by Google to be a part of this program. Yes !! Google selects organizations based on certain criteria. Not all open source organizations are selected. And organizations also find it as the best place to attract young and good talent.&lt;/p&gt;

&lt;p&gt;For students, participating and successfully completing the project will be of great advantage in various aspects. Here are few from my experience.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The level of exposure is very high. Student can learn a lot right from writing a proposal.&lt;/li&gt;
&lt;li&gt;Most of the projects involves lot of coding. Student get to Code a lot. Yea lots of Code!!&lt;/li&gt;
&lt;li&gt;Students will have complete ownership of their project. They get to think about design aspects and plan a lot. However, when students get stuck, mentors and community members will be ready to lend a helping hand.&lt;/li&gt;
&lt;li&gt;Since most mentors and students will be in different countries, they collaborate through GitHub. Students get a very good knowledge in working in a collaborated environment using a VCS.&lt;/li&gt;
&lt;li&gt;Most of the projects are intended to solve real world problems and most of the code that students write will be used. There will be a great satisfaction of making a valuable contribution.&lt;/li&gt;
&lt;li&gt;Since people from different location and culture will be a part of organization, students will learn to adapt and communicate with them effectively. So students communication skills will improve a lot.&lt;/li&gt;
&lt;li&gt;Students community is the best part. It is very exciting. Students from all over the world collaborate in Facebook. Yes, we have separate group for GSoC 2016 in Facebook, LinkedIn, Telegram. Its the very exciting part. Students share their experience, offer support, help and are very encouraging. Its always fun to interact with new people.&lt;/li&gt;
&lt;li&gt;GSoC project will add a great value to the resume. Students will code and learn a lot more than they would have done in college.&lt;/li&gt;
&lt;li&gt;Students need not be a great algorithmic geek. Students with good analytical and programming skills can participate and successfully complete the project.&lt;/li&gt;
&lt;li&gt;And last but not the least, Google always keeps you exciting. Students get a welcome package containing GSoC sticker, pen, diary and 500$ after community bonding period. And for students those who pass mid-term evaluations will get 2250$ and those who successfully complete the project will get 2750$, a certificate from Google and GSoC T-Shirt.&lt;/li&gt;
&lt;li&gt;Also, few organizations insist on writing weekly or monthly blog post regarding project. So Students get a good chance to start blogging.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find about writing a good GSoC proposal in my next &lt;a href="https://venkat2811.blogspot.in/2016/06/writing-good-google-summer-of-code-gsoc.html"&gt;post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Im a final year student and I am in my mid-term evaluation period (at the time of writing this post). And Im already missing GSoC that Ill not be able participate next year&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://venkat.eu/what-is-google-summer-of-code-gsoc-fc721e631e27"&gt;https://venkat.eu/what-is-google-summer-of-code-gsoc-fc721e631e27&lt;/a&gt; &lt;em&gt;on June 24, 2016.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>GSoC — Community Bonding Period</title>
      <dc:creator>Venkat Raman</dc:creator>
      <pubDate>Mon, 23 May 2016 13:00:46 +0000</pubDate>
      <link>https://dev.to/venkat2811/gsoc-community-bonding-period-1mcn</link>
      <guid>https://dev.to/venkat2811/gsoc-community-bonding-period-1mcn</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QcxhD--Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355858026/6k2Bc01MG.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QcxhD--Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn.hashnode.com/res/hashnode/image/upload/v1668355858026/6k2Bc01MG.png" alt="" width="640" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This year, the community bonding period was from April 22ndMay 22nd.. Though I was communicating with my mentor and community members frequently even before GSoC project results were announced (Yeah!!, I was pretty confident), this period was very very helpful and useful.&lt;/p&gt;

&lt;p&gt;I took off from April 29thMay 10 for my semester exams and my mentors were comfortable with it. We had a hangout session before that and my mentor explained what is expected out during this community bonding period. There was a major change in the code base from the time when I wrote proposal and now. Here is the &lt;a href="https://github.com/wso2/product-gw/"&gt;previous code base&lt;/a&gt; and here is the &lt;a href="https://github.com/wso2/carbon-gateway-framework"&gt;new one&lt;/a&gt;. Big change right.? Yes, I too felt the same and I panicked!! But my mentor and members of the org were very helpful and understood my situation. They guided me through it very well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why a Load Balancer ?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My project is to build a HTTP Load Balancer on Top of WSO2 Gateway. When I tell this to my friends few astonished ( yes, I wanted to build a kick-ass product !) and few ask my Why a Load Balancer.?, Its already there, Why re-invent the wheel.?. First of all, I chose this organization because, &lt;a href="http://wso2.com/products/enterprise-service-bus/"&gt;WSO2 ESB&lt;/a&gt; is the backbone of ebay! Yes, It helps ebay in handling &lt;a href="http://wso2.com/casestudies/ebay-uses-100-open-source-wso2-esb-to-process-more-than-1-billion-transactions-per-day/"&gt;more than 1 billion transactions per day&lt;/a&gt;!! Pretty cool uh..? So what does gateway has to do with it.? This gateway-framework will be used to build next-gen-ESB.&lt;/p&gt;

&lt;p&gt;WSO2 Gateway is a high performance, lightweight, low-latency messaging gateway based on standard gateway pattern. Its Netty based non-blocking IO and Disruptor (ringbuffer) architecture makes it the fastest open-source gateway available. &lt;a href="http://www.slideshare.net/kasun04/wso2-gateway?qid=9c8d89a4-e982-4883-87a8-ac2dca7bf223&amp;amp;v=&amp;amp;b=&amp;amp;from_search=1"&gt;Performance Benchmarks&lt;/a&gt; (slide 13,14,15) shows that the gateways is very high when compared to other solutions and is close to the direct netty based backend (without any intermediate gateway).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Features:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support for different Load Balancing Algorithms&lt;/li&gt;
&lt;li&gt;Full Compliance with HTTP specification&lt;/li&gt;
&lt;li&gt;Support for SSL&lt;/li&gt;
&lt;li&gt;Session Persistence&lt;/li&gt;
&lt;li&gt;Health Checking and Redirection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, I am building a pretty cool LB which is based on non-blocking IO to achieve high performance and low-latency mediation.If everything goes well and I hope so, one day, this LB will be used by many organizations &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accomplishments so far:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding &lt;a href="https://github.com/wso2/carbon-gateway-framework"&gt;carbon-gateway-frameworks&lt;/a&gt; code base and trying out few samples of &lt;a href="https://github.com/wso2/product-integration-server"&gt;product-integration-server&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="http://www.antlr.org/"&gt;Antlr 4&lt;/a&gt; grammar support for reading Load Balancer specific configurations.&lt;/li&gt;
&lt;li&gt;Repository structure for a &lt;a href="https://github.com/Venkat2811/product-http-load-balancer"&gt;standalone HTTP Load Balancer Server&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You guys can find my code in the above mentioned link. Thanks for reading.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://venkat.eu/gsoc-community-bonding-period-b0d8f36d918"&gt;https://venkat.eu/gsoc-community-bonding-period-b0d8f36d918&lt;/a&gt; &lt;em&gt;on May 23, 2016.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
