DEV Community: Venkat Raman

The power of Mechanical Sympathy in Software Engineering

Venkat Raman — Thu, 18 Apr 2024 18:34:00 +0000

Introduction

Modern software programming languages, compilers, and frameworks abstract away underlying complexities and details, allowing developers to focus on building systems and applications to solve business problems. This design enables engineers to specialize and build expertise in specific layers, pushing boundaries. However, when tasked with solving problems that stretch hardware capabilities to the maximum, and the hardware is operating at its peak, understanding the underlying architecture and complexities becomes crucial. Novel software paradigms that dramatically increase system performance with real-world implications arise from such scenarios.

Flash Attention is one such algorithm that made huge waves in the NLP community, especially in Transformer Architecture. I first encountered Flash Attention in 2022, when it dramatically improved inference speeds in Stable Diffusion models for image generation. Upon recently revisiting the paper, it reminded me of:

'Locality of Reference' principle from Computer Architecture class in University.
'LMAX Disruptor' the underlying library used in my GSoC project.

In this post, we'll explore these concepts and appreciate how having mechanical sympathy makes us better engineers. To quote Martin Flower, "The phrase Martin Thompson likes to use is 'mechanical sympathy.' The term comes from race car driving and reflects the driver having an innate feel for the car, enabling them to get the best out of it."

Locality of Reference

Locality Of Reference (LOR) is a principle in computer architecture that refers to the tendency of programs to access data and instructions that are close to each other in memory. As we saw in previous blog post, CPU & GPU cores make use of registers and layers of caches for faster data access & processing. Here are key LOR types used by processors (firmware) for better performance:

Temporal Locality - Tendency of programs to access the same memory location repeatedly for a short time. Eg: a+=10 -> Reading the value of a and saving the result back to a. It is beneficial to keep a close to processor to avoid costly (slow) access to main memory.
Spatial Locality - Tendency of programs to access memory locations to nearby to data that is currently being accessed. Eg: we have two variables a and b declared in program and they will be close together in main memory page when program is loaded in memory. So, during fetch cycle, when a is being read from main memory (cache line), b will likely also be in the same cache line and will be available in cache.
Sequential Locality - Tendency of programs to access memory locations sequentially. Eg: array elements will be stored sequentially in memory. When program is iterating over an array, when first element is being read, next contiguous elements will also be read (as part of cache line) from main memory and be available in cache.
Instruction Locality - Similar to above data LOR types, instructions are also prefetched and made available in caches.

So, if data load happens for a single element in a cache line, all elements in a cache line are loaded resulting in quicker access for subsequent elements.

Matrix Multiplication

Matrix multiplication is a classic example with which we can quickly see the impact of LOR principle. Here is a simple program that does matmul without any libraries in Python.

import sys, random
from tqdm import tqdm
from time import *

n = 500

A = [[random.random()
      for row in range(n)]
      for col in range(n)]

B = [[random.random()
      for row in range(n)]
      for col in range(n)]

C = [[0 for row in range(n)]
     for col in range(n)]

print("calculating ... \n")

start = time()
# inefficient
for i in tqdm(range(n)):
    for j in range(n):
        for k in range(n):
            C[i][j] += A[i][k] * B[k][j]
# efficient
#for i in tqdm(range(n)):
#    for k in range(n):
#        for j in range(n):
#            C[i][j] += A[i][k] * B[k][j]
end = time()

print("%0.6f"%(end-start))

The above python program can be further sped up in several ways (changing programming language, compiler optimizations, parallel calculation, tiling, vectorization, AVX, CUDA etc.,) which are not in scope for this post. If interested in those, refer:

MIT OpenCourseWare - Performance Engineering - Matrix Multiplication.

Vipul Vaibhaw's summary & repo
Tao Xu's summary

Running the inefficient & efficient versions of above program in my ubuntu workstation & benchmarking using cachegrind gives:

$ valgrind --tool=cachegrind python matmul_inefficient.py
==253768== Cachegrind, a cache and branch-prediction profiler
==253768== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==253768== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==253768== Command: python matmul_inefficient.py
==253768== 
--253768-- warning: L3 cache found, using its data for the LL simulation.
calculating ... 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [14:33<00:00,  1.75s/it]
873.798730
==253768== 
==253768== I   refs:      314,734,342,652
==253768== I1  misses:          5,738,193
==253768== LLi misses:            870,629
==253768== I1  miss rate:            0.00%
==253768== LLi miss rate:            0.00%
==253768== 
==253768== D   refs:      150,606,141,341  (105,453,303,262 rd   + 45,152,838,079 wr)
==253768== D1  misses:        622,837,260  (    616,546,831 rd   +      6,290,429 wr)
==253768== LLd misses:          2,065,607  (      1,493,478 rd   +        572,129 wr)
==253768== D1  miss rate:             0.4% (            0.6%     +            0.0%  )
==253768== LLd miss rate:             0.0% (            0.0%     +            0.0%  )
==253768== 
==253768== LL refs:           628,575,453  (    622,285,024 rd   +      6,290,429 wr)
==253768== LL misses:           2,936,236  (      2,364,107 rd   +        572,129 wr)
==253768== LL miss rate:              0.0% (            0.0%     +            0.0%  )

$ valgrind --tool=cachegrind python matmul_efficient.py
==296074== Cachegrind, a cache and branch-prediction profiler
==296074== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==296074== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==296074== Command: python matmul_efficient.py
==296074== 
--296074-- warning: L3 cache found, using its data for the LL simulation.
calculating ... 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [14:31<00:00,  1.74s/it]
871.885507
==296074== 
==296074== I   refs:      318,987,466,754
==296074== I1  misses:          4,224,884
==296074== LLi misses:            832,073
==296074== I1  miss rate:            0.00%
==296074== LLi miss rate:            0.00%
==296074== 
==296074== D   refs:      151,347,143,927  (106,200,231,179 rd   + 45,146,912,748 wr)
==296074== D1  misses:        218,499,487  (    216,816,521 rd   +      1,682,966 wr)
==296074== LLd misses:          2,111,315  (      1,539,359 rd   +        571,956 wr)
==296074== D1  miss rate:             0.1% (            0.2%     +            0.0%  )
==296074== LLd miss rate:             0.0% (            0.0%     +            0.0%  )
==296074== 
==296074== LL refs:           222,724,371  (    221,041,405 rd   +      1,682,966 wr)
==296074== LL misses:           2,943,388  (      2,371,432 rd   +        571,956 wr)
==296074== LL miss rate:              0.0% (            0.0%     +            0.0%  )

My workstation is a powerful machine, and 500x500 is a small matrix. So treat L3 cache as main memory and L1 cache as cache memory. The D1 miss rate of inefficient version is 0.4% and for the efficient version is 0.1% resulting in runtime improvement of ~2s. Let's apply sequential locality to a small matrix (for purpose of visualization) and see how changing loop order is giving this performance gain.

As seen above, memory access pattern for matrix B is inefficient on the left. Just by changing iteration order, access pattern for matrix B is fixed and we get free performance boost. Thus, having mechanical sympathy for the underlying hardware architecture helps in improving matmul performance.

LMAX Disruptor

When announced in early 2010s it made rounds in Java world and in HPC trading firms. It was also later adopted in Log4j and also in Nasdaq. Exchanges and brokerages workloads demand millisecond and microsecond latencies. They usually run on beefy bare-metal hardware as performance impact of running on VMs is too costly. These services are written in Thread per Core model (because context switching and L1, L2 cache invalidations are expensive) unlike traditional web-servers that operate on Thread per Request model.

Note: LMAX Disruptor is a high performance inter-thread communication library. Using it in a wrong way can cause significant performance degradation. Generally as a rule of thumb, if a problem can be solved just by scaling out instead of scaling up, it need not be used.

Here is a high level overview of LMAX Exchange.

The problem with traditional queues

The above diagram shows high level LMAX system receiving market data, doing auxiliary processing, core business logic processing and then sending orders to market. Replicator, Journaller & Un-Marshaller can process in parallel, but queues are still needed for ordered processing. So, we have receiver acting as producer and replicator, journaller & un-marshaller acting as consumers contending over shared resource - Queue.

As we saw in matmul section, it is more likely that tail & head vars fall within the same cache line. Producer thread is adding at the end of the queue and consumer thread is consuming from beginning of the queue. When threads are running in different cores, both their L1 & L2 caches needs to be invalidated each time producer / consumer is updating the state of the queue. LMAX team observed that their producer & consumer were running at the same rate & significant time was spent on keeping the L1 & L2 caches up-to date rather than doing actual producing & consuming.

How LMAX Disruptor is so Fast

Lock-Free RingBuffer

RingBuffer (CircularQueue) is also a Queue which operates in FIFO fashion. Key difference between RingBuffer & a traditional Queue is:

When values are consumed, it is not removed.
When end of the queue is reached, writer goes to the beginning of the queue and value is overwritten.

In LMAX Disruptor's RingBuffer implementation, the 'head' & 'tail' are managed outside the buffer instead of using a blanket lock which prevents adding value to end of the queue when consumption is happening in the beginning of the queue & vice versa.

Let's look at the highlighted sequence snapshot in the above diagram.

1) Fast consumer 2 has processed until buffer location 5 & asks cursor for next location. Cursor provides location 6, as location 0 is already processed by consumer 2. Consumer 2 fetches value in buffer location 6 and starts processing.

2) Producer barrier that is tracking consumer 1 & 2 sequences is aware that, consumer 1 is done only until buffer location 1, and consumer 2 is done until buffer location 5. So, only value at location 0 can be overwritten.

3) Producer can write only one value at location 0. Producer is preparing new value. Eg: Fetching latest value from network for example

4) Once new value is ready, producer asks producer barrier to commit. Value is updated in buffer, and sequence is updated to 7 from 0.

5) Slow consumer 1 that is done with processing buffer location 1, asks for next value. It gets the location 7 from cursor. Consumer 1 gets all entries in location 3-7 and works on processing it.

Consumers update their respective consumer sequences after processing. Only when a buffer location is processed by all consumers, it can be overwritten. Producer barrier keeps track of all consumer sequences.

Batching can be done in producer and consumer sequences (not in scope of this post, see references).

Buffer sequences are monotonically increasing as it provides an easier way of tracking consumer and producer buffer locations.

Static Memory Allocation & Delayed Garbage Collection

RingBuffer array is statically allocated with dummy values. Producers write to the next available buffer location using cursor and consumers consume previously unconsumed buffer locations using cursor. Once the value is overwritten there won't be any reference to it and will be easily garbage collected (GC).

In Java 8, GC there are 4 memory spaces. Young Generation (Eden, Survivor) spaces, Old Generation (Tenured Generation), Metaspace (non-heap memory) & Code Cache (JIT compiler related).

Since RingBuffer itself is statically allocated, it will be metaspace and will not be GC'd. The values in buffer are written and consumed quickly and will be GC'd in Eden cycle (quick and cheap), hence avoiding large GC pauses (survivor and old-gen spaces).

Avoiding False Sharing in Cache Lines

In matmul, we saw that variables in a program can share the same cache line. In Disruptor, we have Cursor, Sequence Barriers for both Producer & Consumers. Since we want producer and consumer threads to not be affected by updates to their variables (unlike ArrayBlockingQueue), we have to add padding so that the variable occupy entire cache line. So when producer is updating cursor, consumer caches need not be refreshed.

If we don't do this when producer thread updates cursor, consumer caches needs to be refreshed as they share the same cache line. This is called as False Sharing .

Java 8 has Contended Annotation for this.

Producer & Consumer Sequence Barriers

CPU core does several optimizations using instruction pipelining, reordering etc., as long as the outcome of a reorder or concurrent execution in execution units of a CPU core, doesn't change the outcome of the program.

Java provides Volatile keyword which is a special type of barrier know as write / store barrier. There are also other types of memory barriers and fences.

Here we have two programs where counter is not volatile and counter is volatile. We know that arithmetic operations happen in ALU of CPU core. Core operates using values from registers.

In first program, once counter is loaded into register, 10 iterations of loop happens and each change in counter value is saved only in register. Once the iteration is done, during "write-back" cycle, value is copied and written back to L1 cache and memory unit takes care of propagating this change to other levels of caches and to main memory.

public class LoopCounterExample {
    public static void main(String[] args) {
        int iterations = 10;
        int counter = 0;

        for (int i = 0; i < iterations; i++) {
            counter++;
        }
    }
}

public class LoopCounterExample {
    public static void main(String[] args) {
        int iterations = 10;
        volatile int counter = 0;

        for (int i = 0; i < iterations; i++) {
            counter++;
        }
    }
}

In second program, every update to counter is written back from Register to L1 cache and memory unit takes care of invalidating any other reference to this value. This has significant performance cost, but comes at the value of shared state across multiple threads.

In case of Disruptor, Cursor, Consumer sequences & Producer sequences use memory barrier & fences which offer finer control than volatile keyword. This is done using Java VarHandle.

These techniques offer finer control than Reentrant Lock used in ArrayBlockingQueue. Producer and Consumers can write and consume from ring-buffer at the same time ,and can be confident that when value is read from a buffer location it is always the latest because barriers & fences guarantee:

Anything that happened before barrier call is flushed out (producer adding newly produced value at location 0 and then incrementing cursor from 6 to 7).
Value updated by one thread is immediately visible to all threads (value of cursor to consumer barrier and value consumer sequences producer barrier).

Avoiding Context Switching

Even without the below optimizations Disruptor's performance is significantly higher than an ArrayBlockingQueue (see perf benchmark section below). I found these optimizations very interesting (feel free to skip this and jump to perf section). These were done for LMAX matching engine service that has:

1 Inbound Disruptor with 1 producer and three consumers threads
3 Outbound Disruptor with 1 producer thread (one of the consumer threads from Inbound Disruptor) and 3 consumer threads.
Yellow arrows indicate critical threads that needs dedicated CPU core (for peak performance)

$ isolcpus=0,2,4,6,8,24,26,28,30,32

To isolate CPUs. Above diagram shows 10 cpu cores (20 hyper-threads) isolated from OS kernel scheduler. OS will not schedule any process or thread in these cores. (Plugging my previous post here if you want to understand cpu cores and hyper-threading)

$ cset set --set=/system --cpu=18,20,...,46 
$ cset set --set=/app --cpu=0,2,...,40

To partition system resources. Separate cpu sets for system and app

$ cset proc --move -k --threads --force \ --from-set=/ --to-set=/system

This command moves kernel threads from the default CPU set to the "/system" CPU set. Kernel threads are system-level threads managed by the kernel itself.

$ cset proc --exec /app \ taskset -cp 10,12...38,40 \ java <args>

This command executes the Java application (java ) within the CPU set "/app" using the taskset command. The taskset -cp option specifies which CPUs the process should be allowed to run on. In this case, the Java application is allowed to run on CPUs 10, 12, ..., 38, and 40.

sched_set_affinity(0); 
sched_set_affinity(2);....

Each Java thread is pinned to dedicated core in application code.

Performance Benchmark

I've briefly covered principles and techniques through which LMAX disruptor gives performance gains. I would like to call out that I've used a mix of Disruptor 1.0 & 2.0 terminologies above to easily communicate the problem and underlying principles. For more detailed understanding, see sources in reference section.

Source: LMAX perf test Throughput & Latency. The above benchmarks were done without context switching optimizations.

Thus, having mechanical sympathy for the underlying hardware architecture helps to speed up inter-thread messaging and achieve peak performance.

Flash Attention

So far in this post, we looked at LOR using matmul & Disruptor and see how understanding underlying CPU architecture helps with extracting maximum performance. In this section, we'll look at Flash Attention - "A new attention algorithm that computes exact attention with far fewer memory accesses."

In my previous post, we understood HBM memory and compute intensity of A100 GPU using 2x2 matmul as an example. Flash Attention optimization leads to direct performance gains primarily in bandwidth & overhead bound rather than optimizations in compute bound regime.

As of April 2024, I don't have deep expertise / understanding to explain attention layer of Transformers in detail. Refer to Jay Alammar's amazing post or high quality video from 3Blue1Brown for that. I also cannot do a better job than Aleksa Gordi in explaining step-by-step changes in Flash Attention 1 algorithm with supporting math. Refer to his excellent post for that. Below, I try to provide a practical high level FlashAttention 1 explanation w.r.t underlying Hardware - CUDA Ampere Architecture.

Paper Title: Fast and Memory-Efficient Exact Attention with IO-Awareness

Exact Attention: It's not using sparse matrix / approximation methods to speed up attention calculation. These technique when used, result in models with poor quality. Flash Attention 1 uses exact attention calculation, so there is no reduction in quality.

Fast & Memory-Efficient: Space complexity of vanilla self-attention is O(N), while the algorithmic optimization leads to space complexity of N (O(N)). This reduction is space complexity results in increased memory bandwidth availability, decreasing compute intensity [more data is fed to the beast - CUDA & Tensor cores :) from caches], resulting in improvement in speed.

IO Awareness: NVIDIA A100 SXM GPU has 40-80 GB of HBM (VRAM / DRAM) & 88.1 MB of SRAM in total shared across all SMs (256k registers, 192k L1 cache per SM -> 27.8 MB for registers and 20.3MB for L1 cache combined + 40MB of shared L2 cache).

Diagram source is NVIDIA. It shows required Compute Intensity for FMA operation in CUDA & Tensor Cores - to make the read operations worth the cost. Except for matmul, there are not many computations that have such high compute intensity to make reads from slower memory worth the cost. So, model implementations must try to keep the compute intensity as low as possible. i.e, read and write from caches and registers.

Diagram source is Flash Attention paper. In above attention diagram, in native attention implementation in PyTorch, we can that only ~4ms out of 17ms is spent on Matmul operation (compute bound). The rest of the operations are not that compute heavy. But because of frequent read and writes from HBM, the bandwidth is significantly reduced resulting in wasted GPU compute cycles and higher latency.

Standard Self-Attention

I'm providing just a high level self-attention calculation operations needed to understand FlashAttention1. Refer 3Blue1Brown's video for detailed explanation.

Q1.K1 to Qn.Kn are matrix multiplication of Q&K matrices. The division is for numeric stability (not critical for this post).

The resulting values from matmul range from - infinity to + infinity.

Since matrix column values are used for predicting next token, we need a probability distribution. Softmax operation is applied to every column of the result embedding matrix. The denominator needs sum of all elements in a given column. See sample program below and results with help from ChatGPT.

import torch
import torch.nn.functional as F

def traditional_softmax(matrix, column_index):
    column = matrix[:, column_index]
    softmax_column = F.softmax(column, dim=0)
    return softmax_column

# Example usage
matrix = torch.tensor([[-0.8, -5.0, 5.0, 1.5, 3.4, -2.3, 2.5],
                       [-0.2,  2.3, 3.5, 1.8, 0.9, -1.5, 0.5]], dtype=torch.float32)
column_index = 2
softmax_result = traditional_softmax(matrix, column_index)
print("Softmax for column", column_index, ":", softmax_result)

# result
# Softmax for column 2 : tensor([0.8176, 0.1824])

The token with high probability score get's more "attention".

So far, we briefly saw matmul of matrices Q, K and softmax operation gives result matrix with probability distribution. Masking is applied before softmax to prevent next probability influencing previous token (refer video). Below we see outcome of result matrix after softmax is multiplied with V matrix.

This is how LLMs understand importance for words and sentences in different parts of the text. These steps are done for each layer of the model.

Below is the standard self-attention implementation which does above mentioned calculations for each input token in every layer of a transformer model.

Diagram source: Flash attention paper. One can quickly see, several reads and writes being done to HBM without taking bandwidth and compute intensity of underlying GPU architecture into account.

Flash Attention Optimizations

Diagram source: HuggingFace TGI

Tiled Matrix Multiplication

We are going to revisit.. caches ! (you guessed it :)) Refer to MIT OpenCourseWare matmul with tiling. This is the critical critical change

In the first slide, entire matrix B is loaded as all columns are needed. This is not very efficient use of memory bandwidth. As we saw earlier, In self-attention there are 3 matrix multiplications and one softmax (next section covers online softmax, so for now assume that not all columns are needed for softmax calculation).

Once tiling is done, some bandwidth in HBM is freed up, and some L1 & L2 cache memory are also freed up. This will be used to do softmax operation once Q.K for the block is done. Once softmax is done, we do another matrix multiplication with V block. This result is then written back to HBM. This is called as "Kernel Fusion". ie., a CUDA kernel is doing 3 operations.

A side note: I would imagine there was some kind of tiling already happening on transformer models before FlashAttention. Because, CUDA Thread Blocks & Warps are designed to do parallel operations on every memory page read. I haven't looked into FlashAttention 2, but from reading the abstract, I think this is being done. Again, this highly emphasizes the need for optimizations with Mechanical Sympathy :)

Online Softmax Calculation

Earlier we saw that softmax needs the sum of all elements in a given column. In online softmax calculation, computations are performed for columns in smaller matrix blocks, reducing the memory footprint in SRAM. With each block calculation in flash attention, the maximum score within the block is tracked and saved.

m(x) (Maximum Score): The highest value within a block of scores.
f(x) (Exponential Function): Transforms scores into positive values by raising them to the power of the difference between the score and the maximum score across all blocks resulting in numerical stability
l(x) (Sum of Exponential Scores): The sum of exponential values obtained from applying the exponential function to each score within a block, used for softmax probability computation.

See sample program and results with help from ChatGPT.

import torch
import torch.nn.functional as F

def flash_attention_softmax(matrix, column_index, block_sizes):
    # Step 1: Extract the column vector
    column = matrix[:, column_index]

    # Step 2: Compute the total size of the concatenated vector
    total_size = column.size(0)

    # Step 3: Split the concatenated vector into blocks
    blocks = torch.split(column, block_sizes)

    # Step 4: Compute the maximum value within each block (𝑚(𝑥))
    max_values = [torch.max(block) for block in blocks]

    # Step 5: Compute the global maximum value across all blocks
    global_max = torch.max(torch.stack(max_values))

    numerator = torch.zeros_like(column)
    for i, block in enumerate(blocks):
        # Step 6: Compute numerator for each block (𝑓(𝑥))
        numerator[i * block_sizes[i]:(i + 1) * block_sizes[i]] = torch.exp(block - global_max)

    # Step 7: Compute the sum of exponentials (ℓ(𝑥))
    denominator = torch.sum(numerator)

    # Step 8: Compute softmax probabilities for each block
    softmax_probabilities = numerator / denominator

    return softmax_probabilities

# Example usage
matrix = torch.tensor([[-0.8, -5.0, 5.0, 1.5, 3.4, -2.3, 2.5],
                       [-0.2,  2.3, 3.5, 1.8, 0.9, -1.5, 0.5]], dtype=torch.float32)
column_index = 2
block_sizes = [1, 1]  # Splitting the column into individual elements

print("Matrix:")
print(matrix)

print("\nColumn:")
column = matrix[:, column_index]
print(column)

print("\nBlocks after splitting:")
blocks = torch.split(column, block_sizes)
print(blocks)

print("\nMax values within each block:")
max_values = [torch.max(block) for block in blocks]
print(max_values)

print("\nGlobal maximum value across all blocks:")
global_max = torch.max(torch.stack(max_values))
print(global_max)

softmax_result = flash_attention_softmax(matrix, column_index, block_sizes)
print("\nSoftmax for column", column_index, ":", softmax_result)int("Softmax for column", column_index, ":", softmax_result)

# Matrix:
# tensor([[-0.8000, -5.0000,  5.0000,  1.5000,  3.4000, -2.3000,  2.5000],
#         [-0.2000,  2.3000,  3.5000,  1.8000,  0.9000, -1.5000,  0.5000]])

# Column:
# tensor([5.0000, 3.5000])

# Blocks after splitting:
# (tensor([5.]), tensor([3.5000]))

# Max values within each block:
# [tensor(5.), tensor(3.5000)]

# Global maximum value across all blocks:
# tensor(5.)

# Softmax for column 2 : tensor([0.8176, 0.1824])

To summarize:

(Although I haven't gone into Transformer attention mechanism, math & Flash Attention algorithm, math; I am hoping that at a high level, I was able to communicate the essence of Flash Attention 1 optimizations.)

Tri Dao, et al., with their combined research / expertise & very good understanding in:

Transformer Attention mechanism and the math behind it
NVIDIA Ampere series GPU architecture & CUDA parallel programming
Earlier research works - Self-attention Does Not Need O(n2) Memory by Google Researchers with reference implementation in JAX for TPU & Online normalizer calculation for softmax by NVIDIA researchers with reference implementation in CUDA

have shown Mechanical Sympathy to extract the best out of NVIDIA Ampere GPU hardware architecture.

Outro

Implementing matmul of 4096x4096 in C and changing loop order provides 461% improvement in GFLOPS utilization compared to C implementation with inefficient loop order. This is done purely by exploiting CPU cache line behavior.
P99 latency % improvement when comparing Disruptor against ArrayBlockingQueue is 99% & enabled LMAX Exchange to handle 6M order matching engine TPS on a single machine. This is done primarily by using granular inter-thread messaging allowing concurrent read and writes to buffer, and efficient use of CPU cache line.
FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3 speedup on GPT-2 (seq. length 1K), and 2.4 speedup on long-range arena (seq. length 1K-4K)

In this post, we saw examples of Mechanical Sympathy being applied in wide range of problems requiring different skill-sets and expertise with real world impact.

Deep Learning space is still in its nascent phase. People with expertise in several background (Data Engineering, Model Training, Deep Learning Algorithms, Compiler, Hardware Interface - CPUs, GPUs, Accelerators, Model Inference, Distributed Systems, Infrastructure, Mathematics, Physics, Chemistry) are all working within their domain and rightfully so. Current cost of training & inference for quality models is prohibitively high. Given how LLMs are going to be human companions like a Laptop and a smartphone, several optimizations will be required and some of which will be solved by engineers having very good understanding of underlying hardware and architecture.

It's interesting that FlashAttention 1 was done in 2-3 months. In 2023, they've also published Flash Attention 2 with better parallelism and work partitioning (efficient use of CUDA thread blocks & warps) resulting in optimizations primarily in compute bound regime. I cannot imagine the breakthroughs we would see - If more DeepLearning / Transformer algorithm experts/researchers and CUDA Architects like Stephen Jones, work on optimizing existing layers and algorithms for couple years or so. I'm highlighting CUDA here as NVIDIA is the market leader. Intel, AMD, and other transformer accelerators' computing platform teams should also be spending more effort on optimizing model implementations for their respective hardware.

References:

MIT OpenCourseWare - Performance Engineering - Matrix Multiplication https://youtu.be/o7h_sYMk_oc?si=mWgShE48VgXARZEz
Vipul Vaibhaw's MIT matmul summary - https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa
Tao Xu's MIT matmul summary - https://xta0.me/2021/07/12/MIT-6172-1.html
InfoQ Martin Thompson and Michael Barker on building HPC fintech handling over 100k TPS at LMAX - https://www.infoq.com/presentations/LMAX
InfoQ Sam Adams on LMAX Exchange Architecture https://www.infoq.com/presentations/lmax-trading-architecture/
Trisha Gee on LMAX Disruptor Internals - RingBuffer, LocksAreBad, MemoryBarriers, Consumer, Producer, Disruptor 2.0
LMAX Exchange Disruptor - https://lmax-exchange.github.io/disruptor/#_read_this_first
Martin Thompson on Memory Barriers & Fences - https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html
Guy Nir on LMAX Disruptor - https://www.slideshare.net/slideshow/the-edge-2012-disruptor-guy-raz-nir-published/22790571
Martin Fowler on LMAX Disruptor: https://martinfowler.com/articles/lmax.html
Tri Dao, et al., FLASH ATTENTION 1 & 2: https://github.com/Dao-AILab/flash-attention
Aleksa Gordi on ELI5 - FLASH ATTENTION 1: https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad
Jay Alammar - The Illustrated Transformer

CPU & GPU - The Basics

Venkat Raman — Mon, 08 Apr 2024 11:51:00 +0000

Introduction

In this article, we'll go through some fundamental low level details to understand why GPUs are good at Graphics, Neural Network and Deep Learning tasks and CPUs are good at wide number of sequential, complex general purpose computing tasks. There were several topics that I had to research and get a bit more granular understanding for this post, some of which I will be just mentioning in passing. It is done deliberately to focus just on the absolute basics of CPU & GPU processing.

Von Neumann Architecture

Earlier computers were dedicated devices. Hardware circuits and logic gates were programmed to do specific set of things. If something new had to be done, circuits needed to be rewired. "Something new" could be as simple as doing mathematical calculations for two different equations. During WWII, Alan Turing was working on a programmable machine to beat Enigma machine and later published "Turing Machine" paper. Around the same time, John von Neumann and other researchers were also working on idea which fundamentally proposed:

Instruction and data should be stored in shared memory (Stored program).
Processing and memory units should be separate.
Control unit takes care of reading data & instructions from memory to do calculations using processing unit.

The Bottleneck

Processing bottleneck - Only one instruction and its operand can be at a time in a processing unit (physical logic gate). Instructions are executed sequentially one after another. Over the years, focus and improvements has been in making processors smaller, faster clock cycles, increasing number of cores.
Memory bottleneck - As processors grew faster and faster, speed and amount of data that could be transferred between memory and processing unit became a bottleneck. Memory is several order slower than CPU. Over the years, focus and improvements has been in making memory denser and smaller.

CPUs

We know that everything in our computer is binary. String, image, video, audio, OS, application program, etc., are all represented as 1s & 0s. CPU architecture (RISC, CISC, etc.,) specifications have instruction sets (x86, x86-64, ARM, etc.,), which CPU manufacturers must comply with & is available for OS to interface with hardware.

OS & application programs including data are translated into instruction set and binary data for processing in the CPU. At chip level, processing is done at transistors and logic gates. If you execute a program to add two numbers, addition (the "processing") is done at logic gate in the processor.

In CPU as per Von Neumann architecture, when we are adding two numbers, a single add instruction runs on two numbers in the circuit. For a fraction of that millisecond, only add instruction was executed in (execution) core of the processing unit ! This detail always fascinated me.

Core in a modern CPU

The components in the above diagram are self evident. For more details and detailed explanation refer to this excellent article. In modern CPUs, a single physical core can contain more than one integer ALU, floating-point ALU, etc., Again, these units are physical logic gates.

We need to understand 'Hardware Thread' in CPU core for better appreciation of GPU. A hardware thread is an unit of compute that can be done in execution units of a CPU core, every single CPU clock cycle. It represents the smallest unit of work that can be executed in a core.

Instruction cycle

The above diagram illustrates CPU instruction cycle / machine cycle. It is a series of steps that CPU performs to execute a single instruction (eg: c=a+b).

Fetch: Program counter (special register in CPU core) keeps track of which instruction must be fetched. Instruction is fetched and stored in instruction register. For simple operations, corresponding data is also fetched.
Decode: Instruction is decoded to see operator and operands.
Execute: Based on the operation specified, appropriate processing unit is chosen and executed.
Memory Access: If instruction is complex or additional data is needed (several factors can cause this), memory access is done before execute. (Ignored in above diagram for simplicity). For a complex instruction, initial data will be available in data register of compute unit, but for complete execution of instruction, data access from L1 & L2 cache is required. This means could be a small wait time before compute unit executes and the hardware thread is still holding compute unit during wait time.
Write Back: If execution produces output (eg: c=a+b), output is written back to register / cache / memory. (Ignored in above diagram or anyplace later in the post for simplicity)

In the above diagram, only at t2 compute is being done. Rest of the time, core is just idle (we are not getting any work done).

Modern CPUs have HW components which essentially enables (fetch-decode-execute) steps to happen concurrently per clock cycle.A single hardware thread can now do computation in every clock cycle. This is called as instruction pipelining.

Fetch, Decode, Memory Access, Write Back are done by other components in a CPU. For lack of a better word, these are called "pipeline threads". Pipeline thread becomes hardware thread when it is in execute stage of an instruction cycle.

As you can see, we get compute output every cycle from t2. Previously, we got compute output once every 3 cycle. Pipelining improves compute throughput. This is one of the techniques to manage processing bottleneck in Von Neumann Architecture. There are also other optimizations like out of order execution, branch prediction, speculative execution, etc.,

Hyper-Threading

This is the last concept I want to discuss in CPU before we conclude and move on to GPUs. As the clock speeds increased, the processors also got so fast and efficient. With increase in application (instruction set) complexity, CPU compute cores were underutilized and it was spending more time waiting on memory access.

So, we are seeing memory bottleneck. Compute unit is spending time on memory access and not doing any useful work. Memory is several order slower than CPU and gap is not going to close anytime soon. The idea was to increase memory bandwidth in some units of a single CPU core and keep data ready to utilize the compute units when it is awaiting memory access.

Hyper-threading was made available in 2002 by Intel in Xeon & Pentium 4 processors. Prior to hyper-threading there was only one hardware thread per core. With hyper-threading, there will be 2 hardware threads per core. What does it mean ? Duplicate processing circuit for some registers, program counter, fetch unit, decode unit, etc.

The above diagram just shows new circuit elements in a CPU core with hyper threading. This is how a single physical core is visible as 2 cores to the Operating System. If you had a 4 core processor, with hyper-threading enabled, it is seen by OS as 8 cores. Obviously, L1 - L3 cache size will increase to accommodate additional registers. Note that the execution units are shared.

Assume we have processes P1 and P2 doing a=b+c, d=e+f, this can execute concurrently in a single clock cycle because of HW threads 1 and 2. Witch single HW thread as we saw earlier, this would not be possible. Here we are increasing memory bandwidth within a core by adding additional Hardware Thread so that, the processing unit can be utilized efficiently. This improves compute concurrency.

Some interesting scenarios:

CPU has only one integer ALU. One HW Thread 1 or HW Thread 2 must wait for one clock cycle and proceed with compute in next cycle.
CPU has one integer ALU and one floating point ALU. HW Thread 1 and HW Thread 2 can do addition concurrently using ALU and FPU respectively.
All available ALUs are being utilized by HW Thread 1. HW Thread 2 must wait until ALU is available. (Not applicable for the addition example above, but can happen with other instructions).

Why CPU is so good at traditional desktop / server computing ?

High clock speeds - Higher than GPU clock speeds. Combining this high speed with instruction pipelining, CPUs are extremely good at sequential tasks. Optimized for latency.
Diverse applications & computation needs - Personal computer and servers have wide range of applications and computation needs. This results in complex instruction set. CPU has to be good at several things.
Multitasking & Multi processing - With so many apps in our computers, CPU workload demands context switching. Caching systems and memory access is setup to support this. When a process is scheduled in CPU hardware thread, it has all necessary data ready and executes compute instructions quickly one by one.

CPU Drawbacks

Check this article & also try the Colab notebook. It shows how matrix multiplication is a parallelizable task and how parallel compute cores can speedup the calculation.

Extremely good at sequential tasks but not good at parallel tasks.
Complex instruction set and complex memory access pattern.
CPU also spends lots of energy on context switching, control unit activities in addition to compute

Key Takeaways

Instruction pipelining improves compute throughput.
Increasing memory bandwidth improves compute concurrency.
CPUs are good at sequential tasks (optimized for latency). Not good at massively parallel tasks as it needs large number of compute units and hardware threads which are not available (not optimized for throughput). These are not available because CPUs are built for general purpose computing and have complex instruction sets.

GPUs

As computing power increased, so did the demand for graphics processing. Tasks like UI rendering and gaming require parallel operations, driving the need for numerous ALUs and FPUs at the circuit level. CPUs, designed for sequential tasks, couldn't handle these parallel workloads effectively. Thus, GPUs were developed to fulfill the demand for parallel processing in graphics tasks, later paving the way for their adoption in accelerating deep learning algorithms.

I would highly recommend:

Watching this video that explains parallel tasks involved in Video Games rendering.
Reading this blog post to understand parallel tasks involved in a transformer. There are other deep learning architectures like CNNs, RNNs as well. Since LLMs are taking over the world, high level understanding of parallelism in matrix multiplications required for transformer tasks would set a good context for the remainder of this post. (At a later time, I plan to fully understand transformer & share a digestible high-level overview of what happens in transformer layers of a small GPT model.)

Example CPU vs GPU spec

Cores, hardware threads, clock speed, memory bandwidth, and on chip memory of CPUs & GPUs differ significantly. Example:

Intel Xeon 8280 has:
Nvidia A100 80GB SXM has:

Core in a modern GPU

Terminologies we saw in CPU doesn't always translate directly to GPUs. Here we'll see components and core NVIDIA A100 GPU. One thing that was surprising to me while researching for this article was that CPU vendors don't publish how many ALUs, FPUs, etc., are available in execution units of a core. NVIDIA is very transparent about number of cores and CUDA framework gives complete flexibility & access at circuit level.

In the above diagram in GPU, we can see that there is no L3 Cache, smaller L2 cache, smaller but a lot more control unit & L1 cache and large number of processing units.

Here are GPU components in the above diagrams and their CPU equivalent for our initial understanding. I haven't done CUDA programming, so comparing it with CPU equivalents helps with initial understanding. CUDA programmers understand this very well.

Multiple Streaming Multiprocessors <> Multi Core CPU
Streaming Multiprocessor (SM) <> CPU Core
Streaming processor (SP)/ CUDA Core <> ALU / FPU in execution units of a CPU Core
Tensor Core (capable doing 4x4 FP64 operations on a single instruction) <> SIMD execution units in a modern CPU core (eg: AVX-512)
Hardware Thread (doing compute in CUDA or Tensor Cores in a single clock cycle) <> Hardware Thread (doing compute in execution units [ALUs, FPUs, etc.,] in a single clock cycle)
HBM / VRAM / DRAM / GPU Memory <> RAM
On-chip memory/SRAM (Registers, L1, L2 cache) <> On-chip memory/SRAM (Registers, L1, L2, L3 cache)

Moving data & memory bandwidth

Graphics and deep learning tasks demand SIM(D/T) [Single instruction multi data / thread] type execution. i.e., reading and working on large amounts of data for a single instruction.

We discussed instruction pipelining and hyper-threading in CPU and GPUs also have capability. How it is implemented and working is slightly different but the principles are the same.

Unlike CPUs, GPUs (via CUDA) provide direct access to Pipeline Threads (fetching data from memory and utilizing the memory bandwidth). GPU schedulers work first by trying to fill compute units (includes associated shared L1 cache & registers for storing compute operands), then "pipeline threads" which fetch data into registers and HBM. Again, I want to emphasize that CPU app programmers don't think about this, and specs about "pipeline threads" & number of compute units per core is not published. Nvidia not only publishes these, but also provides complete control to programmers.

I will go into more details about this in a dedicated post about CUDA programming model & "batching" in model serving optimization technique where we can see how beneficial this is.

The above diagram depicts hardware thread execution in CPU & GPU core. Refer "memory access" section we discussed earlier in CPU pipelining. This diagram shows that. CPUs complex memory management makes this wait time small enough (few clock cycles) to fetch data from L1 cache to registers. When data needs to be fetched from L3 or main memory, the other thread for which data is already in register (we saw this in hyper-threading section) get's control of execution units.

In GPUs, because of over subscription (high number of pipeline threads & registers) & simple instruction set, large amount of data is already available on registers pending execution. These pipeline thread waiting for execution become hardware threads and do the execution as often as every clock cycle as pipeline threads in GPUs are lightweight.

Bandwidth, Compute Intensity & Latency

What's over goal?

Fully utilize hardware resources (compute units) every clock cycle to get the best out of GPU.
To keep the compute units busy, we need to feed it enough data.

This is the main reason why latency of matrix multiplication of smaller matrices are the same more or less in CPU & GPU. Try it out.

Tasks needs to be parallel enough, data needs to be huge enough to saturate compute FLOPs & memory bandwidth. If a single task is not big enough, multiple such tasks needs to be packed to saturate memory and compute to fully utilize the hardware.

Compute Intensity = FLOPs / Bandwidth. i.e., Ratio of amount of work that can be done by the compute units per second to amount of data that can be provided by memory per second.

In above diagram, we see that compute intensity increases as we go to higher latency and lower bandwidth memory. We want this number to be as small as possible so that compute is fully utilized. For that, we need to keep as much as data in L1 / Registers so that compute can happen quickly. If we fetch single data from HBM, there are only few operations where we do 100 operations on single data to make it worth it. If we don't do 100 operations, compute units were idle. This is where high number of threads and registers in GPUs come into play. To keep as much as data in L1/Registers to keep the compute intensity low and to keep parallel cores busy.

There is a difference in compute intensity of 4X between CUDA & Tensor cores because CUDA cores can done only one 1x1 FP64 MMA where as Tensor cores can do 4x4 FP64 MMA instruction per clock cycle.

Key Takeaways

High number of compute units (CUDA & Tensor cores), high number of threads and registers (over subscription), reduced instruction set, no L3 cache, HBM (SRAM), simple & high throughput memory access pattern (compared to CPU's - context switching, multi layer caching, memory paging, TLB, etc.,) are the principles that make GPUs so much better than CPUs in parallel computing (graphics rendering, deep learning, etc.,)

Beyond GPUs

GPUs were first created for handling graphics processing tasks. AI researchers started taking advantage of CUDA and it's direct access to powerful parallel processing via CUDA cores. NVIDIA GPU has Texture Processing, Ray Tracing, Raster, Polymorph engines, etc., (let's say graphics specific instruction sets). With increase in adoption in AI, Tensor cores which are good at 4x4 matrix calculation (MMA instruction) are being added which are dedicated for deep learning.

Since 2017, NVIDIA has been increasing number of Tensor cores in each architecture. But, these GPUs are also good at graphics processing. Although the instruction set and complexity is very less in GPUs, it's not fully dedicated to deep learning (especially Transformer Architecture).

FlashAttention 2, a software layer optimization (mechanical sympathy for attention layer's memory access pattern) for transformer architecture provides 2X speedup in tasks.

With our in-depth first principles based understanding of CPU & GPU, we can understand need for Transformer Accelerators: A dedicated chip (circuit only for transformer operations), with even large number of compute units for parallelism, reduced instruction set, no L1/L2 caches, massive DRAM (registers) replacing HBM, memory units optimized for memory access pattern of transformer architecture. After all LLMs are new companions for humans (after web and mobile), and they need dedicated chips for efficiency and performance.

Some AI Accelerators:

Transformer Accelerators:

FPGA based Transformer Accelerators:

Achronix
...

References:

https://en.wikipedia.org/wiki/Von_Neumann_architecture
https://chsasank.com/llm-system-design.html
https://www.redhat.com/sysadmin/cpu-components-functionality
https://docs.wixstatic.com/ugd/56440f_e458602dcb0c4af9aaeb7fdaa34bb2b4.pdf
https://www.nand2tetris.org/course
https://cpu.land/
https://en.wikipedia.org/wiki/Hyper-threading
How do Video Game Graphics Work? - https://youtu.be/C8YtdC8mxTU?si=OdrFXUFMLBhuZF34
CPU vs GPU vs TPU vs DPU vs QPU - https://www.youtube.com/watch?v=r5NQecwZs1A
How GPU Computing Works | GTC 2021 | Stephen Jones - https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31151/
Compute Intensity - https://www.linkedin.com/pulse/threads-tensor-cores-beyond-unveiling-dynamics-gpu-memory-florit-smg2c/
How CUDA Programming Works | GTC Fall 2022 | Stephen Jones - https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41101/
Why use GPU with Neural Networks? - https://www.youtube.com/watch?v=GRRMi7UfZHg
CUDA Hardware | Tom Nurkkala | Taylor University Lecture - https://www.youtube.com/watch?v=kUqkOAU84bA
https://ashanpriyadarshana.medium.com/cuda-gpu-memory-architecture-8c3ac644bd64
https://colab.research.google.com/drive/1nw34aks9SdMwHXl9Gf5T9GPxRB9BIIyr
https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/

OS Error: Too many open files. Understanding file and socket descriptors.

Venkat Raman — Tue, 26 Mar 2024 15:12:19 +0000

Intro

Engineers who've built, deployed and operated backend services would've encountered this error. It usually means your service is serving real user requests - Yay 🎉 ! One possible scenario is - you need to fine-tune server OS configuration to scale up, and the other is - there is resource leakage in your system.

I've encountered this error four times so far in 8 years of my career. Wanted to write and share as it was always interesting.

Let's start with resource leakage.

Service written in kotlin using ktor

In late 2019, our team wanted to experiment with kotlin ktor as an alternative to SpringBoot. We wanted to quickly try it out in a simple microservice receiving just few hundred requests a day. Our app was deployed and serving customer requests for few days without any issue.There were no server restarts since deployment. One morning, 500 service error alerts were triggered. I was looking at the server logs and requests were accepted and erroring out in one of the processing steps. I reproduced this issue in prod env to gather recent logs and restarted the service. Bug was filed and was not yet prioritised (not a critical service receiving millions of requests). A couple days went by and the same issue started occurring in the evening. Our friend, server restart solved it again 😊

By this time, few L1 customer tickets were also filed, and I started looking into the issue. Prometheus showed that memory usage increased over time, and flatlined around time the service started rejecting requests. Also from the logs, we found that error started occurring in one of our processing steps where ktor okhttp client was used. Found this issue in github and upgrading the lib solved this issue.

Service written in python using langchain & openai lib

LangChain is a framework for developing applications (RAG & AI agents) powered by language models. Our app was deployed and serving customer requests for few days without any issue.There were no server restarts since deployment (see the pattern ?). One afternoon in early 2024, 500 service error alerts were triggered. I was looking at the server logs and requests were being rejected with OS Error: Too many open files. Good old server restart quickly fixed the error and the service started serving user requests. My immediate hunch (from ktor issue few years ago) was that there was an underlying resource leakage.

I wanted to reproduce this issue in staging environment. A quick google search showed this issue. So, I monitored the below while simulating few 100 requests

Processes grouped by name & connection status, sorted by count
Local and remote addr of connections in CLOSE_WAIT status

And the remote addr matched OpenAI' api domain. Since Langchain uses LLM provider's client lib to connect and interact with the models, the leak should be in OpenAI client lib. A quick search on openai github issue showed that it was addressed and fixed already. So our fix was to upgrade the underlying openai lib version. The fix was verified in staging and rolled out to customers.

There is a small difference in how 500 service errors were triggered in above services. Kotlin service using ktor server accepted the request and errored out in one of the processing steps that ktor okhttp client. Python service using Flask server errored out while accepting the request for processing. I will punt this for now and cover in a separate post at a later time as it deals with difference in server frameworks.

Before fine-tuning server configuration to scale up let's understand network connections.

Understanding connections & OS files

Opening a file

When a process opens a file, a file descriptor with the following metadata is created: the file position (offset), access mode (read, write, or both), file status flags (such as whether the file is open for appending or is non-blocking), and a reference to the corresponding file table entry in the kernel's file descriptor table. When the file is closed, the file descriptor is released, freeing up system resources associated with the file and removing its entry from the process's file descriptor table.

IPC via shared message queue

Processes on the same machine often use Inter-Process Communication (IPC) mechanisms like message queues for data exchange. Message queues are associated with unique identifiers, akin to file descriptors, enabling processes to access them using standard file I/O operations. They provide synchronization and data buffering, facilitating asynchronous communication and enabling processes to operate independently without waiting for message exchange.

Client server communication via HTTP

Similarly in a client server communication when a HTTP request is made, the library uses processes or threads, and a network file descriptor with the following metadata is created: the network socket type (TCP or UDP), the local and remote addresses and ports, socket options (such as whether the socket is reusable or whether it's in blocking or non-blocking mode), and a reference to the corresponding socket data structures in the operating system's networking stack.

Message queue descriptor and Open MQ descriptor table; Socket descriptor and Open socket descriptor table are all considered and treated as file descriptor and file descriptor table by OS (Linux and POSIX). So far, we discussed high level overview of file descriptors. See references at the end for more details.

ulimit

ulimit is a command-line utility in Unix-like operating systems used to control and report resource limits for processes. It can be used to set limits on the maximum number of file descriptors that a process can open. This is important for preventing resource exhaustion and ensuring system stability. By adjusting the nofile (or open files) limit with ulimit, one can control how many files a process can have open simultaneously, including regular files, directories, pipes, and sockets.

Soft Limit : In the context of file descriptors, the soft limit might determine the maximum number of file descriptors a process can open. If a process tries to exceed the soft limit, it may receive warnings and / or errors, but it can continue operating within the limit
Hard Limit : The hard limit would establish the absolute maximum allowable number per process above which OS will terminate the process.

Load testing my GSOC Project

In my final semester I worked on building a HTTP Load Balancer on top of WSO2 gateway. The gateway core uses LMAX Disruptor - a high-performance inter-thread messaging library used by the London Stock Exchange, renowned for its "mechanical sympathy" approach. It facilitates low-latency, high-throughput messaging between threads, crucial for real-time financial trading systems, by minimizing contention and maximizing CPU cache efficiency. I will discuss about this is a separate blog post.

I wanted to run some benchmarks to see how my load balancer fared against nginx. I started hitting too many openfiles error. I had to make changes to OS configurations to increase the number of concurrent connections.

# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#<domain>        <type>  <item>  <value>
#

*         hard    nofile      500000
*         soft    nofile      500000
root      hard    nofile      500000
root      soft    nofile      500000

# End of file

#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#

net.ipv4.netfilter.ip_conntrack_max = 32768
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_orphan_retries = 1
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_max_orphans = 32768
net.ipv4.ip_local_port_range = 1025    61000

You can refer it here as well: https://github.com/wso2-incubator/HTTP-Load-balancer/tree/master/performance-benchmark/test-bed

Optimizing resources and auto scaling policy for spring boot microservice

I wanted to evaluate how many concurrent connections our single container can handle to optimize autoscaling policy. Default ECS container ulimit is 1024. After ~600 parallel user requests, with memory & cpu utilization of ~50% & 30% respectively, I started seeing too many open files errors. p99 for these 600 parallel user requests was 2s. I increased container ulimit to 2400, and also increased db & HTTP connection pool size (will write about why connection pool is important in a separate post). With the increased limits & optimizations the benchmark showed more than 90% memory and 60% cpu utilization. Based on these, autoscaling was set to trigger at 85% memory utilization.

Thanks for reading !

References

https://man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

https://www.usna.edu/Users/cs/wcbrown/courses/IC221/classes/L09/Class.html

https://www.codequoi.com/en/handling-a-file-by-its-descriptor-in-c/

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Ulimit.html

https://osquery.io/schema/5.11.0/#process_open_files

https://osquery.io/schema/5.11.0/#process_open_sockets

https://osquery.io/schema/5.11.0/#file

https://osquery.io/schema/5.11.0/#device_file

HTTP Load Balancer on Top of WSO2 Gateway — Part 2: Performance Benchmarks

Venkat Raman — Fri, 19 Aug 2016 13:00:54 +0000

In my previous post, I discussed Load Balancer Engine Architecture and its features. In this post, Ill be discussing on performance bench-marks.

Kindly note that the underlying carbon-gateway-framework is under development. And hence more features will be developed and improvisations will be done to this LB.

Performance Benchmarks

In this performance test, five instances of simple service created using Netty framework were used. Each instance is a fast backend (0s delay) with response of size 1KB.

One Million (1,000,000) requests were sent at different concurrency levels (500 to 12,000) to Netty backend, Nginx (Open source version) and GW-LB using apache bench via automated script.

Benchmarks were conducted in Round-Robin algorithm mode with no persistence policies.

More details can be found here.

Test Bed

These are the configurations used during bench-marking.

VM Details

**Guest OS :** Ubuntu 64-bit 16.04 VM
**RAM :** 8 GB
**CPU cores :** 4
**JVM Version :** 1.8.0_91
**Java Runtime :** Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
**Java HotSpot :** Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

Host Machine Details

**Host OS :** OS X EI Captain Version 10.11.5 (15F34) MacBook Pro (Mid 2015)
**Hypervisor :** VMware Fusion Professional Version 8.1.1 (3771013)
**Processor :** 2.5 GHz Intel Core i7
**Memory :** 16 GB 1600 MHz DDR3

Note: HTTP Load Balancer on Top of WSO2 Gateway is referred as GW-LB.

Throughput Test

Tests were done twice. Average of Average throughput for each concurrency level is calculated and plotted.

Figure 1 shows throughput comparison between Open Source Nginx and GW-LB and Figure 1.1 shows throughput comparison along with Netty backend.

Figure 1 : Nginx Open Source Version vs GW-LB

Figure 1.1 : Nginx Open Source Version vs GW-LB along with Netty Back-End

Latency Test

Tests were done twice. Average of Mean Latency for each concurrency level is calculated and plotted.

Figure 2: Nginx Open Source Version vs GW-LB along with Netty BE

Memory Test

Java Flight Recorder (JFR) is enabled while starting LB server and recording is stopped after load test ends. The obtained JFR recording has memory usage details.

Figure 3 shows Committed, Reserved and Used Heap values.

Figure 3: Committed vs Reserved vs Used Heap memory values of GW-LB

It took three code reviews and several rounds of performance bench-marking to get these results. Guidance and suggestions from my mentors and community members were very helpful. I enjoyed discussing with my mentors and it was a great learning too! I loved this phase of my GSoC project more than any because our discussions were very good and I could sense that my mentors were also as eager and motivated as me to achieve good performance.

I also kept track of the performance improvisation that we were doing by creating it as an issue so that it might be helpful one day. You can find it here.

Sample Configuration

You can find a sample config here.
Also find more samples here.

Kindly note that DSL and underlying Gateway Framework is evolving and it will change over the period of time.

Javadoc

You can find javadoc here.

Thanks for reading and Happy Coding !

Originally published at HTTP Load Balancer on Top of WSO2 GatewayPart 2: Performance Benchmarks on August 18, 2016.

HTTP Load Balancer on Top of WSO2 Gateway — Part 1: Project Repository, Architecture and Features

Venkat Raman — Thu, 18 Aug 2016 15:04:55 +0000

Its almost four months and it has been an amazing journey! At this point, I would like to thank my mentors Isuru Ranawaka and Kasun Indrasiri and WSO2 community members especially Senduran and Isuru Udana for continuously mentoring, supporting and guiding me throughout this project.

Google announced list of accepted mentoring organizations and I was looking for projects related to Networking and Java. I came across WSO2s idea list and was pretty excited to see HTTP Load Balancer as a project idea. I had good idea on load balancers and was eager to get into the internals of it and develop one. So I contacted my mentors and they gave me idea on getting started with WSO2 stack. They asked me to come up with set of features that I am willing to develop as part of this project. I also got great help and guidance from WSO2 community members right from the time of writing proposal. With their guidance and suggestions, I was able to come up with basic architecture, set of features and a tentative timeline.

Once selected project proposals were announced, my mentor gave clear idea on what is expected and how to proceed with the project. Here are my previous blog posts on community bonding period and mid term evaluations. Ive completed all the features that I had committed in my project proposal. Based on the performance benchmarks (will be discussed in next post) that I have done, this Load Balancer is performing better than Nginx (Open Source Version). My mentors are also happy with the outcome. There is lot more to be done to take this LB to production like performance improvisation, ability to function in multi-level mode, etc., and moreover the underlying Carbon Gateway Framework is continuously evolving. Even after GSoC period, Ill be contributing to this project and make it production ready.

In this post, Ill be discussing High Level Architecture, Engine Architecture, Message Flow and Load Balancer specific features.

Note: HTTP Load Balancer on Top of WSO2 Gateway is referred as GW-LB.

Project Repository

This GSoC project has been added to WSO2 Incubator and Ive been given membership to WSO2 Incubator Organization :D !

Since GW-LB has standalone run-time, it is developed and managed as a separate project. You can also find the project in my personal repository from which it has been added to WSO2 Incubator.

Carbon Gateway Framework with ANTLR grammar support for LB can be found here. As DSL and gateway framework are evolving, there will be some changes to grammar in future. Please find my commits for handling LB specific configurations and ANTLR grammar support here.

README file has instructions to build and work with the product. You could also simply extract this file and try it out !

High Level Architecture

WSO2 Gateway Framework is a high performance, lightweight, low-latency messaging framework based on standard gateway pattern. Its Netty based non-blocking IO and Disruptor (ring-buffer) architecture makes it the fastest open-source gateway available. Benchmarks[1] show that the performance of gateway is very high when compared to other solutions and is close to the direct netty based backend (without any intermediate gateway).

GW-LB makes use of WSO2s Carbon Gateway Framework, Carbon Transports and Carbon Messaging. These are highly modular and easily extensible as they are OSGi bundles and are part of WSO2 Carbon Platform. This LB by itself is an OSGi bundle and it is built on to of carbon gateway framework. When all these bundles are bundled together along with Carbon Kernel it forms GW-LB Server.

Carbon gateway framework provides configuration management and basic mediation capabilities.
Carbon transports acts as transport layer within WSO2 stack.
Within WSO2 stack, Messages (Requests / Response) are mediated in the form of Carbon Messages.

When a request reaches Carbon Transports (WSO2-Netty Listener) additional layer required for mediation is added and it becomes Carbon Message. Similarly, after mediation when Carbon Message reaches Carbon Transports (WSO2-Netty Sender), all Carbon Message related details will be removed and message is sent to corresponding endpoint. It works similarly when response arrives from back-end. This flow is clearly shown in Figure 1.

Figure 1: High Level Architecture

Engine Architecture

This LB has been built by keeping in mind the modular and extensible nature of WSO2 product stack. WSO2 uses ANTLR4 to develop domain specific language (DSL) for its carbon gateway framework. This DSL will be used to configure and define mediation rules for various products built using this framework including this LB.

The Gateway Framework and DSL are continuously evolving and LB Engine is completely decoupled from the DSL. Also, developers can easily develop their own LB algorithms and persistence policies and plug it into this LB.

Figure 2 clearly depicts the modules that are specific to LB Engine, Carbon Gateway Framework and Carbon Transports.

Figure 2: Engine Architecture

Message Flow

As mentioned above, modules within WSO2 stack communicate via Carbon Messages. Refer Figure 3 to get clear idea on how message flows through various LB modules.

Request Flow: From Client -> LB -> Back-End

When a clients request reaches WSO2-Netty Listener it gets transformed to Carbon Message. This carbon message then reaches Inbound Endpoint.
This carbon message then flows via Pipeline and reaches LB Mediator.
Each LB Outbound Endpoint has its own LB Endpoint Call Mediator. LB Mediator uses this LB Endpoint Call Mediator to forward request to the corresponding LB Outbound Endpoint.
If there is no persistence policy, LB Algorithm returns the name of LB Outbound Endpoint to which LB Mediator has to forward the request.
I f there is any persistence policy, LB Mediator takes appropriate action (discussed later) and finds the name of LB Outbound Endpoint.
LB Mediator then passes the carbon message to LB Endpoint Call Mediator.
This LB Endpoint Call Mediator creates a LB Mediator Call Back and forwards the carbon message to LB Outbound Endpoint which in-turn forwards message to Outbound Endpoint.
Outbound Endpoint then forwards carbon message to back-end service.
When carbon message reaches WSO2-Netty Sender , it transforms carbon message back to original client request and sends it to the corresponding back-end service.

Figure 3: Message Flow

Response Flow: From Back-End -> LB -> Client

When response from back-end reaches WSO2-Netty Sender , it gets transformed to Carbon Message and then its corresponding LB Mediator Callback is invoked.
Based on the configured session persistence policy , LB Mediator Callback takes corresponding action required for session persistence and forwards carbon message to Response Mediator.
The Response Mediator then forwards message to Pipeline which in-turn forwards message to Inbound Endpoint.
Inbound Endpoint then forwards message to corresponding client.
When carbon message reaches WSO2-Netty Listener , it transforms carbon message back to original back end response and sends it to the corresponding client.

Outbound Endpoints

Back end services endpoints are mapped as Outbound Endpoints in Carbon Gateway Framework. LB Engine requires few additional attributes for load balancing these Outbound Endpoints. Figure 4 clearly explains the differences between Outbound Endpoint, LB Outbound Endpoint, Weighted LB Outbound Endpoint and LB Outbound Endpoint for Least Response time algorithm.

Figure 4: Different Outbound Endpoints in GW-LB

Features

This LB supports various load balancing algorithms, session persistence policies, health checking and redirection.

Algorithms

This LB supports both weighted and non-weighted algorithms. In non-weighted algorithms, all Outbound endpoints are considered to be of equal weights. In weighted algorithms, weights can be configured to each and every Outbound Endpoint. If no weight is specified for an endpoint, default value of 1 will be taken as its weight.

Non-Weighted (Simple) Algorithms

1) Round-Robin

LB Mediator forwards requests to Outbound Endpoints in a Round-Robin fashion. If there is any persistence policy, LB Mediator forwards request to the Outbound Endpoint based on it.

2) Random

LB Mediator forwards request to Outbound Endpoints in Random fashion. If there is any persistence policy, LB Mediator forwards request to the Outbound Endpoint based on it.

3) Strict Client IP Hashing

LB looks for clients IP address in incoming request header (request headers will be available in Carbon Message). As of now, LB looks for the following headers:

a) X-Forwarded-For

b) Client-IP

c) Remote-Addr

But these are always configurable and more headers can be added to look for if necessary. If LB cannot retrieve Clients IP or if that IP is not a valid one, LB will send internal server error response to the client. In this algorithm mode, persistence policy should be NO_PERSISTENCE and request will be load balanced only if a valid Client IP is available.

Also, LB uses scalable and efficient consistent hashing over simple modulo hashing.

Advantage of using consistent hashing is that if a particular node is down, only those clients that maintained session with that node are to be remapped to other nodes. Session persistence or affinity of other clients are not affected. When that node is up and back to healthy state, only those clients that were remapped, will be mapped back to this node. But if we use modulo hashing, all the clients will be remapped and session will be lost, which creates bad user experience.

4) Least Response Time

Running average is calculated for endpoints on each and every request. After a fixed WINDOW number of requests are elapsed and load distribution for endpoints will be decided based on their response time. Higher the response time of an endpoint, higher the load on it. So, LB always tries to reduce the response time of an endpoint by forwarding fewer number of requests to it and by forwarding more requests to endpoint with least response time. By doing this, LB achieves even load distribution based on the response time of endpoints.

Weighted Algorithms

1) Weighted Round-Robin

LB Mediator forwards requests to Outbound Endpoints in a Round-Robin fashion by considering endpoints weights. For example, if endpoints A, B, C have weights of 3, 2, 5 respectively. In a total of 10 requests..

a) First 3 requests goes to endpoints A, B, C.

b) Second 3 requests also goes to endpoints A, B, C. Now, 2 requests have been forwarded to endpoint B. So it will not be considered until a total of 10 requests are elapsed.

c) Endpoint A receives next request. Now, 3 requests have been forwarded to endpoint A. So it will not be considered until a total of 10 requests are elapsed.

d) Remaining 3 requests will be forwarded to Endpoint C. No a total of 10 requests have elapsed.

The cycle begins again.

Since endpoints are weighted and weights represent processing power, requests forwarded to endpoints based on persistence policy will also be taken into account.

2) Random

Similar to weighted round-robin but order of endpoints are chosen in random manner.

Easily Extensible Nature

Custom Load Balancing Algorithms (Simple or Weighted) can be easily written by implementing corresponding interfaces.

Session Persistence

Client IP Hashing, Application Cookie and LB Inserted cookie are the three persistence policies supported by this LB as of now.

1) Client IP Hashing

Similar to Strict IP Hashing algorithm, but the only difference is that if LB couldnt find a valid Client IP in request header, the request will be still load balanced based on the configured load balancing algorithm. It also uses scalable Consistent Hashing over modulo hashing.

2) Application Cookie

LB inserts its own cookie inside Application server inserted cookie. So when client sends request, LB will be looking for cookie in the specified format (LB inserted cookie) and based on the cookies value, request will be forwarded to the corresponding back-end and persistence is maintained. And LB also removes the cookie inserted by it before forwarding the request to the Outbound endpoint.

Cookie expiration value is controlled by back-end application and not LB. If there is no cookie available in response sent by the back-end, LB will insert its own cookie for the sake of maintaining persistence. This cookie will be a session cookie. i.e., session persistence will be maintained till the clients browser is open. Once it is closed, persistence will be lost. Also, this custom cookie inserted by LB will be removed before request is forwarded to the client.

3) LB Inserted Cookie

This persistence policy will come in handy when back-end application service is not inserting cookie but persistence has to be maintained. It works similar to that of Application cookie, but the only difference is that inserted cookie is a session cookie.

Health Checking and Redirection

This LB supports both active and passive health checking mode. If health checking is not necessary it can also be disabled. Passive health checking is the default mode as it doesnt introduce any additional overhead on back-end services or networks. Be it active or passive health checking, the following parameters are required.

a) Request Timeout: Time interval after which, request has to be marked as timed out if response is not received.

b) Health Check Interval: Time interval between two health checks

c) Unhealthy Retries: Number of times the request has to continuously fail (timeout) before marking an endpoint as unHealthy.

d) Healthy Retries: Number of times LB should be able to successfully establish connection to servers port before marking it back to healthy.

For each Timed-out request LB will send HTTP Status Code: 504, Gateway Timeout.
If all Outbound endpoints are unhealthy and are unavailable LB will send HTTP Status Code: 503, Service Unavailable.

Passive Health Check

In this mode, LB doesnt send any additional connection probes to check whether an endpoint is healthy or not. It simply keeps track of consecutive failed (timed-out) requests to an endpoint. If Unhealthy retries count is reached, that endpoint will be marked as unhealthy and no more requests will be forwarded to that endpoint until it is back to healthy state.

Active Health Check

LB will be periodically sending connection probes to check whether an endpoint is healthy or not. In this case both consecutive failed requests and consecutive failed connection probe to an endpoint wil be taken into account. If Unhealthy retries count is reached, that endpoint will be marked as unhealthy and no more requests will be forwarded to that endpoint until it is back to healthy state.

BackToHealthyHandler

BackToHealthyHandler is a thread that is scheduled to run after time interval every Health Check Interval is elapsed. It sends connection probe to unhealthy endpoints and tries to establish connection. If it succeeds in establishing connection for healthy retries number of times, that endpoint will be marked as healthy again and requests will be forwarded to that endpoint.

In my next post, Ill be discussing on performance benchmark results of this load balancer.

Thanks for Reading !

Originally published at https://venkat.eu/http-load-balancer-on-top-of-wso2-gateway-part-1-project-repository-architecture-and-features-d4df775af48e on Aug 18, 2016.

GSoC — Mid Term Evaluation

Venkat Raman — Sun, 03 Jul 2016 13:00:23 +0000

So, all GSoCers were eagerly waiting for Mid-Term Evaluation results (27 June 19:00 UTC). Google gives one week (June 2027) for students and mentors to submit their evaluations. Students who clear evaluations will be rewarded and they can continue to code for next six weeks till the final evaluations!! Most of us cleared mid-term evaluation with flying colors and members activity in our FB group spiked as expected. We were posting our mentors feedback. It was fun and exciting to read others feedback too :D

Most of the students would have gone through code reviews & demos before mid-term. My mentors have been very busy and they couldnt conduct any demos or code reviews. I keep them posted on a weekly basis on the progress made through mailing list and they are very well aware of the features implemented so far.

Yesterday, my mentor was free and we had demo / discussion in hangouts for about one and half hours. My mentor was very happy with the outcome and he also thanked me for my contribution so far and asked to keep up the same pace and complete the project successfully. I felt so happy and humbled that my 6 weeks of hard work is definitely going to make a difference!! We also discussed in detail about the further road map. We are yet to have our code review.

Progress so far:

ANTLR 4 grammar support for reading Load balancer specific configurations.
Algorithms:

a) Round Robin

b) Random

c) Strict Client IP Hashing

d) Least Response Time

e) Weighted Round Robin

f) Weighted Random

Session Persistence:

a) Client IP Hashing

b) Application Cookie

c) LB Cookie

Health Checking & Redirection:

a) Active

b) Passive

Todo List:

SSL SupportBoth SSL Offloading and End To End.
Performance evaluation.
Unit testing.

You can also find detailed list of my progress here under Mid Term label of README.

Also find my previous post about writing a good GSoC proposal here.

Thanks for reading !!

Originally published at https://venkat.eu/gsoc-mid-term-evaluation on July 3, 2016.

Writing a good Google Summer of Code (GSoC) Proposal

Venkat Raman — Sun, 26 Jun 2016 13:00:47 +0000

Getting selected in GSoC has various aspects and the most important of them is writing a good proposal. Students who are aspiring to participate in GSoC for the first time find it very difficult to write a proposal. Being a first time GSoCer, I went through the same. Few organizations like KDE have their standard GSoC template so that students can follow it. But, most of the organizations dont have a standard template. So students find it confusing and difficult. You can also find few sample proposals online and use it as a reference. Students are allowed to submit a maximum of 5 proposals, but only one project is allocated per student.

For those who dont know what GSoC is, have a look at my previous post. This timeline will give you an idea about how it works.

Here are the few thing that one must consider before writing a proposal.

Selecting an Organization:

Students should know that selecting a proposal is completely in organizations hand. Google doesnt involve in selecting proposal, but it is responsible for allocating maximum number of projects per organization for that year. My organization WSO2 started participating in GSoC from 2014. Google doesnt provide stats on number of proposal received per organization versus number of proposals selected per organization. But, it provides success rate (number of proposals selected and successfully completed projects).

For instance, success of rate of WSO2 is 5/6 in 2014 and 9/10 in 2015. This year (in 2016) 14 proposals have been selected. Next year around 20 proposals might get selected. If you look at this (especially for first time aspirants), selecting an organization is very crucial. If an organization has more that 5 projects with good success rate in previous year, you are good to go.

Kindly note that I am not against selecting new organizations, just that Google allocates less number of projects. If you have relevant skill sets and comfortable with the technologies that are required for the project, you can proceed. In such cases, try to submit more that one proposal (have a backup).

Certain organizations like KDE, Apache, etc., are highly competitive. Though they select (Google allows them) 3040 proposals, there will be 2 or more students competing for a single project. In such cases either more than one proposal (best proposals) for a single project get selected (happens very rarely) or only one gets selected (happens in most of the cases). Its always good to have a backup while applying for such big organizations.

Relevant Skill Sets:

Students should have relevant skill sets. While organizations dont expect students to be extremely proficient with technologies (programming language, domain etc.,), they do expect good skills so that students will be able to manage and deliver. Note that just because Google has allotted maximum number of projects, organizations dont try to utilize it fully. If organizations feel that your proposal is not good or you dont have relevant skill sets, they will not select your proposal. This is because their success rate is very important.

Communicating with Organization:

Frequent communication with organization is very important. Here are the various phases where students can communicate.

Google starts accepting applications from organizations from October or November every year. Students can look for organizations selected in previous year and follow them. Because, organizations selected in previous year are most likely to be selected in next year also.
Once the idea list is put-up (by organization), students can contact organization mentors and members. They will give some idea and students can start contributing from November or December itself. This will be very helpful for first time aspirants who wish to work with organizations like Apache, KDE etc., Note that the more trust you build (by proving yourself) with your organization members and mentors, there is a good chance that your proposal will get selected.
Once Google announces selected organizations , look for projects matching your skill sets. You will have one month time to discuss with organizations and for submitting proposals.
Factors like how often you communicate, how interested and dedicated you are, matters a lot.
Mentors help you in writing your proposal. They correct your mistakes and they will be very clear in letting you know what is expected out of this project duration.
Since we will be working on their existing projects, mentors insist us in trying out their products, frameworks etc., They also help us if we ran into some difficulties. Once you do this, youll get clear idea on proceeding with technical implementation of your project. Yes, your proposal must include relevant architecture diagrams, technical implementation details, deliverables and tentative timeline for those deliverables.
Mentor organization and Google will take one month of time to announce the selected proposals. Once your proposal is selected it becomes a GSoC project !! In that one month time, organizations expect you to communicate with mentors and start working on small tasks and deliverables. Submitting patches will also help. Since organizations success rate is at stake, writing a good proposal with crystal clear road-map, regular commitment and contribution to your organization is very important. It also builds trust on you and youll have an edge if there is any one competing with you for the same project.
Remember that Communication is the key. Never hesitate to ask questions. Dont think that you might be asking stupid questions. Its never the case. Mentors know that we are students after all.
While approaching a mentor for the first time, give a brief introduction about you. This will help mentors to understand where you stand.
Once your project gets selected, mentors will guide you through.

Here is my project and proposal for you reference. Here is the project idea from WSO2.

Feel free to contact me if you have any questions.

Happy Coding !!

Originally published at https://venkat.eu/writing-a-good-google-summer-of-code-gsoc-proposal-6f040e217af4 on June 26, 2016.

What is Google Summer of Code (GSoC) ?

Venkat Raman — Fri, 24 Jun 2016 13:00:43 +0000

Google Summer of Code is a global program focused on bringing more student developers into open source software development. More importantly Open source organizations find it prestigious for being selected by Google to be a part of this program. Yes !! Google selects organizations based on certain criteria. Not all open source organizations are selected. And organizations also find it as the best place to attract young and good talent.

For students, participating and successfully completing the project will be of great advantage in various aspects. Here are few from my experience.

The level of exposure is very high. Student can learn a lot right from writing a proposal.
Most of the projects involves lot of coding. Student get to Code a lot. Yea lots of Code!!
Students will have complete ownership of their project. They get to think about design aspects and plan a lot. However, when students get stuck, mentors and community members will be ready to lend a helping hand.
Since most mentors and students will be in different countries, they collaborate through GitHub. Students get a very good knowledge in working in a collaborated environment using a VCS.
Most of the projects are intended to solve real world problems and most of the code that students write will be used. There will be a great satisfaction of making a valuable contribution.
Since people from different location and culture will be a part of organization, students will learn to adapt and communicate with them effectively. So students communication skills will improve a lot.
Students community is the best part. It is very exciting. Students from all over the world collaborate in Facebook. Yes, we have separate group for GSoC 2016 in Facebook, LinkedIn, Telegram. Its the very exciting part. Students share their experience, offer support, help and are very encouraging. Its always fun to interact with new people.
GSoC project will add a great value to the resume. Students will code and learn a lot more than they would have done in college.
Students need not be a great algorithmic geek. Students with good analytical and programming skills can participate and successfully complete the project.
And last but not the least, Google always keeps you exciting. Students get a welcome package containing GSoC sticker, pen, diary and 500$ after community bonding period. And for students those who pass mid-term evaluations will get 2250$ and those who successfully complete the project will get 2750$, a certificate from Google and GSoC T-Shirt.
Also, few organizations insist on writing weekly or monthly blog post regarding project. So Students get a good chance to start blogging.

You can find about writing a good GSoC proposal in my next post.

Im a final year student and I am in my mid-term evaluation period (at the time of writing this post). And Im already missing GSoC that Ill not be able participate next year

Originally published at https://venkat.eu/what-is-google-summer-of-code-gsoc-fc721e631e27 on June 24, 2016.

GSoC — Community Bonding Period

Venkat Raman — Mon, 23 May 2016 13:00:46 +0000

This year, the community bonding period was from April 22ndMay 22nd.. Though I was communicating with my mentor and community members frequently even before GSoC project results were announced (Yeah!!, I was pretty confident), this period was very very helpful and useful.

I took off from April 29thMay 10 for my semester exams and my mentors were comfortable with it. We had a hangout session before that and my mentor explained what is expected out during this community bonding period. There was a major change in the code base from the time when I wrote proposal and now. Here is the previous code base and here is the new one. Big change right.? Yes, I too felt the same and I panicked!! But my mentor and members of the org were very helpful and understood my situation. They guided me through it very well.

Why a Load Balancer ?

My project is to build a HTTP Load Balancer on Top of WSO2 Gateway. When I tell this to my friends few astonished ( yes, I wanted to build a kick-ass product !) and few ask my Why a Load Balancer.?, Its already there, Why re-invent the wheel.?. First of all, I chose this organization because, WSO2 ESB is the backbone of ebay! Yes, It helps ebay in handling more than 1 billion transactions per day!! Pretty cool uh..? So what does gateway has to do with it.? This gateway-framework will be used to build next-gen-ESB.

WSO2 Gateway is a high performance, lightweight, low-latency messaging gateway based on standard gateway pattern. Its Netty based non-blocking IO and Disruptor (ringbuffer) architecture makes it the fastest open-source gateway available. Performance Benchmarks (slide 13,14,15) shows that the gateways is very high when compared to other solutions and is close to the direct netty based backend (without any intermediate gateway).

Features:

Support for different Load Balancing Algorithms
Full Compliance with HTTP specification
Support for SSL
Session Persistence
Health Checking and Redirection

So, I am building a pretty cool LB which is based on non-blocking IO to achieve high performance and low-latency mediation.If everything goes well and I hope so, one day, this LB will be used by many organizations .

Accomplishments so far:

Understanding carbon-gateway-frameworks code base and trying out few samples of product-integration-server .
Antlr 4 grammar support for reading Load Balancer specific configurations.
Repository structure for a standalone HTTP Load Balancer Server .

You guys can find my code in the above mentioned link. Thanks for reading.

Originally published at https://venkat.eu/gsoc-community-bonding-period-b0d8f36d918 on May 23, 2016.