DEV Community: Aditya Pandey

How Transformers Architecture Powers Modern LLMs

Aditya Pandey — Wed, 20 May 2026 06:07:08 +0000

When we interact with modern large language models like GPT, Claude, or Gemini, we are witnessing a process fundamentally different from how humans form sentences. While we naturally construct thoughts and convert them into words, LLMs operate through a cyclical conversion process.

Understanding this process reveals both the capabilities and limitations of these powerful systems.

At the heart of most modern LLMs lies an architecture called a transformer. Introduced in 2017, transformers are sequence prediction algorithms built from neural network layers. The architecture has three essential components:

An embedding layer that converts tokens into numerical representations.

Multiple transformer layers where computation happens.

Output layer that converts results back into text.

See the diagram below:

Transformers process all words simultaneously rather than one at a time, enabling them to learn from massive text datasets and capture complex word relationships.

In this article, we will look at how the transformer architecture works in a step-by-step manner.

Step 1: From Text to Tokens

Before any computation can happen, the model must convert text into a form it can work with. This begins with tokenization, where text gets broken down into fundamental units called tokens. These are not always complete words. They can be subwords, word fragments, or even individual characters.

Consider this example input: “I love transformers!” The tokenizer might break this into: [”I”, “ love”, “ transform”, “ers”, “!”]. Notice that “transformers” became two separate tokens. Each unique token in the vocabulary gets assigned a unique integer
ID:

“I” might be token 150
“love” might be token 8942
“transform” might be token 3301
“ers” might be token 1847
“!” might be token 254

These IDs are arbitrary identifiers with no inherent relationships. Tokens 150 and 151 are not similar just because their numbers are close. The overall vocabulary typically contains 50,000 to 100,000 unique tokens that the model learned during training.

Step 2: Converting Tokens to Embeddings

Neural networks cannot work directly with token IDs because they are just fixed identifiers. Each token ID gets mapped to a vector, a list of continuous numbers usually containing hundreds or thousands of dimensions. These are called embeddings.

Here is a simplified example with five dimensions (real models may use 768 to 4096):

Token “dog” becomes [0.23, -0.67, 0.45, 0.89, -0.12]
Token “wolf” becomes [0.25, -0.65, 0.47, 0.91, -0.10]
Token “car” becomes [-0.82, 0.34, -0.56, 0.12, 0.78]

Notice how “dog” and “wolf” have similar numbers, while “car” is completely different. This creates a semantic space where related concepts cluster together.

Why the need for multiple dimensions? This is because with just one number per word, we might encounter contradictions. For example:

“stock” equals 5.2 (financial term)
“capital” equals 5.3 (similar financial term)
“rare” equals -5.2 (antonym: uncommon)
“debt” equals -5.3 (antonym of capital)

Now, “rare” and “debt” both have similar negative values, implying they are related, which makes no sense. Hundreds of dimensions allow the model to represent complex relationships without such contradictions.

In this space, we can perform mathematical operations. The embedding for “king” minus “man” plus “woman” approximately equals “queen.” These relationships emerge during training from patterns in text data.

Step 3: Adding Positional Information

Transformers do not inherently understand word order. Without additional information, “The dog chased the cat” and “The cat chased the dog” would look identical because both contain the same tokens.

The solution is positional embeddings. Every position gets mapped to a position vector, just like tokens get mapped to meaning vectors.

For the token “dog” appearing at position 2, it might look like the following:

Word embedding: [0.23, -0.67, 0.45, 0.89, -0.12]
Position 2 embedding: [0.05, 0.12, -0.08, 0.03, 0.02]
Combined (element-wise sum): [0.28, -0.55, 0.37, 0.92, -0.10]

This combined embedding captures both the meaning of the word and its context of use. This is also what flows into the transformer layers.

Step 4: The Attention Mechanism in Transformer Layers

The transformer layers implement the attention mechanism, which is the key innovation that makes these models so powerful. Each transformer layer operates using three components for every token: queries, keys, and values. We can think of this as a fuzzy dictionary lookup where the model compares what it is looking for (the query) against all possible answers (the keys) and returns weighted combinations of the corresponding values.

Let us walk through a concrete example. Consider the sentence: “The cat sat on the mat because it was comfortable.”

When the model processes the word “it,” it needs to determine what “it” refers to. Here is what happens:

First, the embedding for “it” generates a query vector asking essentially, “What noun am I referring to?”
Next, this query is compared against the keys from all previous tokens. Each comparison produces a similarity score. For example:
- “The” (article) generates score: 0.05
- “cat” (noun) generates score: 8.3
- “sat” (verb) generates score: 0.2
- “on” (preposition) generates score: 0.03
- “the” (article) generates score: 0.04
- “mat” (noun) generates score: 4.1
- “because” (conjunction) generates score: 0.1
The raw scores are then converted into attention weights that sum to 1.0. For example:
- “cat” receives attention weight: 0.75 (75 percent)
- “mat” receives attention weight: 0.20 (20 percent)
- All other tokens: 0.05 total (5 percent combined)

Finally, the model takes the value vectors from each token and combines them using these weights. For example:

Output = (0.75 × Value_cat) + (0.20 × Value_mat) + (0.03 × Value_the) + ...

The value from “cat” contributes 75 percent to the output, “mat” contributes 20 percent, and everything else is nearly ignored. This weighted combination becomes the new representation for “it” that captures the contextual understanding that “it” most likely refers to “cat.”

This attention process happens in every transformer layer, but each layer learns to detect different patterns.

Early layers learn basic patterns like grammar and common word pairs. When processing “cat,” these layers might heavily attend to “The” because they learn that articles and their nouns are related.
Middle layers learn sentence structure and relationships between phrases. They might figure out that “cat” is the subject of “sat” and that “on the mat” forms a prepositional phrase indicating location.
Deep layers extract abstract meaning. They might understand that this sentence describes a physical situation and implies the cat is comfortable or resting.

Each layer refines the representation progressively. The output of one layer becomes the input for the next, with each layer adding more contextual understanding.

Importantly, only the final transformer layer needs to predict an actual token. All intermediate layers perform the same attention operations but simply transform the representations to be more useful for downstream layers. A middle layer does not output token predictions. Instead, it outputs refined vector representations that flow to the next layer.

This stacking of many layers, each specializing in different aspects of language understanding, is what enables LLMs to capture complex patterns and generate coherent text.

Step 5: Converting Back to Text

After flowing through all layers, the final vector must be converted to text. The unembedding layer compares this vector against every token embedding and produces scores.

For example, to complete “I love to eat,” the unembedding might produce:

“pizza”: 65.2
“tacos”: 64.8
“sushi”: 64.1
“food”: 58.3
“barbeque”: 57.9
“car”: -12.4
“42”: -45.8

These arbitrary scores get converted to probabilities using softmax:

“pizza”: 28.3 percent
“tacos”: 24.1 percent
“sushi”: 18.9 percent
“food”: 7.2 percent
“barbeque”: 6.1 percent
“car”: 0.0001 percent
“42”: 0.0000001 percent

Tokens with similar scores (65.2 versus 64.8) receive similar probabilities (28.3 versus 24.1 percent), while low-scoring tokens get near-zero probabilities.

The model does not select the highest probability token. Instead, it randomly samples from this distribution. Think of a roulette wheel where each token gets a slice proportional to its probability. Pizza gets 28.3 percent, tacos get 24.1 percent, and 42 gets a microscopic slice.

The reason for this randomness is that always picking a specific value like “pizza” would create repetitive, unnatural output. Random sampling weighted by probability allows selection of “tacos,” “sushi,” or “barbeque,” producing varied, natural responses. Occasionally, a lower-probability token gets picked, leading to creative outputs.

The Iterative Generation Loop

The generation process repeats for every token. Let us walk through an example where the initial prompt is “The capital of France.” Here’s how different cycles go through the transformer:

Cycle 1:

Input: [”The”, “capital”, “of”, “France”]
Process through all layers
Sample: “is” (80 percent)
Output so far: “The capital of France is”

Cycle 2:

Input: ”The”, “capital”, “of”, “France”, “is”
Process through all layers (5 tokens now)
Sample: “Paris” (92 percent)
Output so far: “The capital of France is Paris”

Cycle 3:

Input: ”The”, “capital”, “of”, “France”, “is”, “Paris”
Process through all layers
Sample: “.” (65 percent)
Output so far: “The capital of France is Paris.”

Cycle 4:

Input: ”The”, “capital”, “of”, “France”, “is”, “Paris”, “.”
Process through all layers
Sample: [EoS] token (88 percent)
Stop the loop
Final output: “The capital of France is Paris.”

The [EoS] or end-of-sequence token signals completion. Each cycle processes all previous tokens. This is why generation can slow as responses lengthen.

This is called autoregressive generation because each output depends on all previous outputs. If an unusual token gets selected (perhaps “chalk” with 0.01 percent probability in “I love to eat chalk”), all subsequent tokens will be influenced by this choice.

Training Versus Inference: Two Different Modes

The transformer flow operates in two contexts: training and inference.

During training, the model learns language patterns from billions of text examples. It starts with random weights and gradually adjusts them. Here is how training works:

Training text: “The cat sat on the mat.”
Model receives: “The cat sat on the”
With random initial weights, the model might predict:

“banana”: 25 percent

“car”: 22 percent

“mat”: 3 percent (correct answer has low probability)

“elephant”: 18 percent

The training process calculates the error (mat should have been higher) and uses backpropagation to adjust every weight:

Embeddings for “on” and “the” get adjusted
Attention weights in all 96 layers get adjusted
Unembedding layer gets adjusted

Each adjustment is tiny (0.245 to 0.247), but it accumulates across billions of examples. After seeing “sat on the” followed by “mat” thousands of times in different contexts, the model learns this pattern. Training takes weeks on thousands of GPUs and costs millions of dollars. Once complete, weights are frozen.

During inference, the transformer runs with frozen weights:

User query: “Complete this: The cat sat on the”
The model processes the input with its learned weights and outputs: “mat” (85 percent), “floor” (8 percent), “chair” (3 percent). It samples “mat” and returns it. No weight changes occur.

The model used its learned knowledge but did not learn anything new. The conversations do not update model weights. To teach the model new information, we would need to retrain it with new data, which requires substantial computational resources.

See the diagram below that shows the various steps in an LLM execution flow:

Conclusion

The transformer architecture provides an elegant solution to understanding and generating human language. By converting text to numerical representations, using attention mechanisms to capture relationships between words, and stacking many layers to learn increasingly abstract patterns, transformers enable modern LLMs to produce coherent and useful text.

This process involves seven key steps that repeat for every generated token: tokenization, embedding creation, positional encoding, processing through transformer layers with attention mechanisms, unembedding to scores, sampling from probabilities, and decoding back to text. Each step builds on the previous one, transforming raw text into mathematical representations that the model can manipulate, then back into human-readable output.

Understanding this process reveals both the capabilities and limitations of these systems. In essence, LLMs are sophisticated pattern-matching machines that predict the most likely next token based on patterns learned from massive datasets.

System Architecture

Aditya Pandey — Thu, 07 May 2026 05:42:25 +0000

A production-ready coding agent is a complex system composed of several critical components working in unison. While the model provides the intelligence, the surrounding infrastructure is what enables it to interact with files, run commands, and maintain safety. The next Figure shows the key components of a Cursor’s agent system.

Router

Cursor integrates multiple agentic models, including its own specialized Composer model. For efficiency, the system offers an “Auto” mode that acts as a router. It dynamically analyzes the complexity of each request to choose the best model for the job.

LLM (agentic coding model)

The heart of the system is the agentic coding model. In Cursor’s agent, that model can be Composer, or any other frontier coding models picked by the router. Unlike a standard LLM trained just to predict the next token of text, this model is trained on trajectories, sequences of actions that show the model how and when to use available tools to solve a problem.

Creating this model is often the heaviest lift in building a coding agent. It requires massive data preparation, training, and testing to ensure the model doesn’t just write code, but understands the process of coding (e.g., “search first, then edit, then verify”). Once this model is ready and capable of reasoning, the rest of the work shifts to system engineering to provide the environment it
needs to operate.

Tools

Composer is connected to a tool harness inside Cursor’s agent system, with more than ten tools available. These tools cover the core operations needed for coding such as searching the codebase, reading and writing files, applying edits, and running terminal commands.

Context Retrieval

Real codebases are too large to fit into a single prompt. The context retrieval system searches the codebase to pull in the most relevant code snippets, documentation, and definitions for the current step, so the model has what it needs without overflowing the context window.

Orchestrator

The orchestrator is the control loop that runs the agent. The model decides what to do next and which tool to use, and the orchestrator executes that tool call, collects the result such as search hits, file contents, or test output, rebuilds the working context with the new information, and sends it back to the model for the next step. This iterative loop is what turns the system from a chatbot into an agent.

One common way to implement this loop is the ReAct pattern, where the model alternates between reasoning steps and tool actions based on the observations it receives.

Sandbox (execution environment)

Agents need to run builds, tests, linters, and scripts to verify their work. However, giving an AI unrestricted access to your terminal is a security risk. To solve this, tool calls are executed in a Sandbox. This secure and isolated environment uses strict guardrails to ensure that the user’s host machine remains safe even if the agent attempts to run a destructive command. Cursor offers the flexibility to run these sandboxes either locally or remotely on a cloud virtual machine.

Note that these are the core building blocks you will see in most coding agents. Different labs may add more components on top, such as long-term memory, policy and safety layers, specialized planning modules, or collaboration features, depending on the capabilities they want to support.

Production Challenges

On paper, tools, memory, orchestration, routing, and sandboxing look like a straightforward blueprint. In production, the constraints are harsher. A model that can write good code is still useless if edits do not apply cleanly, if the system is too slow to iterate, or if verification is unsafe or too expensive to run frequently.

Cursor’s experience highlights three engineering hurdles that general-purpose models do not solve out of the box: reliable editing, compounded latency, and sandboxing at scale.

Challenge 1: The “Diff Problem”

General-purpose models are trained primarily to generate text. They struggle significantly when asked to perform edits on existing files.

This is known as the “Diff Problem.” When a model is asked to edit code, it has to locate the right lines, preserve indentation, and output a rigid diff format. If it hallucinates line numbers or drifts in formatting, the patch fails even when the underlying logic is correct. Worse, a patch can apply incorrectly, which is harder to detect and more expensive to clean up. In production, incorrect edits are often worse than no edits because they reduce trust and increase cleanup time.

A common way to mitigate the diff problem is to train on edit trajectories. For example, you can structure training data as triples like (original_code, edit_command, final_code), which teaches the model how an edit instruction should change the file while preserving everything else.

Another critical step is teaching the model to use specific editing tools such as search and replace. Cursor emphasized that these two tools were significantly harder to teach than other tools. To solve this, they ensured their training data contained a high volume of trajectories specifically focused on search and replace tool usage, forcing the model to over-learn the mechanical constraints of these operations. Cursor utilized a cluster of tens of thousands of GPUs to train the Composer model, ensuring these precise editing behaviors were fundamentally baked into the weights.

Challenge 2: Latency Compounds

In a chat interface, a user might tolerate a short pause. In an agent loop, latency compounds. A single task might require the agent to plan, search, edit, and test across many iterations. If each step takes a few seconds, the end-to-end time quickly becomes frustrating.

Cursor treats speed as a core product strategy. The make the coding agent fast, they have employed three key techniques:

Mixture of Experts (MoE) architecture

Speculative decoding

Context compaction

MoE architecture: Composer is a MoE language model. MoE modifies the Transformer by making some feed-forward computation conditional. Instead of sending every token through the same dense MLP, the model routes each token to a small number of specialized MLP experts.

MoE can improve both capacity and efficiency by activating only a few experts per token, which can yield better quality at similar latency, or similar quality at lower latency, especially at deployment scale. However, MoE often introduces additional engineering challenges and complexity. If every token goes to the same expert, that expert becomes a bottleneck while others sit idle. This causes high tail latency.

Teams typically address this with a combination of techniques. During training they add load-balancing losses to encourage the router to spread traffic across experts. During serving, they enforce capacity limits and reroute overflow. At the infrastructure level, they reduce cross-GPU communication overhead by batching and routing work to keep data movement predictable.

Speculative Decoding: Generation is sequential. Agents spend a lot of time producing plans, tool arguments, diffs, and explanations, and generating these token by token is slow. Speculative decoding reduces latency by using a smaller draft model to propose tokens that a larger model can verify quickly. When the draft is correct, the system accepts multiple tokens at once, reducing the number of expensive decoding steps.

Since code has a very predictable structure, such as imports, brackets, and standard syntax, waiting for a large model like Composer to generate every single character is inefficient. Cursor confirms they use speculative decoding and trained specialized “draft” models that predict the next few tokens rapidly. This allows Composer to generate code much faster than the standard token-by-token generation rate.

Context Compaction: Agents also generate a lot of text that is useful once but costly to keep around, such as tool outputs, logs, stack traces, intermediate diffs, and repeated snippets. If the system keeps appending everything, prompts bloat and latency increases.

Context compaction addresses this by summarizing the working state and keeping only what is relevant for the next step. Instead of carrying full logs forward, the system retains stable signals like failing test names, error types, and key stack frames. It compresses or drops stale context, deduplicates repeated snippets, and keeps raw artifacts outside the prompt unless they are needed again. Many advanced coding agents like OpenAI’s Codex and Cursor rely on context compaction to stay fast and reliable when reaching the context window limit.

Context compaction improves both latency and quality. Fewer tokens reduce compute per call, and less noise reduces the chance the model drifts or latches onto outdated information.

Put together, these three techniques target different sources of compounded latency. MoE reduces per-call serving cost, speculative decoding reduces generation time, and context compaction reduces repeated prompt processing.

Challenge 3: Sandboxing at Scale

Coding agents do not only generate text. They execute code. They run builds, tests, linters, formatters, and scripts as part of the core loop. That requires an execution environment that is isolated, resource-limited, and safe by default.

In Cursor’s flow, the agent provisions a sandboxed workspace from a specific repository snapshot, executes tool calls inside that workspace, and feeds results back into the model. At a small scale, sandboxing is mostly a safety feature. At large scale, it becomes a performance and infrastructure constraint.

Two major issues dominate when training the model:

Provisioning time becomes the bottleneck. The model may generate a solution in milliseconds, but creating a secure, isolated environment can take much longer. If sandbox startup dominates, the system cannot iterate quickly enough to feel usable.
Concurrency makes startup overhead a bottleneck at scale. Spinning up thousands of sandboxes all at once very quickly is challenging. This becomes even more challenging during training. Teaching the model to call tools at scale requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.

These challenges pushed the Cursor team to build custom sandboxing infrastructure. They rewrote their VM scheduler to handle bursty demand, like when an agent needs to spin up thousands of sandboxes in a short time. Cursor treats sandboxes as core serving infrastructure, with an emphasis on fast provisioning and aggressive recycling so tool runs can start quickly and sandbox startup time does not dominate the time to a verified fix.

For safety, Cursor defaults to a restricted Sandbox Mode for agent terminal commands. Commands run in an isolated environment with network access blocked by default and filesystem access limited to the workspace and /tmp/. If a command fails because it needs broader access, the UI lets the user skip it or intentionally re-run it outside the sandbox.

The key takeaway is to not treat sandboxes as just containers. Treat them like a system that needs its own scheduler, capacity planning, and performance tuning.

Conclusion

Cursor shows that modern coding agents are not just better text generators. They are systems built to edit real repositories, run tools, and verify results. Cursor paired a specialized MoE model with a tool harness, latency-focused serving, and sandboxed execution so the agent can follow a practical loop: inspect the code, make a change, run checks, and iterate until the fix is verified.

Cursor’s experience shipping Composer to production points to three repeatable lessons that matter for most coding agents:

Tool use must be baked into the model. Prompting alone is not enough for reliable tool calling inside long loops. The model needs to learn tool usage as a core behavior, especially for editing operations like search and replace where small mistakes can break the edit.
Adoption is the ultimate metric. Offline benchmarks are useful, but a coding agent lives or dies by user trust. A single risky edit or broken build can stop users from relying on the tool, so evaluation has to reflect real usage and user acceptance.
Speed is part of the product. Latency shapes daily usage. You do not need a frontier model for every step. Routing smaller steps to fast models while reserving larger models for harder planning turns responsiveness into a core feature, not just an infrastructure metric.

Coding agents are still evolving, but the trend is promising. With rapid advances in model training and system engineering, we are moving toward a future where they become much faster and more effective.

How Cursor Shipped its Coding Agent to Production

Aditya Pandey — Thu, 07 May 2026 05:27:37 +0000

On October 29, 2025, Cursor shipped Cursor 2.0 and introduced Composer, its first agentic coding model. Cursor claims Composer is 4x faster than similarly intelligent models, with most turns completing in under 30 seconds. For more clarity and detail, we worked with Lee Robinson at Cursor on this article.

Shipping a reliable coding agent requires a lot of systems engineering. Cursor’s engineering team has shared technical details and challenges from building Composer and shipping their coding agent into production. This article breaks down those engineering challenges and how they solved them.

What is a Coding Agent?

To understand coding agents, we first need to look at how AI coding has evolved.

AI in software development has evolved in three waves. First, we treated general-purpose LLMs like a coding partner. You copied code, pasted it into ChatGPT, asked for a fix, and manually applied the changes. It was helpful, but disconnected.

In the second wave, tools like Copilot and Cursor Tab brought AI directly into the editor. To power these tools, specialized models were developed for fast, inline autocomplete. They helped developers type faster, but they were limited to the specific file being edited.

More recently, the focus has shifted to coding agents that handle tasks end-to-end. They don’t just suggest code; they handle coding requests end-to-end. They can search your repo, edit multiple files, run terminal commands, and iterate on errors until the build and tests pass. We are currently living through this third wave.

A coding agent is not a single model. It is a system built around a model with tool access, an iterative execution loop, and mechanisms to retrieve relevant code. The model, often referred to as an agentic coding model, is a specialized LLM trained to reason over codebases, use tools, and work effectively inside an agentic system.

It is easy to confuse agentic coding models with coding agents. The agentic coding model is like the brain. It has the intelligence to reason, write code, and use tools. The coding agent is the body. It has the “hands” to execute tools, manage context, and ensure it reaches a working solution by iterating until the build and tests pass.

AI labs first train an agentic coding model, then wrap it in an agent system, also known as a harness, to create a coding agent. For example, OpenAI Codex is a coding agent environment powered by the GPT-5.2-Codex model, and Cursor’s coding agent can run on multiple frontier models, including its own agentic coding model, Composer. In the next section, we take a closer look at Cursor’s coding agent and Composer.

How Cursor Shipped its Coding Agent to Production

Aditya Pandey — Thu, 26 Feb 2026 15:54:48 +0000

What is a Coding Agent?

To understand coding agents, we first need to look at how AI coding has evolved.

A coding agent is not a single model. It is a system built around a model with tool access, an iterative execution loop, and mechanisms to retrieve relevant code. The model, often referred to as an agentic coding model, is a specialized LLM trained to reason over codebases, use tools, and work effectively inside an agentic system.

System Architecture

Router

LLM (agentic coding model)

Tools

Context Retrieval

Orchestrator

One common way to implement this loop is the ReAct pattern, where the model alternates between reasoning steps and tool actions based on the observations it receives.

Sandbox (execution environment)

Production Challenges

Cursor’s experience highlights three engineering hurdles that general-purpose models do not solve out of the box: reliable editing, compounded latency, and sandboxing at scale.

Challenge 1: The “Diff Problem”

General-purpose models are trained primarily to generate text. They struggle significantly when asked to perform edits on existing files.

Challenge 2: Latency Compounds

Cursor treats speed as a core product strategy. The make the coding agent fast, they have employed three key techniques:

Mixture of Experts (MoE) architecture

Speculative decoding

Context compaction

Context compaction improves both latency and quality. Fewer tokens reduce compute per call, and less noise reduces the chance the model drifts or latches onto outdated information.

Challenge 3: Sandboxing at Scale

Two major issues dominate when training the model:

- Provisioning time becomes the bottleneck. The model may generate a solution in milliseconds, but creating a secure, isolated environment can take much longer. If sandbox startup dominates, the system cannot iterate quickly enough to feel usable.

- Concurrency makes startup overhead a bottleneck at scale.

Spinning up thousands of sandboxes all at once very quickly is challenging. This becomes even more challenging during training. Teaching the model to call tools at scale requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.

The key takeaway is to not treat sandboxes as just containers. Treat them like a system that needs its own scheduler, capacity planning, and performance tuning.

Conclusion

Cursor’s experience shipping Composer to production points to three repeatable lessons that matter for most coding agents:

Tool use must be baked into the model. Prompting alone is not enough for reliable tool calling inside long loops. The model needs to learn tool usage as a core behavior, especially for editing operations like search and replace where small mistakes can break the edit.
Adoption is the ultimate metric. Offline benchmarks are useful, but a coding agent lives or dies by user trust. A single risky edit or broken build can stop users from relying on the tool, so evaluation has to reflect real usage and user acceptance.
Speed is part of the product. Latency shapes daily usage. You do not need a frontier model for every step. Routing smaller steps to fast models while reserving larger models for harder planning turns responsiveness into a core feature, not just an infrastructure metric.

How Transformers Architecture Powers Modern LLMs

Aditya Pandey — Mon, 23 Feb 2026 11:46:15 +0000

Understanding this process reveals both the capabilities and limitations of these powerful systems.

An embedding layer that converts tokens into numerical representations.

Multiple transformer layers where computation happens.

Output layer that converts results back into text.

See the diagram below:

Transformers process all words simultaneously rather than one at a time, enabling them to learn from massive text datasets and capture complex word relationships.

In this article, we will look at how the transformer architecture works in a step-by-step manner.

Step 1: From Text to Tokens

“I” might be token 150

“love” might be token 8942

“transform” might be token 3301

“ers” might be token 1847

“!” might be token 254

Step 2: Converting Tokens to Embeddings

Here is a simplified example with five dimensions (real models may use 768 to 4096):

Token “dog” becomes [0.23, -0.67, 0.45, 0.89, -0.12]

Token “wolf” becomes [0.25, -0.65, 0.47, 0.91, -0.10]

Token “car” becomes [-0.82, 0.34, -0.56, 0.12, 0.78]

Notice how “dog” and “wolf” have similar numbers, while “car” is completely different. This creates a semantic space where related concepts cluster together.

Why the need for multiple dimensions? This is because with just one number per word, we might encounter contradictions. For example:

“stock” equals 5.2 (financial term)

“capital” equals 5.3 (similar financial term)

“rare” equals -5.2 (antonym: uncommon)

“debt” equals -5.3 (antonym of capital)

Step 3: Adding Positional Information

The solution is positional embeddings. Every position gets mapped to a position vector, just like tokens get mapped to meaning vectors.

For the token “dog” appearing at position 2, it might look like the following:

Word embedding: [0.23, -0.67, 0.45, 0.89, -0.12]

Position 2 embedding: [0.05, 0.12, -0.08, 0.03, 0.02]

Combined (element-wise sum): [0.28, -0.55, 0.37, 0.92, -0.10]

This combined embedding captures both the meaning of the word and its context of use. This is also what flows into the transformer layers.

Step 4: The Attention Mechanism in Transformer Layers

Let us walk through a concrete example. Consider the sentence: “The cat sat on the mat because it was comfortable.”

When the model processes the word “it,” it needs to determine what “it” refers to. Here is what happens:

First, the embedding for “it” generates a query vector asking essentially, “What noun am I referring to?”

Next, this query is compared against the keys from all previous tokens. Each comparison produces a similarity score. For example:

“The” (article) generates score: 0.05

“cat” (noun) generates score: 8.3

“sat” (verb) generates score: 0.2

“on” (preposition) generates score: 0.03

“the” (article) generates score: 0.04

“mat” (noun) generates score: 4.1

“because” (conjunction) generates score: 0.1

The raw scores are then converted into attention weights that sum to 1.0. For example:

“cat” receives attention weight: 0.75 (75 percent)

“mat” receives attention weight: 0.20 (20 percent)

All other tokens: 0.05 total (5 percent combined)

Finally, the model takes the value vectors from each token and combines them using these weights. For example:

Output = (0.75 × Value_cat) + (0.20 × Value_mat) + (0.03 × Value_the) + ...
The value from “cat” contributes 75 percent to the output, “mat” contributes 20 percent, and everything else is nearly ignored. This weighted combination becomes the new representation for “it” that captures the contextual understanding that “it” most likely refers to “cat.”

This attention process happens in every transformer layer, but each layer learns to detect different patterns.

Early layers learn basic patterns like grammar and common word pairs. When processing “cat,” these layers might heavily attend to “The” because they learn that articles and their nouns are related.

Middle layers learn sentence structure and relationships between phrases. They might figure out that “cat” is the subject of “sat” and that “on the mat” forms a prepositional phrase indicating location.

Deep layers extract abstract meaning. They might understand that this sentence describes a physical situation and implies the cat is comfortable or resting.

Each layer refines the representation progressively. The output of one layer becomes the input for the next, with each layer adding more contextual understanding.

This stacking of many layers, each specializing in different aspects of language understanding, is what enables LLMs to capture complex patterns and generate coherent text.

Step 5: Converting Back to Text

After flowing through all layers, the final vector must be converted to text. The unembedding layer compares this vector against every token embedding and produces scores.

For example, to complete “I love to eat,” the unembedding might produce:

“pizza”: 65.2

“tacos”: 64.8

“sushi”: 64.1

“food”: 58.3

“barbeque”: 57.9

“car”: -12.4

“42”: -45.8

These arbitrary scores get converted to probabilities using softmax:

“pizza”: 28.3 percent

“tacos”: 24.1 percent

“sushi”: 18.9 percent

“food”: 7.2 percent

“barbeque”: 6.1 percent

“car”: 0.0001 percent

“42”: 0.0000001 percent

Tokens with similar scores (65.2 versus 64.8) receive similar probabilities (28.3 versus 24.1 percent), while low-scoring tokens get near-zero probabilities.

The Iterative Generation Loop

The generation process repeats for every token. Let us walk through an example where the initial prompt is “The capital of France.” Here’s how different cycles go through the transformer:

Cycle 1:

Input: [”The”, “capital”, “of”, “France”]

Process through all layers

Sample: “is” (80 percent)

Output so far: “The capital of France is”

Cycle 2:

Input: ”The”, “capital”, “of”, “France”, “is”

Process through all layers (5 tokens now)

Sample: “Paris” (92 percent)

Output so far: “The capital of France is Paris”

Cycle 3:

Input: ”The”, “capital”, “of”, “France”, “is”, “Paris”

Process through all layers

Sample: “.” (65 percent)

Output so far: “The capital of France is Paris.”

Cycle 4:

Input: ”The”, “capital”, “of”, “France”, “is”, “Paris”, “.”

Process through all layers

Sample: [EoS] token (88 percent)

Stop the loop

Final output: “The capital of France is Paris.”

The [EoS] or end-of-sequence token signals completion. Each cycle processes all previous tokens. This is why generation can slow as responses lengthen.

Training Versus Inference: Two Different Modes

The transformer flow operates in two contexts: training and inference.

During training, the model learns language patterns from billions of text examples. It starts with random weights and gradually adjusts them. Here is how training works:

Training text: “The cat sat on the mat.”

Model receives: “The cat sat on the”

With random initial weights, the model might predict:

“banana”: 25 percent

“car”: 22 percent

“mat”: 3 percent (correct answer has low probability)

“elephant”: 18 percent

The training process calculates the error (mat should have been higher) and uses backpropagation to adjust every weight:

Embeddings for “on” and “the” get adjusted

Attention weights in all 96 layers get adjusted

Unembedding layer gets adjusted

During inference, the transformer runs with frozen weights:

User query: “Complete this: The cat sat on the”

The model processes the input with its learned weights and outputs: “mat” (85 percent), “floor” (8 percent), “chair” (3 percent). It samples “mat” and returns it. No weight changes occur.

See the diagram below that shows the various steps in an LLM execution flow:

Conclusion

Understanding Database Types

Aditya Pandey — Tue, 07 Oct 2025 05:59:24 +0000

The success of a software application often hinges on the choice of the right databases. As developers, we're faced with a vast array of database options. It is crucial for us to understand the differences between these options and how to select the ones that best align with our project's requirements. A complex application usually uses several different databases, each catering to a specific aspect of the application’s needs.

In this comprehensive three-part series, we’ll explore the art of database selection. We’ll arm ourselves with the knowledge necessary to make informed decisions when faced with the challenge of choosing databases for various components of our application. We will dive into the process of database selection, examining the various types of databases, discussing factors that influence database performance and cost, and guiding ourselves toward the best choices for our application while balancing essential tradeoffs.

Throughout the series, we’ll outline the key steps in the database selection process and review case studies that showcase successful database selection in practice. By the end of this series, we aim to empower ourselves with the knowledge and confidence needed to master the art of selecting the right combination of databases for our complex applications.

Understanding Database Types

To make the best decision for our projects, it is essential to understand the various types of databases available in the market. In this section, we explore the key characteristics of different database types, including popular options for each, and compare their use cases.

Relational Databases
Relational databases are based on the relational model, which organizes data into tables with rows and columns. These databases have been the standard choice for many applications due to their robust consistency, support for complex queries, and adherence to ACID properties (Atomicity, Consistency, Isolation, Durability). Key features and benefits of relational databases include:

Structured data organization: Data in relational databases is stored in tables with a predefined schema, enforcing a consistent structure throughout the database. This organization makes it easier to manage and maintain data, especially when dealing with large amounts of structured data.

Relationships and referential integrity: The relationships between tables in a relational database are defined by primary and foreign keys, ensuring referential integrity. This feature allows for efficient querying of related data and supports complex data relationships.

SQL support: Relational databases use Structured Query Language (SQL) for querying, manipulating, and managing data. SQL is a powerful and widely adopted language that enables developers to perform complex queries and data manipulations.

Transactions and ACID properties: Relational databases support transactions, which are sets of related operations that either succeed or fail as a whole. This feature ensures the ACID properties – Atomicity, Consistency, Isolation, and Durability – are maintained, guaranteeing data consistency and integrity.

Indexing and optimization: Relational databases offer various indexing techniques and query optimization strategies, which help improve query performance and reduce resource consumption.

Relational databases also have some drawbacks:

Limited scalability: Scaling relational databases horizontally (adding more nodes) can be challenging, especially when compared to some NoSQL databases that are designed for distributed environments.

Rigidity: The predefined schema in relational databases can make it difficult to adapt to changing requirements, as altering the schema may require significant modifications to existing data and applications.

Performance issues with large datasets: As the volume of data grows, relational databases may experience performance issues, particularly when dealing with complex queries and large-scale data manipulations.

Inefficient for unstructured or semi-structured data: Relational databases are designed for structured data, which may not be suitable for managing unstructured or semi-structured data, such as social media data or sensor data.

Popular relational databases include MySQL, PostgreSQL, Microsoft SQL Server, and Oracle. Each of these options has its unique features, strengths, and weaknesses, making them suitable for different use cases and requirements. When considering a relational database, it is essential to evaluate the specific needs of the application in terms of data consistency, support for complex queries, and scalability, among other factors.

Password, Session, Cookie, Token, JWT, SSO, OAuth - Authentication Explained - Part 2

Aditya Pandey — Thu, 18 Sep 2025 11:25:04 +0000

Passwordless Authentication

**
We have covered three types of authentication so far: HTTP basic authentication, session-cookie authentication, and token-based authentication. They all require a password. However, there are other ways to prove your identity without a password.

When it comes to authentication, there are three factors to consider:

Knowledge factors: something you know, such as a password

Ownership factors: something you own, such as a device or phone number

Inherence factors: something unique to you, such as your biometric features

Passwords fall under “something you know”. One-Time Passwords (OTP) prove that the user owns a cell phone or a device, while biometric authentication proves "something unique to you".

One-Time Passwords (OTP)

**
One-Time Passwords (OTP) are widely used as a more secure method of authentication. Unlike static passwords, which can be reused, OTPs are valid for a limited time, typically a few minutes. This means that even if someone intercepts an OTP, they can’t use it to log in later. Additionally, OTPs require “something you own” as well as “something you know” to log in. This can be a cell phone number or email address that the user has access to, making it harder for hackers to steal.

However, it's important to note that using SMS as the delivery method for OTPs can be less secure than other methods. This is because SMS messages can be intercepted or redirected by hackers, particularly if the user's phone number has been compromised. In some cases, attackers have been able to hijack phone numbers by convincing the mobile carrier to transfer the number to a new SIM card. Once the attacker has control of the number, they can intercept any OTPs sent via SMS. For this reason, it's recommended to use alternative delivery methods, such as email or mobile apps, whenever possible.

Here’s how OTPs work in more detail:

Step 1: The user wants to log in to a website and is asked to enter a username, cell phone number, or email.

Step 2: The server generates an OTP with an expiration time.

Step 3: The server sends the OTP to the user’s device via SMS or email.

Step 4: The user enters the OTP received in the login box.

Step 5-6: The server compares the generated OTP with the one the user entered. If they match, login is granted.

Alternatively, a hardware or software key can be used to generate OTPs for multi-factor authentication (MFA). For example, Google 2FA uses a software key that generates a new OTP every 30 seconds. When logging in, users enter their password and the current OTP displayed on their device. This adds an extra layer of security as hackers would need access to the user’s device to steal the OTP. More on MFA later.

SSO (Single Sign-On)

**
Single Sign-On (SSO) is a user authentication method that allows us to access multiple systems or applications with a single set of credentials. SSO streamlines the login process, providing a seamless user experience across various platforms.

The SSO process mainly relies on a Central Authentication Service (CAS) server. Here's a step-by-step breakdown of the SSO process:

When we attempt to log in to an application, such as Gmail, we're redirected to the CAS server.

The CAS server verifies our login credentials and creates a Ticket Granting Ticket (TGT). This TGT is then stored in a Ticket Granting Cookie (TGC) on our browser, representing our global session.

CAS generates a Service Ticket (ST) for our visit to Gmail and redirects us back to Gmail with the ST.

Gmail uses the ST to validate our login with the CAS server. After validation, we can access Gmail.

When we want to access another application, like YouTube, the process is simplified:

Since we already have a TGC from our Gmail login, CAS recognizes our authenticated status.

CAS generates a new ST for YouTube access, and we can use YouTube without inputting our credentials again.

This process reduces the need to remember and enter multiple sets of credentials for different applications.

There are different protocols that facilitate SSO:

SAML (Security Assertion Markup Language) is widely used in enterprise applications. SAML communicates authentication and authorization data in an XML format.

OIDC (OpenID Connect) is popular in consumer applications. OIDC handles authentication through JSON Web Tokens (JWT) and builds on the OAuth 2.0 framework. More on this in the next section.

For new applications, OIDC is the preferred choice. It supports various client types, including web-based, mobile, and JavaScript clients.

SSO offers a streamlined and secure authentication method, providing a seamless user experience by requiring only one set of credentials for multiple applications. This approach enhances security through the use of strong, unique passwords and reduced phishing risks. It also minimizes administrative burdens for IT departments.
**

OAuth 2.0 and OpenID Connect (OIDC)
**

While OAuth 2.0 is primarily an authorization framework, it can be used in conjunction with OpenID Connect (OIDC) for authentication purposes. OIDC is an authentication layer built on top of OAuth 2.0, enabling the verification of a user's identity and granting controlled access to protected resources.

When using "Sign in with Google" or similar features, OAuth 2.0 and OIDC work together to streamline the authentication process. OIDC provides user identity data in the form of a standardized JSON Web Token (JWT). This token contains information about the authenticated user, allowing the third-party application to create a user profile without requiring a separate registration process.

In this setup, OAuth 2.0 provides "secure delegated access" by issuing short-lived tokens instead of passwords, allowing third-party services to access protected resources with the resource owner's permission. This method enhances security, as the third-party service does not handle or store the user's password directly.

The diagram below shows how the protocol works in the “Sign in with Google” scenario.

In the “Sign in with Google” example, OAuth 2.0 defines four roles:

Resource owner: The end user, who controls access to their personal data.

Resource server: The Google server hosting user profiles as protected resources. It uses access tokens to respond to protected resource requests, ensuring that only authorized services can access the data.

Client: The device (PC or smartphone) making requests on behalf of the resource owner. This device represents the third-party application seeking access to the user's data.

Authorization server: The Google authorization server that issues tokens to clients, managing the secure exchange of tokens between the resource server and the client.

OAuth 2.0 offers four types of authorization grants to accommodate different situations:

Authorization code grant: The most complete and versatile mode, suitable for most application types. More details below.

Implicit grant: Designed for applications with only a frontend, such as single-page applications or mobile apps. This is no longer recommended. More details below.

Resource owner password credentials grant: Used when users trust a third-party application with their credentials, such as a trusted mobile app.

Client credentials grant: Suitable for cases without a frontend, like command-line tools or server-to-server communication, where resource owner interaction is not needed.

The standard provides multiple modes to cater to different application scenarios and requirements, ensuring flexibility and adaptability for diverse situations.

The authorization code grant is one example worth examining. The specifications for the other three grant types are available in RFC-6749.

Password, Session, Cookie, Token, JWT, SSO, OAuth - Authentication Explained - Part 1

Aditya Pandey — Thu, 18 Sep 2025 10:14:18 +0000

When we use various applications and websites, three essential security steps are continuously at play:

Identity

Authentication

Authorization

The diagram below shows where these methods apply in a typical website architecture and their meanings.

In this 2-part series, we dive into different authentication methods, including passwords, sessions, cookies, tokens, JWTs (JSON Web Tokens), SSO (Single Sign-On), and OAuth2. We discuss the problems each method solves and how to choose the right authentication method for our needs.

Password Authentication

Password authentication is a fundamental and widely used mechanism for verifying a user's identity on websites and applications. In this method, users enter their unique username and password combination to gain access to protected resources. The entered credentials are checked against stored user information in the system, and if they match, the user is granted access.

While password authentication is a foundational method for user verification, it has some limitations. Users may forget their passwords, and managing unique usernames and passwords for multiple websites can be challenging. Furthermore, password-based systems can be vulnerable to attacks, such as brute-force or dictionary attacks, if proper security measures aren't in place.

To address these issues, modern systems often implement additional security measures, such as multi-factor authentication, or use other authentication mechanisms (e.g., session-cookie or token-based authentication) to complement or replace password-based authentication for subsequent access to protected resources.

In this section, we will cover password-based authentication first to understand its history and how it functions.

HTTP Basic Access Authentication

HTTP basic access authentication requires a web browser to provide a username and a password when requesting a protected resource. The credentials are encoded using the Base64 algorithm and included in the HTTP header field Authorization: Basic.

Here's how it typically works:

The client sends a request to access a protected resource on the server.

If the client has not yet provided any authentication credentials, the server responds with a 401 Unauthorized status code and includes the WWW-Authenticate: Basic header to indicate that it requires basic authentication.

The client then prompts the user to enter their username and password, which are combined into a single string in the format username:password.

The combined string is Base64 encoded and included in the "Authorization: Basic" header in the subsequent request to the server, e.g., Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=.

Upon receiving the request, the server decodes the Base64-encoded credentials and separates the username and password. The server then checks the provided credentials against its user database or authentication service.

If the credentials match, the server grants access to the requested resource. If not, the server responds with a 401 Unauthorized status code.

HTTP Basic Access Authentication has limitations. The username and password, encoded using Base64, can be easily decoded. Most websites use TLS (Transport Layer Security) to encrypt data between the browser and server, improving security. However, users' credentials may still be exposed to interception or man-in-the-middle attacks.

With HTTP Basic Access Authentication, the browser sends the Authorization header with the necessary credentials for each request to protected resources within the same domain. This provides a smoother user experience, without repeatedly entering the username and password. But, as each website maintains its own usernames and passwords, users may find it difficult to remember their credentials.

This authentication mechanism is obsolete for modern websites.

Session-Cookie Authentication

**
Session-cookie authentication addresses HTTP basic access authentication's inability to track user login status. A session ID is generated to track the user's status during their visit. This session ID is recorded both server-side and in the client’s cookie, serving as an authentication mechanism. It is called a session-cookie because it is a cookie with the session ID stored inside. Users must still provide their username and password initially, after which the server creates a session for the user's visit. Subsequent requests include the cookie, allowing the server to compare client-side and server-side session IDs.

Let’s see how it works:

The client sends a request to access a protected resource on the server. If the client has not yet authenticated, the server responds with a login prompt. The client submits their username and password to the server.

The server verifies the provided credentials against its user database or authentication service. If the credentials match, the server generates a unique session ID and creates a corresponding session in the server-side storage (e.g., server memory, database, or session server).

The server sends the session ID to the client as a cookie, typically with a Set-Cookie header.

The client stores the session cookie.

For subsequent requests, it sends the cookie along with the request headers.

The server checks the session ID in the cookie against the stored session data to authenticate the user.

If validated, the server grants access to the requested resource. When the user logs out or after a predetermined expiration time, the server invalidates the session, and the client deletes the session cookie.

🌐 Why Are RESTful APIs So Popular?

Aditya Pandey — Fri, 20 Jun 2025 08:15:08 +0000

REST is the most common communication standard between computers over the internet. What is it? Why is it so popular?

The common API standard used by most mobile and web applications to talk to the servers is called REST. It stands for REpresentational State Transfer.

REST is not a specification. It is a loose set of rules that has been the de facto standard for building web API since the early 2000s.

An API that follows the REST standard is called a RESTful API. Some real-life examples are Twilio, Stripe, and Google Maps.

Let’s look at the basics of REST. A RESTful API organizes resources into a set of unique URIs or Uniform Resource Identifiers.

The resources should be grouped by noun and not verb. An API to get all products should be

A client interacts with a resource by making a request to the endpoint for the resource over HTTP. The request has a very specific format.

POST /products HTTP/1.1

The line contains the URI for the resource we’d like to access. The URI is preceded by an HTTP verb which tells the server what we want to do with the resource.

You might have heard of the acronym CRUD. This is what it stands for.

In the body of these requests there could be an optional HTTP request body that contains a custom payload of data, usually encoded in JSON.

The server receives the request, processes it, and formats the result into a response.

The first line of the response contains the HTTP status code to tell the client what happened to the request.

A well-implemented RESTful API returns proper HTTP status codes.

A well-behaved client could choose to retry a failed request with a 500-level status code.

We said “could choose to retry” because some actions are not idempotent and those require extra care when retrying. When an API is idempotent, making multiple identical requests has the same effect as making a single request. This is usually not the case for a POST request to create a new resource.

The response body is optional and could contain the data payload and is usually formatted in json.

There is a critical attribute of REST that is worth discussing more.

A REST implementation should be stateless. It means that the two parties don’t need to store any information about each other and every request and response is independent from all others.

This leads to web applications that are easy to scale and well-behaved.

There are two finer points to discuss to round out a well-behaved RESTful API.

If an API endpoint returns a huge amount of data, use pagination.

A common pagination scheme uses limit and offset. Here is an example:

If they are not specified, the server should assume sensible default values.

Lastly, versioning of an API is very important.

Versioning allows an implementation to provide backward compatibility so that if we introduce breaking changes from one version to another, consumers get enough time to move to the next version.

There are many ways to version an API. The most straightforward is to prefix the version before the resource on the URI. For instance:

There are other popular API options like GraphQL and gRPC. We will discuss those and compare them separately.

🐦 How Would You Design Twitter? (Plus: Threads vs Processes, Choosing Databases, and Unique ID Generation)

Aditya Pandey — Mon, 19 May 2025 10:17:10 +0000

In this deep-dive post, we explore system design insights, foundational CS concepts, and architecture patterns from real-world use cases. Let’s unpack 👇

💡 Interview Essential: Process vs Thread

Understanding the difference between processes and threads is a must-have for any backend or systems engineer.

🔹 A Program is just a passive set of instructions on disk.

🔹 A Process is a program in action — it’s loaded into memory, with its own resources (stack, registers, etc.)

🔹 A Thread is the smallest unit of execution, running within a process — multiple threads can share memory and resources.

Key differences:

🔹 Processes are isolated; threads run within the same memory space.
🔹 Context switching is heavier for processes than threads.
🔹 Threads allow faster communication but require careful synchronization.
🔹 Creating processes is resource-intensive; threads are lightweight.

💬 Over to you:
1️⃣ How do coroutines differ from threads in languages like Go or Python?
2️⃣ How would you list all running processes in Linux?

🛠️ System Design Interview: Design Twitter

Based on a 2013 Twitter tech talk, here’s how a tweet travels through Twitter’s architecture:

The Life of a Tweet
1️⃣ Tweet comes in via the Write API
2️⃣ Routed to the Fanout service
3️⃣ Stored and processed in Redis cache
4️⃣ Timeline service locates the relevant Redis shard
5️⃣ User pulls the timeline via the Timeline service

Search & Discovery

🔹 Ingester: Tokenizes tweets for indexing
🔹 Earlybird: Stores the searchable index
🔹 Blender: Builds search and discovery timelines

Push Compute

🔹 HTTP Push
🔹 Mobile Push

🔍 Note: Based on Twitter’s 2013 architecture — still valuable for understanding scalable social media backends. Original Talk

💬 What are the architecture differences between LinkedIn and Twitter? How do their use cases influence design?

🧩 Choosing the Right Database – A Visual Guide

Databases are not one-size-fits-all. Always choose the right DB for the workload:

Common types:

🔹 Relational (SQL) – Great for structured data and ACID compliance
🔹 Key-Value / In-Memory – Speed first (e.g., Redis)
🔹 Time Series – Optimized for time-stamped data
🔹 Document / JSON – Flexible schema (e.g., MongoDB)
🔹 Graph – Best for relationships (e.g., Neo4j)
🔹 Blob / Text Search / Geospatial / Ledger – Specialized needs

💬 Which databases have you used? How did they perform for your workload?

Thanks to Satish Chandra Gupta for the visual inspiration!

🔐 Unique ID Generator – A Must for Scalable Systems

Large-scale systems like Facebook, Twitter, and LinkedIn need unique IDs that meet tough requirements:

🔹 Globally unique
🔹 Roughly time-sorted
🔹 Numeric-only
🔹 64-bit
🔹 Low-latency & scalable

Think of this as the backbone of tweet IDs, post IDs, user IDs. The implementation details vary, but the goal remains the same — fast, distributed, and conflict-free identity.

💬 What kind of ID generation strategies have you used (UUIDs, Snowflake, etc.)?

Let’s keep learning from real-world architectures and build scalable, resilient systems together!

📄 How Would You Design Google Docs? (Plus: Deployment Strategies, Trends & a Book Giveaway!)

Aditya Pandey — Mon, 19 May 2025 09:51:19 +0000

In this edition, we dive into real-world system design, safe deployment strategies, a signed book giveaway, and the latest trends in software architecture 👇

🚀 How to Deploy Services Without Downtime
Deploying services can be risky. Choosing the right deployment strategy matters:

🔹 Multi-Service Deployment
Simple to implement, but high risk — all services are upgraded at once, and rollbacks are complex.

🔹 Blue-Green Deployment
Two identical environments: “blue” for staging, “green” for production. After testing, traffic is routed to the new version. Easier rollback, but expensive.

🔹 Canary Deployment
Roll out updates gradually to small user groups. Safer and cheaper than blue-green but harder to monitor.

🔹 A/B Testing
Multiple versions run simultaneously for user segments. Great for experimentation — but needs careful handling to avoid accidental exposure.

💬 Over to you – Which strategy do you use in production? Any horror stories?

🧠 Google Docs: Real-Time Collaborative Editing Architecture
Designing a real-time editor like Google Docs isn’t trivial:

1️⃣ Clients send edits via WebSocket.
2️⃣ WebSocket Server manages real-time communication.
3️⃣ Operations go to a Message Queue for durability.
4️⃣ A File Operation Server applies collaboration algorithms.
5️⃣ Data stored: metadata, content, and edit history.

Conflict resolution algorithms include:
🔹 Operational Transformation (used by Google Docs)
🔹 Differential Synchronization
🔹 CRDT (actively researched)

💬 Have you ever faced issues using Google Docs? What do you think caused them?

📊 Software Architecture Trends – What’s Changing?
Insights from InfoQ’s Architecture & Design Trends Report:

🔹 "Data + Architecture" – Architects now consider data pipelines, quality & traceability alongside systems.

🔹 Architecture is becoming a shared responsibility — not just for those with “architect” in their title.

🔹 Asynchronous collaboration (like ADRs) is a positive shift from remote work culture.

🔹 Better distributed teams = Better distributed systems.

💬 What trends are you seeing in 2022 and beyond?

Let’s connect and share insights on system design, cloud architecture, and engineering leadership!

📦 At Most Once, At Least Once, Exactly Once: What Do These Really Mean?

Aditya Pandey — Mon, 05 May 2025 10:57:15 +0000

In today’s distributed system architectures, we break large systems into small, independent services. These services need a reliable way to talk to each other — and message queues or event streaming platforms play a critical role in enabling this communication.

But here's the key question:
👉 How reliably is a message delivered from sender to receiver?

Let’s break down the three core delivery semantics you’ll encounter in real-world systems 👇

1️⃣ At-Most Once

🔹 Messages are delivered zero or one time
🔹 No retries, so if something fails — the message is lost
🔹 Simple, fast, but no guarantee of delivery

💡 Use case: Monitoring metrics, logs, or telemetry where occasional loss is acceptable.

2️⃣ At-Least Once

🔹 Messages are never lost
🔹 But they may be delivered multiple times
🔹 Systems must be able to deduplicate on the consumer side

💡 Use case: Order processing, notifications, analytics — where duplicates can be filtered or ignored.

3️⃣ Exactly Once

🔹 Each message is delivered only once, no duplicates, no loss
🔹 Sounds perfect — but it’s very hard to implement
🔹 Adds complexity, latency, and often performance trade-offs

💡 Use case: Financial transactions, trading systems, accounting — where idempotency is not supported and every operation must be precise.

🧠 So... Why Does It Matter?

Choosing the right delivery guarantee isn’t just about tech — it’s about your use case and business priorities.
Sometimes speed matters more than precision. Other times, a single duplicate message could cost thousands.

💭 Bonus Insight:

📌 Message Queue vs Event Streaming Platform?

🔹 Message Queues (like RabbitMQ, SQS): Focus on reliability and order for point-to-point communication.

🔹 Event Streaming Platforms (like Kafka, Pulsar): Optimized for broadcasting, storing, and replaying high-throughput event logs. Ideal for event-driven systems and real-time analytics.

What’s your go-to strategy for delivery semantics in distributed systems?
Let’s discuss in the comments 💬

DEV Community: Aditya Pandey

How Transformers Architecture Powers Modern LLMs

Step 1: From Text to Tokens

Step 2: Converting Tokens to Embeddings

Step 3: Adding Positional Information

Step 4: The Attention Mechanism in Transformer Layers

Step 5: Converting Back to Text

The Iterative Generation Loop

Training Versus Inference: Two Different Modes

Conclusion

System Architecture

Router

LLM (agentic coding model)

Tools

Context Retrieval

Orchestrator

Sandbox (execution environment)

Production Challenges

Challenge 1: The “Diff Problem”

Challenge 2: Latency Compounds

Challenge 3: Sandboxing at Scale

Conclusion

How Cursor Shipped its Coding Agent to Production

What is a Coding Agent?

How Cursor Shipped its Coding Agent to Production

What is a Coding Agent?

System Architecture

Router

LLM (agentic coding model)

Tools

Context Retrieval

Orchestrator

Sandbox (execution environment)

Production Challenges

Challenge 1: The “Diff Problem”

Challenge 2: Latency Compounds

Challenge 3: Sandboxing at Scale

Conclusion

How Transformers Architecture Powers Modern LLMs

Step 1: From Text to Tokens

Step 2: Converting Tokens to Embeddings

Step 3: Adding Positional Information

Step 4: The Attention Mechanism in Transformer Layers

Step 5: Converting Back to Text

The Iterative Generation Loop

Training Versus Inference: Two Different Modes

Conclusion

Understanding Database Types

Understanding Database Types

Password, Session, Cookie, Token, JWT, SSO, OAuth - Authentication Explained - Part 2

Passwordless Authentication

One-Time Passwords (OTP)

SSO (Single Sign-On)

Password, Session, Cookie, Token, JWT, SSO, OAuth - Authentication Explained - Part 1

Password Authentication

HTTP Basic Access Authentication

Session-Cookie Authentication

🌐 Why Are RESTful APIs So Popular?

🐦 **How Would You Design Twitter?** (Plus: Threads vs Processes, Choosing Databases, and Unique ID Generation)

💡 Interview Essential: Process vs Thread

🛠️ System Design Interview: Design Twitter

🧩 Choosing the Right Database – A Visual Guide

🔐 Unique ID Generator – A Must for Scalable Systems

📄 How Would You Design Google Docs? (Plus: Deployment Strategies, Trends & a Book Giveaway!)

📦 At Most Once, At Least Once, Exactly Once: What Do These Really Mean?

1️⃣ At-Most Once

2️⃣ At-Least Once

3️⃣ Exactly Once

🧠 So... Why Does It Matter?

💭 Bonus Insight:

🐦 How Would You Design Twitter? (Plus: Threads vs Processes, Choosing Databases, and Unique ID Generation)