DEV Community: Lokeswaran Aruljothi

How Large Language Models Work: A Simple Overview for Beginners

Lokeswaran Aruljothi — Fri, 02 Jan 2026 15:03:05 +0000

TLDR;

When you give text to an LLM, it first splits it into tokens using tokenization, and those tokens are converted into embeddings. Each token compares itself with other tokens using attention to build meaningful context. The model then generates raw scores for all possible next tokens and converts them into a probability distribution. Finally, sampling selects the next token. This autoregressive loop continues until the LLM produces a complete response.

What is an LLM?

A large language model is, at its core, a next-token prediction system. Given some input text, it predicts the most likely next token, appends it to the input, and repeats this process in a loop to generate a response. At each step, the model effectively asks:

What is the most probable next token that will follow this input?

An LLM does not understand text or language in the way humans do. The text you provide is first converted into numbers, and the model operates entirely on those numerical representations. Rather than understanding meaning, the model learns statistical patterns in language and uses those patterns to predict what token is likely to come next.

Since an LLM operates only on numbers, the text we provide cannot be used directly. The first step is to break text into smaller units called tokens, which act as the building blocks for everything that follows. This process is known as tokenization.

Tokenization: Breaking text into model-friendly pieces

Each language model has a fixed vocabulary, which is a list of tokens it knows how to work with. Every token is mapped to a numeric value called a token ID. Tokens can represent full words, parts of words, punctuation, whitespace, or even common character sequences. Internally, the model never sees raw text, it only sees these token IDs.

When text is passed to the model, it is split into tokens based on the tokenizer and vocabulary used by that specific model.

For example, in the image above, the sentence is tokenized using the GPT‑4o tokenizer. Notice how the word “microservices” is split into multiple tokens rather than treated as a single word. This happens because tokenizers often use subword units to efficiently handle large and diverse vocabularies.

Because different models use different tokenizers and vocabularies, the same sentence can be split into tokens differently across models. This is why token counts vary and why the same input may consume more or fewer tokens depending on the model being used.

On their own, tokens are still just numbers with no inherent meaning. To represent relationships and semantics, these tokens are next converted into embeddings.

Embeddings: Turning tokens into vectors

The tokenizer converts input text into tokens and maps them to token IDs. However, these token IDs are just numbers and carry no inherent meaning. The model cannot reason or operate using these raw IDs alone. This is where embeddings come into the picture.

An embedding is a vector representation of a token. Each token ID is mapped to a dense vector that captures semantic information. Tokens with similar meanings tend to have similar embeddings, which allows the model to reason about language in a more meaningful way.

The image above shows a simplified visualization of embeddings. Words like “king” and “prince” appear close to each other, as do “father” and “son”, while “mother” and “daughter” form a separate group. This illustrates how embeddings can capture semantic similarities and relationships. In reality, embeddings exist in a multi-dimensional space, not just two or three dimensions, and the axes shown here are only for intuition.

You can think of an embedding as a point in a multi-dimensional space. Tokenization gives a token ID, and that ID is used to look up the corresponding embedding vector. At this stage, each token is represented independently. Tokens do not yet interact with one another, and no context has been applied.

Along with token embeddings, positional information is added so the model knows the order of tokens in the sequence. This ensures that the model can distinguish between different word orders, such as “the sky is blue” and “blue is the sky”.

At this point, tokens have semantic meaning, but they still do not understand the full context of the sentence. Each token is aware only of itself. To understand language, tokens must relate to one another and determine which other tokens are relevant. This is achieved using attention.

Attention: Relating tokens to each other

Attention allows each token to look at other tokens in the input and decide how important they are for understanding the current context. Instead of treating all tokens equally, the model learns which tokens should influence each other and by how much. This process enables the model to build context-aware representations of each token.

In the visualization above, each token sends connections to other tokens. Stronger connections indicate that one token considers another token more relevant when building its contextual meaning. This process happens for every token in the input sequence.

Queries, Keys, and Values

To compute attention, each token is transformed into three different representations called query, key, and value.

The query represents what the current token is looking for.
The key represents what other tokens offer.
The value represents the information carried by those tokens.

The query of one token is compared with the keys of all other tokens to measure relevance. Based on this relevance, the corresponding values are combined with different strengths to produce a new, context-aware representation for the token. This is the core idea behind attention. The goal is not to perform a lookup, but to determine which tokens matter most in a given context.

Multi-head self-attention

This attention process does not happen just once. Instead, it runs multiple times in parallel using different attention heads. Each head can focus on different patterns in the sentence, such as grammatical structure, relationships between words, or long-range dependencies. The outputs from all heads are then combined to form a richer representation of each token. This mechanism is known as multi-head self-attention.

Role of the feedforward network

After attention mixes information across tokens, the resulting representations pass through a small feedforward neural network, often called a multi-layer perceptron. This step processes each token independently and helps refine its representation further. While attention handles relationships between tokens, this network helps transform and polish the information within each token.

At the end of this stage, each token is no longer isolated. It now carries information about its own meaning as well as its relationship to other tokens in the sentence. These context-aware token representations (new embeddings) are then used to compute scores for all possible next tokens in the vocabulary, which leads into probability estimation and token selection.

From representations to probabilities

Now that each token has a context-aware representation produced by attention, the model uses these representations to decide what should come next. For the current position, the model generates a score for every possible token in its vocabulary. These raw scores are called logits.

Logits represent how likely each token is to be the next token, relative to the others. A higher logit means the model considers that token more likely. However, logits are not probabilities. They are not normalized, do not fall within a fixed range, and are difficult to interpret directly.

To convert these logits into something meaningful, the model applies a function called softmax that transforms them into a probability distribution. After this step, each possible token is assigned a probability, and all probabilities sum to 1. This distribution represents how likely the model believes each token is to be chosen next.

At this point, the model knows the likelihood of every possible next token. The remaining question is how to select one token from this distribution, which is handled by the sampling step.

Sampling: choosing the next token

After the model computes a probability distribution for the next token, it must select one token to generate. If the model always chooses the token with the highest probability, it will produce the same output every time. This often leads to repetitive and less natural output. Sampling is used to introduce controlled randomness into the generation process.

Greedy decoding (baseline)

In greedy decoding, the model always picks the token with the highest probability. This approach is fully deterministic and produces consistent outputs. While it can be useful for tasks that require precision, it is generally not suitable for open-ended tasks such as creative writing or conversation.

Temperature

Temperature controls how sharp or flat the probability distribution is. It does not change which tokens are possible, only how likely they are relative to one another.

A lower temperature makes the distribution sharper, causing high-probability tokens to dominate. This results in more confident and less random output.
A higher temperature flattens the distribution, allowing lower-probability tokens to be selected more often. This increases diversity and randomness.

Top-k sampling

Top-k sampling further restricts the choice of the next token by considering only the k most probable tokens. All other tokens are ignored. This prevents extremely unlikely tokens from being selected while still allowing some diversity among the most likely options.

Top-p (nucleus sampling)

Top-p sampling, also known as nucleus sampling, selects the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike Top-k, this set can change dynamically depending on how confident the model is. This makes Top-p more adaptive and often better at balancing focus and diversity.

In this example, only the most likely tokens whose probabilities sum to the 0.8 threshold are considered for sampling.

Putting it together

Temperature, Top-k, and Top-p are often used together. They control how randomness is applied, not what the model knows. The underlying probabilities remain the same, but different sampling settings lead to different generation behavior.

Once a token is selected, it is appended to the input, and the entire process repeats. This is how LLMs generate text one token at a time until a complete response is produced.

A simple mental model

At its core, a large language model is a system designed to predict what comes next. Everything you have seen in this blog, from tokenization, embeddings, attention, probabilities, to sampling, exists to support that single goal. The apparent intelligence of an LLM emerges not from understanding language like a human, but from repeatedly applying this prediction process at scale.

Once you internalize this mental model, many behaviors of LLMs start to make sense. This explains why phrasing matters, why responses can vary, and why models sometimes sound confident yet incorrect. They are not reasoning about truth, but generating the most likely continuation based on patterns learned from data.

Rag in production

Lokeswaran Aruljothi — Mon, 09 Dec 2024 17:24:18 +0000

Finally built a portfolio website

Lokeswaran Aruljothi — Thu, 15 Aug 2024 14:15:06 +0000

Hello world

As a software engineer, I have always wanted to build a portfolio website. After multiple incomplete attempts, I finally built one. Link: https://lokeswaran.vercel.app/. This time, I took a different approach and embraced minimal design. I took inspiration from @leeerob and @Ibelick portfolios.

Why Minimal Design?

Focus: Prioritizing content and functionality over flashy elements.
Clarity: A simple interface that guides users intuitively.
Completion: Stripping away the non-essentials led to a project that truly reflects my skills and style.

Here's what you'll find:

My Journey: The highs and lows that brought me here.
Projects: Showcasing problem-solving skills.
Blog: Thoughts on tech trends and software engineering.

Techs I used to build this

Next.js
TailwindCSS
next-mdx-remote for mdx blogging
Ibelick button
rehype-pretty-code plugin for syntax highlighting in the blog
Vercel for deployment

Thinks i learnt while building this

Generating dynamic metadata for my blogs for SEO
Opengraph image and generating dynamic og image in nextjs
Use next-mdx-remote to render mdx files
Use rehype-pretty-code plugin to render the code snippets with one-dark theme and syntax highlight code in the blog

If you're facing struggles with building your portfolio, my advice is simple: KISS(Keep It Simple, Stupid).

Graph Problems

Lokeswaran Aruljothi — Sun, 31 Dec 2023 08:10:31 +0000

Graph problems play a pivotal role in various fields, from computer science to logistics. Understanding and solving these problems require a diverse set of algorithms. In this blog post, we'll explore some fundamental graph problems, shedding light on their significance and presenting solutions that cater to different scenarios.

Shortest Path Problem:

The shortest path problem revolves around finding the most efficient route between two nodes in a graph. Various methods can tackle this problem, depending on the nature of the graph:

BFS (Unweighted Graph): Ideal for unweighted graphs, BFS systematically explores neighbors to discover the shortest path.
Dijkstra's Algorithm: Efficient for graphs with non-negative weights, Dijkstra's algorithm optimally finds the shortest path.
Bellman Ford: A versatile algorithm that accommodates graphs with negative weights.
Floyd-Warshall: Suitable for all-pairs shortest path problems, handling both positive and negative weights.
A*: Combines aspects of Dijkstra's and greedy algorithms, optimizing pathfinding with heuristics.

Connectivity:

Determining connectivity involves establishing whether a path exists between two nodes. Here are methods tailored for different scenarios:

Union Find Data Structure: Efficiently detects connectivity in disjoint sets.
DFS (Depth-First Search): Systematically explores the graph's depth to ascertain connectivity.
BFS (Breadth-First Search): Unveils connected nodes layer by layer, aiding in connectivity analysis.

Negative Cycle:

Identifying negative cycles in weighted directed graphs is crucial. The following algorithms are adept at handling this task:

Bellman Ford: Detects negative cycles and their locations in the graph.
Floyd-Warshall: Extensively applicable, it identifies negative cycles in the graph.

Strongly Connected Components:

Strongly connected components (SCC) are self-contained graph portions in directed graphs. The following algorithms excel in identifying SCC:

Tarjan's Algorithm: Effectively identifies SCC in linear time.
Kosaraju's Algorithm: Divides the graph into SCC, offering a comprehensive understanding.

Travel Salesman Problem:

The Traveling Salesman Problem (TSP) challenges us to find the shortest path visiting all cities exactly once. As an NP-hard problem, TSP requires sophisticated approaches:

Held-Karp with DP (Dynamic Programming): Optimally solves smaller subproblems, building up to the complete solution.
Branch and Bound: Systematically explores potential solutions, pruning branches for efficiency.
Other Approximation Algorithms like Ant Colony Optimization: Leverages heuristic methods for practical solutions.

Bridges and Articulation Points:

Bridges and articulation points highlight vulnerabilities in a graph:

Bridges: Removing them increases the number of connected components, exposing critical edges.

Articulation Points: Their removal amplifies the number of connected components, signaling key nodes in the graph's structure.

These indicators often reveal weak points, bottlenecks, or vulnerabilities in a graph.

Minimum Spanning Tree:

A Minimum Spanning Tree (MST) connects all vertices with the least total edge weight and no cycles. Algorithms for finding MST include:

Kruskal's Algorithm: Greedily selects edges with the smallest weights to construct the MST.
Prim's Algorithm: Builds the MST incrementally, selecting the lightest edge at each step.
Boruvka's Algorithm: Iteratively adds edges to the MST until all vertices are covered.

Network Flow (Max Flow):

Determining the maximum flow through a network is crucial in various applications:

Ford Fulkerson Algorithm: Augments flow paths to achieve maximum flow.
Edmonds-Karp Algorithm: A specialized implementation of Ford Fulkerson using BFS for efficient augmentation.
Dinic's Algorithm: Optimizes the augmentation process, enhancing efficiency.

Understanding these graph problems and their solutions equips you with a powerful toolkit for tackling diverse challenges in computer science, optimization, and beyond. Whether you're navigating paths, exploring connectivity, or optimizing flows, these algorithms form the backbone of computational problem-solving.

Connect with me on LinkedIn & X

Backtracking made simple

Lokeswaran Aruljothi — Mon, 25 Dec 2023 12:48:58 +0000

Yesterday, I could not solve a single backtracking problem in Leetcode. But I watched some YouTube videos to understand the algorithm, and today I am able to solve backtracking leetcode medium problems within 10 minutes. In this blog, I will tell you the trick that I learned to solve any backtracking problems and apply the trick to leetcode problems.

Introduction:

Backtracking is a general algorithm for finding all (or some) solutions to a computational problem that incrementally builds candidates for solutions and abandons a candidate ("backtracks") as soon as it determines that the candidate cannot possibly be completed to a valid solution

We can easily come up with the backtracking solution by answering the following three questions:

What is the goal of the problem?
What are the choices that we can make?
What is the constraint when making a choice?

Here's a Python template for a backtracking algorithm:

def backtrack(candidate, problem):

    # check if the goal is reached
    if is_solution(candidate, problem):
        process_solution(candidate, problem)
        return

    for next_candidate in generate_candidates(candidate, problem):

        # Check the constraint
        if is_valid(next_candidate, problem):

            # Make a choice
            make_choice(candidate, next_candidate, problem)

            # Recursively call the backtracking method
            backtrack(next_candidate, problem)

            # Undo the choice (backtrack)
            undo_choice(candidate, next_candidate, problem)

Let's solve the 39. Combination Sum problem:

Given an array of distinct integers candidates and a target integer target, return a list of all unique combinations of candidates where the chosen numbers sum to target. You may return the combinations in any order.
The same number may be chosen from candidates an unlimited number of times. Two combinations are unique if the frequency of at least one of the chosen numbers is different.
Example 1:
Input: candidates = [2,3,6,7], target = 7
Output: [[2,2,3],[7]]
Explanation:
2 and 3 are candidates, and 2 + 2 + 3 = 7. Note that 2 can be used multiple times.
7 is a candidate, and 7 = 7.
These are the only two combinations.

Now, let's answer the above three questions:

The goal is to find all the combinations of candidates or numbers that sum up to the target.
The choices are all the numbers in the candidates list. We can use the same number many times.
The constraint is that if the sum of the current list is greater than the target, we can not choose that number.

Solution:

class Solution:
    def combinationSum(self, candidates: List[int], target: int) -> List[List[int]]:
       self.combinations = []

       # calling the backtracking helper method
       self.backtracking(candidates, target, 0, 0, [])
       return self.combinations

    def backtracking(self, candidates, target, startIndex, currentSum, currentCombination):
        # Check if we reach the goal
        if currentSum == target:
            self.combinations.append(currentCombination.copy())
            return

        for index in range(startIndex, len(candidates)):
            currentSum += candidates[index]

            # Check the constraint
            if currentSum > target:
                currentSum -= candidates[index]
                continue

            # Make a choice and call the backtracking function again
            currentCombination.append(candidates[index])
            self.backtracking(candidates, target, index, currentSum, currentCombination)

            # Undo the choice
            currentCombination.pop()
            currentSum -= candidates[index]

In the above code, I first initialized the combinations list to hold the combinations. Then I called the backtracking method, which takes 5 arguments:

candidates - input candidates list
target - target number
index - start index of the choice list
currentSum - sum of the elements in the current list
currentCombination - the current combination list In backtracking method, I first checked if I had reached the goal, i.e., if the current sum was equal to the target, then I added the copy of the currentCombination to the combinations list. I have a for loop which loops from the startIndex to the last element in the candidates. I check the constraint, i.e., if the currentSum is less than the target. Then, I made the choice of adding the element to my currentCombination list and called the backtracking method recursively.

Note: Since we can use the same element multiple times, I have passed the current index as the value for startIndex. If we can't use the same element, then pass index+1 as the value for startIndex.

Finally, Undo the choice by removing the element that was added.

Time and Space Complexities:

Time complexity: O(N*2^N)
Space complexity: O(N) where N is the length of the candidates list.

Connect with me on LinkedIn and Twitter.