A deep dive into the core mechanics of modern LLMs, explaining the essential concepts that separate a casual user from a true practitioner
LLM Architectures
Its core innovation is the attention mechanism, which allows the model to weigh the importance of different words in the input text when processing and generating language. This architecture is the backbone of most modern LLMs, including models like GPT and BERT. The Transformer is composed of two primary building blocks: Encoders and Decoders. Different models use these blocks in different combinations to achieve their specific capabilities.
Before moving forward let’s make sure the terms are clear, let’s define them:
- Text/Document: The full sequence of words you are working with.
- Token: The smallest unit the model processes. After tokenization, a sentence is broken down into these pieces. A token is often a word (like “They”) or a sub-word (like “ing” in “running”).
- Embedding: A numerical vector that represents the semantic meaning of a token or a sequence of tokens.
Encoders and Decoders
AI model types has different capabilities, i.e. embedding, text generation, text to image generation, etc.
The models of each types have variaty in sizes (number of parameters).
Typically, Encoders are used for embedding, and Decoders are used for generation.
Encoders
What is Embedding?
Model converts a sequence of words to an embedding (a vector representation of words).
The sequence of words is “They sent me a”. This sentence is tokenized (tokenization means a character sequence — typically a sentence — , is broken into small pieces — mostly simple words) into chunks.
Each token and the whole sentence will be embedded, a vector representation is created from them.
What are Vector Embeddings and why are they useful?
A vector embedding is a powerful concept where a word, sentence, or even an entire document is converted into a numerical representation — a list of numbers called a vector. This vector is designed to capture the rich semantic meaning and context of the original text.
Imagine a vast, multi-dimensional space (often with hundreds of dimensions, such as 300 or more). In this space, every concept has a specific location, represented by its vector. The key principle is that semantically similar concepts will have vectors that are close to each other.
It’s not that a single dimension represents a simple, human-readable trait like ‘kindness’. Instead, meaning is encoded in the vector’s overall position and its relationships with other vectors. For example:
- The vectors for “polite” and “courteous” would be located very close together.
- The vectors for “king” and “queen” would also be near each other.
- Furthermore, the model learns complex relationships. The vector relationship between “king” and “queen” is very similar to the relationship between “man” and “woman”.
The primary application of this is enabling similarity search. By storing these embeddings in a specialized vector database, we can find documents or pieces of text that are semantically similar to a user’s query. Instead of just matching keywords, a similarity search finds content that matches the meaning and intent behind the query, leading to much more relevant and intelligent search results.
Decoders
These kind of models take a sequence of words and output the next word. This is based on probability of the vocabulary which model computes.
Important to understand that the, Decoder only produced a single token at a time! We can invoke a decoder to generate as many new tokens as we want.
In another words, to generate a sequence of new tokens first, we need to feed decoder model with initial sequence of tokens (prompt) and invoke the model to produce the next token.
Encoders — Decoders
This kind of models encodes a sequence of words and uses the encoding to output a next word.
Encoders — Decoders models are typically utilized for sequence-to-sequence tasks, like translation.
Traslation workflow is that we send the English tokens to the model, they are gotten by the encoder which embeds tokens and whole sentence. And then the embeddings get passed to the decoder. There you can notice there is a self-referential loops to the decoder. After generationg a token, that token will be passed back to the decoder.
Architectures at a glance
| Task | Encoders | Decoders | Encoder-decoder | 
|---|---|---|---|
| Embedding text | Yes | No | No | 
| Abstractive QA | No | Yes | Yes | 
| Extractive QA | Yes | Maybe | Yes | 
| Translation | No | Maybe | Yes | 
| Creative writing | No | Yes | No | 
| Abstractive Summarization | No | Yes | Yes | 
| Extractive Summarization | Yes | Maybe | Yes | 
| Chat | No | Yes | No | 
| Forecasting | No | No | No | 
| Code | No | Yes | Yes | 
Prompting and Prompt Engineering
In-context Learning and Few-shot Prompting
- In-context learning — conditaioning (prompting) an LLM with instructions and/or demonstrations of the task it is meant to complete
- k-shot prompting — explicitly provideing k examples of the intended task in the prompt
Here you can see a k-shot prompting example where we tell to the model to translate by providing some examples (in this case three-shot).
Take away:
Few-shot prompting is widely belived to improve results over 0-shot prompting.
Advanced Prompting Strategies
Chain-of-Thought (CoT) — Prompt the LLM to emit intermediate reasoning steps
How can this work at all? Remember, when generating text the model is working word by word, 1 word at a time. It doesn’t have a high-level plan about how to solve the problem.
This is precisely why Chain-of-Thought (CoT) is so effective. By explicitly instructing the model to “think step-by-step,” we force it to generate its reasoning process as part of the output. Each new word it generates is conditioned on the reasoning steps it has already written down. This creates a logical sequence that guides the model toward a more accurate conclusion. Instead of trying to jump straight to the answer — which is difficult for complex problems — the model externalizes its thought process, allowing it to break the problem down and build upon its own intermediate conclusions. It’s like a student showing their work on a math problem; writing down the steps helps avoid errors.
Least-to-most — Prompt the LLM to decompose the problem and solve, easy-first
This strategy builds on Chain-of-Thought by prompting the model to first break a complex problem into a series of simpler subproblems and then solve them in sequence. This is particularly useful for tasks where one step logically depends on the answer to a previous one. The key is to guide the model to tackle the easiest parts first, creating a foundation for solving the more difficult parts. This reduces the cognitive load and improves the chances of arriving at a correct final answer.
Example:
Query: “If a car travels at 60 mph for 30 minutes and then gets stuck in traffic for 15 minutes before traveling at 40 mph for another 15 minutes, what is its average speed for the entire trip?”Least-to-Most Prompting:
Prompt 1 (Decomposition): “Break down the problem of calculating the car’s average speed into smaller steps.”
Model’s Response: “1. Calculate the distance traveled in the first leg. 2. Calculate the distance traveled in the second leg. 3. Calculate the total distance. 4. Calculate the total time. 5. Divide total distance by total time.”
Prompt 2 (Solving): “Great. Now solve each step.”
Model solves sequentially, leading to the correct average speed.
Step-back — Prompt the LLM to indentify high-level concepts pertinent to a specific task
Step-Back prompting encourages the model to generalize and abstract away from the specific details of a question to consider the broader principles or concepts at play. Instead of getting bogged down by the specifics, the model is asked to “take a step back” and think about the fundamental knowledge required to answer the question. It then uses this high-level understanding to formulate a more robust and accurate answer. This is especially effective for complex reasoning tasks and for questions where the details might be misleading.
Issues with Prompting
Prompting models, while a powerful tool, carries several inherent risks that are important to understand and manage. These risks range from security vulnerabilities to ethical dilemmas.
Prompt Injection
Prompt injection is one of the most significant security risks. It is a form of attack where a malicious actor intentionally crafts a prompt to trick an AI model into ignoring its original instructions and developer guidelines. The model often cannot distinguish between developer instructions and user input, allowing an attacker to take control with a cleverly crafted command. For example, it can be manipulated to leak confidential data or generate misinformation and harmful content. This type of attack does not require advanced technical skills; it is carried out simply by deceiving the model using natural language.
Inherent Biases and Prejudices
Generative models operate based on the patterns present in their training data. If this data contains social, cultural, or other prejudices, the model will not only reproduce but can also amplify these biases. This can manifest, for instance, in the model consistently associating certain professions with a specific gender or reinforcing racial and cultural stereotypes. Such biased outputs can perpetuate discriminatory practices in areas like hiring or lending.
Misinformation and “Hallucinations”
Generative AI models are prone to confidently stating falsehoods. This phenomenon is known as “hallucination.” Since models fundamentally predict the next most likely word based on statistical patterns, their responses do not necessarily have a factual basis. This can be particularly dangerous when users rely on AI-generated content for critical decisions, such as for financial or medical advice. Malicious actors can also intentionally use these models to create convincing fake news and disinformation campaigns.
Data Security and Privacy Risks
When users write prompts, they might inadvertently include confidential or personally identifiable information (PII). There is a risk that the model could memorize this information and later reveal it in a response to another user. The risk is especially high with cloud-based or third-party AI services, where input data may not be handled properly, leading to privacy breaches and legal violations (such as GDPR).
Generating Malicious Content
Without proper safety restrictions, models can be used to create offensive, inappropriate, or illegal content. Attackers can generate malicious code or sophisticated phishing emails to target other users or systems, using the AI as a tool to scale their malicious activities.
Training
While prompting is powerful, it can be insufficient when you need an LLM to become a true expert in a specific domain or perform a highly specialized task. This is where **training **comes in.
As opposed to prompting (which gives the model context), training permanently changes the model’s internal parameters. Think of it as teaching the model a new skill, not just giving it notes for a single test. At a high level, the process involves:
- Giving the model an input.
- Letting it guess the corresponding output (e.g., a sentence completion or an answer).
- Comparing its guess to the “correct” answer and slightly adjusting its parameters so it does better next time.
These adjustments change how the model “thinks” and hopefully improve its performance on your specific task. There are several ways to do this, ranging from massive undertakings to highly efficient tweaks.
Continual Pre-training: Expanding the Knowledge Base
- What it is: This technique continues the initial, broad training of an LLM, but using a large corpus of text from a new, specialized domain (e.g., legal documents, medical research, or your company’s internal wiki). You are still just asking the model to predict the next word, but on this new, focused data.
- Parameters Modified: All of them.
- Data Required: A large amount of unlabeled domain-specific text.
- Best for: When the model lacks fundamental knowledge about a specific field. You aren’t teaching it a task, you are teaching it a subject.
Full Fine-Tuning (FT): Teaching a Specific Skill
- What it is: This is the classic way to train a model for a specific task. You take a pre-trained model and train it further on a dataset of examples that show exactly what you want it to do (e.g., thousands of question-answer pairs for a customer service bot).
- Parameters Modified: All of them.
- Data Required: A high-quality, labeled, task-specific dataset.
- Best for: Achieving the highest possible performance on a specific task when you have a large budget and a good dataset. However, it is computationally expensive and risks “catastrophic forgetting,” where the model loses some of its general capabilities.
Parameter-Efficient Fine-Tuning (PEFT): The Smart Middle Ground
To get the benefits of fine-tuning without the enormous cost, several PEFT methods have emerged. The core idea is to freeze the original LLM’s billions of parameters and only train a small number of new or specific ones.
Method A: LoRA (Low-Rank Adaptation)
- What it is: LoRA is a popular PEFT method where small, trainable “adapter” layers are inserted into the model. The original model remains frozen, and only these tiny new layers are trained. It’s like adding specialized tuning knobs to a complex engine instead of rebuilding it.
- Parameters Modified: Only a tiny fraction of new, added parameters.
- Data Required: Labeled, task-specific data (often less than full FT).
Method B: Soft Prompting (or Prompt Tuning)
- What it is: This technique focuses on the input. It freezes the entire model and instead learns a special “soft prompt” — a sequence of numerical values that are prepended to your actual prompt. You can think of these as perfect, computer-generated keywords that are learned during training to steer the model toward the correct output for your task.
- Parameters Modified: A small number of new parameters that represent the soft prompt.
- Data Required: Labeled, task-specific data.
| Training Style | Parameters Modified | Data | Cost | Use Case | 
|---|---|---|---|---|
| Cont. Pre-training | All | Unlabeled | Very High | Adapting to a new knowledge domain. | 
| Full Fine-Tuning | All | Labeled | High | Max performance on a specific task | 
| PEFT (e.g., LoRA) | Few (new) | Labeled | Low | Cost-effective task specialization | 
| Soft Prompting | Few (new) | Labeled | Very Low | Efficiently tuning for many tasks | 
Decoding
Decoding is the process an LLM uses to select words from a probability distribution to generate text. After the model processes an input, it doesn’t know the “right” word; it only knows the probability of every word in its vocabulary being the next one.
Let’s get the example:
“I wrote to the zoo to send me a pet. They sent me a _______”
The model produces a probability distribution over its entire vocabulary, which might look something like this:
“lion = 0.03”; “elephant = 0.02”; “dog = 0.45”; “cat = 0.4”; “panther = 0.05”; “aligator = 0.01”; …
The key question is: "how do we pick a word from this list?" This choice happens iteratively:
- The model computes the probability distribution.
- A word is selected using a decoding strategy.
- The chosen word is appended to the input text.
- The process repeats until the model generates an end-of-sequence (EOS) token or reaches its maximum length.
There are two main families of decoding strategies:
1.) Greedy Decoding: The Direct Path
This is the simplest and most direct strategy. At each step, we simply pick the word with the highest probability.
In the example above, it selects the “dog” (0.45 probability).
The next input becomes:
“I wrote to the zoo to send me a pet. They sent me a dog ______”
The model then generates a new distribution. Let’s say the highest probability is now for the End-of-Sequence token (EOS = 0.99). Greedy decoding selects it, and the generation stops.
Distribution: 
“EOS = 0.99”; “elephant = 0.001”; “dog = 0.001”; “cat = 0.001”; “panther = 0.005”; “aligator = 0.01”; …
Pros: Fast, predictable, and produces the most “likely” output.
Cons: Can be repetitive and boring. It might miss a more creative or coherent sentence by always choosing the locally optimal word, without considering the global sentence structure.
2.) Sampling: Introducing Controlled Randomness
To produce more creative and human-like text, we can introduce randomness. Instead of always picking the top word, we sample from the probability distribution. Several parameters control this process.
Temperature
Temperature is the most important parameter for controlling randomness. It is a value typically between 0.0 and 2.0. It “re-shapes” the probability distribution before sampling.
- Low Temperature (e.g., 0.2): This makes the distribution “peakier.” The probability of high-probability words (like “dog” and “cat”) gets boosted, while low-probability words are suppressed even further. As temperature approaches 0, it becomes identical to greedy decoding. Use when: You want factual, grounded, and predictable answers.
- High Temperature (e.g., > 1.0): This “flattens” the distribution, making the probabilities of words more uniform. Rare words have a higher chance of being selected. Use when: You want creative, diverse, and sometimes surprising output, like for writing a story or brainstorming ideas.
Top-K and Top-P (Nucleus) Sampling
These methods are often used with temperature to further refine the word selection. They prevent the model from picking truly nonsensical words by first filtering the vocabulary list.
- Top-K Sampling: Consider only the K most likely words. For example, if K=3, you would only sample from “dog”, “cat”, and “panther”, ignoring all others.
- Top-P (Nucleus) Sampling: A more dynamic approach. You choose the smallest set of top words whose cumulative probability is greater than P. For example, if P=0.90, you would take “dog” (0.45) and “cat” (0.40), as their sum is 0.85. Since that’s less than 0.90, you’d add the next word, “panther” (0.05), bringing the total to 0.90. You then sample only from that small group.
Take away:
Decoding is a balance between coherence and creativity. For factual applications, use greedy decoding or sampling with a low temperature. For creative tasks, increase the temperature and use Top-K or Top-P to guide the randomness and prevent nonsensical outputs.
Thanks for reading this article, if you have any questions feel free to reach me out!
The LLM application possibilities are coming soon, follow me to get notified.
 






 
    
Top comments (0)