DEV Community

Cover image for Understanding Large Language Models: A Developer's Guide
SATINATH MONDAL
SATINATH MONDAL

Posted on

Understanding Large Language Models: A Developer's Guide

Large Language Models (LLMs) have transformed how we build applications. From ChatGPT to GitHub Copilot, these models power the AI revolution. But how do they actually work? More importantly, as a developer, how do you choose between training your own model, fine-tuning an existing one, or just using prompt engineering?

This guide demystifies LLMs from a developer's perspective—no advanced math degree required.

What You'll Learn

By the end of this article, you'll understand:

  • The fundamental architecture that powers all modern LLMs
  • How transformers process and generate text
  • The critical differences between training, fine-tuning, and prompt engineering
  • When to use each approach for your specific use case
  • Practical implementation strategies with real code examples

Target Audience: Developers with basic AI knowledge who want to understand LLMs deeply enough to make informed architectural decisions.

Table of Contents

The LLM Foundation: What Makes Them Different

Beyond Traditional ML Models

Traditional machine learning models are specialists. You train a spam classifier, and it classifies spam. Train an image classifier, and it classifies images. LLMs are different—they're generalists.

graph LR
    A[Spam Email] --> B[Spam Model]
    B --> C["Spam or Not Spam"]

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff

    subgraph "Traditional ML: One task only"
    A
    B
    C
    end
Enter fullscreen mode Exit fullscreen mode
graph LR
    A[Any Text] --> B[LLM GPT-4]
    B --> C[Translation]
    B --> D[Summarization]
    B --> E[Code Generation]
    B --> F[Q&A, Analysis]
    B --> G[... and more]

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style D fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style E fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style F fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style G fill:#000,stroke:#fff,stroke-width:2px,color:#fff

    subgraph "LLMs: Multiple capabilities"
    A
    B
    C
    D
    E
    F
    G
    end
Enter fullscreen mode Exit fullscreen mode

The Three Defining Characteristics

1. Scale

GPT-1 (2018):    117 million parameters
GPT-2 (2019):    1.5 billion parameters
GPT-3 (2020):    175 billion parameters
GPT-4 (2023):    ~1.76 trillion parameters (estimated)

For context: Your brain has ~86 billion neurons
Enter fullscreen mode Exit fullscreen mode

2. Pre-training on Massive Data

Training Data Sources:
- Books: Millions of volumes
- Web Pages: Billions of pages
- Code Repositories: Terabytes of code
- Scientific Papers: Millions of articles
- Social Media: Filtered conversations

Total: Trillions of words
Enter fullscreen mode Exit fullscreen mode

3. Emergent Abilities

As LLMs scale, they gain abilities they weren't explicitly trained for:

# Not explicitly trained for these, but can do them:
abilities = [
    "Few-shot learning",      # Learn from examples in prompt
    "Chain-of-thought reasoning",  # Break down complex problems
    "Code interpretation",    # Understand and generate code
    "Multilingual translation",    # Translate between languages
    "Mathematical reasoning",  # Solve math problems
    "Creative writing"        # Generate stories, poems
]

# These emerge naturally from scale + training
Enter fullscreen mode Exit fullscreen mode

How LLMs Work Under the Hood

The Core Concept: Next Token Prediction

At their heart, LLMs do one thing: predict the next token.

Input:  "The cat sat on the"
Model:  "mat" (probability: 0.4)
        "floor" (probability: 0.3)
        "chair" (probability: 0.2)
        ...

Chosen: "mat"
Next Input: "The cat sat on the mat"
Model:  "." (probability: 0.5)
        "and" (probability: 0.3)
        ...
Enter fullscreen mode Exit fullscreen mode

This simple process, repeated billions of times during training, creates the illusion of understanding.

The Training Pipeline

Here's what happens when training an LLM:

graph TD
    A[Step 1: DATA COLLECTION<br/>Raw Text from Internet<br/>'The quick brown fox jumps...'] --> B[Step 2: TOKENIZATION<br/>Convert to tokens and IDs<br/>tokens: The, quick, brown, fox<br/>IDs: 123, 456, 789, 101]
    B --> C[Step 3: EMBEDDING<br/>Each token → high-dimensional vector<br/>The → 0.2, 0.5, 0.1, ...]
    C --> D[Step 4: TRANSFORMER PROCESSING<br/>Self-attention + Feed-forward<br/>Layers process context]
    D --> E[Step 5: PREDICTION<br/>Output probabilities for next token<br/>jumps 80%, runs 15%...]
    E --> F[Step 6: LOSS CALCULATION<br/>Compare prediction to actual word<br/>Actual: jumps<br/>Calculate error cross-entropy]
    F --> G[Step 7: BACKPROPAGATION<br/>Update all 175B parameters<br/>Reduce prediction error]
    G -.->|Repeat billions of times| A

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style D fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style E fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style F fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style G fill:#000,stroke:#fff,stroke-width:2px,color:#fff

    classDef default font-size:11px
Enter fullscreen mode Exit fullscreen mode

Tokenization Deep Dive

Understanding tokenization is crucial for working with LLMs:

# Example tokenization (simplified)
text = "Hello, world! How are you?"

# Byte-Pair Encoding (BPE) - Common approach
tokens = ["Hello", ",", " world", "!", " How", " are", " you", "?"]

# Converted to IDs
token_ids = [15496, 11, 995, 0, 1374, 389, 345, 30]

# Key insights:
# 1. Spaces are part of tokens (" world" not "world")
# 2. Punctuation can be separate tokens
# 3. Common words = single token
# 4. Rare words = multiple tokens

# Example with a rare word:
rare_word = "antidisestablishmentarianism"
tokens = ["ant", "id", "ise", "stablish", "ment", "arian", "ism"]
# 7 tokens for one word!
Enter fullscreen mode Exit fullscreen mode

This is why token limits matter:

# GPT-4 context window: 8,192 tokens
# Approximate conversion: 1 token ≈ 0.75 words

max_words = 8192 * 0.75  # ~6,144 words
max_pages = max_words / 250  # ~24 pages (single-spaced)

# But technical text uses MORE tokens:
code = "function calculateTotal(items) { return items.reduce((sum, item) => sum + item.price, 0); }"
# ~30 tokens for this JavaScript snippet
Enter fullscreen mode Exit fullscreen mode

The Inference Process

When you use an LLM, here's what happens:

def llm_inference_simplified(prompt, model, max_tokens=100):
    """
    Simplified view of LLM inference

    Args:
        prompt: User input text
        model: Pre-trained LLM
        max_tokens: Maximum tokens to generate
    """
    # 1. Tokenize input
    tokens = tokenize(prompt)

    # 2. Convert to embeddings
    embeddings = model.embed(tokens)

    generated_tokens = []

    # 3. Generate tokens one at a time
    for _ in range(max_tokens):
        # Run through transformer layers
        output = model.forward(embeddings)

        # Get probability distribution for next token
        next_token_probs = output.get_next_token_distribution()

        # Sample next token (with temperature, top-p, etc.)
        next_token = sample(next_token_probs)

        # Check for stop condition
        if next_token == END_TOKEN:
            break

        generated_tokens.append(next_token)

        # Add to context for next iteration
        embeddings = update_context(embeddings, next_token)

    # 4. Decode tokens back to text
    output_text = detokenize(generated_tokens)

    return output_text

# Real usage:
response = llm_inference_simplified(
    prompt="Explain recursion in Python",
    model=gpt4_model,
    max_tokens=200
)
Enter fullscreen mode Exit fullscreen mode

Memory and Context Windows

LLMs don't have "memory" like databases—they have context windows:

graph TB
    subgraph "Context Window: 8,192 tokens max"
    A["Your Prompt<br/>tokens 1-100"]
    B["Previous Conversation<br/>tokens 101-500"]
    C["System Instructions<br/>tokens 501-600"]
    D["Available Space for Response<br/>tokens 601-8192"]
    end

    A --> B
    B --> C
    C --> D

    E["If you exceed 8,192 tokens:<br/>• Old messages get truncated<br/>• Model forgets early conversation<br/>• You need to re-inject important context"]

    D -.-> E

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style D fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style E fill:#000,stroke:#fff,stroke-width:2px,color:#fff
Enter fullscreen mode Exit fullscreen mode

Practical implications:

# Problem: Long conversation exceeds context
conversation_history = []

for user_message in user_messages:
    conversation_history.append(user_message)

    # Calculate total tokens
    total_tokens = count_tokens(conversation_history)

    if total_tokens > MAX_CONTEXT - BUFFER:
        # Strategy 1: Truncate oldest messages
        conversation_history = conversation_history[-10:]

        # Strategy 2: Summarize conversation
        summary = summarize_conversation(conversation_history[:-5])
        conversation_history = [summary] + conversation_history[-5:]

        # Strategy 3: Extract key information
        key_facts = extract_key_information(conversation_history)
        conversation_history = [key_facts] + conversation_history[-5:]

    response = llm.generate(conversation_history)
Enter fullscreen mode Exit fullscreen mode

Transformers Architecture Explained

The Revolution: Self-Attention

Before transformers (2017), we had RNNs and LSTMs that processed text sequentially. Transformers process all tokens simultaneously using self-attention.

graph LR
    subgraph "RNN: Sequential - slow, can't parallelize"
    A1[The] --> A2[cat]
    A2 --> A3[sat]
    A3 --> A4[on]
    A4 --> A5[the]
    A5 --> A6[mat]
    end

    style A1 fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style A2 fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style A3 fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style A4 fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style A5 fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style A6 fill:#000,stroke:#fff,stroke-width:2px,color:#fff
Enter fullscreen mode Exit fullscreen mode
graph TB
    A["[The, cat, sat, on, the, mat]"]
    B["All at once!<br/>fast, highly parallelizable"]

    A --> B

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff

    subgraph "Transformer: Parallel"
    A
    B
    end
Enter fullscreen mode Exit fullscreen mode

Transformer Block Anatomy

A transformer consists of repeated blocks:

graph TD
    A[Input Embeddings] --> B["1. MULTI-HEAD SELF-ATTENTION<br/>• Query, Key, Value transformations<br/>• Attention scores computation<br/>• 12-96 attention heads parallel"]
    B --> C["2. ADD & NORMALIZE<br/>Residual Connection<br/>output = LayerNorm(input + attention)"]
    C --> D["3. FEED-FORWARD NETWORK<br/>• Two linear layers with activation<br/>• Processes each position independently"]
    D --> E["4. ADD & NORMALIZE<br/>Residual Connection<br/>output = LayerNorm(input + ffn)"]
    E --> F[Next Block or Output Layer]
    F -.->|Repeat 12-96 times| A

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style D fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style E fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style F fill:#000,stroke:#fff,stroke-width:2px,color:#fff

    G["Typical LLM: 12-96 blocks stacked<br/>GPT-3: 96 layers<br/>GPT-4: ~120 layers estimated"]

    style G fill:#000,stroke:#fff,stroke-width:2px,color:#fff
Enter fullscreen mode Exit fullscreen mode

Self-Attention Mechanism (Detailed)

Let's walk through exactly how self-attention works:

import numpy as np

def self_attention_step_by_step(tokens, d_model=512):
    """
    Self-attention mechanism explained step-by-step

    Args:
        tokens: Input token embeddings [seq_len, d_model]
        d_model: Embedding dimension (e.g., 512, 768, 1024)
    """
    seq_len = len(tokens)
    d_k = d_model // 8  # Dimension per head (if 8 heads)

    # Step 1: Create Q, K, V matrices
    # These are learned parameters
    W_q = np.random.randn(d_model, d_k)  # Query weight matrix
    W_k = np.random.randn(d_model, d_k)  # Key weight matrix
    W_v = np.random.randn(d_model, d_k)  # Value weight matrix

    # Step 2: Compute Q, K, V for each token
    Q = tokens @ W_q  # [seq_len, d_k]
    K = tokens @ W_k  # [seq_len, d_k]
    V = tokens @ W_v  # [seq_len, d_k]

    # Step 3: Calculate attention scores
    # "How much should each token attend to every other token?"
    scores = Q @ K.T  # [seq_len, seq_len]

    # Example for 4 tokens:
    # scores = [
    #   [q1·k1, q1·k2, q1·k3, q1·k4],  # Token 1's attention to all
    #   [q2·k1, q2·k2, q2·k3, q2·k4],  # Token 2's attention to all
    #   [q3·k1, q3·k2, q3·k3, q3·k4],  # Token 3's attention to all
    #   [q4·k1, q4·k2, q4·k3, q4·k4],  # Token 4's attention to all
    # ]

    # Step 4: Scale scores (prevents gradients from exploding)
    scores = scores / np.sqrt(d_k)

    # Step 5: Apply causal mask (for autoregressive models)
    # Prevent tokens from attending to future tokens
    mask = np.triu(np.ones((seq_len, seq_len)), k=1) * -1e9
    scores = scores + mask

    # Now scores look like:
    # [
    #   [q1·k1, -inf,  -inf,  -inf ],  # Can only see token 1
    #   [q2·k1, q2·k2, -inf,  -inf ],  # Can see tokens 1-2
    #   [q3·k1, q3·k2, q3·k3, -inf ],  # Can see tokens 1-3
    #   [q4·k1, q4·k2, q4·k3, q4·k4],  # Can see all tokens
    # ]

    # Step 6: Softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)

    # Step 7: Weighted sum of values
    output = attention_weights @ V  # [seq_len, d_k]

    return output, attention_weights

# Visualizing attention for "The cat sat on the mat"
tokens_text = ["The", "cat", "sat", "on", "the", "mat"]
print("Attention Weights Matrix:")
print("        ", "  ".join(tokens_text))
for i, token in enumerate(tokens_text):
    weights = attention_weights[i]
    # Print only non-masked positions
    visible_weights = weights[:i+1]
    print(f"{token:6s}", " ".join(f"{w:.2f}" for w in visible_weights))

# Output example:
#         The   cat   sat   on    the   mat
# The     1.00
# cat     0.30  0.70
# sat     0.20  0.50  0.30
# on      0.10  0.20  0.40  0.30
# the     0.15  0.15  0.25  0.35  0.10
# mat     0.10  0.25  0.20  0.15  0.10  0.20
Enter fullscreen mode Exit fullscreen mode

Multi-Head Attention

Instead of one attention mechanism, transformers use many parallel ones:

class MultiHeadAttention:
    def __init__(self, d_model=768, num_heads=12):
        """
        Multi-head attention allows the model to jointly attend
        to information from different representation subspaces

        Args:
            d_model: Total embedding dimension (768 for BERT, 1024 for GPT-2)
            num_heads: Number of parallel attention heads (typically 8-16)
        """
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads  # Dimension per head

        # Each head has its own Q, K, V projections
        self.W_q = [create_weight_matrix(d_model, self.d_k) 
                    for _ in range(num_heads)]
        self.W_k = [create_weight_matrix(d_model, self.d_k) 
                    for _ in range(num_heads)]
        self.W_v = [create_weight_matrix(d_model, self.d_k) 
                    for _ in range(num_heads)]

        # Output projection
        self.W_o = create_weight_matrix(d_model, d_model)

    def forward(self, x):
        """
        Process input through all attention heads
        """
        # Run attention for each head in parallel
        head_outputs = []

        for i in range(self.num_heads):
            # Each head learns different patterns:
            # Head 1: Subject-verb relationships
            # Head 2: Object relationships
            # Head 3: Positional patterns
            # Head 4: Semantic similarity
            # ... etc

            Q = x @ self.W_q[i]
            K = x @ self.W_k[i]
            V = x @ self.W_v[i]

            attention_output = scaled_dot_product_attention(Q, K, V)
            head_outputs.append(attention_output)

        # Concatenate all heads
        concatenated = concat(head_outputs)  # [seq_len, d_model]

        # Final linear projection
        output = concatenated @ self.W_o

        return output

# Why multiple heads?
# Different heads learn different relationships:

"""
Example attention patterns in GPT-3:

Head 1 (Syntax):
"The cat" → focuses on article-noun agreement
"sat on" → focuses on verb-preposition pairing

Head 2 (Semantics):
"cat" → attends to "animal", "pet" concepts
"sat" → attends to "action", "position" concepts

Head 3 (Long-range):
"the mat" at end → attends back to "cat" at beginning
Links subject to distant objects

Head 4 (Position):
Each token → attends most to neighbors
Captures local context
"""
Enter fullscreen mode Exit fullscreen mode

Feed-Forward Network

After attention, each token passes through a feed-forward network:

class FeedForwardNetwork:
    def __init__(self, d_model=768, d_ff=3072):
        """
        Position-wise feed-forward network
        Typically d_ff = 4 * d_model

        Args:
            d_model: Input/output dimension
            d_ff: Hidden layer dimension (4x larger)
        """
        self.W1 = create_weight_matrix(d_model, d_ff)
        self.W2 = create_weight_matrix(d_ff, d_model)
        self.bias1 = create_bias_vector(d_ff)
        self.bias2 = create_bias_vector(d_model)

    def forward(self, x):
        """
        Two-layer fully connected network with activation

        x: [batch_size, seq_len, d_model]
        """
        # First layer with GELU activation
        hidden = gelu(x @ self.W1 + self.bias1)  # [batch, seq, d_ff]

        # Second layer back to d_model
        output = hidden @ self.W2 + self.bias2   # [batch, seq, d_model]

        return output

# Why the expansion to 4x size?
# The 4x expansion (768 → 3072) allows the network to:
# 1. Learn complex non-linear transformations
# 2. Specialize different neurons for different patterns
# 3. Create rich representations

# Example of what FFN learns:
"""
Input: "bank" (ambiguous)
Context: "river bank"

FFN transforms:
[0.2, 0.5, 0.3, ...]  (generic "bank" embedding)
        ↓
[0.8, 0.1, 0.2, ...]  (contextual "river bank" embedding)

The FFN "contextualizes" the embedding based on surrounding attention
"""
Enter fullscreen mode Exit fullscreen mode

Positional Encoding

Transformers have no inherent sense of position, so we add it:

def positional_encoding(seq_len, d_model):
    """
    Add positional information to embeddings
    Uses sine and cosine functions of different frequencies

    Args:
        seq_len: Sequence length
        d_model: Embedding dimension
    """
    position = np.arange(seq_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * 
                     -(np.log(10000.0) / d_model))

    pos_encoding = np.zeros((seq_len, d_model))

    # Even dimensions: sine
    pos_encoding[:, 0::2] = np.sin(position * div_term)

    # Odd dimensions: cosine
    pos_encoding[:, 1::2] = np.cos(position * div_term)

    return pos_encoding

# Why sine/cosine?
# 1. Values are bounded [-1, 1]
# 2. Pattern is continuous and smooth
# 3. Model can learn relative positions
# 4. Works for any sequence length

# Modern LLMs often use learned positional embeddings instead
def learned_positional_encoding(seq_len, d_model):
    """
    Alternative: Learn position embeddings during training
    Used by GPT models
    """
    # Trainable embedding matrix
    position_embeddings = create_embedding_matrix(seq_len, d_model)

    return position_embeddings[range(seq_len)]
Enter fullscreen mode Exit fullscreen mode

Complete Transformer Architecture

Putting it all together:

graph TD
    A["INPUT<br/>Translate English to French: Hello"] --> B["TOKENIZATION<br/>Translate, English, to, ..."]
    B --> C["TOKEN EMBEDDINGS learned<br/>Each token → 768-dimensional vector"]
    C --> D["+ POSITIONAL ENCODING<br/>Add position information"]
    D --> E["TRANSFORMER BLOCK 1<br/>├─ Multi-Head Attention 12 heads<br/>├─ Add & Normalize<br/>├─ Feed-Forward Network<br/>└─ Add & Normalize"]
    E --> F["TRANSFORMER BLOCK 2<br/>... same structure"]
    F --> G["... repeat 12-96 times"]
    G --> H["OUTPUT LAYER<br/>Project to vocabulary size<br/>vocab_size probabilities"]
    H --> I["SAMPLING<br/>Choose next token: Bonjour"]

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style D fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style E fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style F fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style G fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style H fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style I fill:#000,stroke:#fff,stroke-width:2px,color:#fff

    J["Parameters breakdown for GPT-3 175B params:<br/>• Embedding layer: 50,257 × 12,288 = 617M<br/>• 96 transformer blocks × ~1.8B each = 173B<br/>• Output layer: 12,288 × 50,257 = 617M<br/>Total: ~175 billion parameters"]

    style J fill:#000,stroke:#fff,stroke-width:2px,color:#fff
Enter fullscreen mode Exit fullscreen mode

Training vs Fine-Tuning vs Prompt Engineering

This is where theory meets practice. Let's break down each approach.

Training from Scratch

What it is: Building and training a completely new LLM.

# Conceptual training loop
def train_llm_from_scratch():
    """
    Training a new LLM from scratch
    Requirements: Massive compute, data, time, money
    """
    # Initialize model with random weights
    model = TransformerLLM(
        vocab_size=50000,
        d_model=12288,      # GPT-3 size
        num_layers=96,
        num_heads=96
    )  # ~175 billion parameters

    # Prepare massive dataset
    dataset = load_training_data([
        "CommonCrawl",      # 400B tokens
        "WebText2",         # 19B tokens
        "Books1",          # 12B tokens
        "Books2",          # 55B tokens
        "Wikipedia",       # 3B tokens
    ])  # Total: ~500B tokens

    # Training configuration
    optimizer = AdamW(learning_rate=0.0001)
    batch_size = 3.2_million_tokens  # Per batch
    num_epochs = 1  # One pass through all data

    # Resource requirements
    gpus = 10000  # A100 GPUs (80GB each)
    training_time_days = 34
    cost_estimate = 4_600_000  # USD

    # Training loop
    for epoch in range(num_epochs):
        for batch in dataset.batches(batch_size):
            # Forward pass
            predictions = model(batch.input_tokens)

            # Calculate loss
            loss = cross_entropy_loss(predictions, batch.target_tokens)

            # Backward pass (gradient calculation)
            gradients = compute_gradients(loss)

            # Update 175 billion parameters
            optimizer.step(gradients)

    return model

# Reality check:
costs = {
    "GPT-3 training": "$4.6M",
    "GPT-4 training": "$100M+ (estimated)",
    "Llama 2 70B": "$1.7M",
    "Your startup budget": "????"
}
Enter fullscreen mode Exit fullscreen mode

When to use:

  • ❌ Almost never for most developers
  • ✅ If you're a large research lab
  • ✅ If you have unique, massive proprietary datasets
  • ✅ If you need a model with specific architectural features

Pros:

  • Complete control over architecture
  • Can optimize for specific domain from ground up
  • No dependency on existing models

Cons:

  • Costs millions of dollars
  • Requires months of compute time
  • Needs massive datasets (hundreds of billions of tokens)
  • Requires world-class ML expertise
  • High risk of failure

Fine-Tuning

What it is: Taking a pre-trained model and adapting it to your specific use case.

# Fine-tuning example
def fine_tune_llm(base_model, custom_dataset):
    """
    Fine-tuning adapts a pre-trained model to your domain
    Much more practical than training from scratch
    """
    # Start with pre-trained model (GPT-3, Llama, etc.)
    model = load_pretrained_model("gpt-3.5-turbo")
    # Already knows language, general knowledge, reasoning

    # Your custom dataset (much smaller!)
    training_data = [
        {
            "prompt": "Diagnose this medical symptom: headache and fever",
            "completion": "Differential diagnosis includes: 1. Viral infection..."
        },
        # ... 1,000-100,000 examples
    ]

    # Fine-tuning configuration
    config = {
        "learning_rate": 0.00001,  # Much lower than pre-training
        "batch_size": 32,
        "num_epochs": 3,
        "freeze_layers": 80,  # Freeze most layers, train top 16
    }

    # Resource requirements (much more reasonable!)
    gpus_needed = 1  # Single A100
    training_time = "4-48 hours"
    cost = "$50-$5,000"

    # Training loop
    for epoch in range(config["num_epochs"]):
        for batch in training_data.batches(config["batch_size"]):
            # Forward pass
            outputs = model(batch["prompt"])

            # Calculate loss (only on your data)
            loss = compute_loss(outputs, batch["completion"])

            # Backward pass (only update unfrozen layers)
            update_parameters(loss, config["freeze_layers"])

    return model

# Popular fine-tuning approaches:

# 1. Full fine-tuning (update all parameters)
full_ft = FineTuning(
    model=base_model,
    update_all_layers=True,
    cost="High",
    quality="Best"
)

# 2. LoRA (Low-Rank Adaptation) - Most popular!
lora_ft = LoRA(
    model=base_model,
    rank=8,  # Add small trainable matrices
    update_fraction=0.01,  # Only 1% of parameters
    cost="Low",
    quality="Very Good"
)

# 3. Adapter layers
adapter_ft = AdapterLayers(
    model=base_model,
    adapter_size=64,
    insert_after_each_layer=True,
    cost="Medium",
    quality="Good"
)
Enter fullscreen mode Exit fullscreen mode

Types of Fine-Tuning:

# 1. SUPERVISED FINE-TUNING (SFT)
# Train on input-output pairs
sft_data = [
    {"input": "Summarize this article: ...", "output": "Summary: ..."},
    {"input": "Translate to Spanish: ...", "output": "Spanish text..."},
]

# 2. REINFORCEMENT LEARNING FROM HUMAN FEEDBACK (RLHF)
# The secret sauce behind ChatGPT
rlhf_process = """
Step 1: Collect human preferences
  Model output A vs Model output B → Humans pick better one

Step 2: Train reward model
  Learn to predict human preferences

Step 3: Optimize policy
  Use PPO (Proximal Policy Optimization) to maximize reward
"""

# 3. INSTRUCTION TUNING
# Teach model to follow instructions
instruction_data = [
    {
        "instruction": "Write a poem about coding",
        "input": "",
        "output": "In lines of code, so clear and bright..."
    },
    {
        "instruction": "Explain {concept} to a beginner",
        "input": "concept: recursion",
        "output": "Recursion is when a function calls itself..."
    }
]
Enter fullscreen mode Exit fullscreen mode

When to use:

  • ✅ You have 1,000-100,000 quality examples
  • ✅ Your domain has specific terminology/patterns
  • ✅ You need consistent formatting or style
  • ✅ You want to reduce hallucinations in your domain
  • ✅ Budget: $100-$10,000

Pros:

  • Much cheaper than training from scratch
  • Faster (hours to days vs. months)
  • Excellent results for specific domains
  • Retains general knowledge while adding specialization

Cons:

  • Still requires curated dataset
  • Can forget pre-trained knowledge (catastrophic forgetting)
  • Needs technical expertise
  • Ongoing maintenance as base models update

Prompt Engineering

What it is: Designing inputs to get desired outputs, without changing the model.

# Prompt engineering examples
class PromptEngineer:
    """
    Get better results through clever prompting
    No training required!
    """

    def basic_prompt(self, question):
        """Basic approach - often fails"""
        return f"{question}"

    def few_shot_prompt(self, question):
        """Provide examples in the prompt"""
        return f"""
I'll show you examples, then answer the question:

Example 1:
Q: What is 2+2?
A: Let me break this down: 2 + 2 = 4

Example 2:
Q: What is 5*3?
A: Let me break this down: 5 * 3 = 15

Now answer:
Q: {question}
A: Let me break this down:
"""

    def chain_of_thought_prompt(self, question):
        """Encourage step-by-step reasoning"""
        return f"""
{question}

Let's approach this step-by-step:
1) First, let's understand what we're being asked
2) Then, let's break down the problem
3) Finally, we'll arrive at the answer

Step 1:
"""

    def role_based_prompt(self, question, role="expert"):
        """Assign the model a role/persona"""
        return f"""
You are a world-class {role} with deep expertise.
A student asks you: {question}

You respond with clear, accurate, detailed information:
"""

    def structured_output_prompt(self, data):
        """Get consistent structured outputs"""
        return f"""
Analyze the following and return JSON:

Input: {data}

Return format:
{{
  "sentiment": "positive|negative|neutral",
  "confidence": 0.0-1.0,
  "key_entities": ["entity1", "entity2"],
  "summary": "brief summary"
}}

JSON:
"""

    def retrieval_augmented_generation(self, question, context):
        """RAG: Provide relevant context"""
        return f"""
Use the following context to answer the question.
If you cannot answer from the context, say so.

Context:
{context}

Question: {question}

Answer based on the context:
"""

# Advanced prompt patterns

# 1. Tree of Thoughts
tot_prompt = """
Problem: {problem}

Generate 3 different approaches:

Approach 1:
[reasoning]
[evaluation: score 1-10]

Approach 2:
[reasoning]
[evaluation: score 1-10]

Approach 3:
[reasoning]
[evaluation: score 1-10]

Best approach: [choose highest scoring]
Final answer:
"""

# 2. ReAct (Reasoning + Acting)
react_prompt = """
You can use these tools:
- search(query): Search the web
- calculate(expression): Perform math
- final_answer(answer): Return final answer

Question: What is the population of Paris times 2?

Thought: I need to find Paris's population first
Action: search("population of Paris 2024")
Observation: Paris has 2.2 million inhabitants

Thought: Now I need to multiply by 2
Action: calculate("2.2 * 2")
Observation: 4.4

Thought: I have the answer
Action: final_answer("4.4 million")
"""

# 3. Constitutional AI (Self-Critique)
constitutional_prompt = """
Question: {question}

Initial Answer: {initial_answer}

Now critique your answer:
1. Is it accurate?
2. Is it helpful?
3. Is it harmless?
4. Could it be misunderstood?

Critique:

Revised Answer:
"""
Enter fullscreen mode Exit fullscreen mode

When to use:

  • ✅ Quick prototyping
  • ✅ Budget: $0-$100
  • ✅ Don't have training data
  • ✅ Need flexibility (easy to iterate)
  • ✅ Using general-purpose tasks

Pros:

  • Zero cost (besides API usage)
  • Instant iteration
  • No technical ML expertise needed
  • Works with any model
  • Easy to A/B test

Cons:

  • Less consistent than fine-tuning
  • Token costs for long prompts
  • Requires careful engineering
  • Limited by context window
  • Can be fragile to minor changes

Decision Framework: Which Approach to Use

The Decision Tree

flowchart TD
    A[Start Here] --> B{Do you have millions<br/>of dollars and months<br/>of time?}
    B -->|Yes| C[Train from Scratch<br/>Research labs only]
    B -->|No| D{Do you have 1,000+<br/>high-quality examples<br/>in your domain?}
    D -->|Yes| E[Fine-Tune<br/>Best ROI]
    D -->|No| F[Use Prompt<br/>Engineering]

    F --> G[• RAG for facts<br/>• Few-shot learning<br/>• Clever prompts]
    E --> H[Consider:<br/>• Full FT<br/>• LoRA<br/>• RLHF]

    style A fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style B fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style C fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style D fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style E fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style F fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style G fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    style H fill:#000,stroke:#fff,stroke-width:2px,color:#fff
Enter fullscreen mode Exit fullscreen mode

Detailed Comparison Matrix

Criteria Prompt Engineering Fine-Tuning Training from Scratch
Cost $0-$100 $100-$10K $1M-$100M
Time Minutes Hours-Days Months
Data Needed 0-10 examples 1K-100K 100B+ tokens
Expertise Basic Intermediate Expert
Consistency Medium High Highest
Flexibility Highest Medium Lowest
Domain Adaptation Limited Excellent Complete
Maintenance Easy Medium Complex

Real-World Use Cases

# Use Case 1: Customer Support Chatbot
use_case_support = {
    "approach": "Fine-Tuning (LoRA)",
    "why": """
    - Have 10,000 support conversation logs
    - Need consistent brand voice
    - Domain-specific terminology
    - Cost-effective for high volume
    """,
    "implementation": """
    1. Prepare conversation dataset
    2. Fine-tune Llama 2 with LoRA
    3. Deploy with caching
    4. Monitor and iterate
    """
}

# Use Case 2: Document Summarization
use_case_summarization = {
    "approach": "Prompt Engineering + RAG",
    "why": """
    - Documents vary widely
    - No training data
    - Need flexibility
    - Quick deployment
    """,
    "implementation": """
    1. Extract key sections
    2. Use structured prompt
    3. Add examples in prompt
    4. Validate output format
    """
}

# Use Case 3: Medical Diagnosis Assistant
use_case_medical = {
    "approach": "Fine-Tuning (Full) + RLHF",
    "why": """
    - High stakes (accuracy critical)
    - 50,000 expert-annotated cases
    - Specialized medical terminology
    - Need to reduce hallucinations
    """,
    "implementation": """
    1. Full fine-tune on medical corpus
    2. RLHF with doctor feedback
    3. Extensive validation
    4. Human-in-the-loop deployment
    """
}

# Use Case 4: Code Generation IDE Plugin
use_case_coding = {
    "approach": "Fine-Tuning (specialized)",
    "why": """
    - Specific codebase patterns
    - Internal libraries/APIs
    - Need context awareness
    - Consistent code style
    """,
    "implementation": """
    1. Train on company codebase
    2. Fine-tune for internal APIs
    3. Add RAG for documentation
    4. Continuous learning from reviews
    """
}
Enter fullscreen mode Exit fullscreen mode

Practical Implementation Guide

Setting Up Your First Fine-Tuning Job

Here's a complete example using OpenAI's API:

import openai
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# Step 1: Prepare training data
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a technical documentation expert."},
            {"role": "user", "content": "Explain API rate limiting"},
            {"role": "assistant", "content": "API rate limiting is a technique..."}
        ]
    },
    # ... more examples (minimum 10, recommended 100-1000)
]

# Save to JSONL format
import json
with open("training_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

# Step 2: Upload training file
training_file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Step 3: Create fine-tuning job
fine_tune_job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 0.1
    }
)

# Step 4: Monitor training
import time

while True:
    job = client.fine_tuning.jobs.retrieve(fine_tune_job.id)
    print(f"Status: {job.status}")

    if job.status == "succeeded":
        print(f"Fine-tuned model: {job.fine_tuned_model}")
        break
    elif job.status == "failed":
        print(f"Failed: {job.error}")
        break

    time.sleep(60)

# Step 5: Use fine-tuned model
response = client.chat.completions.create(
    model=job.fine_tuned_model,
    messages=[
        {"role": "system", "content": "You are a technical documentation expert."},
        {"role": "user", "content": "Explain webhook security"}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Building a RAG System (Prompt Engineering Approach)

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class RAGSystem:
    """
    Retrieval-Augmented Generation
    Combines document search with LLM generation
    """

    def __init__(self):
        # Embedding model for semantic search
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.documents = []
        self.embeddings = None

    def add_documents(self, documents):
        """Add documents to knowledge base"""
        self.documents = documents
        self.embeddings = self.embedder.encode(documents)

    def retrieve(self, query, top_k=3):
        """Find most relevant documents"""
        query_embedding = self.embedder.encode([query])
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]

        # Get top-k most similar
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        return [self.documents[i] for i in top_indices]

    def generate_answer(self, query, client):
        """Generate answer using retrieved context"""
        # Retrieve relevant documents
        context_docs = self.retrieve(query, top_k=3)
        context = "\n\n".join(context_docs)

        # Create prompt with context
        prompt = f"""
Use the following context to answer the question. 
If the answer isn't in the context, say so.

Context:
{context}

Question: {query}

Answer:
"""

        # Generate with LLM
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3  # Lower for factual accuracy
        )

        return response.choices[0].message.content

# Usage example
rag = RAGSystem()

# Add your knowledge base
documents = [
    "Python is a high-level programming language created by Guido van Rossum.",
    "Machine learning is a subset of AI that learns from data.",
    "Neural networks are inspired by biological neural networks.",
    # ... add hundreds or thousands of documents
]

rag.add_documents(documents)

# Query
answer = rag.generate_answer(
    "Who created Python?",
    client=OpenAI(api_key="your-key")
)
print(answer)
# Output: "Python was created by Guido van Rossum."
Enter fullscreen mode Exit fullscreen mode

Advanced Fine-Tuning with LoRA

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank of update matrices (higher = more capacity)
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # Which layers to add LoRA to
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# Training (simplified)
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

# Save only LoRA weights (tiny file size!)
model.save_pretrained("./lora-weights")
# Instead of 13GB, you save ~20MB
Enter fullscreen mode Exit fullscreen mode

Monitoring and Evaluation

class LLMEvaluator:
    """
    Evaluate your LLM implementation
    """

    def evaluate_accuracy(self, test_cases):
        """Test on known Q&A pairs"""
        correct = 0
        total = len(test_cases)

        for test in test_cases:
            response = self.get_model_response(test["question"])
            if self.is_correct(response, test["expected"]):
                correct += 1

        return correct / total

    def evaluate_consistency(self, prompts, num_samples=5):
        """Test output consistency"""
        results = {}

        for prompt in prompts:
            responses = [
                self.get_model_response(prompt) 
                for _ in range(num_samples)
            ]

            # Calculate similarity between responses
            similarity = self.calculate_response_similarity(responses)
            results[prompt] = similarity

        return results

    def evaluate_latency(self):
        """Measure response time"""
        import time

        start = time.time()
        self.get_model_response("Test prompt")
        end = time.time()

        return end - start

    def evaluate_cost(self, num_requests, avg_tokens):
        """Calculate cost per request"""
        # Example for GPT-4
        input_cost_per_1k = 0.03
        output_cost_per_1k = 0.06

        total_cost = (
            (avg_tokens / 1000) * input_cost_per_1k +
            (avg_tokens / 1000) * output_cost_per_1k
        ) * num_requests

        return total_cost

# Usage
evaluator = LLMEvaluator()

metrics = {
    "accuracy": evaluator.evaluate_accuracy(test_cases),
    "consistency": evaluator.evaluate_consistency(test_prompts),
    "latency": evaluator.evaluate_latency(),
    "cost": evaluator.evaluate_cost(1000, 500)
}

print(f"Metrics: {metrics}")
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

Let's recap the essential concepts:

Understanding LLMs

  1. Core Principle: LLMs predict next tokens using massive scale, transformers, and pre-training
  2. Not Magic: They're pattern matching machines trained on internet-scale text
  3. Context Window: Limited "memory"—manage carefully in applications
  4. Emergent Abilities: Scale unlocks capabilities not explicitly programmed

Transformer Architecture

  1. Self-Attention: Allows parallel processing and long-range dependencies
  2. Multi-Head: Different heads learn different patterns
  3. Positional Encoding: Adds sequence information
  4. Layer Stacking: Depth enables complex representations

Choosing Your Approach

decision_guide = {
    "Prompt Engineering": {
        "when": "Quick projects, no training data, high flexibility needed",
        "cost": "$",
        "time": "Hours",
        "best_for": ["Prototyping", "General tasks", "Low volume"]
    },

    "Fine-Tuning": {
        "when": "Have 1K+ examples, need consistency, domain-specific",
        "cost": "$$",
        "time": "Days",
        "best_for": ["Production apps", "Custom domains", "Brand voice"]
    },

    "Training from Scratch": {
        "when": "Research lab with millions in funding",
        "cost": "$$$$$",
        "time": "Months",
        "best_for": ["Novel architectures", "Massive proprietary data"]
    }
}
Enter fullscreen mode Exit fullscreen mode

Practical Guidelines

  1. Start Simple: Begin with prompt engineering, add complexity as needed
  2. Measure Everything: Track accuracy, cost, latency, consistency
  3. Iterate Rapidly: LLMs are sensitive—small changes can have big impacts
  4. Use RAG: Often better than fine-tuning for factual knowledge
  5. Consider LoRA: Best cost/performance trade-off for fine-tuning

Common Pitfalls to Avoid

pitfalls = {
    "Over-engineering": "Don't fine-tune when prompt engineering works",
    "Under-testing": "Test edge cases—LLMs can be unpredictable",
    "Ignoring costs": "Token costs add up fast at scale",
    "Prompt brittleness": "Test prompt variations thoroughly",
    "Context overflow": "Monitor token usage in conversations",
    "Hallucinations": "Always validate factual claims",
    "Security": "Sanitize inputs to prevent prompt injection"
}
Enter fullscreen mode Exit fullscreen mode

Next Steps

Now that you understand LLMs:

Immediate Actions

  1. Experiment: Try different models (GPT-4, Claude, Llama 2) with same prompts
  2. Build: Create a simple RAG system with your own documents
  3. Measure: Benchmark costs and performance for your use case
  4. Learn: Dive deeper into specific topics that interest you

Recommended Resources

For Learning:

For Building:

For Staying Current:

Advanced Topics

Once you've mastered the basics:

  • Instruction tuning techniques
  • RLHF implementation details
  • Mixture of Experts (MoE) architectures
  • Quantization and optimization
  • Multi-modal models (vision + text)

Conclusion

Large Language Models represent a paradigm shift in how we build intelligent applications. Understanding how they work—from transformer architecture to training approaches—empowers you to make informed decisions about when and how to use them.

Remember:

  • LLMs are tools, not magic
  • Start with prompt engineering, scale to fine-tuning as needed
  • Measure everything: accuracy, cost, latency, user satisfaction
  • The field evolves rapidly—stay curious and keep experimenting

The best way to truly understand LLMs is to build with them. Start small, iterate quickly, and don't be afraid to experiment.


What's your experience with LLMs? Are you using prompt engineering, fine-tuning, or both? Share your challenges and successes in the comments!

If you found this guide helpful, follow me for more deep dives into AI development. Next up: "Building Production-Ready RAG Systems."


Cover image: Photo by Google DeepMind on Unsplash

Top comments (0)