DEV Community

Cover image for The Math Behind Generative AI: Simple (No PhD Required)
SATINATH MONDAL
SATINATH MONDAL

Posted on

The Math Behind Generative AI: Simple (No PhD Required)

If you've ever wondered how ChatGPT "understands" your questions or how DALL-E creates images from text, you're about to find out. But don't worry—we're leaving the complex calculus at the door. This article breaks down the core mathematical concepts powering generative AI into digestible, visual explanations that actually make sense.

What You'll Learn

By the end of this article, you'll understand:

  • How attention mechanisms help AI "focus" on important information
  • Why embedding spaces are like giving words GPS coordinates
  • How temperature controls the creativity vs. consistency trade-off
  • Practical implications for prompt engineering and AI application development

Prerequisites: Basic familiarity with AI concepts. No advanced math required!

Table of Contents

The Foundation: Why Math Matters

Before we dive in, let's address the elephant in the room: Why should you care about the math?

Understanding these concepts helps you:

  1. Write better prompts - Know what the model "sees" in your input
  2. Debug AI behavior - Understand why you get certain outputs
  3. Optimize performance - Make informed decisions about parameters
  4. Build better applications - Choose the right tools and configurations

Think of it like driving a car. You don't need to be a mechanic, but knowing how the engine, brakes, and steering work makes you a better driver.

Attention Mechanisms: Teaching AI to Focus

The Problem Attention Solves

Imagine reading this sentence: "The cat sat on the mat because it was comfortable."

What does "it" refer to? You instantly know it's the cat (or possibly the mat). How? Your brain automatically pays attention to relevant context. AI models need to do the same thing.

How Attention Works (Simplified)

Let's break down the attention mechanism step by step:

Step 1: Query, Key, Value (QKV)

Think of attention like a database lookup:

┌─────────────────────────────────────────┐
│  "The cat sat on the mat"               │
└─────────────────────────────────────────┘
         │
         ├──> Query (Q):  "What am I looking for?"
         ├──> Key (K):    "What information do I have?"
         └──> Value (V):  "What are the actual values?"
Enter fullscreen mode Exit fullscreen mode

For each word, the model creates three vectors:

  • Query: "What information do I need?"
  • Key: "What information do I have?"
  • Value: "The actual information content"

Step 2: Calculating Attention Scores

The model compares the Query of one word with the Keys of all other words:

# Simplified attention calculation
def simple_attention(query, keys, values):
    """
    Calculate attention scores between a query and keys

    Args:
        query: What we're looking for (vector)
        keys: What information is available (list of vectors)
        values: The actual content (list of vectors)
    """
    scores = []

    # Calculate similarity between query and each key
    for key in keys:
        # Dot product measures similarity
        score = dot_product(query, key)
        scores.append(score)

    # Normalize scores to probabilities (softmax)
    attention_weights = softmax(scores)

    # Weighted sum of values
    output = sum(weight * value 
                 for weight, value in zip(attention_weights, values))

    return output

# Example output for "it" looking at context:
# Attention scores:
# "The"   -> 0.05
# "cat"   -> 0.45  ← High attention!
# "sat"   -> 0.10
# "on"    -> 0.05
# "the"   -> 0.05
# "mat"   -> 0.30  ← Some attention
Enter fullscreen mode Exit fullscreen mode

Step 3: Visual Representation

Here's how attention flows when processing "The cat sat on the mat":

          Attention Weights (darker = stronger)

        The  cat  sat  on  the  mat
The     ██   ░░   ░░   ░░   ░░   ░░
cat     ░░   ███  ░░   ░░   ░░   ░░
sat     ░░   ██   ███  ██   ░░   ░░
on      ░░   ░░   ██   ███  ░░   ██
the     ░░   ░░   ░░   ░░   ███  ██
mat     ░░   ██   ░░   ░░   ██   ███

Legend: ███ Strong  ██ Medium  ░░ Weak
Enter fullscreen mode Exit fullscreen mode

Each row shows what a word pays attention to. Notice how "sat" pays strong attention to "cat" (the subject) and "mat" (the object).

Multi-Head Attention: Multiple Perspectives

Real transformer models use multi-head attention—think of it as having multiple sets of eyes, each looking for different patterns:

Head 1: Focuses on subject-verb relationships
Head 2: Focuses on object relationships  
Head 3: Focuses on temporal/spatial relationships
Head 4: Focuses on semantic similarity
... (typically 8-16 heads in practice)
Enter fullscreen mode Exit fullscreen mode

Why This Matters for You

Understanding attention helps you:

  1. Write better prompts: Place important context near your question
  2. Understand context limits: Attention weakens over long distances
  3. Debug outputs: Know what the model "looked at" when generating responses

Pro Tip: When writing prompts, put the most critical information at the beginning and end—these positions get stronger attention weights.

Embedding Spaces: Giving Meaning Coordinates

From Words to Numbers

Computers can't understand words directly—they need numbers. But not just any numbers. We need numbers that capture meaning.

The Embedding Space Concept

Think of an embedding space as a semantic map where similar concepts are close together:

      Dimension 2 (Formality)
           ↑
    Formal │     CEO ●
           │          
           │     Manager ●
           │               ● Developer
           │          
    Casual │     Boss ●     
           │               ● Programmer
           │          
           │     Coder ●
           └────────────────────────→
                              Dimension 1
                           (Technical)
Enter fullscreen mode Exit fullscreen mode

In this simplified 2D space:

  • X-axis (Dimension 1): Technical vs. Non-technical
  • Y-axis (Dimension 2): Formal vs. Casual

Real embeddings have hundreds or thousands of dimensions, capturing nuances like:

  • Sentiment (positive/negative)
  • Domain (medical, legal, technical)
  • Part of speech (noun, verb)
  • Abstraction level (concrete, abstract)

How Embeddings Are Created

Here's a simplified example of how text becomes embeddings:

# Simplified embedding concept
def create_embedding(text, model):
    """
    Convert text to a high-dimensional vector

    Real models use neural networks trained on massive datasets
    This is a conceptual example
    """
    # Tokenize text
    tokens = tokenize(text)  # ["cat", "sat", "mat"]

    # Each token gets a vector (e.g., 768 dimensions for BERT)
    embeddings = []
    for token in tokens:
        # Look up or compute embedding vector
        embedding = model.encode(token)
        embeddings.append(embedding)

    return embeddings

# Example output (simplified to 4 dimensions):
# "cat" -> [0.8, 0.2, 0.6, 0.1]
# "dog" -> [0.7, 0.3, 0.5, 0.2]  # Similar to cat!
# "car" -> [0.1, 0.8, 0.2, 0.9]  # Very different
Enter fullscreen mode Exit fullscreen mode

Measuring Similarity: Cosine Distance

To find similar words, we measure the angle between vectors:

import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Calculate similarity between two vectors
    Returns value between -1 (opposite) and 1 (identical)
    """
    dot_product = np.dot(vec1, vec2)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)

    return dot_product / (magnitude1 * magnitude2)

# Example usage
cat_embedding = [0.8, 0.2, 0.6, 0.1]
dog_embedding = [0.7, 0.3, 0.5, 0.2]
car_embedding = [0.1, 0.8, 0.2, 0.9]

print(cosine_similarity(cat_embedding, dog_embedding))  # 0.96 - Very similar!
print(cosine_similarity(cat_embedding, car_embedding))  # 0.34 - Different
Enter fullscreen mode Exit fullscreen mode

Semantic Search in Action

This is how embedding-based search works:

User Query: "How to train a neural network?"
           ↓
      [Embedding]
           ↓
    ┌──────────────────────────────┐
    │  Find nearest neighbors in   │
    │  embedding space             │
    └──────────────────────────────┘
           ↓
    Results ranked by distance:
    1. "Neural network training guide" (distance: 0.12)
    2. "Deep learning tutorial" (distance: 0.18)
    3. "Machine learning basics" (distance: 0.24)
Enter fullscreen mode Exit fullscreen mode

Vector Math: The Magic of Embeddings

One of the coolest properties of embeddings is vector arithmetic:

# Famous example:
king - man + woman  queen

# More examples:
paris - france + italy  rome
walking - walk + swim  swimming
bigger - big + small  smaller
Enter fullscreen mode Exit fullscreen mode

This works because embeddings capture relationships and patterns.

Why This Matters for You

Understanding embeddings helps you:

  1. Build better semantic search: Use embeddings instead of keyword matching
  2. Understand AI "understanding": Know what "similar" means to the model
  3. Optimize RAG applications: Choose the right embedding model for your domain
  4. Debug retrieval issues: Understand why certain documents are retrieved

Real-World Application:

# Using embeddings for semantic search
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for documents
documents = [
    "Python is a programming language",
    "Machine learning uses algorithms",
    "Neural networks are inspired by the brain"
]

doc_embeddings = model.encode(documents)

# Search query
query = "What is AI?"
query_embedding = model.encode(query)

# Find most similar document
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# Result: Document 2 ranks highest (machine learning)
Enter fullscreen mode Exit fullscreen mode

Temperature and Sampling: Controlling Creativity

The Probability Distribution Problem

When generating text, AI models don't just pick the "best" word—they work with probabilities:

Model's prediction for next word after "The cat":

sat     ████████████████ 40%
walked  ██████████ 25%
jumped  ███████ 18%
ran     ████ 10%
flew    ██ 5%
...     █ 2%
Enter fullscreen mode Exit fullscreen mode

Question: How do we choose the next word?

Temperature: The Creativity Knob

Temperature controls how "random" or "creative" the output is:

def apply_temperature(logits, temperature):
    """
    Adjust probability distribution based on temperature

    Args:
        logits: Raw model scores for each possible next token
        temperature: Float value (typically 0.0 to 2.0)
            - Low (0.1-0.5): More deterministic, focused
            - Medium (0.7-1.0): Balanced
            - High (1.5-2.0): More random, creative
    """
    # Scale logits by temperature
    adjusted_logits = logits / temperature

    # Convert to probabilities
    probabilities = softmax(adjusted_logits)

    return probabilities

# Example with temperature variations:
original_probs = [0.4, 0.25, 0.18, 0.10, 0.05, 0.02]

# Temperature = 0.5 (Low - More confident)
# Result: [0.52, 0.23, 0.14, 0.07, 0.03, 0.01]
# The top choice becomes even more dominant

# Temperature = 2.0 (High - More random)  
# Result: [0.28, 0.24, 0.21, 0.15, 0.08, 0.04]
# More even distribution, more randomness
Enter fullscreen mode Exit fullscreen mode

Visual Comparison of Temperatures

Temperature = 0.1 (Deterministic)
sat     ████████████████████ 60%
walked  ████ 15%
jumped  ██ 10%
...

Temperature = 1.0 (Balanced)
sat     ████████████████ 40%
walked  ██████████ 25%
jumped  ███████ 18%
...

Temperature = 2.0 (Creative)
sat     ██████████ 25%
walked  █████████ 23%
jumped  ████████ 20%
ran     ██████ 15%
...
Enter fullscreen mode Exit fullscreen mode

Sampling Strategies

Beyond temperature, there are multiple ways to sample from the probability distribution:

1. Greedy Sampling (Temperature = 0)

Always pick the highest probability word:

def greedy_sampling(probabilities):
    """Always select the most likely token"""
    return argmax(probabilities)

# Result: Deterministic but potentially repetitive
# "The cat sat on the mat. The cat sat on the mat..."
Enter fullscreen mode Exit fullscreen mode

2. Top-K Sampling

Only consider the K most likely tokens:

def top_k_sampling(probabilities, k=40):
    """
    Sample from only the top K most likely tokens

    Args:
        probabilities: Full probability distribution
        k: Number of top tokens to consider (default: 40)
    """
    # Get top K indices
    top_k_indices = np.argsort(probabilities)[-k:]

    # Create new distribution with only top K
    top_k_probs = probabilities[top_k_indices]

    # Renormalize
    top_k_probs = top_k_probs / np.sum(top_k_probs)

    # Sample from reduced distribution
    return np.random.choice(top_k_indices, p=top_k_probs)

# Filters out unlikely tokens while maintaining diversity
Enter fullscreen mode Exit fullscreen mode

3. Top-P (Nucleus) Sampling

Consider tokens until cumulative probability reaches P:

def top_p_sampling(probabilities, p=0.9):
    """
    Sample from smallest set of tokens with cumulative probability >= p

    Args:
        probabilities: Full probability distribution
        p: Cumulative probability threshold (default: 0.9)
    """
    # Sort probabilities in descending order
    sorted_indices = np.argsort(probabilities)[::-1]
    sorted_probs = probabilities[sorted_indices]

    # Find cumulative probabilities
    cumsum = np.cumsum(sorted_probs)

    # Find cutoff where cumsum >= p
    cutoff_idx = np.where(cumsum >= p)[0][0] + 1

    # Use only these top tokens
    nucleus_indices = sorted_indices[:cutoff_idx]
    nucleus_probs = sorted_probs[:cutoff_idx]

    # Renormalize and sample
    nucleus_probs = nucleus_probs / np.sum(nucleus_probs)
    return np.random.choice(nucleus_indices, p=nucleus_probs)

# Dynamically adjusts number of tokens based on distribution
Enter fullscreen mode Exit fullscreen mode

Practical Comparison

Prompt: "Write a story about a dragon"

Temperature=0.1, Greedy:
"Once upon a time, there was a dragon. The dragon lived in a cave.
The dragon was very large..."
→ Safe, predictable, possibly boring

Temperature=0.7, Top-P=0.9:
"In the misty peaks of Mount Kazak, there dwelt a dragon named Ember.
Unlike her fearsome kin, Ember had a peculiar hobby..."
→ Balanced creativity and coherence

Temperature=1.5, Top-K=50:
"Dragons! Flying purple guardians of the ancient moon crystals, 
dancing between quantum dimensions while singing operatic melodies..."
→ Creative but potentially incoherent
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Settings

Use Case Temperature Sampling Why
Code generation 0.1-0.3 Greedy/Top-K=10 Need correctness, not creativity
Creative writing 0.7-1.2 Top-P=0.9 Balance creativity and coherence
Brainstorming 1.2-2.0 Top-P=0.95 Maximum diversity of ideas
Factual Q&A 0.1-0.5 Top-K=40 Accuracy over creativity
Chat assistant 0.7-0.9 Top-P=0.9 Natural but focused responses

Implementation Example

# Practical example using OpenAI API
import openai

def generate_with_control(prompt, use_case="balanced"):
    """Generate text with appropriate temperature settings"""

    settings = {
        "code": {"temperature": 0.2, "top_p": 0.1},
        "creative": {"temperature": 1.0, "top_p": 0.95},
        "balanced": {"temperature": 0.7, "top_p": 0.9},
        "factual": {"temperature": 0.3, "top_p": 0.5}
    }

    config = settings.get(use_case, settings["balanced"])

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=config["temperature"],
        top_p=config["top_p"]
    )

    return response.choices[0].message.content

# Usage examples
code = generate_with_control("Write a Python function to sort a list", "code")
story = generate_with_control("Write a short story about AI", "creative")
answer = generate_with_control("What is machine learning?", "factual")
Enter fullscreen mode Exit fullscreen mode

Why This Matters for You

Understanding temperature and sampling helps you:

  1. Control output quality: Match settings to your use case
  2. Debug unexpected outputs: Too random? Lower temperature. Too repetitive? Raise it.
  3. Optimize costs: Lower temperature = fewer tokens needed for good results
  4. Build better applications: Implement dynamic temperature based on context

Putting It All Together

Now let's see how all three concepts work together in a real generative AI system:

The Complete Pipeline

User Input: "Write a poem about coding"
           ↓
    ┌─────────────────┐
    │  1. Embedding   │  Convert text to vectors
    └─────────────────┘
           ↓
    ┌─────────────────┐
    │  2. Attention   │  Focus on relevant context
    └─────────────────┘
           ↓
    ┌─────────────────┐
    │  3. Processing  │  Generate probability distribution
    └─────────────────┘
           ↓
    ┌─────────────────┐
    │  4. Sampling    │  Choose next token (temperature)
    └─────────────────┘
           ↓
    Output: "In lines of logic, bright and true..."
Enter fullscreen mode Exit fullscreen mode

A Detailed Example

Let's trace through generating one sentence:

Input: "The programmer"

Step 1: Embeddings

"The"        → [0.1, 0.3, 0.2, ...]  (768 dimensions)
"programmer" → [0.4, 0.8, 0.1, ...]  (768 dimensions)
Enter fullscreen mode Exit fullscreen mode

Step 2: Attention

Computing attention for next word:
- Query: "What should come after 'programmer'?"
- Keys: Context from "The programmer" and training data
- Attention weights focus on:
  - Similar programming contexts (0.4)
  - Action verbs commonly associated (0.3)
  - Professional scenarios (0.2)
Enter fullscreen mode Exit fullscreen mode

Step 3: Prediction

Model outputs probability distribution:
"wrote"      → 0.25
"debugged"   → 0.18
"solved"     → 0.15
"created"    → 0.12
"fixed"      → 0.10
...
Enter fullscreen mode Exit fullscreen mode

Step 4: Sampling (Temperature=0.7)

Adjusted probabilities:
"wrote"      → 0.28
"debugged"   → 0.19
"solved"     → 0.16
...

Selected: "wrote" (sampled from this distribution)
Enter fullscreen mode Exit fullscreen mode

Step 5: Repeat

Now the input is "The programmer wrote"
→ Process continues for next word...
Enter fullscreen mode Exit fullscreen mode

Interactive Visualization

Here's how you can experiment with these concepts:

# Complete example: Building a mini text generator
import numpy as np

class SimpleTextGenerator:
    def __init__(self, temperature=0.7, top_p=0.9):
        self.temperature = temperature
        self.top_p = top_p

    def get_next_token_probabilities(self, context):
        """Simulate model prediction (normally from neural network)"""
        # This would be your model's output
        # For demo, using simple probabilities
        vocab = {
            "wrote": 0.25,
            "debugged": 0.18,
            "solved": 0.15,
            "created": 0.12,
            "fixed": 0.10,
            "refactored": 0.08,
            "tested": 0.07,
            "deployed": 0.05
        }
        return vocab

    def apply_temperature(self, logits):
        """Apply temperature scaling"""
        # Convert to numpy array
        tokens = list(logits.keys())
        probs = np.array(list(logits.values()))

        # Temperature scaling
        if self.temperature != 1.0:
            # Convert to logits (inverse softmax)
            logits_array = np.log(probs + 1e-10)
            # Scale by temperature
            logits_array = logits_array / self.temperature
            # Back to probabilities
            probs = np.exp(logits_array)
            probs = probs / np.sum(probs)

        return dict(zip(tokens, probs))

    def top_p_filter(self, probs):
        """Apply nucleus sampling"""
        tokens = list(probs.keys())
        prob_values = np.array(list(probs.values()))

        # Sort by probability
        sorted_indices = np.argsort(prob_values)[::-1]
        sorted_probs = prob_values[sorted_indices]

        # Cumulative sum
        cumsum = np.cumsum(sorted_probs)

        # Find nucleus
        cutoff = np.where(cumsum >= self.top_p)[0][0] + 1

        # Filter
        nucleus_indices = sorted_indices[:cutoff]
        nucleus_probs = sorted_probs[:cutoff]
        nucleus_probs = nucleus_probs / np.sum(nucleus_probs)

        # Reconstruct dictionary
        return {tokens[i]: nucleus_probs[j] 
                for j, i in enumerate(nucleus_indices)}

    def generate_token(self, context):
        """Generate next token with temperature and sampling"""
        # Get base probabilities
        probs = self.get_next_token_probabilities(context)

        # Apply temperature
        probs = self.apply_temperature(probs)

        # Apply top-p sampling
        probs = self.top_p_filter(probs)

        # Sample
        tokens = list(probs.keys())
        probabilities = list(probs.values())
        chosen = np.random.choice(tokens, p=probabilities)

        return chosen, probs

    def generate_text(self, prompt, num_tokens=10):
        """Generate multiple tokens"""
        text = prompt

        for _ in range(num_tokens):
            token, probs = self.generate_token(text)
            text += " " + token

            # Print probabilities (for learning)
            print(f"\nContext: '{text}'")
            print("Top probabilities:")
            sorted_probs = sorted(probs.items(), 
                                 key=lambda x: x[1], 
                                 reverse=True)[:3]
            for token, prob in sorted_probs:
                print(f"  {token}: {prob:.2%}")

        return text

# Experiment with different temperatures
print("=== Temperature = 0.1 (Focused) ===")
generator_low = SimpleTextGenerator(temperature=0.1, top_p=0.9)
result_low = generator_low.generate_text("The programmer", num_tokens=5)
print(f"\nResult: {result_low}")

print("\n=== Temperature = 1.5 (Creative) ===")
generator_high = SimpleTextGenerator(temperature=1.5, top_p=0.9)
result_high = generator_high.generate_text("The programmer", num_tokens=5)
print(f"\nResult: {result_high}")
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

Let's recap what we've learned:

1. Attention Mechanisms

  • What: A way for models to focus on relevant context
  • How: Query-Key-Value mechanism with weighted combinations
  • Why it matters: Enables understanding of long-range dependencies
  • Practical tip: Place important context at the start and end of prompts

2. Embedding Spaces

  • What: High-dimensional numerical representations of text
  • How: Neural networks map text to vectors where similar concepts are close
  • Why it matters: Enables semantic understanding and similarity search
  • Practical tip: Use embeddings for semantic search instead of keyword matching

3. Temperature and Sampling

  • What: Methods for controlling randomness in output generation
  • How: Temperature scales probabilities; sampling strategies filter options
  • Why it matters: Controls creativity vs. coherence trade-off
  • Practical tip: Lower temperature for code/facts, higher for creative content

Quick Reference Guide

# Your go-to settings cheat sheet

# For accuracy (code, facts, translations):
temperature = 0.1-0.3
top_p = 0.1-0.5
strategy = "greedy or top-k with k=10"

# For balanced output (chat, Q&A):
temperature = 0.7-0.9
top_p = 0.9
strategy = "top-p (nucleus)"

# For creativity (stories, brainstorming):
temperature = 1.0-1.5
top_p = 0.95
strategy = "top-p with high diversity"

# For maximum exploration:
temperature = 1.5-2.0
top_p = 0.95-1.0
strategy = "top-k with k=100"
Enter fullscreen mode Exit fullscreen mode

Next Steps

Now that you understand the math behind generative AI, here are some ways to apply this knowledge:

  1. Experiment: Try different temperature settings in your prompts (most APIs support this)
  2. Build: Create a semantic search system using embeddings
  3. Optimize: Tune parameters for your specific use case
  4. Learn more: Explore transformer architectures and self-attention in depth

Recommended Resources

Tools to Try

  • Embeddings: sentence-transformers, OpenAI embeddings API
  • Visualization: TensorBoard, LangSmith, W&B
  • Experimentation: Hugging Face Transformers, LangChain

Conclusion

The math behind generative AI might seem complex at first, but it boils down to three key concepts:

  1. Attention: Teaching AI to focus on what matters
  2. Embeddings: Representing meaning as coordinates in space
  3. Sampling: Controlling the creativity-coherence balance

You don't need a PhD to work with AI—you just need to understand these fundamental concepts and how to apply them. Whether you're building RAG applications, fine-tuning models, or just writing better prompts, this knowledge gives you superpowers.

Remember: AI is a tool, and understanding how it works makes you a better craftsperson.


What's your experience with these concepts? Have you experimented with temperature settings or built semantic search? Share your thoughts and questions in the comments below!

If you found this helpful, follow me for more deep dives into AI concepts explained simply. Next up: "Understanding Token Limits and Context Windows."


Cover image: Photo by DeepMind on Unsplash

Top comments (0)