Thousand Miles AI

Posted on Mar 6

The LLM Interview Cheat Sheet — 10 Questions That Actually Come Up

#ai #learning

You've used ChatGPT, built a RAG pipeline, maybe even fine-tuned a model. But can you explain how attention actually works when the interviewer asks? Here are 10 LLM questions that keep showing up in interviews — with answers that actually make sense.

It's 10 PM the night before your Google / Meta / OpenAI LLM engineer interview. You're scrolling through your notes on transformers, and your mind goes blank when you try to explain self-attention out loud. You panic. You Google "explain attention mechanisms" and spend the next hour reading academic papers that feel written in a different language.

By midnight, you're convinced you don't know anything.

Here's the truth: you probably know more than you think. You've fine-tuned models, built RAG pipelines, maybe even experimented with prompt engineering. But when an interviewer asks "How does self-attention work?" or "When would you use fine-tuning vs RAG?", panic takes over and you blank out.

This post is your cheat sheet. Not the academic definitions. The answers that actually work in an interview — clear, concise, and confident.

Why You Should Care

LLM roles are exploding right now. Google, Meta, OpenAI, Anthropic, Microsoft — they're all hiring ML engineers who can talk intelligently about transformers, RAG, fine-tuning, and hallucination. These aren't niche roles anymore. They're the growth area in tech.

The gatekeepers for these roles are these 10 questions (or variations of them). They appear across companies because they separate people who understand LLMs from people who just know how to use them.

The good news: these questions have predictable answers. You just need to know how to explain them.

The 10 Questions (+ Answers You Can Deliver)

1. Explain self-attention. Why can't you just use RNNs?

Why they ask: This is the foundation of everything. If you can't explain this clearly, everything else falls apart.

The answer:
Self-attention lets a token look at every other token in the sequence at once and assign weights to determine which ones matter. It answers: "Given this token, which other tokens should I pay attention to?"

Here's the concrete difference:

RNN (old way): Processes tokens one at a time, left to right. Token at position 10 struggles to "remember" token at position 1 because information has to flow through 9 steps. Long dependencies get lost.
Self-attention (new way): Token at position 10 directly computes its similarity to all other tokens (positions 1–9) and decides their importance instantly. No information decay.

The formula you don't need to memorize, but should understand:

Attention(Q, K, V) = softmax(Q * K / sqrt(d)) * V

Translation: Take your query (Q), multiply by all keys (K), normalize with softmax so weights sum to 1, then multiply by the actual values (V).

The gotcha: Interviewers might ask "What's the computational cost?" Answer: O(n²) where n is sequence length. That's why long context windows are expensive. That's also why companies invest in optimized attention (multi-query attention, flash attention).

2. What is positional encoding and why do we need it?

Why they ask: Transformers are permutation-invariant (word order doesn't matter by default). They want to know if you understand why that's broken and how we fix it.

The answer:
Self-attention doesn't inherently know position. If you feed "dog bit man" or "man bit dog", the attention mechanism computes the same weights. The model needs to know which word is first, second, third.

Positional encoding adds information about position to each token's embedding. The most common method (from the original paper) uses sin/cos waves at different frequencies:

Low frequencies encode large-scale positions (is this early or late in the sequence?)
High frequencies encode local positions (is this token next to another one?)

This way, the model can learn relationships like "noun at position 2, verb at position 4" instead of just "noun, verb".

The gotcha: There's no single best positional encoding. Some models use learned positional embeddings. Others use relative position bias. What matters is that you know the problem exists.

3. Self-attention, multi-head attention — what's the difference?

Why they ask: This trips up a lot of candidates. They use the term "attention head" without understanding what it does.

The answer:
Self-attention is the basic mechanism (Q, K, V multiply and softmax).

Multi-head attention is running the same self-attention operation multiple times in parallel, each with different weight matrices, then combining the results.

Why? Because different "heads" can learn different patterns:

One head might learn to focus on nearby words (local grammar)
Another head might learn to focus on distant words (long-range references)
A third head might learn to focus on certain semantic relationships

Think of it like having 8 different "experts" all looking at the same input but with different lenses.

Formula-wise: Instead of one attention output, you get multiple outputs and concatenate them.

The gotcha: Having 8 heads doesn't mean 8× the understanding. Empirically, 8–16 heads work well. More isn't always better (there are diminishing returns).

4. Explain the transformer architecture in 30 seconds.

Why they ask: They want to know if you can break down complexity. If you ramble for 5 minutes, they think you don't understand the core.

The answer (say this fast):
Transformer has two parts: encoder and decoder.

Encoder: Takes input text, runs it through self-attention (to let tokens attend to each other), then through a feed-forward network. Do this 12–24 times (stacking layers). Output: rich representation of the input.

Decoder: Takes target tokens, runs self-attention (but masked so it can't look ahead), then cross-attention (attends to encoder output), then feed-forward. Do this 12–24 times. Output: next token prediction.

In one sentence: "Stack self-attention and feed-forward layers, apply masking in the decoder, and train to predict the next token."

5. What is tokenization and why does it matter?

Why they ask: Tokenization is the first step. Get it wrong and everything downstream breaks. They want to know if you've thought about this.

The answer:
Tokenization converts raw text into tokens (usually subwords) that the model can process.

"Hello world" might become ["Hel", "lo", "world"] or ["Hello", "world"] depending on the tokenizer.

Why subwords instead of just words?

Rare words: If the tokenizer has never seen "pneumonia", breaking it into ["pneu", "monia"] lets the model handle it anyway
Efficiency: Fewer tokens = faster processing
Spelling variations: "color" and "colour" map to similar tokens

Two main approaches:

BPE (Byte Pair Encoding): Used by GPT. Learns common character pairs and merges them iteratively
WordPiece: Used by BERT. Similar idea but with a frequency-based approach

The gotcha: Different models use different tokenizers. GPT-4 uses a different tokenizer than GPT-3. This matters for token counting, context window size, and fine-tuning.

6. Explain the difference between fine-tuning and RAG. When would you use each?

Why they ask: This separates people building LLM products from people who understand the tradeoffs. It's a systems thinking question.

The answer:

Aspect	Fine-tuning	RAG
What it does	Adjusts model weights on your task-specific data	Retrieves relevant docs, adds them to prompt before generation
Cost	Expensive (GPU hours, time)	Cheap (just needs retrieval + inference)
Speed	Slow to deploy	Fast to iterate
Knowledge cutoff	Can be months/years old (trained on historical data)	Can include live, up-to-date information
When to use	Specific writing style, domain-specific reasoning, behavior you can't prompt into the model	Factual Q&A, company docs, changing information

The real answer: Most of the time, start with RAG. It's faster to build and easier to maintain. Use fine-tuning only when:

You have lots of labeled examples (1000+)
You need consistent style/format
RAG isn't getting you there
You have the infrastructure to maintain a custom model

Example: Customer support chatbot? RAG + the company's knowledge base. Custom code generation for your codebase? Fine-tuning might be worth it.

7. What causes hallucination in LLMs and how do you prevent it?

Why they ask: Hallucination is the biggest issue in production LLM systems. They want to know if you've dealt with it.

The answer:

What is hallucination? The model generates confident, fluent text that's completely false. Not random gibberish — plausible-sounding facts that are wrong.

Why it happens:

The model predicts the next most-likely token based on pattern matching, not factual knowledge
It hasn't learned the boundary between "I know this" and "I'm guessing"
It's trained to be coherent, not accurate

How to prevent it (in order of effectiveness):

RAG (best solution): Give the model a document to read from. Now it can only hallucinate based on what's in that document. Most controllable.
Prompt engineering: Explicit instructions like "Only answer based on the provided context" or "If unsure, say 'I don't know'" help a bit. But models still hallucinate.
Fine-tuning on high-quality data: Train the model on examples where it's penalized for hallucinating. Helps but doesn't fully solve it.
Fact-checking layer: After generation, run the output through a separate fact-checker (another model or rule-based system).
Temperature control: Lower temperature makes the model more confident in likely tokens, reduces randomness. But doesn't fix hallucination.

The honest answer: You can't eliminate hallucination. You can reduce it. RAG is your best bet.

8. How would you evaluate an LLM's quality?

Why they ask: Generating text is easy. Knowing if it's good is hard. They want to know if you've thought about measurement.

The answer:
Depends on the task. There's no one metric.

Automatic metrics (cheap, noisy):

BLEU, ROUGE: Compare generated text to reference text word-by-word. Works for translation, summarization. Penalizes paraphrasing. Bad for open-ended tasks.
BERTScore: Uses embeddings instead of exact word match. More forgiving. Better than BLEU.
Exact Match (EM), F1: For QA. Did the model extract the right answer?

Manual evaluation (expensive, signal-rich):

Human raters: Have people score outputs (1–5) on relevance, accuracy, tone. Gold standard. Requires budget.
Rubric-based: Define criteria (factuality, clarity, completeness) and score against them.

LLM-as-a-Judge (emerging, controversial):

Use a strong LLM (GPT-4) to score outputs from a weaker LLM. Fast and surprisingly good, but can be circular (errors compound).

Business metrics:

For a chatbot: user satisfaction, conversation length, return rate
For a code generator: does generated code compile? Does it pass tests?

The honest answer: Use multiple signals. No single metric tells the full story.

9. Explain what's happening in a forward pass through a transformer.

Why they ask: They want to verify you can trace through the actual computation. Not just regurgitate definitions.

The answer:

Step by step:

Tokenization: "Hello world" → 101, 7592, 2088
Embedding: Each token ID maps to a d-dimensional vector (e.g., 768D for BERT)
Positional encoding: Add sin/cos vectors so the model knows position
Transformer block: Run through self-attention, feed-forward, repeat 12+ times. Each layer transforms the embeddings, extracting deeper meaning
Output layer: Linear layer that converts final embeddings to logits (scores) for each possible next token
Softmax: Convert logits to probabilities summing to 1
Sampling: Pick the next token (or the highest probability one)
Repeat: Feed the new token back in, keep going

The gotcha: During inference, you don't recompute attention over all previous tokens every time (too expensive). You use KV caching: store the keys and values from previous tokens, reuse them, only compute for new tokens.

10. What's the difference between base models and instruction-tuned models? Why do we need both?

Why they ask: This is about understanding the training pipeline and product strategy. It separates engineers from researchers.

The answer:

Base models (like GPT-3, LLaMA):

Trained on next-token prediction on huge internet text
Excellent at patterns and language
Terrible at following instructions
If you ask "Tell me a joke", it will continue text in a way that follows common patterns, not necessarily tell a joke
Useful for: creative writing, text completion, in-context learning

Instruction-tuned models (like ChatGPT, Llama-2-Chat):

Take a base model
Fine-tune it on (instruction, response) pairs where responses are aligned with what users want
Also fine-tune with RLHF (Reinforcement Learning from Human Feedback) to penalize bad outputs
Follows instructions reliably
Useful for: chatbots, Q&A, customer support

Why both exist:

Base models are research tools. They're the raw material.
Instruction-tuned models are products. They're what users interact with.
Sometimes you want a base model (if you're doing research or building something unusual). Usually you want instruction-tuned (if you're shipping to users).

The gotcha: Fine-tuning an instruction-tuned model on new data can degrade instruction-following. This is catastrophic forgetting. You need to be careful about the training setup.

Common Gotchas (Things Candidates Mess Up)

Confusing attention with RNNs: Attention is not sequential. RNNs are sequential. Don't say "attention is better because it's faster at each step" — say "it's faster overall because steps are parallelizable".
Overstating transformer improvements: Transformers are great at long context, but they have O(n²) memory. This is a real limitation. Don't pretend it doesn't exist.
Assuming fine-tuning is always the answer: Most people reach for fine-tuning too early. RAG, prompting, and in-context learning go further than most engineers think.
Saying "more parameters = better": Scaling helps, but data quality and training setup matter just as much. A 7B model trained right beats a 70B model trained poorly.
Forgetting the practical constraints: Interviewers care about inference cost, latency, and inference cost. Academic perfection doesn't matter if you can't serve it.
Not understanding your own tools: If you've used OpenAI API, know its pricing, latency, rate limits. If you've fine-tuned on Hugging Face, know how long it takes and what it costs. Specifics matter.

What to Do Next

Practice explaining these answers out loud. Not reading — speaking. Your brain works differently. You'll stumble on things you thought you understood.
Build something. Try a RAG system, fine-tune a model on your own data, or build a chatbot. Theory is fine, but interviewers test your judgment. You get that from building.
Read the original papers lightly. Not cover-to-cover. Read "Attention is All You Need" (Vaswani et al., 2017) for context. Skim the abstract and architecture section. You don't need to memorize it.
Know your specific tech stack. If you're interviewing at a company, know what models they use. Google? PaLM and Gemini. Meta? LLaMA. OpenAI? GPT-4. Anthropic? Claude. Know the positioning. It shows you've done your homework.
Practice system design questions. "Design a chatbot for a healthcare provider" or "Design a code generation service." These combine everything. Most interviews include one.

The Night Before

You're going to be nervous. That's normal. Everyone is.

The difference between people who pass and people who don't isn't knowledge — it's clarity. You probably know 80% of what you need. You just need to deliver it with confidence.

Before bed, do this:

Read through your answers once (not for hours — 20 minutes max)
Do a practice explanation out loud for each question
Go to sleep knowing that you've prepped well

During the interview:

If they ask something you don't know, say "I don't know that specific detail, but here's how I'd think about it." Then reason through it. Reasoning is more valuable than memorization.
If you blank out on a question, pause for 5 seconds. Think. Then answer. Silence is okay. Rambling is bad.

You've got this. Good luck.

Sources:

Author: thousandmiles-ai-admin

DEV Community

The LLM Interview Cheat Sheet — 10 Questions That Actually Come Up

Why You Should Care

The 10 Questions (+ Answers You Can Deliver)

1. Explain self-attention. Why can't you just use RNNs?

2. What is positional encoding and why do we need it?

3. Self-attention, multi-head attention — what's the difference?

4. Explain the transformer architecture in 30 seconds.

5. What is tokenization and why does it matter?

6. Explain the difference between fine-tuning and RAG. When would you use each?

7. What causes hallucination in LLMs and how do you prevent it?

8. How would you evaluate an LLM's quality?

9. Explain what's happening in a forward pass through a transformer.

10. What's the difference between base models and instruction-tuned models? Why do we need both?

Common Gotchas (Things Candidates Mess Up)

What to Do Next

The Night Before

Top comments (0)