DEV Community

Cover image for Google SRE NALSD Round — A Real Interview Walkthrough
Ace Interviews
Ace Interviews

Posted on

Google SRE NALSD Round — A Real Interview Walkthrough

Non-Abstract Large System Design (NALSD), As It Actually Happens


Interview Context (Implicit, Not Spoken)

  • Role: Google SRE
  • Round: Non-Abstract Large System Design (NALSD)
  • Duration: ~45 minutes
  • Interview Goal: Evaluate whether the candidate can reason about an existing, large production system, identify bottlenecks, and propose incremental, realistic improvements under constraints.

No whiteboard theatrics. No “design YouTube from scratch.”


Correct Definition (Google-specific)

At Google SRE, NALSD unequivocally means:

Non-Abstract Large System Design

It is not an acronym expansion of Networking / Application / Linux / System Design.
Those are evaluation dimensions often exercised during the round, but they are not what NALSD stands for.

What you are preparing for — and what Google interviewers explicitly call this round — is Non-Abstract Large System Design.


What “Non-Abstract” Means at Google (This Matters)

In a Google NALSD round:

  • You do not design Twitter / YouTube / Uber
  • You do not start from first principles
  • You do not invent components freely

Instead, you are given:

  • An already-existing, large production system
  • A concrete failure, constraint, or scaling problem
  • Partial, messy, real-world signals

Your task is to reason inside constraints, not to architect from scratch.

This is why Google explicitly distinguishes NALSD from:

  • General System Design
  • HLD interviews
  • “Design X from scratch” questions

What Google Evaluates in NALSD (Officially Observed)

Interviewers score you on:

  1. Understanding an existing system
  2. Identifying bottlenecks and failure modes
  3. Incremental, realistic design changes
  4. Trade-offs under real constraints
  5. Operational correctness at scale

The "Non-Abstract" in NALSD specifically refers to concrete resource estimation (Disk I/O, Network Bandwidth, RAM, Cores).

If your answer feels “clean” or “idealized”, it is usually wrong.


Now: Google SRE NALSD — High-Probability Scenario


The Scenario

Interviewer opens calmly:

Interviewer:
"Let’s talk about a system you already own.

You run a globally deployed service handling 100,000 Queries Per Second (QPS) globally, distributed across 3 regions.

  • Payload: Small (2KB).
  • Processing Time: Average 10ms per request.
  • Infrastructure: Standard 16-core VMs.

It has been stable for months. Recently, during peak traffic hours, users experience request timeouts. Off-peak traffic is fine.

There were no recent code deploys.

You cannot redesign the system from scratch (The Shift from "Building" to "Fixing" (The NALS Core)): This is the #1 reason people fail NALS. They try to re-architect. Our candidate proposes Admission Control and Load Shedding—these are operational fixes, not architectural rewrites. This is exactly what a Staff SRE would do”

The interviewer stops talking (The "Silence" Tactic): This is the most realistic detail in the whole piece. Most candidates panic here. By calling this out, I am giving you people an "insider secret" that calms your nerves. It immediately establishes your authority.

This pause is deliberate.


Phase 1: Candidate Establishes Non-Abstract Grounding

Strong candidate response (measured, not rushed):

Candidate:
“Understood. I’ll treat this as an existing production system and focus on incremental diagnosis and improvements.

Before proposing solutions, I’d like to clarify the current architecture and failure characteristics.”

The interviewer nods. This is already a positive signal.


Phase 2: Clarifying Questions (What Google Actually Wants)

Candidate proceeds methodically:

Candidate:
“First, at a high level:

  • Is this a user-facing service behind a global load balancer?
  • Are requests synchronous RPCs end-to-end?
  • What does ‘timeout’ mean here — client-side, load balancer, or backend?”

Interviewer answers concisely:

Interviewer:
“Yes — user traffic hits a global frontend, which routes to backend services via RPC.

Timeouts are occurring at the backend RPC layer.”

No extra hints. No rescue.


Candidate continues:

Candidate:
“During peak traffic, do we see elevated error rates across all regions or only some?”

Interviewer:
“Across all regions, but more pronounced in a few.”

This introduces non-uniformity, a classic Google signal.


Candidate narrows scope:

Candidate:
“Are latency distributions affected, or only tail latency?”

Interviewer:
“Primarily tail latency. Median latency is mostly unchanged.”

This is critical. The candidate pauses briefly — intentionally.


Phase 3: Candidate Frames the Problem (Out Loud)

Candidate:
“So we have:

  • A previously stable global system
  • Tail latency degradation under peak load
  • No recent code changes
  • Backend RPC timeouts

That suggests a capacity or contention issue, not a functional bug.”

The interviewer does not react. This is expected.


Phase 4: Hypothesis-Driven Exploration (Core of NALSD)

Candidate explicitly states their approach:

Candidate:
“I’ll reason through this in layers:

  1. Traffic patterns and load behavior
  2. Backend capacity and queuing
  3. Dependency amplification
  4. System-level safeguards like load shedding”

This verbal structuring is important. Google scores how you think.


"The Math Check"

This is the "Magic Moment" in NALSD. This shows the candidate verifying physical capacity before guessing software bugs.

Phase 4.1: The "Non-Abstract" Math Check

This is the step most candidates miss. You must verify if the math works.

Candidate:
"Before we dig into queues, I need to check if we are physically hitting a hardware wall.

If we have 100k QPS total across 3 regions, that is roughly 33k QPS per region.

If one region fails (N+1 redundancy), the remaining two must handle 50k QPS each.

Let's look at CPU:

  • 50,000 requests/sec * 0.01 seconds (processing time) = 500 vCPUs needed per region.
  • Do we currently have 500 vCPUs provisioned per region?"

Interviewer:
"We currently have 40 machines per region, 16 cores each."

Candidate:
"40 machines * 16 cores = 640 cores.

Okay, so strictly speaking, we have the raw CPU capacity (640 available > 500 needed). But 500/640 is nearly 78% utilization.

At 78% average CPU, any micro-bursts will cause queuing. This confirms why we see timeouts only at peak—we are running too hot on CPU."


Phase 5: Investigating Load and Queuing

Candidate:

Candidate:
“First, during peak traffic, do backend request queues grow noticeably?”

Interviewer:
“Yes. Queue depth increases during peak.”

That confirms contention.


Candidate:
“Do backends reject requests early when overloaded, or do they queue until timeout?”

Interviewer:
“They queue.”

This is a red flag.


Candidate articulates the risk:

Candidate:
“Queuing under overload often worsens tail latency.

Instead of failing fast, we allow work to pile up, which increases response times for all users.”

This is textbook Google reasoning — calm, factual, precise.


Phase 6: Non-Abstract Design Constraints

Candidate checks boundaries:

Candidate:
“Are we allowed to:

  • Change request admission behavior?
  • Add caching layers?
  • Adjust client retry behavior?”

Interviewer:
“You can make incremental changes. You cannot change the RPC framework itself.”

This keeps it non-abstract.


Phase 7: Incremental Design Improvements (What Google Expects)

Candidate proposes graduated mitigations, not a single fix:

Candidate:
“I would approach this in stages.”


Stage 1: Bounded Queues (With RAM Calculation)

Candidate:
"First, we must stop the bleeding. The servers are thrashing because they accept work they can't finish.

I propose implementing a Bounded Queue (Leaky Bucket) at the application layer.

I need to size this queue. We don't want requests waiting more than 500ms (our max timeout).

  • Math: At 50k QPS (peak per region) / 40 machines = 1,250 QPS per machine.
  • Max Queue Depth: 1,250 QPS * 0.5s = 625 requests.
  • RAM Impact: 625 requests * 2KB payload = ~1.2MB.

This is negligible RAM. We can safely set a hard cap of 625 pending requests. Any request above this is rejected immediately (503 Service Unavailable) to save the CPU for requests we can serve."


Stage 2: Load Shedding Based on Importance

Candidate:
“Second, if requests are not equal, I’d implement priority-based shedding:

  • Preserve critical user flows
  • Shed best-effort traffic first”

Interviewer:
“How would you decide priorities?”

Candidate:
“Based on user impact and SLO alignment — not request volume.”

This directly aligns with Google SRE doctrine.


Stage 3: Reduce Dependency Amplification

Candidate:
“Next, I’d analyze downstream dependencies:

  • Are we fan-out heavy?
  • Does one slow dependency delay the entire request?”

Interviewer:
“Yes, there is fan-out.”

Candidate:
“Then partial responses or degraded modes could significantly reduce tail latency under load.”


Phase 8: Observability and Proof

Interviewer challenges:

Interviewer:
“How do you prove this is the right fix?”

Candidate:
“I’d look for:

  • Correlation between queue depth and tail latency
  • Improvement in p99 latency after enabling admission control
  • Stable CPU usage but reduced request backlog”

No guessing. Only measurable signals.


Phase 9: Long-Term Design Hardening

Candidate zooms out — but not abstractly:

Candidate:
“Longer term, I’d ensure:

  • Explicit SLOs tied to tail latency
  • Load tests that simulate peak bursts
  • Alerts on queue growth, not just CPU or error rate (The "Queuing" Trap: You correctly identified that queuing is the hidden killer in distributed systems. Most candidates blame CPU or Memory. Blaming the queue shows deep system intuition )”

This shows ownership, not firefighting.


Phase 10: Interviewer Ends the Scenario

Interviewer:
“That’s sufficient. Do you have any questions for me?”

The interview ends quietly. No praise. No verdict.


What the Interviewer Actually Evaluated

They were not testing:

  • Whether you know buzzwords
  • Whether you can redesign the system

They evaluated:

  • Can you reason inside constraints?
  • Do you prioritize tail latency over averages?
  • Do you understand overload behavior?
  • Do you make incremental, defensible changes?

Why This Is a True Google NALSD Reference Scenario

  • Existing system
  • Realistic failure mode
  • No clean solution
  • Trade-offs explicitly discussed
  • Operational correctness prioritized

This is exactly how strong Google SRE candidates pass NALSD.


🚀 Want to Simulate the Full Loop?

Reading one scenario gives you context. Practicing ten of them gives you mastery.

This walkthrough is just one chapter from the NALS Practice Playbook, part of the Complete SRE Career Launchpad.

Most candidates memorize answers. This bundle teaches you the Google-style mental models required to pass the hardest rounds:

  • 📘 The NALS Playbook: 10+ deep-dive scenarios including Control Plane Failures, Regional Latency Spikes, and Packet Loss under Load. Each comes with a "Strong vs. Exceptional" scoring rubric.
  • 🐧 Linux Internals & Troubleshooting: The 20 commands that solve 80% of production incidents, from kernel panics to CPU throttling.
  • 🧠 The SRE Mindset: How to speak fluently in SLOs, Error Budgets, and Blameless Postmortems during the behavioral round.
  • 🐍 Production-Grade Coding: Python & Go workbooks that focus on concurrency, safety, and automation—not just algorithms.

You don't need more generic advice. You need a structured simulation of the actual job.

👉 Get the Complete SRE Career Launchpad Here

Stop guessing. Start architecting.

Top comments (0)