ZeeshanAli-0704

Posted on Mar 15 • Edited on Jun 27

Frontend System Design: Performance Monitoring & Observability

#frontend #interview #systemdesignwithzeeshanali #fsdzeeshan

Performance Monitoring & Observability

A Complete Frontend System Design Guide — Theory, Patterns & Interview Focus

You can build the most elegant UI in the world, but if it takes 5 seconds to load or janks on every scroll, users leave. Performance monitoring isn't a one-time audit — it's a continuous feedback loop that tells you when things are slow, why they're slow, and for whom.

This guide covers Core Web Vitals, identifying and fixing bottlenecks, monitoring strategies, error tracking, performance budgets, and the tools & techniques (Chrome DevTools, Lighthouse, RUM) that make all of it actionable — all from a conceptual, interview-ready perspective.

Core Web Vitals What They Are and Why They Matter
Identifying Performance Bottlenecks
Fixing Performance Issues A Systematic Approach
Chrome DevTools and Available Tools
Real User Monitoring (RUM) vs Synthetic Monitoring
Error Tracking and Logging
Performance Budgets
The Three Pillars of Observability
Tradeoffs Gotchas and Common Pitfalls
Interview Questions and Answers ⬆ Back to Top

Core Web Vitals What They Are and Why They Matter

Core Web Vitals are a set of user-centric performance metrics defined by Google. Since June 2021, they are Google ranking signals, making them critical for both UX and SEO. They answer three fundamental questions about user experience:

┌─────────────────────────────────────────────────────────┐
│                   Core Web Vitals                       │
│                                                         │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   │
│   │     LCP     │   │     INP     │   │     CLS     │   │
│   │  (Loading)  │   │(Responsive) │   │ (Stability) │   │
│   │             │   │             │   │             │   │
│   │ "How fast   │   │ "How fast   │   │ "How stable │   │
│   │  does the   │   │  does it    │   │  is the     │   │
│   │  main       │   │  react to   │   │  visual     │   │
│   │  content    │   │  my input?" │   │  layout?"   │   │
│   │  appear?"   │   │             │   │             │   │
│   └─────────────┘   └─────────────┘   └─────────────┘   │
└─────────────────────────────────────────────────────────┘

The 75th Percentile Rule

Google evaluates Core Web Vitals at the 75th percentile of all page visits. This means 75% of your users must have a "good" experience for you to pass. If your median LCP is 2.0s but your p75 is 3.5s, you fail the assessment.

LCP — Largest Contentful Paint

What it measures: The time from page load start to when the largest visible content element (hero image, heading text, video poster) finishes rendering in the viewport.

Rating	Time
Good	≤ 2.5s
Needs Improvement	2.5s – 4.0s
Poor	> 4.0s

LCP candidate elements: <img>, <video> poster, elements with background-image, and block-level text elements like <h1>, <p>.

What causes poor LCP and how to fix it:

Cause	Why It's Slow	Fix
Slow server response (high TTFB)	Server takes too long to return HTML	CDN, edge caching, optimize backend queries
Render-blocking CSS/JS	Large CSS/JS files block the parser and delay paint	Inline critical CSS, defer non-critical CSS/JS
Slow resource loading	Hero image is unoptimized (2MB PNG)	Compress images, use WebP/AVIF, responsive `srcset`
Client-side rendering	SPA waits for JS → API → render cycle	Use SSR/SSG, stream HTML from server
Lazy-loaded LCP element	Hero image has `loading="lazy"` — browser delays it	Never lazy-load above-the-fold content
No resource prioritization	Browser treats LCP image same as everything else	Use `fetchpriority="high"` on the LCP image, `<link rel="preload">`
Redirect chains	Multiple 301/302 redirects before actual page	Minimize redirects — each adds a full round trip

INP — Interaction to Next Paint

What it measures: INP measures the latency of all user interactions (clicks, taps, key presses) throughout the page's lifetime, and reports a value representing the worst interaction latency. It replaced FID in March 2024.

Rating	Time
Good	≤ 200ms
Needs Improvement	200ms – 500ms
Poor	> 500ms

Why INP replaced FID:

Aspect	FID (Deprecated)	INP (Current)
What's measured	Delay of first input only	All interactions throughout page life
Scope	Single event	Full session
Includes processing time?	No — only input delay	Yes — input delay + processing + presentation delay
Real-world coverage	Misses 99% of interactions	Comprehensive

The three phases of an interaction:

┌───────────────┬─────────────────────┬──────────────────────┐
│  Input Delay  │  Processing Time    │  Presentation Delay  │
│  (main thread │  (event handlers    │  (render, layout,    │
│   was busy)   │   executing)        │   paint, composite)  │
└───────────────┴─────────────────────┴──────────────────────┘
               ← ──── Total INP Latency ──── →

Input Delay: User clicked but the main thread was blocked by another task — the event handler is queued.
Processing Time: How long your event handler code takes to execute.
Presentation Delay: After handler finishes, how long until the browser paints the visual update.

What causes poor INP and how to fix it:

Cause	Why It's Slow	Fix
Long tasks on main thread	Heavy JS blocks the thread during interaction	Break tasks with `scheduler.yield()` or `setTimeout` chunking
Large DOM size	10,000+ nodes — layout/style recalculation is expensive	Virtualize long lists, simplify DOM structure
Expensive re-renders	Framework re-renders entire component tree	Memoize components, colocate state, use selectors
Forced synchronous layout	Reading layout props after writes causes "layout thrashing"	Batch DOM reads before writes
Third-party scripts	Ad/analytics scripts competing for main thread	Load async, defer, or move to Web Workers
No yielding in long handlers	Single handler runs 300ms without breaks	Yield to browser between chunks of work

CLS — Cumulative Layout Shift

What it measures: CLS quantifies how much visible content shifts unexpectedly during the page's lifetime. It measures visual stability — content jumping around without user interaction.

Rating	Score
Good	≤ 0.1
Needs Improvement	0.1 – 0.25
Poor	> 0.25

How CLS is calculated:

Layout Shift Score = Impact Fraction × Distance Fraction

Impact Fraction: Percentage of viewport affected by the shift
Distance Fraction: Greatest distance any element moved (as fraction of viewport)

CLS groups shifts into session windows (bursts within 1 second, max 5-second window). The largest session window is reported.

What causes poor CLS and how to fix it:

Cause	Why It Shifts	Fix
Images without dimensions	Image loads and pushes content down	Always set `width`/`height` or use CSS `aspect-ratio`
Ads/embeds without reserved space	Ad slot loads a 300px banner, pushing everything	Reserve space with min-height/aspect-ratio placeholders
Dynamically injected content	Cookie banner or notification pushes page down	Use `transform` animations, fixed/sticky positioning
Web font swap (FOUT)	Fallback font renders → web font loads → text reflows	Use `font-display: optional`, preload fonts, use `size-adjust`
Client-side rendering	Skeleton loads, then real content arrives with different dimensions	SSR/SSG, skeleton screens with correct dimensions
CSS animations using layout properties	Animating `top`/`left`/`width`/`height` triggers layout	Use `transform` (`translateY`, `scale`) — no layout cost

Supplementary Metrics Worth Knowing

Beyond the three Core Web Vitals, these metrics provide additional context:

Metric	What It Measures	Good Threshold
TTFB (Time to First Byte)	Server responsiveness — time until first byte of HTML arrives	≤ 800ms
FCP (First Contentful Paint)	Time until first text/image is painted on screen	≤ 1.8s
TTI (Time to Interactive)	Time until page is fully interactive (no long tasks blocking)	≤ 3.8s
TBT (Total Blocking Time)	Total time the main thread was blocked between FCP and TTI	≤ 200ms
Speed Index	How quickly visible content is populated (visual completeness)	≤ 3.4s

Interview insight: TTFB is not a Core Web Vital, but it's the root cause of most LCP issues. If TTFB is 2 seconds, your LCP can never be below 2 seconds. Always check TTFB first when debugging LCP.

⬆ Back to Top

Identifying Performance Bottlenecks

The hardest part of performance work isn't fixing issues — it's finding them. Here's a systematic approach to identifying bottlenecks across the loading, rendering, and interaction lifecycle.

The Performance Investigation Funnel

Step 1: What's the symptom?
         │
         ├── Page loads slowly        → Focus on loading metrics (LCP, TTFB, FCP)
         ├── Page feels janky/laggy   → Focus on interaction metrics (INP, TBT)
         ├── Content jumps around     → Focus on stability metrics (CLS)
         └── Specific action is slow  → Focus on that interaction's trace
         │
Step 2: Where is the bottleneck?
         │
         ├── Server response (TTFB)   → Backend/CDN issue
         ├── Resource downloading     → Large assets, no compression, no CDN
         ├── Render blocking          → CSS/JS blocking first paint
         ├── Main thread busy         → Heavy JS execution, long tasks
         ├── Layout thrashing         → DOM reads/writes interleaved
         └── Memory leaks             → Memory growing over time
         │
Step 3: Why is it slow?
         │
         Use Chrome DevTools to profile and trace the root cause

The Six Categories of Frontend Bottlenecks

┌──────────────────────────────────────────────────────────────────┐
│                Frontend Performance Bottlenecks                   │
│                                                                   │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────┐  │
│  │  Network   │ │  Parsing   │ │  Rendering │ │  JavaScript  │  │
│  │            │ │  & Loading │ │  & Layout  │ │  Execution   │  │
│  │ • TTFB     │ │ • Large    │ │ • Complex  │ │ • Long tasks │  │
│  │ • Large    │ │   HTML     │ │   selectors│ │ • Unoptimized│  │
│  │   payloads │ │ • Render-  │ │ • Forced   │ │   frameworks │  │
│  │ • No CDN   │ │   blocking │ │   sync     │ │ • Memory     │  │
│  │ • No       │ │   resources│ │   layout   │ │   leaks      │  │
│  │   compress │ │ • Too many │ │ • Excessive│ │ • No code    │  │
│  │ • DNS/TLS  │ │   requests │ │   repaints │ │   splitting  │  │
│  └────────────┘ └────────────┘ └────────────┘ └──────────────┘  │
│                                                                   │
│  ┌────────────┐ ┌──────────────────────────────────────────────┐ │
│  │  Memory    │ │           Third-Party Scripts                 │ │
│  │            │ │                                               │ │
│  │ • Detached │ │ • Analytics, ads, chat widgets, tag managers │ │
│  │   DOM nodes│ │ • Competing for main thread time             │ │
│  │ • Event    │ │ • Blocking critical rendering                │ │
│  │   listener │ │ • Uncontrolled payload sizes                 │ │
│  │   leaks    │ │                                               │ │
│  └────────────┘ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

How to Identify Each Bottleneck Type

1. Network Bottlenecks

Symptoms: High TTFB, slow resource loading, waterfalls with large gaps.

How to find:

Chrome DevTools → Network panel → Sort by "Time" column → Look for requests > 1s
Check the Waterfall column — see if resources are loading sequentially instead of parallel
Look at TTFB (green bar in waterfall) — if it's large, the problem is server-side
Check Size column — large uncompressed assets are a red flag
Filter by "JS" or "CSS" — check if too many files are being loaded

Signs in the waterfall:
| Pattern | What It Means |
|---|---|
| Long green bar (TTFB) | Server is slow to respond |
| Long blue bar (download) | Asset is too large |
| Resources loading one after another | No parallelism — likely dependency chain or HTTP/1.1 |
| Many small requests | Consider bundling or using HTTP/2 multiplexing |
| Resources from many different domains | High DNS/TLS overhead per domain |

2. Render-Blocking Resources

Symptoms: Long gap between TTFB and FCP. Page is white for too long.

How to find:

Lighthouse audit → "Eliminate render-blocking resources" — lists every CSS/JS file that blocks rendering
DevTools → Performance panel → Record → Look for long "Parse HTML" → "Recalculate Style" → "Layout" before first paint
DevTools → Coverage tab (Ctrl+Shift+P → "Show Coverage") — shows what percentage of each CSS/JS file is actually used on this page

Key insight: A CSS file is render-blocking by default. A <script> tag without async/defer is parser-blocking. Both delay first paint.

3. Main Thread / JavaScript Bottlenecks

Symptoms: Poor INP, janky scrolling, slow interactions, high TBT.

How to find:

DevTools → Performance panel → Record an interaction → Look for Long Tasks (red-flagged bars > 50ms)
The "Main" thread flame chart shows exactly which functions are consuming time
Bottom-Up tab → Sort by "Self Time" to find the most expensive individual functions
Call Tree tab → Shows the call hierarchy, find root causes
Check for "Recalculate Style" and "Layout" entries — signs of layout thrashing

What Long Tasks look like:

Main Thread Timeline
─────────────────────────────────────────────
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ 120ms Long Task (red flag!)
│   ▓▓▓│ 20ms task
│ ▓▓▓▓▓▓▓▓▓│ 60ms Long Task
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ 200ms Long Task (very bad!)
─────────────────────────────────────────────
Any task > 50ms is a "Long Task" that blocks interactions

4. Layout & Rendering Bottlenecks

Symptoms: CLS issues, janky animations, slow scroll.

How to find:

DevTools → Performance panel → Look for frequent "Layout" entries (purple bars) — each is a forced reflow
DevTools → Rendering tab → Enable "Layout Shift Regions" — flashes blue on areas that shift
DevTools → Rendering tab → Enable "Paint flashing" — green flashes show what's being repainted
DevTools → Layers panel — shows which elements have their own compositing layer

Layout thrashing pattern: Reading then writing to DOM repeatedly forces the browser to recalculate layout between each read-write pair. The fix is to batch all reads first, then all writes.

5. Memory Issues

Symptoms: Page gets slower over time, eventually crashes. Tab's memory in Task Manager keeps growing.

How to find:

DevTools → Memory panel → Take heap snapshots at different points → Compare for growing objects
DevTools → Performance Monitor (real-time) → Watch "JS heap size" and "DOM Nodes" over time
Chrome Task Manager (Shift+Esc) → Check if your tab's memory keeps growing

Common causes:
| Cause | What Happens | Fix |
|---|---|---|
| Detached DOM nodes | Removed elements still referenced in JS | Null references when removing elements |
| Event listeners not cleaned up | Each re-render adds new listeners without removing old ones | Use removeEventListener, or cleanup in React's useEffect return |
| Growing arrays/maps | Data accumulates without bounds | Set limits, use LRU eviction |
| Closures retaining large scope | Functions hold references to outer scope variables | Be mindful of what closures capture |
| setInterval without clearInterval | Timers keep running even after component unmounts | Always clear in cleanup/teardown |

6. Third-Party Script Issues

Symptoms: Everything seems optimized but Lighthouse still shows poor scores. Long Tasks you don't recognize.

How to find:

DevTools → Performance panel → Look at Long Tasks → Check if the source URL is a third-party domain
DevTools → Network panel → Filter by "Third-party" → Check sizes and timing
Lighthouse → "Reduce the impact of third-party code" audit
Use the web-vitals attribution build to see if third-party scripts cause poor INP/LCP
Chrome → chrome://inspect/#devices → Check Service Workers for third-party scripts

⬆ Back to Top

Fixing Performance Issues A Systematic Approach

Once you've identified the bottleneck, here's a structured approach to fixing each category:

Loading Performance Fixes (Improve LCP, TTFB, FCP)

Problem	Fix	Impact
High TTFB	Deploy CDN, enable edge caching, optimize backend queries, use streaming SSR	High
Large JS bundles	Code splitting (route-based), tree shaking, lazy imports for below-fold features	High
Large images	Compress, use WebP/AVIF, responsive `srcset`, proper sizing	High
Render-blocking CSS	Inline critical CSS (above-fold styles), defer rest with `media="print"` hack or async loading	High
Render-blocking JS	Add `defer` or `async` attribute, move scripts to bottom of body	Medium
No resource hints	`<link rel="preload">` for LCP image, `<link rel="preconnect">` for critical origins	Medium
Too many HTTP requests	Bundle small files, use HTTP/2 (automatic multiplexing), use image sprites or SVG sprites	Medium
No compression	Enable Brotli or Gzip on server/CDN	Medium
Redirect chains	Eliminate unnecessary 301/302 redirects	Low-Medium

Interaction Performance Fixes (Improve INP, TBT)

Problem	Fix	Impact
Long event handlers	Break into async chunks with `scheduler.yield()` or `setTimeout(fn, 0)`	High
Heavy computation	Move to Web Worker (runs off main thread)	High
Framework over-rendering	Memoize components (`React.memo`, `useMemo`), colocate state closer to where it's used	High
Large DOM	Virtualize long lists (react-window, TanStack Virtual), simplify nested structures	Medium
Layout thrashing	Batch DOM reads before writes, use `requestAnimationFrame` for DOM mutations	Medium
Frequent forced reflows	Avoid reading layout properties (`offsetHeight`, `getBoundingClientRect`) inside write loops	Medium
Third-party scripts	Load non-essential scripts with `defer`/`async`, use `Partytown` to run them in a Web Worker	Medium

Visual Stability Fixes (Improve CLS)

Problem	Fix	Impact
Images without dimensions	Always set `width`/`height` attributes or CSS `aspect-ratio`	High
Dynamic content insertion	Reserve space with placeholders, use `min-height` on containers	High
Font swapping	Use `font-display: optional` (no shift) or `size-adjust` to match fallback metrics	Medium
Injected ads/embeds	Reserve exact space with aspect-ratio containers	Medium
CSS animations using layout properties	Use `transform` and `opacity` only — they skip layout and paint phases	Medium
Late-loading above-fold content	Prioritize critical content loading, SSR/SSG	Medium

The Fix Priority Framework

When multiple issues exist, prioritize by impact × effort:

                        High Impact
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
         │   QUICK WINS     │   BIG PROJECTS   │
         │   (Do first)     │   (Plan & do)    │
         │                  │                  │
         │  • Preload LCP   │  • SSR/SSG       │
Low ─────│  • Image formats │  • Code splitting│───── High
Effort   │  • Add defer/    │  • Web Workers   │  Effort
         │    async         │  • Virtualization │
         │  • Set img dims  │  • Architecture  │
         │                  │    changes       │
         │   FILL-INS       │   AVOID          │
         │   (Do if time)   │   (Low ROI)      │
         │                  │                  │
         │  • DNS prefetch  │  • Rewriting in  │
         │  • Minor CSS     │    new framework │
         │    cleanup       │  • Over-          │
         │                  │    optimization  │
         └──────────────────┼──────────────────┘
                            │
                        Low Impact

⬆ Back to Top

Chrome DevTools and Available Tools

Chrome DevTools — Your Primary Performance Tool

Chrome DevTools has several panels specifically designed for performance investigation. Understanding which panel to use for which problem is crucial.

DevTools Panel	What It Shows	When to Use
Performance	Flame chart of main thread activity, Long Tasks, network waterfall, screenshots	Profiling specific interactions, diagnosing INP/TBT, finding expensive functions
Network	All network requests, waterfall timing, sizes, headers	Debugging loading issues, finding large assets, checking compression/caching
Lighthouse	Full page audit — Performance, Accessibility, SEO, Best Practices scores	Quick overall health check, getting actionable recommendations
Memory	Heap snapshots, allocation timelines, garbage collection	Debugging memory leaks, finding detached DOM nodes
Coverage	Shows how much of each CSS/JS file is actually used	Finding dead code to remove or defer
Rendering	Layout shift visualization, paint flashing, layer borders, FPS meter	Debugging CLS, finding unnecessary repaints, checking compositing
Application	Storage (Cache, IndexedDB, localStorage), Service Workers	Debugging caching strategies, checking storage usage
Performance Monitor	Real-time CPU, heap size, DOM nodes, event listeners, style recalculations	Quick live overview of performance health

How to Use the Performance Panel (Step by Step)

The Performance panel is the most powerful tool for diagnosing bottlenecks:

Open DevTools → Performance tab
Click Record (or Ctrl+E) → Perform the slow action → Stop recording
Read the flame chart top-down:
- Network row — shows when resources loaded
- Frames row — shows frame rate (red = dropped frames)
- Main row — the flame chart of all main thread activity
- Red corners on tasks = Long Tasks (> 50ms) — these are your targets
Click on a Long Task → See the call stack in the Bottom-Up / Call Tree tabs
Bottom-Up tab (sort by Self Time) → Find which functions consumed the most time
Summary tab → Shows time breakdown: Scripting, Rendering, Painting, Idle

What to look for in the flame chart:

Pattern	Meaning	Fix
Wide yellow blocks	Long JavaScript execution	Break up tasks, optimize algorithm, Web Worker
Wide purple blocks	Layout/style recalculation	Reduce DOM complexity, batch DOM writes
Wide green blocks	Paint/composite	Reduce painted area, promote to GPU layer
Many narrow purple + yellow alternating	Layout thrashing	Batch reads before writes
Large red-cornered blocks	Long Tasks blocking interactions	Yield to browser between chunks

How to Use the Network Panel Effectively

Technique	How	What It Reveals
Sort by Size	Click "Size" column header	Largest assets to optimize
Sort by Time	Click "Time" column header	Slowest requests
Check Waterfall	Look at the colored bars	Where time is spent (DNS, TLS, TTFB, download)
Use Throttling	Change "No throttling" to "Slow 3G" or "Fast 3G"	How your site feels on slow networks
Check Disable cache	Tick the checkbox	See first-visit experience
Filter by type	Click JS, CSS, Img, Font, etc.	Focus on specific resource types
Check gzip/Brotli	Compare "Size" vs "Content" columns	Is compression enabled?
Right-click → "Block request URL"	Block specific scripts	Test impact of removing third-party scripts

How to Use the Coverage Tab

Ctrl+Shift+P → Type "Show Coverage" → Enter
Click the reload button in the Coverage panel
See a list of all CSS/JS files with percentage used

Usage	What It Means	Action
< 30% used	Heavy dead code	Extract critical portion, lazy-load the rest
30-70% used	Some dead code	Consider code splitting per route
> 70% used	Mostly needed	Focus optimization elsewhere

How to Use the Rendering Tab for CLS

Ctrl+Shift+P → "Show Rendering"
Enable "Layout Shift Regions" — blue rectangles flash where layout shifts occur
Slowly scroll and interact with the page — watch for unexpected blue flashes
Enable "Paint Flashing" — green rectangles show what's being repainted (excessive repaints = performance drain)

The Full Toolkit Beyond Chrome DevTools

Tool	Type	Best For
Lighthouse (in DevTools or CLI)	Lab / Synthetic	Quick audit, actionable recommendations, scores
PageSpeed Insights	Lab + Field (CrUX)	Checking real-user data alongside lab results
WebPageTest	Lab (real devices)	Detailed waterfall analysis, filmstrip view, multi-step testing
Chrome UX Report (CrUX)	Field (28-day rolling)	Google's official real-user performance data per origin
Google Search Console	Field (CrUX)	SEO impact of Core Web Vitals, URL-level reports
`web-vitals` JS library	Field (RUM)	Custom RUM collection in your own analytics pipeline
Bundle analyzers (`webpack-bundle-analyzer`, `source-map-explorer`)	Build-time	Visualizing what's in your JS bundles, finding bloat
`size-limit` / `bundlesize`	CI/CD	Enforcing bundle size budgets in pull requests
Sentry Performance	RUM + Tracing	Real-user performance monitoring with error correlation
Datadog RUM / New Relic Browser	RUM	Enterprise-grade real-user monitoring with dashboards
DebugBear / SpeedCurve / Calibre	Synthetic + Field	Continuous monitoring with historical trends

Lab vs Field Data — Critical Distinction

Aspect	Lab Data	Field Data
Source	Simulated on controlled device	Real users in production
Tools	Lighthouse, WebPageTest, DevTools	CrUX, web-vitals, Datadog, Sentry
Reproducible	Yes — same conditions every run	No — varies per user
Covers real diversity	No — fixed device/network	Yes — all devices, networks, locations
When available	Anytime (dev, CI/CD)	Only after deployment with real traffic
Use for debugging	Yes — drill into specific issues	Harder — aggregated data
Use for Google ranking	No — Google uses field data only	Yes — CrUX field data drives ranking

Interview insight: Google uses field data (CrUX) for ranking, not lab data. Lighthouse can be 100/100 but if real users on slow phones have poor LCP, your ranking still suffers. That's why RUM matters.

⬆ Back to Top

Real User Monitoring (RUM) vs Synthetic Monitoring

Synthetic Monitoring

Runs automated tests against your site from controlled environments — specific devices, networks, and locations — on a scheduled basis or in CI/CD.

┌──────────────────────────────────────────────────┐
│              Synthetic Monitoring                  │
│                                                   │
│  Scheduled Agent ──→ Load Page ──→ Collect Metrics│
│  (headless browser)  (controlled    (LCP, FCP,    │
│                       conditions)    waterfall)    │
│                                                   │
│  ✅ Same device, network, location every time      │
│  ✅ Reproducible — great for regression detection  │
│  ❌ Doesn't represent real user diversity          │
└──────────────────────────────────────────────────┘

When it runs: Scheduled intervals (hourly/daily), CI/CD pipeline (every PR), or on-demand audits.

Real User Monitoring (RUM)

Collects performance data from actual users as they interact with your site. Every page load, every interaction, every error — from real devices, networks, and locations.

┌──────────────────────────────────────────────────┐
│           Real User Monitoring (RUM)              │
│                                                   │
│  Real User ──→ Loads Page ──→ Browser APIs        │
│  (varied device,  (real network   collect data    │
│   location,        conditions)    automatically)  │
│   browser)              │                         │
│                         ▼                         │
│               Beacon/Fetch ──→ Analytics Backend  │
│               (send data home)  (aggregate, dash) │
│                                                   │
│  ✅ Represents EVERY real user's experience        │
│  ✅ Catches long-tail issues (slow 3G in India)   │
│  ❌ Noisy data — requires percentile analysis     │
└──────────────────────────────────────────────────┘

Head-to-Head Comparison

Dimension	Synthetic	RUM
Data source	Simulated agents	Real users in production
Environment	Controlled (fixed device/network)	Diverse (every user's real conditions)
Consistency	Highly reproducible	Noisy, varies per user
Coverage	Only pages/flows you configure	Every page every user visits
Catches regressions	Proactively (before users see them)	Reactively (after deployment)
Third-party impact	May miss real behavior	Captures actual third-party impact
Geographic coverage	Limited test locations	Global (wherever users are)
Cost	Fixed per test run	Scales with traffic volume

When to Use Which

Scenario	Best Approach	Why
Pre-deployment regression check	Synthetic	Catch issues before users see them
Understand p75/p95 user experience	RUM	Real data from real users
Monitor uptime & availability	Synthetic	Runs even when no users are online
Debug by region/device/connection	RUM	Segment by real user dimensions
A/B test performance impact	RUM	Measure real-world impact
CI/CD quality gate	Synthetic	Automated, deterministic pass/fail
Catch long-tail issues (5% on slow 3G)	RUM	Won't show in synthetic at all

The Ideal Strategy — Use Both

Development ──→ CI/CD Pipeline ──→ Production

Lighthouse        Lighthouse CI       RUM (all users)
in DevTools       assertions          │
(manual)          (fail build if      ├─ Dashboards (p75/p95)
                   LCP > 2.5s)       ├─ Segment by region, device
                                      ├─ Alerts on regressions
                  Scheduled           └─ Correlate with errors
                  Synthetic
                  (hourly)

Key insight: Synthetic catches regressions early (before users see them). RUM shows the real impact on actual users. Together, they provide complete visibility.

How RUM Data Collection Works (Conceptual)

The core pattern for collecting real user performance data:

Instrument the page — add the web-vitals library (or equivalent) to every page
Collect metrics — CWV values fire when available (LCP on navigation, CLS on page hide, INP on page hide)
Buffer events — batch multiple metrics together to reduce network calls
Send via sendBeacon — navigator.sendBeacon() is the recommended method because it survives page unload (the user leaving the page). Regular fetch() might be cancelled.
Enrich with context — attach user agent, connection type (navigator.connection.effectiveType), device memory, geographic info (from server), and page URL
Aggregate on the backend — compute p50, p75, p95 per route, per device type, per connection speed, per region
Alert on threshold breaches — if p75 LCP crosses 2.5s, alert the team

Why sendBeacon? When a user clicks a link or closes the tab, the browser cancels pending fetch() requests. sendBeacon() is specifically designed to survive page transitions — it queues the request and the browser completes it even after navigation.

⬆ Back to Top

Error Tracking and Logging

Errors in production are inevitable. The difference between a minor issue and a catastrophe is how quickly you detect, diagnose, and resolve them.

Why Frontend Error Tracking Matters

90% of frontend errors go unreported by users — they just leave
JavaScript errors can completely break an SPA's functionality
Silent failures (API errors swallowed by catch blocks) go unnoticed without tracking
Cross-browser issues — an error in Safari might not reproduce in Chrome
Third-party scripts throw errors outside your control

Without error tracking, you're flying blind — relying on user complaints to discover issues that may have been happening for days.

Types of Frontend Errors

Error Type	Example	How It's Caught
Unhandled exceptions	`TypeError: Cannot read property 'x' of undefined`	`window.onerror` or `window.addEventListener('error')`
Unhandled promise rejections	Failed `fetch()` without `.catch()`	`window.addEventListener('unhandledrejection')`
Network errors	API returns 500, CORS failure, timeout	`fetch` wrapper or Axios interceptor
Resource load failures	CSS/JS/image 404	`window.addEventListener('error', fn, true)` (capture phase — resource errors don't bubble)
Framework/render errors	Error during React render crashes component tree	Error Boundaries (React), `Vue.config.errorHandler` (Vue)
Custom business logic	User can't checkout, payment fails	Manual error capture call in application logic

How Error Tracking Tools Work (Conceptual)

Error Occurs in Browser
         │
         ▼
┌──────────────────────┐
│  Error Tracking SDK  │
│  (e.g., Sentry)      │
│                       │
│  1. Capture the error │
│  2. Extract stack     │
│     trace             │
│  3. Collect context:  │
│     • Breadcrumbs     │
│       (user actions   │
│       leading up)     │
│     • Browser/OS/     │
│       device info     │
│     • User identity   │
│     • Current URL     │
│     • Console logs    │
│  4. Apply source maps │
│     (minified →       │
│      readable code)   │
│  5. Send to backend   │
└──────────┬────────────┘
           │
           ▼
┌──────────────────────┐
│  Error Tracking       │
│  Backend              │
│                       │
│  1. Deduplicate —     │
│     group similar     │
│     errors into one   │
│     "issue"           │
│  2. Link to release/  │
│     commit that       │
│     introduced it     │
│  3. Track frequency   │
│     and affected      │
│     user count        │
│  4. Send alerts       │
│     (Slack, PagerDuty)│
└──────────────────────┘

Key Error Tracking Concepts

Concept	What It Means	Why It Matters
Issue grouping	Thousands of the same `TypeError` get grouped into one "issue"	Without grouping, you'd see 10,000 separate errors instead of one issue affecting 10,000 users
Breadcrumbs	Automatic trail of user actions before the error (clicks, navigations, API calls, console logs)	Shows you what the user did that led to the error — like a flight recorder
Source maps	Map minified production code back to original source	Without them, stack traces show `a.js:1:42305` instead of `CartTotal.tsx:42`
Release tracking	Tag errors with the deployment version (`v1.2.3`)	Know exactly which deploy introduced the regression
Suspect commits	Tool identifies which Git commit likely caused the error	Go from "we have a bug" to "this commit changed the broken code" in seconds
Session replay	Video-like recording of the user's session (DOM changes, clicks, scrolls)	See exactly what the user experienced — invaluable for reproducing bugs
Sampling	Only capture a percentage of events (e.g., 10% of performance traces)	Control costs — you don't need 100% of identical errors

Error Tracking Tools Comparison

Feature	Sentry	LogRocket	Datadog
Primary focus	Error tracking & alerting	Session replay & debugging	Full-stack observability
Error grouping	Excellent	Basic	Good
Session replay	Yes (built-in)	Best-in-class	Yes
Source maps	Yes	No	Yes
Release tracking	Yes (suspect commits)	No	Yes
Performance (RUM)	Yes	Basic	Yes
Alerting	Advanced (Slack, PagerDuty, rules)	Basic	Advanced
Best for	Error detection & alerting	Understanding user experience	Full-stack teams

Practical recommendation: Use an error tracker (like Sentry) with session replay for error-specific sessions. The combination of "what broke" (stack trace) + "what the user saw" (replay) is the fastest path to diagnosing production issues.

Alert Tiers — The Traffic Light System

Not all errors are equal. Structure your alerts to avoid alert fatigue:

Tier	Condition	Alert Channel	Response
P0 — Critical	Error rate > 5% of sessions, payment/auth flow broken	PagerDuty → On-call engineer	< 15 min
P1 — High	New error in critical flow, > 100 events/hour	Slack #incidents	< 1 hour
P2 — Medium	New error in non-critical flow, < 100 events/hour	Slack #frontend-errors	< 1 day
P3 — Low	Cosmetic errors, third-party script errors	Weekly digest email	Next sprint

Structured Logging Principles

Effective frontend logging follows these principles:

Always include context — timestamp, URL, user ID, session ID, error severity
Use structured format — JSON objects, not string concatenation. Structured logs are searchable and aggregatable.
Include breadcrumbs — log navigation events, API calls, and user actions so you can reconstruct what happened
Sanitize sensitive data — never log passwords, tokens, credit card numbers, or PII
Use sendBeacon for delivery — survives page unload, won't block the main thread
Set severity levels — debug, info, warn, error, fatal — filter by level in production
Sample in production — you don't need 100% of logs. Sample info at 10%, capture all error and fatal.

⬆ Back to Top

Performance Budgets

A performance budget is an agreed-upon set of limits on metrics that affect user experience. Like a financial budget, it forces trade-off conversations and prevents the gradual performance degradation that happens in every growing project.

The Boiling Frog Problem

Without budgets, performance degrades invisibly:

Sprint 1:   Bundle = 150 KB   ← "Fast enough"
Sprint 5:   Bundle = 220 KB   ← "Just one more library..."
Sprint 10:  Bundle = 380 KB   ← "When did this get slow?"
Sprint 15:  Bundle = 550 KB   ← "We need a performance sprint"

With a budget (200 KB limit):
Sprint 5:   Bundle = 195 KB   ← ⚠️ Warning: approaching budget
Sprint 6:   PR adds 15 KB    ← ❌ BUILD FAILS — exceeds 200 KB
            → Developer must optimize or justify before merging

Types of Performance Budgets

Budget Type	What It Limits	Examples
Quantity-based	Number/size of resources	Max 170 KB JS (gzipped), max 50 KB CSS, max 5 third-party scripts
Timing-based	User-centric timing metrics	LCP ≤ 2.5s, FCP ≤ 1.5s, TTI ≤ 3.5s
Score-based	Audit tool scores	Lighthouse Performance ≥ 90, Accessibility ≥ 95

Recommended Budget Template

Category	Metric	Budget	Enforcement
Loading	LCP	≤ 2.5s	Lighthouse CI assertion
Interactivity	INP	≤ 200ms	RUM alerting
Stability	CLS	≤ 0.1	Lighthouse CI assertion
JS bundle	Main bundle (gzipped)	≤ 170 KB	CI size check tool
CSS	Total CSS (gzipped)	≤ 50 KB	CI size check tool
Images	Hero/LCP image	≤ 100 KB	CI check or manual review
Fonts	Total web fonts	≤ 75 KB, max 2 families	Manual review
Third-party	Total third-party JS	≤ 50 KB	Lighthouse audit
Total page	Total page weight	≤ 500 KB	Lighthouse CI
Lighthouse	Performance score	≥ 90	Lighthouse CI assertion

How to Enforce Performance Budgets

Level	Tool / Method	What It Does
IDE	"Import Cost" VS Code extension	Shows bundle size impact of every import inline as you code
Build	Webpack `performance.hints: 'error'` / Vite `chunkSizeWarningLimit`	Fails build if bundles exceed configured size
CI/CD	`bundlesize` or `size-limit`	Checks gzipped bundle sizes against budgets on every PR
CI/CD	Lighthouse CI assertions	Runs Lighthouse on every PR, fails if metrics exceed thresholds
Production	RUM alerting	Alerts when real-user p75 metrics cross budget thresholds

How to Set Performance Budgets

Benchmark current state — run Lighthouse and record your current metrics
Check competitors — benchmark 2-3 competitors for the same metrics
Set goals 20% better than current — or match Google's "Good" thresholds as minimum
Get team buy-in — budgets without enforcement are just wishes
Enforce in CI/CD — automated enforcement removes debates on every PR
Review quarterly — adjust budgets as the app evolves

⬆ Back to Top

The Three Pillars of Observability

Frontend observability goes beyond performance — it's about having complete visibility into what your application is doing in production.

The Three Pillars

Pillar	What It Captures	Frontend Examples
Metrics	Numeric measurements over time	LCP values, error rates, API latency, JS heap size
Logs	Discrete events with context	Errors, user actions, state changes, API responses
Traces	End-to-end request flow across services	User click → API call → backend processing → response → UI update

Why all three matter — they answer different questions:

Metrics tell you SOMETHING is wrong:
  → "Error rate spiked to 5% at 14:30"

Logs tell you WHAT went wrong:
  → "TypeError: Cannot read property 'price' of null in CartTotal.tsx:42"

Traces tell you WHERE in the chain it broke:
  → "User click → /api/cart (200 OK, but price=null) → CartTotal render → crash"

The Frontend Observability Pipeline

┌───────────────────────────────────────────────────────────┐
│                        Browser                             │
│                                                            │
│  Performance APIs ──→ web-vitals ──→ RUM Collector         │
│  Error events     ──→ Sentry SDK ──→ Error Tracker         │
│  User actions     ──→ Session Replay ──→ Replay Store      │
│  Custom events    ──→ Analytics SDK ──→ Event Stream        │
│                                                            │
└────────────────────────┬──────────────────────────────────┘
                         │ (sendBeacon / fetch)
                         ▼
┌───────────────────────────────────────────────────────────┐
│                   Ingestion Layer                          │
│            (API Gateway / Event Queue)                     │
└──────┬──────────────┬──────────────┬──────────────────────┘
       │              │              │
┌──────▼─────┐ ┌──────▼─────┐ ┌─────▼────────┐
│  Metrics   │ │   Logs     │ │   Traces     │
│  Store     │ │   Store    │ │   Store      │
└──────┬─────┘ └──────┬─────┘ └──────┬───────┘
       │              │              │
┌──────▼──────────────▼──────────────▼───────────────────────┐
│              Visualization & Alerting                       │
│                                                             │
│  • Dashboards (p75/p95 by route, device, region)           │
│  • Alerts (Slack/PagerDuty on threshold breaches)          │
│  • SLO tracking (99.5% of loads with LCP < 2.5s)          │
└─────────────────────────────────────────────────────────────┘

Distributed Tracing for Frontend

Distributed tracing connects a user interaction in the browser to all the backend services involved. It works by propagating a trace ID across service boundaries:

Browser (frontend span)
  └─→ trace-id header ──→ API Gateway (span)
      └─→ Order Service (span)
          └─→ Payment Service (span)
              └─→ Database query (span)

Full trace: Click → Validate → API → Payment → DB → Response → UI Update

This is how you answer "why was this checkout call slow for this user?" — you see the entire chain with timing for each step. Tools like Sentry Tracing, Datadog APM, and OpenTelemetry enable this.

⬆ Back to Top

Tradeoffs Gotchas and Common Pitfalls

Common Mistakes

Pitfall	What Goes Wrong	Prevention
Optimizing without measuring	Guessing where slowness is → wasting effort on wrong things	Always profile first — measure, then optimize
Measuring median instead of p75	Median hides long-tail issues that affect 25%+ of users	Google uses p75. Always track p75 and p95.
Lab-only testing	Lighthouse is 100 but real users on slow phones have poor experience	Combine Lighthouse (lab) with RUM (field)
No performance budget	Performance degrades gradually with every sprint	Set budgets, enforce in CI/CD
Alert fatigue	Too many alerts → team ignores all of them	Tier alerts (P0-P3), filter known noise
Not uploading source maps	Production stack traces are unreadable minified code	Upload source maps to your error tracker on every deploy
Over-sampling in RUM	Sending every metric from every user = high cost, no extra value	Sample performance traces at 10-20%, capture all errors at 100%
Ignoring third-party scripts	Your code is fast but GTM + ads + chat widgets add 500ms	Audit third-party impact regularly, use Lighthouse's third-party audit
Testing on developer machines only	Fast MacBook Pro ≠ budget Android phone on 3G	Use CPU/network throttling in DevTools, test on real devices
Chasing Lighthouse score	100/100 Lighthouse doesn't mean real users are happy	Lighthouse is a starting point. RUM is the truth.

Platform-Specific Considerations

Platform	Consideration
Mobile (general)	Always test with CPU throttling (4x–6x slowdown). Median phones are much slower than dev machines.
iOS Safari	No Background Sync, no Periodic Sync, limited Service Worker support. Test on real iOS devices — simulators don't accurately reflect performance.
Low-end Android	2-4 GB RAM, slow CPUs. These are your p75/p95 users. Budget Android phones represent the majority of global web traffic.
Slow networks	Always test on "Fast 3G" and "Slow 3G" throttling. Most of the world isn't on fast broadband.
High-DPI displays	Serving 2x/3x images without responsive sizing burns bandwidth and hurts LCP.

⬆ Back to Top

Interview Questions and Answers

Q1: What are Core Web Vitals and why do they matter?

A: Core Web Vitals are three Google-defined metrics that measure real-world user experience:

LCP (Largest Contentful Paint) — loading speed. Measures when the largest content element is rendered. Target: ≤ 2.5s.
INP (Interaction to Next Paint) — responsiveness. Measures how quickly the page responds to user interactions. Target: ≤ 200ms.
CLS (Cumulative Layout Shift) — visual stability. Measures unexpected layout shifts. Target: ≤ 0.1.

They matter because: (1) Google uses them as SEO ranking signals since June 2021, (2) they're measured at the 75th percentile of real user data, and (3) they directly correlate with user engagement metrics like bounce rate and conversion.

Q2: Why did INP replace FID?

A: FID (First Input Delay) only measured the delay of the first interaction, and it only captured input delay (not processing or presentation time). This meant:

It missed slow interactions later in the page lifecycle (e.g., a button that becomes slow after data loads)
A page could have great FID but terrible responsiveness on the 10th interaction
It didn't account for how long event handlers took to execute

INP measures all interactions throughout the entire page lifecycle and captures the full latency (input delay + processing time + presentation delay). It reports the worst interaction, giving a much more comprehensive picture of responsiveness.

Q3: How would you improve LCP on a slow-loading page?

A: I'd follow a systematic approach:

Check TTFB first — if the server takes 2s to respond, LCP can never be under 2s. Fix with CDN, caching, backend optimization.
Identify the LCP element — use DevTools Performance panel or web-vitals attribution to find which element is the LCP candidate.
Preload the LCP resource — if it's an image, add <link rel="preload"> and fetchpriority="high". If it's text, ensure fonts are preloaded.
Remove render-blocking resources — inline critical CSS, defer non-critical CSS/JS.
Never lazy-load the LCP element — loading="lazy" on the hero image is a common mistake.
Optimize the resource itself — compress images to WebP/AVIF, use responsive srcset, proper sizing.
Consider SSR/SSG — if the page is client-side rendered, the LCP has to wait for JS → fetch → render. Server rendering eliminates that chain.

Q4: How would you improve INP?

A: INP issues come from three phases — I'd diagnose which phase is the problem:

Input Delay is high → The main thread was busy when the user interacted. Break up Long Tasks, defer non-critical work, reduce third-party script impact.
Processing Time is high → Event handlers take too long. Optimize the handler, yield to the browser with scheduler.yield() between chunks, move heavy computation to a Web Worker.
Presentation Delay is high → The visual update after the handler is expensive. Reduce DOM complexity, avoid layout thrashing, simplify CSS selectors.

I'd use the web-vitals attribution build to identify which interaction is the slowest and which phase dominates, then profile that specific interaction in DevTools Performance panel.

Q5: How do you prevent CLS?

Always set dimensions on images/videos — use width/height attributes or CSS aspect-ratio so the browser reserves space before loading
Reserve space for dynamic content — ads, embeds, cookie banners should have min-height or aspect-ratio containers
Use font-display: optional for web fonts — prevents font swap layout shifts entirely (or use size-adjust to match fallback font metrics)
Use transform for animations — never animate top, left, width, height (layout triggers). transform and opacity skip layout and paint entirely.
SSR/SSG — avoid the client-side rendering pattern where skeleton → data → layout shift

Q6: What's the difference between RUM and Synthetic monitoring?

A: Synthetic monitoring runs automated tests from controlled environments (fixed device, network, location). It's reproducible, proactive (catches issues before users see them), and great for CI/CD quality gates. But it doesn't represent real user diversity.

RUM (Real User Monitoring) collects data from actual users in production — real devices, real networks, real locations. It catches long-tail issues (the 5% of users on slow 3G in India), captures third-party script impact, and shows actual user experience. But it's reactive (data only after deployment) and noisy (requires percentile analysis).

Best practice: Use both together. Synthetic in CI/CD to prevent regressions. RUM in production to validate real impact. Google uses field data (RUM/CrUX) for ranking, not lab data.

Q7: What is a performance budget and how do you enforce it?

A: A performance budget is an agreed-upon threshold for metrics that a page must not exceed — like a financial budget for performance. Examples: max 170 KB JS (gzipped), LCP ≤ 2.5s, Lighthouse Performance ≥ 90.

Without budgets, performance degrades gradually as features are added — the "boiling frog" problem.

Enforcement happens at multiple levels:

IDE level — "Import Cost" extension shows bundle size impact while coding
Build level — Webpack/Vite config fails the build on oversized bundles
CI/CD level — bundlesize or size-limit checks bundle sizes on every PR; Lighthouse CI asserts metric thresholds
Production level — RUM dashboards alert when p75 metrics cross budgets

The key is automated enforcement — budgets without CI/CD checks are just suggestions that get ignored.

Q8: How would you investigate a report that "the app is slow"?

A: I'd follow the investigation funnel:

Quantify "slow" — which metric is poor? Check RUM dashboards for LCP/INP/CLS p75. Is it a loading problem, interaction problem, or visual stability problem?
Segment the data — is it slow for all users or specific segments? Filter by device type, connection speed, geographic region, browser, OS. Often "slow" affects budget phones on 3G, not developer MacBooks.
Reproduce locally — try to reproduce with DevTools throttling (CPU 4x slowdown + Fast 3G). Record a Performance trace.
Analyze the trace — look for Long Tasks (red flags), heavy Layout/Style recalculations (purple), large network requests. Use Bottom-Up view to find expensive functions.
Check third-parties — are third-party scripts (analytics, ads, chat widgets) contributing to main thread blocking?
Prioritize fixes — use the impact × effort matrix. Quick wins first (preload LCP, add image dimensions, defer scripts), then bigger architectural changes (code splitting, SSR, Web Workers).
Validate the fix — deploy, monitor RUM for p75 improvement, confirm with users.

Q9: What are the three pillars of observability and how do they apply to frontend?

Metrics — numeric measurements over time. Frontend: Core Web Vitals (LCP, INP, CLS), error rates, API latency, JS heap size. Used for dashboards and alerting.
Logs — discrete events with context. Frontend: error stack traces with breadcrumbs, user action logs, API request/response details. Used for debugging specific issues.
Traces — end-to-end request flows across services. Frontend: tracing from user click → API call → backend processing → database → response → UI update, with timing at each step. Used for diagnosing latency in distributed systems.

Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where in the chain it broke.

In practice: web-vitals + analytics for metrics, Sentry for logs + errors, Sentry Tracing or OpenTelemetry for traces.

Q10: How do you prevent cache from growing unbounded in a Service Worker?

A: Three approaches:

Entry limit — keep max N items per cache (e.g., 100 images). Evict oldest on overflow (FIFO/LRU).
TTL (Time-to-Live) — add a timestamp when caching. On read, check age and delete expired entries.
Version cleanup — on Service Worker activate, delete all caches not in the current version's allow-list.

Q11: What's the difference between lab data and field data?

A: Lab data (Lighthouse, WebPageTest) runs on a simulated device with fixed conditions. It's reproducible and great for debugging, but doesn't represent real user diversity. Field data (CrUX, RUM) is collected from actual users in production — real devices, networks, and locations. It's noisy but represents the truth.

Critical distinction: Google uses field data (CrUX) for SEO ranking, not lab data. You can score 100 on Lighthouse but still fail Google's assessment if real users on slow phones have poor metrics.

Q12: How would you set up error tracking for a production React app?

A: The core components:

Global error handlers — window.onerror for uncaught exceptions, unhandledrejection listener for promise rejections, capture-phase error listener for resource failures
React Error Boundaries — wrap critical routes/features with Error Boundaries to catch render errors and show fallback UI
Error tracking service (e.g., Sentry) — captures errors with full context: stack trace, breadcrumbs (user actions leading to error), user info, browser/OS, release version
Source maps — upload on every deploy so stack traces show original source, not minified code
Alert tiers — P0 (critical flows broken → PagerDuty), P1 (new error in important flow → Slack), P2-P3 (lower severity → digest)
Noise filtering — ignore known harmless errors (ResizeObserver loop limit exceeded, browser extension errors)
Session replay on errors — automatically record sessions where errors occur for easy reproduction

Q13: What tools would you use for a complete frontend performance monitoring strategy?

Stage	Tool	Purpose
Development	Chrome DevTools (Performance, Network, Lighthouse panels)	Profile and debug locally
IDE	Import Cost extension	See bundle size impact of imports
Build	Bundle analyzer (webpack-bundle-analyzer)	Visualize what's in the bundle
CI/CD	Lighthouse CI + `size-limit`/`bundlesize`	Automated regression prevention
Production — RUM	`web-vitals` library + analytics backend	Real user Core Web Vitals
Production — Errors	Sentry (with session replay)	Error tracking, alerting, debugging
Production — Field data	CrUX / PageSpeed Insights	Google's official field data
Ongoing	Scheduled synthetic tests (WebPageTest, DebugBear)	Baseline monitoring, competitor benchmarking

Q14: A page has good lab scores but poor field metrics. Why?

A: This is a very common scenario. Possible reasons:

User diversity — lab tests on fast machines; real users on budget phones with slow CPUs and 3G connections
Third-party scripts — lab might not fully load all third-party scripts (analytics, ads, chat widgets) that compete for main thread time in production
Geographic distance — lab test from a server close to CDN; real users far from CDN edge nodes experience high TTFB
Caching state — lab tests with warm cache; many real users are first-time visitors with cold cache
Dynamic content — lab tests might get different content (fewer ads, no personalization, no A/B test variants)
Long-tail interactions — lab tests the initial load; real users interact for minutes, hitting slow INP interactions later
Browser/OS diversity — lab uses latest Chrome; real users include older Safari, Firefox, webviews with different JS engines

Fix: Segment RUM data by device type, connection speed, and geography to find which user cohorts have poor experience, then optimize for those conditions.

For related topics, see companion articles: **Critical-Rendering-Path.md, **css-js-ui-optimization.md, **network-optimization.md, **image-optimization.md, and **Web_Worker_SharedWorker_and_Service_Worker.md.

⬆ Back to Top

More Details:

Get all articles related to system design
Hashtag: SystemDesignWithZeeshanAli

systemdesignwithzeeshanali

Git: https://github.com/ZeeshanAli-0704/front-end-system-design

⬆ Back to Top

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Performance Monitoring & Observability

A Complete Frontend System Design Guide — Theory, Patterns & Interview Focus

Table of Contents

Core Web Vitals What They Are and Why They Matter

The 75th Percentile Rule

LCP — Largest Contentful Paint

INP — Interaction to Next Paint

CLS — Cumulative Layout Shift

Supplementary Metrics Worth Knowing

Identifying Performance Bottlenecks

The Performance Investigation Funnel

The Six Categories of Frontend Bottlenecks

How to Identify Each Bottleneck Type

1. Network Bottlenecks

2. Render-Blocking Resources

3. Main Thread / JavaScript Bottlenecks

4. Layout & Rendering Bottlenecks

5. Memory Issues

6. Third-Party Script Issues

Fixing Performance Issues A Systematic Approach

Loading Performance Fixes (Improve LCP, TTFB, FCP)

Interaction Performance Fixes (Improve INP, TBT)

Visual Stability Fixes (Improve CLS)

The Fix Priority Framework

Chrome DevTools and Available Tools

Chrome DevTools — Your Primary Performance Tool

How to Use the Performance Panel (Step by Step)

How to Use the Network Panel Effectively

How to Use the Coverage Tab

How to Use the Rendering Tab for CLS

The Full Toolkit Beyond Chrome DevTools

Lab vs Field Data — Critical Distinction

Real User Monitoring (RUM) vs Synthetic Monitoring

Synthetic Monitoring

Real User Monitoring (RUM)

Head-to-Head Comparison

When to Use Which

The Ideal Strategy — Use Both

How RUM Data Collection Works (Conceptual)

Error Tracking and Logging

Why Frontend Error Tracking Matters

Types of Frontend Errors

How Error Tracking Tools Work (Conceptual)

Key Error Tracking Concepts

Error Tracking Tools Comparison

Alert Tiers — The Traffic Light System

Structured Logging Principles

Performance Budgets

The Boiling Frog Problem

Types of Performance Budgets

Recommended Budget Template

How to Enforce Performance Budgets

How to Set Performance Budgets

The Three Pillars of Observability

The Three Pillars

The Frontend Observability Pipeline

Distributed Tracing for Frontend

Tradeoffs Gotchas and Common Pitfalls

Common Mistakes

Platform-Specific Considerations

Interview Questions and Answers

Q1: What are Core Web Vitals and why do they matter?

Q2: Why did INP replace FID?

Q3: How would you improve LCP on a slow-loading page?

Q4: How would you improve INP?

Q5: How do you prevent CLS?

Q6: What's the difference between RUM and Synthetic monitoring?

Q7: What is a performance budget and how do you enforce it?

Q8: How would you investigate a report that "the app is slow"?

Q9: What are the three pillars of observability and how do they apply to frontend?

Q10: How do you prevent cache from growing unbounded in a Service Worker?

Q11: What's the difference between lab data and field data?

Q12: How would you set up error tracking for a production React app?

Q13: What tools would you use for a complete frontend performance monitoring strategy?

Q14: A page has good lab scores but poor field metrics. Why?