ZeeshanAli-0704

Posted on Jun 29

Frontend System Design: Google Sheets

Frontend + Backend System Design: Google Sheets (Collaborative Spreadsheet)

0. Architecture Diagram
1. Concept, Idea, Product Overview
2. Requirements
- 2.1 Functional Requirements
- 2.2 Non Functional Requirements
3. Scope Clarification (Interview Scoping)
4. High Level Architecture
5. Grid Rendering and Virtualization (Deep Dive)
6. Formula Engine and Dependency Graph (Deep Dive)
7. High Level Data Flow Explanation
8. Data Modelling
9. State Management Strategy
10. Real Time Collaboration and Conflict Handling
11. High Level API Design
12. Caching, Chunking, and Large Dataset Strategy
13. Cross Cutting Non Functional Concerns
14. Edge Cases and Tradeoffs
15. Interview Summary

0. Architecture Diagram

1. Concept, Idea, Product Overview

1.1 Product Description

Google Sheets is a collaborative, browser-based spreadsheet platform that combines a massive interactive grid with formula-based computation and real-time multi-user editing.
It is not just a table UI. It is a reactive compute system where cell values form a dependency graph that updates continuously as users edit data.
Target users include analysts, finance teams, operations teams, educators, and general users creating lightweight databases and dashboards.
Primary use case: edit large sheets with low interaction latency while maintaining consistency across collaborators.

1.2 Key User Personas

Analyst / Power User: Works on large sheets with thousands of formulas, filters, and aggregations. Expects smooth scroll, rapid recalculation, and keyboard-heavy workflows.
Collaborative Team Member: Co-edits shared sheets with multiple users, expects live updates, conflict-safe edits, and version-aware reconciliation.
Casual User: Uses smaller sheets for planning or tracking, expects intuitive editing, instant feedback, and low complexity.

1.3 Core User Flows (High Level)

Primary Flow: Edit and Recompute:
1. User edits a cell (for example, A1=500).
2. UI updates optimistically.
3. Client formula engine performs lightweight local evaluation.
4. Sync engine sends versioned operation to server.
5. Server recalculates impacted dependency subgraph and returns deltas.
6. Client reconciles optimistic state with authoritative patch.
Secondary Flow: Scroll Large Sheet:
1. User scrolls across large row/column ranges.
2. Grid computes visible window with overscan.
3. Recycled DOM/canvas nodes are rebound to new row/column indexes.
4. Additional chunks are lazily fetched for non-cached regions.
Secondary Flow: Real-Time Collaboration:
1. Multiple users edit overlapping ranges.
2. Server orders and merges operations.
3. Clients receive cell-level deltas and presence updates.
4. All replicas converge to the same sheet state.

2. Requirements

2.1 Functional Requirements

Spreadsheet Grid:
- Support up to 1,048,576 rows and 18,278 columns per sheet.
- Cell editing, selection, copy/paste, fill handle, keyboard navigation.
- Frozen rows/columns and variable row heights/column widths.
Formula System:
- Parse and evaluate formulas with cell/range references.
- Maintain dependency graph and incremental recomputation.
- Surface formula errors (#REF!, #DIV/0!, circular reference).
Collaboration:
- Real-time co-editing with operation ordering.
- Presence indicators (active users, selections, cursors).
- Conflict handling with deterministic convergence.
Persistence:
- Reliable save and restore of workbook/sheet/cell state.
- Version history pointers and operation log for replay.
Data Handling:
- Delta-based updates instead of full-sheet payloads.
- Lazy loading/chunked fetch for large datasets.

2.2 Non Functional Requirements

Performance: Scroll should remain near 60 FPS in typical datasets; local edit-to-paint latency should stay under one frame budget where possible.
Scalability: Must support large workbooks, high formula density, and concurrent editors.
Consistency: Clients must converge to server-authoritative state despite offline/reconnect scenarios.
Reliability: Retry, replay, and idempotent operation handling on flaky networks.
Security: Access control, authenticated sessions, and encrypted transport.

3. Scope Clarification (Interview Scoping)

3.1 In Scope

Virtualized grid rendering strategy.
Windowing + recycling for smooth scrolling.
Formula parsing, AST model, dependency graph recomputation.
Cell update sync flow and versioned reconciliation.
Collaboration model and backend storage split.

3.2 Out of Scope

Enterprise permissions UX and admin governance workflows.
Advanced BI/chart engine internals.
Print/export pipeline details.
Billing, tenancy partitioning, and legal retention policies.

3.3 Assumptions

Authentication and basic authorization already exist.
Collaboration service supports low-latency pub/sub.
Calculation service can execute recalculation in topological batches.
Network protocol supports operation versioning and delta patches.

4. High Level Architecture

4.1 Overall Approach

Browser client handles rendering and optimistic interactions.
Server side is authoritative for merge ordering and final computation.
Formula evaluation is split:
- Client: lightweight evaluation for immediate feedback.
- Server: canonical recomputation and conflict-safe final state.
Data transfer is patch-first, not snapshot-first, for steady-state interactions.

4.2 Major Architectural Layers

flowchart TD
    U[User Browser] --> FE[Sheets Frontend]
    FE --> GR[Grid Renderer]
    FE --> FEG[Client Formula Engine]
    FE --> SYNC[Sync Engine]
    FE --> COLLAB[Collaboration Server]

    COLLAB --> CALC[Calculation Server]
    COLLAB --> DOC[Document Store]
    COLLAB --> META[Metadata DB]

Layer responsibilities:

Grid Renderer: virtual viewport computation, windowing, node recycling, hit-testing.
Client Formula Engine: parse + validate + fast local eval.
Sync Engine: operation batching, version tracking, retry/replay.
Collaboration Server: operation ordering, merge, fanout.
Calculation Server: dependency graph maintenance + incremental recomputation.
Document Store: workbook/sheet/cell durable state.
Metadata DB: ACL, ownership, sharing, history pointers.

4.3 External Integrations

Identity provider for authentication and session validation.
Notification/presence channels for active collaborators.
Optional audit pipeline for operation history and governance.

5. Grid Rendering and Virtualization (Deep Dive)

5.1 Why Naive Rendering Fails

A full DOM for millions of rows and thousands of columns is infeasible due to memory, layout, and paint overhead.

5.2 Visible Range Computation

For fixed sizes:

rowStart = floor(scrollTop / rowHeight)
rowVisible = ceil(viewportHeight / rowHeight)
rowEnd = rowStart + rowVisible - 1
colStart = floor(scrollLeft / colWidth)
colVisible = ceil(viewportWidth / colWidth)
colEnd = colStart + colVisible - 1

Only this range (plus buffer) is mounted.

Two practical notes:

Clamp bounds to sheet limits so indices never go negative or exceed max rows/cols.
Recompute only on animation frame during fast scroll to avoid over-rendering on every raw scroll event.

5.3 Windowing (Overscan)

Render beyond the exact viewport to reduce churn.

Visible rows: 100..130
Window rows (with overscan): 80..150

Formulas:

windowStart = max(0, start - overscan)
windowEnd = min(total - 1, end + overscan)

Apply for rows and columns.

How to choose overscan:

Low overscan: lower memory, but more frequent mounts during fast scroll.
High overscan: smoother scroll, but more paint/layout work per frame.
Typical strategy: adaptive overscan based on scroll velocity (increase overscan for fast fling, decrease when idle).

5.4 Scroll Lifecycle: From Scroll Event to New Window

When user scrolls, the renderer typically follows this loop:

Capture scrollTop and scrollLeft from the scroll container.
Compute visible row/column range.
Expand to window range using overscan.
Diff old window vs new window.
Rebind/reposition pooled nodes for incoming cells.
Release or park nodes no longer needed.
Apply style transform offsets in one batch paint.

flowchart LR
    S[Scroll event] --> R[Compute visible range]
    R --> W[Expand to window with overscan]
    W --> D[Diff old vs new window]
    D --> B[Reuse or create nodes]
    B --> P[Position with transforms]
    P --> C[Commit batched update]

Pseudo-flow (conceptual):

onScroll() {
  scheduleOnAnimationFrame(() => {
    const visible = computeVisible(scrollTop, scrollLeft);
    const window = applyOverscan(visible, overscan);
    const diff = diffRanges(prevWindow, window);
    recycleAndBind(diff);
    commitPositions(window);
    prevWindow = window;
  });
}

5.5 Node Reuse and Old DOM Removal

There are two common strategies for old nodes:

Recycle pool (preferred): remove mapping from old cell, keep node object, reuse for incoming cell.
Hard unmount: remove node from DOM tree and create later (simpler, but higher churn).

Recommended lifecycle:

Build set of cells that exited window.
Detach their logical mapping from node ids.
Push freed nodes to reusable pool.
For cells entering window, pop from pool and rebind content.
If pool is empty, create new nodes.
If pool exceeds threshold, trim extras (actual DOM removal) to control memory.

Before/after idea:

Before: node1 -> row100,col2, node2 -> row101,col2
Scroll down far
After: node1 -> row500,col2, node2 -> row501,col2

Key implementation detail:

Keep node position updates as transform translations instead of full layout-affecting style writes whenever possible.

Practical React example (conceptual, simplified):

import React, { useMemo, useRef, useState } from "react";

type CellKey = string; // "row:col"

type CellVM = {
  key: CellKey;
  row: number;
  col: number;
  value: string;
  top: number;
  left: number;
};

type Range = { rowStart: number; rowEnd: number; colStart: number; colEnd: number };

const keyOf = (row: number, col: number) => `${row}:${col}`;

function rangeKeys(r: Range): CellKey[] {
  const out: CellKey[] = [];
  for (let row = r.rowStart; row <= r.rowEnd; row++) {
    for (let col = r.colStart; col <= r.colEnd; col++) {
      out.push(keyOf(row, col));
    }
  }
  return out;
}

export function VirtualGrid({
  rowHeight,
  colWidth,
  maxPoolSize = 400,
}: {
  rowHeight: number;
  colWidth: number;
  maxPoolSize?: number;
}) {
  // nodeId -> current cell key mapping
  const [nodeMap, setNodeMap] = useState<Map<number, CellKey>>(new Map());
  // free reusable node ids
  const recyclePoolRef = useRef<number[]>([]);
  const nextNodeIdRef = useRef(1);
  const prevWindowRef = useRef<Range | null>(null);

  function bindWindow(nextWindow: Range) {
    // Step 1: Read previous window state.
    const prevWindow = prevWindowRef.current;

    // Step 2: Build previous and next key sets.
    const prevKeys = new Set(prevWindow ? rangeKeys(prevWindow) : []);
    const nextKeys = new Set(rangeKeys(nextWindow));

    // Step 3: Diff windows into leaving and entering cell sets.
    const leaving = [...prevKeys].filter((k) => !nextKeys.has(k));
    const entering = [...nextKeys].filter((k) => !prevKeys.has(k));

    // Step 4: Update node bindings in one state transition.
    setNodeMap((prevMap) => {
      const map = new Map(prevMap);

      // Step 4.1: Release leaving cells and push their node ids to recycle pool.
      if (leaving.length > 0) {
        for (const [nodeId, cellKey] of map.entries()) {
          if (leaving.includes(cellKey)) {
            map.delete(nodeId);
            recyclePoolRef.current.push(nodeId);
          }
        }
      }

      // Step 4.2: Bind entering cells: reuse pooled ids first, create only if needed.
      for (const cellKey of entering) {
        const reused = recyclePoolRef.current.pop();
        const nodeId = reused ?? nextNodeIdRef.current++;
        map.set(nodeId, cellKey);
      }

      // Step 4.3: Optional cleanup: trim recycle pool to cap memory.
      if (recyclePoolRef.current.length > maxPoolSize) {
        recyclePoolRef.current.length = maxPoolSize;
      }

      return map;
    });

    // Step 5: Commit this window as previous for next diff.
    prevWindowRef.current = nextWindow;
  }

  // Step 6: Build render list from current node bindings.
  const cells: CellVM[] = useMemo(() => {
    const out: CellVM[] = [];
    for (const [nodeId, cellKey] of nodeMap.entries()) {
      const [rowStr, colStr] = cellKey.split(":");
      const row = Number(rowStr);
      const col = Number(colStr);
      out.push({
        key: `node-${nodeId}`,
        row,
        col,
        value: `R${row}C${col}`,
        top: row * rowHeight,
        left: col * colWidth,
      });
    }
    return out;
  }, [nodeMap, rowHeight, colWidth]);

  // Step 7: On scroll, compute nextWindow and call bindWindow(nextWindow).
  // bindWindow({ rowStart: 100, rowEnd: 150, colStart: 2, colEnd: 8 });

  return (
    <div style={{ position: "relative" }}>
      {cells.map((c) => (
        <div
          key={c.key}
          style={{
            position: "absolute",
            width: colWidth,
            height: rowHeight,
            transform: `translate(${c.left}px, ${c.top}px)`,
            willChange: "transform",
          }}
        >
          {c.value}
        </div>
      ))}
    </div>
  );
}

What this snippet demonstrates:

Leaving cells are unmapped and their node ids are pushed to recyclePoolRef.
Entering cells consume pooled ids first, so old DOM nodes are reused.
Pool trimming (maxPoolSize) represents deliberate old-node cleanup to cap memory.
Stable React keys are tied to node ids (node-<id>), not cell positions, enabling reuse.

First render walkthrough with mock data:

Assume:

Total grid size: 10,000 rows x 200 cols
rowHeight = 24px, colWidth = 100px
Viewport size: 480px height x 500px width
Overscan: 5 rows, 2 cols

Step 1: initial visible range at top-left (scrollTop=0, scrollLeft=0)

rowVisible = ceil(480 / 24) = 20
colVisible = ceil(500 / 100) = 5
Visible rows: 0..19
Visible cols: 0..4

Step 2: compute initial window with overscan

Window rows: max(0, 0-5) .. (19+5) => 0..24 (25 rows)
Window cols: max(0, 0-2) .. (4+2) => 0..6 (7 cols)
Total mounted cells: 25 x 7 = 175

Step 3: first-time creation behavior

prevWindow = null, so all 175 cells are "entering"
Recycle pool is empty
Create node ids 1..175
Bind each node id to one cell key (example: node-1 -> 0:0, node-2 -> 0:1, ...)
React renders 175 absolutely-positioned cell nodes

Step 4: user scrolls down and right (scrollTop=240, scrollLeft=200)

New visible rows: 10..29
New visible cols: 2..6
New window rows: 5..34 (30 rows)
New window cols: 0..8 (9 cols)
New window cell count: 30 x 9 = 270

Diff against old window (rows 0..24, cols 0..6):

Intersection cells remain bound and untouched
Leaving cells are unmapped and moved to recycle pool
Entering cells use pooled node ids first, then create only the extra required ids

Why this is efficient:

We do not recreate every cell on each scroll
Most updates are remaps + transform moves
Hard removal happens only when pool exceeds cap

Tiny mock data generator (for local testing):

function mockValue(row: number, col: number): string {
  return `R${row}C${col}`;
}

function getCellData(row: number, col: number) {
  return {
    key: `${row}:${col}`,
    value: mockValue(row, col),
  };
}

5.6 Frozen Rows/Columns and Variable Sizes

Frozen panes are rendered as separate synchronized layers.
Variable dimensions require prefix sums / cumulative offset maps for fast index-to-pixel and pixel-to-index conversions.

For variable row heights, use binary search over prefix sums to map scrollTop to rowStart efficiently.

5.7 Virtualization vs Windowing (What Is the Difference)

These terms are related, but not identical:

Virtualization: overall technique of rendering only a tiny subset of a huge logical list/grid.
Windowing: one implementation strategy inside virtualization where we render a moving contiguous range (window), usually with overscan.

In short:

Virtualization = big concept.
Windowing = concrete mechanism to realize virtualization.

Practical spreadsheet stack:

Virtualization decides "do not render all cells".
Windowing chooses "render this moving range + buffer".
Recycling optimizes "reuse nodes as the range moves".

6. Formula Engine and Dependency Graph (Deep Dive)

6.1 Formula Parse Pipeline

When a user enters =A1+B1:

Tokenize and parse expression.
Build AST.
Resolve references (cell/range names).
Evaluate against current context.

AST example:

    +
   / \
 A1   B1

6.2 Dependency Graph Model

Directed graph where edges point from precedent to dependent.

Example:

graph LR
    A1 --> C1
    B1 --> C1
    C1 --> D1

6.3 Incremental Recalculation Strategy

If A1 changes:

Mark impacted descendants dirty.
Recompute in topological order.
Emit only changed cells as delta patch.

Avoid full-sheet recomputation.

6.4 Circular References and Error Propagation

Detect cycles during dependency updates.
Mark cycle participants with deterministic circular-reference error state.
Propagate dependent errors with stable precedence rules.

7. High Level Data Flow Explanation

7.1 Initial Load Flow

Client loads workbook metadata and initial viewport chunks.
Renderer mounts virtual window.
Formula cache warms for visible cells.
Collaboration channel subscribes to document updates.

7.2 Cell Edit Flow (Optimistic + Authoritative Reconcile)

sequenceDiagram
    participant UI as Frontend UI
    participant SY as Sync Engine
    participant CS as Collaboration Server
    participant DG as Calc Server

    UI->>SY: Local change event (A1=500)
    SY->>UI: Optimistic paint
    SY->>CS: Send op with base version
    CS->>DG: Apply change and find impacted cells
    DG-->>CS: Recomputed delta (A1, C1, D1...)
    CS-->>SY: Authoritative patch + new version
    SY-->>UI: Reconcile local state and repaint viewport

7.3 Error and Retry Flow

If request fails, operation remains in pending queue.
Sync engine retries with backoff.
On reconnect, pending operations are replayed with version checks.
Diverged optimistic state is corrected by authoritative patch.

8. Data Modelling

8.1 Core Entities

Workbook
Sheet
Cell
Operation (edit, format, structural change)
Dependency edge

8.2 Data Shape

{
  "id": "A1",
  "value": 10,
  "formula": "=B1+C1",
  "type": "number",
  "format": "general",
  "version": 42
}

Notes:

value: computed/display value.
formula: expression entered by user.
version: monotonic version used for reconciliation.

8.3 Backend Storage Split

Metadata DB (SQL): owner, ACL, sharing, lifecycle metadata.
Cell State Store (distributed KV/document):
- Key: spreadsheetId:sheetId:cellId
- Value: formula, computed value, format, version
Append Log (optional): replay/history/audit support.

9. State Management Strategy

9.1 State Classification

UI State: selection, active cell, editing mode, scroll offsets.
Data State: visible cells, formula map, dependency metadata.
Network State: pending operations, in-flight requests, connection state.

9.2 State Ownership

Grid subsystem owns viewport and recycle pool state.
Formula subsystem owns parsed expressions and local eval cache.
Sync subsystem owns operation queue, versions, and reconciliation markers.

9.3 Persistence and Recovery

Persist lightweight session cache for faster resume.
Recover with server snapshot + operation replay after reconnect.

10. Real Time Collaboration and Conflict Handling

10.1 Operation Model

Each operation includes:

Document/sheet/cell target
Intent (set value, set formula, format, structural change)
Base revision/version
Operation id (idempotency)

10.2 OT vs CRDT (Spreadsheet Context)

OT: central ordering + transform logic, simpler integration with authoritative server pipeline.
CRDT: peer-friendly conflict-free merges, often higher metadata complexity.

For spreadsheet workloads with central authoritative compute, OT-style sequencing with deterministic server merge is commonly pragmatic.

10.3 Presence, Cursor, and Selection Sync

Presence channel broadcasts user online status.
Cursor/selection updates are ephemeral and throttled.
Presence events are separated from durable cell operations.

11. High Level API Design

11.1 Required APIs

GET /workbooks/:id -> workbook metadata
GET /sheets/:id/chunks?range=... -> chunked cell data
POST /operations/batch -> submit versioned operations
GET /revisions/:workbookId?since=... -> incremental sync
WS /collab/:workbookId -> real-time patches and presence

11.2 Request and Response Shape

Request includes baseVersion, opId, and operation payload.
Response includes accepted version and minimal delta patch.

11.3 Versioning and Idempotency

Reject or rebase stale operations.
Deduplicate retries using operation id.
Ensure replay safety after reconnect.

12. Caching, Chunking, and Large Dataset Strategy

Lazy fetch only visible and near-visible chunks.
Keep LRU cache of recently visited chunks.
Prefetch in scroll direction to hide network latency.
Use delta compression for patches.
Optional hybrid storage path:
- Row/cell oriented for transactional edits.
- Columnar paths for heavy analytical scans.

13. Cross Cutting Non Functional Concerns

13.1 Security

Authenticated APIs and WebSocket channels.
Workbook/sheet ACL enforcement server-side.
TLS in transit and strict audit trails for critical actions.

13.2 Accessibility

Full keyboard navigation for cell traversal and editing.
Screen reader announcements for active cell and formula errors.
Focus management during edit mode transitions.

13.3 Performance Optimization

Virtualization + overscan + recycling.
Batched paints and minimized style/layout thrash.
Incremental graph recompute and delta fanout.

13.4 Observability and Reliability

Metrics: edit latency, recalc latency, patch size, dropped frames, reconnect rate.
Tracing across UI edit -> sync -> calc -> patch apply.
Alerting on recalculation backlog and collaboration lag.

14. Edge Cases and Tradeoffs

Very large paste operations: chunk parse/apply to avoid UI lockups.
Formula storms: throttle non-critical recalcs; prioritize visible range updates.
Concurrent edits on same cell: deterministic server ordering and last accepted version policy.
Offline edits: queue operations and reconcile on reconnect.
Tradeoff: larger overscan improves smoothness but increases memory and CPU cost.
Tradeoff: aggressive local eval improves UX but increases potential rollback frequency.

15. Interview Summary

Problem	Primary Solution
Millions of cells	Virtualized rendering
Scroll jank	Windowing + DOM recycling
Fast formula feedback	Client-side parse/eval + AST
Correct dependent updates	Server-maintained dependency graph
High update throughput	Incremental recomputation + deltas
Huge datasets	Lazy chunk loading + cache
Multi-user editing	Versioned operations with deterministic merge

Final Mental Model

Google Sheets behaves like a virtualized UI over a distributed reactive compute graph.
Frontend guarantees interaction speed, while backend guarantees convergence and correctness.

More Details:

Get all articles related to system design
Hashtag: SystemDesignWithZeeshanAli

systemdesignwithzeeshanali

Git: https://github.com/ZeeshanAli-0704/front-end-system-design