shubham pandey (Connoisseur)

Posted on Mar 23

Designing Google Drive / Dropbox at Scale A System Design Deep Dive — Question by Question

Introduction

Google Drive seems simple — upload a file and access it anywhere. But at 1 billion users, it hides some of the most elegant distributed systems engineering in the industry. Resumable uploads, intelligent deduplication, delta sync, real time collaboration, and offline catch up all working seamlessly together. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Resumable Uploads

Interview Question: User uploads a 5GB video file. Halfway through the upload their internet connection drops. Without smart design they must restart the entire upload from scratch. How do you design uploads so they resume exactly where they left off?

Navigation: The key insight is that you never need to treat a large file as a single atomic upload. If you split it into smaller independent pieces and track which pieces succeeded, you only need to retry the failed pieces.

Solution: Chunked upload with client side state tracking and checksum validation.

Upload flow:

Client splits 5GB file into chunks — typically 5MB each — producing roughly 1000 chunks
Client maintains upload state locally tracking each chunk as PENDING, UPLOADED, or FAILED
Client uploads chunks sequentially or in parallel
Connection drops — client knows exactly which chunks succeeded from local state
Connection restores — client resumes from first failed or pending chunk
All chunks uploaded — server merges into single complete file
Server computes checksum of merged file and compares with client computed checksum
Checksum match — file integrity confirmed, upload complete
Checksum mismatch — corruption detected, affected chunks re-uploaded

Zero re-uploading of already completed chunks regardless of how many times the connection drops.

Key Insight: Chunking transforms a fragile all-or-nothing upload into a resumable checkpoint based process. Client side state tracking means the server never needs to tell the client where to resume — the client already knows.

Challenge 2: Storage Deduplication

Interview Question: User A uploads a 5GB video file. User B — a completely different person — uploads the exact same 5GB file. Google Drive naively stores two complete 5GB copies. With 1 billion users uploading popular files millions of times, this wastes petabytes of storage. How do you detect identical files and avoid storing duplicates?

Solution: Content addressable storage using file hashing.

Client computes SHA256 hash of the file before uploading
Client sends hash to server first
Server checks hash against metadata database
Hash exists — file already stored — create pointer to existing file — skip upload entirely
Hash not found — proceed with upload — store file — save hash to metadata database

Storage savings at scale:

1 million users upload same movie trailer — 500MB each
Without deduplication — 1 million copies — 500TB of storage
With deduplication — 1 copy plus 1 million pointers — 500MB total

Key Insight: Content addressable storage uses the file's own content as its address. Identical content produces identical hash — identical hash means content already exists — no need to store it twice.

Challenge 3: Chunk Level Deduplication

Interview Question: Computing SHA256 of a 5GB file takes several seconds. Can deduplication be done more efficiently — and can it save even more storage?

Navigation: Since files are already split into chunks for resumable uploads, compute hash per chunk rather than per file. Two completely different files might share identical chunks — same embedded image, same opening credits, same boilerplate header.

Solution: Chunk level hash based deduplication with pre-upload hash check.

Client computes hash for every chunk before uploading anything
Client sends all chunk hashes to server in one request
Server checks each hash against chunk database
Server responds with which chunks already exist and which need uploading
Client uploads only the chunks the server does not already have

Example result:

1000 chunk file
Server already has 950 chunks from other files
Client uploads only 50 new chunks — 250MB instead of 5GB
Upload completes in seconds instead of minutes

This technique — uploading without uploading the chunks that already exist — caused a famous controversy when Dropbox implemented it in 2011. Users believed their files were being fully uploaded but Dropbox was silently skipping chunks it already had. The technique is legitimate but raised important transparency questions.

Key Insight: Chunk level deduplication saves more storage than file level deduplication and dramatically reduces upload time. A 5GB file might require uploading only a few hundred megabytes of genuinely new data.

Challenge 4: Deduplication Security — Hash Probing Attack

Interview Question: Cross-user chunk deduplication leaks information. How?

The attack — Hash Probing:

Attacker has a known file — say contraband content
Attacker computes SHA256 hash of that file
Attacker sends hash to Google Drive server without uploading the file
Server responds — chunk already exists, no upload needed
Attacker now knows someone on Google Drive has that exact file
Identified a user possessing specific content without downloading anything

This is called a Hash Probing Attack — using the deduplication mechanism as a detection oracle. Dropbox was caught vulnerable to this attack in 2011 and quietly changed their approach.

Solution: Salted hash with userID — deduplicate within user only.

Wrong approach — per user deduplication without salt:

User A and User B upload same file — two separate copies stored
Eliminates cross-user privacy risk but wastes storage

Better approach — salted hash:
chunk_hash = SHA256(chunk_data + userID)

Same chunk from User A produces different hash than User B
Hash probing impossible — attacker cannot predict salted hash without knowing userID
User A uploads same file twice — same salted hash — deduplicated to one copy
Cross-user deduplication eliminated — privacy preserved
Within-user deduplication fully preserved — storage still saved for same user's duplicate files

Alternative — Convergent Encryption:
Encrypt chunk with user private key before hashing. Each user's chunks encrypted independently. Content completely private even from Google itself.

Key Insight: Cross-user deduplication leaks information about what other users have stored. Salted hashing with userID preserves within-user deduplication while making cross-user hash probing attacks impossible.

Challenge 5: Delta Sync — Only Upload What Changed

Interview Question: User edits a 100MB PowerPoint file — changes a single slide — maybe 50KB of actual changes. Without smart design Google Drive uploads the entire 100MB file again on every save. With 1 billion users constantly editing files this wastes petabytes of unnecessary uploads per day. How do you sync only the actual changes?

Solution: Delta sync using chunk hash comparison.

File already split into chunks from upload
Client maintains hash of every chunk locally
User saves edited file
Client recomputes hash for every chunk
Compares new hashes against stored hashes
Unchanged chunk — same hash — skip entirely
Changed chunk — different hash — upload only this chunk
Server updates metadata with new chunk hash
Other devices notified to download only the changed chunk

Result for 100MB PowerPoint with one changed slide:

1000 chunks total
1 chunk changed — 100KB
Upload 100KB instead of 100MB
99.9 percent bandwidth saving on every incremental edit

Key Insight: Delta sync combined with chunk level hashing means editing a large file costs almost nothing in bandwidth. Only genuinely new bytes ever travel over the network.

Challenge 6: Real Time Sync Notifications

Interview Question: File changes on laptop. Phone needs to know instantly. How does the server notify the phone — and what happens if the phone is offline when the change happens?

Solution: Three tier notification strategy based on device state.

Tier 1 — App actively open — WebSocket persistent connection:

Google Drive app open on phone — WebSocket connection maintained
File changes on laptop — change event published to Kafka
Notification Service consumes from Kafka
Pushes file change notification to phone via WebSocket instantly
Sub 100ms notification delivery — seamless real time sync experience

Tier 2 — App closed or backgrounded — FCM push notification:

Phone app not running — no WebSocket connection
Notification Service sends FCM push notification
FCM wakes up app — app connects and syncs changed chunks
Standard mobile push notification flow

Tier 3 — Device offline — Change Log with ordered event storage:

Phone offline for hours or days
Every file change stored as ordered event in Change Log on server
Phone comes back online — app sends last sync timestamp to server
Server returns all changes since that timestamp in chronological order
App applies changes sequentially — fully caught up regardless of how long it was offline

Key Insight: WebSocket for active app, FCM for background app, and Change Log for offline devices covers every possible device state. No file change is ever missed regardless of connectivity.

Challenge 7: Change Log Retention and Long Term Offline Recovery

Interview Question: User has Google Drive on 5 devices. Tablet not used for 6 months. Thousands of missed changes. Do you replay 6 months of individual chunk changes — and how long do you keep the Change Log?

Wrong Approach: Keep Change Log forever and replay all events for any offline device.

Why It Fails: 1 billion users making constant edits generates petabytes of change events over years. Replaying 6 months of events for a returning device is wasteful when a simpler full state sync achieves the same result more efficiently.

Solution: 30 day TTL on Change Log with full state sync fallback.

Device offline less than 30 days:

Change Log has all events within retention window
Device connects — sends last sync timestamp
Server replays all missed changes in order
Device fully synced with minimal data transfer

Device offline more than 30 days:

Change Log expired via TTL — events gone
App performs full state sync instead
Client sends current file metadata hashes to server
Server compares with current state
Server returns list of files that differ
Client downloads only differing files — not all files
Device fully synced regardless of how long it was offline

Key Insight: 30 day TTL bounds Change Log storage to a predictable size. Devices offline longer than retention window fall back to full state sync — which is actually more efficient than replaying months of stale intermediate events.

Challenge 8: Real Time Collaborative Editing

Interview Question: User A and User B both edit the same Google Doc simultaneously. User A types "Hello" at position 10. User B simultaneously types "World" at position 10. Both changes arrive at the server at the same millisecond. How does Google Docs resolve this without asking users to manually resolve conflicts?

Wrong Approach: Lock the section being edited so only one user can type at a time.

Why It Fails: Locking blocks collaborators from typing while someone else holds the lock. With 10 million concurrent editors, lock contention creates a terrible experience. Users stare at frozen cursors waiting for locks to release. Google Docs never blocks you — you can always type freely.

Solution: Operational Transformation — OT algorithm.

Core insight: Instead of sending the final text, send the operation — what changed and where.

User A sends: INSERT "Hello" at position 10
User B sends: INSERT "World" at position 10

Both arrive at server simultaneously. Server applies User A's operation first:

Original document: "The quick brown fox"
After User A: "The quick Hello brown fox"

Now User B's operation says INSERT "World" at position 10 — but position 10 has shifted because User A inserted 5 characters before it.

OT transforms User B's operation:

Original position: 10
User A inserted 5 characters at position 10
Transformed position: 10 plus 5 equals 15
Transformed operation: INSERT "World" at position 15

Final document: "The quick Hello World brown fox"

Both users see identical document. No conflict popup. No blocking. Fully seamless.

Modern alternative — CRDT Conflict Free Replicated Data Types:

Every character assigned a globally unique ID — not just a position number
Position derived from character relationships — not absolute index
Insertions and deletions commute — order of application does not matter
Used by Figma, Notion, and modern collaborative tools
More robust than OT for complex multi-user scenarios

Key Insight: Operational Transformation allows simultaneous edits by transforming operations relative to each other rather than preventing conflicts. The result is the seamless real time collaboration experience users expect — no locks, no conflict popups, no blocked cursors.

Full Architecture Summary

Resumable uploads — Client side chunk state tracking with checksum validation
Storage deduplication — Chunk level SHA256 hash based content addressable storage
Dedup security — Salted hash with userID prevents cross-user hash probing attacks
Delta sync — Upload only changed chunks via hash comparison
Real time notifications — WebSocket for active app, FCM for offline devices
Offline catch up — Change Log with 30 day TTL, full state sync beyond retention
Collaborative editing — Operational Transformation with position adjustment

Final Thoughts

Google Drive is a masterclass in applying the same core techniques recursively at every layer. Chunking solves resumable uploads, deduplication, delta sync, and parallel processing all at once. Hashing solves content addressability, change detection, and integrity validation simultaneously. TTL solves Change Log retention the same way it solved cache eviction, lock expiry, and presence detection in every other design in this series.

The most important lesson is that elegant systems reuse simple primitives everywhere. Once you understand chunking and hashing deeply, an enormous range of distributed systems problems become variations of the same theme.

Happy building. 🚀