Vikas Kumar

Posted on Feb 6

Design HLD - Distributed File Storage System -Dropbox | Image Upload Service

#systemdesign #dropbox #hld #interview

Requirements

Functional Requirements

Support image upload and download across devices.
Identify and manage exact duplicate images.
Ensure safe retry of upload operations.
Support image transformations (e.g., thumbnails).
Provide secure image access.
Support automatic synchronization across user devices.
Support safe image deletion.

Non Functional Requirements

Highly available and fault tolerant.
Low-latency and high-throughput operations.
High scalability with growing traffic.
Durable and reliable file storage.
Secure storage and access control.
Support large file uploads up to 50 GB.
Cost-efficient at scale.

Key Concepts You Must Know

To be discussed during design

Object Storage vs Metadata Storage

Object storage is a distributed storage system optimized for storing large, unstructured binary data, while metadata storage is a structured data store used to manage information about those objects.

Databases are optimized for small, structured records and queries, not large files.
Object storage systems are optimized for durability, scalability, and cost, but not for complex querying.
Separating image bytes from metadata allows each system to do what it is best at.

Analogy (Library Model)
Object storage is the warehouse storing heavy books. Metadata storage is the catalog system telling you what the book is and where it lives.

Example
Metadata DB → image_id, owner_id, size, hash, storage_path
Object Store → actual image bytes

Multipart / Resumable Uploads

Multipart uploads divide large files into smaller parts that can be uploaded independently and reassembled by the storage system.

Large uploads are prone to network failures and timeouts.
Chunking allows retries at a fine-grained level instead of restarting the entire upload.
Upload state is tracked via an upload session.

Analogy (Shipping Boxes)
Instead of shipping one huge box, ship many small boxes. If one box is lost, only that box is resent.

Example
UploadSession ID
→ Chunk 1 uploaded
→ Chunk 2 uploaded
→ Chunk 3 failed → retry

Signed / Time-Bound URLs

Signed URLs provide temporary, secure access to private objects by embedding authentication information into the URL itself.

The backend validates access and generates a URL with an expiry time and signature.
Storage systems trust the signature and serve the object directly.
This avoids routing large downloads through application servers.

Analogy (Hotel Key Card)
A hotel card opens your room only for a limited time. After checkout, it stops working automatically.

Example
GET /image/123
→ Backend returns signed URL (expires in 5 min)
→ Client downloads from storage

Content-Based Deduplication

Content-based deduplication eliminates redundant data by identifying identical content using cryptographic hashes.

Before storing an image, the system computes its hash.
If the hash already exists, storage is skipped and a new reference is created.
Multiple users can reference the same underlying object.

Analogy (Pointer to Same File)
Instead of saving the same file twice, create another pointer to it.

Example
Hash(H1) exists
→ ref_count++
→ no new storage write

Cryptographic Hash (SHA-256)

SHA-256 is a cryptographic hash function that produces a fixed-length, collision-resistant fingerprint for any input.

Same input always produces the same hash.
Any change in input produces a drastically different hash.
Collision probability is negligible for practical systems.

Analogy (DNA for Files)
Files have unique DNA sequences.

Example
image.jpg → SHA-256 → 256-bit hash

Idempotent Operations

Idempotency ensures that repeating an operation produces the same final state as executing it once.

Network failures often cause retries.
Without idempotency, retries can corrupt data or create duplicates.
Idempotency is usually enforced using unique request IDs.

Analogy (Light Switch)
Turning the light ON multiple times keeps it ON.

Example
DELETE image/123
→ deleted = true
→ retry DELETE → no change

Two-Phase Deletion

Two-phase deletion separates logical deletion from physical deletion to ensure safety and consistency.

Immediate physical deletion is risky in distributed systems.
Soft delete hides the image immediately.
Hard delete is done later by a background process.

Analogy (Recycle Bin)
You delete a file → it goes to trash → later permanently removed.

Example
Phase 1: deleted = true
Phase 2: GC job removes blob

Capacity Estimation

Key Assumptions

DAU (Daily Active Users): ~10 million
Uploads per user per day: ~2 images
Average image size: ~5 MB
Traffic pattern: Read-heavy (images viewed more than uploaded)
System scale: Large-scale, distributed system assumed

Upload Volume Estimation

Total uploads per day => 10M users × 2 uploads = ~20M images/day
Total data uploaded per day => 20M images × 5 MB ≈ ~100 TB/day

Throughput Estimation (QPS)

Write Traffic - Average write QPS (Queries Per Second) => 20M / 86,400 ≈ ~200 uploads/sec
Read Traffic - Reads are assumed ~5× writes => Average read QPS: ~1,000/sec

Metadata Size Estimation

Metadata per image: ~100 bytes (IDs, hash, timestamps, flags)
Metadata per day => 20M × 100 B ≈ ~2 GB/day

Core Entities

User: Represents a system user who uploads, owns, and accesses images.
Image: Represents a logical image uploaded by a user; stores ownership and state, not the raw image bytes.
ImageObject (ImageBlob): Represents the actual binary image file stored in object storage; can be shared across multiple images due to deduplication.
ImageVariant: Represents derived versions of an image such as thumbnails or resized formats.
UploadSession: Represents an in-progress multipart upload and enables safe retries and resumable uploads.

Database Design

Users Table

Represents system users.

User
----
user_id (PK)
email
created_at
status

Used for

Ownership
Sharing
Access control

Image (Asset) Table

Represents a user-visible image.

Image
-----
image_id (PK)
owner_id (FK → User)
content_hash
name
size
visibility
status (active / deleted)
created_at
updated_at

Key Points

One row per user image.
Multiple images can reference the same content hash.
Soft delete is handled via status.

ImageContent (Blob) Table

Represents the actual stored image content.

ImageContent
------------
content_hash (PK)
storage_path
size
ref_count
created_at

Key Points

One row per unique image content.
ref_count tracks how many images reference this blob.
Enables safe deduplication and deletion.

ImageVariant Table

Represents thumbnails or resized versions.

ImageVariant
------------
variant_id (PK)
content_hash (FK → ImageContent)
variant_type (thumbnail_small, large, etc.)
storage_path
created_at

Key Points

Variants are tied to content, not individual users.
Generated asynchronously.

UploadSession Table

Tracks multipart uploads.

UploadSession
-------------
upload_session_id (PK)
owner_id
content_fingerprint
status (uploading / completed)
created_at
expires_at

Optional (if chunk-level tracking is needed)

UploadChunk
-----------
upload_session_id (FK)
chunk_number
status (uploaded / pending)
etag

Key Points

Enables resumable uploads.
Prevents restarting large uploads.

Indexing Strategy

| Access Pattern    | Index                  |
| ----------------- | ---------------------- |
| Fetch user images | (owner_id, created_at) |
| Dedup lookup      | content_hash           |
| Cleanup jobs      | status + ref_count     |
| Sync              | updated_at             |

Indexes are chosen based on actual query patterns, not theoretical normalization.

Consistency Model

Strong consistency for metadata updates (uploads, deletes).
Eventual consistency for: Sync across devices, Variant availability, Background cleanup This balances correctness with scalability.

Transactions & Conditional Writes

Deduplication uses conditional inserts on content_hash.
Reference counts are updated atomically.
Prevents race conditions when multiple users upload the same image.

Failure Handling at DB Level

If metadata write fails → upload not finalized.
Orphaned blobs are cleaned by background jobs.
DB failures degrade performance, not correctness.

API / Endpoints

Start Upload → POST: /uploads

Initializes a new upload session and returns the chunk size and session ID.

Request

{
  "file_name": "photo.jpg",
  "file_size": 50000000,
  "mime_type": "image/jpeg"
}

Response

{
  "upload_session_id": "us_123",
  "chunk_size": 5000000
}

Upload Chunk → PUT: /uploads/{upload_session_id}/chunks/{chunk_number}

Uploads a single chunk of the file and supports safe retries.

Request

Raw binary chunk data

Response

{
  "chunk_number": 3,
  "status": "uploaded"
}

Chunk number = position of this piece in the file (0,1,2,…)

Complete Upload → POST: /uploads/{upload_session_id}/complete

Finalizes the upload, assembles chunks, checks deduplication, and creates the image.

Response

{
  "image_id": "img_456",
  "status": "completed"
}

Get Image → GET: /images/{image_id}

Returns a time-bound signed URL to securely download the image.

Response

{
  "download_url": "https://signed-url",
  "expires_in": 300
}

Get Image Metadata → GET: /images/{image_id}/metadata

Fetches lightweight metadata without downloading the image.

Response

{
  "image_id": "img_456",
  "owner_id": "user_1",
  "size": 50000000,
  "status": "active",
  "created_at": "2026-02-05T10:00:00Z"
}

Update Image Metadata → PATCH: /images/{image_id}

Updates image metadata such as name or visibility.

Request

{
  "name": "vacation_photo.jpg",
  "visibility": "private"
}

Response

{
  "status": "updated"
}

Image Variants (Thumbnails) → GET: /images/{image_id}/variants/{variant_type}

Returns a signed URL for a specific image variant (e.g., thumbnail).

Response

{
  "download_url": "https://signed-url",
  "variant": "thumbnail_small"
}

Soft Delete → DELETE: /images/{image_id}

Soft-deletes the image by marking it as deleted in metadata.

Response

{
  "status": "deleted"
}

Hard Delete (Internal) → POST: /internal/images/{image_id}/cleanup

Permanently removes the image from storage after safety checks.

Response

{
  "status": "permanently_deleted"
}

Sync API (Multi-Device) → GET: /sync?since=timestamp

Returns images added, updated, or deleted since the last sync.

Response

{
  "added": ["img_789"],
  "updated": ["img_456"],
  "deleted": ["img_123"]
}

System Components

1. Client (Web / Mobile)

Provides UI for users to upload, download, view, and delete images.
Splits large images into fixed-size chunks and uploads them independently.
Retries only failed chunks during network failures.
Maintains local image state and syncs changes with the server.

2. Load Balancer & API Gateway

Acts as the single entry point for all client requests.
Authenticates users and enforces authorization rules.
Applies rate limiting and routes requests to backend services.
Shields backend services from direct internet exposure.

3. Image Service (Application Layer)

Stateless service that orchestrates all workflows.
Creates and manages upload sessions.
Generates signed URLs for secure upload and download.
Validates permissions and updates image metadata.
Coordinates deduplication, deletion, and sync logic.
Never handles raw image bytes directly.

4. Metadata Database

Persists all image-related metadata and relationships.
Stores ownership, content hash, object location, reference counts, and lifecycle state.
Serves as the source of truth for: Deduplication, Access control, Synchronization and Deletion safety

5. Object Storage

Stores the actual image binaries and transformed variants.
Images are addressed using their content hash.
Guarantees high durability and virtually unlimited scale.
Supports large objects (up to 50 GB).

6. Image Processing Service (Async Workers)

Consumes upload-completion events.
Generates thumbnails and other image variants asynchronously.
Writes transformed images back to object storage.
Updates metadata once processing completes.
Scales independently from user traffic.

7. CDN (Content Delivery Network)

Caches images and thumbnails close to end users.
Serves read-heavy traffic efficiently.
Uses signed URLs to ensure only authorized access.
Reduces load on object storage and backend services.

8. Sync / Notification Layer

Observes metadata changes in the system.
Notifies connected devices of updates using: Push (WebSockets/SSE) for active images and Polling for inactive images
Enables eventual consistency across all devices.

High-Level Flows

Flow 1: Image Upload

Client requests an upload session from the Image Service.
Image Service returns chunk size and signed upload URLs.
Client uploads image chunks directly to object storage.
On completion, Image Service: Computes SHA-256 hash, then Checks for duplicates, then Creates or updates metadata.
Image becomes available across devices.

Flow 2: Retry / Resume Upload

If a chunk upload fails, the client retries only that chunk.
Upload session tracks completed chunks.
Duplicate chunk uploads are ignored.
Ensures idempotent and reliable uploads.

Flow 3: Image Download

Client requests access to an image.
Image Service verifies ownership or shared access.
A time-bound signed URL is generated.
Client downloads the image from CDN or object storage.

Flow 4: Deduplication

SHA-256 hash uniquely identifies image content.
If a matching hash exists: No new blob is stored, Reference count is incremented
If not: Image is stored as a new object
Each user receives an independent asset reference.

Flow 5: Image Transformation

Upload completion emits an asynchronous event.
Image processing workers generate thumbnails and variants.
Variants are stored as separate objects.
Metadata is updated to reference new variants.

Flow 6: Multi-Device Synchronization

Metadata updates record change timestamps or versions.
Other devices fetch changes via sync APIs or receive push notifications.
Devices apply updates locally.
System converges using eventual consistency.

Flow 7: Image Deletion (Two-Phase)

User deletes image → metadata is marked as deleted.
Image is immediately hidden from all devices.
Background job checks reference count.
Image blob is permanently removed only when no references remain.

Deep Dives - Functional Requirements

1. Support Image Upload and Download Across Devices

Clients (web, mobile, desktop) upload images using direct-to-object-storage uploads via signed URLs.
Large files are split into chunks and uploaded independently.
Downloads use time-bound signed URLs and are served via CDN.
This allows seamless access from any device with low latency and high throughput.

2. Identify and Manage Exact Duplicate Images

The system computes a SHA-256 hash of image content during upload.
This hash uniquely identifies the image bytes.
If the hash already exists, the image blob is not stored again.
A new metadata reference (asset) is created pointing to the existing content.

3. Ensure Safe Retry of Upload Operations

Uploads use multipart (chunked) uploads.
Each chunk is uploaded independently and tracked via an upload session.
Failed chunks are retried without re-uploading completed chunks.
Operations are idempotent, preventing duplicate writes.

4. Support Image Transformations (e.g., Thumbnails)

After upload completion, an event is emitted.
Asynchronous workers generate thumbnails and other image variants.
Transformed images are stored separately and linked via metadata.
This keeps uploads fast and processing scalable.

5. Provide Secure Image Access

All images are stored in private object storage.
Access is granted using short-lived signed URLs after permission checks.
URLs expire automatically, limiting unauthorized access.
CDN integration ensures fast and secure delivery.

6. Support Automatic Synchronization Across User Devices

Metadata is the source of truth for image state.
Clients sync changes using polling or push notifications (WebSocket/SSE).
Only deltas (added, updated, deleted images) are synced.
Ensures eventual consistency across all devices.

7. Support Safe Image Deletion

Deletion is handled using a two-phase delete.
First, the image is soft-deleted in metadata and hidden immediately.
A background job deletes the image blob only when no references remain.
This prevents accidental data loss and works with deduplication.

Deep Dives - Non - Functional Requirements

1. High Availability & Fault Tolerance

All backend services are stateless and deployed across multiple availability zones.
Metadata and storage systems are replicated.
Idempotent APIs ensure retries don’t corrupt state.
Availability: 99.9%+ (system remains usable despite node/AZ failures)

2. Low Latency & High Throughput

Uploads and downloads go directly to object storage using signed URLs.
CDN serves read traffic close to users.
Duplicate uploads are short-circuited before storing data.
Heavy work (thumbnails, scans) runs asynchronously.
Duplicate upload latency: < 50 ms (no file transfer)
Image read latency (CDN): ~5–20 ms

3. High Scalability with Growing Traffic

Stateless services scale horizontally.
Metadata, storage, and processing scale independently.
Sharding by user/content hash avoids hotspots.
Scaling model: Linear (add instances → increase capacity)

4. Durable & Reliable File Storage

Images are stored in object storage with built-in replication.
Content-addressed (hash-based) storage ensures immutability.
Metadata is persisted in a replicated database.
Durability: Object storage-grade (11 nines)

5. Secure Storage & Access Control

All data encrypted in transit and at rest.
Storage buckets remain private.
Access granted via short-lived signed URLs after permission checks.
Signed URL validity: 5–10 minutes

6. Support Large File Uploads (Up to 50 GB)

Files are uploaded using multipart (chunked) uploads.
Clients retry only failed chunks.
Upload state tracked via upload sessions.
Max file size: 50 GB (network-bound, not server-bound)

7. Cost Efficiency at Scale

Exact deduplication stores identical images only once.
CDN reduces repeated reads from storage.
Lifecycle rules clean up unused data.
Storage savings via dedup: Significant (workload-dependent)

Trade Offs

1. Object Storage vs Database for Image Bytes

Choice: Store image bytes in object storage, not in a database.

Pros

Handles very large files efficiently
High durability and low cost
Scales independently from metadata

Cons

No complex querying on image data
Requires separate metadata store

Why This Works

Databases are optimized for small, structured data. Object storage is purpose-built for large blobs and is the industry standard for this use case.

2. Content-Based Deduplication (SHA-256)

Choice: Deduplicate images using cryptographic hashes.

Pros

Massive storage savings
Simple, deterministic duplicate detection
Enables safe reference counting

Cons

Hash computation adds CPU overhead
Only detects exact duplicates (not visually similar images)

Why This Works

Exact deduplication is reliable, fast, and sufficient for most storage optimization needs. Near-duplicate detection can be added later asynchronously.

3. Multipart Uploads vs Single Upload

Choice: Use multipart (chunked) uploads.

Pros

Supports very large files (up to 50 GB)
Allows resumable uploads
Improves user experience and reliability

Cons

More complex client logic
Requires tracking upload state

Why This Works

Single uploads do not scale for large files and fail badly under unreliable networks. Chunking is the industry-standard solution.

4. Direct-to-Object Storage Uploads

Choice: Clients upload/download directly from object storage using signed URLs.

Pros

Very high throughput
Backend stays lightweight and scalable
Lower infrastructure cost

Cons

Less visibility into byte-level progress on backend
Requires careful security handling

Why This Works

Keeping application servers out of the data path is critical for performance and cost at scale.

5. Asynchronous Image Processing

Choice: Generate thumbnails and variants asynchronously.

Pros

Faster upload completion
Better system throughput
Easy horizontal scaling

Cons

Variants are not immediately available
Requires eventual consistency handling

Why This Works

Users care more about upload completion than immediate thumbnails. Async processing optimizes both latency and scale.

6. Two-Phase Deletion

Choice: Soft delete first, hard delete later.

Pros

Prevents accidental data loss
Works safely with deduplication
Enables recovery and auditing

Cons

Requires background cleanup jobs
Storage freed with a delay

Why This Works

Immediate deletion is dangerous in distributed systems. Two-phase deletion is safer and widely used.

7. Eventual Consistency for Sync

Choice: Use eventual consistency for multi-device synchronization.

Pros

High availability and scalability
Reduced coordination overhead
Better performance under load

Cons

Temporary inconsistencies across devices
Requires conflict resolution logic

Why This Works

Strong consistency is unnecessary for file sync and would significantly reduce system availability and throughput.

8. Signed URLs as Bearer Tokens

Choice: Use short-lived signed URLs for access control.

Pros

Simple and scalable access control
Works seamlessly with CDN
No backend involvement during download

Cons

URLs can be shared while valid
Requires short expiration windows

Why This Works

Short-lived URLs significantly reduce risk while enabling high-performance delivery. Additional restrictions can be layered if needed.

Frequently Asked Questions in Interviews

Q. Why do production systems strictly separate binary storage from metadata storage?

Relational and NoSQL databases are optimized for small, mutable records with indexing and transactions. Storing large binaries:

Pollutes buffer cache
Increases replication lag
Makes backups and restores slow
Raises cost per GB significantly
Object storage is optimized for immutable large objects, providing:
Multi-AZ replication by default
High write throughput
Lifecycle policies (cold storage, deletion)
No need for manual sharding
Metadata DB stores only pointers (object_key, hash, size) — never raw bytes.

Q. What does a real metadata schema look like?

A minimal but scalable model:

Blob Table (Content-level)

hash (PK)
object_key
size
ref_count
created_at

Image Table (Ownership-level)

image_id (PK)
user_id (indexed)
hash (FK)
visibility / ACL
created_at
deleted_at

This allows:

Exact deduplication
Independent ownership
Safe deletion via reference counting

Q. Why are uploads designed as direct-to-object-storage in real systems?

Because backend servers:

Are expensive per byte
Are limited by NIC bandwidth
Add failure points

In production, backend servers act as a control plane:

Issue upload credentials
Validate metadata
Finalize uploads
All file bytes flow directly from client → object storage.

Q. How are signed uploads implemented technically?

Backend:

Initiates multipart upload with object storage
Generates signed URLs for each part
Returns upload session metadata to client

Client:

Uploads parts directly using signed URLs
Retries failed parts independently
Calls “complete upload” API after all parts succeed
Backend never touches file bytes.

Q. How is the entire upload workflow made idempotent?

Idempotency is enforced at three layers:

Upload session ID uniquely identifies an upload attempt
Chunk uploads are keyed by (session_id, part_number)

Completion step uses conditional update:

UPDATE uploads
SET status = COMPLETED
WHERE session_id = X AND status != COMPLETED

Retries are safe at every step.

Q. What happens if object storage succeeds but metadata commit fails?

The upload remains in a COMPLETED_IN_STORAGE but PENDING_METADATA state.

A background reconciler:

Scans incomplete uploads
Verifies object existence
Retries metadata commit
Expires uploads past TTL
No user-visible corruption occurs.

Q. Why is content-addressed storage used instead of IDs?

IDs identify ownership, not content.

Content hashes provide:

Deterministic identity
Deduplication
Integrity verification
Using IDs alone makes deduplication race-prone and expensive.

Q. When and how is the hash computed?

Client computes hash while chunking the file (streaming).
This avoids loading the full file into memory.

Optionally:

Backend verifies hash asynchronously for trust
Upload path is never blocked on verification

Q. How do you safely deduplicate under concurrent uploads?

Blob creation uses conditional insert:

INSERT INTO blobs (hash, ...)
IF NOT EXISTS

Outcomes:

One writer wins
Others reuse existing blob
Reference count increment is atomic
No locks, no race conditions.

Q. How do you avoid hot-hash contention?

Shard blob table by hash prefix
Cache hash existence in Redis
Use Bloom filters to skip DB hits on negative lookups
This keeps deduplication fast even for viral content.

Q. Why are multipart uploads mandatory?

Single uploads fail due to:

Client timeouts
Gateway size limits
Network instability

Multipart uploads allow:

Parallelism
Resume from failure
Independent retries per chunk

Q. How is resume implemented without backend state?

Object storage tracks uploaded parts.
Client queries uploaded part list and uploads only missing chunks.
Backend state is optional — object storage is the source of truth.

Q. What happens if an application server crashes mid-request?

Nothing breaks. All servers are stateless.
Requests retry against another instance.
No in-memory state is required for recovery.

Q. How does the system survive AZ or region failures?

App servers: multi-AZ autoscaling
Metadata DB: replicas + failover
Object storage: multi-AZ by default
CDN serves cached content during partial outages
Availability degrades gracefully, not catastrophically.

Q. Why is eventual consistency chosen?

Strong consistency requires cross-region coordination, increasing latency and reducing availability.

Eventual consistency:

Matches user expectations for file systems
Improves availability
Enables global scaling
Correctness is preserved at metadata layer.

Q. How do multiple devices stay in sync?

Devices sync metadata deltas, not binaries:
Polling or push notifications
Only changed image IDs fetched
Actual images downloaded lazily
This minimizes bandwidth and latency.

Q. How is access control enforced technically?

Buckets are private
Backend validates ACLs
Signed URLs scoped to object + operation + expiry
Clients never receive long-lived credentials.

Q. What prevents signed URL abuse?

Short expiration (minutes)
Single-object scope
Optional IP or device binding
Read-only vs write-only URLs
Even leaked URLs have minimal blast radius.

Q. What are the largest cost optimizations in practice?

Exact deduplication (storage)
CDN caching (egress)
Avoiding backend data transfer
Lifecycle rules for cold data
These dwarf micro-optimizations.

Q. Why not aggressively compress images?

JPEG/PNG/WebP are already compressed.
Extra compression:

Increases CPU cost
Adds latency
Saves negligible space
Compression is applied selectively, not globally.

Q. What bottleneck appears first at scale?

Metadata write throughput.
Solved via:

Sharding
Batching
Async writes
Cache-first lookups

Q. What changes at 10× or 100× scale?

Architecture remains unchanged.
We add:

More shards
More async workers
More regions
No redesign — only capacity expansion.

High-Level Summary

This system allows users to upload, store, and sync images across devices at scale. Images are stored using content-addressed object storage to enable exact deduplication, while metadata drives access control, synchronization, and lifecycle management. Large uploads are handled using multipart uploads, and all heavy processing is done asynchronously to keep latency low.

Feel free to ask questions or share your thoughts — happy to discuss!