Jatin Gupta

Posted on Jun 13

How File Upload Works at Scale?

#softwaredevelopment #distributedsystems #aws #google

Ever wondered what actually happens after you click the "Upload" button?
You select a file, and within seconds it appears in Google Drive or Amazon S3. But behind that simple button is a highly optimized distributed system designed to handle millions of uploads every day.

The System Design Behind Google Drive & Amazon S3 Uploads

The Naive Approach ❌
A beginner might think the process is simply:

Client
    |
    |
Upload File
    |
    v
Server
    |
    |
Store File
    |
    v
Storage

Seems easy...
But imagine:

1 GB video
10 million users
Slow internet
Network failures
Server crashes This architecture would fail quickly.

Problems:

Server bandwidth becomes bottleneck
High CPU usage
Upload restarts if connection drops
Difficult to scale So companies use a much smarter architecture.

Step 1: User Selects a File
When you choose a file,
the client immediately gathers metadata:

{
  filename: "vacation.mp4",
  size: 2.1 GB,
  type: "video/mp4"
}

Notice:
The actual file is not uploaded yet.
Only metadata is prepared.

Step 2: Client Sends Metadata to API

Client
      |
      | filename
      | size
      | contentType
      |
      v
API Gateway

The API validates:

User authentication
Storage quota
File type
Permissions If everything is valid, it proceeds.

Step 3: Backend Generates a Pre-Signed URL
Instead of sending the file through the application server,
the backend requests a secure upload URL.

                Backend

                     |
                     |
          Generate Upload URL
                     |
                     |
                     v

                  Amazon S3

Example:

https://bucket.s3.amazonaws.com/file123
?signature=abcxyz
&expires=600

This is called a Pre-Signed URL.
It is:

Temporary
Secure
Limited permission
Expires automatically

Why Use Pre-Signed URLs?
Without it:

Client

     |

Application Server

     |

Storage

Every byte passes through your server.
Bad idea.

With pre-signed URLs:

Client

     |---------------------->

              Storage

The application server only handles authorization.
The heavy file upload goes directly to cloud storage.

Benefits:

Less server load
Lower cost
Better scalability
Faster uploads

Step 4: Client Uploads Directly to S3
Now the client uploads directly:

Client
      |
      |
      | 2 GB file
      |
      |
      v
Amazon S3

The backend is no longer in the data path.
This is exactly why systems can support millions of users simultaneously.

Step 5: Storage Returns Success
After upload:

S3

  |

200 OK

  |

Client

The client now knows the upload succeeded.

Step 6: Metadata is Saved
The application server stores information like:

Files Table

-----------------------------------

id

userId

filename

storageKey

size

mimeType

createdAt

updatedAt

-----------------------------------

Notice:
The database stores metadata, not the actual file.
The actual file remains inside object storage.

Final Architecture

                Metadata

Client -------------------->

                     API

                      |

                      |

          Generate Pre-Signed URL

                      |

                      |

                      v

                    S3

                      ^

                      |

                      |

Client -----------------------> Upload File


                      |

                      |

               Save Metadata

                      |

                      v

                  Database

What Happens if Internet Disconnects?
Suppose:

Uploading...

███████████░░░░░░░░
         60%

Internet goes off.
Without special handling:

Start Again ❌

Uploading a 5 GB file again is frustrating.
Modern systems avoid this using Resumable Uploads.

Resumable Upload

Instead of one huge file,
the client divides it into chunks.
Example:

File

|

|

-----------------------------

Chunk 1

Chunk 2

Chunk 3

Chunk 4

Chunk 5

-----------------------------

Maybe:

20 MB each

Upload Process

Chunk 1 ✅

Chunk 2 ✅

Chunk 3 ✅

Chunk 4 ❌

Chunk 5 ❌

Connection lost.

Later:

Reconnect

|

|

Resume

|

|

Chunk 4 ✅

Chunk 5 ✅

Only missing chunks are uploaded.
Huge bandwidth savings.

Multipart Upload in Amazon S3
Amazon S3 supports Multipart Upload:

Initialize Upload

        |

        |

Upload Part 1

Upload Part 2

Upload Part 3

Upload Part 4

        |

        |

Complete Upload

Internally, S3 assembles all parts into a single object.

Advantages:

Retry individual parts
Parallel uploads
Better reliability
Faster performance

Parallel Upload
Instead of:

Chunk1

↓

Chunk2

↓

Chunk3

↓

Chunk4

Systems do:

Chunk1 ----->

Chunk2 ----->

Chunk3 ----->

Chunk4 ----->

All at once.
This significantly reduces upload time.

What About Very Large Files?

For files like:

10 GB
20 GB
100 GB

Systems use:
Chunking
Multipart upload
Retry logic
Checksum verification
Background processing

This ensures reliability even over unstable networks.

How Sync Works Across Multiple Devices

Suppose you upload from your laptop.

Laptop

      |

      |

Cloud Storage

     /   \

    /     \

Phone    Tablet

When the upload completes:

Metadata is updated
Sync service detects changes
Other devices receive notifications
Only changed files are downloaded That's why your phone quickly shows the new file without manually refreshing.

Why Don't Companies Store Files in Databases?
Imagine storing a 2 GB video directly inside MySQL or PostgreSQL.

Problems:

Massive database growth
Slow backups
Expensive replication
Poor performance

Instead:

Database

↓

Stores:

- filename
- owner
- path
- size
- permissions

Object Storage

↓

Stores:

Actual binary file

This separation makes systems scalable and easier to maintain.

Real Production Flow

                    User

                      |

                      |

                Select File

                      |

                      |

             Send Metadata

                      |

                      v

               API Gateway

                      |

          Authentication

                      |

      Generate Pre-Signed URL

                      |

                      v

                Object Storage

          <--------------------

             Direct Upload

                      |

                      |

             Upload Success

                      |

                      |

             Save Metadata

                      |

                      v

                  Database

                      |

                      |

             Notify Sync Service

                      |

          -----------------------

          |                     |

       Laptop               Mobile

          |                     |

       Synced ✅             Synced ✅

Interview Questions

Q1. Why shouldn't files pass through the application server?

Because it creates a bandwidth bottleneck, increases server cost, and limits scalability. Direct uploads to object storage are more efficient.

Q2. What is a Pre-Signed URL?

A temporary, secure URL generated by the backend that allows a client to upload directly to object storage without exposing permanent credentials.

Q3. Why store metadata in a database instead of the file itself?

Databases are optimized for structured data and queries, while object storage is optimized for storing large binary files reliably and cost-effectively.

Q4. What is Multipart Upload?

It splits a large file into multiple parts that can be uploaded independently and then combined by the storage service into one object.

Q5. What is Resumable Upload?

A mechanism where interrupted uploads continue from the last successfully uploaded chunk instead of restarting from zero.