surajrkhonde

Posted on Jul 2 • Edited on Jul 9

AWS for Newbies — Episode 2

#aws #webdev #s3 #programming

Edit: @nazar_boyko caught a real gap in the original version of this piece — the SHA-256 hash used for deduplication was accepted as a client-supplied value and never verified against the actual uploaded bytes, meaning a client could poison a dedup key with mismatched content.
This has been fixed: the hash is now bound into the signed S3 request via ChecksumSHA256, so S3 itself rejects any upload that doesn't match. Thanks for the sharp read. 🙏

An Uncle–Nephew Chronicle

A ~1 hour read. Continues directly from Episode 1 (IAM + EC2). Every term is defined the moment it shows up.

👦 Nephew: Uncle, I've been thinking about what you asked last time — where does the uploaded file actually go if my server is disposable. I looked it up. It's S3, right?

👨‍🦳 Uncle: Right. And today we're not doing the toy version. We're doing the version a real production app needs: files organized by type, no duplicate files silently eating your storage bill, file size actually enforced instead of "hoped for," and a presigned URL flow with a real expiry policy — not the 2-minute throwaway example from the basics. By the end of this, you'll have the full Node.js code for it, not just diagrams.

👦 Nephew: Let's go. Start from the beginning — what is S3, really?

Part 1 — S3: Storage That Isn't "A Folder on a Server"

👨‍🦳 Uncle: S3 stands for Simple Storage Service. It is object storage — a place to store files (images, PDFs, videos, backups, logs — anything as raw bytes) completely separate from any server. Unlike EC2's disk (EBS), which is tied to one specific virtual machine, S3 exists independently. Ten different servers, or zero servers, can all read and write to the same S3 storage. That's exactly the property we needed after last episode's problem — files that survive even if the server that received them is gone.

In S3's vocabulary:

A bucket is the top-level container — think of it as one storage account/warehouse. Bucket names must be globally unique across all of AWS, not just your account.
An object is a single stored file, along with its metadata (size, content type, upload date, permissions).
A key is the object's full path-like name inside the bucket, e.g. images/profile-abc123.jpg.

👦 Nephew: So a "key" is basically the file path?

👨‍🦳 Uncle: Functionally, yes — but here's a fact that surprises almost everyone: S3 doesn't actually have real folders. It's a flat structure of objects, each with a long key string. When you see images/profile.jpg displayed with a little folder icon in the AWS console, that's the console being helpful — it's just splitting the key string on the / character and drawing a folder illustration for you. Underneath, there is no such thing as an "images folder" object. It's purely a naming convention, called a prefix.

👦 Nephew: Wait, that actually matters for what I want to do — separating PDFs, text files, and images into their own "folders."

👨‍🦳 Uncle: It matters a lot, and it's good news — it means organizing by file type costs you nothing extra. You just design your key naming convention deliberately, and S3 will happily group them for you visually and let you list/filter by prefix efficiently.

Part 2 — Designing the Key Structure (Folder-Wise Organization)

👨‍🦳 Uncle: Let's design it properly instead of winging it. A clean, type-separated key convention looks like this:

documents/{hash}.pdf
images/{hash}.jpg
text-files/{hash}.txt

Or, if you also want per-user isolation (very common in real apps):

users/{userId}/images/{hash}.jpg
users/{userId}/documents/{hash}.pdf

👦 Nephew: Why the hash instead of the original filename, like resume.pdf?

👨‍🦳 Uncle: Three solid reasons. First — two different users might both upload a file called resume.pdf; using the raw filename risks silent overwrites unless you're careful. Second — filenames can contain characters that misbehave in URLs. Third, and most important for today's topic: the hash is how we detect duplicate files. Which brings us to the real meat of today's lesson.

Part 3 — Deduplication: Don't Store the Same File Twice

👨‍🦳 Uncle: Imagine 500 users all upload the exact same company logo, or the same PDF brochure gets re-uploaded across 50 different form submissions. Without deduplication, you're paying S3 storage costs for 500 identical copies of the same bytes. We fix this with content hashing.

👦 Nephew: Meaning?

👨‍🦳 Uncle: SHA-256 is a cryptographic hash function — an algorithm that takes any input (in our case, a file's raw bytes) and produces a fixed-length, 64-character string (called a hash or digest) that is essentially a unique fingerprint of that exact content. Two important properties matter to us:

The same file content always produces the exact same hash, no matter who uploads it or what they named it.
Even a single-bit difference in the file produces a completely different hash. So it's not "similar files get similar hashes" — it's "identical content, identical hash; anything else, unrelated-looking hash."

The chance of two genuinely different files accidentally producing the same SHA-256 hash is astronomically small — small enough that the entire software industry (including Git itself) relies on this property daily.

👦 Nephew: So the plan is: compute the hash, and if we've seen that hash before, don't store the file again?

👨‍🦳 Uncle: Exactly. Let's build it.

3.1 — Computing a SHA-256 hash in Node.js

Node has hashing built into its standard library — no extra package needed. Here's the core building block, written to handle large files efficiently by streaming the file instead of loading the whole thing into memory at once:

const crypto = require("crypto");
const fs = require("fs");

function hashFileStream(filePath) {
  return new Promise((resolve, reject) => {
    const hash = crypto.createHash("sha256");
    const stream = fs.createReadStream(filePath);

    stream.on("data", (chunk) => hash.update(chunk));
    stream.on("end", () => resolve(hash.digest("hex")));
    stream.on("error", reject);
  });
}

// Usage:
// const fileHash = await hashFileStream("/tmp/uploaded-file.pdf");
// e.g. "3f786850e387550fdab836ed7e6dc881de23001b"

👦 Nephew: Why stream it instead of just crypto.createHash('sha256').update(buffer).digest('hex') on the whole file at once?

👨‍🦳 Uncle: Because if someone uploads a 200 MB video file, loading the entire thing into memory just to hash it can spike your server's RAM and slow everything else down — especially if ten uploads happen at once. Streaming reads the file in small chunks, feeds each chunk into the hash calculation, and never holds the whole file in memory. Small files barely notice the difference; large files, it's the difference between a smooth server and a crashed one.

3.2 — Checking for duplicates before storing

Now, the hash alone is only useful if you remember which hashes you've already stored. That's a job for your database, not S3 itself.

CREATE TABLE files (
  id            SERIAL PRIMARY KEY,
  sha256_hash   VARCHAR(64) NOT NULL UNIQUE,
  s3_key        TEXT NOT NULL,
  file_type     VARCHAR(20) NOT NULL,   -- 'pdf' | 'image' | 'text'
  size_bytes    BIGINT NOT NULL,
  uploaded_by   INTEGER REFERENCES users(id),
  created_at    TIMESTAMP DEFAULT now()
);

Notice sha256_hash has a UNIQUE constraint. That single line is doing a lot of work — even if two upload requests race each other at the exact same millisecond, the database itself will reject the second insert of the same hash, so you can't accidentally create a duplicate even under concurrent load.

The check-then-act flow in code:

async function findOrRegisterFile(fileHash, fileType, sizeBytes, extension, userId) {
  // 1. Have we already stored this exact content?
  const existing = await db.query(
    "SELECT s3_key FROM files WHERE sha256_hash = $1",
    [fileHash]
  );

  if (existing.rows.length > 0) {
    // Duplicate! Don't upload again — just reuse the existing object.
    return { isDuplicate: true, s3Key: existing.rows[0].s3_key };
  }

  // 2. New file — decide its key, based on type, using the hash itself
  const folder = { pdf: "documents", image: "images", text: "text-files" }[fileType];
  const s3Key = `${folder}/${fileHash}.${extension}`;

  await db.query(
    `INSERT INTO files (sha256_hash, s3_key, file_type, size_bytes, uploaded_by)
     VALUES ($1, $2, $3, $4, $5)`,
    [fileHash, s3Key, fileType, sizeBytes, userId]
  );

  return { isDuplicate: false, s3Key };
}

👦 Nephew: So if it's a duplicate, we just... don't touch S3 at all? We just point the new "upload" record at the old object?

👨‍🦳 Uncle: Exactly right. The user experience looks identical — "your file uploaded successfully" — but behind the scenes, you saved storage cost, saved upload bandwidth, and saved processing time (if you were going to compress/resize it), all because the bytes were already sitting in S3 from someone else's earlier upload.

Quick flag before we move on: the findOrRegisterFile sketch above is the concept — it's missing a couple of hardening details a production version needs (what stops the hash itself from being a lie, and what happens if two of these race each other). We'll build the real, hardened version in Part 7, once the presigned URL pieces are on the table too.

Part 4 — Enforcing File Size (At Every Layer, Not Just One)

👨‍🦳 Uncle: Here's a mistake I see constantly: a developer checks file size once, on the frontend, in JavaScript, and calls it done. That check is trivially bypassed — anyone can call your API directly with curl or Postman, skipping your frontend entirely. Real file-size enforcement is layered, the same "reject bad traffic as early as possible" principle from our security-groups discussion.

Layer 1 — Client-side (UX only, not security):
Reject obviously oversized files before even starting an upload, so the user gets instant feedback instead of waiting for a slow upload to fail.

Layer 2 — Backend validation, before generating any upload permission:

const MAX_SIZE_BYTES = 5 * 1024 * 1024; // 5 MB

function validateFileSize(sizeBytes) {
  if (sizeBytes > MAX_SIZE_BYTES) {
    const err = new Error("File exceeds the 5MB limit");
    err.statusCode = 413; // "Payload Too Large" — the correct HTTP status for this
    throw err;
  }
}

Layer 3 — Enforced by S3 itself, at the moment of upload, via the presigned request's conditions. This is the layer most beginners don't know exists — and it's the one that actually matters for direct-to-S3 uploads, because your backend never sees the file bytes in that flow, so layer 2 alone can be lied to. We'll wire this up properly in Part 6.

Layer 4 — Load balancer / API Gateway request size limits, as a blunt outer boundary against absurdly oversized requests hitting your infrastructure at all.

👦 Nephew: So the frontend check is basically just "be nice to the user," and the real enforcement happens server-side and inside the S3 request itself.

👨‍🦳 Uncle: Correctly understood.

Part 5 — Permissions: Locking S3 Down Properly

👨‍🦳 Uncle: Now let's get the access model right, because this is where careless setups leak private user files to the entire internet — a mistake that's made the news more than once.

5.1 — Block Public Access (keep it ON)

When you create the bucket, AWS shows a setting called "Block all public access." Leave it enabled. With this on, no object in the bucket can be made public by accident — not through a misconfigured bucket policy, not through an ACL mistake, nothing. Any such attempt is silently denied. Public exposure should always be an intentional, narrow exception (which we handle via presigned URLs, or a CDN in front — future episode), never the default state of the bucket.

5.2 — IAM Policies: deciding what your backend/app can do

Access to S3 is controlled through IAM policies — JSON documents describing which actions are allowed on which resources. The key actions you'll use:

Action	What it allows
`s3:PutObject`	Uploading (writing) a new object
`s3:GetObject`	Downloading (reading) an object
`s3:DeleteObject`	Deleting an object
`s3:ListBucket`	Listing what objects exist in the bucket

A properly scoped policy for your backend's role looks like this — notice it's restricted to one specific bucket, not "all S3 buckets everywhere":

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}

That Resource line, with the /* at the end, means "any object key inside the my-app-uploads bucket" — not the bucket's own settings, not other buckets. This is the Principle of Least Privilege again, applied to a service instead of a person.

5.3 — Attach this via an IAM Role, not access keys

If your backend runs on EC2 (from Episode 1) or Lambda, attach this permission through an IAM Role, exactly like we discussed last time — never paste an Access Key and Secret Key into your .env file for this. The role gives your running server temporary, auto-rotating credentials behind the scenes, with zero secrets to leak.

Part 6 — The Node.js Setup

👨‍🦳 Uncle: Let's install what we need. AWS's modern JavaScript SDK is modular — you install just the pieces you need, not one giant package.

npm install @aws-sdk/client-s3 @aws-sdk/s3-request-presigner

Set up the client once, and reuse it everywhere:

// s3Client.js
const { S3Client } = require("@aws-sdk/client-s3");

const s3 = new S3Client({
  region: "ap-south-1", // keep this the same region as your EC2/RDS — lower latency, lower cost
});

module.exports = s3;

Notice — no access key, no secret key in this code. If this runs on an EC2 instance with the IAM role from Part 5 attached, the SDK automatically discovers and uses those temporary credentials. This is the payoff of setting up roles properly back in Episode 1.

Part 7 — The Full End-to-End Upload Flow

👨‍🦳 Uncle: Now let's assemble everything — hashing, deduplication, size limits, folder-wise keys, and permissions — into one coherent request flow. This is the production pattern, not the toy version.

1. Client picks a file → computes size (and optionally a hash) locally
2. Client calls  POST /api/uploads/request-url  { fileName, fileType, sizeBytes, sha256Hash }
3. Backend:
     a. Validates the user is authenticated & rate-limit not exceeded
     b. Validates sizeBytes <= 5MB          (Layer 2 size check)
     c. Checks sha256Hash against the database
          → if it exists already: respond immediately, "already uploaded",
            return the existing s3Key. No S3 call needed at all.
          → if new: build the S3 key using the folder-by-type convention
            and the hash, e.g.  images/9f86d0...jpg
     d. Generates a PRESIGNED URL for that exact key, with:
          - a Content-Length condition (enforces size at the S3 level)
          - a Content-Type condition (enforces file type at the S3 level)
          - a ChecksumSHA256 condition (enforces the CONTENT ITSELF at the S3
            level — see 7.3, this is what stops a client from lying about
            the hash it's dedup-ing against)
          - an expiry time appropriate to the use case (see Part 8)
     e. Registers a row in the database (status: "pending")
4. Client uploads the raw file bytes DIRECTLY to S3, using that presigned URL
     — the backend server never touches the file bytes.
5. Client notifies backend: "upload finished" (with the s3Key)
6. Backend calls HeadObject on that S3 key, to CONFIRM the object genuinely
   exists and matches the expected size — this defeats a client that lied
   about having uploaded something.
7. Backend marks the database row "confirmed", and the file is now live.

👦 Nephew: Why does the backend need to double-check in step 6? Didn't we already validate everything?

👨‍🦳 Uncle: Because a presigned URL, once issued, is a bit like handing someone a signed blank cheque with a spending limit written on it — you've limited what they can do with it, but you still want proof of what actually happened before you update your own records. Never trust the client's word that an upload succeeded — verify against S3 directly.

7.1 — Code: requesting the upload URL

const { PutObjectCommand } = require("@aws-sdk/client-s3");
const { getSignedUrl } = require("@aws-sdk/s3-request-presigner");
const s3 = require("./s3Client");

const MAX_SIZE_BYTES = 5 * 1024 * 1024; // 5MB

// Note: the extension is decided HERE, server-side — never taken from the
// client. A client-supplied "extension" string is an easy injection vector
// (mismatched content, or a crafted string messing with downstream tooling).
const ALLOWED_TYPES = {
  pdf: { contentType: "application/pdf", folder: "documents", ext: "pdf" },
  image: { contentType: "image/jpeg", folder: "images", ext: "jpg" }, // extend for png etc.
  text: { contentType: "text/plain", folder: "text-files", ext: "txt" },
};

async function requestUploadUrl({ fileType, sizeBytes, sha256HashHex, userId }) {
  if (sizeBytes > MAX_SIZE_BYTES) {
    const err = new Error("File exceeds 5MB limit");
    err.statusCode = 413;
    throw err;
  }

  const typeConfig = ALLOWED_TYPES[fileType];
  if (!typeConfig) {
    const err = new Error("Unsupported file type");
    err.statusCode = 400;
    throw err;
  }

  // Dedup check
  const existing = await db.query(
    "SELECT s3_key FROM files WHERE sha256_hash = $1", [sha256HashHex]
  );
  if (existing.rows.length > 0) {
    return { duplicate: true, s3Key: existing.rows[0].s3_key };
  }

  const s3Key = `${typeConfig.folder}/${sha256HashHex}.${typeConfig.ext}`;

  // S3's checksum parameter wants the hash as BASE64, not hex —
  // easy conversion, easy to forget:
  const sha256HashBase64 = Buffer.from(sha256HashHex, "hex").toString("base64");

  const command = new PutObjectCommand({
    Bucket: "my-app-uploads",
    Key: s3Key,
    ContentType: typeConfig.contentType,
    ContentLength: sizeBytes,         // enforced as part of the signature
    ChecksumSHA256: sha256HashBase64, // binds the hash into the signature itself (see 7.3)
    ChecksumAlgorithm: "SHA256",
  });

  // Short expiry — this is a normal user-initiated upload,
  // it should happen within a couple of minutes.
  const uploadUrl = await getSignedUrl(s3, command, { expiresIn: 120 });

  try {
    await db.query(
      `INSERT INTO files (sha256_hash, s3_key, file_type, size_bytes, uploaded_by, status)
       VALUES ($1, $2, $3, $4, $5, 'pending')`,
      [sha256HashHex, s3Key, fileType, sizeBytes, userId]
    );
  } catch (e) {
    // Race condition: two requests for the same new file arrived almost
    // simultaneously, both passed the SELECT check above before either
    // INSERT landed. The UNIQUE constraint catches it here (Postgres error
    // code 23505) — treat it exactly like a normal dedup hit.
    if (e.code === "23505") {
      const winner = await db.query(
        "SELECT s3_key FROM files WHERE sha256_hash = $1", [sha256HashHex]
      );
      return { duplicate: true, s3Key: winner.rows[0].s3_key };
    }
    throw e;
  }

  return { duplicate: false, uploadUrl, s3Key };
}

7.2 — Code: confirming after upload

const { HeadObjectCommand } = require("@aws-sdk/client-s3");

async function confirmUpload(s3Key, expectedSizeBytes, expectedHashBase64) {
  const head = await s3.send(
    new HeadObjectCommand({
      Bucket: "my-app-uploads",
      Key: s3Key,
      ChecksumMode: "ENABLED", // ask S3 to return the stored checksum too
    })
  );

  if (head.ContentLength !== expectedSizeBytes) {
    throw new Error("Uploaded file size mismatch — possible tampering");
  }

  // Belt-and-suspenders: S3 already refuses the PUT itself if the bytes don't
  // match the signed ChecksumSHA256 (see 7.3), so this should never actually
  // fire — but checking it here means we're not silently trusting that the
  // S3-side enforcement was wired up correctly either.
  if (head.ChecksumSHA256 !== expectedHashBase64) {
    throw new Error("Stored object checksum does not match expected hash");
  }

  await db.query(
    "UPDATE files SET status = 'confirmed' WHERE s3_key = $1", [s3Key]
  );

  return { confirmed: true };
}

👦 Nephew: Wait, Uncle — something's bugging me about this whole thing. That sha256HashHex in requestUploadUrl — it comes from the client. We use it to decide the dedup, and we use it to name the S3 key. But nothing actually re-checks that the bytes landing in S3 hash to that value. Couldn't a client just claim the hash of some harmless PDF while uploading something else entirely to that key? HeadObject only confirms size, not content — so it wouldn't catch it. And because the key is derived from that false hash, it permanently squats there — the next honest person who uploads the real file just gets silently deduped onto the poisoned object.

👨‍🦳 Uncle: ...I'm glad you said that out loud, because you're completely right, and it's exactly the kind of hole that hides in plain sight — I even said "never trust the client" three sections ago and then let the hash itself slide through as an unverified claim. Let's fix it properly.

7.3 — Closing the Hole: Making S3 Verify the Hash, Not Just Store It

👨‍🦳 Uncle: The mistake in the earlier version was treating the SHA-256 as a label the client hands us — good enough to pick a folder and a dedup row, but never actually checked against the real bytes. That's backwards for the one piece of data the whole scheme depends on being honest.

The fix isn't "download the object back and re-hash it after upload." That defeats the entire reason we're using presigned URLs — direct-to-S3 upload — for anything but tiny files, and it's still a race: a poisoned object is live in S3, possibly already deduped against, before your re-check even runs.

The real fix is to stop asking the client to be honest, and make S3 itself the judge — by folding the checksum into the signed request, the same way we already fold in size and content type. That's exactly what the ChecksumSHA256 and ChecksumAlgorithm fields do in the updated 7.1 code above: the hash becomes part of the cryptographic signature, not a free-text value the client can swap out. If the bytes that actually arrive don't hash to what was signed, S3 rejects the PUT outright — the object never gets created, and no dedup row can ever point at poisoned content, because there's nothing there to point at.

👦 Nephew: So dedup and content-integrity are now backed by the same guarantee, instead of dedup quietly assuming integrity was someone else's problem.

👨‍🦳 Uncle: Exactly that. Two things worth flagging so you don't trip on them:

S3 wants the checksum as Base64, not hex. We compute and store the hash as a hex string (that's what hashFileStream() gives us, and it's the natural form for a database column and a URL-safe key), but the ChecksumSHA256 field on the S3 command needs Base64 — hence the one-line Buffer.from(hex, "hex").toString("base64") conversion in 7.1. Miss that conversion and every upload fails with a confusing signature-mismatch error.
The client must compute the hash before asking for the URL. We already needed this for the dedup check, so it's not new work — just now the honesty of that number actually matters, and S3 enforces it instead of politely trusting it.

And notice what this quietly buys us for free: the extension-trust fix from 7.1 (deriving the file extension from a server-side ALLOWED_TYPES map instead of a client-supplied string) and the race-safe insert (catching the 23505 unique-violation instead of assuming the SELECT-then-INSERT gap can never be hit) — both of those were sitting in the same code path as this hash issue, quietly assuming good faith from the client in places where "never trust the client" should have applied uniformly. One hole rarely travels alone; worth re-reading a flow end-to-end once you find the first one.

👦 Nephew: This is the cleanest way I've seen this explained — and now I actually believe the "never trust the client" line means what it says. Okay — the thing I originally asked about: the 5-hour expiry.

Part 8 — Presigned URLs: Choosing the Right Expiry for Production

👨‍🦳 Uncle: Good, let's slow down here, because "just set expiresIn to a big number" is the wrong way to think about it. The expiry time should match why the URL exists.

8.1 — Why expiry length varies by use case

Use case	Typical expiry	Why
User uploads a small profile picture	1–2 minutes	The action is immediate — pick a file, upload starts right away
User uploads a large video on a slow connection	Several hours (e.g. 5 hours)	The transfer itself may genuinely take a long time; the URL must stay valid for the entire upload duration, not just the moment it starts
Admin generates a downloadable report link to share	Hours to a day	The recipient may not click it immediately
Internal service-to-service file access	Minutes	Tightly scoped, machine-triggered, no reason to linger

👦 Nephew: So a 5-hour expiry isn't inherently risky — it depends on what it's for.

👨‍🦳 Uncle: Correct — but it does widen the window during which that specific URL, if leaked, remains usable. So when you deliberately choose a long expiry, you compensate with tighter guardrails elsewhere, which we've already built:

Content-Length-Range / Content-Type conditions baked into the signed request, so even a leaked URL can't be abused to upload an oversized file or a disguised file type.
One object key per presigned URL — the signature is tied to one specific S3 key, not a whole folder, so a leaked URL can't be used to overwrite arbitrary other files.
Server-side confirmation (HeadObject), so even if someone uploads something using a leaked URL, your system won't treat it as trusted, confirmed data without matching your expected metadata.
Short-lived by default, long only when justified — don't reach for 5 hours out of laziness; reach for it because the use case (e.g. a large video upload) genuinely needs that window.

8.2 — Code: generating a 5-hour presigned URL for a large-file production case

async function requestLargeUploadUrl({ fileType, sizeBytes, sha256HashHex, userId }) {
  const MAX_LARGE_FILE_BYTES = 500 * 1024 * 1024; // 500MB, for e.g. video

  if (sizeBytes > MAX_LARGE_FILE_BYTES) {
    const err = new Error("File exceeds the maximum allowed size");
    err.statusCode = 413;
    throw err;
  }

  const s3Key = `videos/${userId}/${sha256HashHex}.mp4`; // extension fixed server-side, same reasoning as 7.1
  const sha256HashBase64 = Buffer.from(sha256HashHex, "hex").toString("base64");

  const command = new PutObjectCommand({
    Bucket: "my-app-uploads",
    Key: s3Key,
    ContentType: "video/mp4",
    ContentLength: sizeBytes,
    ChecksumSHA256: sha256HashBase64, // same integrity guarantee as the short-lived flow
    ChecksumAlgorithm: "SHA256",
  });

  // 5 hours = 5 * 60 * 60 seconds
  const uploadUrl = await getSignedUrl(s3, command, { expiresIn: 5 * 60 * 60 });

  await db.query(
    `INSERT INTO files (sha256_hash, s3_key, file_type, size_bytes, uploaded_by, status, expires_at)
     VALUES ($1, $2, $3, $4, $5, 'pending', NOW() + INTERVAL '5 hours')`,
    [sha256HashHex, s3Key, fileType, sizeBytes, userId]
  );

  return { uploadUrl, s3Key, expiresInSeconds: 5 * 60 * 60 };
}

👦 Nephew: What's that expires_at column doing in the database — isn't the S3 URL's own expiry enough?

👨‍🦳 Uncle: Sharp question. The S3-side expiry stops the URL from working after 5 hours — but it doesn't clean up your own database. If the upload never happens, you'd otherwise have a pending row sitting forever, pointing at an object that will never exist. Tracking expires_at yourself lets a background job periodically clean up abandoned pending uploads — good hygiene for a production system.

One more honest caveat for this specific large-file case: a single PutObject presigned URL works fine up to a few hundred MB, but for anything approaching or exceeding S3's 5GB single-PUT ceiling — or for genuinely unreliable long uploads where you don't want one dropped connection to waste hours of transfer — the real production tool is a multipart upload, where the file is split into parts, each part gets its own presigned URL, and a failed part can be retried without restarting the whole file. That's a deliberately separate topic; today's single-PUT flow is the right building block to understand first.

Part 9 — Wiring In the Rest of the Protection (Recap From Last Time, Now Complete)

👨‍🦳 Uncle: Let's connect this to the earlier security layering, so you see the whole chain end to end for a production deployment:

Internet
   ↓
AWS WAF               → blocks malicious/abnormal traffic patterns, rate-limits by IP
   ↓
Load Balancer          → max request size, connection limits, timeouts
   ↓
Application rate limit → e.g. max 20 upload-requests/min per user
   ↓
Backend validates size, type, dedup hash → issues presigned URL (scoped, size-capped, time-capped)
   ↓
Client uploads DIRECTLY to S3            → backend never touches the raw bytes
   ↓
Backend confirms via HeadObject           → verifies before trusting the upload

Nothing in this chain relies on a single point of trust. Even if one layer is bypassed, the next one still holds.

Part 10 — A Few More Production Essentials (Quick Hits)

👨‍🦳 Uncle: A handful of settings that separate a "working demo" from a "production-grade" S3 setup — worth knowing even if we don't deep-dive each one today:

Server-side encryption (SSE-S3 or SSE-KMS) — S3 encrypts objects at rest automatically. Enable it at the bucket level so you don't have to remember it per-upload.
Versioning — keeps prior versions of an object if it's ever overwritten, protecting against accidental deletes or bad overwrites.
CORS configuration — since the browser is uploading directly to S3 (a different origin than your app's domain), S3's bucket needs a CORS (Cross-Origin Resource Sharing) policy explicitly allowing your frontend's domain to make that PUT request. Without this, browsers block the upload even with a valid presigned URL.
Lifecycle rules — automatically move old, rarely-accessed objects to cheaper storage tiers (like S3 Glacier) after a set number of days, or delete genuinely temporary files automatically.
Deny insecure transport — a bucket policy statement that rejects any request not made over HTTPS, so nothing is ever transmitted in plaintext.

Part 11 — Recap

👨‍🦳 Uncle: Let's replay it:

S3 is object storage, fully decoupled from any server — the fix for the "disposable server" problem from Episode 1.
"Folders" in S3 are an illusion — really just prefixes inside a flat key namespace. We used that to organize uploads by type: documents/, images/, text-files/.
SHA-256 gives every file a content fingerprint. We hash the file, check it against a UNIQUE-constrained database column, and skip storing (and re-uploading) anything we've already got.
File size is enforced at every layer — client (UX only), backend (before issuing any URL), and S3 itself (via a signed Content-Length condition) — never trusting just one.
Permissions stay locked down: Block Public Access ON, tightly scoped IAM policies, delivered via an IAM Role rather than hardcoded keys.
The full flow: request URL → backend validates + dedups + signs → client uploads directly to S3 → backend confirms via HeadObject before trusting anything.
The dedup hash itself is verified, not just trusted — ChecksumSHA256 is baked into the signed request, so S3 rejects any upload whose actual bytes don't match the claimed hash. A client can't poison a dedup key with mismatched content, because the object never gets created if the checksum doesn't match. Extensions are decided server-side too, never taken from client input, for the same "never trust the client" reason.
Presigned URL expiry should match the use case — short (minutes) for quick interactive uploads, longer (hours, like 5 hours) specifically for large files needing a long transfer window — always paired with size/type/checksum conditions baked into the signature itself, never an open-ended, unrestricted permission slip.

Part 12 — Glossary (New Terms From Today)

Term	Plain-English definition
S3	Simple Storage Service — AWS's object storage, independent of any single server.
Bucket	The top-level, globally-uniquely-named container for objects in S3.
Object	A single stored file plus its metadata.
Key	An object's full name/path string inside a bucket — S3 has no real folders, only key-string prefixes.
Prefix	The "folder-like" portion of a key before a `/`, used to visually and logically group objects.
SHA-256	A cryptographic hash function producing a fixed-length fingerprint of a file's exact content — identical content always yields an identical hash.
Deduplication (dedup)	Detecting and avoiding storage of identical file content more than once.
Streaming (hash computation)	Processing a file in small chunks rather than loading it entirely into memory — critical for large files.
Block Public Access	An S3 bucket-level safety switch that prevents any object from being made public by accident.
IAM Policy	A JSON document defining exactly which actions are allowed on which resources.
`s3:PutObject` / `s3:GetObject`	The specific IAM permissions for uploading and downloading S3 objects respectively.
Presigned URL	A temporary, cryptographically signed URL that grants limited, time-boxed permission to upload or download one specific S3 object, without exposing AWS credentials.
Content-Length	A signed condition on a presigned PUT that pins the upload to an exact byte size — a mismatched upload is rejected by S3.
ChecksumSHA256 / ChecksumAlgorithm	S3 parameters that bind a specific SHA-256 value into the signed request itself — S3 rejects the upload if the actual bytes don't hash to that value, turning a client-claimed hash into a server-verified one.
HeadObject	An S3 API call that retrieves an object's metadata (size, checksum, existence) without downloading its actual content — used here to verify an upload really happened and matches expectations.
CORS	Cross-Origin Resource Sharing — the browser security mechanism that must be explicitly configured on the S3 bucket to allow direct browser-to-S3 uploads from your app's domain.
Server-side encryption (SSE)	S3 automatically encrypting stored objects at rest.
Versioning	An S3 bucket feature that retains prior versions of an object instead of overwriting it permanently.
Lifecycle rule	An automated policy to transition or delete objects after a set time period.

Part 13 — What's Next

👦 Nephew: Uncle, this actually feels like a real production system now — not a tutorial toy. What's after this?

👨‍🦳 Uncle: Think about it yourself again, same as last time. You've now got files landing safely in S3, deduplicated, size-checked, permission-locked. But — a user in Chennai and a user in Delhi both request the same image. Where is that image physically served from, and how fast does each of them get it?

👦 Nephew: ...Straight from the S3 bucket in Mumbai, I'd guess? Which means the Delhi user gets it faster than someone further away?

👨‍🦳 Uncle: Exactly the gap we'll close next — CloudFront, AWS's CDN (Content Delivery Network), which caches your S3 content at edge locations around the world so nobody's request has to travel all the way back to Mumbai every time. And while we're there, we'll also cover what happens after upload — using Lambda to automatically process, resize, or validate a file the moment it lands in S3, without you running a single always-on server for that job.

👦 Nephew: Episode 3, then.

👨‍🦳 Uncle: Episode 3.

End of Episode 2. Next up — Episode 3: CloudFront (CDN) & Lambda-Powered Post-Upload Processing.

Top comments (2)

Nazar Boyko • Jul 2

The dedup design is clean, and it fits the "never trust the client" spine you run through the whole piece, which is exactly why one step caught my eye. The sha256Hash that decides dedup comes from the client, but nothing re-checks that the bytes actually landing in S3 hash to that value. So a client could claim the hash of a harmless PDF while uploading something else to that key, and the next person who uploads the real file gets deduped onto the poisoned object. HeadObject confirms the size but not the content, so it wouldn't catch it. Do you hash the bytes server-side after the upload, or is that saved for a later episode?

surajrkhonde • Jul 2

You're right — I missed this when I wrote it. I was so focused on "never trust the client" for size and type that I let the hash itself slide through as a client-supplied value without applying the same rule to it. The dedup key ends up resting on an unverified claim, which undercuts the whole spine of the piece. Good catch — making the correction now: binding the SHA-256 into the signed request via ChecksumSHA256 so S3 verifies it against the actual bytes at upload time, instead of just trusting what the client sent. Appreciate you flagging it.