Martin Kambla

Posted on May 5

Delivering E2EE media without blowing up Postgres

#postgres #architecture #cryptography #backend

When I launched my messenger, the media upload path looked like this:

client → encrypt → POST /media/upload → INSERT INTO media (ciphertext BYTEA)

Functionality was there, 2MB, 25MB (sometimes times out) and at 100MB you get blowups. Here's a brief "lessons-learned" about the road of Postgres BYTEA at scal, and the architectuer I ended up with shipping 200MB encrypte video without the server ever seeing plaintext.

Why BYTEA was good at first?

I've always preferred Postgres to any other DB system. I'm sure others have their benefits in different scenrios but I usually when dealing with serious applications I use Postgres.

And for messaging, the attraction was specific. If the ciphertext lives next to the row pointing at it, there's one write, one transaction, one thing to fail. No dangling references between the DB and a blob store if one side of the write fails. Simple.

What breaks?

Then I tried to ship a 50 MB video.

WAL explosion. Every write to a BYTEA column goes through the Write-Ahead Log. A 50 MB attachment is 50 MB of WAL. Multiply by the replication lag on my read replica and I was seeing replication fall 60+ seconds behind on a single upload.

Backup time. pg_dump with BYTEA attachments turns into a slow march. At 100 GB of media my nightly dump was taking over an hour. And every byte of that was stuff the database didn't need to be parsing — it's opaque ciphertext.

Memory pressure. The server was reading the full ciphertext into JVM heap to stream it back to the client on download. One concurrent 100 MB download = 100 MB of heap. Three concurrent downloads on a 1 GB instance and you're in swap.

Connection hold time. An upload that takes 90 seconds holds a DB connection for 90 seconds. With a default pool of 20 connections, ten concurrent uploaders and everything else blocks.

TOAST limitations. Postgres stores large column values in a separate "TOAST" table with its own indexing and compression. TOAST has a hard 1 GB limit per value. I wasn't near that, but I was building a path that led there.

The pattern was clear: Postgres is a great metadata store, a bad blob store. I needed to split them.

The architecture I moved to

client
  │
  │  1. POST /media/reserve   → server issues upload URL + media_id
  │                              + AES-GCM random key to encrypt with
  │                              (key is stored by the client, not the server)
  │
  │  2. encrypt(file, key)    → ciphertext bytes
  │
  │  3. PUT {presigned S3 URL} → DO Spaces / S3
  │
  │  4. POST /media/finalize  → server marks media_id as ready,
  │                              records size + sha256(ciphertext) only
  │
  │  5. send message with media_id + encrypted key (wrapped for recipient)

The key property: the server never sees plaintext, and never sees the media key. The server knows a ciphertext blob exists at spaces://quldra-media/{media_id}. It knows the blob's size and hash. That's it.

On the recipient side, the message carries the media key wrapped for the recipient's device key. The recipient unwraps it, downloads the ciphertext directly from DO Spaces via a short-lived presigned URL, and decrypts locally.

What I had to fix in the client?

Timeout discipline

Default HTTP client timeouts are a lie for large uploads. The Ktor client defaults worked fine for API calls and quietly retried uploads that crossed 120 seconds, which turned into silent data loss — the client thought it had retried, the server had no idea anything had happened.

I split the timeout regime:

val mediaUploadClient = HttpClient(CIO) {
    install(HttpTimeout) {
        requestTimeoutMillis = 10 * 60 * 1000  // 10 minutes
        connectTimeoutMillis = 30_000
        socketTimeoutMillis = 10 * 60 * 1000
    }
    // CRITICAL: retry disabled on /media/upload
    // The upload is not idempotent from the client's perspective —
    // a retry means re-encrypting with a potentially different random nonce,
    // which creates an orphan blob in Spaces.
    install(HttpRequestRetry) {
        retryOnExceptionIf { _, _ -> false }
    }
}

Streaming encryption

You can't hold 200 MB of plaintext in memory on a mid-range Android device. Encryption has to stream:

fun streamEncryptToFile(
    input: Source,
    output: Sink,
    key: ByteArray,
    nonce: ByteArray
) {
    val cipher = ChaCha20Poly1305()
    cipher.init(true, AEADParameters(KeyParameter(key), 128, nonce))

    val buffer = ByteArray(64 * 1024)
    while (true) {
        val n = input.read(buffer)
        if (n <= 0) break
        val chunk = ByteArray(cipher.getOutputSize(n))
        val written = cipher.processBytes(buffer, 0, n, chunk, 0)
        output.write(chunk, 0, written)
    }
    val tag = ByteArray(cipher.getOutputSize(0))
    val tagWritten = cipher.doFinal(tag, 0)
    output.write(tag, 0, tagWritten)
}

64 KB chunks are the sweet spot on Android. Smaller and the per-chunk syscall overhead dominates; larger and you start allocating in a way that pressures GC.

Progress reporting

Users expect a progress bar for a 200 MB upload. Presigned S3 PUTs don't give you one by default — the network layer doesn't know anything about your application-level progress. I wrapped the Source that streams the ciphertext file:

class ProgressTrackingSource(
    private val delegate: Source,
    private val totalBytes: Long,
    private val onProgress: (bytesRead: Long, total: Long) -> Unit
) : Source {
    private var bytesRead = 0L
    override fun read(buffer: ByteArray): Int {
        val n = delegate.read(buffer)
        if (n > 0) {
            bytesRead += n
            onProgress(bytesRead, totalBytes)
        }
        return n
    }
}

And threaded the onProgress callback into a StateFlow the UI could render.

What I had to fix on the server

Presigned URLs, not streaming through the server

The wrong way:

client → POST /media/upload (200 MB body) → server → write to S3

This makes the server a bottleneck on every upload. The correct way:

client → POST /media/reserve (metadata only) → server
server → presign a PUT URL against S3 → return to client
client → PUT to S3 directly (server not involved)
client → POST /media/finalize (metadata only) → server

Server-side, presigning with the AWS SDK is two lines:

val presigner = S3Presigner.create()
val presignedUrl = presigner.presignPutObject {
    it.signatureDuration(Duration.ofMinutes(15))
        .putObjectRequest { req ->
            req.bucket("quldra-media").key(mediaId)
        }
}

The server does no I/O during the upload. No connection held. No heap pressure. It issues a URL, then hears back 90 seconds later when the finalize request comes in.

Verifying what landed

The finalize step is where the server confirms the upload actually happened:

post("/media/finalize") {
    val req = call.receive<FinalizeRequest>()

    val head = s3.headObject {
        it.bucket("quldra-media").key(req.mediaId)
    }

    require(head.contentLength() == req.expectedSize) {
        "uploaded size mismatch"
    }
    // We can't verify sha256 without reading the object — we trust the client
    // for the hash value it reports. The hash is only used for dedup + UX,
    // not security (the ciphertext is already AEAD-authenticated).

    mediaRepo.markReady(req.mediaId, req.expectedSize, req.sha256)
    call.respond(HttpStatusCode.OK)
}

The HEAD request is cheap and confirms the blob exists with the expected size. The client's reported hash goes into the DB for deduplication and UI purposes. The ciphertext itself is already AEAD-authenticated — if someone tampers with the blob in storage, the recipient's decryption will fail with a tag mismatch, so I don't need the server to guarantee integrity.

What it looks like now

After the migration:

Database size dropped from 180 GB to 11 GB overnight. Backups are fast again.
WAL rate dropped by a factor of ~60.
Memory per concurrent upload on the server: effectively zero.
Maximum upload size went from "25 MB before things get flaky" to 200 MB reliably.
Replication lag on the read replica went from 60s+ during upload spikes to under 1s sustained.

The simplicity argument for BYTEA was real — and wrong at this scale. One extra system to operate (DO Spaces) is the right trade for keeping the database focused on what it's good at.

The piece I'd do differently

I'd have set a size threshold from day one: under 1 MB goes into Postgres, over 1 MB goes to object storage, metadata identical in both cases. That way the architecture supports both paths and you pick per-blob based on actual size, not "we decided in advance." It's more code but it's also the path I'd eventually want anyway.

This is post 3 of a short series on the tech behind Quldra, a post-quantum single-device messenger built in Kotlin Multiplatform. Previous posts covered My road to ML-KEM-768 over X25519 for my messaging app and Device distinct messaging: why I killed multi-device and how fingerprint hashing enforces it..