Junxiao Shi

Posted on Jul 11, 2021 • Originally published at yoursunny.com

Deep Atlantic Storage: Streaming Bits

#go #python #cpp

This post is originally published on yoursunny.com blog https://yoursunny.com/t/2021/das-stream-bits/

I'm bored on 4th of July holiday, so I made a wacky webpage: Deep Atlantic Storage.
It is described as a free file storage service, where you can upload any file to be stored deep in the Atlantic Ocean, with no size limit and content restriction whatsoever.
How does it work, and how can I afford to provide it?

This article is the third of a 3-part series that reveals the secrets behind Deep Atlantic Storage.
The first part revealed that the uploaded file is sorted which drastically reduces its storage demand, and introduced the bit sorting algorithm.
The second part covered how to process the upload in the browser using Web Workers.
Now I'd continue from there, and explain where I store the files and how I offer downloads with reasonable costs.

Storage in the URL

Deep Atlantic Storage sorts the bits in every uploaded file.
After sorting, each file can be represented by two numbers: the number of 0 bits, the number of 1 bits.
Given these two numbers, the sorted file can be reconstructed.

I could make a database or use one of those fancy NoSQL thingy to store those numbers that represent the files, but I prefer my websites stateless so that I don't need to take backups.
Therefore, I decided to encode those numbers in the URI.

Each URI created by the Deep Atlantic Storage looks like this:

https://summer-host-storage.yoursunny.dev/6705fb/18fa05/%E2%98%80%EF%B8%8F.bin
                                          ^        ^          ^
                                          zeros    ones       filename
                                          (in hexadecimal)    (URI encoded)

Whenever someone requests this URI, the server program can extract the number of 0 bits and 1 bits from the URI, and reconstruct the sorted file accordingly.
By having all the information in the URI itself, I can run a storage service without maintaining any server-side storage.

Bit Counts » Byte Array

Given the number of 0 bits and 1 bits in a file, how to generate a file with those bits?
The naive way is:

Generate a string of "0" and "1" repeated for requested length.
Parse the string into an array of bytes.
Write those bytes to a file.

The code looks like this:

def makeFileFromBitCounts(filename: str, cnt0: int, cnt1: int) -> None:
    bitString = '0' * cnt0 + '1' * cnt1
    byteCount = (cnt0 + cnt1) // 8
    arrayOfBytes = int(bitString, base=2).to_bytes(byteCount, byteorder='big')
    with open(filename, mode='wb') as f:
        f.write(arrayOfBytes)

However, this approach is vastly inefficient in memory usage:

The array of bytes has the length of the entire file, and it is allocated in memory.
Even worse, the bit string has 8x the file size, demanding even more memory.

Bit Counts » Stream

Instead of reconstructing the file as a byte array in the memory, I can write the bits to an output stream (which could be a file or a network socket) as I generate them.
However, there's one little problem: in most programming languages, the stream I/O functions are byte-oriented, not bit-oriented.
In other words, I can only write bytes, not bits, to the stream.

Looking at the structure of a sorted file in bytes, it has three parts:

A consecutive run of 0x00 bytes.
Possibly, one byte that is neither 0x00 nor 0xFF.
A consecutive run of 0xFF bytes.

In order to write the file to a stream using byte-oriented API, I need to calculate three things:

The number of 0x00 bytes.
Whether the middle byte exists, and its value.
The number of 0xFF bytes.

Since each byte has 8 bits, suppose there are cnt0 zero bits and cnt1 one bits:

The number of 0x00 bytes should be cnt0 ÷ 8, rounded down.
The number of 0xFF bytes should be cnt1 ÷ 8, rounded down.
If the whole bytes did not cover all the bits, there would be a middle byte.
- This byte should contain cnt0 mod 8 zeros and cnt1 mod 8 ones.
- Since cnt0 + cnt1 is divisible by 8, (cnt0 mod 8) + (cnt1 mod 8) is expected to equal 8.
- Bit numbering within a byte can go either direction, but I picked "Most Significant Bit first" order.

The following code does the calculations and writes the bits to an output stream (try on Coliru):

void
streamFileFromBitCounts(std::ostream& os, size_t cnt0, size_t cnt1)
{
  assert((cnt0 + cnt1) % 8 == 0);

  size_t bytes0 = cnt0 / 8;
  size_t bytes1 = cnt1 / 8;

  std::optional<uint8_t> middleByte;
  size_t middle0 = cnt0 % 8;
  size_t middle1 = cnt1 % 8;
  if (middle0 + middle1 > 0) {
    assert(middle0 + middle1 == 8);
    middleByte = 0xFF >> middle0;
  }

  for (size_t i = 0; i < bytes0; ++i) {
    os.put(static_cast<uint8_t>(0x00));
  }
  if (middleByte.has_value()) {
    os.put(middleByte.value());
  }
  for (size_t i = 0; i < bytes1; ++i) {
    os.put(static_cast<uint8_t>(0xFF));
  }
}

Deep Atlantic Storage runs this logic on a server rather than in the browser.
While Blob API and URL.createObjectURL() allow generating a downloadable file from JavaScript, the entire file must be allocated in memory.
Browser support for streaming from JavaScript is still at an early stage.

Stream » HTTP Server

Deep Atlantic Storage offers file downloads from an HTTP server.
The server, dubbed Deep Atlantic Storage app, is built upon the logic above, with a few improvements:

The response is piped through a gzip compressor, in order to save network egress from my server.
For long runs of 0x00 and 0xFF bytes, instead of writing one byte at a time, the program writes them in pages of 1MB each.
A small delay is added after writing each page to throttle the downloads and avoid consuming too much CPU.
If the client has disconnected, stop writing and return right away.

This is the HTTP handler implementation:

var (
  buf0 = make([]byte, 1048576)               // create 1MB page filled with 0x00 bytes
  buf1 = bytes.Repeat([]byte{0xFF}, 1048576) // create 1MB page filled with 0xFF bytes
)

func handler(w http.ResponseWriter, req *http.Request) {
  // parse URI and extract number of zeros and ones
  var cnt0, cnt1 int
  var name string
  if _, e := fmt.Sscanf(req.URL.Path, "/%x/%x/%s", &cnt0, &cnt1, &name); e != nil {
    w.WriteHeader(http.StatusNotFound)
    return
  }

  // tell browser/client that:
  // (1) the response is binary file, not text or HTML
  // (2) the response is gzip compressed
  // (3) browser should prompt for a download instead of display the file
  w.Header().Set("Content-Type", "application/octet-stream")
  w.Header().Set("Content-Encoding", "gzip")
  w.Header().Set("Content-Disposition", "attachment")

  // calculate number of 0x00 and 0xFF pages and leftover bytes, and the middle byte
  bytes0, bytes1 := cnt0/8, cnt1/8
  middle0, middle1 := cnt0%8, cnt1%8
  pages0, pages1 := bytes0/len(buf0), bytes1/len(buf1)
  bytes0 -= pages0 * len(buf0)
  bytes1 -= pages1 * len(buf1)

  z, _ := gzip.NewWriterLevel(w, 1) // make gzip compressor, fastest mode
  defer z.Close()                   // close and flush when it's done

  writePages := func(page []byte, count int) error { // subroutine to write some pages
    ctx := req.Context()
    for i := 0; i < count; i++ {
      z.Write(page) // pipe one page to the gzip compressor
      select {
      case <-ctx.Done(): // client disconnected
        return ctx.Err() // return non-nil error
      case <-time.After(time.Millisecond): // delay 1ms and continue writing
      }
    }
    return nil
  }

  if e := writePages(buf0, pages0); e != nil { // write 0x00 pages
    return // stop if client has disconnected
  }
  z.Write(buf0[:bytes0]) // write leftover 0x00 bytes
  if middle0+middle1 > 0 { // write the middle byte, if there is one
    z.Write([]byte{0xFF >> middle0})
  }
  z.Write(buf1[:bytes1]) // write leftover 0xFF bytes
  writePages(buf1, pages1) // write 0xFF pages
}

The actual server has some additional logic for verifying the HTTP request verb and Accept-Encoding: gzip header.
You can see its source code in this Gist.

HTTP Server » HTTPS Server

It's 2021 and every website is expected to have HTTPS.
Moreover, the .dev domain I'm using is designated as a "secure namespace" such that most browsers won't even connect to it over plain HTTP.
Therefore, I need to deploy Deep Atlantic Storage app over HTTPS.

I turned to my favorite TLS termination solution, Caddy server, for this task.
The Caddyfile uses a reverse_proxy directive to the HTTP server:

https://summer-host-storage.yoursunny.dev {
  reverse_proxy http://127.0.0.1:3333 {
    flush_interval 1s
    transport http {
      compression off
    }
  }
}

Caddy by default would make request to the HTTP backend server with Accept-Encoding: gzip request header, and then decompress the response if the incoming request does not accept gzip compression.
This is undesirable for Deep Atlantic Storage because my server can possibly send large responses, consuming too much egress bandwidth.
Therefore, I explicitly disabled HTTP transport compression, so that the incoming request is served only if the client accepts gzip compression, and the outgoing response is always compressed.

Conclusion

This article is the final installment of a 3-part series that reveals the secrets behind Deep Atlantic Storage.
Given the bit counts generated by Web Workers, I built and deployed a server that reconstructs the file as gzip'ed HTTP response.

While Deep Atlantic Storage started as a joke, it is a nice exercise to learn the algorithms and APIs around bits, bytes, and blobs.
I enjoyed building the app, and I hope you learned something from these articles.

DEV Community