Posted on Jul 22

How To Read a 10gb Access log file with node.js: Streams.

#javascript #webdev #beginners #tutorial

Stream all things - Node.js Design Patterns Book

Streams are how you move data from point A to point B without nuking your infrastructure.

It doesn’t matter if you’re in CRUD, building infra, or working low-level on raw sockets; your job as a backend engineer boils down to efficient data movement. And streams are how you do it.

Take Google’s Distributed File System: built on streams, moving terabytes. Most of Node.js ecosystem is Also built on streams; HTTP, file I/O, sockets everywhere.

Unlike buffered APIs that load everything into memory (and blow up at scale), streams let you process data in chunks; 64KB, 64MB, you decide.

If you're serious about backend, this is one of those “separate the noobs from the real ones” moments. In this post, we’ll generate a 10GB file using streams; then read it back, live.

Let’s move some data.

Streams: A Crash Course

To understand streams, you first need to understand buffered APIs.

When an API is buffered, nothing happens until it’s done. Doesn’t matter if it’s synchronous:

fs.readFileSync("hugefile.txt");

Or asynchronous:

fs.readFile("hugefile.txt", () => {});

When you call that function, if it’s reading a 10GB file; it will block until it's done. No shortcuts.

Why is that bad? Because memory is finite. Pulling an entire 10GB file into RAM is asking for trouble.

So enter streams. Instead of reading the whole file at once, streams read it in chunks; and they keep track of where they left off. You get data piece by piece, until you hit EOF (end of file).

This isn’t magic; it’s how your SSD (hard disk) works under the hood. Files are split into pages, just like DBs do. But that’s a rabbit hole for another post.

What matters is: you don’t gobble, you sip. That’s how you build fast systems.

Manual Streaming Example

Example of reading a file manually, chunk by chunk using buffers:

const fs = require("fs");

const fd = fs.openSync("./access.bin"); // file descriptor (pointer to file in kernel)

const chunksize = Buffer.alloc(4); // 4 bytes (32 bits)

fs.readSync(fd, chunksize, 0, 4);
let size = chunksize.readInt32BE(0);
console.log(size);

const all = Buffer.alloc(size);
fs.readSync(fd, all, 0, size);

console.log(all.toString());

What’s going on here?

We open a file and get a file descriptor: a pointer to the actual I/O resource managed by the kernel.
We allocate 4 bytes of memory to read a header (this file stores its total size in the first 4 bytes).
We use that header to read the rest of the file manually.

That’s the core idea: small chunks.

Note: not all files append a 4 bytes(32 bits) header size, I personally generated it.

Now let’s scale up and generate a 10GB access log file.

Generating a 10GB Access Log File

Access logs store metadata about every request: IP, timestamp, URL, HTTP verb, status code.

Example log line:

4443.db38.9c21.a24e - - +052046-01-13 23:45 "POST file97.txt HTTP/1.1" 404

Let’s build a generator that streams millions of these lines into a file:

const fs = require("fs");

let fileContent = ``;
const startTime = Date.now();
let writingCount = 0;

fs.writeFileSync("./access.log", "");
const fileStream = fs.createWriteStream("./access.log");

function bytesToMegabytes(bytes) {
  return Math.round(bytes / 1048576);
}

function* genLog() {
  for (;;) {
    const date = GetRandomDate(1451606400000, Date.now());
    const verb = GenerateVerb();
    const urlPath = GenerateURLPath();
    const statusCode = GenerateStatusCode();
    const ipAddress = GenerateIP();

    fileContent += `${ipAddress} - - ${date} "${verb} ${urlPath} HTTP/1.1" ${statusCode}\n`;
    writingCount++;

    // Every million lines, check size
    if (writingCount % 1_000_000 === 0) {
      const fileLength = fs.statSync("access.log").size;

      if (bytesToMegabytes(fileLength) > 10000) { // Tweak here for a smaller file
        console.log("Finished generating access.log");

        const endTime = Date.now();
        console.log(
          `Generated ${bytesToMegabytes(fileLength)}MB in ${endTime - startTime}ms`
        );

        fs.rename("./access.log", `./access-${writingCount}.log`, console.log);  // rename to show number of generate lines.
        break;
      }

      process.stdout.write(`Wrote ${bytesToMegabytes(fileLength)}MB\n`);
      yield fileContent;
    }
  }
}

A few things happening here:

genLog() is a generator - a pausable function.
Every million lines, it yields a chunk of logs.
We stop writing once we hit ~10GB (you can tweak that).

The write driver looks like this:

const g = genLog();

function write() {
  const data = g.next();
  fileContent = ""; // reset buffer

  if (data.done) return;

  if (!fileStream.write(data.value)) {
    fileStream.once("drain", write);
  } else {
    process.nextTick(write);
  }
}

write();

We use a generator to handle backpressure.

What’s Backpressure?

Backpressure happens when your writable stream (e.g. disk) can’t keep up with your readable source (e.g. generator). Chunks pile up in memory; eventually, it's buffered.

It’s like a dam: water rushes in, but if the gates are slow to open. Overflow = flood = crash.

So when .write() returns false, we pause. Only resume once the 'drain' event fires.

if (!fileStream.write(data.value)) {
  fileStream.once("drain", write);
}

The Helpers (Random Log Generators)

function GetRandomDate(start, end) {
  const timestamp = Math.floor(Math.random() * (end - start)) + start;
  return new Date(timestamp * 1000)
    .toISOString()
    .replace("T", " ")
    .slice(0, 19);
}

function GenerateIP() {
  const ipParts = [];
  for (let i = 0; i < 4; i++) {
    if (Math.random() < 0.5) {
      ipParts.push(Math.floor(Math.random() * 256));
    } else {
      const hexPart = Math.floor(Math.random() * 65536).toString(16);
      ipParts.push(hexPart.padStart(4, "0"));
    }
  }
  return ipParts.reverse().join(".");
}

function GenerateVerb() {
  const verbs = ["GET", "POST", "PUT", "DELETE"];
  return verbs[Math.floor(Math.random() * verbs.length)];
}

function GenerateURLPath() {
  const pathParts = [];
  for (let i = 0; i < Math.floor(Math.random() * 10) + 1; i++) {
    if (Math.random() < 0.5) {
      pathParts.push("/" + Math.floor(Math.random() * 100));
    } else {
      pathParts.push(`file${Math.floor(Math.random() * 100)}.txt`);
    }
  }
  return pathParts.join("/");
}

function GenerateStatusCode() {
  const statusCodes = [200, 404, 500];
  return statusCodes[Math.floor(Math.random() * statusCodes.length)];
}

Reading the 10GB File and Counting the Verbs

Node.js gives us the tools to stream files. So let’s use them.

Create a new file to read access.log:

const fs = require("fs");
const readStream = fs.createReadStream("access.log"); // to your generated file.

const verbs = ["GET", "POST", "PUT", "DELETE"];
const verbsCount = {};

function parseVerbs(line) {
  for (let i = 0; i < verbs.length; i++) {
    if (line.includes(verbs[i])) {
      verbsCount[verbs[i]] = (verbsCount[verbs[i]] || 0) + 1;
      break;
    }
  }
}

let totalLines = 0;

// process the stream
readStream.on("data", (chunk) => {
  const lines = chunk.toString().split("\n");
  totalLines += lines.length;
  lines.forEach(parseVerbs);
});

readStream.on("end", () => {
  console.log(totalLines, verbsCount);
});

What’s happening here:

We create a readable stream from the file:
```
fs.createReadStream("access.log")
```
We read it in chunks, convert each chunk to string, split it by newline, and parse each line:
```
chunk.toString().split("\n").forEach(parseVerbs)
```

We don't handle incomplete lines; this works well enough for our case.

Run it, and you’ll get a count of how many times each verb showed up.

Example result from a 2GB file:

24032308 { PUT: 6003016, DELETE: 5995355, GET: 5999245, POST: 6001263 }

Simple. Streamable. Scalable.

Ever wondered what it really takes to build low-level Node.js tooling or distributed systems from scratch?

Learn raw TCP
Go over a message broker in pure JavaScript
Go from "can code" to "can engineer"

Check out: How to Go from a 6.5 to an 8.5 Developer

—

Or maybe you're ready to master the dark art of authentication?

From salting and peppering to mock auth libraries
Understand tokens, sessions, and identity probes
Confidently use (or ditch) auth-as-a-service or roll out your own?

Grab: The Authentication Handbook: With Node.js Examples

DEV Community