absterdabster

Posted on Jan 5

Trying to predict the performance of file reads/writes

#programming #database #systems #architecture

Hi! Let's say you want to read or write to a text file. Maybe you are trying to persist application data, read file input or write output to a file. Will it be fast or slow?

Could we estimate how long it could take?

If you don't want to read, just jump to the conclusion at the bottom.

Long story short, I write this article in frustration because after trial and error, I realized performance can vary a lot from system to system. A couple microseconds on one system could mean a couple milliseconds on another system! 100-1,000,000x slower! smh smh...

If things start to get slow, you could use background threads or start reading/writing in batches of data. But we'll get into that later.

Let's figure out if we even have to worry about things getting too slow.
(Note: I'm a newbie at this stuff, so please correct me if you need to)

Okay. Let's start with how does reading from a file work.

How does reading/writing from a file work?

The high level idea begins with your programming language. Pick your favorite programming language (that has file io). There is probably a read/write method/function in there.

But everything boils down to system calls. System calls are the interface used for hardware interactions by programs/users through the guidance/safety of your operating system. (So you don't corrupt your systems accidentally lol)

For reading, it's read(int fd, char* buf, size_t count).

Python

Let's look at an example of file reading in Python:

with open('filename.txt', 'r') as file:
        # Read the first char
        first_char = file.read(1)

Python is an interpreted language, meaning an interpreter is required to execute the Python logic. I dug a little into CPython, the original Python interpreter codebase. (Turns out CPython converts Python into Bytecode which is later interpreted by the Python Virtual Environment (PVM) with machine code.) Any C extensions are converted to machine code directly and executed at runtime.

I found that under the hood of the file io logic, we had the sneaky system call used by both Windows and Linux:

#ifdef MS_WINDOWS
        _doserrno = 0;
        n = read(fd, buf, (int)count);
        // read() on a non-blocking empty pipe fails with EINVAL, which is
        // mapped from the Windows error code ERROR_NO_DATA.
        if (n < 0 && errno == EINVAL) {
            if (_doserrno == ERROR_NO_DATA) {
                errno = EAGAIN;
            }
        }
#else
        n = read(fd, buf, count);
#endif

Java

In Java, there are a lot of ways to read files. For example, you could use a FileInputStream.

        try (FileInputStream fileInputStream = new FileInputStream(filePath)) {
            int byteData;
            while ((byteData = fileInputStream.read()) != -1) {
                System.out.print((char) byteData);  
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

Now as you may know, Java is a compiled language. The Java Virtual Machine creates a bytecode in its object file. When ready to execute, the Java Virtual Machine then reads the bytecode into machine code. Like Python, C extensions like the Java Native Interface (JNI) are turned to machine code and executed in runtime.

If you dig deep into the Java Development Kit codebase, you can see the JNI implementations of FileInputStream which has the read syscall hidden in its read logic:

ssize_t
handleRead(FD fd, void *buf, jint len)
{
    ssize_t result;
    RESTARTABLE(read(fd, buf, len), result);
    return result;
}

C++

In C/C++, you can directly use the read syscall. But in the case you don't, standard library constructs like std::ifstream also use read under the hood.

I wasn't able to find read in the implementation for std::ifstream, but I suspect you will have to look inside the bits directory of the gcc implementation. (Let me know if you find it! Do it as homework hehe.)

So why am I showing you all this? I suggest you try finding some of these implementations in the interpreters/compilers yourself lol.

If you do, you will probably notice that the read and write syscall is hidden under a lot of other clutter and logic.

In this blog, I'll discuss the performance of read and write syscalls rather than the programming language higher level functions. We can avoid the overhead of the language if there is any.

Other ways to write

Okayyy so I lied. write isn't the only way to write to a file. Turns out you can also use fprintf, fflush, and fsync. (I've seen a SQL implementation use this.)

So what's the difference?

The fprintf, fflush, and fsync splits writing into 3 steps respectively:

Write to your file into a buffer/cache
Flush the buffer to your OS's cache
Transfer from the OS's cache to your disk driver to write to the disk (This could involve writing the entire disk cache.)

fsync blocks until your disk signals it is done transferring/writing.

This could be useful if you have a lot of modifications you want to make, but you don't want to save them to disk yet. (Maybe you want to make your batch modifications into a giant transaction.)

The issue is now you have to save the entire driver cache which could be like 64MB or 128MB! Here is a nice blog with more info.

However, if we use write, we can limit our writes to just the data we are sending. This would make the write faster than our 3 step process to fsync.

If you use the 3 step process, just keep in mind how much data you are writing, aka your disk driver's cache size.

You can find your disk's cache size by looking at your disk specification.

What kind of disk do I have?

So if you don't know what disk you have like I do. Let's figure this out.

If you type lsblk in your linux terminal, you might see something that looks like this or similar to this:

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0   1.8T  0 disk
├─sda1        8:1    0  1000M  0 part  /boot/efi
├─sda2        8:2    0   600M  0 part  /boot
└─sda3        8:3    0   1.8T  0 part  /
zram0       252:0    0     8G  0 disk  [SWAP]
nvme1n1     259:0    0   1.8T  0 disk
├─nvme1n1p1 259:1    0  1000M  0 part
│ └─md125     9:125  0 999.9M  0 raid1
├─nvme1n1p2 259:2    0   600M  0 part
│ └─md127     9:127  0   599M  0 raid1
└─nvme1n1p3 259:3    0   1.8T  0 part
  └─md126     9:126  0   1.8T  0 raid1
...

sda is a disk device. There may be a sdb or sdc and so on. If the type of any of these disks says raid, these disks are probably part of some kind of hardware or software RAID configuration.

Disks part of a RAID configuration are basically copying each other. If you write to disk, it'll be written to all of them. It's a way to backup your files.

But remember that if you have a RAID configuration, each disk may have different specifications. Your writes and reads are going to be as slow as the slowest ones because the RAID controller would be writing to both of them.

Overhead from a RAID controller is usually not a bottleneck, but performance can slightly differ between hardware/software controllers because of using separate hardware vs the busy CPU respectively.

Each disk may have a different mountpoint. If that is the case, you only care about the disk(s) that have the file you intend to read/write from in it's mountpoint. You can see this in the MOUNTPOINTS column.

Ok final thing to note from the command. The RO column says if you have a rotational hard drive. A rotational hard drive is mechanical, and as a result, HDDs tend to be slower than SSDs as flash memory is faster. The difference is magnitudes faster in reading/writing sometimes, as we'll see later.

Okay... I'll stop stalling. Let's see what disk you have. Just modify the command to lsblk -io NAME,MODEL.

Here is what I get:

NAME        MODEL
sda         PERC H730P Adp
|-sda1
|-sda2
`-sda3
zram0
nvme1n1     Samsung SSD 970 EVO 2TB
|-nvme1n1p1
| `-md125
|-nvme1n1p2
| `-md127
`-nvme1n1p3
  `-md126
...

Now you have to look up that model and find your disk's specifications.

Understanding your disk specs

If I look up PERC H730P Adp, it turns out this is one of DELL's Raid Controllers. Here is a snapshot of some of the specs:

This RAID controller has a huge disk cache of 2GB! And it has a data transfer rate of 12 Gbps. As you can see, it is pretty fast.

If I wanted to load the Bee Movie Script (80,000 characters). It would take about 50 microseconds to transfer for the Bee Movie Script, ~80KB!

Note: RAID controllers can sometimes ignore fsync operations. It might not ensure a write to the devices because it has it stored in its cache. At this point, it might lazily store into the disk devices.

Great, now what about the other disk?

Digging deeper

Let's search up the Samsung SSD 970 EVO 2TB.

Here is what we care about. Sequential and Random Access operations/data transfers. Usually they either come in units of IOPS (Input/Output Operations) or bits/bytes per second.

Sequential, as the name implies, is for sequentially writing, like the Bee Movie Script. If I wanted to modify different parts of a file, this would be random access writing. Generally, sequential is faster since memory is physically located close by.

Here we have Sequential write is 2500 MB/s, but Random write is 480,000 IOPS for a queue depth of 32 (32 writes at the same time). This seems kind of dumb, why are they in two different units?

Also, why are reads faster than writes? How fast is 2500 MB/s???

No need to fear, I'm here to save you.

What are QDs?

QDs are queue depths. Basically when your disk says QD32 or QD1, it refers to having 32 write or read requests or 1 write or read request waiting. This is important because disks could sometimes handle multiple requests at a time. This is why QD32 can be a lot faster than QD1.

If we are writing our Bee Movie Script all at once, we'd be QD1. However, if we use fsync or write multiple times, then we would build a queue of requests.

A nice way to estimate QD1 from QD32 is by taking 10%-20% of its IOPS. If you know a better way, let me know in the comments!

How fast is 2500 MB/s?

You have a Bee Movie Script of 80,000 characters. That is 80KB. 80KB/2500MB/s is roughly 35 microseconds.

Easy peasy lemon squeasy.

Why are reads faster than writes?

Let's explore how writing/reading disks work at a high level to understand this.

Disks understand memory in regions called sectors. Sectors in HDD originally were 512 bytes. Now, sectors tend to be 4096 bytes as hardware has advanced.

If I ever want to read or write, the minimum you can theoretically read or write at a time from the disk would be a sector size of data. If I want to read or write 1 byte of data, I have to read the entire sector to find that 1 byte. If I am writing, I have to read the entire sector, apply the change, and then write it back in (A 2 step process!)

Okay, I lied a little again. You can't always write a single sector. Our OSes have file systems. File systems operate with blocks rather than sectors. Multiple sectors make up a block. If I want to modify 1 byte, I'd have to actually modify the entire block.

Blocks can range from 1KB - 8KB, but they must be larger than disk sectors.

PS: Blocks are different from OS pages. Pages in OS are like blocks but for accessing physical RAM.

IOPS vs transfer speed (bytes per second)

Great we went over blocks and sectors!

You probably noticed that the random access specs operate in IOPS. If I want to compare it to sequential reads/writes, I'll have to convert it into bytes per second.

I mentioned that disks operate in sectors. Each input/output operation occurs over a sector. We see that a sector size for the Samsung SSD 970 EVO 2TB is 4KB.

So if random writes are 480,000 IOPS, this is 480,000 sectors per second. This is roughly 2,000 MB/s.

Boom! Random writes are slower than sequential writes. (2000 MB/s < 2500 MB/s).

Randomly writing the Bee Movie Script is roughly 40 microseconds.

Great! We looked at an SSD. Now, so that you can feel my pain, let's look at a HDD.

Comparing an HDD

Let's pretend we have a RAID setup with that Samsung SSD and a HDD disk, for example ST9250610NS. Here are the specs:

It looks a bit different, but remember that HDDs are mechanical. Parts have to physically move and that takes time. We see that a write/read has an average time of 8.5, 9.5 milliseconds respectively.

This average time is for a single sector. A single sector in this disk is 512 bytes according to the specs.

It also mentions a transfer rate of 115 MB/s. Let's test that. If we have 512 bytes/9.5 ms, we get ~50KB/second.

HUHHH??!!?!?!? That doesn't match 115 MB/s!

This average read/write time includes the seek time and rotational latency. This means it includes both the transfer time along with the time it takes for the mechanical parts to move to complete the read/write. (I suspect that sequential writes may be faster, since seek times would be little)

Okay, let's do this again.

If I want to write the Bee Movie Script, 80,000 chars/bytes, it would take about 1.6 seconds if we operated at 50KB/second.

LOOK AT THAT! We went from 30-40 microseconds to 1.6 seconds from SSD to HDD! That's a 1,000,000x latency increase. FEEL THE PAINNNN AHHHHHHHH!

Remember since we are pretending this is a RAID device, the SSD might complete a write pretty fast, but we would have to wait for the HDD drive to finish before the disk can signal completion.

OH! By the way, this hard drive has a 64MB cache. If you used fsync, your large write may take a long time.

The Conclusion

I hope you felt my pain. jkjk.

But save yourself this pain and predict your read/write latencies.

Find out how many bytes you want to read/write
Find out if you are using write or fsync or read or if there is any overhead
Find out if they are sequential/random
Find out if you have a RAID setup or where the file is mounted on
Find out what kind of disk you have and its specs (IOPS/transfer rates)

In the end, the estimation formula is essentially bytes / rate = latency.

For fun, you could try estimating your own read/write speeds and see if your read/write reflects that.

Caveats

Using a networked file system has its own fun. Maybe I'll come back to this topic another time. There might be more involved than just network latencies. If you know, drop a comment lol.

Okay, I'm done now. Peace!

DEV Community