Axel N'cho

Posted on Mar 10

In-depth file management in Python: Underlying tooling and advanced functionalities

#python #file #buffering #concurrency

File management in Python extends beyond simple read/write operations. Understanding the underlying memory-disk interactions and system calls together with and advanced features like buffering, locking and memory mapping is essential for optimizing performance and avoiding pitfalls.

1. File handling and system calls

When you use Python’s open() function, the interpreter interacts with the OS through system calls like open(), read(), write(), and close(). These system calls interface with the kernel to manage file I/O efficiently.

For example, the following Python code:

with open("example.txt", "w") as file:
    file.write("Hello, World!")

translates to (on Unix systems):

open("example.txt", O_WRONLY | O_CREAT | O_TRUNC, 0666) - Opens or creates the file.
write(fd, "Hello, World!", 13) – Writes 13 bytes to the file.
close(fd) – Closes the file descriptor.

On Windows, the equivalent system calls differ slightly.

2. Buffering, cache layers, and the role of `flush()`

Buffering and Cache Layers

File I/O in Python involves multiple layers of caching:

Application-level buffer (Python’s internal buffer): stores data before passing it to the OS.
OS buffer (Page cache): temporarily holds data in memory before writing to disk.
Disk cache (Drive controller): hardware-level caching before physical storage.

Python's flush() and fsync() interact differently with these layers.

The Role of `flush()`

The flush() method forces Python to move data from its internal buffer to the OS buffer, but it does not guarantee immediate disk persistence.

file = open("example.txt", "w")
file.write("Important data")
file.flush()  # Transfers data to OS buffer

`flush()` vs `fsync()`

flush(): moves data from Python’s internal buffer to the OS buffer but doesn’t force a physical disk write.
os.fsync(fd): ensures data is written from the OS buffer to disk, reducing crash risks (works on Unix and Windows).

Example ensuring persistence:

import os
file = open("example.txt", "w")
file.write("Critical log entry")
file.flush()
os.fsync(file.fileno())  # Ensures data is physically written to disk
file.close()

Multi-threading and multi-processing considerations

When multiple threads or processes write to the same file, buffer inconsistencies may occur. Without proper handling, some writes may be lost or interleaved incorrectly.

Scenario: multiple processes writing to the same file

import os, time
from multiprocessing import Process

def write_data():
    with open("shared.txt", "a") as file:
        file.write("Process writing...")
        file.flush()
        os.fsync(file.fileno())  # Ensures data persistence

if __name__ == "__main__":
    processes = [Process(target=write_data) for _ in range(5)]
    for p in processes: 
        p.start()
    for p in processes: 
        p.join()

In this case, flush() ensures immediate buffer transfer to the OS, and fsync() guarantees that all processes persist their changes.

Handling Cache Conflicts

If multiple processes access the same file, OS-level caching can cause inconsistencies. The lack of synchronization between cache layers means that different processes might see outdated data unless proper mechanisms are used. The flush() method ensures that data is moved from Python’s internal buffer to the OS buffer, but it does not guarantee a physical write to disk. If multiple agents write to the same file and only flush() is called without fsync(), data remains in the OS buffer and could be lost if a system crash occurs before the OS writes it to disk. However, in scenarios where performance is critical and occasional data loss is acceptable (e.g., logging systems or temporary file writes) and agents only access the memory version of the file for long processing, calling only flush() can reduce I/O overhead and improve efficiency by allowing the OS to handle disk writes asynchronously.

3. File descriptors

Every open file is associated with a file descriptor, an integer representing an open file (Unix only). Windows has an object called a handle that is used in much the same way that Unix uses file descriptors. The file descriptor limit per process for Unix is around 1024 and as far as I know, there is no such a limit on Windows.

file = open("example.txt", "w")
print(file.fileno())  # Outputs file descriptor number

This descriptor is used in system calls like fsync().

4. File locking mechanisms

When working with shared files, locking prevents race conditions.

Advisory locks: fcntl.lockf() (Unix only): for processes that cooperate "peacefully". The kernel keeps track of the locks but doesn't enforce them - it's up to the applications to obey them. This way the kernel doesn't need to deal with situations like dead-locks.
Mandatory locks: flock() (Unix only), msvcrt.locking() (Windows): kernel enforced file locking.

Example of advisory locking (Unix):

import fcntl
with open("shared.txt", "w") as file:
    fcntl.flock(file, fcntl.LOCK_EX)  # Exclusive lock
    file.write("Locked access\n")
    fcntl.flock(file, fcntl.LOCK_UN)  # Release lock

Locks are essential in multi-threaded or multi-process environments.

5. Memory-mapped files (`mmap`)

For efficient file access, mmap allows mapping a file’s content directly into memory (Unix and Windows).

Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected. For example, you can use the re module to search through a memory-mapped file. You can also change a single byte by doing obj[index] = 97, or change a subsequence by assigning to a slice: obj[i1:i2] = b'...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

Benefits:

Speeds up large file access by avoiding repeated I/O calls.
Enables direct memory access without copying buffers.

6. Asynchronous file I/O

For non-blocking file operations, use asyncio and aiofiles (cross-platform):

import asyncio
import aiofiles

async def read_file():
    async with aiofiles.open("example.txt", "r") as f:
        content = await f.read()
        print(content)

asyncio.run(read_file())

Async file handling is useful in high-performance applications.

7. Working with temporary files

Use tempfile to create temporary files (cross-platform):

import tempfile
with tempfile.NamedTemporaryFile() as temp:
    temp.write(b"Temporary content")
    print(f"Temp file created at: {temp.name}")

Temporary files are auto-deleted upon closure unless the delete parameter of the constructor is set to False.

8. Avoiding disk fragmentation

Disk fragmentation occurs when files or pieces of files get scattered throughout your disks. Not only do hard disks get fragmented, but removable storage can also become fragmented. This can cause poor disk performance and overall system degradation.

You can preallocate disk space for a file then write it sequentially without changing its size. This is useful when we want to reduce the risk of disk fragmentation. I found an answer on stackoverflow addressing the matter. Briefly, it's possible with the msvcrt module on windows. However it seems that's not possible on Unix. os.posix_fallocate() actually does the exact opposite: it sets the file's apparent length to the length you give it, but allocates it as a sparse extent on the disk, so writing to multiple files simultaneously will still result in fragmentation. Ironic, isn't it, that Windows has a file management feature that POSIX lacks ?

9. Best practices for efficient file management

Use buffered I/O wisely: avoid excessive flushing (flush()/fsync()) unless necessary.
Leverage memory-mapped files: optimize access for large files.
Use locking in concurrent environments: prevent race conditions.
Prefer asynchronous file handling: improve performance in I/O-bound tasks.
Use context managers (with ... statement): ensures proper resource cleanup.

Conclusion

Beyond basic file operations, Python provides deep interaction with system-level file handling through buffering, memory-mapped files, asynchronous I/O, and file locking. Understanding these mechanisms helps in writing efficient, robust, and scalable file management code.

References

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

DEV Community

In-depth file management in Python: Underlying tooling and advanced functionalities

1. File handling and system calls

2. Buffering, cache layers, and the role of `flush()`

Buffering and Cache Layers

The Role of `flush()`

`flush()` vs `fsync()`

Multi-threading and multi-processing considerations

Handling Cache Conflicts

3. File descriptors

4. File locking mechanisms

5. Memory-mapped files (`mmap`)

6. Asynchronous file I/O

7. Working with temporary files

8. Avoiding disk fragmentation

9. Best practices for efficient file management

Conclusion

References

Get n8n VPS hosting 3x cheaper than a cloud solution

Top comments (0)

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Read next

Modernizing HyperGraph's CLI: A Journey Towards Better Architecture

How to Create Your First API with Python, Flask and Azure

📈 TIL: 80/20 Principles

Online Python events between Feb 13-Feb-22

Okay

1. File handling and system calls

2. Buffering, cache layers, and the role of flush()

Buffering and Cache Layers

The Role of flush()

flush() vs fsync()

Multi-threading and multi-processing considerations

Handling Cache Conflicts

3. File descriptors

4. File locking mechanisms

5. Memory-mapped files (mmap)

6. Asynchronous file I/O

7. Working with temporary files

8. Avoiding disk fragmentation

9. Best practices for efficient file management

Conclusion

References

Get n8n VPS hosting 3x cheaper than a cloud solution

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Read next

Modernizing HyperGraph's CLI: A Journey Towards Better Architecture

How to Create Your First API with Python, Flask and Azure

📈 TIL: 80/20 Principles

Online Python events between Feb 13-Feb-22

Okay

2. Buffering, cache layers, and the role of `flush()`

The Role of `flush()`

`flush()` vs `fsync()`

5. Memory-mapped files (`mmap`)