Problem and Motivation
After years away, I decided to return to my favorite programming language: Java. My main goal was to deepen my understanding of concurrency and threading at the operating system level — especially to see how concepts like goroutines in Go or coroutines in Kotlin work under the hood and how they can be improved.
To start my journey, I picked up the book Operating Systems: Three Easy Pieces. After reading through the key sections, I felt it was time to put these concepts to the test with some practical coding. Instead of building a basic counter (which felt too trivial), I searched for a more realistic and challenging problem.
That’s when I came across the 1 Billion Row Challenge on YouTube. It inspired me to create my own version: counting the occurrence of 200 million words in a text file.
Why This Challenge is Exciting 🤔
There’s an obvious solution to get you started quickly.
The file size—about 1.5GB—isn’t impossible to handle, but it’s large enough to see real performance impacts as the code improves.
You can apply advanced concepts like multithreading and parallel processing.
You’ll get hands-on experience with Java’s
niopackage, Streams, and efficient file reading techniques.
Step 1: The Proof of Concept (POC) 🛠️
Before diving into complex optimizations, the first step is to implement a robust Proof of Concept (POC). This provides a baseline, allowing me to measure performance and verify the accuracy of the output after each iteration of improvement.
The initial implementation strategy is straightforward:
- Read all lines from the file.
- Split strings using the space delimiter.
-
Aggregate word counts using a
HashMap<String, Int>.
Path dir = Path.of("./src");
Path file = dir.resolve("words.txt");
List<String> lines = Files.readAllLines(file);
for (String line : lines) {
calc(line);
}
void calc(String line) {
if (line == null) return;
String[] parts = line.trim().split(" ");
for (String part : parts) {
if (this.wordsMap.containsKey(part) ) {
this.wordsMap.put(part, this.wordsMap.get(part) + 1);
} else {
this.wordsMap.put(part, 1);
}
}
}
in this step I learned : how to work with nio package (File, Path). it is very simple and like NodeJS fs module.
The performance result was:
| Scenario | Time(s) | Memory(Kb) |
|---|---|---|
| readAllLines | 7.6 | 1829121 |
so I decided to use Stream, to learn it and how will affects on performance
try (Stream<String> lines = Files.lines(file)) {
lines.forEach(wordCounter::calc);
}
Stream doesn't load whole of files, therefore consuming memory reduced incredibly.
| Scenario | Time(s) | Memory(Kb) |
|---|---|---|
| Stream | 7.2 | 661 |
Step 2: Scaling with Concurrency: Applying OS Fundamentals đź§µ
Now, it’s time to apply the core concepts from Operating Systems: Three Easy Pieces: Implementing Multithreading.
To improve performance, I decided to partition the workload across a fixed pool of 8 threads. Each thread is responsible for a specific segment of the data, performing its calculations independently before merging the final results. This is a classic “Thread-local” or “Divide and Conquer” strategy.
The “Shared-Nothing” Advantage
In this approach, each thread operates within its own boundaries. By using thread-local aggregation, we achieve several key benefits:
- No Contention: We avoid locking shared memory during the heavy lifting.
- Isolation: Each sub-task is isolated, preventing synchronization overhead.
- CPU Efficiency: By reducing the need for locks, the process becomes truly CPU-bound rather than stalling on memory barriers or I/O waits.
The goal here was to maximize CPU utilization while keeping the implementation clean and scalable. 🚀
// Create 8 Task for 8 Threads
List<WordCounterTask> tasks = new ArrayList<>();
int PARTITION_SIZE = 8; // near to available Thread
for (int i = 0; i < PARTITION_SIZE; i++) {
int from = index * i;
int to = from + index;
if (i == PARTITION_SIZE - 1) {
to = lines.size();
}
List<String> subListLines = lines.subList(from, to);
WordCounterTask task = new WordCounterTask(subListLines);
tasks.add(task);
}
// Starting 8 Thread
for (int i = 0; i < PARTITION_SIZE; i++) {
Thread t = new Thread(tasks.get(i));
threads.add(t);
t.start();
}
for (Thread t : threads) {
t.join();
}
// Aggergate the results in to a HashMap<String, Integer>
HashMap<String, Integer> results = new HashMap<>();
tasks.forEach(task -> {
task.getWordsMap().forEach((key, value) -> {
if(results.containsKey(key)) {
results.put(key, results.get(key) + value);
} else {
results.put(key, value);
}
});
});
and the result was which I expect: reducing the calculation time
| Scenario | Time(s) | Memory(Kb) |
|---|---|---|
| Multi-Thread + Merging result(Avoiding Lock) | 2.315 | 1829121 |
Step 3: Eliminating split() and Regex — Reducing String Overhead 🚀
The next optimization target was string allocation overhead.
In the previous implementations, I relied on String.split() (which internally uses regex) and stored String objects as keys in a HashMap. While convenient, this approach is expensive:
-
split()uses regular expressions, which add unnecessary overhead. - Each word becomes a new
Stringobject. - Each
Stringallocates achar[]internally. - Massive GC pressure is created when processing hundreds of millions of words.
For a 1.5GB file with ~200 million words, this becomes a serious bottleneck.
Before
for (String line : this.lines) {
String[] parts = line.trim().split(" ");// Heavy operation in hot path loop
for (String part : parts) {
if (this.wordsMap.containsKey(part)) {
this.wordsMap.put(part, this.wordsMap.get(part) + 1);
} else {
this.wordsMap.put(part, 1);
}
}
}
After
StringBuilder key = new StringBuilder();
for (int i = 0; i < line.length(); i++) {
// O(n)
if (line.charAt(i) == ' ' && !key.isEmpty()) {
updateWords(key.toString());
key.setLength(0);
continue;
}
key.append(line.charAt(i));
}
if (!key.isEmpty()) {
updateWords(key.toString());
}
void updateWords(String key) {
// in the first solution I used containsKey + get
// it is 2 lookups -> 1 for containsKey and 1 for get --> slow
// The fasted way is as below Integer value = wordsMap.get(key);
if (value != null) {
this.wordsMap.put(key, value + 1);
} else {
this.wordsMap.put(key, 1);
}
}
and the result:
| Scenario | Time(s) |
|---|---|
| string + 8 thread | 2.682 |
| stringBuilder + no split() + 8 thread | 1.4 (1.4 ~ 1.7) |
Step 4: Going String-less — Optimizing for the Hardware 🏎️
Even StringBuilder—while efficient—still contributes to object churn. To truly unlock performance, I set a new goal: a zero-string allocation solution.
This journey pushed me into the weeds of systems architecture, forcing me to grapple with:
- Bitwise Operations: Understanding how data is stored at the lowest level.
- SIMD & SWAR (SIMD Within A Register): Leveraging CPU-level parallelism.
- Cache Locality: Optimizing for L1/L2 cache hits to keep the CPU fed.
- Instruction Pipelines: How the CPU executes code and why “branchy” code kills performance.
The realization: In performance-critical Java, if you are creating objects, you are losing.
Shifting from Strings to Bytes ⚡
Instead of relying on the high-level String abstraction, I pivoted to processing raw byte[] buffers.
By treating the input as a stream of bytes, I can:
- Reduce GC Pressure: No more transient object allocations for every word.
-
Use Memory Directly: Accessing bytes directly from a
MappedByteBufferallows for zero-copy-like mechanics. - Optimize Parsing: Identifying delimiters (like spaces or newlines) becomes a simple comparison of byte values rather than a heavy regex invocation.
byte[] bufferRef = line.getBytes()
Each word is represented by a range in the underlying buffer. For example, the word “apple” may occupy bytes from index 0 to 5. Since I’m using a thread-local approach, I can build a hash directly from the buffer reference, the word’s starting offset, and its length.
With this approach, I no longer need to convert the bytes into a String or compare characters one by one. Instead, I compare raw bytes, which is significantly cheaper. I also generate a long hash value, which is much more efficient than using a String as the key.
class Word {
public byte[] bufferRef;
public int start;
public int end;
private final int hash;
public int length;
public Word(byte[] buffer, int start, int end) {
//......
}
static int computeHash(byte[] buffer, int start, int end) {
int hash = 1;
for (int i = start; i < end; i++) {
hash = 31 * hash + buffer[i];
}
return hash;
}
@Override
public int hashCode() {
return hash;
}
@Override
public boolean equals(Object obj) {
if (obj == null) return false;
if (obj == this) return true;
if (obj.getClass() != this.getClass()) return false;
Word other = (Word) obj;
if (this.length != other.length) return false;
for (int i = 0; i < this.length; i++) {
if (this.bufferRef[i + start] != other.bufferRef[i + other.start]) {
return false;
}
}
return true;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++) {
sb.append((char)bufferRef[i]);
}
return sb.toString();
}
}
and the calculation method changed to:
byte[] bufferRef = line.getBytes();
int start = 0;
for (int i = 0; i < bufferRef.length; i++) {
if (bufferRef[i] == 0x20) {
Word word = new Word(bufferRef, start, i);
wordsMap.merge(word, 1, Integer::sum);
start = i+1;
}
}
if ((bufferRef.length) - start > 0) {
Word word = new Word(bufferRef, start, bufferRef.length);
wordsMap.merge(word, 1, Integer::sum);
}
As you can see the sting operation is gone, and results:
| Scenario | Time(s) |
|---|---|
| string less (bytes + offset + length) as hash | 1.1~1.2 |
Amazing 🤩
Step 5: Optimizing Memory Usage with MappedByteBuffer đź§©
One major challenge in handling huge files is keeping memory consumption under control.
What Is MappedByteBuffer?
A MappedByteBuffer is a specialized subclass of ByteBuffer. It allows you to map a region of a file directly into the operating system’s virtual memory space—enabling zero-copy file access and letting the OS handle paging behind the scenes. This dramatically reduces heap usage in your Java application.
Memory-Efficient File Splitting
Instead of splitting the file by lines, I segmented the file based on its byte size. For example, with 8 threads, the file is divided into 8 regions (one per thread). Each thread processes its own file segment independently using a dedicated MappedByteBuffer.
However, segmenting by size introduces a practical problem: lines may be split across chunks. To ensure data integrity, I expanded each chunk slightly as needed so that no line was cut in half. While that means the buckets might not be perfectly equal in size, each thread still gets a clean, processable region of data.
try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {
long fileSize = channel.size(); // ~1.5GB
long chunks = fileSize / PARTITION_SIZE;
long from = 0;
for (int i = 0; i < PARTITION_SIZE; i++) {
long to = findNewLinePosition(channel, from + chunks);
if (i == PARTITION_SIZE - 1) {
to = fileSize;
}
System.out.println(from + " -> " + to);
long chunkSize = to - from;
MappedByteBuffer buffer = channel.map(
FileChannel.MapMode.READ_ONLY, from, chunkSize
);
tasks.add(new OptimizedWordCounterTask3(buffer, chunkSize));
from = to;
}
}
And calculation method:
int start = 0;
for (int i = 0; i < this.size; i++) {
byte b = this.mbb.get(i); // Stream buffers
if (b == 0x20 || b == '\n') {
if (i > start) {
WordMappedByteBuffer word = new WordMappedByteBuffer(this.mbb, start, i);
wordsMap.merge(word, 1, Integer::sum);
}
start = i + 1;
}
}
if (this.size - start > 0) {
WordMappedByteBuffer word = new WordMappedByteBuffer(this.mbb, start, (int)size);
wordsMap.merge(word, 1, Integer::sum);
}
The result for memory consumption is huge:
| Scenario | Time(s) | Memory(KB) |
|---|---|---|
| MappedBufferByte | 1.1~1.2 | 900~920 |
Conclusion
My tests were on this Machine: Macbook Pro M4, CPU: 10 core, RAM: 16GB
and the results are:
| Scenario | Time(s) | Memory(KB) |
|---|---|---|
| readAllLines | 7.6 | 1829121 |
| string + 8 thread | 2.682 | 1829121 |
| stringBuilder + 8 thread | 1.4 (1.4 ~ 1.7) | 1829121 |
| string less (bytes + offset + length) as hash | 1.1~1.2 | 1829121 |
| MappedBufferByte | 1.1~1.2 | 900~920 |
What I Really Gained From This Practice 🎓
The most valuable result is what I learned along the way.
Through this single exercise, I deepened my understanding of:
- Concurrency and multithreading – how to design thread-local workloads, avoid contention, and think in terms of CPU cores instead of just “loops.”
-
Java NIO – using
FileChannel,ByteBuffer, and especiallyMappedByteBufferto work efficiently with large files. - Low-level performance concepts – bytes vs strings, bitwise operations, SIMD/SWAR, and how modern CPUs actually execute and optimize instructions.
- Cache behavior (L1/L2) – why memory layout and access patterns matter as much as the algorithm itself.
- Senior-engineer mindset – how experienced engineers approach high-performance problems: measure first, build a baseline, remove abstractions layer by layer, and align the code with how the hardware works.
All of this came from just one practice problem. The performance gains were great—but the learning was even better. 🚀
Next
I don't understand the SWAR concept yet and how can to implement it directly. so, the next steps for me are:
- Learning SWAR
- Learning about Vector API in Java (Real SIMD)
- How these concept can improve the performance.
Please share your idea with me!
Also the Github repo for this project is here, you can read it and learn from it, also contribute in it with your solutions.
Find Me Here
LinkedIn: https://www.linkedin.com/in/eliasmohammadi
Github: https://www.github.com/eliasmohammadi
Devto: https://dev.to/elias_mohammadi
Top comments (0)