elias mohammadi

Posted on May 19

From Strings to Silicon: How I Optimized a Java Word Count by 7X 🚀

#java #performance #learning #software

Problem and Motivation

After years away, I decided to return to my favorite programming language: Java. My main goal was to deepen my understanding of concurrency and threading at the operating system level — especially to see how concepts like goroutines in Go or coroutines in Kotlin work under the hood and how they can be improved.

To start my journey, I picked up the book Operating Systems: Three Easy Pieces. After reading through the key sections, I felt it was time to put these concepts to the test with some practical coding. Instead of building a basic counter (which felt too trivial), I searched for a more realistic and challenging problem.

That’s when I came across the 1 Billion Row Challenge on YouTube. It inspired me to create my own version: counting the occurrence of 200 million words in a text file.

Why This Challenge is Exciting 🤔

There’s an obvious solution to get you started quickly.
The file size—about 1.5GB—isn’t impossible to handle, but it’s large enough to see real performance impacts as the code improves.
You can apply advanced concepts like multithreading and parallel processing.
You’ll get hands-on experience with Java’s nio package, Streams, and efficient file reading techniques.

Step 1: The Proof of Concept (POC) 🛠️

Before diving into complex optimizations, the first step is to implement a robust Proof of Concept (POC). This provides a baseline, allowing me to measure performance and verify the accuracy of the output after each iteration of improvement.

The initial implementation strategy is straightforward:

Read all lines from the file.
Split strings using the space delimiter.
Aggregate word counts using a HashMap<String, Int>.

    Path dir = Path.of("./src");  
    Path file = dir.resolve("words.txt");
    List<String> lines = Files.readAllLines(file);

    for (String line : lines) {  
            calc(line);  
        }

void calc(String line) {  
    if (line == null) return;  
    String[] parts = line.trim().split(" ");  
    for (String part : parts) {  
        if (this.wordsMap.containsKey(part) ) {  
            this.wordsMap.put(part, this.wordsMap.get(part) + 1);  
        } else {  
            this.wordsMap.put(part, 1);  
        }  
    }  
}

in this step I learned : how to work with nio package (File, Path). it is very simple and like NodeJS fs module.

The performance result was:

Scenario	Time(s)	Memory(Kb)
readAllLines	7.6	1829121

so I decided to use Stream, to learn it and how will affects on performance

try (Stream<String> lines = Files.lines(file)) {  
           lines.forEach(wordCounter::calc);  
}

Stream doesn't load whole of files, therefore consuming memory reduced incredibly.

Scenario	Time(s)	Memory(Kb)
Stream	7.2	661

Step 2: Scaling with Concurrency: Applying OS Fundamentals 🧵

Now, it’s time to apply the core concepts from Operating Systems: Three Easy Pieces: Implementing Multithreading.

To improve performance, I decided to partition the workload across a fixed pool of 8 threads. Each thread is responsible for a specific segment of the data, performing its calculations independently before merging the final results. This is a classic “Thread-local” or “Divide and Conquer” strategy.

The “Shared-Nothing” Advantage

In this approach, each thread operates within its own boundaries. By using thread-local aggregation, we achieve several key benefits:

No Contention: We avoid locking shared memory during the heavy lifting.
Isolation: Each sub-task is isolated, preventing synchronization overhead.
CPU Efficiency: By reducing the need for locks, the process becomes truly CPU-bound rather than stalling on memory barriers or I/O waits.

The goal here was to maximize CPU utilization while keeping the implementation clean and scalable. 🚀

// Create 8 Task for 8 Threads
List<WordCounterTask> tasks = new ArrayList<>();  
int PARTITION_SIZE = 8; // near to available Thread  
for (int i = 0; i < PARTITION_SIZE; i++) {  

    int from = index * i;  
    int to = from + index;  
    if (i == PARTITION_SIZE - 1) {  
        to = lines.size();  
    }  
    List<String> subListLines = lines.subList(from, to);  
    WordCounterTask task = new WordCounterTask(subListLines);  
    tasks.add(task);  
}

// Starting 8 Thread
for (int i = 0; i < PARTITION_SIZE; i++) {  
    Thread t = new Thread(tasks.get(i));  
    threads.add(t);  
    t.start();  
}  
for (Thread t : threads) {  
    t.join();  
}

// Aggergate the results in to a HashMap<String, Integer>
HashMap<String, Integer> results = new HashMap<>();  
tasks.forEach(task -> {  
    task.getWordsMap().forEach((key, value) -> {  
        if(results.containsKey(key)) {  
            results.put(key, results.get(key) + value);  
        } else {  
            results.put(key, value);  
        }  
    });  
});

and the result was which I expect: reducing the calculation time

Scenario	Time(s)	Memory(Kb)
Multi-Thread + Merging result(Avoiding Lock)	2.315	1829121

Step 3: Eliminating `split()` and Regex — Reducing String Overhead 🚀

The next optimization target was string allocation overhead.

In the previous implementations, I relied on String.split() (which internally uses regex) and stored String objects as keys in a HashMap. While convenient, this approach is expensive:

split() uses regular expressions, which add unnecessary overhead.
Each word becomes a new String object.
Each String allocates a char[] internally.
Massive GC pressure is created when processing hundreds of millions of words.

For a 1.5GB file with ~200 million words, this becomes a serious bottleneck.

Before

for (String line : this.lines) {  
    String[] parts = line.trim().split(" ");// Heavy operation in hot path loop
    for (String part : parts) {  
        if (this.wordsMap.containsKey(part)) {  
            this.wordsMap.put(part, this.wordsMap.get(part) + 1);  
        } else {  
            this.wordsMap.put(part, 1);  
        }  
    }  
}

After

StringBuilder key = new StringBuilder();  
for (int i = 0; i < line.length(); i++) {  
    // O(n)  
    if (line.charAt(i) == ' ' && !key.isEmpty()) {  
        updateWords(key.toString());  
        key.setLength(0);  
        continue;  
    }  
    key.append(line.charAt(i));  
}  
if (!key.isEmpty()) {  
    updateWords(key.toString());  
}

void updateWords(String key) {  
    // in the first solution I used containsKey + get  
    // it is 2 lookups -> 1 for containsKey and 1 for get --> slow    
    // The fasted way is as below    Integer value = wordsMap.get(key);  
    if (value != null) {  
        this.wordsMap.put(key, value + 1);  
    } else {  
        this.wordsMap.put(key, 1);  
    }  
}

and the result:

Scenario	Time(s)
string + 8 thread	2.682
stringBuilder + no split() + 8 thread	1.4 (1.4 ~ 1.7)

Step 4: Going String-less — Optimizing for the Hardware 🏎️

Even StringBuilder—while efficient—still contributes to object churn. To truly unlock performance, I set a new goal: a zero-string allocation solution.

This journey pushed me into the weeds of systems architecture, forcing me to grapple with:

Bitwise Operations: Understanding how data is stored at the lowest level.
SIMD & SWAR (SIMD Within A Register): Leveraging CPU-level parallelism.
Cache Locality: Optimizing for L1/L2 cache hits to keep the CPU fed.
Instruction Pipelines: How the CPU executes code and why “branchy” code kills performance.

The realization: In performance-critical Java, if you are creating objects, you are losing.

Shifting from Strings to Bytes ⚡

Instead of relying on the high-level String abstraction, I pivoted to processing raw byte[] buffers.

By treating the input as a stream of bytes, I can:

Reduce GC Pressure: No more transient object allocations for every word.
Use Memory Directly: Accessing bytes directly from a MappedByteBuffer allows for zero-copy-like mechanics.
Optimize Parsing: Identifying delimiters (like spaces or newlines) becomes a simple comparison of byte values rather than a heavy regex invocation.

byte[] bufferRef = line.getBytes()

Each word is represented by a range in the underlying buffer. For example, the word “apple” may occupy bytes from index 0 to 5. Since I’m using a thread-local approach, I can build a hash directly from the buffer reference, the word’s starting offset, and its length.

With this approach, I no longer need to convert the bytes into a String or compare characters one by one. Instead, I compare raw bytes, which is significantly cheaper. I also generate a long hash value, which is much more efficient than using a String as the key.

class Word {  
    public byte[] bufferRef;  
    public int start;  
    public int end;  
    private final int hash;  
    public int length;  
    public Word(byte[] buffer, int start, int end) {  
       //......
    }  

    static int computeHash(byte[] buffer, int start, int end) {  
        int hash = 1;  
        for (int i = start; i < end; i++) {  
            hash = 31 * hash + buffer[i];  
        }  
        return hash;  
    }  

    @Override  
    public int hashCode() {  
        return hash;  
    }  

    @Override  
    public boolean equals(Object obj) {  
        if (obj == null) return false;  
        if (obj == this) return true;  
        if (obj.getClass() != this.getClass()) return false;  
        Word other = (Word) obj;  
        if (this.length != other.length) return false;  
        for (int i = 0; i < this.length; i++) {  
            if (this.bufferRef[i + start] != other.bufferRef[i + other.start]) {
                return false;
            }  
        }  
        return true;  
    }  

    @Override  
    public String toString() {  
        StringBuilder sb = new StringBuilder();  
        for (int i = start; i < end; i++) {  
            sb.append((char)bufferRef[i]);  
        }  
        return sb.toString();  
    }  
}

and the calculation method changed to:

byte[] bufferRef = line.getBytes();  
int start = 0;  
for (int i = 0; i < bufferRef.length; i++) {  
    if (bufferRef[i] == 0x20) {  
        Word word = new Word(bufferRef, start, i);  
        wordsMap.merge(word, 1, Integer::sum);  
        start = i+1;  
    }  
}  
if ((bufferRef.length) - start > 0) {  
    Word word = new Word(bufferRef, start, bufferRef.length);  
    wordsMap.merge(word, 1, Integer::sum);  
}

As you can see the sting operation is gone, and results:

Scenario	Time(s)
string less (bytes + offset + length) as hash	1.1~1.2

Amazing 🤩

Step 5: Optimizing Memory Usage with MappedByteBuffer 🧩

One major challenge in handling huge files is keeping memory consumption under control.

What Is `MappedByteBuffer`?

A MappedByteBuffer is a specialized subclass of ByteBuffer. It allows you to map a region of a file directly into the operating system’s virtual memory space—enabling zero-copy file access and letting the OS handle paging behind the scenes. This dramatically reduces heap usage in your Java application.

Memory-Efficient File Splitting

Instead of splitting the file by lines, I segmented the file based on its byte size. For example, with 8 threads, the file is divided into 8 regions (one per thread). Each thread processes its own file segment independently using a dedicated MappedByteBuffer.

However, segmenting by size introduces a practical problem: lines may be split across chunks. To ensure data integrity, I expanded each chunk slightly as needed so that no line was cut in half. While that means the buckets might not be perfectly equal in size, each thread still gets a clean, processable region of data.

try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {  

    long fileSize = channel.size(); // ~1.5GB  
    long chunks = fileSize / PARTITION_SIZE;  
    long from = 0;  
    for (int i = 0; i < PARTITION_SIZE; i++) {  
        long to = findNewLinePosition(channel, from + chunks);  
        if (i == PARTITION_SIZE - 1) {  
            to = fileSize;  
        }  

        System.out.println(from + " -> " + to);  
        long chunkSize = to - from;  
        MappedByteBuffer buffer = channel.map(  
                FileChannel.MapMode.READ_ONLY, from, chunkSize  
        );  

        tasks.add(new OptimizedWordCounterTask3(buffer, chunkSize));  
        from = to;  
    }  
}

And calculation method:

    int start = 0;  
    for (int i = 0; i < this.size; i++) {  
        byte b = this.mbb.get(i);  // Stream buffers
        if (b == 0x20 || b == '\n') {  
            if (i > start) {  
                WordMappedByteBuffer word = new WordMappedByteBuffer(this.mbb, start, i);  
                wordsMap.merge(word, 1, Integer::sum);  

            }  
            start = i + 1;  
        }  
    }  

    if (this.size - start > 0) {  
        WordMappedByteBuffer word = new WordMappedByteBuffer(this.mbb, start, (int)size);  
        wordsMap.merge(word, 1, Integer::sum);  
    }

The result for memory consumption is huge:

Scenario	Time(s)	Memory(KB)
MappedBufferByte	1.1~1.2	900~920

Conclusion

My tests were on this Machine: Macbook Pro M4, CPU: 10 core, RAM: 16GB
and the results are:

Scenario	Time(s)	Memory(KB)
readAllLines	7.6	1829121
string + 8 thread	2.682	1829121
stringBuilder + 8 thread	1.4 (1.4 ~ 1.7)	1829121
string less (bytes + offset + length) as hash	1.1~1.2	1829121
MappedBufferByte	1.1~1.2	900~920

What I Really Gained From This Practice 🎓

The most valuable result is what I learned along the way.

Through this single exercise, I deepened my understanding of:

Concurrency and multithreading – how to design thread-local workloads, avoid contention, and think in terms of CPU cores instead of just “loops.”
Java NIO – using FileChannel, ByteBuffer, and especially MappedByteBuffer to work efficiently with large files.
Low-level performance concepts – bytes vs strings, bitwise operations, SIMD/SWAR, and how modern CPUs actually execute and optimize instructions.
Cache behavior (L1/L2) – why memory layout and access patterns matter as much as the algorithm itself.
Senior-engineer mindset – how experienced engineers approach high-performance problems: measure first, build a baseline, remove abstractions layer by layer, and align the code with how the hardware works.

All of this came from just one practice problem. The performance gains were great—but the learning was even better. 🚀

I don't understand the SWAR concept yet and how can to implement it directly. so, the next steps for me are:

Learning SWAR
Learning about Vector API in Java (Real SIMD)
How these concept can improve the performance.

Please share your idea with me!

Also the Github repo for this project is here, you can read it and learn from it, also contribute in it with your solutions.

Find Me Here

LinkedIn: https://www.linkedin.com/in/eliasmohammadi
Github: https://www.github.com/eliasmohammadi
Devto: https://dev.to/elias_mohammadi

DEV Community

From Strings to Silicon: How I Optimized a Java Word Count by 7X 🚀

Problem and Motivation

Why This Challenge is Exciting 🤔

Step 1: The Proof of Concept (POC) 🛠️

Step 2: Scaling with Concurrency: Applying OS Fundamentals 🧵

The “Shared-Nothing” Advantage

Step 3: Eliminating `split()` and Regex — Reducing String Overhead 🚀

Step 4: Going String-less — Optimizing for the Hardware 🏎️

Shifting from Strings to Bytes ⚡

Step 5: Optimizing Memory Usage with MappedByteBuffer 🧩

What Is `MappedByteBuffer`?

Memory-Efficient File Splitting

Conclusion

What I Really Gained From This Practice 🎓

Next

Find Me Here

Top comments (0)

Problem and Motivation

Why This Challenge is Exciting 🤔

Step 1: The Proof of Concept (POC) 🛠️

Step 2: Scaling with Concurrency: Applying OS Fundamentals 🧵

The “Shared-Nothing” Advantage

Step 3: Eliminating split() and Regex — Reducing String Overhead 🚀

Step 4: Going String-less — Optimizing for the Hardware 🏎️

Shifting from Strings to Bytes ⚡

Step 5: Optimizing Memory Usage with MappedByteBuffer 🧩

What Is MappedByteBuffer?

Memory-Efficient File Splitting

Conclusion

What I Really Gained From This Practice 🎓

Next

Find Me Here

Step 3: Eliminating `split()` and Regex — Reducing String Overhead 🚀

What Is `MappedByteBuffer`?