DEV Community: Mehran

Check this out if you haven't already :)

Mehran — Thu, 04 Sep 2025 08:48:04 +0000

Learning Rust by Building a High-Performance Key-Value Database: A C Developer's Honest Take

Mehran ・ Sep 2

Learning Rust by Building a High-Performance Key-Value Database: A C Developer's Honest Take

Mehran — Tue, 02 Sep 2025 12:23:32 +0000

I spent the past two months learning Rust by building FeOx, a key-value database that achieves 3.7M SET/s and 5M GET/s through a Redis-compatible server (2.5x faster than Redis on the same benchmark).

Coming from C, here's what I actually learned, both good and frustrating.

Memory Management: Same Problems, Different Compiler

In C, I'd use reference counting with atomic operations. In Rust, Arc<T> does the same thing. The generated assembly is nearly identical. The difference? In C, if I forget an increment or decrement, I find out in production when something crashes. In Rust, I can't even compile.

The ownership model does change how you structure shared data. In my code, records are Arc<Record> because they're shared between the hash table and skip list. Each Record contains RwLock<Option<Bytes>> for the value (since it can be cleared from memory after being written to disk), and AtomicU64 for various fields. In C, I used atomic_t, spinlock_t, or RCU based on the access pattern. The difference is that Rust encodes these synchronization choices directly in the type system, so you can't accidentally access an Arc<RwLock<T>> without proper locking. This catches bugs at compile time that would be runtime races in C.

Concurrency: Same Algorithms, Different Errors

The concurrency story is similar. RCU in C means calling rcu_read_lock(), doing your work, calling rcu_read_unlock(), and deferring deletion with call_rcu(). Crossbeam in Rust follows the same pattern: epoch::pin(), do work, let the guard drop automatically, and defer with guard.defer().

It's the same algorithm. The performance is identical. The difference is that in Rust, if I try to access data after moving it, the compiler stops me. In C, I'd find out when it crashes, or worse, when it corrupts memory silently. Both require understanding epoch-based reclamation. Rust doesn't make the concept easier, just safer.

Where Rust Genuinely Shined

Pattern matching for protocol parsing was legitimately cleaner than nested switch statements. Parsing Redis commands became almost elegant instead of the usual maze of conditionals. Error propagation with ? beats goto cleanup patterns any day. The code flows naturally instead of jumping around.

Enums with associated data are genuinely useful. My Operation enum can be Insert, Update, or Delete, each carrying different data. My error type is an enum where each variant includes specific context about what failed. In C, enums are just integer constants; you need separate structs and manual plumbing to achieve the same thing.

The Result<T, E> and Option<T> types make error handling composable. Instead of error codes that break function composition, I can chain operations with map, and_then, and ?. The real win isn't preventing forgotten checks; it's being able to transform and propagate errors through a pipeline of operations without manual boilerplate at every step.

The built-in testing framework made a real difference. I'll be honest: my C projects often skip tests because setting up a testing framework is friction. In Rust, cargo test just works, so I actually wrote tests. Traits are nice for generic code too. In C, I'd use function pointers in structs, which works but gets messy fast. Rust's approach generates better code through monomorphization, though you pay for it with longer compile times.

Where Rust's Tooling Truly Excels

This is where the Rust ecosystem shines. In C, I'd write comments and hope they stay accurate. Maybe set up Doxygen if I'm feeling motivated. In Rust, cargo doc generates functional HTML documentation from doc comments, with examples that are compile-tested. The examples in my docs actually run as tests; they literally cannot go stale without breaking the build.

Benchmarking was transformative. In C, I'd write custom timing loops, worry about compiler optimizations invalidating my measurements, and manually handle warm-up effects. With Criterion.rs, I just write:

b.iter(|| store.get(black_box(&key)))

It handles warmup, statistical analysis, outlier detection, and generates HTML reports with graphs showing the distribution of timings. It even detects performance regressions between runs automatically. My C projects never had this level of rigor. Too much friction to set up.

Where I Fought the Language

Self-referential structures remain a pain point. In C, you just store a pointer. In Rust, you need Pin, PhantomData, or unsafe. I ended up redesigning around indices instead of pointers. Probably better architecture, but it was forced by the language, not chosen for its merits.

The borrow checker doesn't understand some valid patterns. Here's a concrete example: I know a HashMap entry exists because I just checked it, but I can't prove it to the compiler without either unwrap() (which could panic) or unsafe (which defeats the purpose). These situations are frustrating because you know the code is correct, but you can't express that knowledge in the type system.

Async is still rough around the edges. Can't easily mix sync and async code. The ecosystem is fragmented between tokio, async-std, and smol. When I did try tokio for networking, I found it painfully slow for my use case. The overhead of the async runtime and tokio-util's codecs added measurable latency compared to raw mio with manual buffer management. For a database that needs predictable sub-millisecond latencies, I just used threads and mio's event loop directly.

Perhaps most frustrating: ownership forces unnecessary clones. In C, I'd store a pointer to the same record in both the hash table and RB tree. Same memory, multiple references. In Rust, I had to clone keys because both the hash table and skip list want to own them. With Bytes, the clone is cheap (just an Arc increment), but it's still an atomic operation that C doesn't need. The C version just stored pointers everywhere. Yes, it's more dangerous, but it's also more efficient.

Performance: No Magic Bullets

Let's be clear: same algorithms give same performance. Lock-free structures still need careful design. Cache-line alignment still matters; it's just #[repr(align(64))] instead of __attribute__((aligned(64))). SIMD still requires unsafe. When you look at the hot path in assembly, it's remarkably similar between both languages once you strip away the syntax.

The performance I achieved didn't come from Rust being faster. It came from implementing the same optimizations I would in C. The difference is that Rust caught my mistakes at compile time instead of runtime.

Build System and Dependencies

Cargo is genuinely better than Makefiles. No contest there. Dependencies just work, cross-compilation is straightforward, and the tooling is consistent across platforms. But there's a trade-off: you end up with more transitive dependencies. What would be a focused 50-file C project pulls in about 160 dependencies in Rust. Compile times suffer accordingly.

My Impressions After Two Months

Rust caught real bugs that would've been subtle crashes in C. That's valuable. But it also forced rewrites of valid code just to satisfy the borrow checker. Sometimes I knew my code was correct, but I spent an hour restructuring it to prove that to the compiler.

The language didn't make me a better programmer or magically improve performance. It just moved errors from runtime to compile time. Whether that's worth the learning curve depends on your project. For a database where correctness matters? Probably yes. For a weekend prototype? Probably no.

The most honest assessment I can give: Rust is a trade-off, not a pure upgrade. You trade development speed for correctness. You trade simplicity for safety. You trade compile time for runtime reliability. Whether those trades are worth it depends entirely on what you're building and what keeps you up at night.

FeOx is on GitHub if you want to see the code. I'm curious: those who've done similar ports from C to Rust, did you find the same friction points? Or did I miss some idiom that would have made things smoother?

FeOx DB: https://github.com/mehrantsi/feoxdb

FeOx Server: https://github.com/mehrantsi/feox-server

Have you tried the Remote Memory MCP by HPKV? it's more than a store and retrieve! It provides automatic contextual summarization and semantic search to the models based on the stores memories.

Mehran — Sat, 05 Jul 2025 15:20:38 +0000

Remote Memory MCP Server for Cursor IDE

Mehran ・ Apr 18

#cursor #mcp #programming #ai

Memory MCP with Semantic Search Support

Mehran — Sat, 07 Jun 2025 18:58:03 +0000

Mehran

Apr 18 '25

Remote Memory MCP Server for Cursor IDE

#cursor #mcp #programming #ai

6 min read

JSONK: A High-Performance JSON Library for Linux Kernel Space

Mehran — Mon, 02 Jun 2025 12:33:33 +0000

When you're deep in kernel development and need to handle structured data, you quickly realize that the kernel's ecosystem has a glaring omission: there's virtually no mature JSON processing library available. This became painfully apparent during a recent project where I needed reliable JSON handling within a kernel module.

The options were limited. I could either cobble together a basic parser from scratch, risking bugs and incomplete functionality, or find some way to bridge user-space libraries with kernel code, introducing complexity and performance overhead. Neither felt right for production code that needed to be both fast and bulletproof.

After searching through the available kernel JSON libraries, the landscape was surprisingly empty. Most implementations were either academic exercises or incomplete attempts that handled only basic cases. None offered the robustness needed for real-world kernel module development. This gap led to the creation of JSONK.

The Kernel Space Reality

Working in kernel space means playing by different rules. You don't have the luxury of standard library functions or the safety net of process boundaries. When your code crashes, it doesn't just kill a process - it can bring down the entire system. Memory management becomes critical because leaks accumulate over time and can destabilize the machine. Concurrency is everywhere, with multiple threads potentially accessing your data structures simultaneously.

These constraints make JSON processing particularly challenging. User-space libraries assume they can allocate memory freely, handle errors by throwing exceptions, and rely on process isolation for safety. None of these assumptions hold in kernel space.

The existing kernel JSON libraries I found reflected these challenges poorly. Most were incomplete implementations that worked for simple cases but failed on edge conditions. Few handled concurrent access properly, and none provided the atomic update capabilities that kernel modules often need for maintaining consistent state.

Building JSONK

JSONK emerged from these practical needs. Rather than building yet another minimal parser, I focused on creating a library that could handle real-world kernel module requirements: full RFC 8259 compliance, robust error handling, safe memory management, and atomic operations for consistent updates.

The design started with memory management. In kernel space, every allocation matters, and cleanup must be guaranteed. JSONK uses reference counting to ensure that JSON values are properly freed when no longer needed, even in complex scenarios where multiple parts of the code hold references to the same data. This prevents the use-after-free vulnerabilities that can crash the kernel.

Core Features

The library provides several key capabilities designed for kernel space:

RFC 8259 Compliance: Complete JSON specification support including all data types, proper string escaping, unicode sequences, and number parsing
Atomic JSON Patching: Apply partial updates to JSON objects with rollback safety - either all changes succeed or none are applied
Reference Counting: Automatic memory management to prevent use-after-free vulnerabilities
Path-Based Access: Navigate nested structures using dot notation like "user.profile.name"
Memory Safety: Built-in limits to prevent DoS attacks in kernel space

For parsing, the library implements a single-pass algorithm that builds the JSON structure directly without intermediate representations. This approach minimizes memory allocations and provides predictable performance characteristics. The parser handles all JSON data types correctly, including proper string escaping, number parsing with scientific notation, and nested structures with configurable depth limits.

One feature that sets JSONK apart is atomic JSON patching. This allows you to apply partial updates to JSON objects with rollback safety - either all changes succeed, or none are applied. This capability is crucial for kernel modules that need to maintain consistent configuration state or handle concurrent updates safely.

Performance in Practice

Testing JSONK on Linux 6.8.0 reveals performance characteristics that work well for kernel space applications:

Small JSON Parsing (~833 bytes, 10 objects): 425K ops/sec (337 MB/s throughput)
Medium JSON Parsing (~8KB, 100 objects): 13.8K ops/sec (859 MB/s throughput)
Large JSON Parsing (~1MB, 200 objects): 1.12K ops/sec (972 MB/s throughput)
JSON Serialization: 5.94M ops/sec (872 MB/s throughput)
JSON Patching: 827K ops/sec (42 MB/s throughput)
Scalability: Excellent scaling (53-220 ns per element as size increases)

Small JSON documents around 833 bytes with 10 objects parse at 425,000 operations per second, delivering 337 MB/s throughput. This performance makes the library suitable for high-frequency operations or real-time scenarios where consistent sub-millisecond response times are required.

As document size increases to around 8KB with 100 objects, the library maintains excellent throughput at 859 MB/s despite the lower operation rate of 13,800 operations per second. This demonstrates efficient scaling as the parser handles larger, more complex structures. For large documents around 1MB with 200 objects, the library achieves 972 MB/s throughput at 1,120 operations per second, showing that bulk data processing remains highly efficient.

The scalability metrics reveal excellent per-element performance, ranging from 53 to 220 nanoseconds per element as document size increases. This consistent scaling behavior makes performance predictable across different workload sizes.

JSON serialization performs exceptionally well at 5.94 million operations per second with 872 MB/s throughput. The atomic patching operations achieve 827,000 operations per second, which is excellent considering the additional safety guarantees and complexity involved.

Security Considerations

Security in kernel space requires careful attention to several attack vectors. JSONK addresses these concerns through multiple layers of protection.

The library implements strict input validation to prevent malformed JSON from causing buffer overflows or memory corruption. All string operations use bounded functions, and buffer sizes are validated before any memory operations. The parser rejects JSON that exceeds configured limits for nesting depth, object member counts, and array sizes, preventing resource exhaustion attacks.

Memory management uses reference counting to prevent use-after-free vulnerabilities, which are particularly dangerous in kernel space where they can lead to privilege escalation. The atomic patching system ensures that partial updates cannot leave data structures in inconsistent states that might be exploitable.

Input sanitization handles control characters and validates Unicode escape sequences to prevent injection attacks. The library also implements bounds checking on all array and object access operations to prevent buffer overruns that could corrupt kernel memory.

For denial-of-service protection, JSONK enforces limits on parsing time and memory usage. Large or deeply nested JSON documents are rejected before they can consume excessive system resources. These limits are configurable but have safe defaults that prevent most resource exhaustion scenarios.

Real-World Integration

Using JSONK in a kernel module is straightforward. The library provides a clean API that feels familiar to developers who have worked with JSON libraries in other contexts, while respecting kernel space conventions.

Basic Usage Example

#include "include/jsonk.h"

// Parse JSON string
const char *json_str = "{\"name\":\"test\",\"value\":42}";
struct jsonk_value *json = jsonk_parse(json_str, strlen(json_str));

// Access object members
struct jsonk_member *name_member = jsonk_object_find_member(&json->u.object, "name", 4);
if (name_member && name_member->value->type == JSONK_VALUE_STRING) {
    printk("Name: %s\n", name_member->value->u.string.data);
}

// Apply a patch atomically
const char *patch = "{\"value\":100,\"new_field\":\"added\"}";
char result[1024];
size_t result_len;

int ret = jsonk_apply_patch(json_str, strlen(json_str),
                           patch, strlen(patch),
                           result, sizeof(result), &result_len);

// Clean up with reference counting
jsonk_value_put(json);

The reference counting model means you explicitly manage object lifetimes using jsonk_value_get() and jsonk_value_put() functions. This might seem cumbersome compared to garbage-collected environments, but it provides the predictable behavior that kernel code requires. You know exactly when memory is allocated and freed, which is essential for debugging and ensuring system stability.

Concurrent Access Pattern

For concurrent access, JSONK takes a practical approach. Rather than implementing internal locking that might not match your module's synchronization strategy, the library requires you to handle synchronization yourself:

static struct jsonk_value *shared_state = NULL;
static DEFINE_SPINLOCK(state_lock);

void update_shared_state(const char *path, struct jsonk_value *new_value) {
    unsigned long flags;

    spin_lock_irqsave(&state_lock, flags);
    if (shared_state) {
        jsonk_set_value_by_path(shared_state, path, strlen(path), new_value);
    }
    spin_unlock_irqrestore(&state_lock, flags);
}

struct jsonk_value *get_shared_state_copy(void) {
    unsigned long flags;
    struct jsonk_value *copy = NULL;

    spin_lock_irqsave(&state_lock, flags);
    if (shared_state) {
        copy = jsonk_value_get(shared_state);  // Get reference
    }
    spin_unlock_irqrestore(&state_lock, flags);

    return copy;  // Caller must call jsonk_value_put(copy)
}

This gives you complete control over locking granularity and avoids potential deadlocks from conflicting lock hierarchies.

The atomic patching feature proves particularly useful for configuration management. You can parse a configuration JSON, apply updates from user space or other modules, and ensure that either all changes take effect or none do. This prevents the intermediate states that could leave your module in an inconsistent configuration.

Design Constraints and Trade-offs

JSONK makes deliberate trade-offs for kernel space operation. The library imposes limits on nesting depth, object member counts, and array sizes. These limits prevent resource exhaustion attacks and ensure predictable memory usage, but they might constrain applications that need to process arbitrarily complex JSON structures.

Current Limitations

Maximum nesting depth: 32 levels (configurable)
Maximum object members: 1000 per object (configurable)
Maximum array elements: 10000 per array (configurable)
Number precision: 64-bit integers and basic decimal support
Unicode handling: Escape sequences stored literally for kernel efficiency

Number handling focuses on 64-bit integers and basic decimal support rather than arbitrary precision arithmetic. This covers the vast majority of kernel use cases while avoiding the complexity and performance overhead of full-featured number processing. Similarly, Unicode escape sequences are stored literally rather than being converted to UTF-8, which maintains compatibility while avoiding complex character encoding logic in kernel space.

The library doesn't include features like JSON Schema validation or advanced path expressions. These capabilities could be added, but they would increase complexity and memory usage for functionality that most kernel modules don't need. The current feature set focuses on the core operations that kernel code actually requires.

Integration and Build System

JSONK is designed for easy integration with existing kernel module development workflows:

Module Dependencies

# In your module's Makefile
obj-m += your_module.o
your_module-objs := your_source.o

# Declare dependency
MODULE_SOFTDEP("pre: jsonk");

Build and Test Workflow

# Build and test workflow
make clean && make
make test-basic   # Basic functionality tests
make test-perf    # Performance benchmarks
make test-atomic  # Atomic patching tests

# View results
dmesg | tail -50

The build system provides targets for compilation, testing, and module management, making it easy to integrate JSONK into existing development workflows.

Current Status and Availability

JSONK is available now under the GPL-2.0 license at https://github.com/mehrantsi/jsonk. The repository includes the complete library source, examples, and a full test suite that validates both functionality and performance. The build system provides targets for compilation, testing, and module management.

The library has been tested on Linux 6.8.0 and should work on any reasonably recent kernel version. The code follows kernel coding conventions and has been designed to integrate cleanly with existing kernel module development workflows.

Feedback and Contributions

The project welcomes feedback and contributions from the kernel development community. Real-world usage will undoubtedly reveal areas for improvement and additional features that would benefit the broader ecosystem. But even in its current form, JSONK provides capabilities that simply weren't available before in kernel space, opening up new possibilities for kernel module development.

Redis vs. HPKV: 100M keys over network

Mehran — Tue, 27 May 2025 23:28:04 +0000

Why another Redis vs. HPKV article?

It's been less than 3 months since we officially released HPKV and we've been working on a lot of new features and improvements. We've received a lot of questions about how HPKV compares to other KV stores and how it performs in different scenarios. Redis being the most popular KV store and the fact that it's a great choice for many use cases, it naturally is the first thing anyone would compare HPKV to.

Back in February, just before we officially went out of beta, we published a blog post on Redis vs. HPKV where we compared the performance of Redis and HPKV on a single node locally. This was a great way to show the performance of HPKV and how it compares to Redis; however, we wanted to show how this translates to an over-the-network scenario as well as talk about a few details that caused a number of questions and confusion. So here we are.

Why should you care?

Performance is not about the numbers, it's about the context. In other words, what is the performance-to-cost ratio? How much does it cost to get that performance? How much does it cost to maintain that performance? How much does it cost to scale that performance?

Given enough scale, shaving microseconds can save you a lot of money.
In other words, you can translate microseconds to dollars. More on that later.

The setup

The setup is simple: we have 2 nodes, one is used as a server and the other is used as a client. Both machines are located in the EU and the same region, but not in the same datacenter; however, the latency between the two machines is less than 1ms.

Machine A (Server):

Intel Core i9-13900
128 GB DDR5 ECC RAM
SAMSUNG PM9A3 1.92 TB NVMe SSD
Ubuntu 24.04 LTS - 6.8.0-60-generic

Machine B (Client):

Intel Core i5-12500
128 GB DDR4 ECC RAM
SAMSUNG 980 PRO 512 GB NVMe SSD
Ubuntu 24.04 LTS - 6.8.0-60-generic

Network:

1 Gbps Ethernet

Software:

Redis 7.0.15
HPKV 1.17

Methodology & Considerations

Although we used the same machine for both HPKV and Redis, we ensured that only one of the two is running at a time.
No other application/service was running on the machines, only the OS and the KV store.
We followed the best practices mentioned in the Redis Benchmarking Guide for Redis testing.
For HPKV we used the RIOC benchmarking tool, which was configured to match the Redis benchmark parameters. HPKV also provides a local vectored call interface which can bring the single operation latency close to the 300ns range for GET, but since Redis is operating as a server and to keep the comparison fair, we used RIOC for this benchmark. You can find the code for the benchmarking tool here.
HPKV provides both in-memory and disk persistence. For disk persistence comparison, we used Redis with AOF enabled; however, there are some fundamental differences between the two that we will discuss later.

Understanding HPKV local performance

Before we dive into the over-the-network comparison benchmark, we need to understand how HPKV performs locally. This local benchmarking using HPKV's local vectored call interface is a great way to understand HPKV's strengths and the careful design choices we made to achieve them.

HPKV is highly optimized for local performance. It's designed to be a high-performance, low-latency, low-cost KV store. It uses a combination of techniques to achieve this, including:

Advanced lock-free design for highly concurrent read operations
Fine-grained locking for highly concurrent write operations
Custom memory allocator for low-latency memory allocation
CPU cache-friendly design
Custom file system for low-latency disk access
Fast hybrid memory-disk architecture for low-memory environments
And more...

One of the key differences between HPKV and Redis is that Redis is only a server and the only way to communicate with it is through the network. HPKV is a local-first KV store and it's RIOC that provides a network interface to communicate with it. This design choice allows HPKV to achieve low-latency and high-performance for applications that are running on the same machine, but make no mistake,HPKV is no slouch when it comes to over-the-network performance.

If you'd like to learn more about RIOC, we have a blog post that explains it in detail here.

Without further ado, let's dive into the benchmarks.

Local Benchmark Tool

The local benchmark tool is a simple, single-threaded tool that writes, reads and deletes a number of unique keys, each exactly once. This eliminates the data skew and the effect of the cache.

400M keys, single thread, local benchmark

The following shows the result of a 400M 25-byte keys, 25-byte values, single-thread benchmark. HPKV is running in memory-only mode with hash buckets set to 2^28.

As you can see, HPKV is fast! The read performance is quite stable with the number of keys growing beyond 1 billion on a single node, due to its careful hash table design, and the other operations (INSERT, DELETE, RANGE, ATOMIC_INC/DEC, etc.) scale logarithmically with the number of keys, even in hybrid disk mode.

Please note that the 0.500 at the end of the P50, P95 and P99 latency values is an artifact of the histogram binning in the test program.

100M keys, single thread, hybrid memory-disk, local benchmark

The hybrid memory-disk mode in HPKV implements an asynchronous write-behind buffering system with durability characteristics similar to Redis AOF in everysec mode. Upon each write request, HPKV immediately updates in-memory data structures and enqueues a write buffer entry to per-CPU buffers. Background worker threads asynchronously process these entries, performing batched disk writes with immediate synchronization. This architecture provides strong consistency guarantees through memory-resident data while ensuring eventual persistence through lock-free, work-queue-based disk operations.

After a successful write to the disk, the value of the KV pair is removed from memory to keep the memory footprint low. Upon a read request, the value is read from the disk and cached in memory with an LRU eviction policy, memory pressure detection and intelligent prefetching.

With this mechanism, HPKV can operate on machines with very low memory. For example, you can easily run HPKV on a t4g.nano instance with 1GB of memory!

Now the same test as above but in hybrid memory-disk mode.

There are two important things to note here:

The test is reading each key only once, so the latency number is the time it takes to read the value from the disk.
Since the test is running in single-thread mode, the performance is effectively limited by the disk random read speed at QD1 4KB.

Here you can see the effect of the highly optimized custom file system that HPKV uses. 12 microseconds equals 83K IOPS or 320 MiB/s at QD1 4KB. This is very close to the theoretical max of the disk used in the test.

The HPKV's disk write speed is essentially limited by the disk random write speed at QD32 4KB (430K IOPS or 1.6 GiB/s) since the write buffer system automatically scales up and spins more workers to keep up with the write requests. During the above test, the write phase took 100 seconds to complete and the full flush to disk took an additional 130 seconds.

Now with local performance out of the way, let's dive into the over-the-network performance.

100M keys over the network (in-memory)

The following shows the benchmark result of 100M 100-byte values, 50 clients, pipelined 16 operations, sent over the network.

As you can see, during the GET operation, HPKV was essentially limited by the network speed.

It's worth noting that the Redis benchmark tool is used with the following parameters:

redis-benchmark -h xxx.xxx.xxx.xxx -n 100000000 -r 100000000 -t set,get -P 16 -q -c 50 -d 100

We're using the -r parameter to generate random keys in an effort to make the comparison fair to the RIOC benchmark tool that uses unique sequential keys; otherwise, redis-benchmark would only try to update the same key over and over again.

100M keys over the network (hybrid memory-disk)

The following shows the benchmark result of 100M 100-byte values, 50 clients, pipelined 16 operations, sent over the network. HPKV is running in hybrid memory-disk mode with hash buckets set to 2^28.

In this test you can see the effect of disk write speed on Redis write performance, but HPKV is still performing the same as in-memory mode. However, when it comes to reads, the first read is limited by the disk random read speed with the number of parallel reads done by the RIOC server, but the second read, where all keys are cached in memory, the performance is the same as in-memory mode.

Performance-to-cost ratio

At the beginning of this article, we mentioned that performance is not about the numbers, it's about the context. What HPKV offers is a superior performance-to-cost ratio.

On one side of the spectrum, HPKV can run in environments with very low memory and still provide excellent performance, only limited by disk speed, and on the other side of the spectrum, it can operate in memory-only mode and provide excellent nanosecond-range performance during local operation.

This will be the focus of the next article, showing how much you can save by using HPKV instead of Redis, without compromising performance.

Beta Testing HPKV Business Plan - For Free!

As you might know, we still haven't officially opened our Business plan, which offers RIOC yet; however, we're looking for a few early adopters and beta testers, free of charge.

If this is something that interests you, please contact me with your use case at mehran@hpkv.io and I'll get back to you as soon as possible.

Searching among 3.2 Billion Common Crawl URLs with <10µs lookup time and on a 48€/month server

Mehran — Wed, 07 May 2025 20:19:17 +0000

Why this matters?

The core essence of Computer Science at the lowest level is manipulating data through logical operations to perform calculations and every single CS related company in the world, is racing to do more of it, in a shorter amount of time.

The challenge? Scale!

As datasets grow linearly, the computational resources needed frequently grow exponentially or at least non-linearly. For example, many graph algorithms and machine learning operations scale as O(n²) or worse with data size. A seemingly modest 10x increase in data can suddenly demand 100x or 1000x more computation. This exponential scaling wall creates enormous technical and economic pressure on companies handling large datasets — forcing innovations in algorithms, hardware architectures, and distributed systems just to keep pace with expanding data volumes. The pursuit of efficient scaling drives much of the industry’s research and development spending as organizations struggle against these fundamental computational limits.

How hard can it be?

Two weeks ago, I was having a chat with a friend about SEO, specifically on whether or not a specific domain is crawled by Common Crawl and if it did which URLs? After searching for a while, I realized there is no “true” search on the Common Crawl Index where you can get the list of URLs of a domain or search for a term and get list of domains that their URLs, contain that term.
Common Crawl is an extremely large dataset of more than 3 billion pages. Storing the URLs alone would require >400GB storage and finding a term like 'product' among them or making a search like 'https://domain.tld/*search*' would require significant resources and time.
In most cases, storing such a large dataset and crucially enabling searching on it, within a reasonable time is out of the budget of a fun weekend project; or is it?
What if I did the sensible thing and pre-computed reverse indexes as well as extracted domains from the URLs to enable searching the way we explained above? Surely others did that! No? But then how computationally expensive is that operation and how to store and serve it to keep it reasonably priced and fast?
On the surface, this sounds like a good fit for a KV store, where terms are keys and domains are values and we also have another set that domains/sub-domains are keys and URLs are values, right?
Well, I set out to answer these questions for myself.

Pre-computation vs. Realtime

There are various challenges involved with large datasets, one of which is deciding where and when to spend your computational resources in terms of time. The God of large data is Time and it demands its sacrifice. The only choice? whether to “pay now” or “pay later”.
Precomputation is the strategic investment approach — spending computational resources upfront to create optimized data structures and indexes that enable lightning-fast queries later. Like meal prepping for the week on Sunday, you suffer once to enjoy quick access throughout the week. Realtime computation, on the other hand, is the just-in-time approach — calculating results on demand when a query arrives. This saves you upfront costs but can leave users drumming their fingers while waiting for results. With massive datasets like Common Crawl, the realtime approach would require either supercomputer-level hardware or users with the patience of digital monks. The pre-computation strategy, while initially resource-intensive, can transform an impossible problem into one solvable on modest hardware — if you’re clever about your data structures and algorithms.

Common Crawl Analyzer

I started by creating a set of tools (Github) to help me pre-compute the Common Crawl data. My idea was to create 3 tools to extract domains, urls and to perform a term frequency analysis. The first two are reasonably simple and relatively fast, even on a modern laptop, but the real challenge was the term frequency analysis. The idea is simple:

For any given url, tokenize the url.
Ignore all resource identification patterns like integers, UUIDs and etc.
Perform Stemming on the term.
Create a map between the term and domain.
Get the top 80% of the terms and sort the domains of each term based on frequency in a descending order.
Create a CSV file of term, frequency and domains.

Simple enough, right? Well, try to do that for 3+ billion URLs! It would required 100s of GB of RAM and 100s of hours of processing.
I spent some time and optimized the algorithm to among other things:

Parallelize operations where it makes sense.
Create intermediate merge indexes at each stage to avoid full scans.
Create checkpoints, so it can be resumed if anything went wrong.
Only load the working batch in memory.

With these optimizations I could process all 3.2 billion URLs with max RAM usage of 15 GB and in ~72 hours on my laptop (MacBook Pro M3 Max — 64 GB). David can indeed take on Goliath.

Storage and Serving

After pre-computing the data, I ended up with 137 million domains, 3.2 billion URLs and -thanks to Zipf’s law- 168,000 terms for the top 80% of terms, out of computed 17+ billion sanitized terms. My idea now was to store this data in a KV store and add an API on top, so the user can either search for example: “https://sub.domain.tld/*performance*”, where I extract “sub.domain.tld” and use that as key, retrieve list of urls for that domain and perform a wildcard search in memory to keep things simple; or the user search for example: “product” where I retrieve a list of domains for the stemmed input, where user can click on a domain and get a list of that domain. This means 100+ million keys, capping at 4MB max values, occupying ~400GB of raw data, on a single node. I still wanted the lookup to be in order of 10s of microseconds, so the bottleneck becomes the API throughput and network latency; Because in the era of instant gratification, multi-second searches feel like watching paint dry.

Microseconds matter!

You might argue that if we’re serving data on network, whatever performance gain we might have by carefully choosing and tuning our storage layer, it’s dwarfed by the network latency that looks like an eternity compared to the actual data retrieval. Well, it depends on your load; In other words, how many requests you’re planning to serve at any given second? How much of your data retrieval is CPU bound and how much of it is I/O bound? all those microseconds you save, in a span of tens of thousands of requests per second, add up and becomes significant enough that forces you to scale horizontally, multiplying your runtime cost. Those short microseconds, can save you thousands of dollars in cost.

Redis

Storing 400GB of raw data in Redis, would cost you a minimum of 6,800$ per month with no high availability. This is extremely expensive and to make the matters worse, the read latency on such a large dataset is 4–5ms. This might sound very reasonable, but given enough load, your KV quickly becomes a bottleneck and forces you to setup more nodes, doubling and tripling the runtime cost. t’s like trying to fit an elephant into a studio apartment — technically possible, but your landlord (and wallet) will hate you.

RocksDB

RocksDB is a disk-based KV store which can reduce our cost quite a bit. Here we don’t need servers with very large amount of RAM, so perhaps we can try a cheaper machine like a EX44 from Hetzner. This machine is around 48€ per months (inc. VAT). RocksDB performs well on this machine, and given it’s embedded and doesn’t include a network overhead, we can observe a similar (4–5ms un-chached) or better performance if the value is cached (~1ms). This is a great performance to cost ratio compared to Redis and make it more economically feasible to create more nodes to support higher load. Think of RocksDB as the sensible mid-size sedan of the database world — not flashy, but it gets the job done without emptying your bank account.

HPKV

HPKV is a high-performance KV store that it doesn’t use the term “high-performance” lightly! Take a look at its in-memory performance for that matter. HPKV is an attempt to close the gap between in-memory and disk-based KV stores and promises the best performance to cost ratio in the market.
HPKV can perform really well on the same machine as RocksDB (48€ per month EX44 from Hetzner), using only 10GB of RAM and automatically adjusting cache size usage based on access patterns and extremely fast random disk reads (10–14µs for QD1–4KB) thanks to its custom filesystem. This means HPKV can provide Lookup times of around 8µs average and disk read time of 1–2ms for 4MB values. This puts HPKV at ~4 times the cost to performance ratio of RocksDB. If RocksDB is a sedan, HPKV is that one friend who somehow modified their Toyota to outperform a Ferrari while still getting fantastic gas mileage.

Try it for yourself!

After these tests, I created a simple API and a simple search page and hosted it on Cloudflare Pages+Workers. Give it a try at search.hpkv.io and let me know what you think?
Who knew you could wrangle 3.2 billion URLs into submission with just 48€ a month? That’s less than what most people spend on coffee!

Remote Memory MCP Server for Cursor IDE

Mehran — Fri, 18 Apr 2025 11:30:23 +0000

The hallmark of modern LLMs is their ability to follow conversations naturally. But even the most advanced models have a fundamental limitation: their context window. Once information falls outside this window, it's forgotten. For AI assistants and agents to be truly useful, they need persistent memory systems - the ability to recall past interactions, preferences, and context over extended periods.

The Memory Problem for AI Systems

As a developer working with LLMs, you've likely experienced this frustrating scenario: you're building a feature with an AI assistant, and halfway through, it confidently references a function you've "already defined" - except that function doesn't exist. Or perhaps it suggests using a library method that sounds plausible but was completely hallucinated. These aren't just occasional quirks - they're symptoms of the fundamental memory problem in LLMs.

Without persistent memory, LLMs operate with a kind of functional amnesia that severely limits their usefulness in real development workflows:

Hallucinating Non-existent Code: "You can simply use the parseConfigWithFallback() function" - except that such function doesn't exist.

Forgetting Failed Approaches: Explaining why a certain approach won't work, only to have the model suggest the exact same approach again an hour later.

Inconsistent Project Understanding: The model gives architecture recommendations based on one understanding of your project, then contradicts itself in the next session with completely different assumptions.

Reinventing the Wheel: You've established a project-specific pattern for handling certain tasks, but the model keeps suggesting alternative approaches because it's forgotten what was already decided.

Forgetting API Limitations: "Let's use the streaming API for this" - after you've already explained twice that the API doesn't support streaming.

These memory-related failures become exponentially more problematic in long-running projects. Without the ability to remember past interactions, LLMs can't truly function as collaborative partners in development. Each new conversation essentially resets their understanding, forcing developers to repeatedly re-establish context, correct the same misconceptions, and rebuild shared knowledge.

The standard solution has been to expand context windows - from 32K tokens to 200K and beyond. But this approach has significant limitations:

Cost: Larger contexts mean higher inference costs
Relevance: Most historical content isn't relevant to the current query
Focus: Too much context creates "needle in a haystack" problems for the model

What's needed is selective, persistent memory that can be retrieved based on relevance rather than recency.

Introducing the MCP Memory Server

Today, we're excited to announce the general availability of our MCP Memory Server - a ready-to-use system that gives LLMs true long-term memory. Built on HPKV and Nexus Search, it implements the Model Context Protocol (MCP) to seamlessly integrate with supported models.

What is MCP?

The Model Context Protocol (MCP) is an emerging standard for extending AI model capabilities through external services. Our MCP Memory Server implements a specialized protocol for memory management, allowing any compatible AI system to store and retrieve memories without needing to handle the storage and semantic search logic themselves.

How It Works: The MCP Memory API

The idea is simple: Give the LLM the necessary tools to store and retrieve memories; but this is not a simple KV operation as the model cannot "remember" what key it used in the first place or cannot know what keys are relevant.

What we do behind the scene is that we leverage HPKV's Nexus Search which provides semantic and natural language search capability on top of your stored KV in HPKV. so when the model stores a memory with key and value, Nexus Search creates the embeddings and vectorizes the memory together with the value. Later on, when the model is searching for a memory, it can either search for keys semantically and perform a similarity search to get a list of keys to retrieve or even as a natural language question, which then Nexus search retrieves relevant information, feeds to an internal LLM model together with the question and responds the question based on contextual summarization. The best part? since this is all remote, you can have these memories shared across tools and devices!

The MCP Memory Server provides four key functions:

1. Store Memory

Purpose: Creates a new memory entry from a conversation exchange

Key features:

Organizes memories by project and session
Maintains sequential ordering with sequence numbers
Stores both user requests and assistant responses
Supports optional metadata for better retrieval
Automatically creates a key in the format: project_name_date_session_name_sequence_number

2. Search Memory (Semantic Query)

Purpose: Performs AI-powered natural language search over stored memories

Key features:

Uses semantic understanding to find relevant past exchanges
Returns a generated summary of relevant information
Includes source memory keys with confidence scores
Handles complex natural language queries
Intelligently combines information from multiple memory entries

3. Search Keys (Vector Similarity)

Purpose: Finds semantically similar memory keys based on vector similarity

Key features:

Returns a ranked list of memory keys matching the query
Configurable number of results with the topK parameter
Adjustable similarity threshold with minScore
Faster than full semantic search when you only need keys
Perfect for finding related conversations without generating summaries

4. Get Memory (Exact Key Retrieval)

Purpose: Retrieves a specific memory by its exact key

Key features:

Direct access to a specific memory entry
Returns the complete memory object including metadata
Useful when you already know which memory you need
Can be combined with search keys for two-stage retrieval

Real-World Implementation: Cursor IDE

One of the first major use cases of the MCP Memory Server is in Cursor IDE, where AI coding assistance requires persistent understanding of project context, user preferences, and past interactions.

With the MCP Memory Server, Cursor's AI assistant can:

Remember project structures and conventions across sessions
Recall user preferences for coding style and patterns
Reference previous explanations and decisions
Build on past problem-solving approaches

Cursor will use semantic memory search to intelligently retrieve relevant past conversations based on what the developer is currently working on - without forcing them to manually manage or reference this context.

Getting Started

The MCP Memory Server is available now for all HPKV users. Here's how to start using it:

Sign up for an HPKV account
Generate an API key in your dashboard
Integrate the MCP Memory tools in your AI application

Adding MCP Memory to Cursor IDE

Edit your mcp.json file:

{
  "mcpServers": {
    "hpkv-memory-server": {
      "url": "https://memory.hpkv.io/sse"
    }
  }
}

After adding the MCP Memory Server, you'll be notified to login to your HPKV account. Once you do, you'll see a list of your API keys. Select the one you want to use to authenticate with the MCP Memory Server.

Adding MCP Memory to Claude Code

Use the following command:

claude mcp add -s user -t sse hpkv-memory-server https://memory.hpkv.io/sse

After that open Claude Code and type /mcp and follow the instructions to authenticate.

Cursor Rules/CLAUDE.md

In order to integrate the MCP Memory Server with Cursor seamlessly, we created a rule document that you can add to your Cursor project and set the rule type to Always or simply add it to the end of your CLAUDE.md file in case of Claude Code.

You can find more information on Github.

Beyond the Code: New Interaction Paradigms

The MCP Memory Server enables entirely new interaction paradigms:

Truly Personalized Experiences: Systems that remember user preferences, past challenges, and successful solutions
Continuous Learning Agents: Agents that improve over time by remembering what worked and what didn't
Cross-Session Coherence: Maintaining consistent understanding and personality across multiple interactions
Self-Reflection Capabilities: Agents that can review their past actions and refine their approach

Try It Yourself!

Want to try it yourself? The MCP Memory Server is available on all HPKV plans, including our free tier with 100 calls/month. For production applications, our Pro and Business tiers provide higher limits and advanced features.

With the free tier, the memories are stored for 30 days only.

Would like to hear your thoughts on this :)

Mehran — Thu, 03 Apr 2025 10:17:54 +0000

Mehran

Apr 1 '25

Nexus Search: RAG-Powered Semantic Search for HPKV

#ai #database #rag

8 min read

AI-powered semantic search in a KV database

Mehran — Tue, 01 Apr 2025 17:24:49 +0000

Mehran

Apr 1 '25

Nexus Search: RAG-Powered Semantic Search for HPKV

#ai #database #rag

8 min read

Nexus Search: RAG-Powered Semantic Search for HPKV

Mehran — Tue, 01 Apr 2025 17:22:01 +0000

In traditional key-value stores, finding data relies on knowing exact keys or key patterns. This works well for structured, predictable access patterns, but falls short when dealing with unstructured content or natural language queries. This article introduces Nexus Search, our solution for adding semantic understanding to HPKV.

What is Nexus Search?

Nexus Search adds Retrieval Augmented Generation (RAG) capabilities to HPKV, enabling semantic search and AI-powered question answering over your key-value data. Unlike traditional key-based access, Nexus Search understands the meaning of your content, allowing natural language queries and intelligent information retrieval.

At its core, Nexus Search combines two powerful capabilities:

Semantic Search: Finding records based on meaning rather than exact key matches
Question Answering: Generating natural language responses by analyzing relevant records

These capabilities unlock entirely new ways to interact with your data, transforming HPKV from a simple storage system into an intelligent knowledge base.

The Architecture

Nexus Search operates alongside your existing HPKV storage, adding a semantic layer without compromising the performance characteristics that make HPKV valuable.

When you write data to HPKV, Nexus Search automatically processes your content:

Your key-value data is stored in HPKV as usual
The text content is converted into vector embeddings (numerical representations that capture meaning)
These embeddings are stored in a specialized vector database
When you search or query, your input is converted to the same vector format
The system finds the most similar vectors to your query
For queries, an AI model generates a natural language response based on the retrieved content

This process happens automatically in the background whenever you add or update data through the standard HPKV API.

Implementation Details

Nexus Search exposes two main endpoints:

Search Endpoint

The /search endpoint finds semantically similar records to a given query:

POST /search
Content-Type: application/json
X-Api-Key: YOUR_API_KEY

{
  "query": "Your natural language search query",
  "topK": 5,         // Optional: number of results to return (default: 5)
  "minScore": 0.5    // Optional: minimum similarity score threshold (default: 0.5)
}

The response includes matching keys and their similarity scores:

{
  "results": [
    {
      "key": "article:123",
      "score": 0.87
    },
    {
      "key": "product:456",
      "score": 0.76
    },
    ...
  ]
}

Query Endpoint

The /query endpoint provides AI-generated answers to questions about your data:

POST /query
Content-Type: application/json
X-Api-Key: YOUR_API_KEY

{
  "query": "Your natural language question",
  "topK": 5,         // Optional: number of relevant records to use (default: 5)
  "minScore": 0.5    // Optional: minimum similarity score threshold (default: 0.5)
}

The response includes both the answer and the source records used to generate it:

{
  "answer": "The AI-generated answer to your question based on your data",
  "sources": [
    {
      "key": "article:123",
      "score": 0.87
    },
    {
      "key": "product:456",
      "score": 0.76
    },
    ...
  ]
}

Practical Use Cases

Let's explore two practical applications of Nexus Search: log analysis and semantic product filtering.

Log Analysis

Application logs contain valuable information, but finding relevant events can be challenging, especially when you don't know exact patterns to search for. Nexus Search transforms log analysis by allowing natural language queries over log data.

Consider this sample log data stored in HPKV:

key: 20231004123456, value: "2023-10-04 12:34:56 INFO [WebServer] GET /index.html 200 123.456.789.012"
key: 20231004123510, value: "2023-10-04 12:35:10 ERROR [WebServer] GET /nonexistent.html 404 123.456.789.012"
key: 20231004123600, value: "2023-10-04 12:36:00 INFO [AuthService] User 'john_doe' authenticated successfully from IP 123.456.789.012"
key: 20231004123605, value: "2023-10-04 12:36:05 WARN [AuthService] Authentication failed for user 'john_doe' from IP 123.456.789.012"
key: 20231004123700, value: "2023-10-04 12:37:00 INFO [OrderService] User 'john_doe' placed an order with ID 12345"
key: 20231004123705, value: "2023-10-04 12:37:05 ERROR [OrderService] Failed to process order for user 'john_doe': Insufficient funds"
key: 20231004123805, value: "2023-10-04 12:38:05 ERROR [Database] Query failed: INSERT INTO orders ... - Duplicate entry"
key: 20231004123900, value: "2023-10-04 12:39:00 INFO [PaymentGateway] Initiated payment for order 12345"
key: 20231004123910, value: "2023-10-04 12:39:10 ERROR [PaymentGateway] Payment failed for order 12345: Invalid card number"

With traditional key-value access, you would need to:

Know which keys to check or scan a range of keys
Manually filter the results for relevant content
Piece together related events from different logs

With Nexus Search, you can simply ask natural language questions:

// Query: "What happened with user john_doe's order?"
const response = await fetch("https://nexus.hpkv.io/query", {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Api-Key': 'YOUR_API_KEY'
  },
  body: JSON.stringify({
    query: "What happened with user john_doe's order?"
  })
});

const data = await response.json();
console.log(data.answer);

Response:

{
  "answer": "According to the log, here's what happened with user john_doe's order:

1. At 12:37:00, user john_doe placed an order with ID 12345.
2. At 12:37:05, the order failed to process due to insufficient funds.
3. At 12:39:00, the payment gateway initiated payment for the order.
4. At 12:39:10, the payment failed due to an invalid card number.

So, unfortunately, the order was not successfully processed and paid for.",
  "sources": [
    { "key": "20231004123700", "score": 0.8413863 },
    { "key": "20231004123705", "score": 0.8051659 },
    { "key": "20231004123900", "score": 0.7073802 },
    { "key": "20231004123910", "score": 0.6628850 }
  ]
}

Nexus Search automatically identifies relevant log entries, understands their relationships, and synthesizes a coherent explanation—without requiring you to know log formats or key patterns.

Our users have applied this pattern to analyze:

Application error logs to troubleshoot issues
Access logs to identify security concerns
Transaction logs to trace order problems
System metrics to investigate performance issues

This approach drastically reduces time-to-resolution for complex issues that span multiple systems or time periods.

Semantic Product Filtering

E-commerce platforms typically store product information across various systems. While structured queries work for exact attribute matching, finding products based on natural language descriptions is much harder.

Consider an e-commerce site with product specifications stored in HPKV:

// key: product:1001
{
  "name": "UltraBook Pro X1",
  "category": "Laptops",
  "specs": {
    "processor": "Intel Core i7-1280P, 14 cores (6P+8E), up to 4.8GHz",
    "memory": "32GB DDR5-4800",
    "storage": "1TB NVMe SSD, PCIe Gen4x4",
    "display": "14-inch OLED, 2880x1800, 90Hz, 400 nits, 100% DCI-P3",
    "graphics": "Intel Iris Xe Graphics",
    "battery": "72Wh, up to 15 hours",
    "ports": ["2x Thunderbolt 4", "1x USB-A 3.2", "HDMI 2.0", "3.5mm combo jack"],
    "weight": "1.3kg"
  }
}

// key: product:1002
{
  "name": "PowerBook Studio",
  "category": "Laptops",
  "specs": {
    "processor": "AMD Ryzen 9 6900HX, 8 cores, up to 4.9GHz",
    "memory": "64GB DDR5-5200",
    "storage": "2TB NVMe SSD, RAID 0",
    "display": "16-inch Mini-LED, 3456x2234, 120Hz, 1000 nits, 100% DCI-P3",
    "graphics": "AMD Radeon 680M + NVIDIA RTX 3080 Ti 16GB",
    "battery": "90Wh, up to 12 hours",
    "ports": ["3x USB-C 4.0", "1x USB-A 3.2", "SD Card Reader", "HDMI 2.1", "3.5mm combo jack"],
    "weight": "2.2kg"
  }
}

// key: product:1003
{
  "name": "MacBook Air",
  "category": "Laptops",
  "specs": {
    "processor": "Apple M2, 8-core CPU, 8-core GPU",
    "memory": "16GB unified memory",
    "storage": "512GB SSD",
    "display": "13.6-inch IPS, 2560x1664, 60Hz, 500 nits, P3 wide color",
    "graphics": "Integrated 8-core GPU",
    "battery": "52.6Wh, up to 18 hours",
    "ports": ["2x Thunderbolt 3", "MagSafe 3", "3.5mm headphone jack"],
    "weight": "1.24kg"
  }
}

Traditional KV store operations would require either:

Maintaining separate indices for each queryable attribute
Scanning all products and filtering in application code
Implementing a specialized search engine alongside the KV store

With Nexus Search, you can simply search using natural language:

// Find a laptop suitable for video editing
const response = await fetch("https://nexus.hpkv.io/search", {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Api-Key': 'YOUR_API_KEY'
  },
  body: JSON.stringify({
    query: "powerful laptop for professional video editing and rendering"
  })
});

const data = await response.json();
console.log(data.results);

Response:

{
  "results": [
    { "key": "product:1002", "score": 0.89 },
    { "key": "product:1001", "score": 0.76 },
    { "key": "product:1003", "score": 0.64 }
  ]
}

The results are ranked by semantic relevance to the query. The PowerBook Studio ranks highest because its specs (powerful discrete GPU, large amount of RAM, high core count CPU) align well with video editing requirements, even though the product description never explicitly mentions "video editing."

Once you have the matching keys, you can retrieve the full product details using standard HPKV operations:

// Get full details for the best matching product
const productKey = data.results[0].key;
const productDetails = await hpkvClient.get(productKey);
displayProduct(JSON.parse(productDetails));

This approach combines the benefits of fast key-value lookups with the flexibility of semantic search:

Store all product data in HPKV as usual
Use Nexus Search to find relevant products based on natural language
Retrieve and display only the needed products

For e-commerce applications, this enables powerful features like:

Natural language search ("show me laptops good for college students")
Semantic filtering ("lightweight laptops with good battery life")
Feature-based comparison ("laptops with the best display for photo editing")

Performance and Implementation Considerations

Vector Embedding Process

When you store text data in HPKV, Nexus Search processes it as follows:

Text extraction from the value (supporting JSON, plain text, and other formats)
Text normalization and preprocessing
Chunking for long content (with configurable overlap)
Vector embedding generation
Storage in the vector database with reference to the original HPKV key

This process happens asynchronously to avoid impacting HPKV's performance characteristics. There is typically a delay of a few seconds between data writes and when the content becomes searchable.

Limitations

While powerful, Nexus Search has some limitations to be aware of:

The embedding model only supports English text (non-English content may have reduced accuracy)
Up to 20 results can be returned with a single search request
Processing delay between data writes and search availability

Real-World Performance

Performance varies based on data volume, query complexity, and subscription tier:

Tier	Context Tokens	Output Tokens	Request Limits
Free	24K	1K	100 calls/month, 12 req/min
Pro	24K	5K	500 calls/month, 24 req/min
Business	80K	10K	5000 calls/month, 60 req/min, agent mode
Enterprise	110K	50K	Unlimited calls, 120 req/min, agent mode

For most applications, we see:

Search latency: 200-500ms
Query latency: 500ms-2s (depending on complexity)
Indexing throughput: ~20MB/minute

Our ongoing optimizations focus on:

Improving embedding quality for specialized domains
Adding support for more languages and content types

Implementation Example

Here's a complete example showing how to store data and query it with Nexus Search:

// Store data in HPKV first
await fetch(`${baseUrl}/record`, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Api-Key': 'YOUR_API_KEY'
  },
  body: JSON.stringify({
    key: 'article:databases',
    value: 'High performance databases offer exceptional speed and reliability. They typically achieve sub-millisecond response times and can handle millions of operations per second. This makes them ideal for real-time applications, financial systems, gaming backends, and other use cases where latency matters.'
  })
});

// Wait for a moment to allow indexing (in production, data is indexed asynchronously)
await new Promise(resolve => setTimeout(resolve, 1000));

// Query the data using Nexus Search
const response = await fetch("https://nexus.hpkv.io/query", {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Api-Key': 'YOUR_API_KEY'
  },
  body: JSON.stringify({
    query: "What applications benefit from high performance databases?"
  })
});

const data = await response.json();
console.log(data.answer);
// Output: "High performance databases are ideal for real-time applications, financial systems, gaming backends, and other use cases where low latency is critical."

Conclusion

Nexus Search represents our approach to bridging the gap between high-performance key-value storage and semantic understanding. By adding RAG capabilities to HPKV, we've enabled entirely new ways to interact with your data without sacrificing the performance and simplicity that made HPKV valuable in the first place.

The applications go far beyond the examples we've shared here. Nexus Search can be used for applications such as:

Customer support knowledge bases
Internal documentation search
Compliance monitoring across large document sets
User-generated content moderation
Technical troubleshooting assistants
Research paper analysis

We've built Nexus Search with the same commitment to performance, reliability, and security that guides all our work at HPKV. The system is designed to scale with your needs, from small datasets to enterprise-scale knowledge bases.

We're just beginning to explore the possibilities of combining high-performance storage with AI-powered search and retrieval. As we continue to refine and expand Nexus Search, we welcome your feedback, questions, and use cases to help guide our development priorities.

Nexus Search is available on all HPKV subscription tiers, with features and limits varying by tier. Visit our pricing page for details, or dive into the documentation to get started.

High-Performance Secure Networking with RIOC

Mehran — Sat, 29 Mar 2025 16:24:00 +0000

In distributed systems, network communication often becomes the bottleneck that limits overall application performance. We've seen this firsthand while building HPKV, our high-performance key-value store. This article explores how we approached networking challenges in RIOC (Remote I/O Control), the networking layer that powers HPKV's distributed capabilities.

What is RIOC?

RIOC is a client-server protocol implementation designed specifically for interfacing with high-performance storage systems like HPKV. What sets it apart isn't just raw performance, but how it balances three critical requirements:

Latency and Throughput: Zero-copy operations, vectored I/O, and batch processing
Data Consistency: Atomic operations with proper memory barriers
Transport Security: TLS 1.3 with mutual authentication (mTLS)

The Architecture

At its core, RIOC consists of client and server components communicating over a binary protocol. The high-level architecture looks like this:

Performance Optimizations

Network performance is often overlooked, but it's a critical component in any distributed system. RIOC implements several key optimizations that work together to achieve exceptional performance.

Vectored I/O: Reducing System Call Overhead

Traditional network programs use sequential send/recv calls to transmit data. For operations involving multiple data segments (headers, keys, values), this can lead to multiple system calls and context switches.

Vectored I/O is a powerful technique that allows a single system call to send or receive multiple discontinuous data buffers. RIOC uses the writev and readv system calls to simultaneously transmit multiple memory segments without having to copy them into a contiguous buffer first.

Here's how it works in RIOC:

// Example from RIOC implementation
struct iovec iovs[4];  // batch_header + op_header + key + value
int iov_count = 0;

// Setup batch header
iovs[iov_count].iov_base = &batch_header;
iovs[iov_count].iov_len = sizeof(batch_header);
iov_count++;

// Setup op header
iovs[iov_count].iov_base = &op_header;
iovs[iov_count].iov_len = sizeof(op_header);
iov_count++;

// Setup key
iovs[iov_count].iov_base = (void*)key;
iovs[iov_count].iov_len = key_len;
iov_count++;

// Setup value if present
if (value && value_len > 0) {
    iovs[iov_count].iov_base = (void*)value;
    iovs[iov_count].iov_len = value_len;
    iov_count++;
}

// Send all segments in a single system call
writev(socket, iovs, iov_count);

The benefits of this approach are significant:

Reduced System Call Overhead: A single call to writev replaces multiple calls to send
No Extra Memory Copies: Data is sent directly from its original locations
Fewer Context Switches: Minimizing transitions between user space and system space
Improved Packet Efficiency: The TCP/IP stack can optimize packet boundaries

RIOC's implementation also includes size-based optimizations:

For transfers under 4KB, it uses a stack-allocated buffer to minimize heap allocations
For larger transfers, it dynamically adjusts to use vectored I/O with prefetching hints

Zero-Copy Data Transfers

A natural extension of vectored I/O is the concept of zero-copy data transfers. Traditional network programming often involves multiple data copies:

From application buffer to socket buffer
From socket buffer to network interface
From network interface to socket buffer on the receiving side
From socket buffer to application buffer on the receiving side

RIOC minimizes these copies in several ways:

// Example of zero-copy receive for large values
ssize_t rioc_zero_copy_recv(int fd, struct rioc_value *value) {
    // Allocate memory once, to be used directly by application
    if (!value->data) {
        value->data = aligned_alloc(RIOC_CACHE_LINE_SIZE, value->size);
        if (!value->data) return -1;
    }

    // Read directly into application memory
    return recv_all(fd, value->data, value->size);
}

For operations where the client will be accessing the data immediately, RIOC can also leverage techniques like:

Memory mapping: For very large values, memory-mapped I/O can be used to avoid buffer copying
Scatter-gather DMA: When supported by the hardware, RIOC can work with the network stack to use DMA operations
Buffer ownership transfer: Using smart pointers or reference counting to transfer buffer ownership instead of copying content

While zero-copy is powerful, it's not always the most efficient approach for small data sizes due to the overhead of memory management and registration. RIOC dynamically selects the appropriate strategy based on operation size and access patterns.

Socket Tuning: Beyond Default Parameters

TCP's default settings are designed for general-purpose internet communication, prioritizing compatibility and reliability over raw performance. For high-performance local or data center networking, these defaults can be limiting.

RIOC applies careful socket tuning to maximize throughput and minimize latency:

// TCP_NODELAY: Disable Nagle's algorithm
int flag = 1;
setsockopt(socket, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));

// Increase socket buffers to 1MB
int buffer_size = 1024 * 1024;  // 1MB
setsockopt(socket, SOL_SOCKET, SO_RCVBUF, &buffer_size, sizeof(buffer_size));
setsockopt(socket, SOL_SOCKET, SO_SNDBUF, &buffer_size, sizeof(buffer_size));

// Set low-latency type-of-service
int tos = IPTOS_LOWDELAY;
setsockopt(socket, IPPROTO_IP, IP_TOS, &tos, sizeof(tos));

// Enable TCP Quick ACK on Linux
#ifdef TCP_QUICKACK
setsockopt(socket, IPPROTO_TCP, TCP_QUICKACK, &flag, sizeof(flag));
#endif

Let's examine why each option matters:

TCP_NODELAY: Disables Nagle's algorithm, which otherwise would buffer small packets to reduce header overhead. While this can improve efficiency for some workloads, it introduces latency by delaying transmission until either a full packet can be sent or a timeout occurs. For RIOC's low-latency requirements, disabling this is crucial.
Socket Buffer Sizing: Default socket buffers (often ~128KB) can limit throughput, especially on high-bandwidth networks. By increasing to 1MB, RIOC ensures the TCP window can scale appropriately, keeping the network pipe full.
IPTOS_LOWDELAY: This sets the Type of Service (ToS) field in IP packets to request low-latency handling from network equipment. While not all network devices honor this, those that do will prioritize these packets over bulk transfers.
TCP_QUICKACK: On Linux, this disables delayed ACKs, ensuring that TCP acknowledgments are sent immediately rather than waiting to piggyback on data packets or timeouts.

TCP_CORK: Strategic Packet Coalescing

For operations that involve sending multiple segments that should logically be processed together, RIOC uses TCP_CORK to control packet boundaries.

Unlike Nagle's algorithm (which TCP_NODELAY disables), TCP_CORK gives the application explicit control over when packets are sent:

// Enable TCP_CORK
int flag = 1;
setsockopt(socket, IPPROTO_TCP, TCP_CORK, &flag, sizeof(flag));

// Send multiple segments...
send(socket, header, header_size, 0);
send(socket, key, key_size, 0);
send(socket, value, value_size, 0);

// Disable TCP_CORK to flush any remaining data
flag = 0;
setsockopt(socket, IPPROTO_TCP, TCP_CORK, &flag, sizeof(flag));

This technique has several advantages:

Reduced Packet Count: Data is coalesced into fewer, larger packets
Better Utilization: More data per packet means better amortization of TCP/IP header overhead
Lower Processing Cost: Network equipment processes fewer packets

RIOC applies TCP_CORK selectively based on operation size:

Small operations (≤4KB) are sent immediately
Larger operations use TCP_CORK to optimize packet boundaries
The cork is always explicitly removed when transmission is complete, preventing indefinite delays

Lockless Data Structures

Contention on locks can severely impact performance in multi-threaded systems. In high-performance networking, every microsecond counts, and traditional lock-based synchronization can introduce significant overhead. RIOC employs several lockless techniques to minimize contention:

1. Lock-Free Sequence Numbers

For client request tracking, RIOC uses atomic sequence counters that can be accessed without locks:

// Lockless sequence counter for client requests
struct rioc_client {
    // Other fields...
    atomic_uint64_t sequence;  // Atomic sequence counter
};

// Atomically increment sequence number without locks
uint64_t rioc_get_next_sequence(struct rioc_client *client) {
    return atomic_fetch_add_explicit(&client->sequence, 1, memory_order_relaxed);
}

This approach allows multiple threads to generate unique sequence numbers without contention, which is crucial for high-throughput scenarios.

2. Single-Producer, Single-Consumer Queues

For passing data between dedicated threads, RIOC implements true lockless SPSC (Single-Producer, Single-Consumer) queues:

struct spsc_queue {
    atomic_size_t head;
    atomic_size_t tail;
    void *items[QUEUE_SIZE];
} __attribute__((aligned(RIOC_CACHE_LINE_SIZE)));

// Producer: Add item to queue without locks
bool spsc_enqueue(struct spsc_queue *q, void *item) {
    size_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
    size_t next_tail = (tail + 1) % QUEUE_SIZE;

    // Check if queue is full
    if (next_tail == atomic_load_explicit(&q->head, memory_order_acquire))
        return false;  // Queue full

    // Add item and update tail
    q->items[tail] = item;
    atomic_store_explicit(&q->tail, next_tail, memory_order_release);
    return true;
}

// Consumer: Remove item from queue without locks
void* spsc_dequeue(struct spsc_queue *q) {
    size_t head = atomic_load_explicit(&q->head, memory_order_relaxed);

    // Check if queue is empty
    if (head == atomic_load_explicit(&q->tail, memory_order_acquire))
        return NULL;  // Queue empty

    // Get item and update head
    void *item = q->items[head];
    atomic_store_explicit(&q->head, (head + 1) % QUEUE_SIZE, memory_order_release);
    return item;
}

This implementation uses memory ordering primitives to ensure correctness without locks:

memory_order_relaxed for operations where order doesn't matter
memory_order_acquire to ensure all subsequent reads see updates
memory_order_release to ensure all prior writes are visible

3. Read-Copy-Update (RCU) for Connection Management

RIOC uses a simplified RCU-like pattern for managing active connections, allowing lookups to proceed without locking while updates happen concurrently:

// Connection table with lockless reads
struct connection_table {
    atomic_ptr_t connections[MAX_CONNECTIONS];  // Array of atomic pointers
    // Other fields...
};

// Lookup connection without locking
struct connection* find_connection(struct connection_table *table, connection_id_t id) {
    if (id >= MAX_CONNECTIONS)
        return NULL;

    // Atomic read with acquire semantics to ensure we see a complete structure
    return atomic_load_explicit(&table->connections[id], memory_order_acquire);
}

// Add new connection (requires synchronization for writers)
void add_connection(struct connection_table *table, connection_id_t id, struct connection *conn) {
    // Synchronize writers (omitted for brevity)

    // Ensure connection is fully initialized before publishing
    atomic_store_explicit(&table->connections[id], conn, memory_order_release);
}

// Remove connection
void remove_connection(struct connection_table *table, connection_id_t id) {
    // Synchronize writers (omitted for brevity)

    // Set to NULL with release semantics
    atomic_store_explicit(&table->connections[id], NULL, memory_order_release);

    // The actual connection cleanup is deferred until all readers are done
    schedule_deferred_cleanup(find_connection(table, id));
}

This approach allows connection lookups to proceed at full speed without locking, while the less-frequent operations of adding and removing connections can use heavier synchronization.

4. Concurrent Statistics Collection

RIOC maintains various statistics counters that are updated by multiple threads without locking:

struct rioc_stats {
    atomic_uint64_t operations_total;
    atomic_uint64_t bytes_sent;
    atomic_uint64_t bytes_received;
    atomic_uint64_t errors;
    // More counters...
} __attribute__((aligned(RIOC_CACHE_LINE_SIZE)));

// Increment operation counter from any thread
void count_operation(struct rioc_stats *stats) {
    atomic_fetch_add_explicit(&stats->operations_total, 1, memory_order_relaxed);
}

// Add to bytes sent counter
void add_bytes_sent(struct rioc_stats *stats, size_t bytes) {
    atomic_fetch_add_explicit(&stats->bytes_sent, bytes, memory_order_relaxed);
}

// Snapshot statistics safely
void snapshot_stats(struct rioc_stats *stats, struct rioc_stats_snapshot *snapshot) {
    // Use memory barrier to ensure consistent view
    atomic_thread_fence(memory_order_acquire);

    snapshot->operations_total = atomic_load_explicit(&stats->operations_total, memory_order_relaxed);
    snapshot->bytes_sent = atomic_load_explicit(&stats->bytes_sent, memory_order_relaxed);
    snapshot->bytes_received = atomic_load_explicit(&stats->bytes_received, memory_order_relaxed);
    snapshot->errors = atomic_load_explicit(&stats->errors, memory_order_relaxed);
    // Copy other counters...
}

By using atomic operations with appropriate memory ordering, these counters can be updated at very high frequency without locks, while still providing accurate statistics.

5. Hybrid Approaches for Complex Data Structures

For more complex data structures like the multi-producer, multi-consumer work queue, RIOC uses a hybrid approach combining atomic operations with minimal locking:

// Ring buffer with atomic head/tail pointers
size_t head = atomic_load_explicit(&queue->head, memory_order_acquire);
size_t tail = atomic_load_explicit(&queue->tail, memory_order_acquire);

// Check if empty without locking
if (head == tail) {
    // Empty queue handling
}

// Only lock for modifications, not for simple checks
if (need_to_modify) {
    pthread_mutex_lock(&queue->mutex);
    // Double-check conditions after acquiring lock (avoid TOCTOU)
    head = atomic_load_explicit(&queue->head, memory_order_relaxed);
    tail = atomic_load_explicit(&queue->tail, memory_order_relaxed);

    if (still_need_to_modify) {
        // Modify queue state
    }
    pthread_mutex_unlock(&queue->mutex);
}

Although not completely lockless, this approach minimizes the duration of critical sections and avoids locking altogether for read-only operations, significantly reducing contention.

Choosing the Right Approach

RIOC's approach to concurrency balances correctness, performance, and code complexity:

Fully Lockless Algorithms: Used for simple patterns with clear ownership (sequence numbers, SPSC queues, statistics)
RCU-like Patterns: Used for read-heavy data structures where readers should never block
Fine-grained Locking: Used where truly lockless algorithms would be too complex or error-prone
Atomic Operations: Used throughout to ensure proper memory ordering and visibility

While fully lockless algorithms can provide the highest theoretical performance, they often come with increased complexity and subtle correctness issues. RIOC pragmatically chooses the appropriate synchronization mechanism based on the specific requirements of each component.

Cache-Conscious Design

Modern CPUs rely heavily on cache efficiency. RIOC's design considers cache behavior at multiple levels:

1. Cache Line Alignment

To prevent false sharing (where threads inadvertently contend for the same cache line), RIOC aligns critical data structures to cache line boundaries:

// Cache line size detection
#ifndef RIOC_CACHE_LINE_SIZE
#define RIOC_CACHE_LINE_SIZE 64
#endif

// Apply alignment to structures
struct rioc_connection {
    // Fields frequently accessed together
} __attribute__((aligned(RIOC_CACHE_LINE_SIZE)));

// Pad data structures to prevent false sharing
struct work_queue {
    atomic_size_t head;
    char pad1[RIOC_CACHE_LINE_SIZE - sizeof(atomic_size_t)];  // Padding
    atomic_size_t tail;
    char pad2[RIOC_CACHE_LINE_SIZE - sizeof(atomic_size_t)];  // Padding
    // Rest of structure...
};

2. Data Locality

RIOC organizes data structures to keep related data together:

// Group related fields for better locality
struct rioc_batch_op {
    // Header and key fields that are accessed together during parsing
    struct rioc_op_header header;
    char key[RIOC_MAX_KEY_SIZE];

    // Value pointer and metadata accessed during data processing
    char *value_ptr;
    size_t value_offset;

    // Response fields accessed together
    struct rioc_response response;
};

3. Prefetching

For predictable access patterns, RIOC uses explicit prefetching to hide memory latency:

// Prefetch next batch operation
for (size_t i = 0; i < batch->count; i++) {
    struct rioc_batch_op *op = &batch->ops[i];

    // Prefetch the next operation if available
    if (i + 1 < batch->count) {
        __builtin_prefetch(&batch->ops[i + 1], 0, 3);
    }

    // Process current operation...
}

Platform-Specific Optimizations

RIOC is designed to work well across platforms, but it also includes targeted optimizations for specific environments:

Linux-Specific Optimizations

#ifdef __linux__
    // Use Linux-specific socket options
    #ifdef TCP_QUICKACK
        setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &flag, sizeof(flag));
    #endif

    #ifdef SO_BUSY_POLL
        // Reduce latency with busy polling for Linux kernels >= 3.11
        int busy_poll = 50;  // 50 microseconds
        setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &busy_poll, sizeof(busy_poll));
    #endif

    // CPU affinity for performance-critical threads
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(target_cpu, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
#endif

macOS/BSD-Specific Optimizations

#if defined(__APPLE__) || defined(__FreeBSD__)
    // Use kqueue for event notification
    int kq = kqueue();
    struct kevent ev;
    EV_SET(&ev, fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
    kevent(kq, &ev, 1, NULL, 0, NULL);
#endif

Windows-Specific Optimizations

#ifdef _WIN32
    // Use Windows-specific socket options
    int timeout = 1000;  // 1 second in milliseconds
    setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, (const char*)&timeout, sizeof(timeout));

    // Use I/O completion ports for scalable I/O
    HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);
    CreateIoCompletionPort((HANDLE)fd, iocp, (ULONG_PTR)context, 0);
#endif

Thread-per-Connection Model with Worker Pool

RIOC's server architecture uses a sophisticated threading model that balances throughput, latency, and resource utilization.

Traditional server designs follow either a thread-per-connection model (simple but resource-intensive) or an event loop model (efficient but complex to implement and potentially higher latency). RIOC takes a hybrid approach:

// Connection acceptance loop
while (server_running) {
    client_fd = accept(server_fd, ...);

    // Create client context
    client_ctx = create_client_context(client_fd, ...);

    // Queue for worker thread rather than handling inline
    work_queue_push(client_ctx);
}

// Worker thread function
void* worker_thread_func(void *arg) {
    while (server_running) {
        // Get next client context from queue
        client_ctx = work_queue_pop();
        if (!client_ctx) continue;

        // Process client requests
        process_client_requests(client_ctx);
    }
}

The implementation uses a lock-free work queue with cache-line alignment to minimize contention:

struct work_queue {
    // Align head and tail to different cache lines to prevent false sharing
    atomic_size_t head RIOC_ALIGNED;
    char pad1[CACHE_LINE_SIZE - sizeof(atomic_size_t)];
    atomic_size_t tail RIOC_ALIGNED;
    char pad2[CACHE_LINE_SIZE - sizeof(atomic_size_t)];
    struct client_context *clients[MAX_QUEUE_SIZE];
    // Mutex and condition variables for synchronization
    pthread_mutex_t mutex;
    pthread_cond_t not_empty;
    pthread_cond_t not_full;
} RIOC_ALIGNED;

This design provides several benefits:

Connection/Processing Separation: Connection handling is decoupled from request processing
Controlled Concurrency: The worker pool size limits the number of concurrent operations
Efficient CPU Utilization: Work is distributed across available cores without oversubscription
Reduced Contention: Cache-line alignment and atomic operations minimize lock contention

For memory barriers, RIOC uses C11's atomic operations to ensure proper ordering:

// Store with release semantics
atomic_store_explicit(&work_queue.tail, next_tail, memory_order_release);

// Load with acquire semantics to ensure visibility of previous stores
size_t head = atomic_load_explicit(&work_queue.head, memory_order_acquire);

Secure Communication with mTLS

Security shouldn't come at the expense of performance. RIOC implements TLS 1.3 with mutual authentication (mTLS) for secure client-server communication.

What is mTLS?

Traditional TLS primarily authenticates the server to the client. Mutual TLS (mTLS) extends this by also authenticating the client to the server:

This bidirectional verification ensures that both parties are who they claim to be, which is essential in distributed systems where nodes need to trust each other.

RIOC's TLS Implementation

RIOC implements TLS using OpenSSL with the following key features:

TLS 1.3 Only: Enforces the use of the latest TLS protocol version

   SSL_CTX_set_min_proto_version(ctx, TLS1_3_VERSION);
   SSL_CTX_set_max_proto_version(ctx, TLS1_3_VERSION);

Client and Server Verification: Optional but recommended mutual authentication

   // When verification is enabled
   SSL_CTX_set_verify(ctx, SSL_VERIFY_PEER | SSL_VERIFY_FAIL_IF_NO_PEER_CERT, NULL);

Hostname/IP Validation: Ensures the certificate matches the expected hostname

   X509_VERIFY_PARAM *param = SSL_get0_param(ssl);
   X509_VERIFY_PARAM_set1_host(param, hostname, strlen(hostname));

Performance Implications of TLS

TLS traditionally adds overhead, but RIOC mitigates this through several approaches:

Chunked TLS I/O: Uses optimal chunk sizes (16KB) for TLS encryption/decryption

   #define RIOC_TLS_CHUNK_SIZE 16000  // Slightly less than 16KB for TLS overhead

Session Reuse: Maintains TLS sessions for repeated connections between the same endpoints, avoiding expensive handshakes
Modern Ciphers: TLS 1.3 includes more efficient symmetric encryption algorithms like ChaCha20-Poly1305 and AES-GCM

Batch Processing for Amplified Performance

Individual operations have fixed overhead costs. RIOC's batch API allows grouping multiple operations.

The implementation uses a structured approach to minimize memory allocations and maximizes locality:

struct rioc_batch_op {
    struct rioc_op_header header;
    char key[RIOC_MAX_KEY_SIZE];  // Fixed buffer for key
    char *value_ptr;              // Pointer to value in shared buffer
    size_t value_offset;          // Offset in batch buffer
    struct rioc_response response; // Pre-allocated response
    struct iovec iov[RIOC_MAX_IOV]; // Pre-allocated IOVs
} __attribute__((aligned(RIOC_CACHE_LINE_SIZE)));

struct rioc_batch {
    struct rioc_batch_header batch_header;
    struct rioc_batch_op ops[RIOC_MAX_BATCH_SIZE];
    char *value_buffer;           // Single buffer for all values
    size_t value_buffer_size;
    size_t value_buffer_used;
    size_t count;
    size_t iov_count;
    uint32_t flags;
} __attribute__((aligned(RIOC_CACHE_LINE_SIZE)));

This structure optimizes memory usage in several ways:

Single Allocation: Values are stored in a contiguous buffer, reducing fragmentation
Pre-allocated IOVs: Each operation has pre-allocated I/O vectors, avoiding dynamic allocations
Cache Alignment: Structures are aligned to cache lines to prevent false sharing
Key Size Limits: Fixed-size key buffers avoid dynamic allocation for common cases

The batch API provides both synchronous and asynchronous interfaces:

Real-world Performance Considerations

While the optimizations described above yield significant benefits, several real-world factors influence RIOC's performance:

Network Latency

As expected, deployment environment drastically affects observed latency:

When latency increases, batching becomes even more critical for maintaining throughput. Our tests show:

Same Machine: Single operations take ~15μs, while batched operations amortize to <1μs per operation
Local Network: Single operations take ~500-800μs, while batched operations drop to ~100μs per operation
Multi-Region: Batching becomes essential, reducing effective per-operation time by up to 90%

Memory Management

RIOC uses thread-local buffers and alignment to optimize memory access:

static __thread char recv_buffer[4096] __attribute__((aligned(RIOC_CACHE_LINE_SIZE)));

This prevents false sharing between threads and reduces cache line bouncing. The __thread qualifier ensures each thread has its own copy of the buffer, eliminating contention and synchronization needs.

The implementation also strategically uses prefetching hints to mitigate memory latency:

// Prefetch the next IOV entry before processing
if (curr_iovcnt > 1) {
    __builtin_prefetch(curr_iov + 1, 0, 3);  // Read, high temporal locality
}

Conclusion

Building high-performance networking for distributed systems requires careful attention to both low-level details and higher-level architectural choices. In RIOC, we've tried to balance several competing concerns:

Performance through vectored I/O, zero-copy transfers, socket tuning, and batching
Security through TLS 1.3 and mutual authentication
Reliability through timeout handling and error recovery
Maintainability through clear architecture and platform abstractions

While we're pleased with the results achieved so far, we recognize that networking performance optimization is never truly "finished." There's always room for improvement, and we continue to learn and refine our approach as we gather more real-world usage data and as the underlying platforms evolve.

Many of the techniques described here weren't novel inventions—they build on the excellent work of others in the systems and networking communities. Our contribution has been to synthesize these approaches into a coherent system that addresses the specific needs of high-performance distributed key-value stores.

If you're interested in exploring RIOC further or contributing to its development, you can find the code (except the server component) in our GitHub repository as part of the HPKV project. We welcome feedback, questions, and contributions as we continue this journey.