Quan Van

Posted on Jan 5

I Built a Hybrid AI Database - Cache in Go (And It Runs Stable on My Old Dell Latitude)

#go #database #opensource #software

Stability Over Raw Speed: The "Arena" Architecture

The biggest enemy of stability in Go databases is the Garbage Collector (GC). If you store 1 million vectors as separate slice objects, the GC has to scan 1 million pointers. This causes "Stop-the-World" pauses, making latency spike unpredictably.

To fix this, I didn't use fancy tricks. I used contiguous memory.

I implemented a Vector Arena. Instead of allocating millions of small objects, Pomai allocates massive, flat arrays of float32.

// From packages/ds/vector/arena.go
type VectorArena struct {
    // A flat slice of chunks. Reading inside a chunk is thread-safe.
    chunks [][]float32
    // ...
}

The Result:

Zero Pointer Chasing: The GC sees one big object, not millions.

CPU Cache Friendly: Data is laid out sequentially.

Stability: On my Dell, I can load vectors and search them without random CPU spikes.

Respecting the CPU Cache (False Sharing)

When running on a dual-core or quad-core laptop, concurrency contention can kill performance.

In the core Store structure, I tracked hits and misses using atomic counters. But there was a hidden problem: False Sharing. If these two counters sit on the same 64-byte Cache Line, Core A updating hits invalidates the cache for Core B updating misses. They fight over the bus.

I fixed this by forcing memory padding, ensuring they live on different cache lines:

// From internal/engine/core/store.go
type Store struct {
    // ... config fields ...

    // Padding A: Separate hits from previous fields
    _ [56]byte 
    hits atomic.Uint64

    // Padding B: CRITICAL. Ensure hits and misses are not neighbors in L1 cache.
    _ [56]byte 
    misses atomic.Uint64
}

This small change didn't make the database "magically faster," but it made the CPU usage flat and predictable under load.

Survival Mode: Adaptive Tuning

My laptop doesn't have infinite resources. If a container or process starts eating too much RAM, the OS invokes the OOM Killer.

Pomai Cache includes a SysAdapt module. On startup, it inspects the environment (Cgroups or /proc/meminfo).

If RAM is tight: It aggressively lowers the GOGC percent to force more frequent cleanups.

If CPU is choking: The AutoTuner detects high latency in vector searches and automatically reduces the search precision (ef_search) slightly.

It trades a bit of recall accuracy for survival. It prioritizes keeping the process alive and responsive over being perfect.

// From internal/engine/core/sysadapt.go
func ApplySystemAdaptive() {
    // Detects Cgroup limits (Docker/K8s) or Host Memory
    memLimit := detectCgroupMemoryLimit()

    // Heuristics: If memory per core is low, throttle parallelism
    if memLimit > 0 {
        // ... tune GOMAXPROCS and GCPercent automatically
    }
}

Hybrid Storage: "Granules" & Compression

Storing large objects (images, audio) in RAM is expensive. I implemented PGUS (Pomai Granular Unified Storage). It breaks large values into fixed-size "granules" (like chunks).

But here is the cool part: It uses Entropy-based Compression (PEC).

Before storing, it calculates the entropy of the data chunk.

High Entropy? (Likely already compressed, e.g., JPG) -> Store Raw. Save CPU.

Low Entropy? (JSON, Logs) -> Compress with Snappy/Zstd. Save RAM.

This keeps the memory footprint on my laptop low without wasting CPU cycles trying to compress incompressible data.

The Verdict: It Just Works
I ran a benchmark on my Dell Latitude E5440:

Workload: Mixed Vector Search + Key-Value operations.

Throughput: ~5,000 requests/second.

Errors: 0.

Latency: < 2ms (p50).

Under the Hood: How It Actually Works

You might be wondering: "Okay, it stores data, but how does a request actually flow through the system?"

Here is the lifecycle of a request in Pomai Cache, designed for zero-allocation performance:

The Network Layer (gnet): Unlike standard Go net/http which spawns a Goroutine per connection (expensive), Pomai uses gnet, an event-loop networking library based on epoll/kqueue. It handles thousands of concurrent connections on a single thread before passing data to the worker pool.
Zero-Copy Protocol: The binary protocol is simple: [MagicByte][OpCode][KeyLen][ValLen][Key][Value]. The parser doesn't allocate new strings for every key. It slices the bytes directly from the network buffer.
The Routing (Sharding): To avoid a single global lock (Global Mutex), the key space is divided into 2048 Shards (configurable). ShardID = hash(key) & (ShardCount - 1). This means 2048 concurrent writes can happen simultaneously without blocking each other.
The "Brain" (Background Agents): While your data is being read/written, several background agents are watching:
- AutoTuner: Monitors latency. If it sees slow Vector Searches, it tells the HNSW index to be "less precise but faster".
- Eviction Manager: Instead of scanning all keys (O(N)), it uses random sampling (like Redis) but weighted by our PPE algorithm (Predicted Next Access).

Getting Started: Try It on Your Machine

You don't need a cluster to test this. It compiles into a single binary.

Prerequisites

Go 1.22 or higher (for the latest runtime optimizations).
Make (optional) I still not complete this, just run directly with go.

Build from Source
Clone the repo and build the binary:

git clone https://github.com/AutoCookies/pomai-cache.git
cd pomai-cache

# Build the optimized binary
go build -ldflags="-s -w" -o pomai-server ./cmd/server/main.go

(The -s -w flags strip debug information to make the binary smaller).

Running "Survival Mode" (Low RAM)
If you are running on a limited laptop like mine (or a small Docker container), use these flags to prevent OOM:

# Limits RAM to 4GB, uses WAL for durability
./pomai-server \
  --persistence=wal \
  --data-dir=./data \
  --mem-limit=4GB \
  --gomaxprocs=2

Running "Performance Mode" (Server)

# Uses all cores, larger write buffer for disk IO
./pomai-server \
  --persistence=wal \
  --write-buffer=10000 \
  --flush-interval=100ms \
  --cache-shards=4096

Running with RAM Caching

./pomai-server \ 
 --persistence=wal \
  --write-buffer=10000 \
  --flush-interval=100ms \
  --cache-shards=4096

(Just without the persistence flag, you will use It as a in RAM cache)

Benchmark It Yourself

Don't take my word for it. I included a benchmarking tool in the repo:

# Build the benchmark tool
go build -o pomai-bench ./cmd/pomai-bench/main.go

# Run a mixed workload (Vector Search + KV)
./pomai-bench -mode=ai -clients=50 -requests=100000

You should see the "Zombie Mode" kick in if you push it too hard!

You will see this after run It successfully

The "Secret Sauce": Self-Made Algorithms

I didn't just copy standard algorithms. To make Pomai "Autonomous," I had to invent my own heuristics. Here are the three pillars of its intelligence:

PPPC 3.0 (Pomai Predictive Pruning Cleaner)

Standard TTL (Time-To-Live) is dumb—it deletes data when the timer runs out, even if that data is part of a critical context.

PPPC 3.0 is smarter. It uses a "Peeling Strategy":

It predicts the "Next Access Time" for every key using an Exponential Moving Average (EMA).
Instead of deleting a whole Graph Cluster when memory is low, it "peels" the outer layers—the nodes that are least connected and predicted to be cold.
Result: It keeps the "Core Context" (the seed of the pomegranate) alive while sacrificing the less important edges.

PIE (Pomai Intelligent Eviction)

How do you tune a database? Usually, you edit a config.yaml. Pomai tunes itself using Reinforcement Learning (Multi-Armed Bandit).

The Agent: Continuously monitors the "Reward" function: (HitRate / Latency).
The Action: It dynamically adjusts the ef_search (HNSW precision) and the number of eviction samples.
If the server is idle, it increases precision for better Recall. If it's under attack, it lowers precision to survive.

PMAC (Pomai Multi-Agent Clustering)

In manager.go, I didn't use Raft or Paxos (too heavy). I built a Gossip-based Agent System.

Geo-Latency Aware: Nodes ping each other. If Node A and Node B are physically close (<5ms), they automatically form a "shard group" to replicate data faster.
PLBR (Probabilistic Burst Replication): If a key becomes "Hot" (accessed > 1000 times/s), the owner node probabilistically "bursts" (replicates) that key to random peers to spread the load instantly.

The Verdict: It Just Works

We often obsess over theoretical maximums—"can it do 1 million IOPS?"—but rarely talk about reliability on constrained hardware.

I ran the final benchmark on my Dell Latitude E5440 (Intel Core i5-4300U, DDR3 RAM). I pushed it with 50 concurrent clients doing a mix of Vector Searches and Key-Value writes using the pomai-bench tool included in the repo.

The Results:

Throughput: ~5,048 requests/second.
Bandwidth: ~17.28 MB/s.
Avg Latency: 1.664 ms.
Total Errors: 0.

The most important number there isn't the 5,000 req/s. It's the 0 errors.

Despite the heavy load, the SysAdapt module kept the Garbage Collector in check, and the VectorArena prevented memory fragmentation. The CPU usage was high but flat—no jagged spikes that usually freeze the OS.

Pomai Cache proves that you don't need a $10,000 server to run a modern, AI-native database. You just need to respect the hardware, align your memory, and stop fighting the CPU cache.

Some benchmark that I ran in my Old Laptop

Graph mode

Hash mode

KV Mode (Key-Value)

What’s Next?

Pomai is stable, but it's still evolving. My goal isn't to replace Redis or Postgres, but to offer a simpler, all-in-one alternative for AI Agents and Edge deployments.

Here is what I am working on next to make it even better:

PQL (Pomai Query Language): Currently, you use API methods. I am building a SQL-like parser to allow complex queries like SEARCH VECTOR ... FILTER GRAPH ... in a single network call.
Transactions: Adding multi-shard ACID guarantees for financial-grade data integrity.
WASM Runtime: Allowing you to push small Go/Rust functions directly into Pomai to run logic next to your data (Zero-Latency).

It’s not breaking any world records. But it runs smoothly on hardware from 2013. It handles Vectors, Graphs, and KV data in a single binary, and it doesn't crash when I open a browser tab alongside it. It still have other mode as: ai-mode, plg-mode, pic-mode, but I think It not optimized yet.

For me, that's the definition of Production Grade.

If you are interested in seeing how I implemented the HNSW Index or the Gossip Protocol in Go, check out the repo.

Repo Link is here:

AutoCookies / pomai-cache

pomai-cache

Pomai Cache — Production-Grade AI-Native In-Memory Data Platform

Pomai Cache is a hybrid in-memory data platform engineered for modern AI and real-time systems. It unifies key-value caching, vector search, time-series, graph relationships, and matrix operations in a single binary with adaptive runtime tuning, predictive eviction, and production-grade persistence and clustering features.

This document is a complete operational and technical reference intended for engineers, SREs, and platform teams responsible for deploying, operating, benchmarking, or contributing to Pomai Cache.

View on GitHub

Happy Coding!

DEV Community