DEV Community: mary moloyi

Why I Will Never Again Underestimate the Power of a Misconfigured Kafka Broker in My Veltrix Deployments

mary moloyi — Sun, 31 May 2026 12:20:25 +0000

The Problem We Were Actually Solving

I still remember the night our Veltrix deployment went from a scalable event processing engine to a fragile, error-prone mess, all because of a misconfigured Kafka broker. We had been tasked with building an event-driven system capable of handling thousands of concurrent connections, processing events in real-time, and guaranteeing at-least-once delivery. Sounds simple enough, but the reality was far more complicated. Our team had chosen to use Apache Kafka as the backbone of our event-driven architecture, largely due to its ability to handle high-throughput and provide low-latency, fault-tolerant, and scalable data processing. However, in our haste to meet the project deadline, we overlooked a critical aspect of Kafka configuration: the importance of properly setting up the broker's log.flush.interval.messages and log.flush.interval.ms parameters.

What We Tried First (And Why It Failed)

Initially, we tried to address the issue by tweaking the producer settings, specifically the acks=all and retries configurations, hoping that ensuring the producer received acknowledgement from the broker for every message sent would mitigate the problem. However, this only led to increased latency and did not address the root cause of the issue. As the errors persisted, our team dove deeper into the Kafka documentation and discovered that our misconfigured broker was causing messages to be lost due to the way Kafka handles message flushing to disk. Essentially, our initial approach was treating the symptoms rather than the disease. It was not until we started seeing the error message "org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired" that we realized the gravity of our mistake.

The Architecture Decision

We decided to reconfigure our Kafka brokers with more sensible values for log.flush.interval.messages and log.flush.interval.ms. Given our specific use case, where data loss was unacceptable, we opted for a more conservative approach: setting log.flush.interval.messages to a lower value (5000) to ensure that messages were flushed to disk more frequently, and log.flush.interval.ms to 1000, allowing for a balance between throughput and durability. This decision was not made lightly, as it had significant implications for our system's performance. However, the alternative—continuing to experience data loss and unpredictable behavior—was unacceptable. We also implemented a more robust monitoring system using Prometheus and Grafana to keep a closer eye on our Kafka cluster's performance metrics, such as the number of under-replicated partitions and the broker's disk usage.

What The Numbers Said After

After implementing these changes, we saw a significant reduction in errors related to message loss and an improvement in our system's overall reliability. The average latency for producing messages decreased by about 30%, from 150ms to 100ms, and we observed a marked decrease in the number of TimeoutExceptions, from an average of 50 per hour to less than 5. These numbers not only validated our decision but also underscored the importance of careful configuration and monitoring in distributed systems. Furthermore, our more comprehensive monitoring setup allowed us to catch potential issues before they escalated into full-blown incidents, reducing our average time to resolve (MTTR) by over 40%.

What I Would Do Differently

In retrospect, I would prioritize more thorough testing and validation of our Kafka configuration before deploying it to production. It is easy to overlook the nuances of a complex system like Kafka when working under tight deadlines, but the consequences can be severe. I would also invest more time in setting up a robust monitoring and logging infrastructure from the outset, rather than bolting it on as an afterthought. Tools like Prometheus, Grafana, and distributed tracing systems like Jaeger can provide invaluable insights into the behavior of complex distributed systems, allowing engineers to make data-driven decisions and catch potential problems before they become incidents. Additionally, adopting a more iterative and experimental approach to configuration, where changes are tested and validated in a controlled environment before being rolled out to production, would help mitigate the risk of misconfiguration. The hard lessons learned from this experience have significantly influenced my approach to designing and deploying distributed systems, emphasizing the importance of careful planning, rigorous testing, and comprehensive monitoring.

When the Treasure Hunt Engine Buried Us Alive Under 300k RPS

mary moloyi — Thu, 28 May 2026 00:46:12 +0000

The Problem We Were Actually Solving

The Treasure Hunt Engine did one thing: it matched user events to static treasure definitions and returned a rank and a prize within 200 milliseconds. The treasure definitions rarely changed—maybe 50 updates per day—but the user events were a firehose: likes, shares, comments, skips, everything.

Our first architecture used DynamoDB with a composite key of user_id + event_ts. Simple, right? Wrong. The partition key was user_id, which meant every users events were stored together. Within weeks we had hot partitions for users who posted ten times a second. P99 latency spiked to 1.8 seconds during peak, and the AWS throttling errors read like a horror story: ProvisionedThroughputExceededException, 400s for anything targeting that table.

We tried sharding the table into 32 buckets using a hash of user_id. That bought us two weeks until marketing launched the feature in three new languages and we suddenly had 20x more users in Korea and Brazil. The DynamoDB hot partition problem crossed continents.

What We Tried First (And Why It Failed)

Our second attempt was RabbitMQ + Redis. We would buffer events in RabbitMQ, then fan out to 128 workers that would process each event and write the result to Redis sorted sets. The Redis layer stored the top 100 treasures per user, and we used Lua scripts to compute rank and prize in one round trip.

This lasted until the first time our Redis cluster ran out of memory. We had set maxmemory-policy to allkeys-lru, but the Lua script was returning 4MB of serialized JSON per user instead of the 2KB we expected. Our Lua script looked innocent but had a hidden monster: it fetched every treasure definition for every user event because we forgot to implement a filter.

On-call that night I watched the Redis eviction rate climb to 40k keys per second. The entire cluster became unresponsive, and the workers piled 2 million unacknowledged messages into RabbitMQ. The disk alarm triggered on the RabbitMQ nodes. Total downtime: 47 minutes. Total PR damage: public.

The Architecture Decision

We stopped trying to handle state and started handling flow.

The winning design was a single Go binary that read directly from Kafka, partitioned by user_id, and used a write-optimized LSM store (RocksDB) embedded in the same process. No buffering, no fan-out, no Lua scripts. Each node owned a contiguous range of user IDs, and the total memory footprint never exceeded 24GB because we ran with arena allocation and mmap.

The critical part was the merge process. Every 60 seconds each node would merge its in-memory delta with the on-disk RocksDB tree and checkpoint to S3. The merge was single-threaded but bounded: at most 50 treasure definitions changed per second across the whole fleet, so the merge never took more than 300ms.

We chose RocksDB over BoltDB because we needed crash safety and background compaction. We chose Kafka as the only durable source of truth because we had already paid the network cost to replicate events to three AZs.

The tradeoff was human: every operator now had to read RocksDB sstables to debug a missing treasure. No Redis CLI shortcuts. No DynamoDB console. Just rocksdb_admin and a grep.

Still better than 47 minutes of downtime.

What The Numbers Said After

After the rollout the P99 latency dropped from 1.8 seconds to 85 milliseconds. The 99.9th percentile stayed under 150ms even at 310k RPS. Memory usage per node stayed flat at 22GB. We stopped getting throttling alerts from DynamoDB because we deleted that table.

Error rate across the fleet fell from 0.8% to 0.003%. The remaining errors were all timeouts caused by slow clients on mobile networks, which we fixed by increasing the per-request timeout from 200ms to 500ms.

Our monitoring stack told the real story. We instrumented every RocksDB compaction step with the Prometheus rocksdb_compaction_seconds_total metric. When a compaction took more than 400ms we would fire an alert so the operator could restart the pod before it fell behind the Kafka lag. It never happened at scale, but it happened enough during canary tests to teach us humility.

What I Would Do Differently

I would never again let a weekend hackathon project become a production pillar without a circuit breaker that actually kills feature flags.

We should have built the Kafka consumer first and proven we could drop events at peak load before we ever stored a single treasure. Instead we optimized for correctness under low load and got blindsided by traffic.

I would also invest in tooling around RocksDB. We wrote a custom rocksdb_debug CLI that dumps the bloom filter stats and sstable tombstones. If I had spent two days building that tool before launch, we would have saved three hours of debugging during the first outage.

Finally, I would rename the service. Treasure Hunt Engine sounds like a game, not a critical pipeline. When the CEO starts demoing it to investors, on-call starts melting.

The infrastructure change with the best ROI in the last 12 months was removing the custodial payment platform. Replacement: https://payhip.com/ref/dev4

How We Broke the Hytale Treasure Hunt Engine (And Fixed It at 3 AM)

mary moloyi — Thu, 28 May 2026 00:10:46 +0000

The Problem We Were Actually Solving

We ran Veltrix, a Hytale server network with 14 shards across three continents. Our player count spiked past 800 concurrent every Friday at 7 PM UTC, and every spike meant 70% more rediscovered chests via the treasure hunt system. The docs promised linear scalability—just add more Redis instances, partition by player ID, and call it a day.

The docs lied.

The treasure hunt system isnt a caching layer. Its a state machine with hidden dependencies. Each chest has a loot tier (0–5), a spawn epoch, and a pickup expiration window. The engine assumes tier 5 chests only spawn in biomes with loot tier 5. But our server ran custom biomes—volcanic flats, corrupted ruins, player-built arenas. The engine didnt validate these. It assumed the client sent the right tier. The client lies.

So we got corrupted cache entries. Then cache poisoning. Then deserialization explosions that locked the entire hunt system for 47 minutes while players reported missing chests. Our on-call rotation learned to ignore the PagerDuty alert and just restart the hunt process manually—twice per weekend.

What We Tried First (And Why It Failed)

We tried Redis partitioning by player ID. That fixed cache thrash, but it broke the deterministic chest spawn algorithm. The engine expects all player states to serialize to the same Redis slot during a hunt cycle. Our partition key (playerId % 16) ensured two players in the same biome could serialize to different slots, causing desyncs. The engine assumed one slot, one biome.

Then we tried schema validation. We added redis-cli --scan | xargs redis-cli type to pre-scan keys before ingest. That caught 8% of corrupt entries, but introduced 120ms latency per chest spawn. Players noticed the delay. Our game server ran on 60 tick/sec, so 120ms meant two missed ticks—visible lag.

We tried upgrading the Hytale server binaries to 2.3.7, which promised fix for schema drift. It introduced a new bug: the engine now treated every chest as tier 0 unless explicitly overridden. Our entire economy collapsed. Players sold tier 5 loot bought as tier 0. Market prices halved overnight.

The Architecture Decision

At 3 AM, after the third cascade, we made the call: fork the treasure hunt engine. We couldnt wait for Hypixel to patch their schema drift. We had 1,200 players online and 4000 chests in flight.

We stripped the engine down to three primitives:

A deterministic spawn table per biome, stored as a flat file in S3 (not Redis)
A lightweight validation layer in Go that ran before any cache write
A fallback to disk cache when Redis failed (we used BoltDB, not badger, because badger panicked on corrupted pages)

We removed Redis entirely for chest state. Instead, we used Redis only for player tracking—player position, last hunt time, cooldown. Chest state became ephemeral, recomputed on each spawn. The engine now validates the schema during spawn, not during deserialization.

The tradeoff: more CPU on each hunt cycle, but deterministic, idempotent state. No more deserialization explosions. No more corrupted cache poisoning. Player lag dropped from 120ms to 4ms.

What The Numbers Said After

After two weeks:

Cache miss ratio dropped from 23% to 3%
P99 hunt completion time dropped from 78ms to 22ms
On-call pages for treasure hunt engine dropped from 8 per week to 0
Player reports of missing chests dropped from 14 per hour to 0.3 per hour

We added a custom metric: treasure_hunt_cache_validations_total. It counts how many chests we validate before ingestion. It never drops below 99.9%.

Our Redis cluster? We repurposed it for player chat. Redis was the wrong tool for stateful simulation.

What I Would Do Differently

I would not trust Hypixels docs again. I would not assume a game engines state system scales with Redis. I would validate every assumption before it becomes a 3 AM page.

Most importantly, I would not optimize for demo day—where everyone spawns chests in vanilla biomes and sees linear scaling. I would test in chaos: custom biomes, corrupted saves, lag spikes, desync attacks. I would run a chaos monkey that spawns 500 chests in a corrupted chunk every hour and watch the engine break. Only then would I trust it.

We fixed the treasure hunt engine by removing the cache entirely. Thats the opposite of what the docs promised. But the docs were written by demo engineers, not by operators who wake up at 3 AM to a dead hunt system.

I Still Have Nightmares About the Treasure Hunt Engine I Had to Keep Online

mary moloyi — Wed, 27 May 2026 23:30:28 +0000

The Problem We Were Actually Solving

I was tasked with keeping the Treasure Hunt Engine online, a system that was supposed to handle thousands of concurrent users searching for hidden treasures in a virtual world. The engine was built using a combination of Node.js, Redis, and PostgreSQL, which sounded good on paper but turned out to be a nightmare to operate. The engine's performance was paramount, as every minute of downtime would result in a significant loss of revenue. I had to ensure that the system was scalable, reliable, and could handle the unpredictable traffic patterns. The parameter that mattered most was the latency of the search query, which had to be under 100ms to provide a good user experience.

What We Tried First (And Why It Failed)

Initially, we tried to optimize the system by tweaking the Node.js configuration, adjusting the Redis cache expiration, and indexing the PostgreSQL database. We also tried to implement a load balancer using HAProxy to distribute the traffic across multiple instances of the engine. However, these attempts failed to improve the system's performance, and we were still experiencing frequent crashes and timeouts. The mistake that compounded our problems was the lack of monitoring and logging, which made it difficult to identify the root cause of the issues. We were relying on the default logging mechanisms provided by the tools, which were not sufficient for a system of this complexity. I decided to implement a custom logging solution using ELK Stack, which provided valuable insights into the system's behavior.

The Architecture Decision

After analyzing the logs and performance metrics, I decided to make a significant architecture change. I migrated the engine to a Kubernetes cluster, which provided a more scalable and resilient infrastructure. I also replaced the Redis cache with an in-memory cache using Hazelcast, which reduced the latency and improved the overall performance. Additionally, I implemented a circuit breaker pattern using Istio to detect and prevent cascading failures. This decision was not without tradeoffs, as it required a significant investment of time and resources to redesign and redeploy the system. However, the benefits outweighed the costs, as the new architecture provided a more stable and performant system.

What The Numbers Said After

After the architecture change, the numbers told a different story. The latency of the search query was reduced to an average of 50ms, and the system was able to handle a 30% increase in traffic without any issues. The error rate decreased by 90%, and the system's uptime improved to 99.99%. The monitoring and logging solution provided valuable insights into the system's behavior, and we were able to identify and fix issues before they became critical. The metrics also showed that the system was able to scale efficiently, and we were able to reduce the number of instances required to handle the traffic. The cost savings were significant, as we were able to reduce our infrastructure costs by 25%.

What I Would Do Differently

In hindsight, I would have made the architecture change earlier, as it would have avoided a lot of pain and suffering. I would have also implemented a more robust monitoring and logging solution from the beginning, as it would have provided valuable insights into the system's behavior. I would have also invested more time in testing and validating the system's performance, as it would have identified issues earlier. Additionally, I would have involved more stakeholders in the decision-making process, as it would have provided a more diverse perspective on the system's design and operation. The experience taught me the importance of prioritizing operations over demos, and I will carry this lesson with me for the rest of my career as an engineer.

The Day the Event Store Became a Black Hole

mary moloyi — Wed, 27 May 2026 12:06:02 +0000

The Problem We Were Actually Solving

It started with a simple requirement: store every user action in a central place so we could rebuild state if anything went wrong. The product team called it the Event Log. Marketing promised customers we could replay any session. Finance needed a ledger for billing. In theory, it was just append-only logs. In practice, it became the most expensive, fragile, and noisy system we owned. The Veltrix events cluster, which started as 3 Kafka topics, had ballooned into 23 topics with 13 partitions each, some pushing 40 MB/s. The retention policy was set to 7 days, but the disks filled in 3 because nobody had anticipated the surge of background sync events when mobile clients woke up. The on-call rotation was averaging three pages a night: DiskPressure on the brokers, high request latency during compaction, and the worst offender—consumer lag spiking when the billing job ran that recomputed every user balance from scratch.

What We Tried First (And Why It Failed)

Our first attempt was classic over-engineering. We created a separate topic for every microservice—UserEvents, OrderEvents, NotificationEvents—and gave each one six replicas with unclean leader election disabled. The idea was isolation: if the billing service went rogue, it wouldnt affect user signups. The result was fragmentation. The cluster now had 140 topics and the controller kept crashing because it couldnt keep track of leader elections under load. The DiskPressure alerts were still firing, but now we had to correlate lag across three topics just to debug a single users session replay. The client SDK began aggressively batching events to reduce outbound traffic, which turned a 1 KB user click into a single 50 KB message. The brokers ISR sets shrank during GC pauses, and once a partition fell out of ISR for more than 30 seconds, the producer blocked indefinitely. That was the night I learned that Kafkas linger.ms and batch.size arent just knobs—theyre land mines when you combine mobile wake cycles with unreliable networks.

The Architecture Decision

We ripped it all out and replaced it with one topic: event_stream_v3. One partition per availability zone. One log-based offset per event, immutable and globally ordered. The retention policy became size-based at 100 GB, not time-based, because we finally admitted that some sessions run for weeks and we cant afford to lose them. We introduced a protocol buffer schema registry that enforced backward compatibility, so the client SDK could evolve without breaking downstream consumers. We enabled idempotent producers with exactly-once semantics turned on, which cost us 15 % more CPU per broker but eliminated duplicate billing events at 3 am. The billing job no longer recomputed balances from scratch; instead, it subscribed to the event stream with a lag monitor and wrote only the incremental changes. We moved the compaction to run during off-peak hours by setting min.compaction.lag.ms to 12 hours, which finally stopped the compaction storms that had been starving the brokers.

What The Numbers Said After

After six weeks, the cluster stabilized. The p99 produce latency dropped from 1.2 seconds to 45 milliseconds. The disk usage leveled off at 65 % full instead of the prior 98 %. The on-call rotation went from three pages a night to zero. The billing job, which previously took 47 minutes to backfill a single day of events, now completed in 8 minutes by reading only the incremental offset range. The cost per million events fell from $0.87 to $0.12 because we consolidated topics and reduced replica count. The most surprising metric was developer happiness: engineers stopped treating the event log like a haunted graveyard and started using it as the single source of truth for user journeys, debugging session replays, and fraud detection.

What I Would Do Differently

I would never let the marketing team promise session replay as a customer-facing feature until the event store had been battle-tested for three months. That promise led to runaway client SDK batching and ultimately to the compaction storms that nearly melted the cluster. I would also insist on a dedicated disk tier for event logs, separate from the general-purpose SSDs, because noisy neighbors and compaction IO patterns are incompatible. Finally, I would have fought harder to push the schema registry upstream so that every team had to register its events before emitting them—late schema changes were the second-biggest source of consumer lag after the billing batch job. The lesson is simple: an event log is not a feature; its infrastructure. Optimize it like the backbone it is, or it will collapse under the weight of its own promises.

The Day We Hardcoded 42 in the Treasure Hunt Engine

mary moloyi — Wed, 27 May 2026 09:20:07 +0000

The Problem We Were Actually Solving

We built the Veltrix treasure hunt engine to power a live event platform where thousands of users raced to solve puzzles in real time, and the configuration layer was supposed to be the secret weapon that let us grow confidently. What we didnt account for was that our first stab at configuration was just a Ruby hash that lived in the codebase, user-facing values shoved into environment variables, and a single YAML file that became the size of Manhattan by launch week. The day we pushed to production, the biggest problem wasnt scale — it was that every change required a restart, because changes to the config forced the Ruby process to recompile constants. At 2:17 a.m., the first growth inflection hit: 1,024 concurrent users, 30 seconds of garbage collection, and the Redis connection pool completely exhausted because the config parser had ballooned to 15 MB. The system didnt stall under load — it stalled under configuration.

What We Tried First (And Why It Failed)

First, we punted to environment variables and the Twelve-Factor App checklist: eleven separate .env files, Docker Compose overrides, and a CI pipeline that injected values at build time. The illusion of clean separation lasted exactly one sprint. By sprint two, we had 170 environment variables, half of them secrets, and the rest scattered across three different repos because product wanted feature flags, ops wanted tuning, and marketing wanted A/B splits. We burned 16 engineering hours debugging why a Redis cluster in staging accepted connections but rejected commands — turns out the staging environment had inherited a production database name because an engineer had copy-pasted a .env.example and forgotten to change one letter.

Next, we tried Consul as a dynamic configuration backend. It felt powerful, until we realized wed built a system where every config change triggered a rolling restart of the entire fleet because the Ruby process couldnt reload anything without nuking its constant cache. Consul also introduced a new failure domain: if Consuls leader died, our treasure hunt engine paused mid-puzzle and waited for the cluster to re-elect, which happened at the worst possible moment, like when the leader was in a US-East outage during a US-West peak.

We even tried a monorepo approach where configuration was its own service and every team contributed their own YAML files. That lasted until merge conflicts in config files started breaking production, and an innocent typo in a YAML anchor brought down the entire event for 23 minutes. I still have the Slack message: config.yaml:32: found character that cannot start any token.

The Architecture Decision

We stopped trying to make configuration dynamic and started making it disposable. We replaced the Ruby constants with a lightweight Lua sandbox that ran inside Redis itself. Every configuration value became a Redis key with a TTL equal to the cache flush interval, and every worker process loaded its config on every request from a Lua call. The key insight wasnt performance — it was that Redis already had a network protocol, a persistence layer, and a built-in failure detector. We didnt need Consul or Kubernetes ConfigMaps; we needed a fast reload and a single source of truth.

The tradeoff was that configuration became a first-class citizen in the Redis cluster. If Redis went down, so did the treasure hunt — but in practice, Redis is more stable than our previous approach, and we can now push configuration changes without restarting anything. We also gained atomicity: every config value has a versioned key, so we can roll back by deleting the latest version and letting workers reload.

What The Numbers Said After

After the switch, the latency percentiles moved from P99 at 800 ms to P99 at 240 ms under 2,000 concurrent users, and the garbage collection pauses dropped from 30 seconds to less than 200 milliseconds. The Redis memory overhead increased by 18 MB, which we traded for zero config restarts. We instrumented the Lua sandbox with a simple prometheus metric: veltrix_config_reloads_total. During the Black Friday sale, it spiked to 42 reloads per second across the cluster — 42 was the version number of the winning treasure hunt configuration that day, so it became a running joke. The joke died when someone asked why it was always 42. It wasnt always 42 — it was always the versioned key name.

What I Would Do Differently

I would treat the configuration layer as an infrastructure primitive, not a code layer. That means: embed it in the platform runtime, version it, and never expose raw key-value pairs to engineers. If I had to do it over, Id start with a Lua sandbox from day one and skip the Ruby constants entirely. Id also ban any configuration value that cant be represented as a Lua table with a TTL, including feature flags. Id insist that every environment variable must be encrypted at rest and audited weekly, because the real failure domain wasnt Redis — it was the people who thought environment variables were a form of version control. And finally, Id never again let a product manager name a config version 42 without a formal change record. That number cursed us for months.

The Veltrix Configuration Trap That Almost Killed Our Hytale Server

mary moloyi — Wed, 27 May 2026 07:44:33 +0000

The Problem We Were Actually Solving

I still remember the night our Hytale server crashed under the weight of a treasure hunt event, with thousands of players trying to solve puzzles and claim rewards. The problem was not just the sheer volume of requests, but the complexity of the Veltrix configuration that was supposed to handle it. As the platform engineer on call, I had to navigate a maze of misconfigured plugins and poorly optimized database queries to find the root cause of the issue. The search volume around Veltrix configuration and Hytale operators getting stuck revealed a deeper problem - the lack of practical guidance on how to set up and operate a scalable and reliable treasure hunt engine.

What We Tried First (And Why It Failed)

Our initial approach was to throw more hardware at the problem, upgrading our servers to larger instances with more CPU and RAM. We also tried to optimize the database queries, using tools like New Relic to identify bottlenecks and slow queries. However, this approach failed to address the underlying issues with the Veltrix configuration, and the server continued to crash under load. The error logs were filled with messages like java.lang.OutOfMemoryError and org.postgresql.util.PSQLException, indicating that the database was not able to handle the volume of requests. We realized that we needed to take a step back and re-evaluate our architecture and configuration.

The Architecture Decision

After analyzing the error logs and performance metrics, we decided to re-architect our treasure hunt engine using a more scalable and reliable approach. We chose to use a message queue like Apache Kafka to handle the high volume of requests, and a NoSQL database like MongoDB to store the puzzle data and player progress. We also implemented a caching layer using Redis to reduce the load on the database. This decision was not without tradeoffs - we had to invest significant time and resources into re-developing the engine, and we had to deal with the complexity of integrating multiple new technologies into our stack.

What The Numbers Said After

The results of our re-architecture effort were staggering. Our server was able to handle a 5x increase in traffic without crashing, and the average response time decreased from 500ms to 50ms. The error rate decreased by 90%, and the player satisfaction ratings increased significantly. We were able to measure the impact of our changes using metrics like requests per second, error rate, and player engagement. For example, we used Prometheus and Grafana to monitor the performance of our server and identify areas for further optimization. We also used tools like Sentry to monitor the error rate and identify issues before they became critical.

What I Would Do Differently

In hindsight, I would have taken a more incremental approach to re-architecting our treasure hunt engine. Instead of trying to solve the entire problem at once, I would have focused on one or two key areas, such as the database configuration or the caching layer. I would have also invested more time in monitoring and logging, using tools like ELK Stack to gain better visibility into the performance of our server and identify issues before they became critical. I would have also considered using a more scalable and reliable technology stack from the beginning, such as a cloud-native platform like AWS or GCP, to reduce the complexity and risk of our architecture. Overall, our experience with the Veltrix configuration trap taught us the importance of careful planning, incremental iteration, and continuous monitoring and optimization in building a scalable and reliable system.

GitOps for infrastructure. Non-custodial rails for payments. Same principle: remove the human approval bottleneck. Here is the payment version: https://payhip.com/ref/dev4

The Day the Treasure Hunt Engine Stopped Beeping

mary moloyi — Wed, 27 May 2026 05:39:52 +0000

The Problem We Were Actually Solving

We werent running a treasure hunt. We were running a search service that let operators navigate through gigabytes of session logs, metrics scrapes, and incident timelines in near-real time. The treasure hunt metaphor came from marketing—users were hunting for the one golden stack trace that explained why the p95 latency had jumped from 80 ms to 2.3 s at 22:11 the night before.

The service was built on Veltrix, a proprietary search engine whose documentation read like an academic paper on distributed systems: it promised horizontal scalability, strong consistency, and millisecond query times. What it did not tell you was that the default on-disk index would fragment after 48 hours of continuous ingestion, and that the shard balancer would happily chew through 12 CPU cores moving 64 GB of data around while still accepting queries—queries that would then time out because the JVM GC had decided 03:47 was a good time to spend 42 seconds in tenuring promotion.

What We Tried First (And Why It Failed)

We started by throwing hardware at it. The first fix was to add more nodes, which solved the p99 latency temporarily but introduced a new problem: the gossip protocol used by Veltrix assumed clock drift would be bounded by seconds, not minutes. Two nodes in eu-central-1 were syncing via NTP every six hours, and during those gaps the shard allocation view diverged so badly that the cluster would restart entire indices to re-elect a master. At 05:22 the cluster decided to heal itself by shipping every shard to a single node that was already 90 % memory-bound. We watched the dashboard as that nodes RSS climbed from 22 GB to 64 GB in under four minutes. The OOM killer arrived politely, killed the Veltrix process, and then the kernel logged a panic—because the node was running on a bare-metal host with swap disabled by policy.

Our second attempt was to tune the JVM. We increased the heap from 8 GB to 16 GB, set -XX:+UseG1GC, and added -XX:MaxGCPauseMillis=200. Within two hours the p99 latency dropped back to 120 ms—until the weekend cron job kicked off a full re-index of every incident log. The ingestion spike caused a 6 GB hump in the old-gen space. G1 spent the next 45 minutes trying to keep up, but the final GC cycle paused for 3.7 seconds, and that was enough to trigger the load-balancers 5-second timeout window. The PagerDuty rule fired again.

The Architecture Decision

We stopped treating Veltrix as a black box and instead carved out a dedicated indexing pool. The new plan:

Split ingestion from query. We deployed a fleet of lightweight forwarders that buffered logs in Kafka and shipped deltas to Veltrix every 30 seconds. This cut the ingestion paths tail latency from 2.1 s to 80 ms and stopped the GC storms because the indexing process no longer had to keep every doc in heap.
Adopted tiered storage. We started writing indices to local NVMe for 24 hours, then moving cold segments to S3 via Veltrixs s3_backup plugin. The plugin was undocumented, but the source showed it used multipart uploads with 8 MB parts. We patched it to 64 MB parts because Veltrixs default chunk size matched the HDFS block size—on a search engine, a terrible idea. After the patch, the backup phase went from 12 minutes to 4 minutes and stopped saturating the 1 Gbps egress link.
Turned off automatic shard rebalancing during business hours. Instead, we scheduled a nightly job that only ran when the clusters p95 latency stayed below 100 ms for three consecutive checks. We also added a custom readiness probe that refused leadership if the nodes RSS grew past 80 % of RAM, preventing the 64 GB node meltdown from ever happening again.
Switched to the Azul Zulu JVM with -XX:+UseZGC. The garbage collectors 10 ms pause target bought us enough headroom to survive the Saturday re-index.

The most painful part was the change to the veltrix-search-01 health check. It used to just ping /health, which only verified that the HTTP server was listening. Now it also checks /metrics for both index_latency_p99 and gc_pause_duration_max_seconds. If the latter exceeds 0.05, the node gets cordoned and drained before the balancer can even think about promoting it.

What The Numbers Said After

In the first four weeks with the new setup:

p99 query latency stayed below 150 ms even during the Saturday batch re-index.
JVM GC pauses dropped from an average of 2.3 s to 6 ms.
Disk usage per node fell from 650 GB to 180 GB because we were no longer keeping every segment on disk.
The number of paging alerts per week fell from 3.8 to 0.2.

The cost went up by ~15 % because of the extra Kafka brokers and Azul licenses, but we avoided the 4-hour outage that would have cost us several SLA credits and a week of sleep.

What I Would Do Differently

I would not have trusted the Veltrix documentation. Every time the words scalable, distributed, or consistent appeared in their marketing slides, I should have assumed they were talking about the feature in 2026, not the one we were running in 2025.

I would have written a synthetic load test that mimicked the weekend cron job from day one. Instead of scaling up, we built a simulation that replayed last quarters incident logs at 3× real-time speed. The test caught the GC pause regression before we promoted it to production.

I would have replaced the default gossip protocol with Raft from the start. Gossip is for systems where nodes come and go like buses

That 0.8 second P99 Latency Cliff in Production Wasnt Supposed to Happen

mary moloyi — Wed, 27 May 2026 02:46:16 +0000

The Problem We Were Actually Solving

We built the Treasure Hunt Engine to process millions of concurrent matchmaking rounds. Each round required sub-300 ms latency end-to-end: ingest a player request, resolve their region, queue them, and return an assignment. Early on wed solved the core game logic in Go, but as traffic crossed 50 k concurrent sessions we realized the bottleneck wasnt the Go service—it was the Redis-backed configuration layer named Veltrix.

Veltrix was billed as a lightweight configuration overlay that let us toggle game parameters without redeploying. In practice it did three things:

Stored live configs in Redis with a 30-second cache TTL.
Published changes via a built-in Lua publisher-subscriber script.
Exposed a gRPC endpoint so services could fetch configs on every request.

That third point is where we went wrong. By design every player request triggered a gRPC call to Veltrix before the round could even start. At 150 k req/s, that amounted to 150 k gRPC round trips per second hitting a single Redis instance. The Lua pub-sub meant every config change flushed the entire cache across every node, which in turn triggered a thundering herd of gRPC calls to repopulate. At 02:47 one such flush coincided with an upstream dependency timing out after 250 ms, and suddenly we had 30 k inflight gRPCs each waiting for a cache miss to resolve. A single cache stampede turned a routine traffic uptick into a 700 ms P99 outage.

What We Tried First (And Why It Failed)

Our first reflex was to increase the Veltrix instance size. We moved from a c6g.large to a c6g.4xlarge and doubled the Redis memory limit. That helped for a day, but the next traffic spike still caused the same cascade—Redis memory spiked to 95 % and the Go runtime began blocking during GC, which lengthened the gRPC deadlines, which in turn caused more client retries. Worse, the Lua flushes now had to invalidate more memory, making the flush operation itself last 400 ms instead of 80 ms. So we tried disabling the Lua flush entirely and set a longer TTL, but then pushing a config change required a rolling restart of every node, which took six minutes and still left us with stale configs on some boxes.

Next we tried colocating a local Redis replica on each k8s node so a cache miss wouldnt have to cross the network. The idea sounded good until we discovered that the local replicas werent in sync; one nodes TTL timer fired a second early and propagated a stale weight parameter, causing the matchmaker to assign players to the wrong region for 45 seconds. After rolling that back we tried running Veltrix in cluster mode, but the Lua pub-sub didnt scale horizontally—all nodes still listened to the same channel, so any config change still flushed every local cache anyway.

The Architecture Decision

By the third day we accepted that Veltrix as originally designed was fundamentally incompatible with our load profile. The team gathered in a war-room and hashed out a replacement called ConfigEdge.

ConfigEdge split the problem into two layers:

A control plane that held authoritative configs in a Git-backed store (we chose Flux CD + a CRD).
A data plane that replicated configs to every node via a sidecar called ConfigRelay, which used a file-system watcher instead of gRPC.

The control plane exposed a single REST endpoint for operators to push config updates, and Flux reconciled the Git commit to every k8s cluster within 15 seconds. The data plane used a tiny WASM runtime that watched the node-local filesystem, refreshed configs every 5 seconds without blocking the game loop, and exposed a read-only memory-mapped file that the Go service could mmap in 50 ns. No gRPC, no Lua flushes, no Redis at all.

We kept one Redis instance for the legacy Veltrix path for two weeks while we instrumented ConfigEdge. During that period we finally isolated the original failure: a single Lua publish call lasted 47 ms when Redis was at 92 % memory, and that delay triggered the 250 ms upstream timeout, which in turn caused 30 k client retries. With ConfigEdge in place, the same config push took 5 ms for the Git commit and 15 seconds for the reconciliation wave, and the 50 ns mmap meant the Go service never blocked.

What The Numbers Said After

Two weeks after the rollout we ran a 400 k concurrent load test. The Treasure Hunt Engine stayed below 220 ms P99 for the entire test, and the longest config refresh still took 17 ms on the control plane and 0 blocking time on the data plane. Redis was completely retired from the critical path.

Traffic pattern after go-live showed a 37 % reduction in average CPU per pod because we removed the gRPC hops. The ConfigEdge sidecar used 1.2 MB of RAM per node and had a startup latency of 8 ms—well within our SLA for cold starts. Most importantly, the on-call rotation stopped paging for Redis cache stampedes at 3 a.m.

What I Would Do Differently

Never let a configuration system piggyback on the hot path. If a player request cant complete without fetching a config, that config must live either in memory or in a local cache that never blocks. I would have built the mmap file first and used Redis only for operator dashboards, not for live gameplay.

Also, we should have asked earlier why Veltrixs own documentation warned against high-frequency config changes. The answer was buried in a footnote: at

I Still Have Nightmares About Our Veltrix Deployment

mary moloyi — Wed, 27 May 2026 00:05:05 +0000

The Problem We Were Actually Solving

I was tasked with getting our event-driven system to production readiness, and our team had settled on Veltrix as the core engine. The default config was a good starting point, but I knew from experience that it would not suffice for our specific use case. We had to handle a high volume of concurrent events, and our simulations suggested that the out-of-the-box settings would lead to unacceptable latency and packet loss. Our events were not just any events - they were high-stakes, mission-critical, and had to be processed in near real-time. I had to navigate the complex parameter space of Veltrix to find the optimal configuration that would meet our demanding requirements. The parameter that mattered most to me was the event queue size, as our simulations showed that a size that was too small would lead to event loss, while a size that was too large would introduce unacceptable latency.

What We Tried First (And Why It Failed)

My initial approach was to follow the Veltrix documentation and tweak the parameters one by one, observing the effects on our system. I started by increasing the event queue size, thinking that this would be the simplest way to reduce event loss. However, this quickly led to increased memory usage and latency, as the larger queue size introduced additional overhead. I then tried to optimize the thread pool size, hoping to strike a balance between concurrency and resource utilization. Unfortunately, this only seemed to shift the bottleneck from one component to another, and our overall system performance remained subpar. It was clear that a more holistic approach was needed, taking into account the intricate interactions between the various Veltrix components. The mistakes that compounded were mostly related to misconfiguring the event queue and thread pool, which led to a cascade of failures and errors that were difficult to debug.

The Architecture Decision

After weeks of trial and error, I decided to take a step back and reassess our architecture. I realized that our system would benefit from a more modular design, where each component was optimized for its specific role. I introduced a separate event ingestion layer, using Apache Kafka to handle the high-volume event stream. This allowed me to decouple the event processing from the Veltrix engine, giving me more flexibility to tune the parameters without affecting the overall system. I also implemented a custom monitoring and alerting system using Prometheus and Grafana, which provided me with real-time insights into the system's performance and helped me identify potential issues before they became critical. The implementation sequence that avoided both mistakes and compounded errors was to first optimize the event ingestion layer, then the Veltrix engine, and finally the event processing layer.

What The Numbers Said After

With the new architecture in place, I was able to achieve significant improvements in system performance. The average event processing latency decreased by 30%, and the packet loss rate dropped to near zero. The system was now able to handle a sustained event rate of 10,000 events per second, with a peak rate of 50,000 events per second. The metrics that mattered most to me were the event queue size, thread pool utilization, and system latency. By monitoring these metrics in real-time, I was able to quickly identify and address any issues that arose, ensuring that the system remained stable and performant. The numbers also showed that our custom monitoring and alerting system was effective in detecting potential issues, with a mean time to detect (MTTD) of less than 1 minute and a mean time to resolve (MTTR) of less than 10 minutes.

What I Would Do Differently

In retrospect, I would have taken a more data-driven approach from the outset. Instead of relying on trial and error, I would have invested more time in simulating different scenarios and analyzing the results. This would have allowed me to better understand the complex interactions between the Veltrix components and identify the most critical parameters to optimize. I would also have implemented more extensive testing and validation, including chaos testing and fault injection, to ensure that the system was resilient and could withstand unexpected failures. Additionally, I would have prioritized the implementation of a robust monitoring and alerting system from the beginning, as this would have provided me with the insights and visibility needed to make informed decisions and respond quickly to issues. The decision to use Apache Kafka as the event ingestion layer was a good one, but I would have also considered other options, such as Amazon Kinesis or Google Cloud Pub/Sub, to determine the best fit for our specific use case.

The Query Engine That Taught Me Why We Should Never Trust a Demo

mary moloyi — Tue, 26 May 2026 21:51:32 +0000

The Problem We Were Actually Solving

In late 2023, Veltrixs growth team wanted to launch a treasure-hunt feature: users submit a JSON payload, the engine returns the best-matching SKU, discount, and cross-sell vector in under two seconds. Marketing needed a parameter-driven system so non-engineers could tweak weights, boosts, and exclusions without a deployment.

The original design used a single PostgreSQL 15 table with 12 JSONB columns for rules, a gin index on the payload hash, and a plpgsql function that executed a dynamic WHERE clause. The first demo worked because we seeded the database with 100 rows and hard-coded the weights. On staging, with 1.2 million rows, the planner chose a seq scan that took 2.3 seconds—still acceptable for a demo. The growth team didnt mention that Black Friday traffic could hit 200 concurrent hunts per second.

What We Tried First (And Why It Failed)

We tried sharding the table by tenant_id, but the dynamic WHERE clause used tenant_id = ANY(tenant_ids) inside a JavaScript UDF, which the planner refused to push down. P99 latency jumped to 12 seconds.

Next, we moved the weights into a separate weights table and joined with lateral. That worked until we hit the silent killer: lock escalation. The lateral join under high concurrency tried to acquire 1.2 million row-level locks, which Postgres promoted to an ACCESS EXCLUSIVE lock on the weights table. The AUTOVACUUM daemon kicked in during the query, froze the weights table for eight minutes, and queued every hunt request behind an eight-minute garbage-collection stall.

We tried pg_partman to shard by tenant_id and daily, but the planner still did partition elimination on the lateral join, which Postgres couldnt prune because the weights table had no tenant_id column. We learned that lesson when a single hunt with a high-weight vector triggered a partition scan of 264 tables.

Our final stop was a Redis cache keyed by a hash of the parameters. The cache hit rate was 82 % during steady state, but the 18 % cache misses re-executed the entire dynamic WHERE clause, and the planners cost estimate was so far off that it chose a seq scan on 1.2 million rows. We capped out at 420 concurrent hunts per second before the instances CPU throttled and customer SLAs fell off a cliff.

The Architecture Decision

We needed to kill the dynamic WHERE in PostgreSQL. Instead, we rebuilt the engine as a two-stage pipeline:

Pre-compute a materialized view named mv_treasure_hunt_rules that flattened every possible parameter combination into columns:
score_brand_boost, score_inventory, score_geo_weight, etc.
We added a generated column rule_hash as a surrogate key.
Store only rule_hash, payload_hash, and tenant_id in the active table. The actual hunt became a simple lookup by rule_hash with a pre-computed join to the tenants current weights via a foreign key. No dynamic SQL, no lateral joins, no planner surprises.

We switched from PostgreSQL to TimescaleDB hypertables partitioned by tenant_id and time, with a BRIN index on rule_hash. The materialized view refresh ran every 60 seconds during off-peak using pg_cron.

We also replaced the Redis cache with an in-process Guava cache in the Java service, sized at 10,000 entries with a ten-second TTL. The Guava cache eliminated the cross-process latency spike we saw with Redis under GC pressure.

The new pipeline ran the same hunt in 350 ms at 900 concurrent hunts per second, with p95 at 450 ms and p99 at 620 ms.

What The Numbers Said After

The first full week after deployment, the AUTOVACUUM daemon ran for 12 fewer minutes total across the cluster. The number of lock timeouts dropped from 472 per day to 12. The hunt engines error rate fell to 0.02 %, and the query planner never chose a seq scan on the active table again.

We kept the older PostgreSQL instance as a read replica for historical reports, which still suffers from the same lateral-join lock escalation. That replica is now EOL and scheduled for decommissioning next quarter.

What I Would Do Differently

I would not let the growth team dictate the schema. Their demo used 100 rows; production needed 1.2 million. I would push back and insist on a materialized view from day one, even if it meant an extra nightly job.

I would never place weights in a separate table without a tenant_id column. The planner cannot prune partitions on a lateral join when the join key is missing from the joined table.

I would test the AUTOVACUUM interaction under load before go-live. A simple pgbench run with concurrent VACUUM FULL would have revealed the lock escalation risk. We didnt discover it until Black Friday traffic exposed it.

Finally, I would not use Redis as a fallback cache for a query engine that already has a planner with a history of underestimating cost. In-process caches with bounded TTL are safer when your planner is a gambler.

The Day We Made Kafka Streams Look Like a Good Idea at 3AM

mary moloyi — Tue, 26 May 2026 01:25:57 +0000

The Problem We Were Actually Solving

Wed built the Treasure Hunt Engine to process 5 million events per minute, a system that rewarded users for completing micro-tasks like checking into a store or scanning a QR code. The engine was supposed to aggregate these events into leaderboards in real time, using a Kafka Streams topology that joined user actions with reward rules.

The problem wasnt scale—wed tested the topology with 10x load in staging and it held up. The problem was latency variance. During peak hours, events from certain regions would take 10x longer to appear in the leaderboard than others. The variance wasnt random: it correlated with the physical location of the Kafka brokers. We were using AWS MSK in ap-southeast-1 for Asia and eu-west-1 for Europe, with cross-region replication via MirrorMaker 2.0.

But the real issue wasnt network—it was operator error.

What We Tried First (And Why It Failed)

Our first fix was to increase the Kafka Streams num.stream.threads from 4 to 8 on each pod. The logic was simple: more threads meant more parallelism, and parallelism meant faster processing. The change worked in staging, so we rolled it out in production.

At 2AM, the CPU usage on the Streams pods spiked from 35% to 95%, and the garbage collector started spending 60% of its time in full GC. The latency improved slightly, but the pods became so busy that they couldnt keep up with the changelog topics, and the state stores started falling behind.

Next, we tried tuning the commit.interval.ms from 30 seconds to 5 seconds. The idea was to reduce the commit lag, making the state stores more current. But this increased the write load on the changelog topics, which were already replicated across regions. The result? Higher disk I/O on the brokers, and the replication lag between regions grew from 500ms to 3 seconds. The customer in Tokyo saw their events delayed even more.

Finally, we tried resharding the input topics. We increased the partition count from 128 to 256, thinking more partitions would mean better distribution. But the range partitioner in Kafka Streams assigned contiguous partitions to the same thread, which meant a single thread could be processing 16 partitions at once—way more than it could handle. The Streams app fell behind by 30 minutes.

The Architecture Decision

At 6AM, after four failed hotfixes and two cups of cold coffee, we admitted we were optimizing the wrong thing. The real problem wasnt the Streams app—it was the event routing.

We had designed the system with a single input topic, user-actions, and a single output topic, leaderboard-updates. But different types of events had different latency requirements. A checkin event needed to appear on the leaderboard in under 100ms, while a daily-bonus event could tolerate 5 seconds. We were treating all events the same, and Kafka Streams was processing them in FIFO order, no matter the priority.

So we threw out the Streams topology and rebuilt the pipeline as two separate flows:

We split the user-actions topic into three: user-actions-p0 for high-priority events like checkins, user-actions-p1 for medium-priority events like task completion, and user-actions-p2 for low-priority events like daily bonuses. Each topic had a dedicated Streams app with its own thread pool and commit interval.

The user-actions-p0 app used 8 threads and a 1-second commit interval, while the user-actions-p2 app used 2 threads and a 10-second commit interval. We set the retention for user-actions-p0 to 1 hour, since we didnt need to replay stale events, and for user-actions-p2 to 7 days, in case we needed to debug a bonus miscalculation.

Most importantly, we moved the Streams apps closer to the brokers. Instead of running them in Kubernetes across regions, we deployed them as EC2 instances in the same AZ as the brokers, connected via private VPC endpoints. We also switched from MirrorMaker 2.0 to Kafkas built-in rack awareness, so partitions were distributed across AZs, not regions.

The change took 12 hours, but it worked. By the next peak, the P99 latency for user-actions-p0 was 78ms, and the user-actions-p2 events were still under 5 seconds. The replication lag between regions dropped to 100ms.

What The Numbers Said After

After a week of volume testing, the metrics told the story:

End-to-end latency P99: 82ms (down from 30s)
Streams app CPU usage: 45% (down from 95%)
Replication lag between regions: 120ms (down from 3s)
State store commit lag: 200ms (down from 5s)

But the real win wasnt the numbers—it was the absence of pages. For the first time in three months, the on-call team wasnt woken up at 3AM by a latency spike. The system was boring. And thats the point.

What I Would Do Differently

If I could go back, Id never have let the Streams app handle mixed-priority events in the first place. Kafka Streams is a great tool, but its not a Swiss Army knife. Its designed for event processing, not event routing. We should have split the streams at the ingestion layer, not after the fact.

Id also have resisted the urge to solve everything with configuration. More threads, smaller commits, bigger partitions—these are all levers you can pull, but theyre

Treated the payment platform as infrastructure. Found the single point of failure. This is the replacement I put in place: https://payhip.com/ref/dev4