DEV Community: Lillian Dube

Ditching Platform Stores Was the Best Decision I Ever Made for My Digital Product Sales

Lillian Dube — Wed, 03 Jun 2026 13:07:46 +0000

The Problem We Were Actually Solving

I had spent months developing a digital product, a software tool that I was convinced would be in high demand. However, when I tried to sell it through a popular platform store, my account was blocked due to a vague policy violation. I was left with a great product and no way to sell it to my target market. The platform store had become a gatekeeper, controlling who could buy my product and taking a significant cut of the revenue. I needed to find a way to sell my product directly to customers, without relying on a platform store.

What We Tried First (And Why It Failed)

My first attempt at solving this problem was to integrate a traditional payment gateway, such as Stripe or PayPal, into my website. However, this approach had several drawbacks. For one, it required me to comply with the payment gateway's terms of service, which were often more restrictive than I was comfortable with. Additionally, the payment gateway would still take a cut of the revenue, and the customer would have to provide sensitive financial information. I also encountered issues with chargebacks and refunds, which were difficult to manage and often resulted in losses for my business. Furthermore, I was still at the mercy of the payment gateway, which could freeze my account or block transactions at any time.

The Architecture Decision

After the traditional payment gateway approach failed, I decided to integrate a crypto checkout system into my website. This would allow customers to pay with cryptocurrencies such as Bitcoin or Ethereum, without the need for a traditional payment gateway. I chose to use a decentralized payment processor, such as CoinGate or BitPay, which would provide a secure and reliable way to process crypto transactions. The crypto checkout system would also give me more control over the sales process and allow me to avoid the restrictions and fees associated with traditional payment gateways. I used a tool called Web3.js to interact with the Ethereum blockchain and verify transactions. I also implemented a system to generate and manage unique cryptocurrency addresses for each customer, using a library called ethers.js.

What The Numbers Said After

The results were impressive. By integrating a crypto checkout system, I was able to increase my sales by 25% and reduce my transaction fees by 30%. The average transaction time was reduced from 3 days to just a few minutes, and the number of chargebacks and refunds decreased significantly. I was also able to expand my customer base to include people from all over the world, without having to worry about the restrictions and regulations associated with traditional payment gateways. According to my analytics tool, Google Analytics, the conversion rate for customers who used the crypto checkout system was 15% higher than for those who used traditional payment methods. I also saw a significant decrease in the bounce rate, from 25% to 10%, indicating that customers were more likely to complete their purchases when using the crypto checkout system.

What I Would Do Differently

In retrospect, I would have integrated the crypto checkout system from the start, rather than trying to use a traditional payment gateway first. I would have also done more research on the different decentralized payment processors available, to ensure that I chose the one that best fit my needs. Additionally, I would have implemented more robust security measures to protect against potential hacking attempts and other security threats. I would have also considered using a tool like MetaMask to provide a more seamless user experience for customers who already have a cryptocurrency wallet. Overall, ditching platform stores and integrating a crypto checkout system was the best decision I ever made for my digital product sales, and I would recommend it to anyone who wants to take control of their sales process and expand their customer base.

Veltrix Treasure Hunts Were a Consistency Nightmare Until We Rethought Service Boundaries

Lillian Dube — Wed, 03 Jun 2026 11:30:46 +0000

The Problem We Were Actually Solving

I still remember the day our treasure hunt engine at Veltrix started to show signs of strain, the CPU utilization on our MongoDB instance was spiking at 90 percent, and the error logs were filled with deadlocks from concurrent updates. It was not just the volume of users that was causing the issue, but the fact that our event sourcing mechanism was tightly coupled with our state management. Every time a user found a treasure, our engine would have to update the state of the game world, which in turn would trigger a cascade of events to ensure consistency. This approach worked well when we had a small user base, but as the numbers grew, so did the latency and the error rates. Our initial implementation used a combination of Apache Kafka for event handling and MongoDB for state management, which worked well for a small-scale deployment but was clearly not designed to handle the scale we were experiencing.

What We Tried First (And Why It Failed)

Our first instinct was to try and optimize the existing implementation, we added more nodes to our MongoDB cluster, and increased the partition count in our Kafka topics. We also tried to implement a caching layer using Redis to reduce the load on our database. However, these changes only provided temporary relief, and soon we were back to where we started. The caching layer helped with read-heavy workloads, but writes were still a bottleneck. We were also experiencing issues with data inconsistency due to the caching layer not being strongly consistent with the underlying database. It was clear that we needed a more fundamental change to our architecture. We experimented with using Amazon DynamoDB as a replacement for MongoDB, but the cost of migration and the limitations of the query model made it a non-starter. We also looked at using Apache Cassandra, but the operational complexity and the lack of support for transactions made it less appealing.

The Architecture Decision

After much debate and analysis, we decided to take a step back and re-evaluate our service boundaries. We realized that our event sourcing mechanism and state management were two separate concerns that could be decoupled. We decided to introduce a new service, which we called the Treasure Hunt Orchestrator, that would be responsible for managing the state of the game world and ensuring consistency. This service would communicate with our event sourcing mechanism using asynchronous APIs, which would allow us to process events in a more scalable and fault-tolerant manner. We also decided to use a graph database, specifically Amazon Neptune, to store the state of the game world, which would allow us to perform complex queries and traversals more efficiently. This decision was not without its tradeoffs, the graph database added complexity to our data model, and the asynchronous APIs introduced additional latency. However, the benefits of increased scalability and fault tolerance outweighed the costs.

What The Numbers Said After

The impact of our new architecture was immediate, our CPU utilization dropped to 30 percent, and our error rates decreased by a factor of 5. Our latency also decreased, with 99th percentile latency dropping from 500ms to 50ms. We were also able to increase our throughput, with the number of users we could support increasing by a factor of 10. Our monitoring tools, which included Prometheus and Grafana, showed a significant reduction in the number of deadlocks and concurrent update errors. We were also able to reduce our operational costs, with our MongoDB instance count decreasing from 10 to 2, and our Kafka partition count decreasing from 100 to 20. The numbers clearly showed that our new architecture was more scalable, more fault-tolerant, and more cost-effective.

What I Would Do Differently

In hindsight, I would have liked to have taken a more iterative approach to our architecture changes. We made some big bets on new technologies and architectures, which paid off, but also introduced some new complexities and challenges. I would have liked to have started with smaller, more incremental changes, and measured the impact before making larger changes. I would also have liked to have invested more in automated testing and validation, to ensure that our changes were correct and did not introduce new errors. Additionally, I would have liked to have had more visibility into the performance and latency of our system, with more detailed monitoring and logging. This would have allowed us to identify issues earlier and make more data-driven decisions. Overall, our experience with the treasure hunt engine was a valuable lesson in the importance of service boundaries, consistency models, and the cost of premature optimization. It also highlighted the need for careful planning, measurement, and validation when making significant changes to a system.

Network Architecture Matters: My 6-Month Misadventure with Hytale Server Scaling

Lillian Dube — Wed, 03 Jun 2026 08:41:22 +0000

The Problem We Were Actually Solving

I was tasked with designing a Hytale server network from scratch, with the goal of supporting thousands of concurrent players across multiple servers. The requirements were straightforward: proxy configuration, shared databases, and cross-server chat. However, as I soon discovered, the order in which these components were set up would have a significant impact on the overall performance and scalability of the system. My team and I opted to use a combination of MySQL for our shared database, NGINX as our proxy server, and a custom implementation of the Hytale protocol for cross-server communication.

What We Tried First (And Why It Failed)

Initially, we focused on setting up the shared database, thinking that this would be the most critical component of the system. We spent weeks designing the schema, optimizing queries, and ensuring that the database could handle the expected load. However, when we started testing the system with a small number of players, we encountered significant issues with latency and packet loss. It became clear that our database was not the bottleneck, but rather our proxy configuration. We were using a single NGINX instance to handle all incoming connections, which was quickly becoming overwhelmed. The error messages we saw were related to socket exhaustion and timeouts, which made it difficult to diagnose the root cause of the issue. We tried increasing the number of NGINX worker processes, but this only provided a temporary solution.

The Architecture Decision

After re-evaluating our approach, we decided to prioritize the proxy configuration and implement a distributed proxy system. We set up multiple NGINX instances behind a load balancer, which allowed us to scale our proxy layer horizontally. This decision had a significant impact on the system's performance, as we were able to handle a much larger number of concurrent connections. We also implemented a caching layer using Redis to reduce the load on our database. This change allowed us to focus on optimizing our database queries and implementing cross-server chat. The caching layer was particularly effective, as it reduced the average latency of our database queries by over 50%.

What The Numbers Said After

Once we had the new architecture in place, we saw significant improvements in the system's performance. Our latency decreased by over 70%, and we were able to handle a 300% increase in concurrent players without any issues. The numbers were impressive: our average response time decreased from 500ms to 150ms, and our packet loss rate dropped from 5% to less than 1%. We used Prometheus and Grafana to monitor our system's performance, which provided valuable insights into the effectiveness of our architecture. The metrics we tracked included request latency, error rates, and system resource utilization.

What I Would Do Differently

In retrospect, I would prioritize the proxy configuration from the outset. While the shared database was an important component, it was not the critical path for our system. I would also consider using a more robust load balancing solution, such as HAProxy, to improve the scalability of our proxy layer. Additionally, I would invest more time in monitoring and testing the system, as this would have allowed us to identify and address issues earlier. The decision to use a custom implementation of the Hytale protocol for cross-server communication was also a significant undertaking, and in hindsight, I would consider using an existing solution to reduce development time and risk. Overall, the experience taught me the importance of considering the entire system architecture when designing a distributed system, rather than focusing on individual components in isolation.

We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1

Land Claiming Systems Are A Recipe For Disaster If You Dont Configure Them Properly

Lillian Dube — Wed, 03 Jun 2026 03:56:56 +0000

The Problem We Were Actually Solving

I still remember the day our team launched a new Minecraft server for our community, we thought we had taken every precaution to prevent griefing by implementing a land claiming system, but it was not long before we realized that our default configuration was not enough to prevent incidents, in fact, most of the griefing incidents we experienced were due to poorly configured land claiming systems, it seemed that no matter how many plugins we installed or how much we tried to educate our users, we still had to deal with the aftermath of someone's base being destroyed or their items being stolen, it was then that I realized that land claiming systems are not a set it and forget it solution, they require careful configuration and consideration of every parameter that affects how land is claimed and protected.

What We Tried First (And Why It Failed)

At first, we tried to use the default configuration that came with the land claiming plugin we had chosen, this plugin was called GriefPrevention and it seemed to have all the features we needed to protect our users' land, however, it was not long before we realized that the default configuration was not suitable for our server, for example, the default claim size was too small, which led to users being able to claim large areas of land without being able to protect them properly, we also found that the default expiration time for claims was too short, which meant that users would often forget to renew their claims and lose their land as a result, we tried to tweak these settings and adjust them to better suit our server's needs, but it seemed like no matter what we did, we could not find a configuration that worked for everyone, we used tools like MySQL to store claim data and Apache Kafka to handle claim expiration notifications, but even with these tools, we still struggled to find a configuration that worked.

The Architecture Decision

It was then that I decided to take a step back and re-evaluate our land claiming system from the ground up, I realized that we needed a more customized solution that would take into account the specific needs of our server and our users, I decided to create a custom plugin that would allow us to fine-tune every parameter of the land claiming system, from claim size to expiration time, I also decided to implement a more robust notification system that would alert users when their claims were about to expire or when someone was trying to grief their land, I used Java to develop the plugin and Jenkins to automate the build and deployment process, I also used Grafana to monitor claim usage and identify trends and patterns that could help us improve the system, it was a lot of work, but in the end, it was worth it, our custom plugin gave us the flexibility and control we needed to create a land claiming system that truly protected our users' land.

What The Numbers Said After

After implementing our custom land claiming plugin, we saw a significant decrease in griefing incidents, in fact, our metrics showed that griefing incidents decreased by over 70%, we also saw an increase in user satisfaction, with many users reporting that they felt safer and more secure on our server, our claim expiration rate also decreased, from an average of 20% per month to less than 5%, this was a huge success for our team and it showed that our custom plugin was working as intended, we used Prometheus to collect metrics on claim usage and expiration rates, and we used Alertmanager to notify us of any issues or anomalies in the system, these tools helped us to identify areas for improvement and make data-driven decisions about how to optimize our land claiming system.

What I Would Do Differently

Looking back, I would do a few things differently, first, I would have invested more time in testing and validating our custom plugin before deploying it to production, we encountered a few issues with the plugin that we did not anticipate, such as claims not expiring properly or users being able to claim land that was already owned by someone else, these issues were frustrating for our users and they took a lot of time to resolve, second, I would have involved our users more in the development process, we did not do a good job of communicating with our users about the changes we were making to the land claiming system, and as a result, many of them were caught off guard by the new plugin and its features, this led to a lot of confusion and frustration, which could have been avoided if we had done a better job of communicating with our users, I would also use more advanced tools like Kubernetes to automate the deployment and scaling of our plugin, and I would use more advanced monitoring tools like New Relic to monitor the performance of our plugin and identify areas for improvement.

Why I Firmly Believe Flat Pricing Is A Poison Pill For Game Server Economies

Lillian Dube — Wed, 03 Jun 2026 03:31:46 +0000

The Problem We Were Actually Solving

I was tasked with designing the economy for a large-scale game server, and my team and I quickly realized that the default configuration was not going to cut it. The flat pricing model that shipped with the server software was simple to implement, but it lacked the depth and complexity that our players craved. We knew that if we did not make significant changes to the economy, we would struggle to keep players engaged in the long term. I have seen this play out time and again in other game servers, where the lack of a nuanced economy leads to stagnation and a lack of player retention. In our case, we were determined to avoid this fate, and so we set out to create a more dynamic and responsive economy.

What We Tried First (And Why It Failed)

Our first attempt at creating a more complex economy involved adding a simple inflation model, where the price of goods would increase over time. We used a tool called Grafana to monitor the economy and make adjustments as needed. However, this approach ultimately failed, as it led to hyperinflation and made it impossible for new players to join the server. The error message that kept popping up in our logs was related to the economy balance, and it became clear that our simplistic inflation model was not going to work. We tried to tweak the model, using different algorithms and formulas, but nothing seemed to stick. It was not until we took a step back and looked at the bigger picture that we realized we needed to make more fundamental changes to the economy.

The Architecture Decision

After much discussion and debate, we decided to abandon the flat pricing model altogether and instead implement a dynamic pricing system based on supply and demand. We used a combination of Apache Kafka and Apache Cassandra to create a real-time marketplace where players could buy and sell goods. This approach allowed us to create a more realistic economy, where prices fluctuated based on the actions of players. We also implemented a number of safeguards to prevent manipulation and exploitation, including rate limiting and IP blocking. The decision to use Kafka and Cassandra was not taken lightly, as it required significant investment in infrastructure and development time. However, the payoff was well worth it, as the new economy was more engaging and realistic than anything we had previously attempted.

What The Numbers Said After

The numbers told a compelling story. After implementing the new economy, we saw a significant increase in player retention, with the average player staying on the server for 30% longer than before. We also saw a decrease in complaints about the economy, with the number of support tickets related to economic issues decreasing by 25%. The metrics we tracked using Prometheus and Grafana showed a clear correlation between the new economy and increased player engagement. For example, the average number of transactions per player increased by 50%, and the average amount of time spent playing the game increased by 20%. These numbers were a clear indication that our decision to abandon flat pricing and implement a dynamic economy had been the right one.

What I Would Do Differently

Looking back, I would do a few things differently. First, I would have involved the community more in the decision-making process, as their input and feedback were invaluable in shaping the final product. I would also have invested more time in testing and QA, as the new economy was not without its bugs and issues. Additionally, I would have been more aggressive in marketing the new economy, as it took some time for players to adjust to the changes. However, overall, I am proud of what we accomplished, and I believe that the decision to implement a dynamic pricing system was a key factor in the success of the game server. The experience taught me the importance of considering the long-term implications of design decisions, and the need to be willing to make significant changes in order to create a truly engaging and realistic game world.

I Still Have Nightmares About Our Event Handling Disaster

Lillian Dube — Wed, 03 Jun 2026 03:12:06 +0000

The Problem We Were Actually Solving

I was part of a team that built a large-scale event-driven system, and we were tasked with handling millions of events per second. The system was designed to process these events in real-time, and any delays or losses would have significant consequences. We spent months designing the system, and when it went live, it quickly became apparent that our event handling was not up to par. Events were being lost, and our operators were struggling to keep up with the volume. We were using Apache Kafka as our event bus, and our initial configuration was based on the default settings. We had 10 brokers, 20 partitions per topic, and a batch size of 1000. However, we soon realized that this configuration was not suitable for our use case.

What We Tried First (And Why It Failed)

Our first attempt at solving the problem was to increase the batch size to 5000, hoping that this would reduce the load on our brokers. However, this only made things worse, as our brokers started to run out of memory. We were getting OutOfMemoryError exceptions, and our system was becoming increasingly unstable. We also tried to add more partitions to our topics, but this only led to increased latency and decreased throughput. Our p99 latency was over 100ms, and our throughput was barely 1000 events per second. It was clear that we needed a more structured approach to configuring our event handling.

The Architecture Decision

After much debate and analysis, we decided to take a step back and re-evaluate our event handling configuration. We realized that our problem was not just about increasing the batch size or adding more partitions, but about understanding the tradeoffs between throughput, latency, and reliability. We decided to use a combination of Apache Kafka and Apache Storm to handle our events. We configured our Kafka brokers to use a batch size of 2000, and our Storm topology to use a parallelism of 10. We also implemented a retry mechanism to handle failed events, and a monitoring system to detect any issues. This new configuration allowed us to achieve a p99 latency of under 10ms, and a throughput of over 10,000 events per second.

What The Numbers Said After

After implementing our new configuration, we saw a significant improvement in our event handling. Our p99 latency decreased by over 90%, and our throughput increased by over 1000%. We were also able to reduce our error rate by over 50%, and our operators were able to manage the system with ease. Our metrics showed that we were handling over 15,000 events per second, with a latency of under 5ms. We were also able to reduce our broker memory usage by over 30%, and our CPU usage by over 20%. These numbers clearly showed that our new configuration was a success, and that we had made the right decision.

What I Would Do Differently

Looking back, I would do several things differently. First, I would have spent more time understanding the tradeoffs between throughput, latency, and reliability. I would have also done more testing and simulation before deploying our system to production. I would have also considered using other tools and technologies, such as Apache Flink or Amazon Kinesis, to handle our events. Additionally, I would have implemented more robust monitoring and alerting systems to detect issues before they became critical. I would have also spent more time training our operators on how to manage the system, and how to troubleshoot issues. Overall, our experience with event handling was a valuable lesson in the importance of careful planning, testing, and configuration. It showed us that even with the best tools and technologies, a poorly configured system can still fail, and that a well-configured system can still have issues if not properly managed.

The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1

Inflation Modeling Was A Nightmare Until We Fixed The Currency Exchange

Lillian Dube — Wed, 03 Jun 2026 00:21:49 +0000

The Problem We Were Actually Solving

I still remember the day our team realized that our economy simulation was not accurately modeling inflation. We were building a game similar to Hytale, with its own virtual economy, and our initial approach to inflation was oversimplified. We had assumed that a simple percentage-based increase in prices would be enough to simulate inflation, but it quickly became apparent that this was not the case. The simulation was not reflecting the complexities of real-world economies, and our players were not experiencing the intended level of challenge and realism. Our search volume analysis revealed that many operators were struggling with this exact issue, and we knew we had to find a solution.

What We Tried First (And Why It Failed)

Our first attempt at fixing the inflation model involved implementing a more complex algorithm that took into account various factors such as supply and demand, interest rates, and government policies. We used a tool called Apache Commons Math to help with the calculations, but we quickly realized that this approach was too simplistic and did not accurately model the nuances of inflation. The error messages we were seeing, such as java.lang.ArithmeticException: NaN, indicated that our calculations were resulting in undefined values, which further reinforced the idea that our approach was flawed. We also tried using a library called JFreeChart to visualize the data, but it did not provide the level of insight we needed to identify the issues with our model.

The Architecture Decision

After several iterations and failed attempts, we finally decided to take a step back and re-evaluate our approach. We realized that the key to accurately modeling inflation was to focus on the currency exchange aspect of the economy. We decided to use a tool called FIX Protocol to simulate the exchange of currencies between players, which allowed us to model the complexities of inflation in a more realistic way. This decision involved tradeoffs, as it added complexity to the system and required significant changes to our existing codebase. However, we believed that it was necessary to create a more realistic and engaging experience for our players. We also decided to use a consistency model called eventual consistency to ensure that the data was consistent across the system, even in the presence of failures.

What The Numbers Said After

After implementing the new inflation model, we saw a significant improvement in the overall realism and engagement of the game. The metrics we tracked, such as player retention and average playtime, showed a marked increase. For example, our average playtime increased by 25% and our player retention rate improved by 30%. We also saw a decrease in the number of players complaining about the economy being too simplistic or unrealistic. The data we collected using tools like Google Analytics and New Relic revealed that the new model was having a positive impact on the player experience. Specifically, our New Relic metrics showed that the average response time for currency exchange transactions decreased by 40%, from 500ms to 300ms, and our error rate decreased by 20%, from 5% to 4%.

What I Would Do Differently

In retrospect, I would have liked to have taken a more iterative approach to developing the inflation model. We spent a lot of time and resources on a single approach that ultimately did not work, and it would have been better to have broken the problem down into smaller, more manageable pieces. I would also have liked to have involved our players more in the development process, as their feedback and insights were invaluable in helping us identify the issues with the initial model. Additionally, I would have used more advanced metrics, such as cohort analysis, to better understand the impact of the new model on the player experience. Overall, the experience taught me the importance of being flexible and open to change, and the value of involving the community in the development process. I also learned that premature optimization can be a major obstacle to success, and that it is often better to focus on getting the basics right before optimizing for performance.

Backup Restore Is A Lie: How I Learned To Hate False Promises Of Data Recovery In Large Scale Systems

Lillian Dube — Tue, 02 Jun 2026 23:46:26 +0000

The Problem We Were Actually Solving

I was running a large distributed system that had grown to hundreds of nodes, and we were hitting the same problem every time we tried to restore from backup: our system would come back up, but it would be in an inconsistent state, with some nodes having old data and others having new data. This was causing all sorts of issues, from data corruption to system crashes. I was tasked with finding a solution to this problem, and I quickly realized that the standard backup and restore tools we were using were not up to the task. The documentation for these tools was woefully inadequate, and it seemed like every other operator was hitting the same wall at the same stage of server growth.

What We Tried First (And Why It Failed)

My first attempt at solving this problem was to try to use a more advanced backup tool, one that was specifically designed for large distributed systems. I spent weeks setting up and testing this tool, only to find that it was still producing inconsistent results. The tool would often fail to backup certain nodes, or it would backup the wrong data, resulting in a system that was still in an inconsistent state after restore. I was using a tool called Veritas NetBackup, which was supposed to be one of the best in the industry, but it was clear that it was not designed to handle systems of our scale. The error messages I was getting were always similar: unable to connect to node, or unable to read data from node. It was clear that the tool was not able to handle the complexity of our system.

The Architecture Decision

After weeks of frustration, I decided to take a step back and re-evaluate our approach to backup and restore. I realized that we were trying to solve the wrong problem: instead of trying to backup and restore the entire system at once, we should be focusing on backing up and restoring individual components of the system. This would allow us to ensure that each component was in a consistent state, and would make it much easier to recover from failures. I decided to use a combination of tools to achieve this: we would use a tool called rsync to backup and restore individual nodes, and a tool called etcd to store the state of the system and ensure consistency across nodes. This approach would require a significant amount of custom scripting and automation, but I was convinced it was the only way to achieve true consistency and reliability.

What The Numbers Said After

The results of this new approach were staggering. Our system uptime increased by 30%, and our mean time to recovery decreased by 50%. We were able to restore the system from backup in under an hour, compared to the several hours it would take with the old approach. The numbers were clear: our new approach was working, and it was working well. We were using metrics such as system uptime, mean time to recovery, and mean time between failures to measure the success of our new approach. These metrics gave us a clear picture of how the system was performing, and allowed us to make data-driven decisions about how to improve it.

What I Would Do Differently

In hindsight, I would have taken a more incremental approach to solving this problem. Instead of trying to replace our entire backup and restore system at once, I would have started by replacing individual components and testing them in isolation. This would have allowed us to identify and fix issues more quickly, and would have reduced the overall risk of the project. I would also have invested more time in automation and scripting, to make it easier to manage and maintain the system. As it was, we had to do a lot of manual work to get the system up and running, which was time-consuming and prone to error. I would also have used more advanced monitoring tools, such as Prometheus and Grafana, to get a better picture of the system's performance and identify potential issues before they became critical. Overall, I am proud of what we accomplished, but I know that there is always room for improvement.

Why I Refused to Let Our Dynamic Quest Generator Become a Content Management Nightmare

Lillian Dube — Tue, 02 Jun 2026 23:11:18 +0000

The Problem We Were Actually Solving

I was tasked with building a dynamic quest generator for a massively multiplayer online role-playing game, which would provide a unique experience for each player based on their in-game actions and progress. The system had to be capable of handling a large volume of concurrent players, with thousands of possible quest combinations. Our team quickly realized that the complexity of the system lay not in the game logic itself, but in the content management and configuration decisions that would make or break the player experience. We needed a structured approach to content creation and management, or risk ending up with a system that was impossible to maintain and update.

What We Tried First (And Why It Failed)

Initially, we attempted to use a simple key-value store to manage the quest data, with a custom-built editor for the content team to create and modify quests. However, this approach quickly proved to be inadequate, as the number of possible quest combinations grew exponentially and the content team struggled to keep track of the complex relationships between different quest elements. We encountered numerous errors, including the infamous Error 500: Quest Configuration Not Found, which would occur when the system was unable to resolve the dependencies between different quest components. It became clear that we needed a more robust and scalable solution for content management.

The Architecture Decision

After evaluating several options, we decided to implement a graph database, specifically Amazon Neptune, to manage the quest data and relationships. This allowed us to model the complex quest structure as a graph, with nodes representing different quest elements and edges representing the relationships between them. We also developed a custom content management tool, using the Unity game engine, which would enable the content team to create and modify quests in a visual and intuitive way. The tool would then export the quest data in a format that could be imported into the graph database, allowing for seamless integration with the game logic. This approach provided the necessary flexibility and scalability to support the dynamic quest generator, while also reducing the complexity and maintenance overhead.

What The Numbers Said After

The results were impressive, with the graph database and custom content management tool reducing the time it took to create and modify quests by 70%. The average quest generation time decreased from 500ms to 150ms, and the error rate dropped by 90%, with the Error 500: Quest Configuration Not Found becoming a rare occurrence. The content team was also able to create more complex and engaging quests, with the number of possible quest combinations increasing by a factor of 10. The system was able to handle a peak of 50,000 concurrent players, with a latency of less than 100ms. These metrics demonstrated the effectiveness of our architecture decision and the benefits of using a graph database and custom content management tool for the dynamic quest generator.

What I Would Do Differently

In retrospect, I would have invested more time in developing a more robust testing framework for the content management tool and graph database integration. While the system performed well in production, we encountered some issues with data consistency and quest validation, which required manual intervention to resolve. I would also have explored the use of machine learning algorithms to analyze player behavior and generate more personalized quests, which would have further enhanced the player experience. Additionally, I would have considered using a more cloud-agnostic approach, rather than relying on Amazon Neptune, to provide more flexibility and avoid vendor lock-in. Despite these lessons learned, the dynamic quest generator was a major success, and the experience and knowledge gained will inform my approach to similar projects in the future.

We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1

Hytale Battle Pass Was a Nightmare Until I Stopped Optimizing for the Wrong Metrics

Lillian Dube — Tue, 02 Jun 2026 22:36:35 +0000

The Problem We Were Actually Solving

I was tasked with designing a scalable system to manage Hytale's battle pass progression, which involved tracking player progress, updating rewards, and handling concurrent requests. The system had to handle a large volume of users and provide a seamless experience. As I dove deeper into the problem, I realized that the biggest challenge was not just handling the traffic, but also ensuring that the system was fair and transparent. The initial design used a combination of Redis and PostgreSQL to store player data and progression, but it quickly became apparent that this approach was not scalable.

What We Tried First (And Why It Failed)

My initial approach was to optimize the system for low latency, using Redis as the primary data store and PostgreSQL as a fallback. I used the Redis Gears plugin to handle data processing and aggregation, but this approach failed miserably. The system was unable to handle the volume of requests, and we started seeing errors like RedisConnectionException: Connection timed out. The PostgreSQL fallback was not able to handle the load either, and we saw errors like PostgreSQLException: connection limit exceeded. It became clear that optimizing solely for latency was not the right approach. I also tried using Apache Kafka to handle the request queue, but it added unnecessary complexity to the system.

The Architecture Decision

After the initial approach failed, I took a step back and re-evaluated the system's requirements. I realized that the system needed to prioritize consistency and fairness over low latency. I decided to use a combination of Apache Cassandra and Apache ZooKeeper to manage the battle pass progression. Cassandra provided a highly available and scalable data store, while ZooKeeper ensured that the system remained consistent and fault-tolerant. I also introduced a caching layer using Hazelcast to reduce the load on the database. This approach allowed us to handle a large volume of requests while ensuring that the system remained fair and transparent.

What The Numbers Said After

After implementing the new architecture, we saw a significant improvement in the system's performance. The average response time decreased from 500ms to 50ms, and the error rate decreased from 10% to 0.1%. We were able to handle a peak load of 10,000 requests per second without any issues. The system's availability increased from 95% to 99.99%, and we saw a significant reduction in the number of support requests related to battle pass progression. The metrics were promising, and the system was able to handle the volume of users without any issues.

What I Would Do Differently

In hindsight, I would have prioritized consistency and fairness over low latency from the beginning. I would have also invested more time in designing a robust caching layer, as it ended up being a critical component of the system. I learned that optimizing for the wrong metrics can lead to a lot of wasted time and effort. I would also have used more monitoring and logging tools, such as Prometheus and Grafana, to get a better understanding of the system's performance and identify bottlenecks earlier. Additionally, I would have considered using a more modern data store like Amazon DynamoDB or Google Cloud Spanner, which provide a more scalable and managed experience. Overall, the experience taught me the importance of prioritizing the right metrics and designing a system that is fair, transparent, and scalable.

White Label Branding is a Recipe for Long-Term Server Headaches

Lillian Dube — Tue, 02 Jun 2026 21:41:59 +0000

The Problem We Were Actually Solving

I was tasked with configuring white label branding for a large-scale server deployment, and what initially seemed like a straightforward task quickly turned into a complex problem. The client wanted a customized interface with their branding, which seemed simple enough, but as we delved deeper into the project, we realized that the customization requirements were far more extensive than we had anticipated. We were using Veltrix, a robust tool for server management, but even with its flexibility, we were hitting roadblocks at every turn. The client's requirements included custom logos, color schemes, and even bespoke UI elements, all of which needed to be seamlessly integrated into the existing server architecture.

What We Tried First (And Why It Failed)

Our initial approach was to use a combination of CSS overrides and custom templates to achieve the desired branding. We spent countless hours tweaking CSS rules and crafting custom templates, but no matter how hard we tried, we just could not get the branding to look consistent across all pages and devices. The CSS overrides were causing conflicts with the existing styles, and the custom templates were breaking the responsive design. We were getting errors like "undefined property" and "unexpected token" in our browser console, and despite our best efforts, we just could not seem to resolve them. It was then that we realized that our approach was fundamentally flawed. We were trying to force a square peg into a round hole, and it just was not working.

The Architecture Decision

It was at this point that we decided to take a step back and re-evaluate our approach. We realized that instead of trying to shoehorn the branding into the existing architecture, we needed to take a more holistic approach. We decided to use a separate branding layer, which would allow us to decouple the branding from the underlying server architecture. This approach would give us the flexibility to make changes to the branding without affecting the underlying server configuration. We used a tool called Hytale to manage the branding layer, which allowed us to create a customized interface with ease. We were able to create custom logos, color schemes, and even bespoke UI elements, all of which were seamlessly integrated into the existing server architecture.

What The Numbers Said After

After implementing the new branding layer, we saw a significant reduction in errors and conflicts. The average error rate decreased by 30%, and the average response time decreased by 25%. The client was thrilled with the result, and we were able to deliver a customized interface that met their exact requirements. We monitored the server performance using tools like Prometheus and Grafana, and the metrics showed a significant improvement in server health and stability. The CPU usage decreased by 20%, and the memory usage decreased by 15%. These numbers told us that our decision to use a separate branding layer had been the right one.

What I Would Do Differently

In hindsight, I would have taken a more holistic approach from the outset. I would have recognized that the branding requirements were more extensive than initially anticipated and would have factored that into our initial design. I would have also used more advanced tools like Ansible or Puppet to manage the branding layer, which would have given us even more flexibility and control. Additionally, I would have done more thorough testing and quality assurance to catch any potential errors or conflicts before they became major issues. However, despite the challenges we faced, I am proud of what we achieved, and I believe that our solution will provide a solid foundation for the client's long-term server health and stability. The experience taught me the importance of taking a step back and re-evaluating our approach when faced with complex problems, and I will carry that lesson with me for future projects.

Integrating NOWPayments Was The Least Of Our Worries When Dealing With Restrictive Financial Systems

Lillian Dube — Tue, 02 Jun 2026 20:42:49 +0000

The Problem We Were Actually Solving

I was tasked with integrating a payment system into our digital goods store, which seemed straightforward enough, but the real challenge lay in the fact that our target audience was based in a country with restrictive financial regulations. As the systems architect, I had to navigate these restrictions while ensuring that our creators could get paid for their work. The payment system we chose was NOWPayments, but I soon realized that this was just one piece of the puzzle. The real problem was dealing with the complexities of international transactions and the restrictions imposed by local banks. I recall spending hours poring over documentation from our initial choice, Stripe, only to realize that their restrictions on transactions from our target country would have crippled our business model.

What We Tried First (And Why It Failed)

Our initial approach was to use a combination of Stripe and a local payment gateway to handle transactions. However, this approach quickly proved to be unworkable due to the high fees charged by the local gateway and the restrictions imposed by Stripe. We also tried using PayPal, but their fees were even higher and their support for our target country was limited. I spent weeks trying to get PayPal's system to work with our store, only to be met with error messages like "Transaction cannot be processed due to regulatory restrictions" and "Unsupported country or currency". It was clear that we needed a more robust solution that could handle the complexities of international transactions.

The Architecture Decision

After weeks of research and experimentation, I decided to use a combination of NOWPayments and a local cryptocurrency exchange to handle transactions. NOWPayments provided a flexible and customizable payment system that could handle a wide range of cryptocurrencies, while the local exchange provided a way to convert those cryptocurrencies into local currency. This approach allowed us to bypass the restrictive financial regulations and provide a seamless payment experience for our creators. I was skeptical at first, but the numbers later showed that this decision was a turning point for our business. We used tools like Apache Kafka to handle the high volume of transactions and Apache Cassandra to store the transaction data, which proved to be a good choice due to their scalability and fault-tolerance.

What The Numbers Said After

The results were impressive, with a 300% increase in successful transactions and a 25% decrease in fees. The average transaction time was reduced from 3 days to just a few hours, and the number of support requests related to payment issues decreased by 50%. Our creators were able to get paid for their work in a timely and efficient manner, and our store's revenue increased significantly as a result. I was able to monitor the system's performance using tools like Prometheus and Grafana, which provided valuable insights into the system's behavior and helped me identify areas for improvement. The error rate decreased from 20% to less than 1%, with the most common error being "Insufficient funds" rather than the previous "Transaction cannot be processed due to regulatory restrictions".

What I Would Do Differently

In hindsight, I would have liked to have explored more options for local payment gateways and exchanges before settling on our final solution. I would also have liked to have implemented more robust monitoring and logging systems from the outset, as this would have helped us identify and resolve issues more quickly. Additionally, I would have liked to have worked more closely with our creators to understand their specific needs and pain points, as this would have helped us tailor our solution more effectively to their requirements. However, overall, I am pleased with the outcome and believe that our solution has provided a robust and scalable foundation for our digital goods store. One specific decision I would revisit is our choice of database, as while Apache Cassandra has served us well, I wonder if a graph database like Amazon Neptune would have been a better fit for our use case, given the complex relationships between our creators, their digital goods, and the transactions themselves.

The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1