DEV Community: Miahlouge

Redundancy vs. Replication in a Distributed System

Miahlouge — Tue, 22 Apr 2025 18:16:21 +0000

𝗥𝗲𝗱𝘂𝗻𝗱𝗮𝗻𝗰𝘆 - backup systems to avoid downtime.
𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 - backup data to avoid data loss.

Redundancy and replication are both strategies for increasing the reliability and availability of systems.

Redundancy

Duplicating resources to ensure uninterrupted operation in case of failure.

Redundancy is the practice of duplicating critical components or systems to increase reliability and availability. If one part fails, another seamlessly takes over. This applies to servers, databases, network devices, and even entire data centers.

𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀
→ Prevents downtime by switching to backup systems.
→ Keeps system running even during component failure.
→ Enables recovery in case of hardware or regional outages.
→ Often combined with load balancing to handle traffic across redundant systems.

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗨𝘀𝗲 𝗥𝗲𝗱𝘂𝗻𝗱𝗮𝗻𝗰𝘆
• You’re building systems that must run 24/7.
• User experience and trust are tightly tied to uptime e.g. banking, trading etc.
• Legal or SLA (Service Level Agreement) requirements demand.
• Preparation for hardware failures, data center outages, or regional disasters.
• Maintenance without downtime ensures availability during updates or changes.

𝗘𝘅𝗮𝗺𝗽𝗹𝗲: 𝗣𝗮𝘆𝗺𝗲𝗻𝘁 𝗦𝗲𝗿𝘃𝗶𝗰𝗲

A payment platform has two payment processing servers in the same data center. If one server fails during a credit card transaction, the load balancer automatically reroutes the request to the backup server.

Replication
Creating multiple, identical copies of data or resources.

Replication ensures that data exists in more than one place whether across databases, servers, or regions. It helps systems stay available even when part of the infrastructure becomes unreachable.

𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀

→ Provides high availability of data, even during failures
→ Improves read performance by allowing distributed access
→ Supports disaster recovery and backup strategies
→ Enables data locality for global applications
→ Protects against data loss

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗨𝘀𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻

• You need data access across multiple regions or data centers.
• You want to scale reads across replicas.
• You’re building fault-tolerant and distributed databases.
• Regulatory requirements demand backups or geo-redundancy.
• Real-time analytics or reporting systems require up-to-date data from production.

𝗘𝘅𝗮𝗺𝗽𝗹𝗲: 𝗣𝗮𝘆𝗺𝗲𝗻𝘁 𝗦𝗲𝗿𝘃𝗶𝗰𝗲
The platform stores user balances and transaction records in a database that is replicated across multiple regions (e.g., Frankfurt and Amsterdam). If the primary database becomes unavailable, the system can read from the replica.

Conclusion
Redundancy is like an insurance policy in system design. Replication, on the other hand, ensures that data stays accessible and consistent across systems, enhancing availability, performance, and disaster recovery.

Rate Limiting in Distributed System

Miahlouge — Wed, 16 Apr 2025 12:40:45 +0000

Rate-limiting, or throttling, is a mechanism that rejects a request when a specific quota is exceeded. A technique to control how many requests a client can make to a service over a given time window. When a client exceeds their quota, subsequent requests are rejected or delayed.

Throttling Pattern and Rate Limiting Pattern as key strategies for protecting services against overuse and abuse. It ensures fair resource allocation, and maintaining system stability under varying load conditions.

What is Rate Limiting?
Rate limiting defines quotas on resource usage, typically measured by:

Number of requests (e.g., 100 requests per minute)
Data volume (e.g., 1MB per second)
Concurrency (e.g., 10 parallel connections)

If a service allows 10 requests per second per API key, and a particular key makes 12 requests, then 2 requests will be rejected on average.

Why Implement Rate Limiting?
It helps keep systems fair and available for everyone from single user or a bot that throttles up the requests. By controlling how many requests a user or bot can make, it prevents any single source from overwhelming the service. This ensures that all users get a fair chance to access resources. It also adds a layer of security by blocking brute-force attacks and bot-nets that try to exploit the system through excessive or malicious requests.

They use brute-force algorithms to attack the system using a single machine or even multiple machines that constitute a network of bots known as a bot-net.

Types fo Rate Limiting
Rate limiting can be broadly categorized in two types: Single process rate limiting, and distributed rate limiting. We will start with a single-process implementation first and then extend it to a distributed one.

Single-Process Rate Limiting

Tracking Timestamps
We could store a list of timestamps for each API key and periodically clean out any that are older than the quota interval (e.g., 1 minute). But as the number of requests grows, this approach becomes memory intensive.

Memory-Efficient Alternative: Bucketing
A more scalable approach is to divide time into fixed intervals (buckets) such as one-minute buckets and use counters instead of raw timestamps.

How It Works:

Each incoming request is mapped to a time bucket based on its timestamp.
The corresponding bucket’s counter is incremented.
Only the relevant bucket windows are retained.

Example:
If a request arrives at 12:00:18, it is counted under the bucket for 12:00.

This method compresses request information efficiently, using constant space per API key, regardless of the number of requests.

Now, with this memory-efficient setup, how can we enforce rate limits?

Enforcing Limits with a Sliding Window
We can implement rate limiting using a sliding window that moves across buckets in real time, tracking the requests within it. The window's length aligns with the quota time unit, like 1 minute. However, because the sliding window can overlap with multiple buckets, we must calculate a weighted sum of bucket counters to determine the requests within the window.

To compute this:

Calculate a weighted sum of overlapping bucket counters.

Weigh each bucket based on how much of it falls within the sliding window.

This approximation improves with smaller buckets (e.g., 10-second intervals).

Distributed Rate Limiting
Things get tricky in distributed systems with multiple servers, especially across regions. If each server has its own rate limiter, it can lead to two main issues:

Inconsistent limits
Race Conditions

When multiple processes handle requests, local rate limiting isn’t enough. We need shared state to coordinate limits globally across nodes. A shared data store is needed to track total requests per API key.

Shared Data Store Approach
A central data store (e.g., Redis, DynamoDB, Memcached) can store counters for each API key and bucket.

Store two counters per API key (current and previous buckets).
Each request updates the counter via atomic operations like INCR.

Challenges & Optimizations
Concurrency Issues
Simultaneous updates from multiple nodes can lead to race conditions. Using transactions solves this but is slow and resource-heavy. Atomic operations might be a solution (e.g., INCR, GETSET, COMPARE-AND-SWAP) to safely update counters.

Race Condition Example:

Without atomic operation

With Atomic Operation (INCR in Redis)

Performance Bottlenecks
Writing to the shared store on every request introduces latency and load.

Solution:

Batch updates in memory.
Periodically flush them asynchronously to the shared store.

Servers batch bucket updates in memory for some time, and flush them asynchronously to the data store at the end of it. This significantly reduces write frequency and improves throughput.

Store Downtime
What if the central data store becomes unavailable?

Enter the CAP theorem: you must choose between consistency and availability during network faults.

Safer fallback:

Continue to serve requests using the last known state from the data store.

Avoid outright rejections due to temporary unreachability — especially if the alternative is degraded business operations.

This compromise favors availability over strict consistency, which is often acceptable in real-world systems.

Rate Limiting Considerations
Are you safely hooking rate limiters into your middleware stack?
Make sure failures (like bugs or Redis downtime) don’t break your API—catch exceptions and let requests pass through if needed.

Are you showing clear rate limit errors to users?
Choose between HTTP 429 or 503 based on context, and return clear, actionable messages.
Can you safely turn off the rate limiters if necessary?
Use feature flags as escape valves and set up alerts to monitor how often limiters trigger.
Did you test each rate limiter in dark mode to see the impact?Ensure your limits keep the API stable without disrupting users. You might need to collaborate with users to adjust their usage patterns.

When Not to Use Rate Limiting?
Rate limiting might not be necessary for internal services within a fully trusted environment where all clients are known, controlled, and operate at predictable loads. In such cases, adding rate limiting could introduce unnecessary complexity and latency without much benefit.

Conclusion
Rate limiting and throttling help protect systems from overuse. They make sure users get fair access. In distributed systems, things get more complex. Issues like race conditions and latency can happen. Using buckets, sliding windows, and atomic updates helps. Batching and async writes reduce load. These tools keep systems fast and reliable.

𝗜𝗻𝘀𝗽𝗶𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗥𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀

Rate Limiting Pattern, Microsoft
Scaling your API with rate limiters, Stripe
Understanding Distributed Systems by Roberto Vitillo

𝗖𝗼𝗺𝗯𝗶𝗻𝗲 𝗕𝗼𝘁𝗵: 𝗠𝗼𝘀𝘁 𝗦𝘂𝗰𝗰𝗲𝘀𝘀𝗳𝘂𝗹 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲𝘀 𝗗𝗼

Miahlouge — Mon, 14 Apr 2025 14:53:07 +0000

Miahlouge

Apr 12 '25

Streaming vs Queuing: What Happens If You Choose Wrong?

#streaming #queuing #eventdriven #miahlouge

1 min read

Load isn’t the problem. Mismanaging it is.

Miahlouge — Mon, 14 Apr 2025 14:52:15 +0000

Miahlouge

Apr 12 '25

Load Balancing vs Load Shedding vs Load Leveling

#softwareengineering #distributedsystems #highavailability #miahlouge

2 min read

What Is Database Partitioning?

Miahlouge — Mon, 14 Apr 2025 05:05:40 +0000

Partitioning is the process of dividing a large database table into smaller. Database partitioning can be broadly categorized into two types -

𝗛𝗼𝗿𝗶𝘇𝗼𝗻𝘁𝗮𝗹 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 - It divides large tables across multiple storage nodes based on region, such as East, West, and South.

𝗩𝗲𝗿𝘁𝗶𝗰𝗮𝗹 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 - It separates sensitive data from core data based on access patterns.

𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 𝗞𝗲𝘆 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀
→ Distributes data across multiple storage nodes for better scalability.
→ Enhances data manageability by segmenting large datasets
→ Enables parallel query execution, improving performance
→ Optimizes physical data storage structure for efficient access

𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 𝗗𝗶𝘀𝗮𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀
→ 𝗜𝗻𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝗰𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆 in schema design and query logic
→ 𝗜𝗺𝗽𝗿𝗼𝗽𝗲𝗿 𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 can cause data skews or hotspots
→ 𝗖𝗿𝗼𝘀𝘀-𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗾𝘂𝗲𝗿𝗶𝗲𝘀 may be slower
→ 𝗠𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝗶𝗻𝗴 𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 across partitions is harder

What is Database Indexing?

Miahlouge — Mon, 14 Apr 2025 05:01:21 +0000

Indexing is a database optimization technique that creates specialized lookup structures (B-Trees).

𝗜𝗻𝗱𝗲𝘅𝗶𝗻𝗴 𝗞𝗲𝘆 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀
→ Fasts data retrieval by reducing search time
→ Enables efficient lookup structures without altering original data
→ Keeps the underlying data intact, ensuring data consistency
→ Optimizes query execution paths for faster access

𝗜𝗻𝗱𝗲𝘅𝗶𝗻𝗴 𝗗𝗶𝘀𝗮𝗱𝘃𝗮𝗻𝘁𝗮𝗴𝗲𝘀
→ Increases storage requirements due to additional data structures
→ Slows down write operations (INSERT, UPDATE, DELETE)
→ Over-indexing can hurt more than help

Distributed Transactions: 2PC vs 3PC vs Saga

Miahlouge — Sun, 13 Apr 2025 12:41:42 +0000

It's a complex topic in itself. Sometimes even seasoned professionals misunderstand.

▢ 𝟮𝗣𝗖 - atomic but blocking, commit or abort in two steps.
▢ 𝟯𝗣𝗖 - splits commit into two, reduces blocking and handles partial failures.
▢ 𝗦𝗮𝗴𝗮 - a sequence of local transactions that breaks a transaction into multiple steps.

𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲 𝟮-𝗣𝗵𝗮𝘀𝗲 𝗖𝗼𝗺𝗺𝗶𝘁
→ When strict consistency is needed, all participants commit or abort.
→ For simple, low-latency systems with minimal crash risk.

𝗘𝘅𝗮𝗺𝗽𝗹𝗲: 𝗕𝗮𝗻𝗸 𝗙𝘂𝗻𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗲𝗿 𝗕𝗲𝘁𝘄𝗲𝗲𝗻 𝗔𝗰𝗰𝗼𝘂𝗻𝘁𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗦𝗮𝗺𝗲 𝗕𝗮𝗻𝗸
• Transfer involves debiting one account and crediting another.
• If one fails, the entire transaction must rollback.
• Atomicity is a must, and latency is acceptable.

𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲 𝟯-𝗣𝗵𝗮𝘀𝗲 𝗖𝗼𝗺𝗺𝗶𝘁
→ When minimizing blocking is key, and partial failures must be avoided.
→ Where fault tolerance takes priority over message overhead and complexity.

𝗘𝘅𝗮𝗺𝗽𝗹𝗲: 𝗖𝗿𝗼𝘀𝘀-𝗥𝗲𝗴𝗶𝗼𝗻 𝗟𝗲𝗱𝗴𝗲𝗿 𝗦𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻
Synchronizing transaction records between European and Asian data centers.
• Each region prepares and pre-commits.
• Final commit is sent when all regions are ready.
• Handles network partition or coordinator crash more gracefully than 2PC.

𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲 𝗦𝗮𝗴𝗮
→ Performance matter more than strict consistency
→ Long-running distributed transactions where full rollback isn't practical

𝗘𝘅𝗮𝗺𝗽𝗹𝗲: 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗙𝘂𝗻𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗲𝗿 (𝗦𝗪𝗜𝗙𝗧)
Transferring funds from a bank in Germany to one in Singapore via SWIFT or similar clearing systems.
• Debit sender → Local transaction
• Notify intermediary → Local transaction
• Credit receiver → Local transaction
• If the final step fails, compensation (e.g., refund sender) is triggered.

Downstream Resiliency in Distributed System

Miahlouge — Sun, 13 Apr 2025 01:23:47 +0000

Downstream resiliency ensures that a component can continue to function correctly even if the components it relies on experience issues.

𝗧𝗶𝗺𝗲𝗼𝘂𝘁
Before we start, let’s answer the simple question: "Why timeout?".

A successful response, even if it takes time, is better than a timeout error. Hmm… not always, it depends.

When a network call is made, it’s best practice to configure a timeout. If the call is made without a timeout, there is a chance it will never return. Network calls that don’t return lead to resource leaks.

Modern HTTP clients such as Java, .NET etc do a better job and usually, come with default timeouts. For example, .NET Core HttpClient has a default timeout of 100 seconds. However, some HTTP clients, like Go, do not have a default timeout for network requests. In such cases, it is a best practice to explicitly configure a timeout.

How to configure timeout and not breach the SLA?

Option 1: Share Your Time Budget
Divide your SLA between services, e.g., 500ms for Order Service and 500ms for Payment Service. This prevents SLA breaches but may cause false positive timeouts.

Option 2: Use a TimeLimiter
Wrap calls in a time limiter, setting a shared max timeout (e.g., 1s) while allowing flexibility (e.g., 700ms per service) to handle varying response times efficiently.

How do we determine a good timeout duration? One way is to base it on the desired false timeout rate. For example, if 0.1% of downstream requests can timeout, configure the timeout based on the 99.9th percentile of response time.

Good monitoring tracks the entire lifecycle of a network call. Measure integration points carefully. This helps with debugging production issues.

𝗥𝗲𝘁𝗿𝘆 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀
When a network request fails or times out, the client has two options: fail fast or retry the request. If the failure is temporary, retrying with backoff can resolve the issue. However, if the downstream service is overwhelmed, immediate retries can worsen the problem. To prevent this, retries should be delayed with progressively increasing intervals until either a maximum retry limit is reached or sufficient time has passed.

This approach incorporates techniques such as Exponential Backoff, Cap, Random Jitter, and Retry Queue, ensuring the system remains resilient while avoiding additional strain on the downstream service.

𝗘𝘅𝗽𝗼𝗻𝗲𝗻𝘁𝗶𝗮𝗹 𝗕𝗮𝗰𝗸𝗼𝗳𝗳
Exponential backoff is a technique where the retry delay increases exponentially after each failure.

backoff = backOffMin * (backOffFactor ^ attempt)

For an initial backoff of 2 seconds and a backoff factor of 2:

1st retry: 2×2^1=2 seconds
2nd retry: 2×2^2=4 seconds
3rd retry: 2×2^3=8 seconds

This means that after each failed attempt, the time to wait before retrying increases exponentially. Exponential backoff can cause multiple clients to retry simultaneously, leading to load spikes on the downstream service. To solve this, we can limits the maximum retry delay to prevent excessive waiting times.

𝗖𝗮𝗽𝗽𝗲𝗱 𝗘𝘅𝗽𝗼𝗻𝗲𝗻𝘁𝗶𝗮𝗹 𝗕𝗮𝗰𝗸𝗼𝗳𝗳
Capped exponential backoff builds upon exponential backoff by introducing a maximum limit (cap) for the retry delay. This prevents the delay from growing indefinitely while ensuring retries happen within a reasonable timeframe.

backoff = backOffMin * (backOffFactor ^ attempt)

However, the cap limits the maximum delay. For an initial backoff of 2 seconds, a backoff factor of 2, and a cap of 8 seconds:

1st retry: 2×2^1=2 seconds
2nd retry: 2×2^2=4 seconds
3rd retry: min⁡(2×2^3, 8)=8 seconds (capped)

Capping the delay ensures retries don't extend indefinitely, striking a balance between efficiency and resilience.

𝗥𝗮𝗻𝗱𝗼𝗺 𝗝𝗶𝘁𝘁𝗲𝗿 𝘄𝗶𝘁𝗵 𝗖𝗮𝗽𝗽𝗲𝗱 𝗘𝘅𝗽𝗼𝗻𝗲𝗻𝘁𝗶𝗮𝗹 𝗕𝗮𝗰𝗸𝗼𝗳𝗳
This method enhances capped exponential backoff by adding randomness to the delay, preventing synchronized retries and reducing the risk of traffic spikes. Random jitter spreads out retry attempts over time, improving system stability.

delay = random(0, min(cap, backOffMin * (backOffFactor ^ attempt)))

For an initial backoff of 2 seconds, a backoff factor of 2, and a cap of 8 seconds:

1st retry: Random value between 0 and 2×2^1=4 seconds
2nd retry: Random value between 0 and 2×2^2=8 seconds
3rd retry: Random value between 0 and 2×2^3=8 seconds (capped)

The addition of randomness avoids "retry storms," where multiple clients retry at the same time, and spreads out load more evenly to protect the downstream service.

𝗥𝗲𝘁𝗿𝘆 𝗔𝗺𝗽𝗹𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻
Suppose a user request goes through a chain: the client calls Your Awesome Service, which calls Order Service, which then calls Payment Service. If the request from Order Service to Payment Service fails, should Order Service retry? Retrying could delay Your Awesome Service’s response, risking its timeout. If Your Awesome Service retries, the client might timeout too, amplifying retries across the chain. This can overload deeper services like Payment Service. For long chains, retrying at one level and failing fast elsewhere is often better.

𝗙𝗮𝗹𝗹𝗯𝗮𝗰𝗸 𝗣𝗹𝗮𝗻
Fallback plans act as a backup when retries fail. Imagine a courier who can’t deliver your package after trying once. Instead of repeatedly attempting the same thing, they switch to a "Plan B"—like leaving the package in front of door, or at a nearby kiosk or post office. Similarly, in systems, this means using an alternative option, such as cached data or another provider, when the primary service isn’t working. The system then notifies users or logs the change, just like the courier leaving you a note or sending a text. This way, resources aren't wasted on endless retries, and the system remains resilient by relying on a practical backup solution.

𝗖𝗶𝗿𝗰𝘂𝗶𝘁 𝗕𝗿𝗲𝗮𝗸𝗲𝗿𝘀
When a downstream service fails persistently, retries slow down the caller and can spread slowness system-wide. A circuit breaker detects such failures, blocks requests to avoid slowdowns, and fails fast instead. It has three states: closed (passes calls, tracks failures), open (blocks calls), and half-open (tests recovery).

If failures exceed a threshold, it opens; after a delay, it tests in half-open mode. Success closes it; failure reopens it. This protects the system, enabling graceful degradation for non-critical dependencies. Timing and thresholds depend on context and past data.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻
Downstream resiliency is a critical aspect of Resiliency Engineering, ensuring components can adapt and recover gracefully from failures in dependent systems. By implementing effective strategies, systems can remain robust and reliable, even in the face of unforeseen disruptions.

𝗜𝗻𝘀𝗽𝗶𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗥𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀

𝗗𝗼𝘄𝗻𝘀𝘁𝗿𝗲𝗮𝗺 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆
All you need to know about timeouts, Zalando Engineering Blog
Understanding Distributed Systems by Roberto Vitillo.

Resiliency in Distributed Systems

Miahlouge — Sun, 13 Apr 2025 00:49:43 +0000

Resiliency Engineering is the practice of designing and building systems to achieve resiliency. Ensuring they can handle failures, adapt to disruptions, and recover gracefully without major downtime.

Anything that can go wrong will go wrong.
Murphy’s Law

𝗪𝗵𝗮𝘁 𝗶𝘀 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆?
Before understanding Resiliency Engineering, it is necessary to understand what Resiliency is. Resiliency is an outcome, not a practice. It is the ability of a system to handle failures, adapt to disruptions, and maintain functionality under pressure.

𝗪𝗵𝗮𝘁 𝗶𝘀 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴?
Resiliency Engineering is the practice of designing and building systems to achieve resiliency. It involves strategies like fault tolerance, redundancy, self-healing mechanisms, and failure recovery to ensure systems remain stable and reliable even in unpredictable conditions.

𝗧𝘆𝗽𝗲𝘀 𝗼𝗳 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
Resiliency engineering can be broadly categorized into three types: proactive resiliency, reactive resiliency, adaptive resiliency.

𝗣𝗿𝗼𝗮𝗰𝘁𝗶𝘃𝗲 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆
Proactive resiliency prevents failures before they happen, keeping systems stable and reliable. It ensures smooth operations by distributing traffic, limiting overload, and maintaining backups. All are called Upstream Resiliency.

Load Balancing, Load Shedding & Load Leveling – Distribute traffic efficiently and prevent overload.
Throttling & Rate Limiting – Control excessive requests to maintain system stability.
Chaos Engineering – Inject controlled failures to test and improve system resilience.
Redundancy & Replication – Ensure backup systems are active to prevent downtime.

𝗥𝗲𝗮𝗰𝘁𝗶𝘃𝗲 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆
Reactive Resiliency ensures systems recover quickly with minimal impact when failures occur. All are called Downstream Resiliency.

Timeout - Setting a timeout ensures operations don’t hang indefinitely.
Retry Strategies & Retry Amplification – Reattempt failed operations with increasing delays to reduce strain and avoid simultaneous retries.
Fallback Plan & Failover Mechanisms – Offering alternative flows and switch to backup systems seamlessly.
Circuit Breakers – Prevent repeated failures from overwhelming services while avoiding unnecessary retries.

𝗔𝗱𝗮𝗽𝘁𝗶𝘃𝗲 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆
Adaptive Resiliency bridges Upstream and Downstream Resiliency by learning from failures and continuously improving system resilience.

Observability & Monitoring – Track failures in real time for better insights.
Chaos Engineering – Identify weaknesses and enhance system robustness.
Automated Scaling – Dynamically adjust resources based on demand.
Machine Learning & AI – Predict and prevent failures before they happen.

𝗖𝗼𝗿𝗲 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀 𝗼𝗳 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝘆 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
Building resilient systems requires key principles that ensure systems can withstand failures, adapt to disruptions, and recover quickly. These core concepts provide the foundation for designing resilient architectures.

To engineer resiliency, systems must be built with key principles:

Fault Tolerance – The ability to operate even when components fail

Redundancy – Backup systems that take over in case of failure.

Failover & Recovery – Mechanisms to switch to a working state quickly.

Observability & Monitoring – Real-time insights into system health.

Chaos Testing – Simulating failures to test system robustness.

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻
A truly resilient system integrates all three—proactively preventing failures, reacting gracefully when they occur, and continuously adapting to become stronger over time.

𝗜𝗻𝘀𝗽𝗶𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗻𝗱 𝗥𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀

Database Optimization: Partitioning vs Indexing

Miahlouge — Sun, 13 Apr 2025 00:33:36 +0000

𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 - Horizontal partitioning divides large tables across multiple storage nodes based on region, such as East, West, and South.

Vertical partitioning, on the other hand, separates sensitive data from core data based on access patterns.

𝗜𝗻𝗱𝗲𝘅𝗶𝗻𝗴 - Creates specialized lookup structures (B-Trees)

Streaming vs Queuing: What Happens If You Choose Wrong?

Miahlouge — Sat, 12 Apr 2025 20:56:12 +0000

𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 - sequential movements, processed in real-time.
𝗤𝘂𝗲𝘂𝗶𝗻𝗴 - stored in a queue, processed sequentially.

Choosing between streaming and queuing isn’t just about picking Kafka over RabbitMQ. It’s about making an architectural decision that will define how your system scales, evolves, and handles data over time.

Pick the wrong one, and you’ll feel the consequences for years.

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗖𝗵𝗼𝗼𝘀𝗲 𝗦𝗧𝗥𝗘𝗔𝗠𝗜𝗡𝗚:

You need historical data replay (e.g., debugging or analytics).
Your system requires event order guarantees (e.g., processing transactions sequentially).
Multiple consumers need to read the same event independently.
You’re handling high-throughput data flows that need efficient processing.

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗖𝗵𝗼𝗼𝘀𝗲 𝗤𝗨𝗘𝗨𝗜𝗡𝗚:

You need guaranteed task completion (e.g., order processing or background jobs).
Each task must be processed by only one consumer (e.g., no need for data replay).
The message should be consumed once and discarded after processing.
Built-in retries and dead-letter queues offer automatic failure handling.

𝗖𝗼𝗺𝗯𝗶𝗻𝗲 𝗕𝗼𝘁𝗵: 𝗠𝗼𝘀𝘁 𝗦𝘂𝗰𝗰𝗲𝘀𝘀𝗳𝘂𝗹 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲𝘀 𝗗𝗼
The most powerful systems leverage both—streaming for real-time processing and queuing for task completion. An example might be:

Real-time tracking systems (using streaming for events like user activity or sensor data).
Task-based systems (using queuing to ensure reliable processing, such as background jobs or transactional workflows).

𝗞𝗲𝘆 𝗥𝗲𝗮𝘀𝗼𝗻𝘀 𝗳𝗼𝗿 𝗠𝗶𝘀𝘂𝘀𝗲
→ Events ≠ Tasks – Queues handle tasks (e.g., payments), while streams handle continuous data (e.g., market prices).

→ Latency Matters – Queues add delays; streams process in real-time.

→ No Replay – Queues discard messages; streams allow reprocessing.

→Tooling Bias – Teams stick to familiar queues instead of streaming solutions like Kafka or Pulsar.

This awesome diagram by @boyney123

Load Balancing vs Load Shedding vs Load Leveling

Miahlouge — Sat, 12 Apr 2025 20:42:18 +0000

𝗟𝗼𝗮𝗱 𝗕𝗮𝗹𝗮𝗻𝗰𝗶𝗻𝗴 – spreads incoming traffic across nodes to avoid bottlenecks.
𝗟𝗼𝗮𝗱 𝗟𝗲𝘃𝗲𝗹𝗶𝗻𝗴 – smooths out spikes by queuing work for later processing.
𝗟𝗼𝗮𝗱 𝗦𝗵𝗲𝗱𝗱𝗶𝗻𝗴 – drops non-critical requests to keep the core alive.

Pick the wrong one, and you’re either wasting resources or crashing hard.

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗨𝘀𝗲 𝗟𝗢𝗔𝗗 𝗕𝗔𝗟𝗔𝗡𝗖𝗜𝗡𝗚
→ Evenly distribute requests to ensure consistent response times.
→ Scale horizontally during peak hours (e.g. market open/close).
→ Prevent single points of failure with smart routing.

Distributes traffic across servers to prevent overload and keep performance steady. Smart routing avoids bottlenecks and single points of failure. Works best when resources are healthy and scalable.

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗨𝘀𝗲 𝗟𝗢𝗔𝗗 𝗟𝗘𝗩𝗘𝗟𝗜𝗡𝗚
→ Request peaks that come in waves? Buffer them.
→ Use queues to decouple services (e.g., order matching from settlement).
→ Works when a little delay is acceptable for long-term system health.

A messaging channel is set up between clients and the service. This channel helps manage the flow for requests, allowing the service to handle them at its own pace. But beware of queue growth and latency creep.

𝗪𝗵𝗲𝗻 𝗧𝗼 𝗨𝘀𝗲 𝗟𝗢𝗔𝗗 𝗦𝗛𝗘𝗗𝗗𝗜𝗡𝗚
→ Market crash? Order spike? Can't handle everything?
→ Prioritize order execution over non-critical analytics or notifications.
→ Drop 20% of traffic if it means saving the core 80%.

A server can become overwhelmed with requests, leading to slow performance or ever unavailability. To manage this, it can reject excess requests when it reaches its capacity. Better to shed load than go down completely.

𝗪𝗵𝗲𝗿𝗲 𝗜𝘁 𝗔𝗹𝗹 𝗖𝗼𝗺𝗲𝘀 𝗧𝗼𝗴𝗲𝘁𝗵𝗲𝗿
In a stock exchange system:
• Balance incoming traffic to keep nodes healthy.
• Shed non-critical updates during volatile spikes.
• Level settlement processing using queues to avoid crash.

𝗪𝗵𝗲𝗿𝗲 𝗧𝗲𝗮𝗺𝘀 𝗢𝗳𝘁𝗲𝗻 𝗚𝗲𝘁 𝗜𝘁 𝗪𝗿𝗼𝗻𝗴
• Relying only on balancing—when all nodes are overloaded, it fails.
• Over-shedding—leads to lost revenue and frustrated users.
• Leveling without limits—queues grow endlessly, latency explodes.

Load isn’t the problem. Mismanaging it is.