DEV Community: Asad Raheem

How I spent 10 months identifying & reducing technical debt?

Asad Raheem — Thu, 16 Jul 2020 17:46:05 +0000

Edit: Oh, I forgot to mention. Following these simple approaches, I have been able to improve performance by up to 70% and increased scalability to a whole new level by utilizing event-driven architecture.

It's in October 2019. Finally, I have the time to start lowering the technical debt ratio. But where to start? How to identify? How to prioritize? Which tenant's experience to focus on?

Identification

The ultimate goal is to improve user experience by improving performance and scalability. It makes sense to prioritize features that are being used the most by the customers. One approach would be to conduct surveys but that doesn't seem practical. Where can I get such data? Telemetry (Application Insights).

Since my Cloud Computing Platform (Azure) had a cap for querying data using its portal, I had to write a Python script for downloading and processing 3 months of data for tenants individually and collectively. I obtained the 50th, 90th, and 99th percentiles and the average for each service's response time. After filtering the data using the obtained stats, services were prioritized and selected for improvement accordingly.

Analyzing & Improving

I analyzed each selected and related services. Re-designing and revamping each service entirely was not a practical idea. Therefore, I proceeded with the following approaches:

Reducing DB Round Trips

A database round trip is very costly. I made sure minimum roundtrips were being made by either fetching all required data in a single round-trip or using Redis Cache.

Optimizing Queries

The ORM I was using generates optimized queries but in some cases, the query required a full table scan or was fetching too much data. I introduced non-clustered indexes where reading occurred more frequently than writing. I also split the queries accordingly.

Stored Procedures

Sometimes, it's not possible to reduce database roundtrips when several tables need to be accessed or operated on. Although from my perspective stored procedures increase maintainability, I used them in such situations. Slow performance services were now high-performance ones. I also changed the parameters of existing stored procedures. Instead of calling the stored procedure repeatedly for different Ids, I passed all the Ids in one go in a comma-separated string.

Sync to Async

Communicating synchronously with external resources blocks the thread. Wherever possible, I transitioned to asynchronous APIs.

Bulk Operations

I utilized a third-party library to bulk insert/update large number of records.

Breaking Services

Some services were returning too much unrelated data. This was halting the front-end application to render its components even though not much information was required to be shown on them. I broke such services and tried to use the parts of the original response model to have less impact on the front-end application.

Concurrency Issues

It turns out customers sometimes use a feature in unexpected conditions. I needed to resolve concurrency issues without affecting scalability and performance especially for power-user features. I utilized the serverless compute service (Function App) to cater to this.

Event-Triggered Tasks

A user shouldn't necessarily need to wait for the entire service to complete. Sometimes the job is complex and unavoidably takes time due to several stages involved. I broke such services into several phases and utilized serverless compute service (Function App) along with real-time messages (Azure SignalR) so that the user doesn't need to wait on the same web page. This also reduced load on the server and provided better scalability.

Transition to New Technologies

The ORM I was using seemed to be significantly slower than a newer one. I also had to utilize new features such as dependency injection in Function Apps. Wherever easily possible, I transitioned to new technologies for better performance.

Domain-Driven Design

Initializing a huge data model context takes time. This is even more problematic on the consumption plan due to the cold-start. I started to divide the context domain wise so that only a small set of related tables are in it. This started a transition to microservices approach from a monolithic one but this would require incremental steps to be rolled over the entire product. Nonetheless, I now have a framework set up for doing so.

I was also involved in developing new features in the meantime. It was fun and I'm still reducing technical-debt whenever I get a chance. This experience has improved my API designing and coding skills.

I hope this helps.

Why Redis Cache times out in Azure Function App on Consumption Plan? - A Journey

Asad Raheem — Tue, 14 Jul 2020 16:08:06 +0000

I decided to move a power-user feature to an Azure Function App. Redis Cache was extensively being used. In a controlled environment, it resulted in better scalability and performance.

The problem?

Redis time-out exceptions were being thrown on production. Always? No. Sometimes? Yes and that was even a bigger problem as it was difficult to trace the root cause.

I was following the approach mentioned in Microsoft documentation.

private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() =>
{
    string cacheConnection = ConfigurationManager.AppSettings["CacheConnection"].ToString();
    return ConnectionMultiplexer.Connect(cacheConnection);
});

public static ConnectionMultiplexer Connection
{
    get
    {
        return lazyConnection.Value;
    }
}

First Hunch

Redis Server Load might have exceeded the plan. To my surprise, that was not the case. Redis was hardly ever exceeding 10% server load.

Second Hunch

Redis server is single-threaded. Object size might be too large in the cache.

Avoid using certain Redis commands that take a long time to complete, unless you fully understand the impact of these commands. For example, do not run the KEYS command in production. Depending on the number of keys, it could take a long time to return. Redis is a single-threaded server and it processes commands one at a time. If you have other commands issued after KEYS, they will not be processed until Redis processes the KEYS command.

That was also not the case.

Third Hunch

Another feature synchronously accessing Redis for a large object might be causing this issue but it just didn't make sense. Such features weren't being frequently used.

Fourth Hunch

Noisy neighbors. Azure Redis Cache Standard Tier C0 plan was being used. It turns out C0 plans aren't meant for production use.

The Basic tier is a single node system with no data replication and no SLA. Also, use at least a C1 cache. C0 caches are meant for simple dev/test scenarios since they have a shared CPU core, little memory, and are prone to "noisy neighbor" issues.

Upgraded the plan and waited patiently. The issue still didn't resolve.

Time for Experimentation

Made a testing gear for generating a large number of asynchronous requests to access Redis Cache using the same lazy initialization pattern.

Viola! The much-awaited timeout finally occurred on my local system. It was occurring when multiple threads were trying to access the cache. Due to the lazy loading pattern mentioned above, the cache connection was asynchronously tried to be initiated by every request. According to the documentation:

The Lazy instance is not thread safe; if the instance is accessed from multiple threads, its behavior is undefined. Use this mode only when high performance is crucial and the Lazy instance is guaranteed never to be initialized from more than one thread. If you use a Lazy constructor that specifies an initialization method (valueFactory parameter), and if that initialization method throws an exception (or fails to handle an exception) the first time you call the Value property, then the exception is cached and thrown again on subsequent calls to the Value property.

But how was this occurring on production? The answer, consumption plan.

The function app is not always running on the consumption plan. The Redis connection was being initialized whenever the function was triggered by an Azure Storage Queue message. The problem was occurring if the function app received a burst of messages either when it wasn't already running or it was scaling out.

Solution

Pass a LazyThreadSafetyMode mode in the constructor. Yes, that's it. Other than None, there are two options PublicationOnly or ExecutionAndPublication. For my use-case, I needed PublicationOnly as stated in the documentation:

When multiple threads try to initialize a Lazy instance simultaneously, all threads are allowed to run the initialization method (or the parameterless constructor, if there is no initialization method). The first thread to complete initialization sets the value of the Lazy instance. That value is returned to any other threads that were simultaneously running the initialization method, unless the initialization method throws exceptions on those threads. Any instances of T that were created by the competing threads are discarded.

private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() =>
{
    string cacheConnection = ConfigurationManager.AppSettings["CacheConnection"].ToString();
    return ConnectionMultiplexer.Connect(cacheConnection);
}, LazyThreadSafetyMode.PublicationOnly);

public static ConnectionMultiplexer Connection
{
    get
    {
        return lazyConnection.Value;
    }
}

The fix itself was simple but figuring out the exact conditions on the production environment was difficult.

Note: In the above code snippets, ConfigurationManager is being used to access App Settings. I wrote that here to stay consistent with the documentation. Since Azure Function App v2, Environment.GetEnvironmentVariable should be used.

I hope this helps.