Have you ever felt like you just needed another “You” to get everything done in time? That’s how we, at Carnot, felt when our customer base increased at a rate, way beyond our expectation. It was like taking on a sea swimming challenge just after finishing your first lessons with floaters.
For the backend team, it meant having to support 20 times more traffic without impacting customer experience in any way. It was a challenge, and it did not help the fact that our entire backend team consisted of only three people. But you know what they say — “No pressure, no diamonds!”
We learned a lot of empowering lessons from those demanding times. One such lesson was — Auto Scaling, which is arguably the most important cloud management tool.
In the early days, when we knew little about handling traffic peaks, we had a monolithic architecture. A socket server would allow our IoT devices to connect & send data, which was then queued in a Redis cache until it got processed by our main server.
As our clients increased, the entire cloud pipeline was affected by the slowest component. The Redis cache which was used only for queuing soon had a large queue and ran out of memory.
At that time, not upgrading the cache for better memory specification would have meant losing valuable customer data. Hence, we selected the Redis plan so as to support data retention in the queue even during the peak traffic time.
The system was stable with this. The only problem was that we were paying a lot more than what we were utilizing. On deeper analysis, we found that the so-called “peak traffic” would occur only during certain hours of the day. 75% of the time, the extra memory opted, was unused. But we were still paying for it.
Our top reasons to have a system - an auto-scaling feature on all our Redis instances were:
- Dynamic Load Management - Helps to cater to the dynamic load on your application. Scales in accordance to the load & demand.
- Avoid Data Drop - Helps to ensure timely scaling to avoid redis data drop. Scales prematurely by monitoring key metrics and mitigates data drop of new incoming information
- Avoid Downtime - Helps to avoid downtime on your application. Monitors key metrics to ensure your redis is always accessible.
- Cost Saving - Finally, saves cost by scaling down in low-load scenarios
Frankly, “auto-scaling” was a buzz word for us at that time. We had read a lot about it and understood that it helps to automatically detect your traffic & scale your cloud components. We knew that we absolutely needed this. But we could not find an easy plug and play solution for our scenario — Redis scaling on the Heroku platform.
So, we decided to do what we do best at Carnot — Start from basics and build your own in-house system
What we needed was a mechanism to
- Monitor the key metrics for Redis health check
- Detect incoming traffic (or load) on servers
- Create a cost-function to get the cheapest possible plan for current traffic
- Change the plan as detected by the cost-function
Well, that’s what we built for one Redis, then the other and another; until we created a standalone plug-and-play solution of our own — for any Redis on Heroku.
- Memory consumption
- Clients connected
We designed a metric collector that would connect to the concerned Redis and collect these metrics periodically using
INFO command. The frequency of metrics collection was kept configurable (with minimum of 30 sec) and could be selected as per the expected rate of traffic change. The metric collector would store these metrics in a connected database.
We also pre-defined scaling thresholds for the Redis that needed monitoring & scaling. The traffic detector would actually store these thresholds in database and would compare the metrics collected by metric collector against the thresholds to decide whether scaling was needed. Example - if a threshold of 80% is set on memory consumption, and current consumed memory has been above 80% for a while, "scale-up" decision would be given.
For our cost function, we had all the available Redis plans stored in database along with details like maximum offered memory, maximum offered client connections, cost, etc. To save cost, we needed to scale down the Redis without causing too frequent back-to-back "scale-up", "scale-down" decisions. This is why we added
- Certain wait period between consecutive scaling decisions
- Additional buffer on top of configured threshold for "scale-down" decision
- Check of memory consumption percentage on the lower plan before deciding to "scale-down"
Example - if a threshold of 80% is set on memory consumption, and the current consumed memory has been less than 75% of the memory offered by lower plan, "scale-down" action was decided.
The actual scaling was done using stored details of the Redis and the Heroku Platform APIs.
We have made this entire system open source. If you are facing a similar issue, feel free to set it up in your account and share your experience. Any issues/suggestions to improve the system are most welcome.
Follow below guidelines to quickly set this auto scaling module on your heroku account.
HEROKU_API_KEY: API key for your heroku account
EMAIL_HOST: Host name for your email
EMAIL_PORT: Port for your email
EMAIL_HOST_PASSWORD: Host password
EMAIL_HOST_USER: Host username
SERVER_EMAIL: Server email
RECIPIENTS: Comma separated emails of notification recipients
ENABLE_EMAILS: A global flag to enable / disable all types of mails
EMAIL_PREFIX: Email subject prefix for possible filter addition
N_RETENTION_DAYS: Number of days to retain metrics logs
initialize.sh is run from console.
This does 3 things:
- Applies all DB migrations
- Creates superuser for django admin panel
- Adds static info to database
python add_redis_mon.py from console
This will prompt you for all information related to redis instance for which auto scaling needs to be enabled.
Name: An identifier name for your redis insatnce. Can be alphanumeric
Heroku App Name: This is the name of heroku application to which your redis instance is attached
Redis Heroku Name: This is the URL name by which your heroku application identifies the redis. (Example: REDISCLOUD_ORANGE_URL or HEROKU_REDIS_BLANK_URL)
Metric Rate: Choose rate of metrics collection from given options. Select a faster rate only if your application demands a quick action.
Avg memory percentage: Choose the avg memory usage percentage you wish to maintain for you redis instance
Avg connection percentage: Choose the avg number of client connections you wish to maintain with the redis instance
Scaling Enable- If set, auto scaling action would be taken to maintain the avg metric percentages
Min Plan- Choose lower plan limit for scaling. If not set, lowest available plan will be selected
Max Plan- Choose upper plan limit for scaling. If not set, highest available plan will be selected
Scaling Rate- Choose rate at which scaling check should be performed
Threshold Breach- Email notification to configured recipients whenever any threshold is crossed
Scaling Success- Email notification whenever a successful scaling action is performed
Scaling Failure- Email notification whenever there is a failure in scaling action
To setup the project locally, follow this reference guide
While working so much with Redis caching, queuing and more specifically with Redis auto-scaling, we realized that Redis Cloud was the best suited solution for our used case.
With other service providers we observed one of these problems:
- Slowed plan update - the upgrade or downgrade of Redis plan would take a long time to execute compared to what our application could tolerate. This used to make Redis unreachable during command execution and would cause bottlenecks in our pipelines.
- Redis URL change - with some service providers, the Redis URL (creds) would change, and this would impact some of our internal applications that had used URL to connect to Redis. The Redis would not be accessible to these applications causing them to crash.
- Cost - many of the service providers that offered faster plan update were expensive
Overall, we found that Redis service by Redis Cloud would scale quickly while maintaining all the data. We ended up migrating all our caches to Redis Cloud
After sugar, spice & everything nice and an accidental chemical X, we are now capable of supporting over a million IoT nodes. We have definitely come a long way from those early times and learned a lot along the way.
As we grew, most of our systems moved away from Heroku (to AWS) in order to save costs. But, for any start-ups/developers out there who are dealing with around thousand client nodes and use Heroku as their preferred choice of platform (for its obvious ease of set-up), we hope this blog would be helpful.
We are trying to fix some broken benches in the Indian agriculture ecosystem through technology, to improve farmers’ income. If you share the same passion join us in the pursuit, or simply drop us a line on firstname.lastname@example.org
This post is in collaboration with Redis.