- "It doesn't work at all, my children can't connect to the online class!"
- "We're very sorry about that it's an exceptional situation, we're working on the problem with our tech team and will keep you posted."
It is the kind of tweet & LinkedIn messages that have exploded during those recent COVID-times. Indeed, online services, especially meeting utilities or edtech programs, had to face a particularly sustained usage, enduring a unplanned and lasting peak load. I've seen only a few able to maintain an acceptable quality of service (at least, Slack and Zoom have shined with their professionalism for what I could see), the bigger part having to establish a connection queue for its clients, or even to shutdown their entire service for an endless "maintenance".
Even being comprehensive, when facing an online service availability problem, we're very quick to change. After all why shouldn't we benefit for the plethora of offering at our disposal? And after the crisis has passed, it's highly expected that we'll see no reason to change, no reason to go back to the initial fancy startup product we were using at the origin. For a startup that's a lot of clients lost, maybe enough to not be able to start the traction back. Terrible death for a business, another virus victim.
But 🧐
- Couldn't those businesses have anticipated more?
- Couldn't they have a stronger tech at the origin, exceptional event or not?
- Once facing the problem, couldn't they adapt better and faster?
The scaling maturity model
First, I want to introduce what I will call a company "scaling maturity". Scaling is the art of adapt (automatically of not) your tech stack to handle the incoming demand. And to be fair Zoom and Slack are far more mature companies than (for example) recently-born edtech startups.
Let's analyze them using the scaling maturity model.
1 - Nominal usage load: Slack or Zoom already had a big traffic before, the peak represents a smaller percentage of their total nominal traffic compared to a startup for which it is maybe 100 or 1000 times bigger than usual.
2 - Product maturity: They've had time to know the specificity of their usage, the way the data is accessed, their systems points of failure, etc...
3 - Technical skills: They probably have a bigger and more experienced tech team.
To synthetize, they know a lot about how their product is used and where they are going, and so know a lot about where to focus, and what kind of effort they need to do to adapt to the extra demand. Plus their current architecture can already sustain some load.
On the other side, young online services were just overflowed, calling for help, trying desperately to find solutions to shard and replicate their existing relational databases (more on that later).
I'll take an hypothetical example of an edtech startup offering innovative online class services
1 - Nominal usage load: Some clients like what they do, it's the future, they believe in their capability to grow and propose more features with time. So there is a light load for now and a steady growth expected, they've opted for several cheap OVH servers.
2 - Product maturity: It's basically innovative education R&D so they have a very young product.
3 - Technical skills: Interns, probably young employees, sometimes founder-made prototypes. At this point salaries weigh a lot...
I'll make a preliminary conclusion to the first question: businesses couldn't have anticipated and even, for the smaller ones, wasn't expected to have... Indeed, if one wants to design a product with "infinite" (or easy to add) scaling from scratch, this will cost him a lot of money and time. Two scarce resources you rationally prefer to invest in other things when you're a small business (finding product-market fit, adding features, growing, ...)
Covid-19 is a very good example of a black swan event. An event that is very rare, has massive implications, and so for which companies haven't planned for.
Indeed, this point is pretty obvious. We'll call it the "bird pause" 😉
I've recently talked with several young edtech companies looking urgently for solutions. They were facing the same issue: their relational database could not scale anymore (mysql or postgres). They were trying to add cache, pop new replica nodes, refactoring their apps and services, and their final blocker was the database... The only remaining solution was to scale writes horizontally (to shard). So they were looking for some smart proxy solutions to put in front allowing the database to scale seamlessly. Not that straightforward though... And not that cheap.
And I don't even talk about managing the migration in that context!
To have an idea about how hard it is to shard an existing database I would say that the more complex and global querying you have in place (like cross-accounts data aggregation) the more smart and expensive the front proxy will have to be. It goes from a quite simple hash-distribution routing, to a very complex (and hard-to-scale too) query planner.
In light of the scaling maturity model, it's quite clear we can't blame them for not having the scaling mechanisms in place before, but at least they could have planned better for their next scaling step.
Then... How to scale efficiently as a startup?
We can deconstruct first what is to be "scaled"
- servers capacity: costs will force you to dimension correctly their size (cpu / ram / storage). They can be physical, virtual, or containers.
- data access patterns: know your usage, avoid shared state and global querying, prefer immutability.
- human intervention & maintenance: "the more you automate, the faster you iterate" (Github, CI/CD, terraform, ...).
- code refactoring: to scale ofter means pre-compute things, use smart caching, more synchronization, all of this has to be coded and maintained at some point.
Note: if your product does not have any real traction yet, you probably shouldn't think about this at all for now, a good old MVC prototype with whatever tech you already know will do the job.
If you already have some usage, and have a first idea of where your business is going, then there are several possibilities. If you have at your disposal a strong technical founder, he probably already knows what to do. Otherwise, especially if you did get some founding, it would be a good idea to get someone experimented enough helping you setup the infrastructure at the beginning.
Let's explore then two different strategies.
The infinite scaling strategy
Well I prefer to immediately debunk the myth, it's theoretically possible to approach such an architecture but it will be very expensive (in time and money _ again things you don't really have as a startup), and possibly quite rigid.
The key here would be to use a lot of high level managed services who run on giant cloud infrastructures. The services chosen should be 100% dynamic, I mean scale transparently: you shouldn't have to manage any physical nodes directly. They better have included and transparent replication (to scale reads) and sharding capabilities (to scale writes) and be able to be geographically dispatched in several regions on the planet.
Here some managed services examples:
- Relational datastore: Google cloud spanner
- NoSQL real time sync datastore: Google firestore
- Replicated cache: AWS Elasticache
- Distributed File storage: AWS S3
- Data streaming: Google Pub/Sub
- Data warehouse: Google Bigquery
Yes, that's a lot of GCP services, to be honest they are ahead..!
Actually I have to nuance the curve a bit:
- Price is actually super low at the very beginning: services often offer free plan for a limited volume or during a limited period.
- On the other side, costs can decrease when you reach a big volume because of degressive pricing, or by using provider specific cost optimization techniques.
The step by step scaling strategy
This is the usual and probably the most efficient strategy, you do with what you have at first (skills, people, ...), but you always stay a step ahead, you are conscious about what your system points of failure are, and how to resolve them. You plan carefully for the next system migrations. This requires specifically:
- To develop a strong monitoring and alerting pipeline. You have plenty of tools to do that easily nowadays.
- To test every migration before. The more the system is distributed, the harder the problems are predictable. It's easier to just test it.
As examples, the scaling next step could be:
- Shard the database or the streaming pipeline.
- Introduce automatic scaling of your HTTP services.
- Refactor your application to isolate your data in silos, introduce some kind of load balancing, and replicate parts of your system.
The core principle is to be cost efficient and smart: to scale efficiently you have to scale in the right proportion and at the right time, too soon and it will cost you too much, too late and you'll loose some quality of service... And this is why having a good feedback (automated) on your current architecture is a real plus.
Clouds often offer pre-activated monitoring metrics, for example if you use AWS Beanstalk environments for your web services it provides you automatically the following ones:
- requests count
- average latency
- load
- CPU
- network in/out
- ...
It is super fast then to visualize in dashboards, to enable alerting, or even activate auto-scaling based on CPU percentage, or network usage... And you have the tools now to scale your web services correctly. After all, more knowledge at the beginning means better decisions at the end.
Of course, smart scaling is specific scaling, so every case has a different story..! (databases, queues, pubsubs, data analytics, ...).
To conclude, the right solution probably lies in between the two strategies. It depends on your budget, the technical complexity of your product, and the skills you have at your disposal...
At least I hope you have new ideas now about how to better anticipate scale!
Scale safe 👋
Top comments (0)