We've been building our product for the better part of two years now. The game-changer for me personally has been building observability in right from the start.
We instrument almost everything, so we can ask questions about our system later. This includes collecting events from both our client apps and server APIs. Some real questions we've asked along the way:
- Is it just me, or did our mobile app's boot time just go through the roof? (cause: third-party API acting up)
- Why is the P99 latency of our notification API so high? (cause: data doesn't fit into Postgres' in-memory cache anymore)
- Why is our overall API latency high sporadically? (cause: noisy neighbors on Heroku)
The fact that we can easily ask and answer these questions means that we can fix things quickly, and prioritize technical improvements before things get out of control.
For example, since our team is small, we've been using Heroku to keep operations relatively simple. Heroku has been a great choice for us so far, but our observability tooling reveals the shortcomings of a shared environment such as Heroku. In the screenshot below, we see how a single dyno (i.e. a process on Heroku) goes amok before it's terminated and replaced by a new process:
Still, Heroku is good enough for us for now, and if we decide to jump ship, we have the data to back the decision up.
What tools you end up using is another matter, but we've been happy paying customers of Honeycomb since 11/2018.