More and more companies are now shifting to a cloud-native & microservices based architecture. Having an application monitoring tool is critical in this world because you can’t just log into a machine and figure out what’s going wrong.
We have spent the last couple of years learning about application monitoring & observability. What are the key features an observability tool should have to enable fast resolution of issues.
In our opinion, a good observability tools should have
- Out of the box application metrics
- Way to go from metrics to traces to find why some issues are happening
- Seamless flow between metrics, traces & logs — the three pillars of observability
- Filtering of traces based on different tag and filters
- Ability to set dynamic thresholds for alerts
- Transparency in pricing
We found that though there are open-source tools like Prometheus & Jaeger, they don’t provide great user experience like SaaS products do. It takes lots of time and effort to get them working, figuring out the long term storage, etc. And if you want metrics and traces, it’s not possible as Prometheus metrics & Jaeger traces have different formats.
SaaS tools like DataDog and NewRelic do a much better job at many of these aspects
- They are easy to setup & get started
- Provide out-of-box application metrics
- Provides correlation between metrics & traces
But it has the following issues
- Crazy node based pricing which doesn’t make sense in today’s micro-services architecture. Any node which is live for more than 8hrs in a month is charged. So, unsuitable for spiky workloads
- Very costly. They charge custom metrics for $5/100 metrics
- It is cloud only, so not suitable for companies which have concerns with sending data outside their infra
- For any small feature, you are dependent on their roadmap. We think this is an unnecessary restriction for a product which is used by developers. A product used by developers should be extendible
To fill this gap we built SigNoz, an open-source alternative to DataDog.
Some of our key features which makes us vastly superior to current open-source products
Get p90, p99 latencies, RPS, Error rates and top endpoints for a service out of the box.
Found something suspicious in a metric, just click that point in the graph & get details of traces which may be causing the issues. Seamless, Intuitive.
for example you can find latency experienced by customers who have customer_type set as premium
Create custom metrics from filtered traces to find metrics of any type of requests. Want to find p99 latency of customer_type: premium who are seeing status_code:400. Just set the filters, and you have the graph. Boom!
You can drill down details of how many events is each application sending or at what granularity, so that you can adjust your sampling rate as needed and not get a shock at the end of the month ( case with SaaS vendors many a times)
Detailed flamegraph to find exact cause of the issue, and which of the underlying requests is causing the problem. Is it a SQL query gone rogue or a redis operation is causing an issue
Check out our Github repo & give it a try. We would love any feedback on what you like or what doesn’t make sense. We are also active on Slack, so give us a shout out there and we would be happy to answer any questions or help you set things up.