DEV Community: Re Alvarez-Parmar

Closing observability gaps with custom metrics

Re Alvarez-Parmar — Sun, 29 May 2022 07:01:46 +0000

Which application metrics should you collect?

I frequently engage with customers that are amid breaking their monolithic applications into smaller microservices. Many teams with also see this migration as an opportunity to make applications more observable. As a result, customers inquire which metrics they should monitor for a typical cloud native application.

Previously, when a customer asked me how to instrument a service, I pointed them to the well known USE and RED methods. But, I felt the response wasn’t thorough. A list of specific metrics to monitor can be helpful for teams building cloud native applications. This post is an attempt to provide a list of metrics to collect in a typical application. Not all the metrics listed below apply to every application type. For example, batch-like workloads rarely serve traffic, and resultantly, don't need to keep a log of requests-served.

The goal of this document is to help developers come up with the golden signals for their applications.

Golden Signals, a term used first in the Google SRE handbook. Golden Signals are four metrics that will give you a very good idea of the real health and performance of your application as seen by the actors interacting with that service, whether they are final users or another service in your microservice application.

Observability

Cloud best practices recommend building systems that are observable. While the word observability (or “/O11y/” as it is popularly known) doesn’t have an official definition, it is the measure of a system’s ability to expose its internal state. The three pillars of observability are logs, metrics, and traces.

Modern systems are designed to produce logs, /emit/ metrics, and provide traces to help developers and operators understand its internal state.

Push vs Pull

Emitting metrics by exposing them on an externally accessible HTTP endpoint is gaining wider adoption thanks to developers adopting Prometheus for monitoring. In this model, Prometheus pulls metrics by scraping the application’s /metrics endpoint.

When you run Node Exporter, it publishes metrics at http://localhost:9100/metrics

Observability tools aggregate and analyze data from different sources to help you detect issues and identify bottlenecks. The goal is to use these system signals to improve its reliability and prevent downtime.

AIOps products like Amazon DevOps Guru can also detect anomalies using your application's logs, metrics, and traces (and other sources) and give you early signals to prevent a potential disruption.

Metrics to collect

For an application to function as designed, the application and its underlying system have to be /healthy/. Host metrics inform the operator of the host’s and infrastructure resource usage, like CPU, memory, I/O, etc. If you use Prometheus, Node Exporter collects this information automatically for you.

Host metrics rarely differ. Whether we run a process on an EC2 instance or a Raspberry Pi, we’re interested in the same metrics.

Unlike host metrics, application metrics are unique to each microservice. Application metrics are supposed to provide the operator the information so they can do these things:

Identify future areas of improvement by providing code-specific measurements. Application monitoring or APM tools provide measurements over a segment of time that developers can analyze.
When the system fails, provide information for troubleshooting and prevention.
In some cases, provide early signals to business. For example, if the application exposes, the /orders/ it has processed in the last 60 minutes can be tracked using the monitoring system, rather than querying a relational database.

There are several companies like application monitoring or APM companies like New Relic, DataDog that have products to aggregate application metrics using SDKs or agents. However, what they will not collect are the business specific metrics that only your application cares about.

In order to create a list of relevant metrics for an application, its architects will need to determine a signal for its every key function. The hallmark of a microservice is that it does /one thing well/, therefore it shouldn’t have many key functions. Start by white-boarding the functions implemented in the code and creating a list of metrics that would help you gauge its performance (or its availability at the least).

Most measurements you’ll do will fall under one of these categories:

Counter

As the name suggests, this value is incremented when a function runs. Example: total requests served

Histogram

Histograms are charts that show the frequency of the occurrence of several ranges of values. A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.

Gauge

This type is metric tracks a value that increases or decreases over a period. Example: number of threads.

With that background, let’s go through the list of common custom metrics developers use.

Network activity

These are the obvious metrics to track for any application that serves traffic. Network metrics tell you how much load is placed on the system. Over the time, these data points assist you when devising the scaling strategy for the system.

Things you should include are:

Request count by API type or page
Requests total
Transactions
Concurrent, expired, and rejected sessions
A watermark that records maximum concurrent sessions
Average processing time
A count by error type

Resource usage

It is a best practice to monitor a systems /saturation/, which is a measure of your systems resource consumption. Every resource has a /breaking point/, beyond which additional stress causes performance degradation. Scalable and reliable systems are designed to never breach the breaking point.

However, simply collecting overall resource saturation at an application level is insufficient. You also need to look deeper at thread or resource pool level.

Consider collecting these metrics:

Number of processors, system CPU load, process CPU load, available memory, used memory, available swap, used swap, open file descriptor count.
Total resources consumed by connection pools, thread pools, and any other resource pools.
Total started thread count, current thread count, current busy threads, keep alive count, poller thread count, and connection count.
Objects created, destroyed, and checked out, high-water mark, number of times checked out,
Number of threads blocked waiting for a resource, number of times a thread has blocked waiting

Common frameworks like Tomcat, Flask, etc. support exporting pre-defined metrics. For example, JMX already exposes a bunch of these metrics. See AWS CloudWatch documentation.

Users

Besides, serving the intended audience, bots or scripts flood internet facing web servers with requests. These automated requests can overload the system if unauthenticated requests are improperly handled (for example, not redirecting all unauthenticated requests to the authentication service and attempting to process an unauthenticated request).

Here are user related metrics to collect:

Authenticated and unauthenticated requests
Demographics, authenticated and unauthenticated requests, usage patterns,
Unsuccessful login attempts

Some of these metrics may also come from your Load Balancer or ingress.

Business transaction (for each type)

If your application follows the microservices approach, then the code fulfills one function, at least that’s the idea. What are the key performance indicators for your app’s function? Define them and track these metrics.

Should future releases cause performance regression, you’ll be able to detect it. Tracking these business metrics will help you track trends easily and avoid a cascading failure.

Here are common things that services care about:

Orders, messages, requests, transactions processed
Success and failure rates. For a retailer, this could be the conversion rate.
Service level agreements (like average transaction response time)

If you still need help with identifying key metrics, ask yourself this question: In what ways can my application negatively affect the business even when it might appear to be healthy?

Database connections

Along with monitoring your database instances using database monitoring tools, consider collecting database connection health metrics in your application. This is especially helpful if your application uses a shared database. If your application encounters database connection errors but the database remains operational for other application, you know the problem is on the application side, and not the database.

Consider recording these databases-related metrics:

A count of SQLException thrown
Number of (concurrent or maximum)queries
Average query run time

Data consumption

Wherever you’re persisting data, you need to ensure that you’re going to go over your quotas and run out of space. Besides, monitoring on disk and in-memory data volumes, don’t forget to monitor the data your application stores in databases and caches.

Cache health

Speaking of cache, it is a best practice to monitor these metrics:

Items in cache
Get and set latency
Hits and miss rates
Items flushed

Also, consider using an external cache such as Redis or Memcached.

External services

Keeping a track of how downstream services perform is also useful in understanding issues. Along with using timeouts, retries (preferably with exponential backoff), and circuit breakers, consider monitoring these metrics for every external service your service's proper functioning depends on:

Circuit breaker status
Count of timeouts, requests
Average response time or latency
Responses by type
Network errors, protocol errors
Requests in flight
A high watermark of concurrent requests.

Granularity in metrics collection

The frequency at which you publish and collect metrics depends on your business requirements. For a retailer, knowing traffic patterns by the hour and day is useful in scaling capacity. Similarly, a travel company’s traffic pattern are influenced by holiday schedules.

Amazon EC2 provides instance metrics at 1-minute interval, which is a good start for critical metrics.

Remember that there’s a cost attached to exposing, collecting, and analyzing metrics. Collecting unnecessary information in metrics can put a strain on the system and slow down troubleshooting.

Consider giving the operator the control over the metrics your code should generate. This way, you can turn on specific metrics whenever needed.

Conclusion

Finding out which metrics to collect is an answer that only the most familiar with the code can answer. This post provides a list of metrics for you to get started.

Are there any metrics that I have overlooked? Let me know at @realz.

References

Push Vs. Pull In Monitoring Systems – Giedrius Statkevičius

SRE Metrics: Four Golden Signals of Monitoring

Learning Modern Linux

Release It!

Appendix

Instrumentation

Instrumentation is the way to measure an application’s performance. It is highly useful in profiling and troubleshooting. There are two common strategies for instrumentation:

Auto instrumentation. This is generally done using a library like OpenTelemetry API and SDK. For more see “What Is Auto-Instrumentation?”.
Custom instrumentation. Whenever your instrumentation needs are not met by auto instrumentation, you will also generate custom metrics.

Connecting Kubernetes clusters across VPCs

Re Alvarez-Parmar — Thu, 31 Mar 2022 22:41:11 +0000

A few months ago, someone asked me the best way to connect services running in different Amazon EKS (EKS) clusters running in two different VPCs. Thinking about connecting network resources across AWS VPCs reminded me of the early days of AWS when you needed to implement a complex hub and spoke architecture to connect Amazon Virtual Private Networking (VPC). Thankfully, AWS has newer services that simplify VPC interconnectivity.

AWS customers can connect networked resources in different AWS accounts using VPC peering, AWS Transit Gateway, VPC sharing, AWS PrivateLink, or 3rd-party solutions.

Given that there are so many options, I wasn’t sure which solution to recommend. The question I was posed needed a better understanding of AWS networking services. Here’s the research I did to understand the most optimal approach for connecting Kubernetes hosted services running in separate VPCs.

Note: I am not an AWS networking expert. This post was distilled from AWS documentation and this AWS whitepaper.

Symmetric and Asymmetric flow

Before picking a solution to interconnect services across different VPCs, you must consider your data connection requirements. There are two types of connectivity patterns between a set of network resources.

In the first scenario, a client initiates a connection with the server to* *send requests, but the server never initiates a connection with the client. A typical example will be a traditional three-tier web app that stores data in a database. In such a scenario, the backend connects to the database. The database never establishes a connection with the backend. I will call this type of connection flow as asymmetric flow in this blog.

Symmetric flow is the opposite. In this data connectivity pattern, either side (client or server) can initiate a connection.

Customers looking to connect Kubernetes-hosted applications running in different AWS accounts will have to start by determining the data connection patterns their applications and services will use.

Connecting services hosted in different AWS accounts

The type of connectivity your services require influences how you can connect services in different VPCs.

When interconnecting private services that need symmetric flow, any service has to be able to initiate a connection with another service. Service discovery is a prerequisite for connectivity. Services have to know how to connect to downstream services. Once that problem is resolved (using DNS, or Consul etc.), there are two primary ways to interconnect services:

VPC Peering (including TGW) or VPC Sharing
AWS PrivateLink

The key difference between the two approaches is that once VPC peering or VPC sharing is set up, network resources (EC2 instances, pods, containers) in the VPCs can interconnect by default.

AWS PrivateLink provides secure access to services hosted in other VPCs without peering or sharing VPCs. You do this by configuring an interface endpoint to access the service in the other VPC. Because access control is more fine-grained, you may have to create an interface endpoint powered by PrivateLink for each Kubernetes hosted service that uses a different NLB.

In the fullness of time, most large enterprises will (many already do) use a combination of AWS networking services depending on the use case. In the next section, we review AWS network services and determine the scenarios in which they are a good fit.

🛟VPC Peering

When services have to consume other services running in different VPCs, requiring symmetric flow, VPC peering is the easiest way to provide interconnectivity.

You can connect the VPCs in which your EKS clusters reside as long as you had the foresight and luxury of pre-planning VPC CIDRs in advance (notice the intentional redundancy 🙂), and don’t have too many VPCs to interconnect or require transitive routing.

VPC peering is the preferred method to connect VPCs when there are less than 10 VPCs (source). However, most enterprises have complicated, extensive networks and sub-networks, which inevitably leads to hundreds of VPCs. Interconnecting them using VPC peering is a difficulty that AWS Transit Gateway intends to solve.

🚉 AWS Transit Gateway

AWS Transit Gateway (TGW) is another option to connect VPCs wherever services need to interconnect with symmetric flow.

TGW is designed to simplify creating and managing multiple VPC peering connections at scale. It can act as the central router for large-scale, enterprise-grade, globally-distributed networks.

TGW is the default choice for EKS users that have to connect their clusters with more than 10 VPCs and need symmetric flow.

Are there reasons for not using TGW?

The good ole’ VPC peering still has some tricks up its sleeve that TGW hasn’t mastered yet. Here are a few reasons TGW may not be the right choice for you:

Lower cost — With VPC peering you only pay for data transfer charges. Transit Gateway has an hourly charge per attachment in addition to the data transfer fees. For example, in US-East-1:

No bandwidth limits - With Transit Gateway, Maximum bandwidth (burst) per Availability Zone per VPC connection is 50 Gbps. VPC peering has no aggregate bandwidth. Individual instance network performance limits and flow limits (10 Gbps within a placement group and 5 Gbps otherwise) apply to both options. Only VPC peering supports placement groups.
Latency - Unlike VPC peering, Transit Gateway is an additional hop between VPCs.
Security Groups compatibility - Security groups referencing works with intra-Region VPC peering.

TGW (or VPC Peering) enables inter-region VPC peering, which is helpful when your EKS clusters reside in different AWS regions.

🫱🏽‍🫲🏼Amazon VPC Sharing

Here’s a third way to provide symmetric flow: share a VPC with multiple AWS accounts.

AWS Resource Access Manager (RAM) allows you to share VPCs (among other things) with other AWS accounts within your AWS Organization.

RAM allows network administrators to create and manage VPCs centrally. Shared VPC enables network resources in different AWS accounts to communicate seamlessly as if they were on the same VPC and account. You can still control traffic using security groups and network ACLs.

You can share your VPCs to leverage the implicit routing within a VPC for applications that require a high degree of interconnectivity and are within the same trust boundaries. This reduces the number of VPCs that you create and manage while using separate accounts for billing and access control.

VPC sharing benefits:

Simplified design — no complexity around inter-VPC connectivity
Fewer managed VPCs
Segregation of duties between network teams and application owners
Better IPv4 address utilization
Lower costs — no data transfer charges between instances belonging to different accounts within the same Availability Zone

According to the whitepaper, customers can use VPC sharing in conjunction with TGW to optimize for cost and performance. VPC sharing has a few limitations, make sure you’re accounting for them in your design.

🎯AWS PrivateLink

Are you itching to know the options if you need asymmetric flow connectivity? No. Let me tell you anyway. 😄

AWS PrivateLink provides private IP connectivity between VPCs so that clients can connect with services hosted in other VPCs.

The key benefit over VPC sharing or peering is that PrivateLink connects services even when they run in different VPCs with overlapping CIDRs. It is also simpler to set up as it doesn’t require changes to route tables, subnets, or TGWs.

The AWS Prescriptive Guidance has a guide for using PrivateLink and NLB with EKS.

Symmetric flow with PrivateLink

If you choose to interconnect services using PrivateLink, you can still provide support for symmetric flow You’d have to add another PrivateLink to support connections originating from the opposite end.

Cost comparison

I researched PrivateLink pricing and failed to come up with a fair comparison with Transit Gateway. Unfortunately, there are too many vectors to provide a generalized price comparison. I recommend involving an AWS Solutions Architect for a detailed analysis.

Conclusion

AWS provides many methods to connect services running in different VPCs. This post reviews the options available for EKS customers.

You can simplify network topologies by interconnecting shared Amazon VPCs using connectivity features, such as AWS PrivateLink, transit gateways, and VPC peering.

Here’s a rudimentary decision tree to help you get started.

Did I get anything wrong? Please tweet me your feedback at @realz.

What are Rollups 🍣

Re Alvarez-Parmar — Tue, 22 Mar 2022 20:32:49 +0000

Increasing the transactional throughput of public blockchains is a key focus for blockchain researchers today. EIP-4844 just came out, and it's old news that rollups will play a huge role in the future of scaling Ethereum. Ethereum's co-founder Vitalik Buterin described the concept of rollups back in 2014. Last year, Vitalik claimed rollups to be the “only choice" for making gas fees more affordable.

Rollups offer faster and cheaper transactions for dApp developers and their customers. This post summarizes my research on rollups, and a few things dApp developers should know when picking the right blockchain protocol.

Why do we need rollups? 🛼

Rollups are in the short and medium-term, and possibly in the long term, the only trustless scaling solution for Ethereum.

Thanks Vitalik for that intro. If you’re familiar with Ethereum, you also know about the current gas prices crises. Basically, if you transfer coins on Ethereum, be prepared to fork over up to $10 in fees to send $1. The last sentence may not be an exaggeration. 😔

Rollups are a layer 2 scaling solution to drastically reduce the gas price on the Ethereum mainnet. Rollups are supposed to provide a way to reduce the costs and latency of decentralized applications (dApps) for users and developers.

In layer 2 scaling solutions, web3 apps send transactions to nodes that are part of the layer 2 network, then the network batches transactions into groups before anchoring (publishing) them to layer 1, after which they are secured by layer 1 since they are publicly verifiable and cannot be altered. Thus rollups offer faster execution by executing transactions off-chain and publishing the proof of transactions on-chain.

Rollups move computation (and state storage) off-chain, but keep some data per transaction on-chain.

A rolledup transaction could include tens of thousands of transactions, which means tens of thousands of transactions can be recorded on the mainchain for the price of one. Using compression algorithms, the more layer 2 transactions you can bundle in a single layer 1 transaction, the cheaper it is to store proof of transactions.

Jag Sidhu writes “some Ethereum engineers got these individual account updates down to a few bytes (8–12 bytes depending on the implementation) which means that a block with 1 megabyte of bandwidth would be able to roughly process 83k — 125k account adjustments per block and around 5500 to 8300 TPS theoretically assuming 15 second block times.”

Image source

Types of rollups

The paper subtly categorizes rollups into two prominent categories: Arbitrum and Optimism with fees that are ~3-8x lower gas fees than L1 and ZK-rollups, with ~40-100x lower gas fees than Ethereum mainnet.

So, what’s the difference between Arbitrum and Optimism that provide single-digit gains than ZK-rollups with triple-digit gains? That’s because there are two types of layer 2 scaling solutions: Optimistic and ZK.

Optimistic rollups

Arbitrium and Optimism are layer 2 protocols that use “optimistic rollup” (OR) to scale Ethereum. An optimistic rollup network assumes that transactions are valid by default and only performs calculations, via a fraud proof, in the event of a challenge.

In other words, when an application transacts on an optimistic rollup network like Arbitrum, the actual transfer of funds (from accountA to accountB) happens on the Arbitrum. The transaction is then published on Ethereum mainnet.

Remember that an optimistic rollup network assumes all transactions are valid, at least initially. So what happens if a transaction in invalid?

This is indeed a problem with optimistic rollups. Because, every transaction is assumed valid, optimistic rollups have a withdrawal time (7-14 days) constraints while the network waits for someone else to challenge the state of the network.

Optimistic Rollups rely on fraud proofs to avoid re-computations. The state is proposed to Ethereum by a “bonded” actor. Anyone who wants to challenge the actor may claim a bounty by proving that the state update is inaccurate. To accomplish this, the challenger must provide the data required by the smart contract to prove the inaccuracy. This thread goes over the key difference between Optimism and Arbitrum fraud proof mechanism.

ZK-rollups don’t have the withdrawal time constraint because they include a validity proof.

ZK Rollups

For Ethereum — and EVM compatible chains — to become world’s next distributed computing platform, gas prices have to be massively reduced, until it is cheaper to do things at internet scale. ZK rollups (ZKR) promise could be the key to achieving that level of scalability.

ZK rollups like zkSync are popular because they don’t have the withdrawal time problem that optimistic rollups do. Withdrawal times in zkSync, a ZK rollup live on Ethereum mainnet, are 10 minutes to 7 hours during low usage. Moreover, ZK rollups gets cheaper and faster as the usage increases, so in the future things will become faster.

But, what does ZK stand for?

ZK rollups are based on the concept of provers and verifiers. ZK stands for Zero Knowledge.

ZKR "roll-up" off-chain transactions and generate a cryptographic proof known as a zk-SNARK. The acronym zk-SNARK stands for “Zero-Knowledge Succinct Non-Interactive Argument of Knowledge". The zk-SNARK is the proof of validity of the transactions in the form of a hash and is eventually placed on the main chain.

A special ZK Rollup smart contract, which resides on Layer 1, maintains the status of the transfers made on rollup chain. The status can only be updated with a validity card; the zk-SNARK. The zk-SNARK is a hash that represents the blockchain's validity status.

“Zero-knowledge” proofs allow one party (the prover) to prove to another (the verifier) that a statement is true without revealing any information beyond the validity of the statement itself. For example, given the hash of a random number, the prover could convince the verifier that there indeed exists a number with this hash value, while disclosing what that random number is.

In a zero-knowledge “Proof of Knowledge” the prover can convince the verifier not only that the number exists, but that they in fact know such a number – again, without revealing any information about the number.

zk-SNARK’s succinct proofs are only a few hundred bytes and can be verified within a few milliseconds. The ZK proof mathematically proves that no fraud has occurred.

zkSync is a ZKR live on Ethereum mainnet. Immutable X and Loopring also use ZKR. Zcash is the first widespread application of zk-SNARKs. Polygon is focused on Zero-Knowledge (ZK) cryptography as the end game for blockchain scaling. There's a lot of innovation happening in this space. L2beat.com provides details about Ethereum layer 2 scaling solutions.

ZKR core components

ZKRs execute transactions on sidechain and roll them on the mainchain. ZKR use two transactors and relayers to achieve this.

Transactor create and broadcast transaction data (indexed address, value, network fee, and nonce) to the network. Transactor corresponds to an external account on Ethereum. Smart contracts then record addresses to one Merkle Tree and the transaction value to another.
Relayers collect a large number of transactions creating rollups. Relayers generate the ZK proof that creates the blockchain state before and after each transaction. The resulting changes reach the mainchain in a verifiable hash. Although anyone can become a relayer, you must first stake their cryptocurrency in a smart contract to ensure honesty.

This “state” is essentially a database which represents new balances and adjustments to accounts as users transact with their accounts inside of the rollup

How do rollups reduce gas?

Rollups don’t actually reduce the gas on Ethereum. Recall that a rollup is a layer 2 sidechain; when using a rollup, you won’t be sending transactions on Ethereum mainnet; instead transactions will be submitted to the L2.

Users of a dApp running the ZK-Rollup scheme will pay less in transaction fees.

Are Optimistic Rollups a temporary solution?

This seems to be a common question in the community. If ZKR are faster, then why even bother with OR?

Optimistic rollups have a first-mover advantage. First of all, the main reason why OR were more popular in the past was because until recently ZKRs didn't support Solidity smart contracts. ZKRs have to generate validation proofs, and the earliest iterations were not EVM and Solidity compatible. That changed in 2021. Now you can take your Solidity smart contract and deploy it on a ZKR with a few (relatively minor) changes.

On Feb 2022, zkSync 2.0 became available on Ethereum’s testnet. zkEVM is a virtual machine that executes smart contracts in a way that is compatible with zero-knowledge-proof computation.

Only time will tell who wins.

How will the merge affect this?

Simply put, it will not. I’ll provide a detailed answer in another post.

Topics we skipped

In optimistic rollups, when transactions are ready to be rolled up, a sequencer is a specially designated full node that can control the ordering of transactions. Sequencers bundle transactions and submit both the transaction data and the new L2 state root to L1. Kyle Charbonnet has explained Optimism's optimistic rollup implementation in detail here.

ZK-STARK (Zero-Knowledge Scalable Transparent ARguments of Knowledge). The proof system used in ZK-SNARK requires a trusted party, or parties, to initially set up the ZK proof system. A dishonest trusted party could compromise the privacy of the system. ZK-STARKS improve on this technology by removing the need for a trusted setup.

Conclusion

Blockchain is a fast-moving space. Millions of dollars continue to be funneled into building scalable future blockchain networks. It’s hard to tell if ZKRs will be the silver bullet to address Ethereum’s data availability and scaling problems. In the short term, it does look like ZKRs are a step in the right direction.