knspar

Posted on May 12

TCP Observability for Microservices (Part II)

#webdev #devops #performance #networking

In a microservices architecture, application performance is not determined solely by how fast your code executes. It is equally dependent on the health of the network plumbing that connects your services. When an API call traverses a mesh of dozens of independent services, the lifecycle of TCP connections, from establishment to teardown, becomes a critical pillar of observability.

The most damaging issues in these environments belong to a class of "invisible problems": performance degradation, connection pool exhaustion, and tail latency. Their impact often remains hidden until system pressure increases, at which point they trigger cascading failures. Key metrics such as connection idle time and termination origin (did the client or server close first?) are often the difference between a five-minute fix and a week-long debugging nightmare.

When we talk about connection pool exhaustion or tail latency, we are not merely talking about application code. We are talking about TCP state management.

Why Application-Level Monitoring Isn't Enough

HTTP status codes are useful for spotting surface-level application errors, but they often mask deeper transport and timing failures in distributed systems. In persistent protocols and modern microservices, many decisive performance signals do not live in the response code; they live in the transport flow.

WebSockets: Managing Persistence

WebSockets are designed to be long-lived, yet without TCP-level monitoring they can degrade into "zombie" connections, sockets that are technically open but no longer transmitting meaningful data.

Idle Time: Continuous monitoring reveals whether a connection has been idle for too long. Excessive idle time often leads to silent drops by load balancers, firewalls, or NAT gateways long before the application notices.
Termination Origin: Distinguishing whether the client or server closed the connection is vital. It helps you diagnose whether a frontend app is crashing, a backend service is hitting a timeout, or an intermediary is forcefully resetting the flow.

HTTP/2: Multiplexing and Stream Management

The most efficient pattern for HTTP messaging is the reuse of a single connection for multiple exchanges—whether via standard request-response cycles or pipelined requests. This principle is even more central to HTTP/2, which employs multiplexing to handle multiple concurrent streams over a single TCP connection.

Keep-Alives and PINGs: HTTP/2 uses PING frames to maintain connection liveness and measure latency. TCP-level monitoring helps confirm whether long-lived connections are truly stable or being pruned by intermediaries.
Connection Reuse: Analyzing TCP behavior ensures that clients are actually reusing existing connections instead of repeatedly opening new ones, which would negate one of HTTP/2's main performance benefits.

The Three Pillars of Connection Health

To achieve reliable, high-performance systems, your observability stack must be able to answer three fundamental questions:

Connection Utilization: Is the connection being efficiently reused across multiple requests, or is churn degrading performance?
Termination Source: Which side initiated the shutdown, the client or the server?
Closure Timing: When was the connection closed, and did it happen earlier than expected?

Hands-On: Tracking TCP with Justniffer

One powerful tool for this level of visibility is Justniffer. It captures network traffic and produces logs focused on connection lifecycle events, making it useful for diagnosing transport-layer behavior without application instrumentation.

You can use the following command to track connection timing, idle periods, and the side that closed the connection:

justniffer -i lo -l "%request.timestamp %dest.ip %dest.port \
    c=%connection.time i=%idle.time.0 req=%request.time resp=%response.time \
    i2=%idle.time.1 s=%session.time %connection %close.originator \
    req_h=%request.header.connection resp_h=%response.header.connection"

Field Reference

Field	Meaning
`c`	TCP connection setup/handshake time
`i`	Idle time before the request
`req`	Request transfer time
`resp`	Response transfer time
`i2`	Idle time after the response
`s`	Total session time
`connection`	Connection state (`start`, `continue`, `last`, `unique`)
`close.originator`	Which side closed the connection (`client`, `server`)
`req_h` / `resp_h`	The `Connection` header values

Scenario 1: Requests Over Non-Persistent Connections

2026-05-12 08:57:54.651504 127.0.0.1 8080 c=0.000070 i=0.000074 req=0.000000 resp=0.005880 i2=0.000376 s=0.006400 unique server req_h=close resp_h=close
2026-05-12 08:57:56.172010 127.0.0.1 8080 c=0.000074 i=0.000073 req=0.000000 resp=0.005958 i2=0.000448 s=0.006553 unique server req_h=close resp_h=close
2026-05-12 08:57:56.891523 127.0.0.1 8080 c=0.000049 i=0.000055 req=0.000000 resp=0.006007 i2=0.000379 s=0.006490 unique server req_h=close resp_h=close
2026-05-12 08:57:58.791617 127.0.0.1 8080 c=0.000067 i=0.000056 req=0.000000 resp=0.005404 i2=0.000367 s=0.005885 unique server req_h=close resp_h=close

What this tells us:

Connection churn: Every request is marked unique, meaning a brand-new TCP connection is established and torn down for each HTTP request.
Zero reuse: The c value (connection setup time) is present on every line, confirming repeated TCP handshakes. This is inefficient in high-throughput paths.
Explicit closure: Both request and response headers specify Connection: close, and the log shows the server as the closure originator (server).
Performance impact: Individual request durations are low, but repeated setup/teardown adds avoidable overhead that can inflate tail latency under load.

Scenario 2: Keep-Alive Requested, But No Reuse Observed

2026-05-12 09:10:23.468279 127.0.0.1 8080 c=0.000088 i=0.000068 req=0.000000 resp=0.005478 i2=0.000459 s=0.006093 unique client req_h=keep-alive resp_h=-
2026-05-12 09:10:24.766013 127.0.0.1 8080 c=0.000046 i=0.000060 req=0.000000 resp=0.005507 i2=0.000818 s=0.006431 unique client req_h=keep-alive resp_h=-
2026-05-12 09:10:26.701355 127.0.0.1 8080 c=0.000079 i=0.000066 req=0.000000 resp=0.005523 i2=0.000399 s=0.006067 unique client req_h=keep-alive resp_h=-
2026-05-12 09:10:27.285707 127.0.0.1 8080 c=0.000070 i=0.000063 req=0.000000 resp=0.005881 i2=0.000490 s=0.006504 unique client req_h=keep-alive resp_h=-
2026-05-12 09:10:28.261914 127.0.0.1 8080 c=0.000081 i=0.000078 req=0.000000 resp=0.006131 i2=0.000628 s=0.006918 unique client req_h=keep-alive resp_h=-

What this tells us:

Observed mismatch: The client requests Connection: keep-alive (req_h=keep-alive), yet every entry is still unique, so the connection is not reused.
Important protocol nuance: resp_h=- means no explicit Connection header was captured in the response. In HTTP/1.1, that does not automatically mean keep-alive was refused, because persistence is the default unless Connection: close is sent.
Client-side closure: The client closes each connection. This can happen when the client does not pool connections, intentionally closes after each request, or uses short idle/age limits.
Likely causes to verify: client library behavior, reverse proxy connection policy, protocol version negotiation (HTTP/1.0 vs HTTP/1.1), and intermediary timeouts.

Scenario 3: Requests Over a Functional Keep-Alive Connection

2026-05-12 09:11:22.933485 127.0.0.1 8080 c=0.000065 i=0.000071 req=0.000000 resp=0.006105 i2=1.410593 s=0.006241 start - req_h=keep-alive resp_h=-
2026-05-12 09:11:24.350183 127.0.0.1 8080 c=- i=- req=0.000000 resp=0.005272 i2=1.940395 s=1.422106 continue - req_h=keep-alive resp_h=-
2026-05-12 09:11:26.295850 127.0.0.1 8080 c=- i=- req=0.000000 resp=0.006187 i2=1.887490 s=3.368688 continue - req_h=keep-alive resp_h=-
2026-05-12 09:11:28.189527 127.0.0.1 8080 c=- i=- req=0.000000 resp=0.005752 i2=0.466600 s=5.261930 continue - req_h=keep-alive resp_h=-
2026-05-12 09:11:28.661879 127.0.0.1 8080 c=- i=- req=0.000000 resp=0.005523 i2=1.417703 s=7.151756 last client req_h=keep-alive resp_h=-

What this tells us:

The start entry: The first request bears the TCP handshake cost (c=0.000065).
Effective reuse: The continue entries show c=-, confirming that subsequent requests reused the same TCP connection.
Idle time analysis: The i2 values (post-response idle time) show roughly 0.47 to 1.94 seconds of wait time between requests. This is reasonable, but it should remain below intermediary idle timeouts.
Session-time consistency check: The final s=7.151756 is consistent with an accumulated session timeline, matching the roughly 7.15-second span between the first and last timestamps.
The last entry: The final line marks session end (last), and the client initiated closure.
Diagnosis: This pattern is typically healthy keep-alive reuse followed by normal client-side shutdown

From Logs to Action: A Troubleshooting Matrix

Pattern	Observation	Likely Cause	Action
Connection Churn	Every request is `unique` with high `c` values	Keep-alive disabled on client or server	Enable `keep-alive` in both application and proxy configs
No Observed Reuse	Client requests keep-alive but connection remains `unique` and client closes	Client not pooling, short client timeout, intermediary policy, or protocol mismatch	Verify client pool settings, HTTP version, and proxy timeout/connection policy

Conclusion

TCP observability bridges the gap between vague complaints like "the network is slow" and precise conclusions such as "connections are not being reused because the client pool closes them after 5 seconds." Application-level metrics tell you that a request failed; transport-layer observability tells you why the underlying path behaved that way.

By tracking idle times, handshake costs, and termination origins with tools like Justniffer, you can:

Optimize keep-alive strategies to reduce connection churn.
Validate that your HTTP/2 clients are actually reusing long-lived connections.
Ensure WebSockets remain alive and are not silently dropped by intermediaries.
Tune client and server pool configurations based on measured connection lifecycles instead of guesswork.

In modern distributed systems, if you are not observing the transport layer, you are flying blind.

DEV Community

TCP Observability for Microservices (Part II)

Why Application-Level Monitoring Isn't Enough

WebSockets: Managing Persistence

HTTP/2: Multiplexing and Stream Management

The Three Pillars of Connection Health

Hands-On: Tracking TCP with Justniffer

Field Reference

Scenario 1: Requests Over Non-Persistent Connections

Scenario 2: Keep-Alive Requested, But No Reuse Observed

Scenario 3: Requests Over a Functional Keep-Alive Connection

From Logs to Action: A Troubleshooting Matrix

Conclusion

Top comments (0)