Debugging TCP by Building Your Own Logging Proxy in 400 Lines of Rust
Every backend engineer has been stuck on the same question: "is my client actually sending what I think it's sending?" The bytes are right there on the wire. Why is it so hard to see them?
I've been debugging a lot of raw TCP protocols lately β Postgres, Redis, a small custom RPC framing β and I got tired of the gap between the tools I had. So I wrote a 400-line Rust program that fills it. This post is about why the existing tools weren't enough, what I ended up building, and the handful of interesting design calls that came up along the way.
π GitHub: https://github.com/sen-ltd/tcp-proxy
The tools that didn't fit
Before writing anything, I went through what I already had.
tcpdump / Wireshark. These are amazing when you need to know what's on the wire in the packet sense β but that's exactly the problem. TCP segments are not application messages. A single SET key value Redis command might arrive across two segments; a lazy keepalive shows up interleaved; retransmits make everything look wrong until you read the sequence numbers. Reading a session back out of a pcap is a full minute of Wireshark clicking even when you know exactly what you're looking for. And if you're on a machine without root, you can't capture at all.
socat. socat will happily forward a TCP port. What it will not do is log the bytes that pass through. You can bolt on tee or ship everything to a file, but now you've lost the framing between the two directions β which chunks were requests, which were responses, where the session boundaries are.
mitmproxy. Incredible tool. HTTP-only. If I'm debugging Postgres wire protocol, it has nothing to say.
Custom one-off. I've written a ten-line tokio script for this three times now. Each time it's almost good enough, and each time the next problem needs "can you just add colored direction markers" or "can you truncate the hex dump so a 4 MB blob doesn't scroll me off the screen."
So I wrote the tool I wish I'd had.
What I built
tcp-proxy is a CLI. You give it --listen and --forward. It accepts TCP connections on the listen side, connects to the forward side, and ferries bytes between them β while logging every chunk, in both directions, as a classic hexdump -C-style hex+ASCII dump, tagged with a per-connection session ID.
tcp-proxy --listen 127.0.0.1:6380 --forward 127.0.0.1:6379
Then point your Redis client at :6380:
listening on 127.0.0.1:6380, forwarding to 127.0.0.1:6379
[sid=0001] accepted from 127.0.0.1:55318
[sid=0001] C->S (14 bytes)
00000000 2a 31 0d 0a 24 34 0d 0a 50 49 4e 47 0d 0a |*1..$4..PING.. |
[sid=0001] S->C (7 bytes)
00000000 2b 50 4f 4e 47 0d 0a |+PONG.. |
[sid=0001] closed: eof
That's it. That's the whole tool, conceptually. The rest of this post is about the handful of things that were less obvious than I expected once I started writing it.
Design call #1: don't use copy_bidirectional
The first thing a reasonable Rust person reaches for in this situation is tokio::io::copy_bidirectional. It takes two AsyncRead + AsyncWrite halves and shuffles bytes between them forever. One function call and you've got a proxy.
It's useless for this problem.
The reason is that copy_bidirectional moves bytes through the kernel's socket buffers without ever landing them in your address space at a point where you can do something about them. You do not get a hook. You cannot say "also, while you're moving these 4096 bytes, hand them to me so I can print them." There's no middle.
So I wrote the copy loop by hand. It looks like this:
async fn tee_loop<R, W>(
src: &mut R,
dst: &mut W,
arrow: Arrow,
sid: &str,
cfg: &SessionConfig,
sink: Arc<dyn LogSink>,
shutdown: Arc<AtomicBool>,
) -> TeeOutcome
where
R: AsyncReadExt + Unpin,
W: AsyncWriteExt + Unpin,
{
let mut buf = vec![0u8; 8 * 1024];
loop {
if shutdown.load(Ordering::Relaxed) {
return TeeOutcome::Shutdown;
}
let read = match timeout(cfg.idle_timeout, src.read(&mut buf)).await {
Ok(Ok(n)) => n,
Ok(Err(_)) => return TeeOutcome::Error,
Err(_) => return TeeOutcome::Timeout,
};
if read == 0 {
let _ = dst.shutdown().await;
return TeeOutcome::Eof;
}
let chunk = &buf[..read];
// Log first. Then forward. If we reversed the order, a slow
// upstream would make hex dumps appear to lag the connection.
let record = crate::logger::format_chunk(
cfg.format, sid, arrow, chunk, cfg.max_dump, cfg.no_dump,
);
sink.write_blocking(&record);
if dst.write_all(chunk).await.is_err() {
return TeeOutcome::Error;
}
}
}
One connection gets two of these loops running concurrently β one for clientβserver, one for serverβclient. Each has its own half-split of each socket (TcpStream::into_split gives you an owned read half and write half that can live in different tasks). They share the log sink behind an Arc. When either loop ends, handle() awaits both and emits a single closed event with a reason derived from both outcomes.
The "log before you forward" ordering is the one thing I almost got wrong. If you log after the write_all, and the upstream is slow, your hex dump gets delayed by the upstream's writeback speed β the log starts to lie about when the client sent bytes. Logging first makes "tcp-proxy saw these bytes at T" and "tcp-proxy shipped these bytes at T+Ξ΅" accurate in that order, which is what a person debugging would expect.
Design call #2: pure formatter, impure sink
The hex dump formatter itself is a pure function. It takes &[u8] and a limit, returns String.
pub fn format_chunk(bytes: &[u8], limit: usize) -> String {
let mut out = String::new();
let shown = if limit == 0 { bytes.len() } else { limit.min(bytes.len()) };
let head = &bytes[..shown];
for (line_idx, row) in head.chunks(16).enumerate() {
let address = line_idx * 16;
render_line(&mut out, address, row);
out.push('\n');
}
if shown < bytes.len() {
out.push_str(&format!("... {} more bytes ...\n", bytes.len() - shown));
}
out
}
Zero I/O, zero locking, zero tokio in the whole hexdump module. This matters for two reasons.
Testability. Unit tests for the hex dump layout assert on exact string output. Known input bytes β known expected characters. I can test "the ASCII panel pads short last rows to 16 chars wide" in one assertion without spinning up a runtime or mocking a writer.
Atomic log records. The session handler builds the entire log record β the [sid=0001] C->S (14 bytes) header, a newline, and the full multi-line hex dump β into a single String, and then hands that string to the LogSink behind a mutex in one write_blocking call. A 4 MB chunk being hex-dumped by session 0001 cannot interleave with a 12-byte Redis ping from session 0002, because both log records are fully built before either holds the lock.
If I'd done it the more "obvious" way β have the hex dump write straight to stdout line-by-line β I'd have had to either hold a global stdout mutex for the entire duration of a dump (blocking all other sessions' logs) or accept interleaved output (unreadable). Building the string first makes both problems go away.
Design call #3: graceful shutdown without ceremony
You press Ctrl-C on this thing while it's forwarding connections. What should happen?
The rules I wanted:
- Stop accepting new connections immediately.
- Let in-flight connections finish whatever they're doing right now.
- But not forever β
docker stopshouldn't hang.
The pattern I landed on is embarrassingly simple:
let shutdown = Arc::new(AtomicBool::new(false));
// Signal task: flip the flag on SIGINT / SIGTERM.
tokio::spawn(async move {
let ctrl_c = async { let _ = signal::ctrl_c().await; };
let term = async {
if let Ok(mut s) = signal::unix::signal(SignalKind::terminate()) {
s.recv().await;
}
};
tokio::select! { _ = ctrl_c => {} _ = term => {} }
shutdown_sig.store(true, Ordering::Relaxed);
});
// Accept loop: breaks when the flag is set.
loop {
if shutdown.load(Ordering::Relaxed) { break; }
let accept = tokio::select! {
r = listener.accept() => r,
_ = wait_shutdown(shutdown.clone()) => break,
};
// ... spawn session handler ...
}
// Drain loop: wait for in-flight to finish, bounded by a deadline.
let deadline = Duration::from_secs(5);
let _ = tokio::time::timeout(deadline, async {
while in_flight.load(Ordering::Relaxed) > 0 {
drained.notified().await;
}
}).await;
Session handlers check the same AtomicBool on each pass through the tee loop. Because the tee loop calls timeout(idle_timeout, src.read(...)), any stuck read is bounded by --timeout, so even a connection where neither side is saying anything can't hold shutdown forever β the idle timeout fires and the loop exits with TeeOutcome::Timeout, then the handler logs closed: idle_timeout and decrements the in-flight counter.
The Notify on the in-flight counter is what lets the main task wake up when the last connection finishes, instead of polling.
There's no fancy futures library here. An atomic bool, an atomic counter, and a Notify. The nice thing about signals is they're rare; you don't need them to be efficient, you need them to be correct.
Tradeoffs
I'm going to list what this tool does not do, because that's more useful than listing what it does.
No TLS termination. tcp-proxy sees the bytes exactly as they're on the wire, same as tcpdump. If the protocol is TLS-wrapped, you get ciphertext β useless for debugging the application layer. For that you want mitmproxy, or dumping session keys into SSLKEYLOGFILE and reading them back in Wireshark. Pretending to be a TLS terminator would triple the complexity of this tool and make a fork-and-customize much harder for the downstream user.
No record/replay. The log is append-only text. If you want to replay a session against a different upstream β or even just rewind a few seconds β this isn't the tool. (It would be a satisfying next project, though: tcp-proxy writes a binary journal, tcp-replay reads it back with configurable pacing.)
Memory is proportional to chunk size. Each tee loop has an 8 KiB buffer. A chunk up to 8 KiB fits. If the upstream / client hands over more than that at once, the read() call gets however much fits, and the rest shows up on the next iteration. The --max-dump flag is what keeps log output bounded: the full chunk is still forwarded, but only the first N bytes are hex-dumped.
Byte-accurate, not frame-aware. It doesn't know about Redis's RESP framing, Postgres's message envelopes, or HTTP chunked encoding. If a RESP *3\r\n$3\r\nSET\r\n... arrives split across two kernel reads, you'll see two chunks of partial frame. A protocol-aware debugger would reframe; this one doesn't, and I think that's the right call for a general-purpose tool. You can always | less the output and visually reassemble.
Try it in 30 seconds
# Clone and run via Docker. The image is under 10 MB.
git clone https://github.com/sen-ltd/tcp-proxy && cd tcp-proxy
docker build -t tcp-proxy .
# Run against whatever is already listening locally. For example,
# if you have Redis on :6379, point tcp-proxy at a new port and
# connect clients there instead.
docker run --rm --network host tcp-proxy \
--listen 127.0.0.1:6380 --forward 127.0.0.1:6379
# In another shell:
redis-cli -p 6380 PING
redis-cli -p 6380 SET hello world
redis-cli -p 6380 GET hello
You'll see every byte. That's the whole point.
What I'd change next
If I were to push this further, the things on the list are: optional ANSI color (printable ASCII in one color, control chars in another); a --replay FILE mode that reads a captured journal instead of proxying live; a --stats panel that shows per-session bytes/sec in a corner of the terminal; and maybe a --redact /PATTERN/ flag that blanks out matching bytes in the log without affecting the forwarded traffic.
None of those are needed for the "what are my bytes doing" question, though. The 400-line version already answers that one, and that's the version I actually keep in my tmux pane.
SEN εεδΌη€Ύ builds small, sharp developer tools in public. This is entry #167 in the 100+ series.

Top comments (0)