There’s a special kind of pain only backend engineers know: everything in your service looks perfectly optimized — goroutines tight, DB tuned, caches warm — yet your p95 latency stubbornly stays above the SLA budget.
That was me two years ago.
We had a Go service pushing ~40k responses/sec at peak. CPU usage was rising faster than traffic. Latency graphs showed small but consistent spikes around serialization. At first I ignored them — “JSON is fine”, I told myself. “Encoding isn’t that expensive.”
I was wrong.
Serialization turned out to be one of the most underestimated sources of latency in Go systems. Fixing it didn’t just remove a bottleneck — it reduced our infrastructure cost by ~28%.
This article is the deep dive I wish I had back then.
🟢 1. The Silent Killer: Serialization on the Hot Path
Serialization is always on the critical path:
- reading from DB → serialize to cache
- writing to Kafka → serialize message
- sending HTTP response → serialize to JSON
- distributed systems → serialize across RPC
- snapshotting → serialize to disk
Even small inefficiencies compound massively under load.
In our service, each serialization step was adding:
- +0.4 ms p50
- +1.2 ms p95
Multiply that by thousands of calls per second → you get a CPU furnace.
🟡 2. Where Serialization Lag Comes From (Real Causes)
Let’s break it down.
1) Reflection Overhead (JSON)
Go's encoding/json is fantastic for convenience… and terrible for performance.
Reflection dominates the flamegraph:
reflect.Value.Interface
reflect.Value.Field
encodeState.reflectValue
2) Excessive Allocations
Dynamic encoding = tons of small allocations → GC pressure → latency spikes.
3) String Encoding Costs
Converting everything to string (numbers, booleans, timestamps) is expensive.
4) Deep Struct Trees
Nested structs = recursive reflection = death by a thousand cuts.
5) Repeated Schema Discovery
JSON repeatedly re-discovers field names and tags.
6) Large Payloads
Serialization grows linearly with size — and sometimes worse.
🔵 3. Flamegraph Example: What We Saw in Production
This is a simplified mock of the actual flamegraph segment:
45% CPU: encoding/json.Marshal
28% reflect.Value.Field
7% encodeState.string
5% encodeState.structEncoder
4% map iteration
This explains EVERYTHING:
- Almost half of CPU load was just serializing.
- Nothing else in the system consumed more CPU.
When serialization dominates, optimizing queries or handlers is meaningless.
🟣 4. Step-by-Step: How We Reduced Serialization Lag
Now — the real meat.
Here’s exactly what worked (in descending order of impact).
🔥 Step 1 — Swap JSON for a Faster Library
This alone gave us 40–55% improvement.
Options (best → good):
- jsoniter (drop-in replacement, faster)
- EasyJSON (code generation, zero reflection)
- GoJay (stream-based, absurdly fast)
Why it works
Because you eliminate reflection or reduce it significantly.
Example
import jsoniter "github.com/json-iterator/go"
var json = jsoniter.ConfigCompatibleWithStandardLibrary
data, _ := json.Marshal(payload)
Latency drop: ~35–45%
CPU drop: ~30%
Almost no code changes.
🟢 Step 2 — Use Binary Formats (MsgPack / Protobuf)
The biggest structural win.
Switching to MessagePack or Protobuf improves:
- encode speed
- decode speed
- payload size
- GC pressure
- memory locality
Gains we measured:
| Format | Improvement |
|---|---|
| MessagePack | ~3× faster serialization |
| Protobuf | ~6× faster serialization |
When to choose what:
- MessagePack: quick win, minimal friction
- Protobuf: long-term maximum performance
🟡 Step 3 — Preallocate Buffers
This is a trick few Go engineers use.
Avoid:
json.Marshal(v)
Prefer:
buf := make([]byte, 0, 1024)
enc := json.NewEncoder(bytes.NewBuffer(buf))
Or for protobuf/msgpack:
buf := make([]byte, 0, 512)
data, _ := codec.MarshalToSizedBuffer(buf, payload)
Why it works:
You remove reallocations → fewer copies → fewer GC cycles.
Typical improvement: 5–15%.
🔵 Step 4 — Remove Unnecessary Fields (The Ruthless Pass)
We audited the payload.
We asked:
“Does the client actually need this?”
Turns out 20–35% of fields were never used.
Removing unused data reduced encoding time by ~12% and network weight by ~25%.
This is the easiest optimization mentally — the hardest politically.
🟣 Step 5 — Avoid Encoding Large Collections Repeatedly
If a payload contains:
- a big list
- nested structures
- computed metadata
…cache the encoded version.
Example:
Instead of computing:
for each request:
json.Marshal(bigList)
Do:
cachedBytes.Store(key, encodedValue)
We saw 20% latency reduction on endpoints using heavy lists.
🔥 Step 6 — Use Streaming Encoders
For large responses (10KB+), use streaming:
encoder := json.NewEncoder(w)
encoder.Encode(object)
This prevents huge temporary buffers → improves memory locality.
Improvement: 10–20% on large payloads.
🧠 Step 7 — Flatten Structs
Deep nested structures cause:
- recursive reflection
- many allocations
- excessive pointer chasing
Flattening them (denormalizing slightly) improved performance by 5–10%.
🧪 Step 8 — Avoid Interfaces in Hot Structs
Go’s serialization treats interface{} as a wildcard → slow path.
If your struct has:
Meta map[string]interface{}
You are sabotaging your own performance.
Replace with generics or typed maps.
📉 Actual Final Results After Applying All Steps
Our real improvements after a 3-week serialization refactor:
| Metric | Before | After |
|---|---|---|
| p50 latency | ~2.4 ms | ~1.1 ms |
| p95 latency | ~6.7 ms | ~2.4 ms |
| CPU usage | baseline | −28% |
| GC pauses | frequent | rare |
| Node count | 12 | 8 |
Those savings were worth thousands of dollars monthly.
Serialization was the hidden bottleneck all along.
🧠 Senior-Level Lessons Learned
- Serialization lies on the exact hot path → always profile it.
- JSON is never “free”.
- Libraries matter more than you expect.
- Buffer allocations are invisible killers.
- Binary formats boost performance across the stack.
- Schema control (protobuf) = speed + clarity + evolution.
- Interfaces kill serialization speed.
- CPU flamegraphs don’t lie.
🎯 Real Recommendations (Practical)
Best “drop-in” upgrade
→ jsoniter
Easiest, big gains, minimal risk.
Best long-term upgrade
→ Protobuf
For serious scalability.
Best mid-level win
→ MessagePack
Perfect when you want speed without .proto overhead.
Best structural improvement
→ removing fields, flattening structs, reducing nesting.
📚 If You Want to Go Deeper
These Educative courses shaped my understanding of performance and distributed systems:
Highly recommended (affiliate friendly).
Top comments (0)