DEV Community: Kshitij Sharma

Why is it important to make a good api rather than a clean api

Kshitij Sharma — Thu, 23 Apr 2026 09:25:56 +0000

Kshitij Sharma

Apr 23

When Your “Clean” REST API Becomes a Production Nightmare

#api #backend #systemdesign #webdev

Comments

4 min read

When Your “Clean” REST API Becomes a Production Nightmare

Kshitij Sharma — Thu, 23 Apr 2026 04:41:21 +0000

Everything looked perfect on paper.

Clean endpoints
Nice resource naming
Proper HTTP methods

Then production hit:

Clients started retrying aggressively
Data inconsistencies appeared
Versioning became a mess
One change broke 3 consumers

That’s when reality kicks in:

REST API design is not about elegance — it’s about survivability under change.

The Real Constraints of REST APIs in Production

You’re not designing endpoints.
You’re designing contracts under uncertainty.

What actually shapes your API:

Multiple clients (web, mobile, third-party)
Network unreliability
Backward compatibility pressure
Partial failures
Latency budgets
Data ownership boundaries

Ignoring these = brittle APIs that collapse under scale.

Resource Modeling Is Where Most People Fail

Everyone talks about /users and /orders.

That’s surface-level.

The real question:

What is the lifecycle of your resource?

Bad Design (naive CRUD mindset)

POST /orders
GET /orders/:id
PUT /orders/:id
DELETE /orders/:id

Looks fine. Completely wrong for real systems.

Why?

Orders aren’t freely mutable
State transitions matter (created → paid → shipped)
Business rules are ignored

Model State Transitions Explicitly

Better:

POST   /orders
POST   /orders/:id/pay
POST   /orders/:id/ship
POST   /orders/:id/cancel

Now:

You encode business logic in API
You prevent invalid transitions
You reduce client-side bugs

Idempotency: The Thing That Saves You From Chaos

Most APIs break under retries.

Reality:

Clients retry
Proxies retry
Load balancers retry

If your endpoint isn’t idempotent → duplicate operations.

Real Failure Case

Payment API:

POST /payments

Client times out → retries → duplicate charge.

Congrats, you just lost user trust.

Fix: Idempotency Keys

POST /payments
Idempotency-Key: 8f3a-xyz-123

Server logic:

if (exists(idempotencyKey)) {
  return previousResponse;
}

processPayment();
storeResult(idempotencyKey);

Partial Failure Handling (The Silent Killer)

Your API calls:

DB
Cache
External service

One fails.

Now what?

Most APIs:

Return 500 and pray.

That’s not a strategy.

Better Approach: Explicit Failure Semantics

Return partial success where valid
Use compensating actions
Log correlation IDs

Example:

{
  "status": "partial_success",
  "data": {...},
  "failed_dependencies": ["inventory-service"]
}

Versioning: Where APIs Go to Die

Naive approach:

/v1/users
/v2/users

Problem:

You now maintain 2 systems forever
Clients don’t migrate

Better Strategy: Evolution Over Versioning

Add fields, don’t remove
Use default values
Deprecate gradually

When Versioning Is Actually Needed

Breaking contract changes
Semantic shifts (not just fields)

Even then:

Prefer header-based versioning

Accept: application/vnd.myapi.v2+json

Overfetching vs Underfetching

Classic REST problem.

Overfetching

GET /users/:id

Returns:

Name
Email
Address
Preferences
Activity logs

Client only needs name.

Waste:

bandwidth
latency

Underfetching

Client needs:

user
orders
payments

Makes 3 calls.

Now latency multiplies.

Practical Fix: Controlled Expansion

GET /users/:id?include=orders,payments

Trade-off:

More complex backend
Better client efficiency

Implementation: What a Production-Ready API Looks Like

Express.js Example (Opinionated Structure)

const express = require('express');
const app = express();

// Middleware: request ID for tracing
app.use((req, res, next) => {
  req.id = crypto.randomUUID();
  next();
});

// Idempotency middleware
const store = new Map();

app.post('/payments', async (req, res) => {
  const key = req.headers['idempotency-key'];

  if (store.has(key)) {
    return res.json(store.get(key));
  }

  const result = await processPayment(req.body);

  store.set(key, result);
  res.json(result);
});

// Explicit state transition
app.post('/orders/:id/ship', async (req, res) => {
  const order = await getOrder(req.params.id);

  if (order.status !== 'paid') {
    return res.status(400).json({ error: 'Invalid state' });
  }

  await shipOrder(order);
  res.json({ status: 'shipped' });
});

Common Mistakes That Kill APIs

❌ Treating REST Like CRUD

You ignore:

business logic
state transitions
invariants

❌ Ignoring Timeouts and Retries

Your system works… until network instability hits.

❌ No Observability

No:

request IDs
structured logs
tracing

Debugging becomes guessing.

❌ Tight Coupling to DB Schema

Changing DB → breaks API

Fix:

API is a contract, not a reflection of your database

❌ Overusing HTTP Status Codes

People do:

200 OK (with error inside body)

Or:

500 for everything

Both are wrong.

Trade-offs You Can’t Escape

Flexibility vs Simplicity

Flexible APIs → harder to maintain
Simple APIs → limited use cases

Performance vs Consistency

Strong consistency → slower
Eventual consistency → complex

Versioning vs Evolution

Versioning → fragmentation
Evolution → constraints on change

Abstraction vs Control

High abstraction → easy usage
Low abstraction → better performance

What a Mature REST API Actually Looks Like

Explicit state transitions
Idempotent operations
Backward-compatible changes
Observability baked in
Controlled data fetching
Failure-aware responses

Final Reality Check

If your API:

breaks on retries
can’t evolve without versioning chaos
hides business logic
lacks observability

It’s not production-ready.

Key Takeaways

REST is not CRUD — it’s contract design under failure
Idempotency is non-negotiable
State transitions must be explicit
Versioning is a last resort, not default
Most failures come from network behavior, not code
API design is about handling bad conditions, not ideal flows

If you design APIs assuming everything works perfectly,
your system will fail the moment it doesn’t.

Understand the mechanism behind the api failing randomly

Kshitij Sharma — Wed, 22 Apr 2026 17:41:09 +0000

Kshitij Sharma

Apr 22

When Your API “Randomly” Starts Timing Out

#backend #webdev #networking #distributedsystems

Comments

4 min read

When Your API “Randomly” Starts Timing Out

Kshitij Sharma — Wed, 22 Apr 2026 17:30:48 +0000

You deploy a perfectly fine service. Load tests passed. Latency looked clean. Then production hits—and suddenly:

P95 latency spikes
Requests hang without logs
CPU is fine, memory is fine… but users are screaming

This isn’t a “bug.” This is you not understanding the actual HTTP request lifecycle beyond the textbook diagram.

If you don’t know what really happens between a client sending a request and your handler returning a response, you’re flying blind.

HTTP Request Lifecycle — What Actually Happens Under the Hood

Forget diagrams like Client → Server → Response. That’s marketing-level abstraction.

A real request goes through:

1. Connection Establishment

DNS resolution
TCP handshake (3-way)
TLS handshake (if HTTPS)
Connection pooling / reuse (keep-alive)

2. Kernel → User Space Transition

NIC receives packet → kernel buffer
Socket read readiness via epoll/kqueue
Data copied into user space buffers

3. HTTP Parsing

Raw bytes → protocol parsing (headers, method, path)
Chunked decoding / content-length validation
Header normalization

4. Routing & Middleware Chain

Path matching (often regex or trie-based)
Middleware execution (auth, logging, rate limiting)

5. Business Logic Execution

DB calls
External APIs
CPU-bound work

6. Response Construction

Serialization (JSON, protobuf, etc.)
Compression (gzip, brotli)

7. Write Back to Socket

Kernel send buffer
TCP congestion control
Potential partial writes

8. Connection Lifecycle Decision

Keep-alive reuse vs close
Idle timeout tracking

Miss any one of these layers, and you’ll misdiagnose production issues.

Where Systems Actually Break

Let’s cut the theory. Real failures:

🔴 1. Head-of-Line Blocking in Connection Pools

You think you're async, but your HTTP client pool is exhausted.

Result:

Requests queue waiting for a free connection
Latency explodes without CPU increase

🔴 2. Slow Clients = Resource Leaks

If a client reads slowly:

Your server keeps buffers open
Threads/event-loop slots remain occupied

This is classic slowloris territory.

🔴 3. Middleware Abuse

Stacking 10 middlewares sounds clean.

Reality:

Each adds latency
Each may block (logging, auth calls)
Hard to reason about ordering

🔴 4. TLS Handshake Overhead

Without reuse:

Every request pays ~1–2 RTT extra
CPU spikes due to crypto

🔴 5. Kernel Buffer Backpressure

Your app “sent” the response.

Kernel says:

Nope, buffer full. Try later.

If you ignore this:

Writes block
Event loop stalls
Throughput collapses

Architecture Decisions That Actually Matter

1. Thread-per-request vs Event Loop

Thread-per-request (e.g., classic Java)

Pros:

Simpler mental model
Blocking code is fine

Cons:

Context switching overhead
Memory per thread (~1MB stack)

Event-driven (Node.js, Netty, Go runtime hybrid)

Pros:

High concurrency
Efficient IO

Cons:

Blocking = catastrophic
Debugging harder

2. Reverse Proxy in Front (NGINX / Envoy)

You should not expose your app server directly.

Why:

Handles TLS termination
Absorbs slow clients
Better connection management

3. Connection Reuse Strategy

Bad:

New TCP per request

Better:

HTTP/1.1 keep-alive

Best:

HTTP/2 multiplexing

Trade-off:

HTTP/2 introduces head-of-line blocking at TCP layer
QUIC (HTTP/3) fixes it but adds complexity

Implementation: What This Looks Like in Code

Example: Minimal HTTP Server (Node.js — showing lifecycle touchpoints)

const http = require('http');

const server = http.createServer((req, res) => {
  // 1. Request received (already parsed by Node's HTTP parser)

  // 2. Middleware simulation
  const start = Date.now();

  if (req.headers['x-block']) {
    // simulate bad middleware
    while (Date.now() - start < 100) {}
  }

  // 3. Business logic
  setTimeout(() => {
    const responseBody = JSON.stringify({ ok: true });

    // 4. Response write
    res.setHeader('Content-Type', 'application/json');
    res.setHeader('Content-Length', Buffer.byteLength(responseBody));

    res.write(responseBody);

    // 5. End response (flush to kernel)
    res.end();
  }, 10);
});

// 6. Connection-level tuning
server.keepAliveTimeout = 5000;
server.headersTimeout = 6000;

server.listen(3000);

Where This Code Lies to You

You don’t see TCP
You don’t see kernel buffers
You don’t control backpressure explicitly
You don’t see partial writes

That abstraction is convenient—and dangerous.

Advanced Concern: Backpressure Handling

Most people ignore this. That’s why systems collapse under load.

Example (Node.js stream backpressure):

function writeResponse(res, data) {
  const canContinue = res.write(data);

  if (!canContinue) {
    // Kernel buffer full — wait
    res.once('drain', () => {
      console.log('Resumed writing');
    });
  }
}

If you ignore this:

Memory spikes
Latency spikes
Eventually crashes

Failure Case: Timeout Mismatch Hell

You configure:

Load balancer timeout: 60s
App server timeout: 30s
DB timeout: 10s

What happens?

DB times out → app retries
App still running → LB kills connection
Client retries → duplicate work

Result:

Cascade failure

Trade-offs You Can’t Avoid

Latency vs Throughput

Small buffers → lower latency, more syscalls
Large buffers → better throughput, worse tail latency

Simplicity vs Control

Frameworks hide complexity
But you lose control over:
- connection reuse
- backpressure
- parsing behavior

CPU vs Network Efficiency

Compression saves bandwidth
Costs CPU
Under load, CPU becomes bottleneck

Keep-Alive vs Resource Locking

Keep-alive reduces handshake overhead
But holds connections longer
Risk: connection pool exhaustion

Final System Design (What Actually Works in Production)

A sane architecture:

Client
  ↓
CDN (optional)
  ↓
Reverse Proxy (NGINX / Envoy)
  ↓
App Server (stateless, event-driven)
  ↓
Service Layer
  ↓
Database / Cache

Key rules:

Terminate TLS early
Enforce timeouts at every layer
Use connection pooling aggressively
Monitor queueing, not just CPU

Key Takeaways (No Fluff)

HTTP lifecycle is mostly not in your code—it’s in the kernel and network stack
Most latency issues are queueing problems, not computation problems
Backpressure is real; ignoring it will kill your system
Middleware is not free—treat it like production code, not decoration
Timeouts must be aligned across layers or you create cascading failures
Keep-alive and pooling are double-edged swords

If you still think HTTP is just “request comes in, response goes out,” you’re not ready to debug production systems.