Why I Stopped Believing 'Best Practices' and Started Trusting 'Works For Us'

#architecture #discuss #softwareengineering #systemdesign

I spent 18 months building the 'perfect' architecture. Then I watched a customer delete it in 20 minutes and replace it with a cron job. Here's what I learned about the 'best practice' trap — and why boring technology often wins.

The demo that didn't land

We were eighteen months into building layline.io when we got our first serious enterprise prospect. A Fortune 500 logistics company. Their data team had reviewed our architecture, liked the batch-plus-streaming approach, and scheduled a full-day workshop to dive deep.

We prepared for weeks. We built a demo that showed off everything: complex event processing, automatic backpressure handling, schema evolution. It was, by every textbook definition, a best practice architecture. Distributed. Fault-tolerant. Built to scale horizontally. The kind of system you'd draw on a whiteboard during a conference talk.

The workshop went well. The engineers asked good questions. Then, in the last thirty minutes, the senior architect leaned back and said something I'll never forget: "This is impressive. But we run everything on a single server with cron jobs, and it works. What would we actually gain from all this complexity?"

I had a hundred answers ready. Scalability. Resilience. Future-proofing. But I could see in his face that he wasn't asking for a technology comparison. He was asking me to justify why his current reality — boring, simple, working — was insufficient.

I couldn't. Not honestly.

The architecture I deleted

Three months later, I was in a different room with a different customer. This one was a mid-sized fintech. They'd been running a Kafka-based streaming pipeline for two years. It was falling over constantly. They'd hired consultants, upgraded hardware, rewritten their consumer logic twice. The system was "correct" by every distributed systems textbook. It was also a nightmare to operate.

In the meeting, their lead engineer showed me the architecture diagram. It was beautiful. Twelve microservices, three different persistence layers, a custom operational data store for state management. They'd followed every pattern from the Confluent blog and the Martin Kleppmann book.

"What if," I asked, "you just wrote the events to a file and processed them in batches?"

He stared at me. "That's... not streaming."

"No," I agreed. "But you're processing events hourly anyway because your downstream system can't handle real-time updates. You're paying the operational cost of a streaming architecture to achieve batch semantics."

They didn't buy layline.io that day. But six weeks later, I got an email. They'd deleted the entire architecture. Replaced it with a single process that read files and wrote to a database. A cron job, basically. Their p99 latency went from 200ms to five minutes — which didn't matter because their business process was daily. Their operational incidents went from three per week to zero. Their engineering team went from firefighting to shipping features.

The "wrong" architecture was better because it matched their actual constraints, not their aspirational ones.

The best practice trap

Here's what I've learned from 25 years of building and selling data infrastructure: best practices are context-dependent by definition, but they're marketed as universal truths.

The streaming-first architecture that Netflix needs is not the architecture a 50-person SaaS company needs. The microservices approach that lets Amazon deploy 10,000 times per day is not what your team of four engineers needs. The AI agent framework that raised $50 million in VC funding is not what your cron-based ETL needs.

But you wouldn't know that from reading industry content. Every vendor blog post, every conference talk, every architecture blueprint shows the same progression: start simple, then "graduate" to complexity as you grow. The implication is clear: simple is for beginners. Complexity is for serious practitioners.

This is backwards. Complexity is a liability that should be added reluctantly, not a badge of honor that should be pursued eagerly.

What "works for us" actually looks like

I've started asking customers a different question in early conversations: "What's the simplest thing that could work for your actual workload?" Not your projected workload in three years. Not your aspirational real-time use case that the CEO mentioned once. Your actual workload, today.

The answers are consistently surprising:

A healthcare company processing a million patient records per day does it with a single-threaded Python script that runs for four hours every night. It's been running for six years without modification. Why? Because the records arrive via FTP at 2 AM, and the doctors don't look at the dashboards until 8 AM.
A retail company processing point-of-sale data from 2,000 stores uses a three-node Kafka cluster. Not because they need the throughput — they could fit a day's events in a single file — but because their existing team knew Kafka and didn't have time to learn something new during their busiest season.
A logistics company tracking container ships in real time uses... a spreadsheet. The operations team updates it manually. They tried building an automated pipeline twice. Both times, the automated system failed in ways that were harder to debug than the spreadsheet. The spreadsheet is "wrong" in a dozen ways, but it's inspectably wrong. You can see the errors.
None of these are "best practices." All of them are correct for their context.

The AI agent hype cycle

If you want to see the best practice trap in its most aggressive form, watch how the data engineering industry is currently responding to AI agents.

Every competitor blog I read lately — Airbyte, Confluent, Kestra — is positioning their product as "AI agent ready." There are deep dives on Model Context Protocol, ontologies for agents, context window management. The implicit message: if you're not architecting for AI agents right now, you're falling behind.

I asked a customer last week if they were looking at AI agents for their data pipelines. "We spent six months trying to get an LLM to generate SQL," he said. "It was 70% accurate on simple queries and 30% accurate on complex ones. The 30% was subtle enough that we didn't catch it until the CEO saw a wrong number in a board deck. We're back to engineers writing SQL."

This isn't an argument against AI. It's an argument against defaulting to AI because it's the current best practice. The teams that benefit from AI agents today have specific characteristics: high query volumes, relatively simple schemas, tolerance for occasional errors, and engineering resources to validate outputs. If that doesn't describe your situation, AI agents aren't your solution yet — no matter how many vendor blog posts suggest otherwise.

How to actually evaluate technology

So if "best practice" isn't a reliable guide, what is?

Here's the framework I use now, both for my own architectural decisions and when advising customers:

Start with your actual constraints. How much data? What arrival patterns? What latency requirements? What team size and expertise? What budget for operations? The answers to these questions eliminate 90% of "industry standard" architectures immediately.

Optimize for debugging, not for elegance. The architecture that produces clean diagrams is often the one that's hardest to debug at 2 AM. Prefer systems where you can trace a single record from source to destination without crossing three different abstraction layers.

Measure operational cost in team attention, not just infrastructure dollars. A distributed system that runs itself but requires a senior engineer to be on call is more expensive than a single server that needs occasional restarts but can be managed by a junior hire.

Plan for the migration you'll actually do, not the migration you should do. Every team has legacy systems they'll never retire. Design for graceful coexistence with old technology rather than revolutionary replacement of it.

When in doubt, start boring. You can always add complexity. Removing it is much harder. The teams I see succeeding are the ones that add technology reluctantly, with clear evidence that simpler approaches have been exhausted.

The counter-argument I'm not making

I want to be clear about what I'm not saying. I'm not arguing for technical conservatism or against trying new things. Some problems genuinely do require complex, distributed, real-time architectures. If you're processing payments at scale, you need exactly-once semantics. If you're serving ML features with sub-100ms latency, you need streaming. If you're Netflix, you need what Netflix needs.

But most companies aren't Netflix. Most data pipelines don't need to handle 10,000 events per second. Most teams don't have a platform engineering group to manage the operational burden of "modern" data infrastructure.

The uncomfortable truth is that the industry has conflated "what successful tech companies do" with "what you should do." Successful tech companies have endless engineering resources, high tolerance for operational pain, and business models that require real-time everything. Your company probably doesn't. Your architecture shouldn't pretend otherwise.

Where layline.io fits (and where it doesn't)

I'll close with something that might surprise you: layline.io is not the right choice for every data integration problem.

If you have a few batch jobs that run reliably on a schedule, and your team is comfortable with your current setup, you probably don't need us. Seriously. The operational overhead of learning a new platform isn't worth it if your current reality is stable and understood.

Where we add value is when you've outgrown simple approaches but want to avoid the complexity tax of stitching together multiple specialized tools. When you need both batch and streaming in the same system. When your team is tired of maintaining separate orchestration, transformation, and monitoring layers. When you want to consolidate around one model instead of managing a coordination seam between three different tools.

Even then, I'd rather you start with a proof of concept that processes a single day's data than an ambitious migration plan. Prove that the simpler approach works for your actual workload before committing to the complex one.