Ranga Bashyam G

Posted on Mar 3

Architecture, Deployment & Observability - The Part Nobody Warns You About

#devops #cloudcomputing #architecture #ai

Have you ever wondered where the actual problem starts in a software lifecycle?

Most people say requirement gathering. Understanding the goals, the vision, the stakeholder expectations... yes, those matter, absolutely. But honestly? That's not where things fall apart.

The actual mess starts at technical planning. Architecting the solution, then trying to execute that architecture on real infrastructure that never behaves the way your diagram assumed, that's where the cracks appear first. And once the foundation has cracks, no amount of clean code or good intentions covers it up.

Architecture Is a Negotiation, Not a Blueprint

Here's what I've learned after architecting multiple tools and services, Architecture is never purely a technical exercise. It's always a conversation between what the business expects, what the infrastructure can actually handle, and what your team can realistically build and maintain without burning out in 3 months.

Every decision you make cascades. You pick Apache Airflow today, two years later your team is debugging consumer lag at 2 AM. You go DAGs for speed, now scaling is a 6-month refactoring nightmare. You choose a managed cloud service to save time, now you're locked into their pricing decisions forever.

There is no perfect architecture. The job is picking the right trade-off for the right context and being honest about what you're giving up.

The trade-offs nobody wants to have (but you have to):

Consistency vs availability - pick a side, document why
Stateless vs stateful - each has infra implications your ops team will live with long after you move on
Managed cloud services vs self-hosted - a cost vs control conversation that needs actual numbers, not vibes
Microservices vs monolith vs modular monolith - "it depends" is fine, but it depends on what needs an answer

The engineers who become architects aren't the ones who know every pattern. They're the ones who know which pattern to not use in a given situation. That's the real experience.

Making Your Client Understand ! This Is the Hardest Part !!

Okay so here's the thing nobody prepares you for.

You've done the analysis. You know exactly what the limitations are. You know why the proposed approach won't scale the way the client thinks it will. You know the infra constraints are real. Now you have to explain that to someone who paid good money and has expectations you can't fully meet with the given resources.

If you keep explaining the scarce resources in technical terms, the bid is off. They value you low. Not because you're wrong, but because you failed to translate the constraint into something they actually understand.

Lead with outcomes. "This approach handles a 10x traffic spike without manual intervention" lands better than "we're implementing HPA with custom metrics on Kubernetes." Both mean the same thing. Only one keeps the client in the room.

The real game is making your client understand the boundaries while also showing them the best of what's possible within those boundaries. That takes experience. Technical knowledge. And honestly, a lot of patience too.

Deployment, Where Architecture Meets Real Life

Architecture on paper and architecture in production are two very different things.

I've seen beautiful designs fall apart the first time they hit real network latency, real disk I/O, and real users doing things nobody anticipated in the design session.

Kubernetes is the de-facto orchestration layer now. It's powerful. It's also unforgiving. If you don't understand what you're deploying into... node affinity, resource requests vs limits, pod disruption budgets, k8s will punish you with fancy errors and silent failures at the worst possible time.

Single Cloud vs Multi-Cloud - Companies Are Splitting on This

Nowadays companies are moving in two directions. One is a specific cloud-oriented approach, the other is multi-cloud. Both are valid. Both have real costs that go beyond the invoice.

Single cloud gives you depth, tighter integrations, simpler operational model, better managed service compatibility. You're betting on one vendor's roadmap and pricing history though.

Multi-cloud gives you resilience and leverage... no single provider outage takes you down, no pricing lock-in. The cost is complexity. Your infra team is now managing abstractions across two or three different API paradigms, IAM models, and networking topologies at the same time.

Only a clever and well-budgeted team survives cloud optimally. The cloud is pay-as-you-go ! but waste is also pay-as-you-go. And waste compounds faster than most teams realize.

AI Workloads Changed the Deployment Problem

People used to worry about hosting AI models locally,considering the workload, stability, scalability, performance, latency, accessability and mainly on privacy. So everyone moved to cloud. Now the problem is: cloud is pay-as-you-go and the charges are heavier than expected.

GPU-backed compute is expensive. Cold start latency on inference endpoints is different from a stateless REST API. Token-by-token generation means time-to-first-token and total generation time are completely different signals with completely different infra implications.

The API call looks cheap. The infrastructure required to make that API call reliable, fast, and cost-efficient at scale, that's where the real engineering work is hiding.

Observability! You Don't Know What You Don't Measure

Here's the uncomfortable truth. You don't actually know what your system is doing in production. You know what you think it's doing. You know what it did in staging. You know what it looked like in the load test.

Production is a different animal.

Observability is not monitoring. Monitoring tells you something is wrong, it's the alarm. Observability is the ability to ask arbitrary questions about your system's internal state based on the signals it produces. It's how you go from "something is broken" to "here is exactly why, and here is exactly where."

Metrics, Logs, Traces... All Three, No Skipping

Metrics - what is happening right now. Your SLA dashboard, your capacity planning input, your early warning system
Logs - what happened. Invaluable for debugging, but expensive and noisy at scale if you're not intentional about log levels and sampling
Traces - how it happened. The full request journey across distributed services. In a microservices world, traces are non-negotiable. Without them, debugging a latency spike across four services and two external APIs is just educated guesswork

The mistake most teams make is treating observability as something you bolt on after the system is built. By then the instrumentation is an afterthought, naming conventions are all over the place, and you're collecting the signals that were easy to add, not the ones that are actually useful.

Observability for AI Systems Is a Different Problem

Traditional APM tooling wasn't built for LLM workloads. Latency behaves differently. But beyond infrastructure metrics, AI systems need semantic observability.

Did the model return something useful? Was the retrieved context in the RAG pipeline actually relevant? Is the prompt structure degrading as edge cases accumulate over time?

CPU utilization and memory graphs can't answer those questions. You need eval pipelines, response quality sampling, feedback loops embedded into the product itself. That's a different layer of observability and most teams aren't thinking about it yet.

Your Cloud Bill Is a Signal Too

In cloud-native environments, an unexpected cost spike is often the first indicator of a misconfiguration or a runaway process. Engineers who treat FinOps as someone else's problem eventually end up in a very awkward conversation with leadership trying to explain why infra costs tripled.

Tagging resources, attributing costs to services and teams, anomaly alerts on spend, this is observability work. It belongs in the same operational posture as your Prometheus dashboards.

Where It All Ties Together

The cloud is more capable than ever. AI capabilities are an API call away. Kubernetes lets you orchestrate globally. The tooling exists and it's genuinely impressive.

But tooling is not a substitute for craft.

The engineer who can design a system that survives its own success, deploy it reproducibly and observably, and instrument it to give genuine insight into its actual behavior that engineer is rare. And that combination is what actually moves organizations forward.

The gap between a system that technically works and a system that is production-ready, cost-efficient, observable, and maintainable, that gap is not small. In most projects, that gap is the majority of the actual engineering effort.

Requirement gathering gave you a direction. Architecture, deployment, and observability are the journey.

Anyone can deploy a service. Fewer can architect one that lasts. Fewer still can tell you, at any moment, exactly how that service is behaving and why.

That's the skill set. That's the discipline. And as AI workloads and cloud-native systems keep evolving, the engineers who invest in all three, not just the one they find most interesting are the ones building the infrastructure the next decade runs on.

The cloud isn't magic. Kubernetes isn't magic. AI isn't magic. The craft is understanding deeply enough to make it look that way.
~ Ranga Bashyam

Top comments (1)

shrinithi .G • May 5

The article highlights the often-overlooked reality that system failures usually begin at the architecture and planning stage, not coding. It effectively explains how architecture is about trade-offs rather than perfect solutions. The discussion on deployment challenges shows how real-world conditions break ideal designs. Its explanation of observability clearly distinguishes it from monitoring and stresses its importance in production systems. Overall, it offers practical, experience-driven insights into building scalable, reliable systems beyond just theory.