DEV Community: Berik Ashimov

Why Developers Are Looking Beyond the Cloud

Berik Ashimov — Sat, 23 May 2026 19:13:55 +0000

For years, the default advice was simple: move everything to the cloud.

Need compute? Use the cloud.
Need storage? Use the cloud.
Need databases, queues, observability, networking, backups, and security? Use managed services.

That advice made sense for a long time. Cloud platforms changed how software was built. They made infrastructure programmable, global, fast to deploy, and accessible to small teams without large capital investment. A startup could launch worldwide without buying a single server. A developer could create a production environment in minutes.

But something interesting is happening now.

Many engineering teams are quietly becoming more selective. They are not abandoning the cloud, but they are no longer treating it as the automatic answer to every infrastructure problem. Bare metal is coming back. Private cloud is being reconsidered. Edge infrastructure is growing. Developers are paying more attention to cost, latency, data control, portability, and operational independence.

The new trend is not anti cloud. It is cloud maturity.

The cloud was never magic

The cloud solved real problems, but it also created new ones.

At first, the value was obvious. Instead of waiting weeks for hardware, teams could deploy instantly. Instead of guessing capacity, they could scale on demand. Instead of building everything from scratch, they could consume managed services.

But over time, many organizations discovered that convenience has a price.

Cloud bills can become unpredictable. Managed services can create deep vendor lock in. Data transfer fees can turn architecture decisions into financial surprises. Debugging distributed systems across many managed components can become harder than managing a smaller, simpler platform.

The cloud did not remove infrastructure complexity. It changed where that complexity lives.

For a small team, that tradeoff is often worth it. For a growing company with stable workloads, high traffic, heavy storage, or strict compliance requirements, the equation starts to change.

Predictable workloads do not always need elastic pricing

One of the strongest arguments for the cloud is elasticity. If demand changes constantly, paying for resources as you use them is powerful.

But not every workload is unpredictable.

Many applications have steady traffic. Internal platforms, analytics pipelines, databases, monitoring systems, storage clusters, CI runners, media processing jobs, and enterprise services often run 24 hours a day with fairly known capacity needs.

In these cases, renting infrastructure forever may not be the most efficient model.

A dedicated server that runs at high utilization can be dramatically more cost effective than equivalent cloud resources. A small private cluster can handle predictable workloads with lower monthly cost and more control. Even colocation, which once felt old fashioned, is becoming attractive again for teams that understand their baseline capacity.

This does not mean every company should buy racks of servers. It means engineers should ask better questions.

Is this workload variable or stable?
Does it need global elasticity or predictable capacity?
Are we paying for convenience, or are we paying because nobody reviewed the architecture?
Would a hybrid model reduce cost without increasing risk?

The future belongs to teams that can answer these questions clearly.

Latency is becoming a product feature

Infrastructure decisions are no longer only about cost. They are also about user experience.

Modern users expect applications to feel instant. A few hundred milliseconds can change how a product feels. For gaming, collaboration tools, trading platforms, video, IoT, payment systems, and real time dashboards, latency is not a technical detail. It is part of the product.

This is one reason edge infrastructure is growing.

When compute moves closer to users, applications can respond faster. When data processing happens near the source, systems can avoid unnecessary round trips. When services are deployed regionally, reliability improves because one central location is no longer the only path.

The cloud is still important here, but it is not the whole story. Edge nodes, regional data centers, CDN compute, private points of presence, and local processing are becoming part of modern architecture.

The best infrastructure is not always the biggest. Sometimes it is simply closer.

Developers want portability again

A few years ago, many teams were comfortable building deeply around one cloud provider. Managed databases, serverless functions, proprietary queues, cloud specific IAM, hosted observability, and provider specific deployment models made development faster.

But the cost of that speed often appeared later.

When everything depends on one provider’s unique services, moving becomes difficult. Even testing locally can become complicated. Engineers may understand the application code but not the platform behavior around it. A simple migration can turn into a long and expensive rewrite.

This is why portability is becoming valuable again.

Containers, Kubernetes, Terraform, OpenTofu, PostgreSQL, object storage compatible APIs, standard observability formats, and open networking tools are popular because they give teams options. They allow workloads to run in more than one environment. They also reduce the fear of being trapped.

Portability does not mean pretending every platform is identical. It means designing systems so that the business has choices.

A portable system can run in the cloud today, on bare metal tomorrow, and across multiple regions when needed. That flexibility is not just technical. It is strategic.

Compliance and data control matter more than ever

For regulated industries, infrastructure is not only about performance and cost. It is also about control.

Finance, healthcare, telecom, government, and critical infrastructure companies must think carefully about where data lives, how it moves, who can access it, how logs are stored, and how systems are audited.

Cloud providers offer strong security capabilities, but responsibility still belongs to the organization using them. Misconfigured storage, overly broad permissions, unclear network boundaries, and weak audit processes can still create serious risk.

Some workloads are easier to govern when they run in a more controlled environment. Others benefit from managed cloud security services. The point is not that one model is always better. The point is that infrastructure should match the regulatory reality of the application.

For developers, this means security and compliance can no longer be treated as something that happens after deployment. They must be part of architecture from the beginning.

Where does the data live?
Who can access it?
What happens during an incident?
Can we prove how the system behaves?
Can we rebuild it from code?
Can we audit every important change?

These questions are becoming normal engineering questions.

Simplicity is becoming a competitive advantage

One of the biggest lessons of the last decade is that complexity scales faster than teams expect.

A system can start with a simple architecture and slowly turn into dozens of services, queues, functions, databases, secrets, dashboards, policies, and deployment pipelines. Each part may make sense individually, but together they become hard to reason about.

This is why many senior engineers are rediscovering the value of boring infrastructure.

Boring does not mean outdated. It means understandable, reliable, documented, and easy to operate.

A boring PostgreSQL database is often better than five specialized data services.
A boring Linux server can be better than a fragile chain of managed components.
A boring deployment pipeline can be better than a platform nobody fully understands.
A boring network design can be better than a clever one that fails in strange ways.

The best infrastructure is not the one that looks most impressive in a diagram. It is the one that survives real production pressure.

The rise of hybrid thinking

The most practical teams are not choosing between cloud and bare metal as a matter of ideology. They are using both.

A common pattern is emerging:

Cloud for burst capacity, global reach, managed services, and fast experimentation.
Bare metal or private cloud for stable workloads, databases, storage, CI, and cost sensitive systems.
Edge infrastructure for latency sensitive workloads.
Open source tooling for portability and control.
Automation everywhere to keep operations repeatable.

This hybrid approach requires more engineering discipline, but it can produce better results. Teams get the speed of cloud where it matters and the efficiency of owned or dedicated infrastructure where it makes sense.

The key is automation. Without automation, hybrid infrastructure becomes chaos. With automation, it becomes a flexible platform.

Infrastructure as code, Git based workflows, monitoring, configuration management, immutable images, automated backups, and clear runbooks are what make this model realistic.

Infrastructure knowledge is becoming valuable again

For a while, some people believed developers would no longer need to understand infrastructure deeply. Managed platforms would hide the details.

That prediction did not fully come true.

Modern software still depends on networking, storage, Linux, DNS, TLS, routing, load balancing, databases, queues, observability, and security. When something breaks, abstraction is helpful only until it hides the root cause.

The best engineers today are not just writing application code. They understand how their code behaves in production. They know what happens when latency increases, when disk IO is saturated, when DNS fails, when a certificate expires, when a region goes down, or when a deployment creates unexpected load.

Infrastructure knowledge is no longer only for system administrators. It is becoming a core engineering skill again.

The future is not cloud first. It is architecture first

The next phase of infrastructure will not be defined by one platform.

It will be defined by better decision making.

Cloud is excellent for many things. Bare metal is excellent for many things. Edge is excellent for many things. Managed services, open source platforms, private clouds, and colocation all have a place.

The mistake is choosing any of them automatically.

The better approach is architecture first:

Understand the workload.
Understand the business constraints.
Understand the cost model.
Understand the security requirements.
Understand the operational capacity of the team.
Then choose the infrastructure that fits.

This is a more mature way to build systems. It is also more honest.

Infrastructure is not just a place where applications run. It is part of the product, part of the cost structure, part of the security model, and part of the user experience.

The cloud era taught developers to move fast. The next era will teach them to choose wisely.

BGP Edge Hygiene at a PCI-Regulated Fintech: IRR + RPKI in Production

Berik Ashimov — Fri, 15 May 2026 17:23:20 +0000

A single hijacked prefix can route a chunk of payment traffic into a stranger's network for half an hour before anyone notices. For a payment provider, that is not a routing incident. It is a regulatory event, an exposed-traffic incident, and an auditor knocking on Monday morning.

This post walks through the BGP edge hygiene we ran in production at a national fintech: what we filtered, how we automated it, what broke, and a copy-paste checklist at the end.

The threat model in 200 words

If you run a public-facing AS, the internet routing system trusts you and your peers to announce only what you should announce. That trust is not enforced by default. Five classes of problem will hurt you:

Route hijacks where a remote AS originates your prefix and pulls traffic away.
Route leaks where a transit customer accidentally re-announces full tables to a peer.
Sub-prefix hijacks, more-specific announcements that win longest-prefix-match.
BGP optimizer leaks, a known class of incident where a vendor box generates synthetic more-specifics and a misconfigured peer re-advertises them.
Static fat-fingers from your own ops team.

For a payment provider the consequences are not just availability. They include confidentiality risk on traffic that touches the cardholder data environment, PCI DSS scope expansion, and customer trust damage that lasts longer than any outage.

We treated the BGP edge as a perimeter control, not a routing function. That framing pulled the security team, the compliance team, and the network team into the same review cycle.

Five layers of edge hygiene

We layered five filters on every external eBGP session. Each one alone is insufficient. Together they cut routing-related incidents to near zero over the year we measured.

Layer 1: max-prefix per session

The cheapest, most effective control. If a peer accidentally leaks full tables to us, we want to tear the session down, not crash the box.

Cisco IOS-XR:

router bgp 65000
 neighbor 203.0.113.1
  remote-as 64500
  address-family ipv4 unicast
   maximum-prefix 5000 80 restart 5
  !

Junos:

protocols {
    bgp {
        group EBGP-PEERS {
            neighbor 203.0.113.1 {
                family inet {
                    unicast {
                        prefix-limit {
                            maximum 5000;
                            teardown 80 idle-timeout 5;
                        }
                    }
                }
            }
        }
    }
}

The 80% warning threshold is the important part. You want a syslog message before you go down, not after.

Layer 2: AS-path filters

Reject any prefix whose AS-path contains your own ASN. This catches the surprisingly common case where a peer somewhere has a stale route through your AS and tries to send it back to you.

Junos:

policy-options {
    as-path OWN-AS-IN-PATH ".* 65000 .*";
    policy-statement DENY-OWN-AS {
        from as-path OWN-AS-IN-PATH;
        then reject;
    }
}

Also reject paths containing private ASNs (64512 to 65534 and 4200000000 to 4294967294) on peering sessions where they have no business appearing.

Layer 3: IRR-based prefix filters

For every peer that is not a Tier 1 transit, build a per-peer prefix filter from their published as-set in IRR. If a prefix is not in their published origin set, drop it.

Generate with bgpq4:

bgpq4 -h whois.radb.net -4 -A -l PEER-EXAMPLE-IN AS-EXAMPLE

This produces a prefix-list ready to paste into Cisco IOS-XR or Junos config. Pin to a specific IRR source (RADb, RIPE, ARIN) per peer to avoid junk from less-trusted databases.

Layer 4: RPKI Route Origin Validation

RPKI ROV is the only one of these five layers that gives you cryptographic origin authentication. Configure a local validator (Routinator, rpki-client, or OctoRPKI), feed it to the routers, and drop invalid at the edge.

Cisco IOS-XR:

router bgp 65000
 rpki server 192.0.2.10
  transport tcp port 3323
  refresh-time 600
 !
 address-family ipv4 unicast
  bgp bestpath origin-as use validity
 !

route-policy RPKI-DROP-INVALID
  if validation-state is invalid then
    drop
  else
    pass
  endif
end-policy

Apply the policy in on every eBGP session. Do not "prefer valid", drop invalid. Halfway measures here are how partial hijacks slip through.

Run at least two validators. We ran three (Routinator + rpki-client + a vendor appliance) and reconciled them in monitoring. If one validator goes stale, you do not want it silently flipping prefixes to NotFound.

Layer 5: bogons, reserved, and martians

Drop RFC1918, RFC6598 (100.64.0.0/10), documentation prefixes (192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24), default route (unless you genuinely accept it), and anything with a prefix length above your accepted maximum (typically /24 for IPv4 and /48 for IPv6).

This catches misconfigurations from your peers more often than malice. It is also what auditors look for in PCI DSS section 1 reviews.

Automating it with Ansible

Manual prefix-list updates do not scale past a handful of peers and they rot fast. IRR data changes daily. We built a small Ansible role that does the boring part.

Pipeline:

Pull each peer's as-set from RADb or RIPE using bgpq4.
Render the per-peer prefix-list with a Jinja2 template.
Diff against the running config. If the delta exceeds a threshold (say, plus or minus 10%), flag for human review instead of pushing.
Push via Netmiko or NAPALM, with commit confirm rollback on Junos and a commit replace with diff preview on IOS-XR.
Re-run nightly via cron. IRR data drifts continuously.

Minimal Ansible task:

- name: Generate prefix-list for peer
  ansible.builtin.command: >
    bgpq4 -h whois.radb.net -4 -A
    -l "{{ peer.name }}-IN"
    {{ peer.as_set }}
  register: prefix_list
  changed_when: false

- name: Render config snippet
  ansible.builtin.template:
    src: peer_filter.j2
    dest: "/var/network-configs/{{ peer.name }}.conf"
  vars:
    prefix_list_body: "{{ prefix_list.stdout }}"

- name: Sanity check delta
  ansible.builtin.script: scripts/diff_check.py "{{ peer.name }}.conf"
  register: delta
  failed_when: delta.rc not in [0, 2]

- name: Apply config
  when: delta.rc == 0
  junipernetworks.junos.junos_config:
    src: "/var/network-configs/{{ peer.name }}.conf"
    comment: "auto-update prefix-list for {{ peer.name }}"
    confirm: 5

The confirm: 5 on Junos is the safety net. If the commit breaks the session and you cannot re-commit within 5 minutes, the box rolls back automatically.

Store every generated config in git. When something goes sideways at 03:00, you want to see what changed in the last 24 hours, not guess.

Three things production taught us

Lesson 1: max-prefix saves you from your peers

A regional peer added a new aggregate to their as-set in IRR. Our nightly job picked it up and added it to the filter. The following morning their upstream withdrew the route entirely. Our session would have accepted the larger filter and idled around the new boundary. The 80% warning threshold on max-prefix fired an alert hours before the session reached the actual limit. We caught the IRR drift before anyone noticed at the routing layer.

The lesson: IRR data is necessary but not authoritative. Always wrap it in a max-prefix that reflects the peer's actual size plus some headroom, not their theoretical max.

Lesson 2: "prefer valid" is not "drop invalid"

Early in the rollout we configured RPKI to prefer valid over invalid rather than dropping invalid outright. The reasoning seemed sound at the time: do not break things on day one.

Then a partial hijack happened. A misconfigured AS announced an invalid origin for a /24 that overlapped a covered /22. "Prefer valid" picked the valid /22, but the more-specific /24 still won longest-prefix match because no valid /24 existed in the BGP table. Traffic for that /24 went to the hijacker for about 12 minutes.

We flipped to drop-invalid on every session that night. We lost a small handful of legitimate prefixes whose ROAs were misconfigured by their owners. We sent them email. They fixed it. Net incident count went down, not up.

Lesson 3: pre-commit diff validation catches typos

An AS-path regex update almost dropped a legitimate transit prefix because the regex matched too aggressively. The Ansible diff-check script flagged a delta of minus 1,200 prefixes against the previous run. We caught it before push.

If your automation does not have a "this change looks too big, ask a human" gate, build one. The cost is one extra step per change. The benefit is not paging your team at 02:00 because a regex ate the internet.

How this maps to PCI DSS

If you operate in a PCI environment, the BGP edge is in scope, even if you sometimes argue otherwise. The relevant controls map cleanly:

Requirement 1.2 (network segmentation): A clean edge defines your perimeter. RPKI drop-invalid is the cleanest possible argument that you only accept authenticated origins.
Requirement 10 (logging): Every RPKI validation state transition, every max-prefix warning, and every prefix-list change must reach your SIEM. Auditors will ask.
Requirement 11.4 (intrusion detection): Anomalous prefix-count deltas, sudden session flaps, and unexpected origin AS changes are network IDS signals. Wire them to alerting.
Requirement 6.4 (change management): Git-backed configs, pre-commit diff validation, and commit-confirm rollback are change management. Show the auditor the commit log and the rollback playbook.

Done right, network hygiene becomes compliance-by-design, not a separate workstream.

The take-home checklist

If you are setting this up in your own environment:

[ ] max-prefix per eBGP session, with 80% warning threshold
[ ] AS-path filter rejecting your own ASN in the path
[ ] AS-path filter rejecting private ASNs on public sessions
[ ] IRR prefix filter per non-transit peer, regenerated daily
[ ] RPKI ROV with drop-invalid (not prefer-valid)
[ ] At least two RPKI validators, reconciled in monitoring
[ ] Bogon and reserved prefix drops at the edge
[ ] Maximum prefix length filter (typically /24 v4, /48 v6)
[ ] Pre-commit diff validation with a "too big" abort
[ ] commit-confirm or equivalent rollback on push
[ ] Git-backed configs with audit log
[ ] Logging RPKI state transitions to SIEM
[ ] Alerting on prefix-count delta over 10 percent
[ ] Quarterly review of every peer as-set
[ ] Runbook for sudden BGP session flap

If you implement the first six items, you have removed the largest sources of BGP edge risk for your AS. Everything else is operational polish.

Final note

None of this is novel work. IRR has been around since the 1990s, RPKI since the early 2010s. What is novel is that for most payment providers, fintechs, and regulated networks, this is still not the default configuration. The gap between "everyone should be doing this" and "everyone is doing this" is where edge incidents come from.

If your AS is announced on the public internet and you cannot tick off most of the checklist above, you have homework.

I write about production network reliability, BGP edge security, and infrastructure automation. Find me on LinkedIn or reach out at berik@ashimov.com.