137Foundry

Posted on Jun 6

10 Operational Continuity Risks Every Small Engineering Team Should Track

#business #devops #productivity

Small engineering teams tend to underestimate operational continuity risk. The argument goes that the company is too small to need a formal risk register, the team is small enough to know its own dependencies, and any real failure will be obvious. In practice none of these are true. The team is too busy to keep all the dependencies in working memory, the company outgrew "small" two quarters ago, and the obvious failure modes are usually only obvious in retrospect.

Below are ten operational continuity risks that show up over and over in small to mid-size engineering organizations. Most teams are tracking two or three of them. The other seven are usually invisible until they hit production.

Photo by John Barkiple on Unsplash

1. Single owner for a critical service

The most common continuity risk is a service that only one engineer actually understands. The bus factor of one. If the engineer leaves, the team is in trouble. If the engineer goes on a two-week vacation, anything that breaks in their service will sit broken until they get back.

The fix is a deliberate cross-training rotation. Two other engineers should be able to deploy, debug, and modify the service before you call the risk closed. The 137Foundry advisory practice usually finds three or four single-owner services on a first walkthrough of a small engineering org.

2. Critical webhook or integration with no retry handling

If your billing or onboarding flow depends on a webhook from a third party, and the webhook delivery has no idempotent retry handling on your side, you have a continuity risk. The third party will fail. The retry will arrive at a state where the operation has half-completed. The reconciliation will be manual.

The Wikipedia article on idempotence covers the underlying property; in practice, the fix is to design every webhook receiver to be safely re-runnable on any payload.

3. Credentials with no rotation policy

API keys, database passwords, and integration tokens that have not rotated since the company started carry continuity risk on two fronts. They are likely sitting in a former employee's accessible storage somewhere, and the procedure to actually rotate them is untested.

Rotation does not have to be aggressive. Annual rotation with a known runbook is far better than monthly rotation that nobody actually runs. The point is that the rotation procedure is exercised, not that the rotation is frequent.

4. No documented recovery path for the primary database

Most engineering teams have backups. Most engineering teams have not actually restored from those backups within the last six months. The continuity risk is not the absence of backups; it is the absence of a tested restore procedure.

A quarterly restore drill into a staging environment closes this risk. The drill takes a few hours and surfaces problems that would otherwise become disasters during a real incident.

5. Single point of failure on a deployment pipeline

Deployment pipelines that rely on one specific engineer's local setup, one specific token, or one specific CI service contract carry continuity risk. If the deployment pipeline goes down, every shipping change waits.

The fix is to make deployment reproducible from a known set of secrets and configurations, with two or three engineers able to run a deployment without needing to ask anyone.

Photo by Hyundai Motor Group on Pexels

6. Vendor lock-in on a critical service with no exit plan

A vendor lock-in that nobody has thought through is a slow-motion continuity risk. The vendor might raise prices. The vendor might shut down a service tier. The vendor might be acquired and changed.

The fix is not to avoid vendor lock-in. Lock-in often makes economic sense. The fix is to know what an exit would cost, in time and money, and to have written down a sketch of how it would work. The Open Web Application Security Project covers some of the security dimensions of vendor reliance; the operational dimensions are usually company-specific.

7. No on-call rotation or unclear escalation path

If an incident happens at 2 AM, who gets paged? If the first person does not respond in fifteen minutes, who is next? If the issue cannot be solved by the on-call engineer, who has the authority to make business-impact decisions?

A team without clear answers to these three questions has a continuity risk that materializes the first time something serious happens off-hours.

8. Critical data flow with no monitoring

Background jobs, data syncs, and scheduled tasks that "just work" until they don't are a common source of silent failure. The continuity risk is not that they fail; it is that they fail silently for days before anyone notices.

The fix is dead simple. Every critical job runs a heartbeat check, and the heartbeat check pages someone if it is missed twice. The discipline is in actually implementing it across every critical flow.

9. Documentation gaps in onboarding-critical paths

If a new engineer cannot get the development environment running without three direct messages to the senior engineer, the onboarding process is a continuity risk. The senior engineer who answers those messages will eventually leave or get promoted into a role that no longer has time for them.

The fix is to test onboarding documentation against actual new engineers and fix every place the documentation fails. The exercise takes one day and pays back over years.

10. No risk register that integrates with sprint planning

The meta-risk that contains all the others is the absence of a working risk register that actually influences engineering decisions. Without it, the other nine risks remain individually tracked at best and forgotten at worst.

The longer 137Foundry guide on building a technology risk register your team will actually use walks through what a working register looks like and how to build one in two sessions. The shorter version is that the register has to be specific, owned, costed, and reviewed on a real cadence.

Photo by Walls.io on Pexels

How to use this list

Read through the ten risks. Mark each one as "we have this handled," "we have this partly handled," or "we are exposed here." The honest version of this exercise usually surfaces three to five risks that the team has not actually addressed.

Pick the top two by impact. Schedule a focused engineering effort on each one in the next quarter. The effort does not have to be huge. Most of these risks close with a few days of focused work plus a follow-up review six weeks later to confirm the fix held.

If you have a working risk register, add the surviving risks to it as concrete rows. If you do not, this list of ten is a reasonable seed for the first version. The National Institute of Standards and Technology cybersecurity resources offer additional frameworks for more comprehensive coverage when the organization is ready for the next layer.

The point of the exercise is not to be paranoid. It is to be slightly less surprised when something goes wrong. Every one of these ten risks materializes for someone every year. The teams that survive the materialization without a multi-week incident are the ones that took the boring preparation work seriously, before the alarm went off.

A small engineering team can carry every one of these risks for a long time before any of them fire. That is the trap. The cost of preparation is small and continuous. The cost of an unprepared incident is large and concentrated. Choose accordingly.

DEV Community