I have been on both sides of the cloud architect interview table. As a hiring manager at Lockheed Martin and Cigna Healthcare, I conducted over 200 technical interviews for cloud architecture roles. As a candidate, I went through interview loops at three Fortune 500 companies and two government contractors.
Each question below includes the answer I would accept from a senior candidate, with the depth and specificity that separates a hire from a rejection.
Foundational Architecture Questions
1. What is the difference between high availability and fault tolerance?
High availability minimizes downtime through redundancy. A system with 99.99% availability (52 minutes of downtime per year) is highly available. It may experience brief interruptions during failover but recovers quickly.
Fault tolerance means the system continues operating without any interruption when a component fails. Fault tolerance is more expensive because it requires active-active redundancy rather than active-passive.
2. Explain the CAP theorem and how it applies to cloud database selection.
The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition tolerance.
In practice, partition tolerance is non-negotiable. The real choice is:
- CP systems (DynamoDB strongly consistent, Cloud Spanner): sacrifice availability during partitions. Use for financial transactions.
- AP systems (DynamoDB eventually consistent, Cassandra): sacrifice consistency during partitions. Use for social feeds, session stores.
3. How do you design a multi-region active-active architecture?
Key challenges: data replication, conflict resolution, and routing.
- Data layer: globally distributed database (DynamoDB Global Tables, CockroachDB, Cloud Spanner) or cross-region replication with conflict resolution
- Application layer: identical stacks per region with feature flags for regional rollouts
- Routing: Route 53 latency-based routing or Cloudflare load balancing
- Conflict resolution: last-writer-wins, vector clocks, or application-level merge logic
- Testing: regular "region evacuation" drills
4. How would you migrate a monolithic application to microservices?
I use the Strangler Fig pattern, not a big-bang rewrite:
- Map domains using domain-driven design. Identify bounded contexts
- Extract incrementally, starting with the domain that has the clearest API boundary
- Separate the shared database into per-service databases with eventual consistency through events
- Introduce an API gateway to route between monolith and new services
- Implement distributed tracing before extracting services
- Budget 6-18 months. Teams that try 3 months end up with a distributed monolith
5. Containers vs. serverless -- when do you choose each?
| Dimension | Containers | Serverless |
|---|---|---|
| Startup time | Seconds to minutes | Milliseconds to seconds |
| Max execution | Unlimited | 15 minutes (Lambda) |
| Cost model | Per-hour (even when idle) | Per-invocation + duration |
| State | Stateful possible | Stateless by design |
| Best for | Long-running services | Event-driven processing, variable-load APIs |
Originally published at Citadel Cloud Management. 17 free cloud courses available -- no credit card required.
Top comments (0)