Creating a software system is a lot like constructing a building. If the foundation is not solid, structural problems can undermine the integrity and function of the building.
In this article, we're going to talk about the design principles we can follow to build a future proof large scale software. The concepts are from *AWS Well-Architected framework * whitepaper. This whitepaper inspires to learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way to consistently measure your architectures against best practices and identify areas for improvement. I'll be trying to summarize the Whitepaper
So let's first quickly sum up the Guiding Design Principles:
- Stop guessing capacity needs: Scale up & Down as required
- Automate everything: Automated systems ensure consistency & reliability
- Test at scale: Test an accurate replica of production on-demand
- Adapt & Evolve: Adapt the architecture as needed to meet new challenges
The framework is based on 5 pillars:
1). Operational Excellence
2). Cost optimization
3). Reliability
4). Performance Efficiency
5). Security
Operational Excellence
The main emphasis of this pillar is: Does your architecture work ? Will it continue to ?
Let's look at this pillar specific principles:
- All operations are code
- Document is updated automatically
- Make smaller changes you can roll back
- Iterate...a lot
- Expect things to go sideways
Cost Optimization
Emphasis: Spend only what you have to
Pillar specific principles:
- Consumption based pricing
- Measure efficiency constantly
Reliability:
Emphasis: ** Will this system work consistently & recover quickly ?**
Pillar specific principles:
- Recover from issues automatically
- Scale horizontally first for resilience
- Reduce idle resources
- Manage change through automation
Performance Efficiency
Emphasis: Remove bottlenecks, reduce waste
Pillar specific principles:
- Reduce latency
- Serverless
Security
Emphasis: *Does this system work only as intended? *
Pillar specific principles:
- Automate security tasks
- Encrypt data in transit and at rest
- Know who did what when
- Identities have the least privileges required
Operational Excellence In Depth
Operational excellence is the ability to run systems and gain insights into their operations in order to deliver business value, and to continuously improve supporting processes and procedures. The 3 Phases of Operational Excellence
Prepare-Prioritize: Prioritize to align with business priorities
- What is the business goal ?
- What are the critical pieces need to meet that goal ?
- Any compliance restrictions/requirements ?
- Dependencies between services ?
Design your architecture to support business Priorities
- Is the design observable ?
- Are your logs & observations actionable ?
Is your workload ready to go live ?
- Are your processes consistent ?
- Is operational code properly managed ?
- Are tests in place ?
- Anticipate failure ?
- Ensure your workload is actually working
Shit happens. Be ready.
- Anticipate planned & unplanned events
- Respond in code
- Connect observations with 3rd party tools as needed
Evolve
- Learn from success & failure
- Post-event, have runbooks changed ?
- Test assumptions
- Experiment early and often find better solutions
Cost
- Use the appropriate resources & configurations
- Provision to current needs with an eye to future
- Right size to lowest resource that meets needs
- Use data to choose purchase options
- Optimize by geography
- Optimize data transfer
- Know how much you're spending and where
- Continuously work to maximize value delivered
- Align utilization with requirements
- Report and validate findings
- Evaluate new services for value
** Awareness of spend is key to maximizing value **
Reliability
Reliability is the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions.
- Scale horizontally first for resilience
- Reduce idle resources
- Manage change through automation
Limit: Understand default & requested resources limit
Networking: Understand topology, bandwidth & latency
Availability: Ensure your application is ready for business use
Ensure your application is ready for business use
- Can users access your application
- Deploy without issue
- Can you push issue to planned downtime
- Can your application withstand portal outages ?
Performance Efficiency
Selection:
- Is this the optimal solution for this workload ?
- What type of compute best suits ?
- Which data store is ideal for this workload ?
- Does your network design complement compute & data store choices ?
Review:
- Continuously ensure choices work for your workload
- Is infrastructure stored as code ?
- Are deployments simple & automated ?
- Can benchmarks be taken automatically ?
Monitoring:
- Use active & passive monitoring where appropriate
- Understand the five phases of monitoring (Generation, Aggregation, Real-time Processing, Storage, Analysis)
- Create actionable metrics
Trade of -> You can't have it all
Top comments (0)