Building production-grade infrastructure is difficult. And stressful. And time-consuming. By production-grade infrastructure, I mean the servers, data stores, load balancers, security functionality, monitoring and alerting tools, building pipelines, and all the other pieces of your technology that are necessary to run a business. Your company is placing a bet on you: it’s betting that your infrastructure won’t fall over if traffic goes up, or lose your data if there’s an outage, or allow that data to be compromised when hackers try to break in—and if that bet doesn’t work out, your company can go out of business.
Task | Description | Example tools |
---|---|---|
Install | Install the software binaries and all dependencies. | Bash, Chef, Ansible, Puppet |
Configure | Configure the software at runtime. Includes port settings, TLS certs,service discovery, leaders, followers, replication, etc. | Bash, Chef, Ansible, Puppet |
Provision | Provision the infrastructure. Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc. | Terraform, CloudFormation |
Deploy | Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments. | Terraform, CloudFormation, Kubernetes, ECS |
High availability | Withstand outages of individual processes, servers, services, data centers, and regions. | Multidatacenter, multiregion, replication, auto scaling, load balancing |
Scalability | Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). | Auto scaling, replication, sharding, caching, divide and conquer |
Performance | Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling. | Dynatrace, valgrind, VisualVM, ab, Jmeter |
Networking | Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access. | VPCs, firewalls, routers, DNS registrars, OpenVPN |
Security | Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. | ACM, Let’s Encrypt, KMS, Cognito, Vault, CIS |
Metrics | Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting. | CloudWatch, DataDog, New Relic,Honeycomb |
Logs | Rotate logs on disk. Aggregate log data to a central location. | CloudWatch Logs, ELK, Sumo Logic, Papertrail |
Backup and Restore | Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. | RDS, ElastiCache, replication |
Cost optimization | Pick proper Instance types, use spot and reserved Instances, use auto scaling, and nuke unused resources. | Auto scaling, spot Instances, reserved Instances |
Documentation | Document your code, architecture, and practices. Create playbooks to respond to incidents. | READMEs, wikis, Slack |
Tests | Write automated tests for your infrastructure code. Run tests after every commit and nightly. | Terratest, inspec, serverspec, kitchen-terraform |
Top comments (0)