As engineers, we often hear "safety" and "security" used interchangeably, but they represent fundamentally different concerns that require distinct approaches.
Understanding this distinction is crucial for building resilient systems that protect both your users and your organization.
The Core Difference
Security is about protecting systems from malicious actors who intentionally try to cause harm, steal data, or disrupt operations.
Safety is about protecting systems and users from unintended failures, bugs, or accidents that could cause harm, even when everyone has good intentions.
Think of it this way: Security asks "What if someone tries to break this?" while Safety asks "What if something goes wrong?"
For Software Engineers
Security Concerns
Software engineers must defend against adversaries actively trying to exploit vulnerabilities.
Key Security Concepts
1. Input Validation and Sanitization
Malicious users will try to inject harmful code or manipulate your system through user inputs.
// ❌ UNSAFE: SQL Injection vulnerability
const getUserData = (userId) => {
return db.query(`SELECT * FROM users WHERE id = ${userId}`);
}
// ✅ SECURE: Parameterized queries
const getUserData = (userId) => {
return db.query('SELECT * FROM users WHERE id = ?', [userId]);
}
2. Authentication and Authorization
Ensure users are who they claim to be (authentication) and can only access what they should (authorization).
# ❌ INSECURE: No permission check
@app.route('/api/user/<user_id>/delete', methods=['DELETE'])
def delete_user(user_id):
User.delete(user_id)
return {'status': 'deleted'}
# ✅ SECURE: Proper authorization
@app.route('/api/user/<user_id>/delete', methods=['DELETE'])
@require_auth
def delete_user(user_id):
if not current_user.is_admin() and current_user.id != user_id:
raise PermissionError("Unauthorized")
User.delete(user_id)
return {'status': 'deleted'}
3. Secrets Management
This should be a 101 for both SWE and Cloud Engineers, but don't hurt repeat it: Never hardcode credentials or expose sensitive data.
// ❌ INSECURE: Hardcoded credentials
const apiKey = "sk_live_51HxYz2KzP9876543210";
// ✅ SECURE: Environment variables with secret management
const apiKey = process.env.STRIPE_API_KEY;
// Loaded from vault/secret manager in production
4. Dependency Security
Third-party libraries can introduce vulnerabilities.
# Regular security audits
npm audit
pip-audit
Security Checklist for Software Engineers
- [ ] All user inputs are validated and sanitized
- [ ] SQL injection prevention via parameterized queries
- [ ] XSS protection implemented (content security policy, output encoding)
- [ ] CSRF tokens on state-changing operations
- [ ] Secure password storage (bcrypt, Argon2)
- [ ] Multi-factor authentication supported
- [ ] API rate limiting implemented
- [ ] Dependencies regularly scanned for vulnerabilities
- [ ] Secrets stored in environment variables or secret managers
- [ ] HTTPS enforced everywhere
- [ ] Security headers configured (HSTS, X-Frame-Options, etc.)
- [ ] Logging excludes sensitive data
- [ ] Regular penetration testing or security reviews
Safety Concerns
Software engineers must also ensure systems fail gracefully and don't harm users through bugs or design flaws.
Key Safety Concepts
1. Error Handling and Graceful Degradation
Systems should handle failures without causing cascading problems or data loss.
# ❌ UNSAFE: Unhandled exception crashes the service
def process_payment(amount, user_id):
user = get_user(user_id)
payment = charge_card(user.card_token, amount)
update_balance(user_id, amount)
return payment
# ✅ SAFE: Proper error handling with rollback
def process_payment(amount, user_id):
try:
user = get_user(user_id)
if not user:
return {'error': 'User not found', 'status': 'failed'}
payment = charge_card(user.card_token, amount)
try:
update_balance(user_id, amount)
except Exception as e:
# Rollback the charge if balance update fails
refund_charge(payment.id)
log_error(f"Payment processing failed: {e}")
return {'error': 'Processing error', 'status': 'failed'}
return {'status': 'success', 'payment_id': payment.id}
except Exception as e:
log_error(f"Payment error: {e}")
return {'error': 'Service temporarily unavailable', 'status': 'failed'}
2. Race Conditions and Concurrency
Multiple operations happening simultaneously can lead to data corruption.
// ❌ UNSAFE: Race condition
var balance = 1000
func withdraw(amount int) {
if balance >= amount {
// Another goroutine might modify balance here!
balance -= amount
}
}
// ✅ SAFE: Using mutex for synchronization
var (
balance = 1000
mu sync.Mutex
)
func withdraw(amount int) bool {
mu.Lock()
defer mu.Unlock()
if balance >= amount {
balance -= amount
return true
}
return false
}
3. Data Validation for Integrity
Validate data not just for security, but to prevent logical errors.
// ❌ UNSAFE: No bounds checking
function calculateDiscount(price: number, discountPercent: number): number {
return price * (discountPercent / 100);
}
// ✅ SAFE: Validate business logic constraints
function calculateDiscount(price: number, discountPercent: number): number {
if (price < 0) {
throw new Error('Price cannot be negative');
}
if (discountPercent < 0 || discountPercent > 100) {
throw new Error('Discount must be between 0 and 100');
}
return price * (discountPercent / 100);
}
4. Circuit Breakers and Timeouts
Prevent cascading failures when dependencies fail.
const CircuitBreaker = require('opossum');
const options = {
timeout: 3000, // If function takes longer than 3s, trigger a failure
errorThresholdPercentage: 50, // Open circuit if 50% of requests fail
resetTimeout: 30000 // After 30s, try again
};
async function callExternalAPI(data) {
const response = await fetch('https://api.example.com/data', {
method: 'POST',
body: JSON.stringify(data)
});
return response.json();
}
const breaker = new CircuitBreaker(callExternalAPI, options);
// If the API is down, circuit opens and fails fast
breaker.fire(requestData)
.then(result => console.log(result))
.catch(err => console.log('Service degraded, using fallback'));
Safety Checklist for Software Engineers
- [ ] Comprehensive error handling on all critical paths
- [ ] Database transactions with proper rollback mechanisms
- [ ] Timeouts configured for all external calls
- [ ] Circuit breakers for downstream dependencies
- [ ] Input validation for business logic (not just security)
- [ ] Race condition prevention (locks, atomic operations)
- [ ] Idempotency for critical operations
- [ ] Graceful degradation when services fail
- [ ] Dead letter queues for failed async operations
- [ ] Comprehensive logging and monitoring
- [ ] Feature flags for safe rollouts
- [ ] Automated testing including edge cases
- [ ] Health checks and readiness probes
For Infrastructure Specialists and Cloud Engineers
Security Concerns
Infrastructure teams must protect the entire system perimeter and prevent unauthorized access.
Key Security Concepts
1. Network Segmentation and Least Privilege
Isolate resources and grant minimal necessary permissions.
# Terraform example: Secure VPC setup
resource "aws_security_group" "web_tier" {
name = "web-tier"
description = "Web tier security group"
# Only allow HTTPS from internet
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
# No SSH from internet
# SSH only from bastion host in private subnet
}
resource "aws_security_group" "database_tier" {
name = "database-tier"
description = "Database tier security group"
# Only allow MySQL from app tier
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.app_tier.id]
}
# No direct internet access
}
2. Identity and Access Management (IAM)
Principle of least privilege for cloud resources.
# ❌ INSECURE: Overly permissive IAM policy
AWSTemplateFormatVersion: '2010-09-09'
Resources:
DeveloperRole:
Type: AWS::IAM::Role
Properties:
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AdministratorAccess # TOO BROAD!
# ✅ SECURE: Scoped permissions
AWSTemplateFormatVersion: '2010-09-09'
Resources:
DeveloperRole:
Type: AWS::IAM::Role
Properties:
Policies:
- PolicyName: DeveloperPolicy
PolicyDocument:
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
Resource:
- arn:aws:s3:::my-app-bucket/*
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: arn:aws:logs:*:*:log-group:/aws/lambda/my-app-*
3. Secrets and Encryption
Protect data at rest and in transit.
# Kubernetes example: Using secrets properly
apiVersion: v1
kind: Secret
metadata:
name: database-credentials
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: cGFzc3dvcmQ=
---
apiVersion: v1
kind: Pod
metadata:
name: app-pod
spec:
containers:
- name: app
image: myapp:latest
env:
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: database-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: database-credentials
key: password
4. Security Monitoring and Intrusion Detection
Detect and respond to threats in real-time.
# AWS CloudWatch + GuardDuty example
resource "aws_cloudwatch_log_metric_filter" "unauthorized_api_calls" {
name = "UnauthorizedAPICalls"
log_group_name = "/aws/cloudtrail/organization"
pattern = "{ ($.errorCode = \"*UnauthorizedOperation\") || ($.errorCode = \"AccessDenied*\") }"
metric_transformation {
name = "UnauthorizedAPICalls"
namespace = "Security/Metrics"
value = "1"
}
}
resource "aws_cloudwatch_metric_alarm" "unauthorized_api_calls_alarm" {
alarm_name = "UnauthorizedAPICallsAlarm"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "UnauthorizedAPICalls"
namespace = "Security/Metrics"
period = "300"
statistic = "Sum"
threshold = "5"
alarm_description = "Triggers when unauthorized API calls exceed threshold"
alarm_actions = [aws_sns_topic.security_alerts.arn]
}
Security Checklist for Infrastructure Teams
- [ ] Network segmentation implemented (VPCs, subnets, security groups)
- [ ] Principle of least privilege for all IAM roles and policies
- [ ] MFA enforced for privileged accounts
- [ ] Secrets managed via vault/secrets manager (not in code)
- [ ] Encryption at rest enabled for all data stores
- [ ] TLS/SSL enforced for all data in transit
- [ ] Regular security patching automated
- [ ] Bastion hosts or VPN for administrative access
- [ ] Audit logging enabled (CloudTrail, Cloud Audit Logs)
- [ ] Intrusion detection system deployed
- [ ] DDoS protection configured
- [ ] Regular vulnerability scanning
- [ ] Container image scanning in CI/CD
- [ ] Web Application Firewall (WAF) configured
- [ ] Backup encryption enabled
Safety Concerns
Infrastructure teams must ensure systems remain available and resilient to failures.
Key Safety Concepts
1. High Availability and Redundancy
Eliminate single points of failure.
# Terraform: Multi-AZ deployment for high availability
resource "aws_autoscaling_group" "app" {
name = "app-asg"
vpc_zone_identifier = [
aws_subnet.private_a.id,
aws_subnet.private_b.id,
aws_subnet.private_c.id
]
min_size = 3
max_size = 10
desired_capacity = 3
# Spread instances across availability zones
health_check_type = "ELB"
health_check_grace_period = 300
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
target_group_arns = [aws_lb_target_group.app.arn]
}
resource "aws_lb" "app" {
name = "app-lb"
load_balancer_type = "application"
# Deploy across multiple AZs
subnets = [
aws_subnet.public_a.id,
aws_subnet.public_b.id,
aws_subnet.public_c.id
]
enable_deletion_protection = true
}
2. Disaster Recovery and Backups
Ensure data can be recovered and services restored.
# Kubernetes: Automated backup with Velero
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- production
- staging
storageLocation: aws-backup
volumeSnapshotLocations:
- aws-snapshots
ttl: 720h # 30 days retention
---
# RDS automated backups
resource "aws_db_instance" "production" {
identifier = "production-db"
backup_retention_period = 30
backup_window = "03:00-04:00"
# Enable automated backups to different region
copy_tags_to_snapshot = true
# Enable point-in-time recovery
enabled_cloudwatch_logs_exports = ["audit", "error", "general", "slowquery"]
}
3. Resource Limits and Auto-scaling
Prevent resource exhaustion and ensure capacity.
# Kubernetes: Resource limits and HPA
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
4. Chaos Engineering and Testing
Proactively test system resilience.
# Chaos Mesh: Simulating pod failures
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
namespace: chaos-testing
spec:
action: pod-failure
mode: one
selector:
namespaces:
- production
labelSelectors:
app: web-service
duration: "30s"
scheduler:
cron: "@every 2h"
---
# Network chaos: Simulating latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-test
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-service
delay:
latency: "100ms"
correlation: "100"
jitter: "0ms"
duration: "5m"
Safety Checklist for Infrastructure Teams
- [ ] Multi-AZ/region deployment for critical services
- [ ] Automated backups with tested recovery procedures
- [ ] Auto-scaling configured for compute resources
- [ ] Resource quotas and limits enforced
- [ ] Health checks and liveness probes configured
- [ ] Load balancers with proper health checks
- [ ] Database replication and failover tested
- [ ] Disaster recovery runbooks documented and tested
- [ ] Monitoring and alerting for resource exhaustion
- [ ] Rate limiting at infrastructure level
- [ ] Canary deployments or blue-green deployment strategy
- [ ] Rollback procedures tested and automated
- [ ] Chaos engineering tests run regularly
- [ ] Capacity planning based on metrics
- [ ] Graceful shutdown handling for pods/instances
Improving Safety and Security at Your Company
Of course, this is not a comprehensive list of how or what you should implement, but can give you ideas of "oh, we forgot this thing", and covers some topics which should be handled by your internal IT or security team (depending of how big your company is, or how segregated the rules are there).
Immediate Actions
For Everyone:
- Enable MFA on all accounts
- Audit and rotate credentials
- Review and update dependencies
- Set up basic monitoring and alerting
For Software Engineers:
- Add input validation to critical endpoints
- Implement proper error handling
- Add health check endpoints
For Infrastructure Teams:
- Review IAM policies for over-privileged access
- Enable audit logging
- Verify backup processes work
Short-term Improvements
For Software Engineers:
- Implement automated security scanning in CI/CD (and if you don't know how to do it, do not suffer in silence, ask for help of your infra folks!)
- Add comprehensive test coverage for edge cases
- Implement circuit breakers for external dependencies
- Set up proper secrets management
- Add structured logging with correlation IDs
For Infrastructure Teams:
- Implement network segmentation
- Set up automated patching
- Configure auto-scaling
- Implement blue-green or canary deployments
- Set up cross-region backups
Long-term Strategic Initiatives
Organization-wide:
- Establish security champions program
- (if the budget allows) Implement bugbounty programs
- Conduct regular disaster recovery drills
- Implement chaos engineering practices
- Create incident response playbooks
- Regular security and safety training
- Implement observability stack (metrics, logs, traces)
- Conduct penetration testing
- Establish SRE practices and SLOs
Culture Building:
- Blameless post-mortems for incidents
- Security and safety in code review checklists
- Threat modeling for new features
- Regular game days for failure scenarios
- Share lessons learned across teams
"Real-world" examples
Infrastructure Setup
# Kubernetes deployment with both safety and security
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 5 # Safety: Multiple replicas
strategy:
type: RollingUpdate # Safety: Zero-downtime deployments
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
# Security: Run as non-root
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: payment-service
image: payment-service:v1.2.3
# Security: Read-only filesystem
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
# Safety: Resource limits prevent resource exhaustion
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
# Safety: Liveness probe ensures unhealthy containers restart
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Safety: Readiness probe prevents traffic to unready containers
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
# Security: Secrets from vault
env:
- name: PAYMENT_GATEWAY_API_KEY
valueFrom:
secretKeyRef:
name: payment-secrets
key: gateway-api-key
- name: ENCRYPTION_KEY
valueFrom:
secretKeyRef:
name: payment-secrets
key: encryption-key
# Safety: Graceful shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Wait for connections to drain
# Security: Network policies restrict access
# Safety: Pod disruption budget ensures availability during maintenance
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-service-pdb
spec:
minAvailable: 3 # Safety: Always keep 3 pods running
selector:
matchLabels:
app: payment-service
Conclusion
Safety and security are complementary but distinct disciplines. Security protects against malicious actors, while safety protects against failures and accidents. Both are essential for building trustworthy systems.
Remember:
- Security = Protecting against adversaries
- Safety = Protecting against failures
The best engineering teams excel at both. Start with the checklists above, implement improvements incrementally, and build a culture where both safety and security are everyone's responsibility.
Further Reading
- OWASP Top Ten - Essential security vulnerabilities
- Google SRE Book - Safety and reliability practices
- AWS Well-Architected Framework - Cloud architecture best practices
- Chaos Engineering: System Resiliency in Practice
- NIST Cybersecurity Framework - Comprehensive security guidance
Top comments (0)