Anderson Leite

Posted on Nov 16

Safety vs Security in Software: A Practical Guide for Engineers and Infrastructure Teams

#safety #security #infrastructureascode #softwareengineering

As engineers, we often hear "safety" and "security" used interchangeably, but they represent fundamentally different concerns that require distinct approaches.

Understanding this distinction is crucial for building resilient systems that protect both your users and your organization.

The Core Difference

Security is about protecting systems from malicious actors who intentionally try to cause harm, steal data, or disrupt operations.

Safety is about protecting systems and users from unintended failures, bugs, or accidents that could cause harm, even when everyone has good intentions.

Think of it this way: Security asks "What if someone tries to break this?" while Safety asks "What if something goes wrong?"

For Software Engineers

Security Concerns

Software engineers must defend against adversaries actively trying to exploit vulnerabilities.

Key Security Concepts

1. Input Validation and Sanitization

Malicious users will try to inject harmful code or manipulate your system through user inputs.

// ❌ UNSAFE: SQL Injection vulnerability
const getUserData = (userId) => {
  return db.query(`SELECT * FROM users WHERE id = ${userId}`);
}

// ✅ SECURE: Parameterized queries
const getUserData = (userId) => {
  return db.query('SELECT * FROM users WHERE id = ?', [userId]);
}

2. Authentication and Authorization

Ensure users are who they claim to be (authentication) and can only access what they should (authorization).

# ❌ INSECURE: No permission check
@app.route('/api/user/<user_id>/delete', methods=['DELETE'])
def delete_user(user_id):
    User.delete(user_id)
    return {'status': 'deleted'}

# ✅ SECURE: Proper authorization
@app.route('/api/user/<user_id>/delete', methods=['DELETE'])
@require_auth
def delete_user(user_id):
    if not current_user.is_admin() and current_user.id != user_id:
        raise PermissionError("Unauthorized")
    User.delete(user_id)
    return {'status': 'deleted'}

3. Secrets Management

This should be a 101 for both SWE and Cloud Engineers, but don't hurt repeat it: Never hardcode credentials or expose sensitive data.

// ❌ INSECURE: Hardcoded credentials
const apiKey = "sk_live_51HxYz2KzP9876543210";

// ✅ SECURE: Environment variables with secret management
const apiKey = process.env.STRIPE_API_KEY;
// Loaded from vault/secret manager in production

4. Dependency Security

Third-party libraries can introduce vulnerabilities.

# Regular security audits
npm audit
pip-audit

Security Checklist for Software Engineers

[ ] All user inputs are validated and sanitized
[ ] SQL injection prevention via parameterized queries
[ ] XSS protection implemented (content security policy, output encoding)
[ ] CSRF tokens on state-changing operations
[ ] Secure password storage (bcrypt, Argon2)
[ ] Multi-factor authentication supported
[ ] API rate limiting implemented
[ ] Dependencies regularly scanned for vulnerabilities
[ ] Secrets stored in environment variables or secret managers
[ ] HTTPS enforced everywhere
[ ] Security headers configured (HSTS, X-Frame-Options, etc.)
[ ] Logging excludes sensitive data
[ ] Regular penetration testing or security reviews

Safety Concerns

Software engineers must also ensure systems fail gracefully and don't harm users through bugs or design flaws.

Key Safety Concepts

1. Error Handling and Graceful Degradation

Systems should handle failures without causing cascading problems or data loss.

# ❌ UNSAFE: Unhandled exception crashes the service
def process_payment(amount, user_id):
    user = get_user(user_id)
    payment = charge_card(user.card_token, amount)
    update_balance(user_id, amount)
    return payment

# ✅ SAFE: Proper error handling with rollback
def process_payment(amount, user_id):
    try:
        user = get_user(user_id)
        if not user:
            return {'error': 'User not found', 'status': 'failed'}

        payment = charge_card(user.card_token, amount)

        try:
            update_balance(user_id, amount)
        except Exception as e:
            # Rollback the charge if balance update fails
            refund_charge(payment.id)
            log_error(f"Payment processing failed: {e}")
            return {'error': 'Processing error', 'status': 'failed'}

        return {'status': 'success', 'payment_id': payment.id}
    except Exception as e:
        log_error(f"Payment error: {e}")
        return {'error': 'Service temporarily unavailable', 'status': 'failed'}

2. Race Conditions and Concurrency

Multiple operations happening simultaneously can lead to data corruption.

// ❌ UNSAFE: Race condition
var balance = 1000

func withdraw(amount int) {
    if balance >= amount {
        // Another goroutine might modify balance here!
        balance -= amount
    }
}

// ✅ SAFE: Using mutex for synchronization
var (
    balance = 1000
    mu      sync.Mutex
)

func withdraw(amount int) bool {
    mu.Lock()
    defer mu.Unlock()

    if balance >= amount {
        balance -= amount
        return true
    }
    return false
}

3. Data Validation for Integrity

Validate data not just for security, but to prevent logical errors.

// ❌ UNSAFE: No bounds checking
function calculateDiscount(price: number, discountPercent: number): number {
  return price * (discountPercent / 100);
}

// ✅ SAFE: Validate business logic constraints
function calculateDiscount(price: number, discountPercent: number): number {
  if (price < 0) {
    throw new Error('Price cannot be negative');
  }
  if (discountPercent < 0 || discountPercent > 100) {
    throw new Error('Discount must be between 0 and 100');
  }
  return price * (discountPercent / 100);
}

4. Circuit Breakers and Timeouts

Prevent cascading failures when dependencies fail.

const CircuitBreaker = require('opossum');

const options = {
  timeout: 3000, // If function takes longer than 3s, trigger a failure
  errorThresholdPercentage: 50, // Open circuit if 50% of requests fail
  resetTimeout: 30000 // After 30s, try again
};

async function callExternalAPI(data) {
  const response = await fetch('https://api.example.com/data', {
    method: 'POST',
    body: JSON.stringify(data)
  });
  return response.json();
}

const breaker = new CircuitBreaker(callExternalAPI, options);

// If the API is down, circuit opens and fails fast
breaker.fire(requestData)
  .then(result => console.log(result))
  .catch(err => console.log('Service degraded, using fallback'));

Safety Checklist for Software Engineers

[ ] Comprehensive error handling on all critical paths
[ ] Database transactions with proper rollback mechanisms
[ ] Timeouts configured for all external calls
[ ] Circuit breakers for downstream dependencies
[ ] Input validation for business logic (not just security)
[ ] Race condition prevention (locks, atomic operations)
[ ] Idempotency for critical operations
[ ] Graceful degradation when services fail
[ ] Dead letter queues for failed async operations
[ ] Comprehensive logging and monitoring
[ ] Feature flags for safe rollouts
[ ] Automated testing including edge cases
[ ] Health checks and readiness probes

For Infrastructure Specialists and Cloud Engineers

Security Concerns

Infrastructure teams must protect the entire system perimeter and prevent unauthorized access.

Key Security Concepts

1. Network Segmentation and Least Privilege

Isolate resources and grant minimal necessary permissions.

# Terraform example: Secure VPC setup
resource "aws_security_group" "web_tier" {
  name        = "web-tier"
  description = "Web tier security group"

  # Only allow HTTPS from internet
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # No SSH from internet
  # SSH only from bastion host in private subnet
}

resource "aws_security_group" "database_tier" {
  name        = "database-tier"
  description = "Database tier security group"

  # Only allow MySQL from app tier
  ingress {
    from_port       = 3306
    to_port         = 3306
    protocol        = "tcp"
    security_groups = [aws_security_group.app_tier.id]
  }

  # No direct internet access
}

2. Identity and Access Management (IAM)

Principle of least privilege for cloud resources.

# ❌ INSECURE: Overly permissive IAM policy
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DeveloperRole:
    Type: AWS::IAM::Role
    Properties:
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AdministratorAccess  # TOO BROAD!

# ✅ SECURE: Scoped permissions
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DeveloperRole:
    Type: AWS::IAM::Role
    Properties:
      Policies:
        - PolicyName: DeveloperPolicy
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                Resource:
                  - arn:aws:s3:::my-app-bucket/*
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: arn:aws:logs:*:*:log-group:/aws/lambda/my-app-*

3. Secrets and Encryption

Protect data at rest and in transit.

# Kubernetes example: Using secrets properly
apiVersion: v1
kind: Secret
metadata:
  name: database-credentials
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: cGFzc3dvcmQ=

---
apiVersion: v1
kind: Pod
metadata:
  name: app-pod
spec:
  containers:
  - name: app
    image: myapp:latest
    env:
    - name: DB_USERNAME
      valueFrom:
        secretKeyRef:
          name: database-credentials
          key: username
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: database-credentials
          key: password

4. Security Monitoring and Intrusion Detection

Detect and respond to threats in real-time.

# AWS CloudWatch + GuardDuty example
resource "aws_cloudwatch_log_metric_filter" "unauthorized_api_calls" {
  name           = "UnauthorizedAPICalls"
  log_group_name = "/aws/cloudtrail/organization"

  pattern = "{ ($.errorCode = \"*UnauthorizedOperation\") || ($.errorCode = \"AccessDenied*\") }"

  metric_transformation {
    name      = "UnauthorizedAPICalls"
    namespace = "Security/Metrics"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "unauthorized_api_calls_alarm" {
  alarm_name          = "UnauthorizedAPICallsAlarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "UnauthorizedAPICalls"
  namespace           = "Security/Metrics"
  period              = "300"
  statistic           = "Sum"
  threshold           = "5"
  alarm_description   = "Triggers when unauthorized API calls exceed threshold"
  alarm_actions       = [aws_sns_topic.security_alerts.arn]
}

Security Checklist for Infrastructure Teams

[ ] Network segmentation implemented (VPCs, subnets, security groups)
[ ] Principle of least privilege for all IAM roles and policies
[ ] MFA enforced for privileged accounts
[ ] Secrets managed via vault/secrets manager (not in code)
[ ] Encryption at rest enabled for all data stores
[ ] TLS/SSL enforced for all data in transit
[ ] Regular security patching automated
[ ] Bastion hosts or VPN for administrative access
[ ] Audit logging enabled (CloudTrail, Cloud Audit Logs)
[ ] Intrusion detection system deployed
[ ] DDoS protection configured
[ ] Regular vulnerability scanning
[ ] Container image scanning in CI/CD
[ ] Web Application Firewall (WAF) configured
[ ] Backup encryption enabled

Safety Concerns

Infrastructure teams must ensure systems remain available and resilient to failures.

Key Safety Concepts

1. High Availability and Redundancy

Eliminate single points of failure.

# Terraform: Multi-AZ deployment for high availability
resource "aws_autoscaling_group" "app" {
  name                = "app-asg"
  vpc_zone_identifier = [
    aws_subnet.private_a.id,
    aws_subnet.private_b.id,
    aws_subnet.private_c.id
  ]

  min_size         = 3
  max_size         = 10
  desired_capacity = 3

  # Spread instances across availability zones
  health_check_type         = "ELB"
  health_check_grace_period = 300

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  target_group_arns = [aws_lb_target_group.app.arn]
}

resource "aws_lb" "app" {
  name               = "app-lb"
  load_balancer_type = "application"

  # Deploy across multiple AZs
  subnets = [
    aws_subnet.public_a.id,
    aws_subnet.public_b.id,
    aws_subnet.public_c.id
  ]

  enable_deletion_protection = true
}

2. Disaster Recovery and Backups

Ensure data can be recovered and services restored.

# Kubernetes: Automated backup with Velero
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
    - production
    - staging
    storageLocation: aws-backup
    volumeSnapshotLocations:
    - aws-snapshots
    ttl: 720h  # 30 days retention

---
# RDS automated backups
resource "aws_db_instance" "production" {
  identifier = "production-db"

  backup_retention_period = 30
  backup_window          = "03:00-04:00"

  # Enable automated backups to different region
  copy_tags_to_snapshot = true

  # Enable point-in-time recovery
  enabled_cloudwatch_logs_exports = ["audit", "error", "general", "slowquery"]
}

3. Resource Limits and Auto-scaling

Prevent resource exhaustion and ensure capacity.

# Kubernetes: Resource limits and HPA
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

4. Chaos Engineering and Testing

Proactively test system resilience.

# Chaos Mesh: Simulating pod failures
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-test
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-service
  duration: "30s"
  scheduler:
    cron: "@every 2h"

---
# Network chaos: Simulating latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-test
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  delay:
    latency: "100ms"
    correlation: "100"
    jitter: "0ms"
  duration: "5m"

Safety Checklist for Infrastructure Teams

[ ] Multi-AZ/region deployment for critical services
[ ] Automated backups with tested recovery procedures
[ ] Auto-scaling configured for compute resources
[ ] Resource quotas and limits enforced
[ ] Health checks and liveness probes configured
[ ] Load balancers with proper health checks
[ ] Database replication and failover tested
[ ] Disaster recovery runbooks documented and tested
[ ] Monitoring and alerting for resource exhaustion
[ ] Rate limiting at infrastructure level
[ ] Canary deployments or blue-green deployment strategy
[ ] Rollback procedures tested and automated
[ ] Chaos engineering tests run regularly
[ ] Capacity planning based on metrics
[ ] Graceful shutdown handling for pods/instances

Improving Safety and Security at Your Company

Of course, this is not a comprehensive list of how or what you should implement, but can give you ideas of "oh, we forgot this thing", and covers some topics which should be handled by your internal IT or security team (depending of how big your company is, or how segregated the rules are there).

Immediate Actions

For Everyone:

Enable MFA on all accounts
Audit and rotate credentials
Review and update dependencies
Set up basic monitoring and alerting

For Software Engineers:

Add input validation to critical endpoints
Implement proper error handling
Add health check endpoints

For Infrastructure Teams:

Review IAM policies for over-privileged access
Enable audit logging
Verify backup processes work

Short-term Improvements

For Software Engineers:

Implement automated security scanning in CI/CD (and if you don't know how to do it, do not suffer in silence, ask for help of your infra folks!)
Add comprehensive test coverage for edge cases
Implement circuit breakers for external dependencies
Set up proper secrets management
Add structured logging with correlation IDs

For Infrastructure Teams:

Implement network segmentation
Set up automated patching
Configure auto-scaling
Implement blue-green or canary deployments
Set up cross-region backups

Long-term Strategic Initiatives

Organization-wide:

Establish security champions program
(if the budget allows) Implement bugbounty programs
Conduct regular disaster recovery drills
Implement chaos engineering practices
Create incident response playbooks
Regular security and safety training
Implement observability stack (metrics, logs, traces)
Conduct penetration testing
Establish SRE practices and SLOs

Culture Building:

Blameless post-mortems for incidents
Security and safety in code review checklists
Threat modeling for new features
Regular game days for failure scenarios
Share lessons learned across teams

"Real-world" examples

Infrastructure Setup

# Kubernetes deployment with both safety and security
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 5  # Safety: Multiple replicas
  strategy:
    type: RollingUpdate  # Safety: Zero-downtime deployments
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      # Security: Run as non-root
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000

      containers:
      - name: payment-service
        image: payment-service:v1.2.3

        # Security: Read-only filesystem
        securityContext:
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL

        # Safety: Resource limits prevent resource exhaustion
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

        # Safety: Liveness probe ensures unhealthy containers restart
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Safety: Readiness probe prevents traffic to unready containers
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3

        # Security: Secrets from vault
        env:
        - name: PAYMENT_GATEWAY_API_KEY
          valueFrom:
            secretKeyRef:
              name: payment-secrets
              key: gateway-api-key
        - name: ENCRYPTION_KEY
          valueFrom:
            secretKeyRef:
              name: payment-secrets
              key: encryption-key

        # Safety: Graceful shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Wait for connections to drain

      # Security: Network policies restrict access
      # Safety: Pod disruption budget ensures availability during maintenance
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
spec:
  minAvailable: 3  # Safety: Always keep 3 pods running
  selector:
    matchLabels:
      app: payment-service

Conclusion

Safety and security are complementary but distinct disciplines. Security protects against malicious actors, while safety protects against failures and accidents. Both are essential for building trustworthy systems.

Remember:

Security = Protecting against adversaries
Safety = Protecting against failures

The best engineering teams excel at both. Start with the checklists above, implement improvements incrementally, and build a culture where both safety and security are everyone's responsibility.

DEV Community

Safety vs Security in Software: A Practical Guide for Engineers and Infrastructure Teams

The Core Difference

For Software Engineers

Security Concerns

Key Security Concepts

Security Checklist for Software Engineers

Safety Concerns

Key Safety Concepts

Safety Checklist for Software Engineers

For Infrastructure Specialists and Cloud Engineers

Security Concerns

Key Security Concepts

Security Checklist for Infrastructure Teams

Safety Concerns

Key Safety Concepts

Safety Checklist for Infrastructure Teams

Improving Safety and Security at Your Company

Immediate Actions

Short-term Improvements

Long-term Strategic Initiatives

"Real-world" examples

Infrastructure Setup

Conclusion

Further Reading

Top comments (0)