DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Cloud Service Models — Full SRE Lecture: IaaS, PaaS, SaaS


🌐 The Big Picture First

Think of cloud service models as a spectrum of responsibility. The more you move right, the less you manage — but also the less control you have.

YOUR RESPONSIBILITY
◄────────────────────────────────────────────►
Maximum                                Minimum

On-Premises → IaaS → PaaS → SaaS → Serverless
Enter fullscreen mode Exit fullscreen mode

A useful analogy is pizza:

On-Premises  = Make pizza at home (you own everything)
IaaS         = Order dough & ingredients (you cook it)
PaaS         = Order pizza kit (just assemble & bake)
SaaS         = Order delivery (just eat it)
Enter fullscreen mode Exit fullscreen mode

🏗️ Layer 1 — IaaS (Infrastructure as a Service)

What it is

You rent raw infrastructure — servers, storage, networking — from a cloud provider. The provider manages the physical hardware. You manage everything above the hypervisor.

Responsibility Split

Provider manages:        You manage:
─────────────────        ───────────────────────
Physical servers    →    Operating System
Data centers        →    Runtime & middleware
Networking HW       →    Applications
Hypervisor          →    Data
Storage HW          →    Security patches
                    →    Scaling
                    →    Backups
                    →    Monitoring
Enter fullscreen mode Exit fullscreen mode

Real Examples

Provider IaaS Products
AWS EC2, EBS, VPC, S3
GCP Compute Engine, Cloud Storage
Azure Virtual Machines, Azure Blob

IaaS Use Cases

  • Lift & shift migrations from on-prem
  • Custom OS configurations needed
  • High performance computing (HPC)
  • Full control over networking required
  • Legacy applications that can't be containerized

IaaS Code Example — Terraform EC2

# You provision and OWN this — classic IaaS
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  # YOU are responsible for everything inside this machine
  user_data = <<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
  EOF

  tags = {
    Name = "company-web-server"
  }
}
Enter fullscreen mode Exit fullscreen mode

🚀 Layer 2 — PaaS (Platform as a Service)

What it is

The provider manages OS, runtime, middleware, and scaling. You focus purely on writing and deploying your application code.

Responsibility Split

Provider manages:        You manage:
─────────────────        ───────────────────────
Physical servers    →    Application code
Data centers        →    Data
Hypervisor          →    User access
Operating System    →    Configurations
Runtime             →    (sometimes) scaling rules
Middleware          →
Patching            →
Scaling infra       →
Enter fullscreen mode Exit fullscreen mode

Real Examples

Provider PaaS Products
AWS Elastic Beanstalk, RDS, Lambda
GCP App Engine, Cloud Run, Cloud SQL
Azure Azure App Service, Azure SQL
Others Heroku, Render, Railway

PaaS Use Cases

  • Startups moving fast without dedicated DevOps
  • Managed databases (RDS handles patching, backups)
  • Web apps where you don't care about OS
  • Rapid prototyping

PaaS Code Example — AWS Elastic Beanstalk

# You just push code — platform handles the rest
# .ebextensions/app.config

option_settings:
  aws:autoscaling:asg:
    MinSize: 2
    MaxSize: 10
  aws:elasticbeanstalk:environment:
    EnvironmentType: LoadBalanced
  aws:ec2:instances:
    InstanceTypes: t3.medium

# No OS management, no nginx config, no patching
# Platform handles it ALL
Enter fullscreen mode Exit fullscreen mode

💻 Layer 3 — SaaS (Software as a Service)

What it is

A fully managed application delivered over the internet. You don't manage infrastructure, OS, runtime, or the app itself. You just use it.

Responsibility Split

Provider manages:        You manage:
─────────────────        ───────────────────────
Everything               Your data
Infrastructure      →    User access/permissions
OS & runtime        →    Configurations within app
Application         →    Integrations
Updates & patches   →
Scaling             →
Security            →
Enter fullscreen mode Exit fullscreen mode

Real Examples

Category SaaS Products
Monitoring Datadog, New Relic, PagerDuty
Communication Slack, Gmail, Zoom
CI/CD GitHub Actions, CircleCI
Security Okta, CrowdStrike
Storage Dropbox, Google Drive

SaaS from SRE perspective

You don't manage the app BUT you must manage:
✅ API integrations with your systems
✅ SSO/SAML configuration
✅ Data retention policies
✅ Vendor SLA monitoring
✅ Cost & license management
✅ Data backup (vendor may not guarantee YOUR data)
Enter fullscreen mode Exit fullscreen mode

🔄 Shared Responsibility Model — Deep Dive

This is critical for SRE engineers to understand deeply.

                    ON-PREM   IaaS    PaaS    SaaS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Applications          YOU      YOU     YOU    VENDOR
Data                  YOU      YOU     YOU     YOU ⚠️
Runtime               YOU      YOU    VENDOR  VENDOR
Middleware            YOU      YOU    VENDOR  VENDOR
OS                    YOU      YOU    VENDOR  VENDOR
Virtualization        YOU     VENDOR  VENDOR  VENDOR
Servers               YOU     VENDOR  VENDOR  VENDOR
Storage               YOU     VENDOR  VENDOR  VENDOR
Networking            YOU     VENDOR  VENDOR  VENDOR
Data Center           YOU     VENDOR  VENDOR  VENDOR
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Enter fullscreen mode Exit fullscreen mode

⚠️ Data is always YOUR responsibility regardless of model. Even in SaaS, if vendor loses your data — that's your problem operationally.


🧠 What an SRE Must Know About Each Model


IaaS — SRE Responsibilities

1. OS Hardening & Patching

# You own this with IaaS
# Automated patching with SSM
aws ssm send-command \
  --document-name "AWS-RunPatchBaseline" \
  --targets "Key=tag:Environment,Values=production" \
  --parameters '{"Operation":["Install"]}'
Enter fullscreen mode Exit fullscreen mode

2. Auto Scaling & Self Healing

resource "aws_autoscaling_group" "web" {
  min_size         = 2
  max_size         = 20
  desired_capacity = 4

  health_check_type         = "ELB"
  health_check_grace_period = 300

  # Self-healing — replace unhealthy instances automatically
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

3. Monitoring You Must Set Up Yourself

# Prometheus scrape config for IaaS EC2
scrape_configs:
  - job_name: 'ec2-instances'
    ec2_sd_configs:
      - region: us-east-1
        port: 9100  # node_exporter port
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: environment
Enter fullscreen mode Exit fullscreen mode

4. Backup Strategy

# EBS snapshots — YOUR responsibility in IaaS
aws ec2 create-snapshot \
  --volume-id vol-xxxxxxxx \
  --description "Daily backup $(date +%Y-%m-%d)"
Enter fullscreen mode Exit fullscreen mode

PaaS — SRE Responsibilities

1. Monitor What the Platform Exposes

# RDS is PaaS — you monitor metrics, not the OS
import boto3

cloudwatch = boto3.client('cloudwatch')

# Key RDS metrics to alert on
rds_metrics = [
    'CPUUtilization',        # > 80% = alert
    'FreeStorageSpace',      # < 20% = alert
    'DatabaseConnections',   # near max = alert
    'ReadLatency',           # > 20ms = investigate
    'WriteLatency',          # > 20ms = investigate
    'ReplicaLag',            # > 30s = alert
]
Enter fullscreen mode Exit fullscreen mode

2. Understand Platform Limits

AWS RDS Limits you MUST know as SRE:
- Max connections per instance type
- Storage autoscaling thresholds
- Failover time (~60-120 seconds)
- Backup retention (1-35 days)
- Maintenance windows impact

If you don't know these → you'll miss incidents
Enter fullscreen mode Exit fullscreen mode

3. Runbook for PaaS Failures

## RDS Failover Runbook

1. Alert fires: RDS_ReplicaLag > 30s
2. Check: AWS Console → RDS → Events
3. If primary unhealthy → failover triggers automatically
4. Expected downtime: 60-120 seconds
5. Verify: application reconnects (check connection pooling)
6. Notify: stakeholders if > 2 min downtime
7. Postmortem: if failover was unexpected
Enter fullscreen mode Exit fullscreen mode

SaaS — SRE Responsibilities

1. Vendor SLA Tracking

# Track your SaaS vendors' uptime against their SLA
vendors = {
    "datadog": {
        "sla_target": 99.9,
        "status_page": "https://status.datadoghq.com",
        "impact": "CRITICAL"  # no monitoring if down
    },
    "pagerduty": {
        "sla_target": 99.9,
        "status_page": "https://status.pagerduty.com",
        "impact": "CRITICAL"  # no alerting if down
    },
    "github": {
        "sla_target": 99.9,
        "status_page": "https://githubstatus.com",
        "impact": "HIGH"  # no deploys if down
    }
}
Enter fullscreen mode Exit fullscreen mode

2. SaaS Dependency Risk

As SRE you must ask:
❓ What happens if Datadog goes down?
   → Do we have fallback monitoring?

❓ What happens if PagerDuty goes down?
   → Do we have SMS/phone tree backup?

❓ What happens if GitHub goes down?
   → Can we still deploy hotfixes?

❓ What happens if Okta goes down?
   → Can engineers still access production?
Enter fullscreen mode Exit fullscreen mode

3. Data Backup for SaaS

# Even SaaS data needs backup — vendor not responsible
# Example: backup GitHub repos

#!/bin/bash
ORGS=("company-org")
for org in "${ORGS[@]}"; do
  repos=$(gh repo list $org --json name -q '.[].name')
  for repo in $repos; do
    git clone --mirror \
      https://github.com/$org/$repo.git \
      /backups/github/$org/$repo.git
  done
done
Enter fullscreen mode Exit fullscreen mode

📊 SLO/SLI Design Per Model

This is where SRE expertise really shows:

IaaS — You define AND measure everything:
  SLI: Custom metrics from your app + infra
  SLO: 99.9% availability (you control this)
  Error Budget: You own it fully

PaaS — Platform gives you some metrics:
  SLI: Mix of platform metrics + app metrics
  SLO: Limited by platform's own SLA
  Error Budget: Platform failures count against YOU

SaaS — You mostly observe:
  SLI: API response times, login success rate
  SLO: Constrained by vendor SLA
  Error Budget: Vendor downtime burns YOUR budget
Enter fullscreen mode Exit fullscreen mode

🔥 Real Incident Scenarios by Model

IaaS Incident

Alert: High CPU on EC2 fleet (95%)
SRE Actions:
1. SSH into instance → top → find runaway process
2. Check ASG → is it scaling?
3. Check ALB → redistribute traffic
4. Patch if OS-level issue
5. You have FULL access to diagnose

Resolution time: Fast if skilled, slow if not
Enter fullscreen mode Exit fullscreen mode

PaaS Incident

Alert: RDS connections maxed out
SRE Actions:
1. Check CloudWatch → DatabaseConnections metric
2. Check application → connection pool config
3. Scale instance type (few minutes)
4. Add read replica to distribute load
5. You CANNOT ssh into RDS — limited visibility

Resolution time: Dependent on platform tooling
Enter fullscreen mode Exit fullscreen mode

SaaS Incident

Alert: Datadog not receiving metrics
SRE Actions:
1. Check status.datadoghq.com
2. Check your agent → is it running?
3. If vendor issue → wait + use backup monitoring
4. You have ZERO control over their infrastructure

Resolution time: Entirely up to vendor
Enter fullscreen mode Exit fullscreen mode

💡 Key SRE Takeaways

Topic IaaS PaaS SaaS
Toil level High Medium Low
Control Full Partial None
Blast radius You caused it Shared Vendor caused it
MTTR You control Partly you Vendor controls
Cost model Pay per resource Pay per usage Pay per seat
Scaling Manual/ASG Auto Automatic
Patching You Vendor Vendor
Debugging Full access Limited API/logs only

🎓 Senior SRE Mental Model

At 6 years experience, you should think about this like:

IaaS = Maximum flexibility, maximum toil
       → Use when you NEED control
       → Automate everything or drown

PaaS = Sweet spot for most workloads
       → Understand platform limits deeply
       → Know exactly what you can't control

SaaS = Treat vendors like internal services
       → Track their SLAs
       → Build fallbacks for critical ones
       → Own YOUR data always

Modern SRE reality:
Most companies use ALL THREE simultaneously
Your job = understand the boundary of responsibility
           at each layer and build reliability
           within those constraints
Enter fullscreen mode Exit fullscreen mode

Top comments (0)