DEV Community

InstaDevOps
InstaDevOps

Posted on • Originally published at instadevops.com

Platform Engineering: Building Internal Developer Platforms

Introduction

Developers at top tech companies deploy code with a simple git push. They provision databases with a Slack command. They view production metrics in seconds. Meanwhile, at many companies, developers wait days for infrastructure tickets, struggle with complex deployment processes, and can't debug production issues without asking DevOps for help.

This is the promise of Platform Engineering: building internal developer platforms that give developers self-service capabilities while maintaining security, reliability, and cost control. It's about treating your infrastructure as a product, with developers as your customers.

In this comprehensive guide, we'll explore how to build effective internal developer platforms that accelerate development without sacrificing control.

What is Platform Engineering?

Definition

Platform Engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.

Traditional DevOps:
Developer → Ticket → DevOps Team → Manual Setup → Developer Can Use
            (Days)

Platform Engineering:
Developer → Self-Service Portal → Automated Provisioning → Immediate Use
            (Minutes)
Enter fullscreen mode Exit fullscreen mode

Platform vs. DevOps

DevOps Team:
- Reactive: Responds to developer requests
- Services: "We'll set that up for you"
- Bottleneck: Limited team capacity
- Focus: Keeping systems running

Platform Team:
- Proactive: Builds self-service tools
- Products: "Here's how to do it yourself"
- Scalable: Automation scales infinitely
- Focus: Developer productivity

Evolution:
DevOps → SRE → Platform Engineering
(2010s)   (2015+)  (2020+)
Enter fullscreen mode Exit fullscreen mode

Why Platform Engineering Matters

Developer Productivity

Without Platform:

Developer needs database:
1. Create Jira ticket (15 min)
2. Wait for DevOps review (1-3 days)
3. Back-and-forth on requirements (2 days)
4. DevOps provisions database (1 day)
5. Developer configures connection (30 min)

Total time: 4-6 days
DevOps time: 2 hours
Context switches: 5+

With Platform:

Developer needs database:
1. Run: platform create database --type postgres --size small
2. Database provisioned automatically (5 min)
3. Credentials in secret store
4. Connection string in environment variables

Total time: 5 minutes
DevOps time: 0 (fully automated)
Context switches: 0

Productivity gain: 1,000x faster
Enter fullscreen mode Exit fullscreen mode

Cost Reduction

Scenario: 50 developers, each needs 2 environments/month

Manual approach:
- 100 environment requests/month
- 2 hours per request (DevOps time)
- 200 hours/month = 1.25 FTE
- Cost: $15,000/month in DevOps time

Platform approach:
- Self-service automation
- 1 hour/month maintenance
- 0.006 FTE
- Cost: $100/month

Savings: $14,900/month = $178,800/year
Enter fullscreen mode Exit fullscreen mode

Standardization and Compliance

Without Platform:
- Each team configures infrastructure differently
- Security policies inconsistently applied
- Cost optimization varies by team
- Audit trail incomplete

With Platform:
- Standardized infrastructure patterns
- Security policies enforced automatically
- Cost controls built-in
- Complete audit trail
Enter fullscreen mode Exit fullscreen mode

Core Components of Internal Developer Platforms

1. Service Catalog

Self-service menu of available services:

# platform-catalog.yaml

services:
  # Databases
  - name: PostgreSQL Database
    id: postgres
    description: "Managed PostgreSQL database"
    sizes:
      - small: db.t3.small (2 vCPU, 2GB RAM)
      - medium: db.t3.medium (2 vCPU, 4GB RAM)
      - large: db.t3.large (2 vCPU, 8GB RAM)
    backup: Automatic daily backups
    cost: From $30/month

  # Kubernetes
  - name: Kubernetes Namespace
    id: k8s-namespace
    description: "Isolated Kubernetes namespace"
    features:
      - Resource quotas
      - Network policies
      - Automatic Ingress
      - Monitoring enabled
    cost: Free (pay for resources used)

  # Cache
  - name: Redis Cache
    id: redis
    description: "Managed Redis cache"
    sizes:
      - small: cache.t3.micro (0.5GB)
      - medium: cache.t3.small (1.5GB)
      - large: cache.t3.medium (3GB)
    cost: From $15/month

  # Message Queue
  - name: Message Queue
    id: rabbitmq
    description: "RabbitMQ message broker"
    cost: From $25/month
Enter fullscreen mode Exit fullscreen mode

2. Infrastructure as Code Templates

# modules/postgres-database/main.tf

variable "app_name" {
  description = "Application name"
  type        = string
}

variable "environment" {
  description = "Environment (dev/staging/prod)"
  type        = string
}

variable "size" {
  description = "Database size (small/medium/large)"
  type        = string
  default     = "small"
}

locals {
  instance_types = {
    small  = "db.t3.small"
    medium = "db.t3.medium"
    large  = "db.t3.large"
  }

  storage_sizes = {
    small  = 20
    medium = 100
    large  = 500
  }
}

# RDS Instance with best practices
resource "aws_db_instance" "main" {
  identifier = "${var.app_name}-${var.environment}"

  # Instance configuration
  engine               = "postgres"
  engine_version       = "14.6"
  instance_class       = local.instance_types[var.size]
  allocated_storage    = local.storage_sizes[var.size]

  # Credentials (stored in Secrets Manager)
  username = var.app_name
  password = random_password.db_password.result

  # Network
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.db.id]

  # Backup
  backup_retention_period = var.environment == "prod" ? 30 : 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  # Encryption
  storage_encrypted = true
  kms_key_id       = aws_kms_key.db.arn

  # Monitoring
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  monitoring_interval             = 60
  monitoring_role_arn            = aws_iam_role.rds_monitoring.arn

  # Tags
  tags = {
    Application = var.app_name
    Environment = var.environment
    ManagedBy   = "platform-engineering"
    CostCenter  = var.app_name
  }
}

# Store password in Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
  name = "${var.app_name}/${var.environment}/database"
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id = aws_secretsmanager_secret.db_password.id
  secret_string = jsonencode({
    username = aws_db_instance.main.username
    password = random_password.db_password.result
    host     = aws_db_instance.main.address
    port     = aws_db_instance.main.port
    database = aws_db_instance.main.db_name
  })
}

output "connection_secret" {
  value       = aws_secretsmanager_secret.db_password.arn
  description = "ARN of secret containing database credentials"
}
Enter fullscreen mode Exit fullscreen mode

3. CLI Tool

#!/usr/bin/env python3
# platform CLI

import click
import boto3
import subprocess

@click.group()
def cli():
    """Platform Engineering CLI"""
    pass

@cli.command()
@click.option('--type', required=True, type=click.Choice(['postgres', 'mysql', 'mongodb']))
@click.option('--size', default='small', type=click.Choice(['small', 'medium', 'large']))
@click.option('--name', required=True, help='Application name')
@click.option('--env', required=True, type=click.Choice(['dev', 'staging', 'prod']))
def create_database(type, size, name, env):
    """Create a managed database"""
    click.echo(f"Creating {type} database for {name}-{env}...")

    # Run Terraform
    result = subprocess.run([
        'terraform', 'apply',
        '-auto-approve',
        f'-var=app_name={name}',
        f'-var=environment={env}',
        f'-var=size={size}',
        f'-target=module.{type}_database'
    ], capture_output=True, text=True)

    if result.returncode == 0:
        click.echo(f"✓ Database created successfully!")
        click.echo(f"\nConnection details stored in AWS Secrets Manager:")
        click.echo(f"Secret: {name}/{env}/database")
        click.echo(f"\nRetrieve with: platform get-secret {name}/{env}/database")
    else:
        click.echo(f"✗ Failed to create database")
        click.echo(result.stderr)

@cli.command()
@click.argument('secret_name')
def get_secret(secret_name):
    """Retrieve secret from Secrets Manager"""
    sm = boto3.client('secretsmanager')

    try:
        response = sm.get_secret_value(SecretId=secret_name)
        secret = json.loads(response['SecretString'])

        click.echo(f"\nDatabase Connection Information:")
        click.echo(f"  Host: {secret['host']}")
        click.echo(f"  Port: {secret['port']}")
        click.echo(f"  Database: {secret['database']}")
        click.echo(f"  Username: {secret['username']}")
        click.echo(f"  Password: {secret['password']}")
        click.echo(f"\nConnection String:")
        click.echo(f"  postgresql://{secret['username']}:{secret['password']}@{secret['host']}:{secret['port']}/{secret['database']}")
    except Exception as e:
        click.echo(f"✗ Error retrieving secret: {e}")

@cli.command()
@click.option('--name', required=True)
@click.option('--env', required=True)
def create_namespace(name, env):
    """Create Kubernetes namespace with standard config"""
    namespace = f"{name}-{env}"

    click.echo(f"Creating namespace {namespace}...")

    # Create namespace manifest
    manifest = f"""
apiVersion: v1
kind: Namespace
metadata:
  name: {namespace}
  labels:
    app: {name}
    environment: {env}
    managed-by: platform
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: {namespace}-quota
  namespace: {namespace}
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    persistentvolumeclaims: "10"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: {namespace}
spec:
  podSelector: {{}}
  policyTypes:
  - Ingress
  - Egress
"""

    # Apply
    result = subprocess.run(
        ['kubectl', 'apply', '-f', '-'],
        input=manifest.encode(),
        capture_output=True
    )

    if result.returncode == 0:
        click.echo(f"✓ Namespace {namespace} created")
        click.echo(f"\nDeploy your app with:")
        click.echo(f"  kubectl apply -f deployment.yaml -n {namespace}")
    else:
        click.echo(f"✗ Failed to create namespace")
        click.echo(result.stderr.decode())

if __name__ == '__main__':
    cli()
Enter fullscreen mode Exit fullscreen mode

4. Portal/UI

// Internal developer portal using React

import React, { useState } from 'react';
import { Button, Form, Select, Input } from 'antd';

const DatabaseProvisioning: React.FC = () => {
  const [loading, setLoading] = useState(false);

  const onFinish = async (values: any) => {
    setLoading(true);

    try {
      const response = await fetch('/api/provision/database', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(values)
      });

      const data = await response.json();

      if (response.ok) {
        notification.success({
          message: 'Database Created',
          description: `Your ${values.type} database is being provisioned. You'll receive credentials via Slack in ~5 minutes.`
        });
      }
    } catch (error) {
      notification.error({
        message: 'Provisioning Failed',
        description: error.message
      });
    } finally {
      setLoading(false);
    }
  };

  return (
    <Form onFinish={onFinish} layout="vertical">
      <Form.Item 
        name="type" 
        label="Database Type"
        rules={[{ required: true }]}
      >
        <Select>
          <Select.Option value="postgres">PostgreSQL</Select.Option>
          <Select.Option value="mysql">MySQL</Select.Option>
          <Select.Option value="mongodb">MongoDB</Select.Option>
        </Select>
      </Form.Item>

      <Form.Item 
        name="size" 
        label="Size"
        rules={[{ required: true }]}
      >
        <Select>
          <Select.Option value="small">Small (2GB RAM) - $30/mo</Select.Option>
          <Select.Option value="medium">Medium (4GB RAM) - $60/mo</Select.Option>
          <Select.Option value="large">Large (8GB RAM) - $120/mo</Select.Option>
        </Select>
      </Form.Item>

      <Form.Item 
        name="appName" 
        label="Application Name"
        rules={[{ required: true }]}
      >
        <Input placeholder="my-app" />
      </Form.Item>

      <Form.Item 
        name="environment" 
        label="Environment"
        rules={[{ required: true }]}
      >
        <Select>
          <Select.Option value="dev">Development</Select.Option>
          <Select.Option value="staging">Staging</Select.Option>
          <Select.Option value="prod">Production</Select.Option>
        </Select>
      </Form.Item>

      <Form.Item>
        <Button type="primary" htmlType="submit" loading={loading}>
          Create Database
        </Button>
      </Form.Item>

      <Alert 
        message="Automatic Features"
        description={
          <ul>
            <li>Automatic backups (daily)</li>
            <li>Encryption at rest</li>
            <li>Monitoring and alerting</li>
            <li>Credentials in AWS Secrets Manager</li>
          </ul>
        }
        type="info"
      />
    </Form>
  );
};
Enter fullscreen mode Exit fullscreen mode

5. Golden Paths

Pre-built templates for common use cases:

# Golden path: New web application
platform create app web-app \
  --template node-api \
  --name my-api \
  --env dev

# Creates:
# ✓ Kubernetes namespace
# ✓ PostgreSQL database
# ✓ Redis cache
# ✓ CI/CD pipeline
# ✓ Monitoring dashboards
# ✓ Log aggregation
# ✓ SSL certificate
# ✓ DNS entry

# Golden path: Batch job
platform create app batch-job \
  --template python-worker \
  --name data-processor \
  --env prod \
  --schedule "0 2 * * *"  # Daily at 2 AM

# Creates:
# ✓ Kubernetes CronJob
# ✓ S3 bucket for data
# ✓ IAM role with least privilege
# ✓ CloudWatch alarms
# ✓ Dead letter queue
Enter fullscreen mode Exit fullscreen mode

Implementation Roadmap

Phase 1: Foundation (Month 1-2)

Week 1-2: Discovery
□ Interview developers (what do they need most?)
□ Analyze ticket queues (what's requested most?)
□ Measure current lead times
□ Identify quick wins

Week 3-4: Core Infrastructure
□ Set up Terraform modules
□ Create first golden path (e.g., database)
□ Build CLI tool
□ Documentation

Week 5-6: Alpha Testing
□ Select 2-3 friendly teams
□ Gather feedback
□ Iterate quickly

Week 7-8: Rollout
□ Internal marketing
□ Training sessions
□ Office hours for support
Enter fullscreen mode Exit fullscreen mode

Phase 2: Expansion (Month 3-4)

□ Add more services (Redis, message queues)
□ Build web portal
□ Implement cost tracking
□ Add observability
□ Create more golden paths
Enter fullscreen mode Exit fullscreen mode

Phase 3: Maturity (Month 5-6)

□ Advanced features (auto-scaling, disaster recovery)
□ Self-service monitoring
□ Policy as code
□ Developer analytics
□ Integration with existing tools
Enter fullscreen mode Exit fullscreen mode

Platform Team Structure

Small Company (<50 engineers):
1-2 Platform Engineers
- Build and maintain platform
- Developer support
- 80% automation, 20% tickets

Medium Company (50-200 engineers):
3-5 Platform Engineers
- Platform development
- SRE responsibilities  
- Developer enablement
- On-call rotation

Large Company (>200 engineers):
8-15 Platform Engineers organized into:
- Developer Experience team
- Infrastructure Automation team
- SRE team
- Security & Compliance team
Enter fullscreen mode Exit fullscreen mode

Measuring Success

Key Metrics

# Platform KPIs

metrics = {
    # Speed
    'deployment_frequency': {
        'before': '2 per week',
        'after': '50 per day',
        'improvement': '350x'
    },

    'lead_time_for_changes': {
        'before': '3 days',
        'after': '2 hours',
        'improvement': '36x'
    },

    # Efficiency
    'provisioning_time': {
        'before': '4 days (database)',
        'after': '5 minutes',
        'improvement': '1,152x'
    },

    'devops_tickets': {
        'before': '200 per month',
        'after': '20 per month',
        'improvement': '90% reduction'
    },

    # Quality
    'mttr': {
        'before': '4 hours',
        'after': '30 minutes',
        'improvement': '8x'
    },

    'change_failure_rate': {
        'before': '20%',
        'after': '5%',
        'improvement': '75% reduction'
    },

    # Cost
    'cloud_spend_efficiency': {
        'before': '$150 per developer per month',
        'after': '$90 per developer per month',
        'improvement': '40% reduction'
    }
}
Enter fullscreen mode Exit fullscreen mode

Best Practices

1. Treat Platform as a Product

Product thinking:
- Developers are your customers
- Gather feedback regularly
- Prioritize features based on impact
- Measure adoption and satisfaction
- Iterate quickly

Example feedback loop:
1. Weekly developer surveys
2. Monthly roadmap review
3. Quarterly platform retrospective
4. Track NPS (Net Promoter Score)
Enter fullscreen mode Exit fullscreen mode

2. Document Everything

# Documentation structure

/docs
├── getting-started/
│   ├── quickstart.md
│   ├── concepts.md
│   └── tutorials/
├── services/
│   ├── databases.md
│   ├── kubernetes.md
│   └── ci-cd.md
├── golden-paths/
│   ├── web-api.md
│   ├── background-job.md
│   └── static-website.md
└── runbooks/
    ├── troubleshooting.md
    └── incident-response.md
Enter fullscreen mode Exit fullscreen mode

3. Provide Escape Hatches

Allow advanced users to customize:

# Standard path (recommended)
platform create database --type postgres --size small

# Advanced path (override defaults)
platform create database \
  --type postgres \
  --instance-class db.r5.2xlarge \
  --storage 1000 \
  --iops 10000 \
  --custom-config postgres-optimized.yaml

# Expert path (full Terraform)
terraform apply -var-file=custom.tfvars
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls

1. Building Too Much Too Fast

❌ Bad: Build everything before launch
- 20 services in catalog
- Full-featured web UI
- Complex approval workflows
- 6 months development
- Low adoption (too complex)

✅ Good: Start small, iterate
- 2-3 most requested services
- Simple CLI tool
- Automated approval for dev
- 1 month to first user
- High adoption (solves real pain)
Enter fullscreen mode Exit fullscreen mode

2. Not Involving Developers

❌ Bad: Platform team builds in isolation
- "We know what developers need"
- No user research
- Low adoption

✅ Good: Developer-centric design
- Interview developers
- Beta test with friendly teams
- Gather feedback continuously
- Iterate based on usage data
Enter fullscreen mode Exit fullscreen mode

3. No Migration Path

❌ Bad: "Everyone must use platform immediately"
- Force migration
- Break existing workflows
- Developer resistance

✅ Good: Gradual adoption
- New projects use platform
- Existing projects migrate when convenient
- Provide migration tools
- Support both approaches during transition
Enter fullscreen mode Exit fullscreen mode

Conclusion

Platform Engineering is about empowering developers with self-service capabilities while maintaining control, security, and cost efficiency. Done right, it dramatically accelerates development velocity and reduces operational toil.

Key principles:

  1. Treat it as a product - Developers are your customers
  2. Start small - Ship value quickly, iterate
  3. Automate everything - Eliminate toil and tickets
  4. Document thoroughly - Make it easy to use
  5. Measure success - Track metrics that matter

The best platform is one that disappears—developers barely notice it because everything just works. That's the goal.

Need help building an internal developer platform? InstaDevOps provides expert consulting for platform engineering, developer experience, and infrastructure automation. Contact us for a free consultation.


Need Help with Your DevOps Infrastructure?

At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.

Our Services:

  • 🏗️ AWS Consulting - Cloud architecture, cost optimization, and migration
  • ☸️ Kubernetes Management - Production-ready clusters and orchestration
  • 🚀 CI/CD Pipelines - Automated deployment pipelines that just work
  • 📊 Monitoring & Observability - See what's happening in your infrastructure

Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.

📅 Book a Free 15-Min Consultation

Originally published at instadevops.com

Top comments (0)