Introduction
Developers at top tech companies deploy code with a simple git push. They provision databases with a Slack command. They view production metrics in seconds. Meanwhile, at many companies, developers wait days for infrastructure tickets, struggle with complex deployment processes, and can't debug production issues without asking DevOps for help.
This is the promise of Platform Engineering: building internal developer platforms that give developers self-service capabilities while maintaining security, reliability, and cost control. It's about treating your infrastructure as a product, with developers as your customers.
In this comprehensive guide, we'll explore how to build effective internal developer platforms that accelerate development without sacrificing control.
What is Platform Engineering?
Definition
Platform Engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.
Traditional DevOps:
Developer → Ticket → DevOps Team → Manual Setup → Developer Can Use
(Days)
Platform Engineering:
Developer → Self-Service Portal → Automated Provisioning → Immediate Use
(Minutes)
Platform vs. DevOps
DevOps Team:
- Reactive: Responds to developer requests
- Services: "We'll set that up for you"
- Bottleneck: Limited team capacity
- Focus: Keeping systems running
Platform Team:
- Proactive: Builds self-service tools
- Products: "Here's how to do it yourself"
- Scalable: Automation scales infinitely
- Focus: Developer productivity
Evolution:
DevOps → SRE → Platform Engineering
(2010s) (2015+) (2020+)
Why Platform Engineering Matters
Developer Productivity
Without Platform:
Developer needs database:
1. Create Jira ticket (15 min)
2. Wait for DevOps review (1-3 days)
3. Back-and-forth on requirements (2 days)
4. DevOps provisions database (1 day)
5. Developer configures connection (30 min)
Total time: 4-6 days
DevOps time: 2 hours
Context switches: 5+
With Platform:
Developer needs database:
1. Run: platform create database --type postgres --size small
2. Database provisioned automatically (5 min)
3. Credentials in secret store
4. Connection string in environment variables
Total time: 5 minutes
DevOps time: 0 (fully automated)
Context switches: 0
Productivity gain: 1,000x faster
Cost Reduction
Scenario: 50 developers, each needs 2 environments/month
Manual approach:
- 100 environment requests/month
- 2 hours per request (DevOps time)
- 200 hours/month = 1.25 FTE
- Cost: $15,000/month in DevOps time
Platform approach:
- Self-service automation
- 1 hour/month maintenance
- 0.006 FTE
- Cost: $100/month
Savings: $14,900/month = $178,800/year
Standardization and Compliance
Without Platform:
- Each team configures infrastructure differently
- Security policies inconsistently applied
- Cost optimization varies by team
- Audit trail incomplete
With Platform:
- Standardized infrastructure patterns
- Security policies enforced automatically
- Cost controls built-in
- Complete audit trail
Core Components of Internal Developer Platforms
1. Service Catalog
Self-service menu of available services:
# platform-catalog.yaml
services:
# Databases
- name: PostgreSQL Database
id: postgres
description: "Managed PostgreSQL database"
sizes:
- small: db.t3.small (2 vCPU, 2GB RAM)
- medium: db.t3.medium (2 vCPU, 4GB RAM)
- large: db.t3.large (2 vCPU, 8GB RAM)
backup: Automatic daily backups
cost: From $30/month
# Kubernetes
- name: Kubernetes Namespace
id: k8s-namespace
description: "Isolated Kubernetes namespace"
features:
- Resource quotas
- Network policies
- Automatic Ingress
- Monitoring enabled
cost: Free (pay for resources used)
# Cache
- name: Redis Cache
id: redis
description: "Managed Redis cache"
sizes:
- small: cache.t3.micro (0.5GB)
- medium: cache.t3.small (1.5GB)
- large: cache.t3.medium (3GB)
cost: From $15/month
# Message Queue
- name: Message Queue
id: rabbitmq
description: "RabbitMQ message broker"
cost: From $25/month
2. Infrastructure as Code Templates
# modules/postgres-database/main.tf
variable "app_name" {
description = "Application name"
type = string
}
variable "environment" {
description = "Environment (dev/staging/prod)"
type = string
}
variable "size" {
description = "Database size (small/medium/large)"
type = string
default = "small"
}
locals {
instance_types = {
small = "db.t3.small"
medium = "db.t3.medium"
large = "db.t3.large"
}
storage_sizes = {
small = 20
medium = 100
large = 500
}
}
# RDS Instance with best practices
resource "aws_db_instance" "main" {
identifier = "${var.app_name}-${var.environment}"
# Instance configuration
engine = "postgres"
engine_version = "14.6"
instance_class = local.instance_types[var.size]
allocated_storage = local.storage_sizes[var.size]
# Credentials (stored in Secrets Manager)
username = var.app_name
password = random_password.db_password.result
# Network
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.db.id]
# Backup
backup_retention_period = var.environment == "prod" ? 30 : 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
# Encryption
storage_encrypted = true
kms_key_id = aws_kms_key.db.arn
# Monitoring
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
# Tags
tags = {
Application = var.app_name
Environment = var.environment
ManagedBy = "platform-engineering"
CostCenter = var.app_name
}
}
# Store password in Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
name = "${var.app_name}/${var.environment}/database"
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = jsonencode({
username = aws_db_instance.main.username
password = random_password.db_password.result
host = aws_db_instance.main.address
port = aws_db_instance.main.port
database = aws_db_instance.main.db_name
})
}
output "connection_secret" {
value = aws_secretsmanager_secret.db_password.arn
description = "ARN of secret containing database credentials"
}
3. CLI Tool
#!/usr/bin/env python3
# platform CLI
import click
import boto3
import subprocess
@click.group()
def cli():
"""Platform Engineering CLI"""
pass
@cli.command()
@click.option('--type', required=True, type=click.Choice(['postgres', 'mysql', 'mongodb']))
@click.option('--size', default='small', type=click.Choice(['small', 'medium', 'large']))
@click.option('--name', required=True, help='Application name')
@click.option('--env', required=True, type=click.Choice(['dev', 'staging', 'prod']))
def create_database(type, size, name, env):
"""Create a managed database"""
click.echo(f"Creating {type} database for {name}-{env}...")
# Run Terraform
result = subprocess.run([
'terraform', 'apply',
'-auto-approve',
f'-var=app_name={name}',
f'-var=environment={env}',
f'-var=size={size}',
f'-target=module.{type}_database'
], capture_output=True, text=True)
if result.returncode == 0:
click.echo(f"✓ Database created successfully!")
click.echo(f"\nConnection details stored in AWS Secrets Manager:")
click.echo(f"Secret: {name}/{env}/database")
click.echo(f"\nRetrieve with: platform get-secret {name}/{env}/database")
else:
click.echo(f"✗ Failed to create database")
click.echo(result.stderr)
@cli.command()
@click.argument('secret_name')
def get_secret(secret_name):
"""Retrieve secret from Secrets Manager"""
sm = boto3.client('secretsmanager')
try:
response = sm.get_secret_value(SecretId=secret_name)
secret = json.loads(response['SecretString'])
click.echo(f"\nDatabase Connection Information:")
click.echo(f" Host: {secret['host']}")
click.echo(f" Port: {secret['port']}")
click.echo(f" Database: {secret['database']}")
click.echo(f" Username: {secret['username']}")
click.echo(f" Password: {secret['password']}")
click.echo(f"\nConnection String:")
click.echo(f" postgresql://{secret['username']}:{secret['password']}@{secret['host']}:{secret['port']}/{secret['database']}")
except Exception as e:
click.echo(f"✗ Error retrieving secret: {e}")
@cli.command()
@click.option('--name', required=True)
@click.option('--env', required=True)
def create_namespace(name, env):
"""Create Kubernetes namespace with standard config"""
namespace = f"{name}-{env}"
click.echo(f"Creating namespace {namespace}...")
# Create namespace manifest
manifest = f"""
apiVersion: v1
kind: Namespace
metadata:
name: {namespace}
labels:
app: {name}
environment: {env}
managed-by: platform
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: {namespace}-quota
namespace: {namespace}
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
persistentvolumeclaims: "10"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: {namespace}
spec:
podSelector: {{}}
policyTypes:
- Ingress
- Egress
"""
# Apply
result = subprocess.run(
['kubectl', 'apply', '-f', '-'],
input=manifest.encode(),
capture_output=True
)
if result.returncode == 0:
click.echo(f"✓ Namespace {namespace} created")
click.echo(f"\nDeploy your app with:")
click.echo(f" kubectl apply -f deployment.yaml -n {namespace}")
else:
click.echo(f"✗ Failed to create namespace")
click.echo(result.stderr.decode())
if __name__ == '__main__':
cli()
4. Portal/UI
// Internal developer portal using React
import React, { useState } from 'react';
import { Button, Form, Select, Input } from 'antd';
const DatabaseProvisioning: React.FC = () => {
const [loading, setLoading] = useState(false);
const onFinish = async (values: any) => {
setLoading(true);
try {
const response = await fetch('/api/provision/database', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(values)
});
const data = await response.json();
if (response.ok) {
notification.success({
message: 'Database Created',
description: `Your ${values.type} database is being provisioned. You'll receive credentials via Slack in ~5 minutes.`
});
}
} catch (error) {
notification.error({
message: 'Provisioning Failed',
description: error.message
});
} finally {
setLoading(false);
}
};
return (
<Form onFinish={onFinish} layout="vertical">
<Form.Item
name="type"
label="Database Type"
rules={[{ required: true }]}
>
<Select>
<Select.Option value="postgres">PostgreSQL</Select.Option>
<Select.Option value="mysql">MySQL</Select.Option>
<Select.Option value="mongodb">MongoDB</Select.Option>
</Select>
</Form.Item>
<Form.Item
name="size"
label="Size"
rules={[{ required: true }]}
>
<Select>
<Select.Option value="small">Small (2GB RAM) - $30/mo</Select.Option>
<Select.Option value="medium">Medium (4GB RAM) - $60/mo</Select.Option>
<Select.Option value="large">Large (8GB RAM) - $120/mo</Select.Option>
</Select>
</Form.Item>
<Form.Item
name="appName"
label="Application Name"
rules={[{ required: true }]}
>
<Input placeholder="my-app" />
</Form.Item>
<Form.Item
name="environment"
label="Environment"
rules={[{ required: true }]}
>
<Select>
<Select.Option value="dev">Development</Select.Option>
<Select.Option value="staging">Staging</Select.Option>
<Select.Option value="prod">Production</Select.Option>
</Select>
</Form.Item>
<Form.Item>
<Button type="primary" htmlType="submit" loading={loading}>
Create Database
</Button>
</Form.Item>
<Alert
message="Automatic Features"
description={
<ul>
<li>Automatic backups (daily)</li>
<li>Encryption at rest</li>
<li>Monitoring and alerting</li>
<li>Credentials in AWS Secrets Manager</li>
</ul>
}
type="info"
/>
</Form>
);
};
5. Golden Paths
Pre-built templates for common use cases:
# Golden path: New web application
platform create app web-app \
--template node-api \
--name my-api \
--env dev
# Creates:
# ✓ Kubernetes namespace
# ✓ PostgreSQL database
# ✓ Redis cache
# ✓ CI/CD pipeline
# ✓ Monitoring dashboards
# ✓ Log aggregation
# ✓ SSL certificate
# ✓ DNS entry
# Golden path: Batch job
platform create app batch-job \
--template python-worker \
--name data-processor \
--env prod \
--schedule "0 2 * * *" # Daily at 2 AM
# Creates:
# ✓ Kubernetes CronJob
# ✓ S3 bucket for data
# ✓ IAM role with least privilege
# ✓ CloudWatch alarms
# ✓ Dead letter queue
Implementation Roadmap
Phase 1: Foundation (Month 1-2)
Week 1-2: Discovery
□ Interview developers (what do they need most?)
□ Analyze ticket queues (what's requested most?)
□ Measure current lead times
□ Identify quick wins
Week 3-4: Core Infrastructure
□ Set up Terraform modules
□ Create first golden path (e.g., database)
□ Build CLI tool
□ Documentation
Week 5-6: Alpha Testing
□ Select 2-3 friendly teams
□ Gather feedback
□ Iterate quickly
Week 7-8: Rollout
□ Internal marketing
□ Training sessions
□ Office hours for support
Phase 2: Expansion (Month 3-4)
□ Add more services (Redis, message queues)
□ Build web portal
□ Implement cost tracking
□ Add observability
□ Create more golden paths
Phase 3: Maturity (Month 5-6)
□ Advanced features (auto-scaling, disaster recovery)
□ Self-service monitoring
□ Policy as code
□ Developer analytics
□ Integration with existing tools
Platform Team Structure
Small Company (<50 engineers):
1-2 Platform Engineers
- Build and maintain platform
- Developer support
- 80% automation, 20% tickets
Medium Company (50-200 engineers):
3-5 Platform Engineers
- Platform development
- SRE responsibilities
- Developer enablement
- On-call rotation
Large Company (>200 engineers):
8-15 Platform Engineers organized into:
- Developer Experience team
- Infrastructure Automation team
- SRE team
- Security & Compliance team
Measuring Success
Key Metrics
# Platform KPIs
metrics = {
# Speed
'deployment_frequency': {
'before': '2 per week',
'after': '50 per day',
'improvement': '350x'
},
'lead_time_for_changes': {
'before': '3 days',
'after': '2 hours',
'improvement': '36x'
},
# Efficiency
'provisioning_time': {
'before': '4 days (database)',
'after': '5 minutes',
'improvement': '1,152x'
},
'devops_tickets': {
'before': '200 per month',
'after': '20 per month',
'improvement': '90% reduction'
},
# Quality
'mttr': {
'before': '4 hours',
'after': '30 minutes',
'improvement': '8x'
},
'change_failure_rate': {
'before': '20%',
'after': '5%',
'improvement': '75% reduction'
},
# Cost
'cloud_spend_efficiency': {
'before': '$150 per developer per month',
'after': '$90 per developer per month',
'improvement': '40% reduction'
}
}
Best Practices
1. Treat Platform as a Product
Product thinking:
- Developers are your customers
- Gather feedback regularly
- Prioritize features based on impact
- Measure adoption and satisfaction
- Iterate quickly
Example feedback loop:
1. Weekly developer surveys
2. Monthly roadmap review
3. Quarterly platform retrospective
4. Track NPS (Net Promoter Score)
2. Document Everything
# Documentation structure
/docs
├── getting-started/
│ ├── quickstart.md
│ ├── concepts.md
│ └── tutorials/
├── services/
│ ├── databases.md
│ ├── kubernetes.md
│ └── ci-cd.md
├── golden-paths/
│ ├── web-api.md
│ ├── background-job.md
│ └── static-website.md
└── runbooks/
├── troubleshooting.md
└── incident-response.md
3. Provide Escape Hatches
Allow advanced users to customize:
# Standard path (recommended)
platform create database --type postgres --size small
# Advanced path (override defaults)
platform create database \
--type postgres \
--instance-class db.r5.2xlarge \
--storage 1000 \
--iops 10000 \
--custom-config postgres-optimized.yaml
# Expert path (full Terraform)
terraform apply -var-file=custom.tfvars
Common Pitfalls
1. Building Too Much Too Fast
❌ Bad: Build everything before launch
- 20 services in catalog
- Full-featured web UI
- Complex approval workflows
- 6 months development
- Low adoption (too complex)
✅ Good: Start small, iterate
- 2-3 most requested services
- Simple CLI tool
- Automated approval for dev
- 1 month to first user
- High adoption (solves real pain)
2. Not Involving Developers
❌ Bad: Platform team builds in isolation
- "We know what developers need"
- No user research
- Low adoption
✅ Good: Developer-centric design
- Interview developers
- Beta test with friendly teams
- Gather feedback continuously
- Iterate based on usage data
3. No Migration Path
❌ Bad: "Everyone must use platform immediately"
- Force migration
- Break existing workflows
- Developer resistance
✅ Good: Gradual adoption
- New projects use platform
- Existing projects migrate when convenient
- Provide migration tools
- Support both approaches during transition
Conclusion
Platform Engineering is about empowering developers with self-service capabilities while maintaining control, security, and cost efficiency. Done right, it dramatically accelerates development velocity and reduces operational toil.
Key principles:
- Treat it as a product - Developers are your customers
- Start small - Ship value quickly, iterate
- Automate everything - Eliminate toil and tickets
- Document thoroughly - Make it easy to use
- Measure success - Track metrics that matter
The best platform is one that disappears—developers barely notice it because everything just works. That's the goal.
Need help building an internal developer platform? InstaDevOps provides expert consulting for platform engineering, developer experience, and infrastructure automation. Contact us for a free consultation.
Need Help with Your DevOps Infrastructure?
At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.
Our Services:
- 🏗️ AWS Consulting - Cloud architecture, cost optimization, and migration
- ☸️ Kubernetes Management - Production-ready clusters and orchestration
- 🚀 CI/CD Pipelines - Automated deployment pipelines that just work
- 📊 Monitoring & Observability - See what's happening in your infrastructure
Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.
📅 Book a Free 15-Min Consultation
Originally published at instadevops.com
Top comments (0)