DEV Community: lssh

AlmaLinux: From Firmware Preparation to Continuous Auditing

lssh — Sat, 28 Feb 2026 05:59:16 +0000

Setting Up an AlmaLinux System: From Hardware Planning to Enterprise Hardening

Setting up an AlmaLinux system follows a structured process that spans from hardware planning to post-installation hardening, including resilience, reproducibility, and continuous auditing strategies. This guide integrates enterprise security best practices aligned with CIS, DISA STIG, and HIPAA benchmarks.

1. Preparation and Firmware (BIOS/UEFI)

1.1 System Requirements

Minimum RAM: 1.5 GB (4 GB or more recommended for production)
Disk space: 10 GB minimum, 20 GB recommended for general use
Supported architectures: x86_64 (Intel/AMD), aarch64 (ARM64), ppc64le (PowerPC), s390x (IBM Z)

1.2 BIOS/UEFI Configuration

It's essential to decide between legacy BIOS or UEFI before starting. UEFI is required to enable Secure Boot, a feature AlmaLinux supports to ensure only signed and authorized kernel modules are loaded. Verify boot order and disable Secure Boot if there are driver conflicts. The BIOS/UEFI choice determines the partitioning scheme: MBR for legacy BIOS, GPT for UEFI. Verify compatibility of critical controllers (Intel 8254x NIC, Atheros chips, storage HBAs), as some require additional firmware in isolated environments.

1.3 ISO Download and Verification

Available types: boot (network installation), minimal (standalone base), dvd (full packages)
Import the AlmaLinux public key and verify the SHA256 checksum of the downloaded image to ensure integrity
Boot media creation: dd on Linux/macOS, or Rufus/Fedora Media Writer on Windows
USB of at least 8 GB (12 GB recommended for convenience)

2. Configuration in the Anaconda Installer

Once booted from the USB, Anaconda will guide the configuration through the "Installation Summary."

2.1 Localization and Time

Configure keyboard, language support, and time zone. Enable network time (NTP) for precise synchronization, which is critical for forensic log consistency.

2.2 Software Selection

Base environment: Server with GUI, Server (no GUI), Minimal Installation, or Virtualization Host. Select add-ons according to the server role, applying the principle of minimal attack surface.

2.3 Partitioning Strategy with LVM

Custom partitioning with LVM (Logical Volume Manager) is fundamental for operational resilience. LVM allows expanding disks online without downtime and applying restrictive mount options per volume.

Separate /var, /tmp, /home, and /var/log into independent LVM volumes
Apply options in /etc/fstab: for /tmp and /var/tmp use noexec (blocks binary execution), nosuid (ignores set-user-ID), nodev (ignores special devices)
For /var/log, isolation prevents a disk-filling DoS attack from compromising the root partition
The XFS filesystem (default in AlmaLinux) allows online growth with xfs_growfs

Note: filesystems like Ext3/4 reserve 5% of space exclusively for root, allowing recovery if a user fills the disk.

2.4 Encryption

Enable "Encrypt my data" in Anaconda to protect data at rest using LUKS encryption with a strong passphrase.

2.5 Network and Hostname

Enable detected network interfaces and assign an FQDN (Fully Qualified Domain Name) hostname.

2.6 Security Profile (SCAP Compliance)

From Anaconda, SCAP policies can be applied that automate the initial hardening of the system, establishing the security baseline from the first boot.

Available profiles:

CIS Benchmark Level 1 or Level 2
DISA STIG
HIPAA

CIS Level 2 is the most restrictive and suitable for environments with strict compliance requirements.

2.7 User Settings

Define a strong password for root
Create a regular user and mark them as Administrator for sudo access
Root should never be used for routine operations

2.8 FIPS (Highly Regulated Environments)

In environments that require it, enabling FIPS mode ensures the system uses only certified cryptographic algorithms.

⚠️ Important warning: FIPS must be enabled from the installer boot using the fips=1 kernel parameter, not as a post-installation step. Post-installation conversion can break software that depends on non-certified algorithms (legacy MD5 implementations, certain OpenSSL versions). Cold-initialized FIPS mode is significantly more reliable.

3. Post-Installation and Hardening Phase

After the first reboot, proactive maintenance and layered security steps are executed.

3.1 Initial Update

Run sudo dnf update immediately to apply the latest security patches and errata (ALSA). Audit active services with systemctl list-units --type=service and disable everything not necessary for the server role (minimal attack surface principle).

3.2 SELinux Management

Verify that SELinux is in Enforcing mode
Never disable it as a solution to problems
Fix incorrect labels with restorecon, not by disabling protection
Adjust behaviors with SELinux booleans before resorting to policy changes
Analyze denials with ausearch -m avc or sealert for precise diagnosis

3.3 SSH Management

SSH configuration is one of the most critical attack surfaces on any network-exposed server.

Disable root login: PermitRootLogin no
Force use of Protocol 2
Force public key authentication and disable password authentication
Restrict access with AllowUsers or AllowGroups in /etc/ssh/sshd_config
Consider changing the default port (22) to reduce noise in access logs

3.4 Firewall Configuration

Verify that firewalld is active and enabled at boot
Apply the principle of least network privilege: only strictly necessary services (SSH, HTTP/S, etc.) with open ports
Review active zones and connections with firewall-cmd --list-all

3.5 Auditing with auditd

SELinux and firewalld protect the system, but auditd provides the forensic event logging required by frameworks like CIS or STIG. Without auditd, a system may be secure but not auditable.

Verify that auditd is active and configured to record critical security events
Configure a remote log server via Syslog-ng with SSL/stunnel tunnels — this ensures logs are tamper-proof even if the host is compromised (auditd is useless if an attacker with root access can wipe local records)
Validate post-installation time synchronization with chronyc tracking, as precise timestamps are essential for forensic log consistency

3.6 Least Privilege with sudo

Configure /etc/sudoers with least privilege policies, limiting which commands each user or group can execute
Enable logging of all commands executed via sudo for full traceability
Always use visudo to edit sudoers, avoiding syntax errors that could lock out access

3.7 Kernel Live Patching

In environments with high-availability SLAs, the traditional patch → reboot → maintenance window cycle represents a significant operational cost. Implement KernelCare or TuxCare to apply critical security patches to the kernel and system libraries (Glibc, OpenSSL) without needing to reboot. This maximizes availability and eliminates maintenance windows for urgent security patches.

4. Resilience and Operational Continuity

4.1 Backup Strategy

A hardened system without backups represents an equally serious operational risk as a system without hardening. Backups must be defined before going to production.

Configure LVM snapshots for quick captures of the system state before critical changes
Implement regular external backups with tools like Restic, Bacula, or automated rsync policies
Periodically verify the integrity and restorability of backups

4.2 Continuity with LVM

LVM allows extending disks and growing XFS filesystems online with minimal or no downtime. Plan volumes with anticipated growth space to avoid capacity emergencies.

4.3 Long-Term Update Strategy

For migrations between major versions (e.g., AlmaLinux 8 to 9), use the ELevate project, which allows in-place upgrades without reinstalling the system. Plan update windows and document the pre-migration state with LVM snapshots.

5. Reproducibility and Configuration Management

5.1 Automation with Ansible

The manual guide becomes real value when translated into reusable, idempotent code. Ansible allows reproducing exactly the same state across any number of servers.

Automate configuration of SELinux labels, booleans, and firewall rules
Ensure every deployed server is identical to the previous one, eliminating configuration drift
Version the playbooks in a source control system for change traceability

5.2 Integration with Existing Infrastructure

Domain joining with realm join for integration with Active Directory (via Winbind or SSSD) or FreeIPA, centralizing identity management
Configure internal package repositories for air-gapped environments or those with software approval policies

6. Active Vulnerability Management

Everything above covers initial configuration and maintenance. A complete enterprise cycle requires continuous evaluation — a server that meets CIS Level 2 today may not meet it six months later after updates and operational changes.

6.1 Scanning with OpenSCAP

Run oscap xccdf eval periodically against the same CIS/STIG profile applied during installation. Reports identify configuration drift from the initial baseline. Integrate scans into CI/CD pipelines or schedule them with cron for automatic evaluation.

6.2 Auditing with Lynis

Lynis complements OpenSCAP with a broader system security approach. Run lynis audit system and review the additional hardening recommendations.

Conclusion

This order of operations ensures the system is not only functional from the first minute, but meets enterprise security standards from its initial deployment. The sequence covers the four dimensions of a mature production system:

🔒 Secure by design — hardening from Anaconda
🔍 Auditable — auditd + remote logs
🔄 Resilient — LVM + backups + live patching
♻️ Reproducible — Ansible + configuration management

Active vulnerability management with OpenSCAP and Lynis closes the cycle by ensuring compliance is maintained over time, not just at the moment of deployment.

Originally published in https://lbcristaldo.hashnode.dev/

De NGINX Ingress a Gateway API

lssh — Sun, 22 Feb 2026 03:13:45 +0000

Guía de migración: por qué, cómo y qué esperar

1. Por qué migrar: el fin de NGINX Ingress

El controlador Ingress NGINX de la comunidad entra en fase de mantenimiento de "mejor esfuerzo" hasta su retiro oficial en marzo de 2026. A partir de esa fecha, no habrá más correcciones de errores ni parches de seguridad. Cualquier CVE descubierto quedará sin respuesta.
Esto convierte la migración en una prioridad de cumplimiento y seguridad, no solo una actualización técnica opcional. Postergarla es acumular deuda técnica insostenible: un sistema de enrutamiento expuesto, sin soporte, corriendo tráfico de producción.
La migración a Gateway API no es un cambio de versión. Es una re-arquitectura del manejo de tráfico en Kubernetes.

2. El cambio de paradigma: del monolito al modelo desacoplado

El modelo Ingress combinaba en un solo objeto todo lo relacionado con el enrutamiento: aprovisionamiento de infraestructura, configuración de red y reglas de aplicación. Funcionaba, pero con fricciones importantes.

El problema con Ingress
• Dependía de anotaciones propietarias y específicas de cada proveedor para funciones avanzadas (canary, reescrituras, cabeceras). Eso generaba vendor lock-in: si cambiabas de controlador, reescribías todo.
• Al vivir en un solo objeto, un error en una ruta podía desestabilizar el controlador completo. El radio de impacto era global.
• Estaba limitado a HTTP/HTTPS. No había soporte nativo para TCP, UDP, gRPC ni TLS terminado en capa 4.

Lo que trae Gateway API
Gateway API separa las responsabilidades en tres capas bien definidas:
• GatewayClass: define el proveedor de infraestructura (quién opera el controlador).
• Gateway: gestionado por el equipo de plataforma o clúster. Aquí viven las políticas globales de TLS, WAF y seguridad.
• Routes (HTTPRoute, TCPRoute, etc.): gestionadas por los equipos de aplicación. Cada equipo controla sus propias rutas sin tocar la configuración global.

Este desacoplamiento tiene consecuencias concretas:
• Un desarrollador ya no puede sobreescribir accidentalmente la configuración de TLS del clúster.
• Un error de sintaxis en el namespace de un equipo no afecta las rutas de otros.
• Las funciones avanzadas —división de tráfico, modificación de cabeceras, enrutamiento por peso— son campos de primera clase en la especificación, no anotaciones ad hoc.
• Soporte nativo de protocolos L4 y L7: HTTP, HTTPS, TCP, UDP, TLS, gRPC.
La portabilidad es otra ganancia directa: al basarse en un estándar de SIG-Network, cambiar de controlador o de proveedor de nube ya no implica reescribir todos los manifiestos.

3.Cómo se hace: proceso de migración

La recomendación es clara: evitar el enfoque "Big-Bang". Una migración progresiva y en paralelo reduce el riesgo a niveles manejables.

El Step by Step
1. Auditoría: Inventariar todas las dependencias del Ingress actual. Identificar anotaciones personalizadas o "exóticas", configuración de DNS, certificados TLS y casos de uso especiales. No todas las anotaciones tienen equivalentes directos en Gateway API; es mejor descubrirlo antes que después.
2. Conversión asistida: Usar la herramienta ingress2gateway, que toma manifiestos de Ingress NGINX y genera recursos de Gateway API equivalentes. El output requiere revisión y ajustes manuales; es un punto de partida, no un resultado final.
3. Despliegue en paralelo ("Double Run"): Instalar el nuevo controlador de Gateway API junto al controlador NGINX existente. Crear los objetos Gateway y HTTPRoute sin dirigir tráfico real todavía. Validar configuración en un entorno de staging.
4. Cambio progresivo vía DNS: Redirigir pequeños porcentajes de tráfico al nuevo stack (comenzar con 1-5%) y monitorear métricas de latencia (p99) y tasas de error 5xx antes de continuar. Escalar gradualmente hasta el 100%.
5. Validación y limpieza: Una vez que el 100% del tráfico corre estable sobre Gateway API, eliminar los recursos Ingress y desinstalar el controlador antiguo. Esto reduce la superficie de ataque y elimina la deuda técnica.

Puntos críticos de la configuración
Dos mecanismos de Gateway API merecen atención especial durante la migración:
• permite de forma explícita que un Gateway en un namespace de infraestructura envíe tráfico a un servicio en un namespace de aplicación. Sin este recurso, el enrutamiento entre namespaces está bloqueado por defecto. Es seguridad por diseño.ReferenceGrant:
• los objetos de Gateway API reportan su estado mediante condiciones (Accepted, Programmed, ResolvedRefs). Esto permite diagnosticar problemas directamente sobre el recurso, sin necesidad de revisar logs del controlador.Status Conditions:

4. Comportamiento esperado tras la migración
Una vez completada la migración, la infraestructura opera de forma cualitativamente distinta:

Gobernanza y seguridad granular
Los equipos de infraestructura definen y protegen las políticas globales en el objeto Gateway. Los equipos de desarrollo gestionan sus HTTPRoutes de forma autónoma. El RBAC de Kubernetes hace cumplir esta separación de forma nativa, sin parches ni convenciones informales.

Reducción del blast radius
Una configuración errónea en el namespace de un equipo afecta únicamente sus propias rutas. El controlador no entra en estado inestable; simplemente reporta el error en el campo de status del objeto afectado.
**
Convergencia con Service Mesh (iniciativa GAMMA)
La iniciativa GAMMA del proyecto Gateway API permite usar la misma sintaxis para gestionar tanto el tráfico de entrada Norte-Sur como el tráfico interno Este-Oeste entre servicios. Esto simplifica la pila tecnológica y elimina la necesidad de herramientas separadas para cada tipo de tráfico.

Integración de WAF y seguridad centralizada
Implementaciones como NGINX App Protect o las políticas de seguridad de Envoy Gateway permiten centralizar la protección contra amenazas (OWASP Top 10) directamente en el punto de entrada, gestionado por el objeto Gateway y sin necesidad de configuración por cada servicio.

5. Consideraciones y puntos de atención

Gateway API resuelve problemas reales de Ingress, pero su mayor granularidad introduce complejidades propias que conviene tener presentes.

Carga cognitiva
Donde antes existía un solo objeto Ingress, ahora hay tres capas (GatewayClass, Gateway, Routes) que deben estar correctamente interconectadas. La curva de aprendizaje es real. Equipos sin experiencia previa con el modelo necesitan tiempo de adopción y documentación interna clara.

Escalabilidad del plano de control
En clústeres grandes con miles de rutas, la traducción de objetos Kubernetes a configuración de plano de datos (por ejemplo, protocolo xDS de Envoy) puede ser computacionalmente costosa. Esto puede generar picos de CPU y latencia en la propagación de actualizaciones durante períodos de alta dinámica de cambios.

Anotaciones sin equivalente directo
No todas las anotaciones avanzadas de NGINX Ingress tienen un campo equivalente en la especificación actual de Gateway API. Algunos casos de uso requieren rediseño, no conversión. La herramienta ingress2gateway ayuda, pero no cubre el 100% de los escenarios.

Fragilidad bajo carga dinámica
Algunas implementaciones han mostrado comportamientos inestables ante un flujo muy alto de cambios de rutas o picos de conexiones. En modelos de proxy compartido, un namespace con tráfico excesivo puede afectar la disponibilidad de recursos para otros. El monitoreo continuo durante y después de la migración no es opcional.

Cómo la comunidad gestiona estos riesgos
El proyecto Gateway API, liderado por SIG-Network, aborda estos riesgos mediante mecanismos concretos:
• Pruebas de conformidad estrictas que garantizan comportamiento consistente entre distintos controladores, reduciendo la fragmentación.
• El modelo de Policy Attachment permite aplicar políticas de seguridad y rate limiting de forma declarativa y jerárquica, reduciendo el impacto de configuraciones erróneas.
• Reglas de scope claras: un controlador solo reporta errores sobre objetos dentro de su cadena de propiedad, evitando conflictos entre múltiples implementaciones en el mismo clúster.

La migración a Gateway API no es urgente por moda técnica; es urgente porque marzo de 2026 es una fecha concreta, y los CVEs no esperan ventanas de mantenimiento.

Kubernetes SIG-Network · Gateway API · 2025–2026

Cloud workstation on AWS for $36/month: Windows EC2, static IP and Denver egress explained

lssh — Sun, 15 Feb 2026 22:21:18 +0000

TL;DR

I built this for a client based in Denver. The project fell through (as they do). But the setup was too pretty to waste, so here it is: a Windows cloud workstation with a static Denver IP, built from Argentina, for a use case that no longer exists. Turns out it's useful anyway.

Instance: t3.large — 2 vCPU, 8 GB RAM, Windows Server 2022
Static inbound IP: Elastic IP (free while the instance is running)
Denver egress: a $3/month static residential proxy — not a VPN, not a second EC2
Total monthly cost: ~$36–39 running 8h/day on weekdays, or ~$118 if you leave it on 24/7

No overengineering. No enterprise fluff. Just a clean, hardened Windows box that does what it needs to do.

This is what we're building

Three independent flows:

You → EC2 via RDP (encrypted, locked to your IP only)
Chrome → Proxy → Web (all browser traffic exits from a real Denver residential IP)
EC2 → EBS (persistent disk that survives stops and restarts)

The key insight the diagram makes obvious: AWS region and Denver geolocation are completely separate concerns. Your instance lives in us-east-1 (Virginia) for billing reasons. Denver happens at the proxy layer, outside AWS entirely. These two decisions don't interfere with each other.

Who is this actually for?

Before going further, this setup makes sense if you need any of these

A persistent Windows environment with a stable identity, especially if you travel frequently and your IP changes constantly. Your cloud machine stays in "Denver" while you're physically in Buenos Aires, Bangkok, or a airport lounge somewhere in between
Browser-based workflows that benefit from a consistent IP identity. Think platforms that tie account trust to IP history, or any service that behaves differently depending on your apparent location
A static outbound IP so you can whitelist yourself on external services
A geo-specific IP (Denver or otherwise) without paying for a full dedicated server
A starting point you can scale. This setup is deliberately one node. The same pattern extends to a fleet: one workstation per city, or multiple instances for a distributed remote team. Today it's one VM, later, it's infrastructure.

It's not the right call if you need GPU compute, if you're running heavy desktop software, or if multiple people need to access simultaneously (look at WorkSpaces for that).

Architecture decisions (and what I ruled out)

This is the section most tutorials skip. They tell you what to do but not why this and not that. Here's every real decision I made.

Instance type: why `t3.large` and not smaller or bigger

The t3 family is burstable — you get a baseline of CPU performance with the ability to spike when needed. For browser-based work, that's ideal: mostly idle, occasionally intense when loading heavy pages or running multiple tabs.

Instance	vCPU	RAM	Windows On-Demand (us-east-1)	Verdict
`t3.medium`	2	4 GB	~$55/mo	Too little RAM for Chrome + RDP overhead
`t3.large`	2	8 GB	~$110/mo	Sweet spot
`t3.xlarge`	4	16 GB	~$220/mo	Overkill — 2x cost, same use case

Windows Server itself consumes ~2 GB at idle. RDP session adds ~500 MB. Chrome with a few tabs adds another 1–2 GB. On a t3.medium you're already at the ceiling before doing any work.

Region: `us-east-1`, not whatever's closest to Denver

Counter-intuitive but important: don't choose your AWS region based on geographic proximity to your target city. Region determines compute pricing. Denver egress is handled at the proxy layer. These are independent.

us-east-1 (N. Virginia) is the cheapest AWS region for Windows On-Demand. There's no AWS region in Denver, and even if there were, AWS IP geolocation at the city level is unreliable — you cannot guarantee a city-level geo from a raw AWS IP regardless of region.

Denver egress: proxy, not VPN, not a second EC2

Denver because that's where the client was. The client is gone. The architecture isn't. Swap the city for wherever you need. The setup is identical.

This was the most important decision. Three options I evaluated:

Option A: VPN client on EC2 (Mullvad, ProtonVPN)
Installs a VPN client on the Windows instance, routes all traffic through a Denver server. Works, but: adds latency overhead, costs $5–10/mo, routes all traffic including RDP (which you don't need geo'd), and VPN IPs are well-known to commercial services.

Option B: Second EC2 instance in a hypothetical Denver region
Doesn't exist. AWS has no Denver region. Even the closest region (us-west-2, Oregon) wouldn't give you a Denver IP.

Option C: Static residential ISP proxy (chosen)
A single HTTP/SOCKS5 endpoint from a provider like IPRoyal or Webshare. Configured once in Chrome. Costs $2–5/mo for a single static Denver IP with unlimited bandwidth. The exit IP is a real residential ISP address — not a datacenter IP — which matters for services like LinkedIn that flag datacenter ranges.

Why residential matters: Many platforms cross-reference your IP against known datacenter CIDR blocks. A residential proxy from an actual Denver ISP looks like a person in Denver, not a server in Denver.

Storage: 50 GB gp3, not the default

AWS will suggest gp2 by default. Use gp3 — same performance baseline, 20% cheaper, and you can provision throughput independently if needed later. 50 GB gives Windows room to breathe (baseline install + updates + Chrome profile + downloads).

Elastic IP: attach it, don't skip it

Without an Elastic IP, your instance gets a new public IP every time it starts. That means updating your RDP bookmark, your Security Group rules, and any external whitelists every single time. One EIP solves all of that permanently. It's free while attached to a running instance.

Execution: step by step, both CLI and console

Prerequisites: AWS account, AWS CLI configured (aws configure), an RDP client (built into Windows; Microsoft Remote Desktop on Mac).

Step 1: Create a Security Group

Console: EC2 → Security Groups → Create Security Group

CLI:

# Create the security group
aws ec2 create-security-group \
  --group-name windows-workstation-sg \
  --description "Windows cloud workstation" \
  --vpc-id vpc-xxxxxxxxx

# Add RDP inbound rule — YOUR IP ONLY
MY_IP=$(curl -s https://checkip.amazonaws.com)
aws ec2 authorize-security-group-ingress \
  --group-name windows-workstation-sg \
  --protocol tcp \
  --port 3389 \
  --cidr "${MY_IP}/32"

Critical: Never open port 3389 to 0.0.0.0/0. Bots will attempt brute-force login within minutes. Your IP only, always.

Step 2: Launch the EC2 instance!

Console: EC2 → Launch Instance → Windows Server 2022 Base → t3.large → 50 GB gp3 → select your security group → launch

CLI:

# Get the latest Windows Server 2022 AMI ID for us-east-1
aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=Windows_Server-2022-English-Full-Base-*" \
  --query "sort_by(Images, &CreationDate)[-1].ImageId" \
  --output text

# Launch instance (replace ami-xxxxxxxxx with the ID from above)
aws ec2 run-instances \
  --image-id ami-xxxxxxxxx \
  --instance-type t3.large \
  --key-name your-key-pair-name \
  --security-group-ids sg-xxxxxxxxx \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3","DeleteOnTermination":true}}]' \
  --no-associate-public-ip-address \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=windows-workstation}]'

Step 3: Allocate and attach an Elastic IP

Console: EC2 → Elastic IPs → Allocate → Associate → select your instance

CLI:

# Allocate a new Elastic IP
ALLOC_ID=$(aws ec2 allocate-address --domain vpc --query AllocationId --output text)
echo "Allocation ID: $ALLOC_ID"

# Get your instance ID
INSTANCE_ID=$(aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=windows-workstation" \
  --query "Reservations[0].Instances[0].InstanceId" \
  --output text)

# Associate the EIP
aws ec2 associate-address \
  --instance-id $INSTANCE_ID \
  --allocation-id $ALLOC_ID

# Get your permanent IP
aws ec2 describe-addresses \
  --allocation-ids $ALLOC_ID \
  --query "Addresses[0].PublicIp" \
  --output text

Save that IP. It's yours permanently until you explicitly release it.

Step 4: Get the Windows password

Console: EC2 → Instances → select instance → Actions → Security → Get Windows password → upload .pem → decrypt

CLI:

aws ec2 get-password-data \
  --instance-id $INSTANCE_ID \
  --priv-launch-key /path/to/your-key.pem \
  --query PasswordData \
  --output text

Wait 4–5 minutes after launch before this works — the instance needs to finish initializing and encrypt the password.

Step 5: Connect via RDP and baseline setup

# Windows
mstsc /v:<your-elastic-ip>

# Mac — open Microsoft Remote Desktop, add PC with your Elastic IP

Once connected:

# Open PowerShell as Administrator, then:

# 1. Disable IE Enhanced Security (lets you download Chrome)
$AdminKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A7-37EF-4b3f-8CFC-4F3A74704073}"
Set-ItemProperty -Path $AdminKey -Name "IsInstalled" -Value 0
Stop-Process -Name Explorer

# 2. Install Chrome silently
$installer = "$env:TEMP\ChromeSetup.exe"
Invoke-WebRequest "https://dl.google.com/chrome/install/ChromeStandaloneSetup64.exe" -OutFile $installer
Start-Process $installer -ArgumentList "/silent /install" -Wait

# 3. Disable unnecessary services
Stop-Service -Name "Spooler" -Force
Set-Service -Name "Spooler" -StartupType Disabled
Stop-Service -Name "Fax" -Force  
Set-Service -Name "Fax" -StartupType Disabled

# 4. Set account lockout policy (5 attempts, 30 min lockout)
net accounts /lockoutthreshold:5 /lockoutduration:30 /lockoutwindow:30

Step 6: Configure the Denver proxy

Purchase a static residential proxy with Denver, CO targeting from IPRoyal (~$2/mo) or Webshare (~$3/mo). You'll get: a host, a port, a username, and a password.

Option A: Chrome only (recommended, most surgical)

Create a desktop shortcut pointing to:

"C:\Program Files\Google\Chrome\Application\chrome.exe" --proxy-server="socks5://USERNAME:PASSWORD@HOST:PORT"

Use this shortcut for all geo-sensitive work. Regular Chrome remains proxy-free for anything else.

Option B: System-wide proxy (all traffic)

# Set system proxy via PowerShell
$proxyAddress = "HOST:PORT"
Set-ItemProperty -Path "HKCU:\Software\Microsoft\Windows\CurrentVersion\Internet Settings" `
  -Name ProxyServer -Value $proxyAddress
Set-ItemProperty -Path "HKCU:\Software\Microsoft\Windows\CurrentVersion\Internet Settings" `
  -Name ProxyEnable -Value 1

Verify Denver egress:

Open Chrome (via proxy shortcut) → navigate to https://ipinfo.io
Expected output: city: Denver, region: Colorado

Step 7: OS hardening

# Disable Remote Assistance (separate from RDP — you don't need it)
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Remote Assistance" `
  -Name "fAllowToGetHelp" -Value 0

# Enable Windows Defender real-time protection
Set-MpPreference -DisableRealtimeMonitoring $false

# Disable Xbox services (not needed on a server)
$xboxServices = @("XblAuthManager","XblGameSave","XboxGipSvc","XboxNetApiSvc")
foreach ($svc in $xboxServices) {
    Stop-Service -Name $svc -Force -ErrorAction SilentlyContinue
    Set-Service -Name $svc -StartupType Disabled -ErrorAction SilentlyContinue
}

# Rename Administrator account (minor but effective hardening)
Rename-LocalUser -Name "Administrator" -NewName "cloudadmin"

Monthly cost scenarios

Scenario	EC2 Compute	EBS	Elastic IP	Proxy	Total
Always on (730 hrs)	$109.79	$4.00	$0.00	$3.00	~$117/mo
Business hours (175 hrs)	$26.32	$4.00	$3.60*	$3.00	~$37/mo
Minimal use (80 hrs)	$12.03	$4.00	$3.60*	$3.00	~$23/mo
Instance stopped entirely	$0.00	$4.00	$3.60	$3.00	~$11/mo

EIP charges $0.005/hr while the instance is stopped (~$3.60/mo if stopped all non-working hours)

The single most effective cost optimization: stop the instance when you're done working. Not terminate — stop. The EBS volume persists. Your Chrome profile, your files, your proxy config — all intact. You pick up exactly where you left off.

Want to automate stop/start? A Lambda function + EventBridge rule can auto-stop at 8 PM and auto-start at 8 AM on weekdays. Adds maybe 30 minutes of setup. Saves ~$80/month if you'd otherwise leave it running.

Troubleshooting and maintenance

"I can't RDP in"

Most common cause: your local IP changed (happens with residential ISPs).

# Get your current IP
curl https://checkip.amazonaws.com

# Update the Security Group rule
aws ec2 revoke-security-group-ingress \
  --group-name windows-workstation-sg \
  --protocol tcp --port 3389 \
  --cidr OLD_IP/32

aws ec2 authorize-security-group-ingress \
  --group-name windows-workstation-sg \
  --protocol tcp --port 3389 \
  --cidr NEW_IP/32

"The proxy isn't working / ipinfo.io shows wrong city"

Confirm the Chrome shortcut includes the full --proxy-server flag
Check proxy credentials haven't expired (some providers rotate them)
Try curl --proxy socks5://USER:PASS@HOST:PORT https://ipinfo.io from PowerShell to isolate whether it's a Chrome config issue or a proxy issue

"Instance is slow / Chrome is laggy"

Check CPU credit balance — t3 instances use CPU credits and can throttle if you've sustained high load:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Average

If credits are near zero, either wait for them to replenish (they refill at ~24 credits/hr on t3.large) or temporarily upgrade to a t3.large with unlimited burst mode enabled.

Routine maintenance (monthly, 10 minutes)

# Windows Update — run in PowerShell
Install-Module PSWindowsUpdate -Force
Get-WUInstall -AcceptAll -AutoReboot

# Check disk space
Get-PSDrive C | Select-Object Used,Free

# Rotate your Administrator password if shared
net user cloudadmin NewStrongPassword123!

"I accidentally clicked Terminate instead of Stop"

I'm sorry. The instance is gone. The EBS volume may still exist if you unchecked "Delete on termination" during setup — check EC2 → Volumes for an available volume and attach it to a new instance. This is why snapshots exist:

# Take a snapshot before anything risky
aws ec2 create-snapshot \
  --volume-id vol-xxxxxxxxx \
  --description "workstation-backup-$(date +%Y%m%d)"

Run this once a week. It costs ~$0.05/GB/month and has saved me more than once.

The project that started this is gone. The setup isn't. If it saves you three hours of clicking around the AWS console and a surprise bill, it did its job.

All pricing based on AWS us-east-1 On-Demand rates as of early 2026. Proxy pricing based on IPRoyal single static residential IP. Your numbers may vary slightly — always verify at aws.amazon.com/ec2/pricing.

Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failures

lssh — Thu, 12 Feb 2026 00:50:50 +0000

The Pattern

Modern cloud infrastructure often evolves through incremental additions.
A team starts with basic CI/CD, adds Terraform for IaC, integrates
security scanning, sets up monitoring—each piece works in isolation,
but the system as a whole becomes fragile.

Here's a failure pattern I've observed across multiple production
GCP environments: what appears to be "a few broken configs" is actually
a multi-layer architectural problem spanning Docker, Terraform, GitHub
Actions, and cloud-native security tooling.

Let's dissect it.

DISCLAIMER: All code examples, project names, domains, and configurations in this article are sanitized examples for educational purposes. No real client data or proprietary information is exposed. This analysis is based on publicly available documentation and common infrastructure patterns.

The Symptom List

In this pattern, teams typically surface a cluster of related failures:

Build & Container Issues:

Docker multi-stage build misconfigurations — CI/CD pipelines reference non-existent stage names in Dockerfiles
Duplicate or conflicting CMD instructions — containers exhibit unpredictable startup behavior
Image scanning pipeline breaks — security tools block pushes but jobs still succeed

Infrastructure-as-Code Failures:

Terraform module reference errors — output files reference modules that don't exist in the configuration
Variable interface mismatches — calling code passes variables that modules don't accept
Wrong execution context — CI runs IaC commands in incorrect directories
Provider version drift — different environments use incompatible provider versions

CI/CD Architecture Gaps:

Missing deployment automation — builds succeed but nothing triggers actual deployments
No quality gates — tests and builds run in parallel; failures don't block progression
Hardcoded deployment paths — only specific branches trigger deploys; others require manual intervention
Configuration drift — production URLs and domains missing from automation config

Security Tooling Integration Conflicts:

Overlapping vulnerability detection — Trivy, GCP Container Analysis, and Security Command Center all scan the same images
Runtime security false positives — Falco rules trigger on legitimate Cloud Run startup syscalls
Fragmented security reporting — findings appear in multiple systems with no single source of truth
Policy enforcement gaps — security scans run but don't actually block deployments

Tech stack representative of this pattern: GitHub Actions, GCP Cloud Run, Artifact Registry, Terraform, Firebase Hosting, containerized microservices with pnpm/npm monorepo structure.

Seems like a lot of small fixes, right? The reality is more complex.

What I Actually Found: The 3-Layer Problem

These aren't isolated bugs. They're symptoms of failures at three distinct levels.

Layer 1: The Obvious (Syntax & Configuration Errors)

These are the errors you see immediately when you run the tools:

Docker Target Mismatch:

# Dockerfile declares:
FROM node:20-alpine AS runner

# GitHub Action requests:
with:
  target: app # ❌ Stage "app" doesn't exist

Terraform Module Reference:

# outputs.tf tries to reference:
output "api_url" {
  value = module.cloud_run_api.service_url # ❌ Module doesn't exist
}

# main.tf actually has:
module "api_service" { # Different name!
  source = "../../modules/cloud-run"
}

Variable Name Mismatch:

# envs/prod/main.tf sends:
module "api" {
  service_name = "api-prod" # ❌ Module doesn't accept this
}

# modules/cloud-run/variables.tf expects:
variable "name" { # Different variable!
  type = string
}

These are language and consistency errors. Terraform requires that any resource or module referenced in output files be explicitly declared in the active configuration. When you refactor and change module names in main.tf but forget to update outputs.tf, you get this.

The fix? Run terraform validate — it catches these immediately without even connecting to the cloud.

Layer 2: Platform Changes (Hidden Causes)

This is where it gets interesting. Some failures aren't in the code — they're in how GCP's platform has evolved.

GCP Service Account Permission Changes:

GCP recently changed how Cloud Build uses service accounts. What used to work automatically now fails because the build service account no longer has default permissions to write logs or read from Artifact Registry.

The missing piece: iam.serviceaccounts.actAs permission, required for one identity to assume the role of a runtime service account.

Organization Policy Restrictions:

That "Firebase region conflict" isn't a typo in your Terraform. It's a collision with constraints/gcp.resourceLocations — an organization policy that blocks deployments to certain regions, even if your Terraform syntax is perfect.

VPC Service Controls:

If the project sits inside a VPC Service Controls perimeter, Cloud Run deployments can fail silently with confusing 403/404 errors. The perimeter blocks communication between Google services — like the Cloud Run agent trying to read images from Artifact Registry.

Security Tooling Conflicts:

When security tools are added incrementally — each solving a specific
problem in isolation — they create overlapping responsibilities and
contradictory enforcement policies.

A typical pattern:

Trivy is added to CI to scan container images before push
Falco is added to monitor runtime behavior in Cloud Run
GCP Container Analysis API scans images automatically on push to Artifact Registry
Security Command Center aggregates findings across the project

Each tool works. The integration doesn't.

The failure cascade:

Trivy finds a CVE and is configured to block the push
The GitHub Action reports success anyway (exit code not wired correctly)
Image gets pushed to Artifact Registry
Container Analysis API scans the same image 10 minutes later
Falco triggers alerts on normal Cloud Run startup syscalls (false positive)
Security Command Center reports the same CVE 3 hours later
Three different alerting systems fire
No one knows which finding to trust or act on first

Root cause: No centralized security policy. Each tool was added
without defining ownership, enforcement boundaries, or a single
source of truth for findings.

The hidden cost: Security tools that don't actually gate deployments give a false sense of protection. The pipeline feels secure. It isn't.

GCP Resource Name Limits:

GCP has a 63-character limit for resource names. If your Terraform generates names that exceed this (long prefixes like baseInstanceName), the system truncates them, causing duplicate name conflicts and deployment failures.

These aren't bugs in your code. They're platform governance and technical constraints that interact badly with naive configurations.

Layer 3: Architectural Debt (The Root Problem)

The deepest layer isn't about syntax or permissions — it's about missing architecture.

No CI/CD Gates:

The build and CI workflows are decoupled. Tests can fail, but images still get built and pushed. There's no needs: dependency chain enforcing that tests pass before builds run.

# What's happening:
jobs:
  test:
    runs-on: ubuntu-latest
  build:
    runs-on: ubuntu-latest # ❌ Runs in parallel, doesn't wait for tests

Wrong Directory Context:

GitHub Actions runs terraform plan in the repository root instead of envs/staging/. Terraform is directory-dependent — without the right context, it validates an empty or incomplete configuration.

Hardcoded Feature Branch:

Only one deployment path works: a specific feature branch → staging. There's no development → staging automation, no main → production workflow. Everything else is manual.

Missing Environment Variables:

Production URLs and domains aren't defined anywhere in the automation. Cloud Run services deploy without knowing their actual domain mappings, leaving SSL certificates stuck in provisioning or external access failing with 404/502.

This is lifecycle orchestration failure. Someone built pieces that "worked" in isolation but never architected how they fit together.

Why Fixing Order Matters

You can't just "fix what's broken." Here's why sequence matters:

❌ Fix production Terraform first → Staging still broken, can't test changes

❌ Wire up CI gates first → Builds still fail, nothing to gate

❌ Add domain configs first → Deployments fail before they even reach the domain mapping step

✅ Fix build errors → then CI validation → then deployment automation → then configuration gaps

Think of it like renovating a house: you can't install the roof if the foundation is cracked. You can't paint the walls if the plumbing leaks.

The Remediation Strategy:

Day 1-2: Fix blocking issues (foundation)

Day 3-4: Wire up automation (plumbing)

Day 5: Clean up medium issues (finishing touches)

This bottom-up approach ensures each layer is stable before building on top of it.

How to Actually Fix This

Issue #1: Docker Target Mismatch

Quick diagnosis:

grep "AS " apps/api/Dockerfile # See what stage names actually exist
grep "target:" .github/workflows/*.yml # See what CI requests

The fix:

# Option A: Fix the composite action (recommended)
# .github/actions/build-push/action.yml
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    target: runner # ✅ Match Dockerfile stage name

# Option B: Fix the Dockerfile
FROM node:20-alpine AS app # ✅ Match action target

Why it works: Docker multi-stage builds use FROM ... AS <name> to label stages. The --target flag tells Docker which stage to stop at. Mismatched names = build failure.

Issue #2: Staging Terraform Undefined Module

Quick diagnosis:

cd envs/staging
grep -n "module\." outputs.tf # Find all module references
grep -n 'module "' main.tf # Find all module declarations
# Names must match exactly

The fix:

# outputs.tf (BEFORE)
output "api_url" {
  value = module.cloud_run_api.service_url # ❌
}

# outputs.tf (AFTER)
output "api_url" {
  value = module.api_service.service_url # ✅ Match actual module name
}

Validation:

terraform init
terraform validate # Must pass
terraform plan # Should show changes, not errors

Why it works: Terraform's output system requires module references to exist in the configuration. This is caught during the validation phase, which checks internal consistency without cloud access.

Issue #3: Production Variable Mismatch

Quick diagnosis:

# Check what the module expects
cat modules/cloud-run/variables.tf

# Check what production sends
grep -A 10 'module "api"' envs/prod/main.tf

The fix:

# envs/prod/main.tf (BEFORE)
module "api" {
  source = "../../modules/cloud-run"
  service_name = "api-prod" # ❌ Module doesn't have this variable
  container_port = 8080 # ❌
}

# envs/prod/main.tf (AFTER)
module "api" {
  source = "../../modules/cloud-run"
  name = "api-prod" # ✅ Match module's variable.tf
  port = 8080 # ✅
}

Why it works: Terraform modules define a contract through variables.tf. The calling code must provide values that match these declared variables. Interface mismatches halt plan generation.

Issue #4: Wrong Directory in CI

Quick diagnosis:

# Check if workflow sets working directory
grep -A 5 "defaults:" .github/workflows/terraform-ci.yml

The fix:

# .github/workflows/terraform-ci.yml (BEFORE)
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - run: terraform init # ❌ Runs in repo root

# .github/workflows/terraform-ci.yml (AFTER)
jobs:
  validate:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./envs/staging # ✅ Set context
    steps:
      - run: terraform init # Now runs in correct directory

Why it works: Terraform is context-dependent. Without explicit directory specification, commands run in $GITHUB_WORKSPACE (repo root), where no .tf files exist for the specific environment.

Issue #5-6: Missing Deployment Automation

Create: .github/workflows/deploy-staging.yml

name: Deploy to Staging

on:
  push:
    branches:
      - development
    paths:
      - 'apps/**'
      - 'packages/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'pnpm'
      - run: pnpm install
      - run: pnpm test
      - run: pnpm lint

  build:
    needs: test # ✅ Only runs if tests pass
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Auth to GCP
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.GCP_SA_EMAIL }}
      - name: Build API
        uses: ./.github/actions/build-push
        with:
          dockerfile: apps/api/Dockerfile
          image: us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/images/api
          tag: staging-${{ github.sha }}
          build-target: runner # ✅ Fix for issue #1

  deploy:
    needs: build # ✅ Only runs if build succeeds
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./envs/staging
    steps:
      - uses: actions/checkout@v4
      - name: Auth to GCP
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.GCP_SA_EMAIL }}
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -var="image_tag=staging-${{ github.sha }}" -out=tfplan
      - run: terraform apply -auto-approve tfplan

Why it works: The needs: keyword creates job dependencies. GitHub Actions won't run build until test succeeds, won't run deploy until build succeeds. This is the "gating" that was missing.

Issue #7: CI Doesn't Gate Deployments

Already solved in Issue #5-6. The key is the needs: chain:

test → build → deploy

Each step must complete successfully before the next begins.

Issue #8: URL Configuration Gaps

Create centralized config:

# envs/staging/terraform.tfvars
project_id = "myproject-staging"
region = "us-central1"

domains = {
  api = "api-staging.myapp.com"
  web = "staging.myapp.com"
}

Use in module:

# modules/cloud-run/main.tf
resource "google_cloud_run_service" "service" {
  name = var.name
  location = var.region

  template {
    spec {
      containers {
        image = var.image

        env {
          name = "API_URL"
          value = "https://${var.api_domain}"
        }
        env {
          name = "WEB_URL"
          value = "https://${var.web_domain}"
        }
      }
    }
  }
}

resource "google_cloud_run_domain_mapping" "domain" {
  location = var.region
  name = var.custom_domain

  spec {
    route_name = google_cloud_run_service.service.name
  }
}

Update GitHub Secrets:

gh secret set STAGING_API_URL --body "https://api-staging.myapp.com"
gh secret set STAGING_WEB_URL --body "https://staging.myapp.com"

Why it works: Cloud Run requires domain validation and DNS configuration. Without these URLs in Terraform, the platform can't set up SSL certificates or route external traffic correctly.

Issues #12-15: Security Tooling Integration Conflicts

Quick diagnosis:

# Check if Trivy actually fails the job on findings
grep -A 10 "trivy" .github/workflows/*.yml
# Look for: exit-code: '1' and severity threshold

# Check for duplicate scanning
grep -r "scan\|trivy\|falco\|vulnerability" .github/workflows/*.yml

# Check Falco rules for Cloud Run compatibility
cat falco-rules/custom-rules.yaml | grep -i "container\|syscall"

# Check if Container Analysis is enabled
gcloud services list --enabled | grep containeranalysis

The fix — Option A: GCP Native (simpler):

Consolidate on GCP's built-in security tooling and remove
redundant third-party tools:

# .github/workflows/deploy-staging.yml
jobs:
  security-scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Scan image with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE_TAG }}
          format: 'sarif'
          exit-code: '1'        # ✅ Actually fails the job
          severity: 'CRITICAL,HIGH'
          output: 'trivy-results.sarif'

      - name: Upload results to Security Command Center
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  deploy:
    needs: security-scan  # ✅ Deploy only if scan passes
    runs-on: ubuntu-latest
    steps: [...]

# modules/cloud-run/main.tf
# Use GCP Binary Authorization instead of Falco for deploy-time enforcement
resource "google_binary_authorization_policy" "policy" {
  project = var.project_id

  default_admission_rule {
    evaluation_mode  = "REQUIRE_ATTESTATION"
    enforcement_mode = "ENFORCED_BLOCK_AND_AUDIT_LOG"

    require_attestations_by = [
      google_binary_authorization_attestor.trivy_passed.name
    ]
  }
}

The fix — Option B: Trivy + Falco (more control):

Keep both tools but define clear ownership boundaries:

# Trivy owns: pre-deploy image scanning (CI gate)
# Falco owns: runtime anomaly detection (post-deploy monitoring)
# Security Command Center owns: compliance reporting (audit trail)
# Container Analysis: disabled (redundant with Trivy)

# .github/workflows/deploy-staging.yml
jobs:
  scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Trivy scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE_TAG }}
          exit-code: '1'          # ✅ Hard gate
          severity: 'CRITICAL'
          ignore-unfixed: true    # Reduce noise

  deploy:
    needs: scan  # ✅ Trivy must pass
    runs-on: ubuntu-latest
    steps: [...]

# falco-rules/cloud-run-rules.yaml
# Tune Falco to ignore Cloud Run startup behavior
- rule: Unexpected syscall in container
  desc: Detect anomalous syscalls at runtime
  condition: >
    spawned_process and container
    and not proc.name in (cloud_run_allowed_processes)
    and not container.image.repository contains "gcr.io/cloudrun"
  output: "Unexpected process %proc.name in %container.name"
  priority: WARNING

- macro: cloud_run_allowed_processes
  condition: >
    proc.name in (node, python, java, nginx, sh, bash)
    and not proc.cmdline contains "curl metadata"  # Block SSRF attempts

Fix for Security Command Center duplicate findings:

# Disable Container Analysis if using Trivy (avoid duplicates)
gcloud services disable containeranalysis.googleapis.com

# OR: Configure SCC to deduplicate findings
gcloud scc settings update \
  --organization=YOUR_ORG_ID \
  --enable-asset-discovery

Why it works: Each security tool has a defined role with clear
enforcement boundaries. Trivy gates at build time. Falco monitors
at runtime. Security Command Center handles compliance reporting.
No overlaps, no gaps, no false sense of security.

The architectural principle:
Security tools should be additive in coverage, not redundant in scope.

Common Gotchas During Remediation

🚩 "I fixed the Dockerfile but CI still fails"

→ Check if the composite action caches the old target name. Clear workflow cache or update the action's default input.

🚩 "Terraform validate passes but plan fails"

→ You're probably in the wrong directory. Check pwd in your CI logs and verify working-directory is set.

🚩 "Images build but Cloud Run deployment fails"

→ Service account permissions (Layer 2). Run:

gcloud projects get-iam-policy YOUR_PROJECT \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:*@cloudbuild.gserviceaccount.com"

🚩 "Firebase deployment fails with region conflict"

→ Check org policy:

gcloud resource-manager org-policies describe \
  constraints/gcp.resourceLocations \
  --project=YOUR_PROJECT

🚩 "Variables are undefined in running container"

→ Don't put them in the Dockerfile. Inject via Terraform's env blocks in the Cloud Run service definition.

🚩 "Trivy scan passes but vulnerable images still get deployed"
→ Check exit-code configuration. Trivy reports findings by default
but doesn't fail the job unless exit-code: '1' is explicitly set
with a severity threshold.

🚩 "Falco generates hundreds of alerts on Cloud Run startup"
→ Cloud Run has a specific startup sequence that triggers generic
Falco rules. Add Cloud Run-specific macros to your custom rules
to filter legitimate startup behavior.

🚩 "Security Command Center shows the same CVE from 3 different sources"
→ You have overlapping scanners. Decide on a single source of truth
(Trivy OR Container Analysis, not both) and disable the redundant one.

🚩 "Binary Authorization blocks deployment after security scan passes"
→ The attestor isn't linked to your Trivy results. The attestation
step must explicitly create a Binary Authorization attestation after
a successful scan.

What This Analysis Doesn't Cover

If this was real infrastructure, you would need to check the next points:

Terraform state drift (manual changes in GCP)
Networking/DNS configuration details
Secret management implementation
The full history of how the system reached this state

But: For declared issues, these are all the documented root causes according to official Terraform, Docker, GitHub Actions, and GCP documentation.

Think of this as: symptoms → probable diagnosis. The real fix needs hands on the actual system.

Visual: The 3-Layer Problem

Fix bottom-up, not top-down.

Conclusion

Infrastructure failures rarely have a single cause. What looks like "broken Terraform" is usually a combination of:

Configuration errors (Layer 1)
Platform evolution you didn't track (Layer 2)
Missing architectural decisions (Layer 3)

The fix isn't just correcting syntax — it's understanding how these layers interact and building a system that's resilient to change.

Key takeaways:

Diagnose in layers. Don't stop at the obvious errors.
Fix in order. Foundation before plumbing before paint.
Build in gates. Make it impossible for broken code to reach production.
Document decisions. Future you (or the next developer) needs context.
Scope honestly. Complex infrastructure work takes time. Price accordingly.

The goal isn't just to fix what's broken today — it's to build a system that won't break the same way tomorrow.

Static IP Addresses for GKE Outbound Traffic: A Practical Guide to Cloud NAT

lssh — Tue, 10 Feb 2026 03:01:31 +0000

TL;DR

To get a fixed public IP for your GKE cluster's outbound traffic:

Reserve a regional static IP
Create a Cloud Router in the same region
Configure Cloud NAT with Manual IP assignment using that reserved IP

Done! All outbound traffic from your pods will always exit through the same IP.

The problem:

Your application running on Google Kubernetes Engine (GKE) needs to connect to an external database that requires IP whitelisting. But pods in Kubernetes have ephemeral IPs that change constantly. The solution? Cloud NAT with manual static IP assignment.

Why is this necessary?

In modern microservice architectures, it's common for Kubernetes applications to need access to:

Managed databases in other GCP projects
Third-party APIs with strict firewall policies
Legacy services that only allow access from known IPs

The challenge: GKE nodes (especially in private clusters) don't have fixed public IPs, making it impossible to maintain a stable whitelist.

The solution: Cloud NAT with manual assignment

Cloud NAT (Network Address Translation) acts as a gateway that translates your cluster's internal private addresses to a fixed, predictable public IP address.

Step-by-step implementation:

Reserve a static IP address

First, reserve a regional IP that we'll use as the public "face" of our cluster:

gcloud compute addresses create nat-static-ip \
  --region=us-central1

Important note: The IP must be in the same region as your GKE cluster.

— Create a Cloud Router

Cloud NAT requires a Cloud Router, which acts as the control plane for NAT configuration:

gcloud compute routers create nat-router \
  --network=my-vpc \
  --region=us-central1

— Configure Cloud NAT with manual assignment

This is the critical step. You must choose manual assignment (not automatic) to ensure the IP remains fixed:

gcloud compute routers nats create nat-config \
  --router=nat-router \
  --region=us-central1 \
  --nat-external-ip-pool=nat-static-ip \
  --nat-all-subnet-ip-ranges

The --nat-external-ip-pool flag specifies the static IP we reserved in step 1.

— Add the IP to your destination's whitelist

Once Cloud NAT is configured, all outbound traffic from your cluster will use the static IP. You can now confidently add it to your database or external service's firewall.

Key benefits

Persistence: The IP won't change even if the cluster restarts or nodes are recreated.

Security: Your GKE nodes can remain in private subnets without public IPs, reducing your attack surface.

Scalability: Cloud NAT is a managed service that scales automatically without impacting performance.

No application changes: If you use GitOps with ArgoCD, you don't need to modify your deployments. Configuration is entirely at the infrastructure level.

Important considerations

Capacity management: In manual assignment mode, you're responsible for calculating how many IPs/ports you need. If your cluster grows significantly, you might experience OUT_OF_RESOURCES errors.

Monitoring: Set up alerts for NAT port utilization to detect issues before they impact production.

Alternatives: For very specific use cases (like custom NAT logic or complex firewall requirements), consider whether a manual NAT instance might be more appropriate, though this increases operational overhead.

When to use this solution?

✅ You need to communicate with services requiring IP whitelisting

✅ You run private GKE clusters

✅ You want a scalable, managed solution

✅ You need compliance and centralized auditing of outbound traffic

❌ You have extremely custom NAT logic

❌ You need granular control the managed service doesn't offer.

How to verify it's working

Once configured, you can easily test it:

# Create a temporary pod and check your public IP
kubectl run curl-test --image=radial/busyboxplus:curl --rm -it -- \
  curl -s ifconfig.me

# Or run continuous checks to confirm the IP stays consistent
kubectl run -i --tty curl-test --image=radial/busyboxplus:curl --rm -- \
  sh -c "while true; do curl -s ifconfig.me; echo; sleep 2; done"

You should see your reserved static IP returned consistently.

Common issues and how to fix them

IP keeps changing → Double-check that you selected "Manual" (not "Automatic") in your Cloud NAT configuration.
Reserved IP in wrong region → The static IP and Cloud NAT must be in the same region as your GKE cluster.
Pods still using dynamic IPs → Ensure the NAT is applied to the subnetwork where your GKE cluster runs (NAT configuration → "Selected subnetworks").
Using GKE Autopilot → It works exactly the same. No special configuration needed.
No traffic showing in NAT → Wait 2-3 minutes after applying changes (Cloud NAT takes a moment to propagate).

Conclusion

Cloud NAT with manual IP assignment is GCP's standard solution for this common use case. It's reliable, scalable, and relatively simple to configure. Most importantly: it allows you to keep your resources secure in private networks while maintaining controlled connectivity to the outside world.

Have you implemented Cloud NAT in your infrastructure? What challenges did you encounter?

DEV Community: lssh

AlmaLinux: From Firmware Preparation to Continuous Auditing

Setting Up an AlmaLinux System: From Hardware Planning to Enterprise Hardening

1. Preparation and Firmware (BIOS/UEFI)

1.1 System Requirements

1.2 BIOS/UEFI Configuration

1.3 ISO Download and Verification

2. Configuration in the Anaconda Installer

2.1 Localization and Time

2.2 Software Selection

2.3 Partitioning Strategy with LVM

2.4 Encryption

2.5 Network and Hostname

2.6 Security Profile (SCAP Compliance)

2.7 User Settings

2.8 FIPS (Highly Regulated Environments)

3. Post-Installation and Hardening Phase

3.1 Initial Update

3.2 SELinux Management

3.3 SSH Management

3.4 Firewall Configuration

3.5 Auditing with auditd

3.6 Least Privilege with sudo

3.7 Kernel Live Patching

4. Resilience and Operational Continuity

4.1 Backup Strategy

4.2 Continuity with LVM

4.3 Long-Term Update Strategy

5. Reproducibility and Configuration Management

5.1 Automation with Ansible

5.2 Integration with Existing Infrastructure

6. Active Vulnerability Management

6.1 Scanning with OpenSCAP

6.2 Auditing with Lynis

Conclusion

De NGINX Ingress a Gateway API

Cloud workstation on AWS for $36/month: Windows EC2, static IP and Denver egress explained

TL;DR

This is what we're building

Who is this actually for?

Architecture decisions (and what I ruled out)

Instance type: why t3.large and not smaller or bigger

Region: us-east-1, not whatever's closest to Denver

Denver egress: proxy, not VPN, not a second EC2

Storage: 50 GB gp3, not the default

Elastic IP: attach it, don't skip it

Execution: step by step, both CLI and console

Step 1: Create a Security Group

Step 2: Launch the EC2 instance!

Step 3: Allocate and attach an Elastic IP

Step 4: Get the Windows password

Step 5: Connect via RDP and baseline setup

Step 6: Configure the Denver proxy

Step 7: OS hardening

Monthly cost scenarios

Troubleshooting and maintenance

"I can't RDP in"

"The proxy isn't working / ipinfo.io shows wrong city"

"Instance is slow / Chrome is laggy"

Routine maintenance (monthly, 10 minutes)

"I accidentally clicked Terminate instead of Stop"

Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failures

Static IP Addresses for GKE Outbound Traffic: A Practical Guide to Cloud NAT

Instance type: why `t3.large` and not smaller or bigger

Region: `us-east-1`, not whatever's closest to Denver