Polliog for AWS Community Builders

Posted on Apr 8

I Replaced ElastiCache with Valkey on ECS (And Cut the Bill by 70%)

#aws #redis #valkey #ecs

ElastiCache is a genuinely good service. Managed failover, automated backups, CloudWatch integration out of the box. For teams that need Redis and don't want to operate it, it makes sense.

The price, however, does not scale down. A cache.t4g.small (2 vCPU, 1.37GB RAM) runs about $25/month in eu-west-1. A cache.r7g.large (2 vCPU, 13.07GB RAM) is $175/month. Multi-AZ doubles those numbers. For a startup or side project running on a single-digit revenue, that's a significant line item for what is often a queue and a session cache.

Valkey is a Redis-compatible open-source project under the Linux Foundation, backed by AWS, Google, and others. It forked from Redis 7.2.4 (BSD-3 license) and maintains full protocol compatibility. Every client library that works with Redis works with Valkey: ioredis, node-redis, BullMQ, Sidekiq, Celery. No code changes.

Here's how to run it on ECS Fargate and what it actually costs.

The Architecture

We're running Valkey as an ECS service on Fargate, backed by an EFS volume for persistence, inside a VPC with a security group that restricts access to the application services only.

VPC
├── Public Subnet
│   └── Application Load Balancer
├── Private Subnet A
│   ├── ECS Service: App (Fargate)
│   └── ECS Service: Valkey (Fargate)
└── Private Subnet B
    └── ECS Service: App (Fargate) - replica

Valkey doesn't need to be publicly accessible. It lives in the private subnet and is reachable only from the application services in the same VPC.

Step 1: EFS Volume for Persistence

Fargate tasks are ephemeral. Without a persistent volume, every Valkey restart loses your data. EFS solves this without managing EC2.

# terraform/efs.tf
resource "aws_efs_file_system" "valkey" {
  creation_token = "${var.app_name}-valkey"
  encrypted      = true

  lifecycle_policy {
    transition_to_ia = "AFTER_7_DAYS"
  }

  tags = {
    Name = "${var.app_name}-valkey"
  }
}

resource "aws_efs_mount_target" "valkey" {
  for_each = toset(var.private_subnet_ids)

  file_system_id  = aws_efs_file_system.valkey.id
  subnet_id       = each.value
  security_groups = [aws_security_group.efs.id]
}

resource "aws_security_group" "efs" {
  name   = "${var.app_name}-efs"
  vpc_id = var.vpc_id

  ingress {
    from_port       = 2049
    to_port         = 2049
    protocol        = "tcp"
    security_groups = [aws_security_group.valkey_task.id]
  }
}

Important: use an EFS Access Point, not rootDirectory directly.

Specifying a rootDirectory that doesn't physically exist on a fresh EFS filesystem causes the Fargate task to fail immediately with ResourceInitializationError: failed to invoke EFS utils... directory does not exist. Fargate won't create the directory automatically.

EFS Access Points handle this correctly - they create the directory with the right UNIX permissions if it doesn't exist yet:

resource "aws_efs_access_point" "valkey" {
  file_system_id = aws_efs_file_system.valkey.id

  posix_user {
    uid = 1000
    gid = 1000
  }

  root_directory {
    path = "/valkey"
    creation_info {
      owner_uid   = 1000
      owner_gid   = 1000
      permissions = "0755"
    }
  }
}

Step 2: ECS Task Definition

{
  "family": "valkey",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
  "volumes": [
    {
      "name": "valkey-data",
      "efsVolumeConfiguration": {
        "fileSystemId": "fs-XXXXXXXXX",
        "transitEncryption": "ENABLED",
        "authorizationConfig": {
          "accessPointId": "fsap-XXXXXXXXX",
          "iam": "ENABLED"
        }
      }
    }
  ],
  "containerDefinitions": [
    {
      "name": "valkey",
      "image": "valkey/valkey:8.0-alpine",
      "portMappings": [
        {
          "containerPort": 6379,
          "protocol": "tcp"
        }
      ],
      "command": [
        "valkey-server",
        "--save", "60", "1000",
        "--appendonly", "yes",
        "--appendfsync", "everysec",
        "--maxmemory", "800mb",
        "--maxmemory-policy", "allkeys-lru",
        "--requirepass", "VALKEY_PASSWORD_FROM_SECRETS_MANAGER"
      ],
      "mountPoints": [
        {
          "sourceVolume": "valkey-data",
          "containerPath": "/data",
          "readOnly": false
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/valkey",
          "awslogs-region": "eu-west-1",
          "awslogs-stream-prefix": "valkey"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "valkey-cli ping | grep PONG"],
        "interval": 10,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 15
      }
    }
  ]
}

A few things worth noting in this config:

--save 60 1000 triggers an RDB snapshot every 60 seconds if at least 1000 keys changed. Combined with --appendonly yes, you get both AOF and RDB persistence - the AOF gives you per-second durability, the RDB gives you faster restart times.

--maxmemory-policy allkeys-lru means Valkey will evict the least recently used keys when it hits the memory limit. For a cache workload this is usually what you want. For a queue workload (BullMQ, Sidekiq) you should use noeviction instead and alert on memory pressure.

The password comes from Secrets Manager via the secrets field in the task definition rather than a hardcoded string. The example above is simplified for readability.

Step 3: ECS Service and Service Discovery

For the application to connect to Valkey, it needs a stable hostname. ECS Service Discovery provides this via AWS Cloud Map.

# terraform/service-discovery.tf
resource "aws_service_discovery_private_dns_namespace" "internal" {
  name = "internal.${var.app_name}"
  vpc  = var.vpc_id
}

resource "aws_service_discovery_service" "valkey" {
  name = "valkey"

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.internal.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

resource "aws_ecs_service" "valkey" {
  name            = "valkey"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.valkey.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.valkey_task.id]
    assign_public_ip = false
  }

  service_registries {
    registry_arn = aws_service_discovery_service.valkey.arn
  }

  # Prevent ECS from cycling the task during deployments
  # when the app has active connections
  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200
}

Your application connects to valkey.internal.your-app:6379. When the task is replaced (restart, deployment), the DNS record updates automatically within the TTL.

Step 4: Connecting from Node.js

No code changes from a Redis setup. ioredis works as-is:

import Redis from "ioredis";

const valkey = new Redis({
  host: process.env.VALKEY_HOST,   // valkey.internal.your-app
  port: 6379,
  password: process.env.VALKEY_PASSWORD,
  // Retry on connection loss - important for task replacement events
  retryStrategy: (times) => Math.min(times * 100, 3000),
  maxRetriesPerRequest: 3,
  enableOfflineQueue: true,
  lazyConnect: true,
});

valkey.on("error", (err) => {
  console.error("Valkey connection error:", err);
});

valkey.on("reconnecting", () => {
  console.log("Valkey reconnecting...");
});

BullMQ requires no changes either:

import { Queue, Worker } from "bullmq";
import { valkey } from "./valkey";

const emailQueue = new Queue("emails", { connection: valkey });

const worker = new Worker(
  "emails",
  async (job) => {
    await sendEmail(job.data);
  },
  { connection: valkey }
);

The Cost Comparison

Running Valkey on Fargate with 0.5 vCPU and 1GB RAM in eu-west-1.

Note: The Fargate numbers below use on-demand pricing ($0.04048/vCPU-hour, $0.004445/GB-hour). If you're using a Compute Savings Plan (common for Fargate workloads), expect 20-40% lower compute costs.

Resource	Monthly cost (on-demand)
Fargate compute (0.5 vCPU)	$14.77
Fargate memory (1GB)	$3.24
EFS storage (5GB used)	$1.50
EFS throughput	$0.30
CloudWatch logs	$0.50
Total	~$20.30

vs. ElastiCache cache.t4g.small (1.37GB RAM):

Resource	Monthly cost
ElastiCache t4g.small	$25.20
Total	$25.20

Single-node, the savings are modest (~20%). The real comparison is Multi-AZ ElastiCache, which is the production-grade option: a Multi-AZ cache.t4g.small is $50.40/month. Against that, Fargate at ~$20 is a 60% reduction - and with a Savings Plan applied, closer to 70%.

For a cache.r7g.large workload (13GB RAM), the numbers shift further:

Option	Monthly cost
ElastiCache r7g.large	$175.00
ElastiCache r7g.large Multi-AZ	$350.00
Fargate (2 vCPU / 13GB, on-demand)	~$108.00
Fargate (2 vCPU / 13GB, Savings Plan)	~$70.00

The savings are real. So is the operational difference.

What You Give Up

ElastiCache manages automatic failover, Multi-AZ replication, and rolling upgrades. With Fargate, you're responsible for all of that.

No automatic failover. If the Valkey Fargate task dies, ECS will restart it automatically (typically 30-90 seconds). During that window, connections fail. For a cache this is usually acceptable. For a job queue, your workers will queue errors and retry - still acceptable if you've configured retries correctly. For session storage, users get logged out. Decide based on your workload.

Manual upgrades. You control the Docker image tag. Update the task definition, trigger a new deployment. No automatic patch management.

No Multi-AZ replication out of the box. If you need a hot standby, you'll need to set up Valkey's built-in replication between two Fargate tasks and handle failover at the application level or with an intermediate proxy. This adds complexity that may not be worth it below a certain scale.

Persistence responsibility. EFS gives you durable storage, but you're responsible for backup strategy. Set up AWS Backup for the EFS volume or use Valkey's BGSAVE + S3 export for point-in-time backups.

When This Makes Sense

This setup is the right call when:

Your workload is a queue, cache, or session store without strict HA requirements
You're running a startup, side project, or internal tool where a 60-second outage is acceptable
You're already paying for Fargate and EFS for other services
The ElastiCache bill is a meaningful percentage of your monthly AWS spend

It's the wrong call when:

You need sub-second automatic failover
You're storing session data where eviction causes user-visible disruption
Your compliance requirements mandate managed services with AWS support coverage
You're operating at a scale where the operational overhead of self-managed infrastructure costs more than the ElastiCache bill

The break-even point for most teams is somewhere around $100-150/month in Redis costs. Below that, ElastiCache's convenience usually wins. Above it, self-hosted starts to look attractive even accounting for the operational investment.

Running Valkey on AWS in a different configuration? Different numbers for your region or workload? Happy to hear it in the comments.

DEV Community