Joseph Sutton

Posted on May 27, 2022

Scaled Virus Scanner using AWS Fargate, ClamAV, S3, and SQS with Terraform

#terraform #aws #security #javascript

Welcome back for more shenanigans!

Some time ago, team I was on ran into a problem of hitting the Lambda deployment (and runtime) size limits: our solo Lambda function + a ClamAV layer with pre-built binaries and virus definitions. If you had a smaller file size requirement to scan, I'm sure that a Lambda layer setup with ClamAV and its binaries and definitions would work great for you; however, it wasn't in our case. We needed to scale our solution to allow for files up to sizes of 512MB.

TL;DR: GitHub repo.

Since we were already using SQS and EC2 for other things, why not use it along with S3 and Fargate? We had a spike to pursue either EC2 or a Fargate consumer client, but Fargate had better maintainability in the long run.

Note that I won't be implementing a cluster policy (yet) in this article, I might save that for another time; however, things should be setup to translate relatively well in that matter.

NOTE: This assumes you have your AWS credentials setup already via aws configure. If you plan on using a different profile, make sure you reflect that in the main.tf file below where profile = "default" is set.

Anyhow, let's get to it. First, let's set up our main configuration in terraform/main.tf:

terraform {
  required_version = ">= 1.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.29"
    }
  }

  backend "s3" {
    encrypt        = true
    bucket         = "tf-clamav-state"
    dynamodb_table = "tf-dynamodb-lock"
    region         = "us-east-1"
    key            = "terraform.tfstate"
  }
}

# TODO: Make note about aws credentials and different profiles
provider "aws" {
  profile = "default"
  region  = "us-east-1"
}

This has a remote state (didn't I write about that once?), so we need a script to setup the S3 bucket and DynamoDB table for both our state and lock status respectively. We can set that up in a bash script in terraform/tf-setup.sh:

#!/bin/bash

# Create S3 Bucket
MY_ARN=$(aws iam get-user --query User.Arn --output text 2>/dev/null)
aws s3 mb "s3://tf-clamav-state" --region "us-east-1"
sed -e "s/RESOURCE/arn:aws:s3:::tf-clamav-state/g" -e "s/KEY/terraform.tfstate/g" -e "s|ARN|${MY_ARN}|g" "$(dirname "$0")/templates/s3_policy.json" > new-policy.json
aws s3api put-bucket-policy --bucket "tf-clamav-state" --policy file://new-policy.json
aws s3api put-bucket-versioning --bucket "tf-clamav-state" --versioning-configuration Status=Enabled
rm new-policy.json

# Create DynamoDB Table
aws dynamodb create-table \
  --table-name "tf-dynamodb-lock" \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1 \
  --region "us-east-1"

This does require an S3 policy s3_policy in terraform/templates/s3_policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "RESOURCE",
      "Principal": {
        "AWS": "ARN"
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "RESOURCE/KEY",
      "Principal": {
        "AWS": "ARN"
      }
    }
  ]
}

Now we can run the tf-setup.sh script (don't forget to chmod +x it) via cd terraform && ./tf-setup.sh.

Now that we have our remote state established, let's scaffold out our infrastructure in Terraform. First up, we need our buckets (one for quarantined files and one for clean files), our SQS queue, and an event notification configured on the quarantine bucket for when an object is created. We can set this up via terraform/logistics.tf:

provider "aws" {
  region = "us-east-1"
  alias  = "east"
}

data "aws_caller_identity" "current" {}

resource "aws_s3_bucket" "quarantine_bucket" {
  provider = aws.east
  bucket   = "clamav-quarantine-bucket"
  acl      = "private"

  cors_rule {
    allowed_headers = ["Authorization"]
    allowed_methods = ["GET", "POST"]
    allowed_origins = ["*"]
    max_age_seconds = 3000
  }

  lifecycle_rule {
    enabled = true

    # Anything in the bucket remaining is a virus, so
    # we'll just delete it after a week.
    expiration {
      days = 7
    }
  }
}


resource "aws_s3_bucket" "clean_bucket" {
  provider = aws.east
  bucket   = "clamav-clean-bucket"
  acl      = "private"

  cors_rule {
    allowed_headers = ["Authorization"]
    allowed_methods = ["GET", "POST"]
    allowed_origins = ["*"]
    max_age_seconds = 3000
  }
}


data "template_file" "event_queue_policy" {
  template = file("templates/event_queue_policy.tpl.json")

  vars = {
    bucketArn = aws_s3_bucket.quarantine_bucket.arn
  }
}

resource "aws_sqs_queue" "clamav_event_queue" {
  name = "s3_clamav_event_queue"

  policy = data.template_file.event_queue_policy.rendered
}

resource "aws_s3_bucket_notification" "bucket_notification" {
  bucket = aws_s3_bucket.quarantine_bucket.id

  queue {
    queue_arn = aws_sqs_queue.clamav_event_queue.arn
    events    = ["s3:ObjectCreated:*"]
  }

  depends_on = [
    aws_sqs_queue.clamav_event_queue
  ]
}

resource "aws_cloudwatch_log_group" "clamav_fargate_log_group" {
  name = "/aws/ecs/clamav_fargate"
}

If you read the clamav_event_queue block above, there's an event queue policy - let's not forget that in terraform/templates/event_queue_policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage"
      ],
      "Resource": "arn:aws:sqs:*:*:s3_clamav_event_queue",
      "Condition": {
        "ArnEquals": {
          "aws:SourceArn": "${bucketArn}"
        }
      }
    }
  ]
}

Since this is a "security" thing, we need to make sure it's isolated within its own VPC. I'm no networking guru, so most of the information I obtained was from this Stackoverflow answer. We'll do that in terraform/vpc.tf:

# Networking for Fargate
# Note: 10.0.0.0 and 10.0.2.0 are private IPs
# Required via https://stackoverflow.com/a/66802973/1002357
# """
# > Launch tasks in a private subnet that has a VPC routing table configured to route outbound 
# > traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection 
# > to ECR on behalf of the task.
# """
# If this networking configuration isn't here, this error happens in the ECS Task's "Stopped reason":
# ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed
resource "aws_vpc" "clamav_vpc" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "private" {
  vpc_id     = aws_vpc.clamav_vpc.id
  cidr_block = "10.0.2.0/24"
}

resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.clamav_vpc.id
  cidr_block = "10.0.1.0/24"
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.clamav_vpc.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.clamav_vpc.id
}

resource "aws_route_table_association" "public_subnet" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private_subnet" {
  subnet_id      = aws_subnet.private.id
  route_table_id = aws_route_table.private.id
}

resource "aws_eip" "nat" {
  vpc = true
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.clamav_vpc.id
}

resource "aws_nat_gateway" "ngw" {
  subnet_id     = aws_subnet.public.id
  allocation_id = aws_eip.nat.id

  depends_on = [aws_internet_gateway.igw]
}

resource "aws_route" "public_igw" {
  route_table_id         = aws_route_table.public.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.igw.id
}

resource "aws_route" "private_ngw" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.ngw.id
}


resource "aws_security_group" "egress-all" {
  name        = "egress_all"
  description = "Allow all outbound traffic"
  vpc_id      = aws_vpc.clamav_vpc.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Now that we have the networking configuration done, we can go ahead and implement the ECS / Fargate configuration:

resource "aws_iam_role" "ecs_task_execution_role" {
  name = "clamav_fargate_execution_role"

  assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": "sts:AssumeRole",
     "Principal": {
       "Service": "ecs-tasks.amazonaws.com"
     },
     "Effect": "Allow",
     "Sid": ""
   }
 ]
}
EOF
}

resource "aws_iam_role" "ecs_task_role" {
  name = "clamav_fargate_task_role"

  assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": "sts:AssumeRole",
     "Principal": {
       "Service": "ecs-tasks.amazonaws.com"
     },
     "Effect": "Allow",
     "Sid": ""
   }
 ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "ecs_task_execution_policy_attachment" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role_policy_attachment" "s3_task" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}

resource "aws_iam_role_policy_attachment" "sqs_task" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSQSFullAccess"
}

resource "aws_ecs_cluster" "cluster" {
  name = "clamav_fargate_cluster"

  capacity_providers = ["FARGATE"]
}

data "template_file" "task_consumer_east" {
  template = file("./templates/clamav_container_definition.json")

  vars = {
    aws_account_id = data.aws_caller_identity.current.account_id
  }
}

resource "aws_ecs_task_definition" "definition" {
  family                   = "clamav_fargate_task_definition"
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "2048"
  requires_compatibilities = ["FARGATE"]

  container_definitions = data.template_file.task_consumer_east.rendered

  depends_on = [
    aws_iam_role.ecs_task_role,
    aws_iam_role.ecs_task_execution_role
  ]
}

resource "aws_ecs_service" "clamav_service" {
  name            = "clamav_service"
  cluster         = aws_ecs_cluster.cluster.id
  task_definition = aws_ecs_task_definition.definition.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    assign_public_ip = false

    subnets = [
      aws_subnet.private.id
    ]

    security_groups = [
      aws_security_group.egress-all.id
    ]
  }
}

The container_definitions from the template_file has the configuration for the log configuration and environment variables. That configuration is found in terraform/templates/clamav_container_definition.json:

[
  {
    "image": "${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com/fargate-images:latest",
    "name": "clamav",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-region": "us-east-1",
        "awslogs-group": "/aws/ecs/clamav_fargate",
        "awslogs-stream-prefix": "project"
      }
    },
    "environment": [
      {
        "name": "VIRUS_SCAN_QUEUE_URL",
        "value": "https://sqs.us-east-1.amazonaws.com/${aws_account_id}/s3_clamav_event_queue"
      },
      {
        "name": "QUARANTINE_BUCKET",
        "value": "clamav-quarantine-bucket"
      },
      {
        "name": "CLEAN_BUCKET",
        "value": "clamav-clean-bucket"
      }
    ]
  }
]

Since we're using Fargate, we'll need a Dockerfile configuration and an ECR repository (as depicted in the clamav_container_definition.json file above). Let's get the ECR repository configured in terraform/ecr.tf:

resource "aws_ecr_repository" "image_repository" {
  name = "fargate-images"
}

data "template_file" "repo_policy_file" {
  template = file("./templates/ecr_policy.tpl.json")

  vars = {
    numberOfImages = 5
  }
}

# keep the last 5 images
resource "aws_ecr_lifecycle_policy" "repo_policy" {
  repository = aws_ecr_repository.image_repository.name
  policy     = data.template_file.repo_policy_file.rendered
}

You can play around with the number of images, completely up to you. It's just for versioning your Docker image that contains the consumer. Now, for the actual Dockerfile:

FROM ubuntu

WORKDIR /home/clamav

RUN echo "Prepping ClamAV"

RUN apt update -y
RUN apt install curl sudo procps -y

RUN curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -
RUN apt install -y nodejs
RUN npm init -y

RUN npm i aws-sdk tmp sqs-consumer --save
RUN DEBIAN_FRONTEND=noninteractive sh -c 'apt install -y awscli'

RUN apt install -y clamav clamav-daemon

RUN mkdir /var/run/clamav && \
  chown clamav:clamav /var/run/clamav && \
  chmod 750 /var/run/clamav

RUN freshclam

COPY ./src/clamd.conf /etc/clamav/clamd.conf
COPY ./src/consumer.js ./consumer.js
RUN npm install
ADD ./src/run.sh ./run.sh

CMD ["bash", "./run.sh"]

This basically installs node, initializes a node project, and installs the bare essentials we need for the consumer: aws-sdk and tmp (for file handling to scan it). The first file, we can create in src/clamd.conf which is the ClamAV configuration (for the daemon that will be listening):

LocalSocket /tmp/clamd.socket
LocalSocketMode 660

Now for the SQS consumer in src/consumer.js:

const { SQS, S3 } = require('aws-sdk');
const { Consumer } = require('sqs-consumer');
const tmp = require('tmp');
const fs = require('fs');
const util = require('util');
const { exec } = require('child_process');

const execPromise = util.promisify(exec);

const s3 = new S3();

const app = Consumer.create({
  queueUrl: process.env.VIRUS_SCAN_QUEUE_URL,
  handleMessage: async (message) => {
    console.log('message', message);
    const parsedBody = JSON.parse(message.Body);
    const documentKey = parsedBody.Records[0].s3.object.key;

    const { Body: fileData } = await s3.getObject({
      Bucket: process.env.QUARANTINE_BUCKET,
      Key: documentKey
    }).promise();

    const inputFile = tmp.fileSync({
      mode: 0o644,
      tmpdir: process.env.TMP_PATH,
    });
    fs.writeSync(inputFile.fd, Buffer.from(fileData));
    fs.closeSync(inputFile.fd);

    try {
      await execPromise(`clamdscan ${inputFile.name}`);

      await s3.putObject({
        Body: fileData,
        Bucket: process.env.CLEAN_BUCKET,
        Key: documentKey,
        Tagging: 'virus-scan=clean',
      }).promise();

      await s3.deleteObject({
        Bucket: process.env.QUARANTINE_BUCKET,
        Key: documentKey,
      }).promise();

    } catch (e) {
      if (e.code === 1) {
        await s3.putObjectTagging({
          Bucket: process.env.QUARANTINE_BUCKET,
          Key: documentKey,
          Tagging: {
            TagSet: [
              {
                Key: 'virus-scan',
                Value: 'dirty',
              },
            ],
          },
        }).promise();
      }
    } finally {
      await sqs.deleteMessage({
        QueueUrl: process.env.VIRUS_SCAN_QUEUE_URL,
        ReceiptHandle: message.ReceiptHandle
      }).promise();
    }
  },
  sqs: new SQS()
});

app.on('error', (err) => {
  console.error('err', err.message);
});

app.on('processing_error', (err) => {
  console.error('processing error', err.message);
});

app.on('timeout_error', (err) => {
 console.error('timeout error', err.message);
});

app.start();

This does the following within a 10 second interval:

1) Pulls the file in through the quarantine bucket via the metadata in the SQS message (message in-flight)
2) Writes it to /tmp
3) Scans it with clamdscan (via the ClamAV daemon, clamd which already has the virus definitions loaded)
4) If it's clean, it puts the file in the clean bucket with a clean tag with the key virus-scan, removes it from the quarantine bucket, and deletes the message
5) If it's dirty, it tags the file as virus-scan = dirty and keeps the virus in the quarantine bucket, and deletes the SQS message

For now, this consumer handles only 1 message at a time; this can be easily configured to handle more messages since the ClamAV daemon scanner is much more efficient.

Now, the last file mentioned in the Dockerfile is the bash script that runs the updater, daemon, and consumer in src/run.sh:

echo "Starting Services"
service clamav-freshclam start
service clamav-daemon start

echo "Services started. Running worker."

node consumer.js

Cool, let's start it up by terraform plan and terraform apply. Once that's finished (you'll have to confirm by typing yes), you should be good to go.

Now, to test it with a script, test-virus.sh:

#!/bin/bash

aws s3 cp fixtures/test-virus.txt s3://clamav-quarantine-bucket
aws s3 cp fixtures/test-file.txt s3://clamav-quarantine-bucket

sleep 30

VIRUS_TEST=$(aws s3api get-object-tagging --key test-virus.txt --bucket clamav-quarantine-bucket --output text)
CLEAN_TEST=$(aws s3api get-object-tagging --key test-file.txt --bucket clamav-clean-bucket --output text)

echo "Dirty tag: ${VIRUS_TEST}"
echo "Clean tag: ${CLEAN_TEST}"

Running that, here's the output we get:

Dirty tag: TAGSET       virus-scan      dirty
Clean tag: TAGSET       virus-scan      clean

There we go. Hopefully y'all learned something! I had a lot of fun with this, and although I felt like I rushed it in a few areas, I look forward to your comments to see what all I missed (or to answer questions).

Y'all take care now.

Top comments (7)

kiruba3441 • Oct 6 '22

How does the virus definitions get updated? Do we run Freshclam via CRON?

Marc • Jan 3 '23 • Edited

As per my understanding the freshclam daemon automatically fetches new virus definitions (every other hour).

Martin Smola • Jan 23 '23 • Edited

As per my understanding the update is doing once per 24 hours. I had to rewrite it because it didn't work and it uses AWS SDK v2

Marc • Jan 23 '23

The default is 12 times a day (every other hour). To see if it works I created a freshclam.conf (same level as the clamd.conf). Make sure to copy it into your docker image.

Content of the file can be something like this

# Number of database checks per day (12 is the default).
Checks 12

# Enable logging to a specified file.
UpdateLogFile /var/log/clamav/freshclam.log

# Execute this command after certain events
OnUpdateExecute echo "Freshclam has updated a database."
OnOutdatedExecute echo "Freshclam has found a new version (%v) of a database."
OnErrorExecute echo "Freshclam has failed to update a database."

Now when starting this container and letting it run for a while you can check the log file and you will see that it really is checking for updates. You can enter bigger numbers (for testing), but they will blacklist you for pulling the freshclam db too often. I think you are not allowed to pull more than once per hour.

Doc on the conf file can be found here:
manpages.ubuntu.com/manpages/bioni...

CogginCreations • Feb 8 '23

This is great and really helpful. I was looking at an example clamd.conf. Can you provide an example that would write out the standard ecs tasks cloudwatch logs.

I want to make sure my files are actually getting scanned. Currently it is working, but almost seems TOO fast to actually be working (scanning) correctly. Scanned a 100MB PDF in 1 second. All the routing and tagging I changed a little, but works great.

Again, great work!

Joseph Sutton • Feb 13 '23

Thanks, @coggincreations ! Glad to hear that it's still working! I haven't spun this up in quite some time, and I changed workplaces and haven't touched TF in quite some time; however, if I get some time, I'll try to write out an example and update the repo.

PS: I understand what you mean by too fast. I was skeptical about that at first, too. 😅

Karthik Mathrubootham • Jun 16 '23

Hello @sutt0n and rest of folks - great article. I deployed the docker image from my mac running on arm64 and hence had to recreate the ECS container task on arm64 ( updating the aws_ecs_task_definition resource block and rerunning terraform apply. I see the ECS fargate service using the latest task definition but the cloudwatch logs are NOT showing the consumer.js processing anything from the SQS queue, two files from the test scenario are sitting on the SQS queue. Any suggestions on how I can trigger the processing and why the consumer.js in the new task definition is not executing ? Whats the frequency of polling of this task and where is that set