Bassem Hussein

Posted on Jun 3

Building a Scalable Audit Log System with AWS Serverless Services (No Lambda)

#aws #athena #eventbridge #serverless

Introduction

Audit logging is a critical requirement for any enterprise system — you need to know who did what, when, and to which entity. But building an audit log system that scales with your microservices architecture without breaking the bank is challenging.

In this post, we walk through how we built a production-grade, cost-optimized audit logging pipeline on AWS that captures domain events from multiple microservices and makes them queryable via SQL — all without a single Lambda function or server to manage.

The Problem

We have dozens of microservices (orders, inventory, users, etc.) that modify entities like Orders, Order Items, and User Accounts. We needed:

Complete audit trail — every create, update, and delete captured
Parent-child relationships — e.g., an OrderItem belongs to an Order
Fast, cheap queries — filter by date, source, entity, actor
Zero operational overhead — no servers, no scaling concerns
SQL injection protection — safe querying from application code

Architecture

┌──────────────────────┐
│   Microservices      │
│  (Orders, Inventory, │
│   Users, etc.)       │
└──────────┬───────────┘
           │ PutEvents
           ▼
┌──────────────────────┐
│  Amazon EventBridge  │
│  (Central Event Bus) │
└──────────┬───────────┘
           │ Rule → Target
           ▼
┌──────────────────────┐
│  Kinesis Firehose    │
│  (Buffer + Compress) │
└──────────┬───────────┘
           │ GZIP → S3
           ▼
┌──────────────────────┐
│  S3 (Raw Layer)      │
│  year=YYYY/month=MM/ │
│  day=DD/hour=HH/     │
└──────────┬───────────┘
           │ Daily Schedule
           ▼
┌──────────────────────┐
│  AWS Glue ETL        │
│  (PySpark → Parquet) │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  S3 (Analytics)      │
│  dt=YYYY-MM-DD/      │
│  source=service/     │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Amazon Athena       │
│  (Serverless SQL)    │
└──────────────────────┘

Why No Lambda?

Most audit log implementations use Lambda to transform events in real-time. We intentionally avoided this:

Concern	Lambda Approach	Our Approach (Glue)
Cost at scale	Per-invocation billing adds up	Single daily job, fixed cost
Cold starts	Adds latency	N/A — batch processing
Concurrency limits	Can throttle under load	No limits
Error handling	Complex DLQ patterns	Simple retry on daily job
Transformation bugs	Break real-time ingestion	Fix and re-run the batch

Key insight: Audit logs don't need real-time transformation. A daily batch is cheaper, simpler, and equally useful for compliance and investigation.

Step 1: Event Publishing (Microservices)

Each microservice publishes domain events to EventBridge following a consistent contract:

{
  "source": "myapp.orderservice",
  "detail-type": "orderitemupdated.v1",
  "time": "2026-04-02T12:31:01Z",
  "detail": {
    "entities": [
      {
        "entity": {
          "id": "1001",
          "name": "Order Item #1001",
          "status": "Processing"
        },
        "before": {
          "quantity": "2",
          "status": "Pending"
        },
        "after": {
          "quantity": "5",
          "status": "Processing"
        },
        "parent": {
          "id": "500",
          "name": "Order #500",
          "type": "Order"
        },
        "channel_code": "WEB",
        "channel_account_code": "Primary"
      }
    ],
    "entity_type": "OrderItem",
    "action": "Update",
    "correlation_id": "40f12582-b3be-4aeb-a7ca-9d9247030e20",
    "actor": {
      "type": 1,
      "name": "jane@example.com"
    }
  }
}

Design Rules

entity is the only required object — it must have id and name
before and after are optional JSON maps (missing = no change tracking)
parent is optional — supports hierarchical entity relationships
channel_code and channel_account_code are optional strings
actor.type: 0 = Service, 1 = User, 2 = Client

This contract is generic enough that any microservice can publish events without custom schemas.

Step 2: EventBridge to Firehose (Zero Code)

An EventBridge rule routes matching events directly to a Kinesis Firehose delivery stream. No Lambda in between.

# CloudFormation snippet
AuditLogRule:
  Type: AWS::Events::Rule
  Properties:
    EventBusName: myapp-eventbus
    EventPattern:
      source:
        - prefix: "myapp"
    Targets:
      - Id: FirehoseTarget
        Arn: !GetAtt DeliveryStream.Arn
        RoleArn: !GetAtt EventBridgeRole.Arn

Firehose buffers events (60 seconds or 5 MB, whichever comes first) and writes them as GZIP-compressed JSON to S3.

Step 3: Raw Storage (S3)

Raw events land in a time-partitioned structure:

s3://myapp-audit-logs-bucket/raw/eventbridge/
  year=2026/month=05/day=01/hour=15/
    firehose-1-2026-05-01-15-00-00.json.gz

This layer is:

Cheap — GZIP compressed, short retention
Not queried directly — exists only as input for Glue
Lifecycle managed — transitions to Glacier after 30 days, expires after 90

Step 4: Glue ETL (Daily Transformation)

A scheduled AWS Glue job runs daily and transforms yesterday's raw events into query-optimized Parquet:

import pyspark.sql.functions as F
from pyspark.sql.types import *
from awsglue.context import GlueContext

# Read yesterday's raw JSON
raw_df = spark.read.json(f"s3://{bucket}/raw/eventbridge/year={y}/month={m}/day={d}/")

# Extract detail and explode entities array
detail_df = raw_df.select(
    F.col("id").alias("EventId"),
    F.col("time").alias("Timestamp"),
    F.col("source").alias("Source"),
    F.col("detail.actor.name").alias("ActorName"),
    F.col("detail.actor.type").alias("ActorType"),
    F.col("detail.entity_type").alias("EntityType"),
    F.col("detail.action").alias("Action"),
    F.col("detail.correlation_id").alias("CorrelationId"),
    F.explode("detail.entities").alias("entity_change")
)

# Extract entity fields (always present)
result_df = detail_df.select(
    "*",
    F.col("entity_change.entity.id").cast("string").alias("EntityId"),
    F.col("entity_change.entity.name").alias("EntityName"),
    # Parent fields (optional - coalesce handles missing)
    F.coalesce(F.col("entity_change.parent.id").cast("string"), F.lit(None)).alias("EntityParentId"),
    F.coalesce(F.col("entity_change.parent.name"), F.lit(None)).alias("EntityParentName"),
    F.coalesce(F.col("entity_change.parent.type"), F.lit(None)).alias("EntityParentType"),
    # Optional context strings
    F.coalesce(F.col("entity_change.channel_code"), F.lit(None)).alias("ChannelCode"),
    F.coalesce(F.col("entity_change.channel_account_code"), F.lit(None)).alias("ChannelAccountCode"),
    # Before/After as JSON strings
    F.to_json(F.col("entity_change.before")).alias("Before"),
    F.to_json(F.col("entity_change.after")).alias("After"),
)

# Write partitioned Parquet
result_df.write \
    .mode("overwrite") \
    .partitionBy("dt", "source") \
    .parquet(f"s3://{bucket}/analytics/auditlog_parquet/")

Key Points

F.coalesce(..., F.lit(None)) — gracefully handles missing parent, channel_code, etc.
F.explode("detail.entities") — one event can produce multiple audit rows
Parquet format — columnar, compressed, perfect for Athena
Partitioned by dt and source — enables partition pruning

Step 5: Athena Table (Serverless SQL)

The Athena table uses partition projection — no crawlers, no MSCK REPAIR TABLE:

CREATE EXTERNAL TABLE audit_logs_db.auditlog (
  Id string,
  Timestamp timestamp,
  ActorName string,
  ActorType string,
  EntityId string,
  EntityName string,
  EntityType string,
  EntityParentId string,
  EntityParentName string,
  EntityParentType string,
  ChannelCode string,
  ChannelAccountCode string,
  Action string,
  Before string,
  After string,
  CorrelationId string
)
PARTITIONED BY (dt string, source string)
STORED AS PARQUET
LOCATION 's3://myapp-audit-logs-bucket/analytics/auditlog_parquet/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.dt.type' = 'date',
  'projection.dt.range' = '2024-01-01,NOW',
  'projection.dt.format' = 'yyyy-MM-dd',
  'projection.dt.interval' = '1',
  'projection.dt.interval.unit' = 'DAYS',
  'projection.source.type' = 'injected',
  'storage.location.template' = 's3://myapp-audit-logs-bucket/analytics/auditlog_parquet/dt=${dt}/source=${source}/'
);

Example Queries

-- All updates by a specific user in the last week
SELECT *
FROM audit_logs_db.auditlog
WHERE dt >= '2026-05-27'
  AND source = 'myapp.orderservice'
  AND ActorName = 'jane@example.com'
  AND Action = 'Update'
ORDER BY Timestamp DESC
LIMIT 100;

-- All changes to an Order and its child OrderItems
SELECT *
FROM audit_logs_db.auditlog
WHERE dt >= '2026-05-11' AND dt <= '2026-05-18'
  AND source IN ('myapp.orderservice', 'myapp.inventoryservice')
  AND (EntityId = '500' OR EntityParentId = '500')
ORDER BY Timestamp DESC;

Step 6: C# Client Library (NuGet Package)

We built a strongly-typed client that generates safe SQL and handles pagination:

Installation

dotnet add package MyApp.AuditLogs

Setup

services.AddAuditLogAthena(options =>
{
    options.Database = "audit_logs_db";
    options.Table = "auditlog";
    options.OutputLocation = "s3://athena-results-bucket/";
    options.WorkGroup = "primary";
});

Querying

var page = await client.QueryAsync(new AuditLogQuery
{
    From = DateOnly.FromDateTime(DateTime.UtcNow.AddDays(-7)),
    To = DateOnly.FromDateTime(DateTime.UtcNow),
    Sources = new[] { "myapp.orderservice" },
    Limit = 100,
    EntityId = "500",
    IncludeChildren = true,  // Also fetches child entities (e.g. OrderItems)
    EntityTypes = new[] { "Order" },
    Actions = new[] { Action.Update }
});

// Paginate
if (page.NextCursor != null)
{
    var nextPage = await client.QueryAsync(new AuditLogQuery
    {
        // ... same filters ...
        Cursor = page.NextCursor,
        Direction = PageDirection.Next
    });
}

Security

Since Athena doesn't support parameterized queries, the client implements:

Input validation — max string length, date range limits (7 days max)
SQL escaping — single quotes doubled (O'Brien → O''Brien)
Type-safe enums — ActorType and Action prevent arbitrary injection
Limit enforcement — max 1000 rows per query

Cost Breakdown

For a system processing ~100K events/day:

Service	Monthly Cost	Notes
EventBridge	~$3	$1/million events
Firehose	~$0.15	$0.029/GB × ~5GB/month
S3 (Raw)	~$0.12	Short retention, GZIP
S3 (Parquet)	~$0.05	Columnar, compressed
Glue	~$13	1 DPU × 30 min/day
Athena	~$0.50	Partition pruning, Parquet
Total	~$17/month

Compare this to running an always-on database or Elasticsearch cluster ($100-500+/month).

Deployment (GitHub Actions + OIDC)

We deploy using GitHub Actions with OIDC authentication — no long-lived AWS credentials:

name: Deploy Audit Logs

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.DEPLOY_ROLE }}
          aws-region: us-east-1

      - name: Deploy CloudFormation
        run: |
          aws cloudformation deploy \
            --template-file aws/cf.yaml \
            --stack-name audit-logs-pipeline \
            --capabilities CAPABILITY_IAM

Lessons Learned

1. Not Everything Needs Real-Time Processing

Audit logs are written once and read occasionally. Daily batch transformation is perfectly adequate and dramatically cheaper than real-time Lambda processing.

2. Partition Projection Eliminates Operational Toil

Without partition projection, every new day requires running MSCK REPAIR TABLE or a Glue Crawler. Partition projection handles this automatically.

3. Make Optional Fields Truly Optional

The parent, channel_code, and before/after fields are all optional in our schema. The Glue job uses F.coalesce() to gracefully handle missing data — no event is ever rejected.

4. Composite Cursors Handle Duplicate Timestamps

Using Timestamp + Id as a pagination cursor ensures stable ordering even when multiple events share the same timestamp.

5. Validate at the Query Layer

Since Athena doesn't support parameterized queries, input validation and escaping in the client library is critical. We enforce max date ranges (7 days) to prevent expensive full-table scans.

Summary

By combining EventBridge, Firehose, S3, Glue, and Athena — and intentionally avoiding Lambda — we built a production audit log system that:

✅ Scales automatically with event volume

✅ Costs ~$17/month for 100K events/day

✅ Supports hierarchical entity queries (parent-child)

✅ Provides cursor-based pagination for frontend integration

✅ Protects against SQL injection

✅ Handles optional fields gracefully

✅ Deploys with zero long-lived credentials

The key takeaway: choose the right tool for the job. For audit logs, a daily batch with Parquet + Athena beats a real-time Lambda pipeline on cost, simplicity, and reliability.

DEV Community