DEV Community

Bassem Hussein
Bassem Hussein

Posted on

Building a Scalable Audit Log System with AWS Serverless Services (No Lambda)

Introduction

Audit logging is a critical requirement for any enterprise system — you need to know who did what, when, and to which entity. But building an audit log system that scales with your microservices architecture without breaking the bank is challenging.

In this post, we walk through how we built a production-grade, cost-optimized audit logging pipeline on AWS that captures domain events from multiple microservices and makes them queryable via SQL — all without a single Lambda function or server to manage.


The Problem

We have dozens of microservices (orders, inventory, users, etc.) that modify entities like Orders, Order Items, and User Accounts. We needed:

  1. Complete audit trail — every create, update, and delete captured
  2. Parent-child relationships — e.g., an OrderItem belongs to an Order
  3. Fast, cheap queries — filter by date, source, entity, actor
  4. Zero operational overhead — no servers, no scaling concerns
  5. SQL injection protection — safe querying from application code

Architecture

┌──────────────────────┐
│   Microservices      │
│  (Orders, Inventory, │
│   Users, etc.)       │
└──────────┬───────────┘
           │ PutEvents
           ▼
┌──────────────────────┐
│  Amazon EventBridge  │
│  (Central Event Bus) │
└──────────┬───────────┘
           │ Rule → Target
           ▼
┌──────────────────────┐
│  Kinesis Firehose    │
│  (Buffer + Compress) │
└──────────┬───────────┘
           │ GZIP → S3
           ▼
┌──────────────────────┐
│  S3 (Raw Layer)      │
│  year=YYYY/month=MM/ │
│  day=DD/hour=HH/     │
└──────────┬───────────┘
           │ Daily Schedule
           ▼
┌──────────────────────┐
│  AWS Glue ETL        │
│  (PySpark → Parquet) │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  S3 (Analytics)      │
│  dt=YYYY-MM-DD/      │
│  source=service/     │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Amazon Athena       │
│  (Serverless SQL)    │
└──────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why No Lambda?

Most audit log implementations use Lambda to transform events in real-time. We intentionally avoided this:

Concern Lambda Approach Our Approach (Glue)
Cost at scale Per-invocation billing adds up Single daily job, fixed cost
Cold starts Adds latency N/A — batch processing
Concurrency limits Can throttle under load No limits
Error handling Complex DLQ patterns Simple retry on daily job
Transformation bugs Break real-time ingestion Fix and re-run the batch

Key insight: Audit logs don't need real-time transformation. A daily batch is cheaper, simpler, and equally useful for compliance and investigation.


Step 1: Event Publishing (Microservices)

Each microservice publishes domain events to EventBridge following a consistent contract:

{
  "source": "myapp.orderservice",
  "detail-type": "orderitemupdated.v1",
  "time": "2026-04-02T12:31:01Z",
  "detail": {
    "entities": [
      {
        "entity": {
          "id": "1001",
          "name": "Order Item #1001",
          "status": "Processing"
        },
        "before": {
          "quantity": "2",
          "status": "Pending"
        },
        "after": {
          "quantity": "5",
          "status": "Processing"
        },
        "parent": {
          "id": "500",
          "name": "Order #500",
          "type": "Order"
        },
        "channel_code": "WEB",
        "channel_account_code": "Primary"
      }
    ],
    "entity_type": "OrderItem",
    "action": "Update",
    "correlation_id": "40f12582-b3be-4aeb-a7ca-9d9247030e20",
    "actor": {
      "type": 1,
      "name": "jane@example.com"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Design Rules

  • entity is the only required object — it must have id and name
  • before and after are optional JSON maps (missing = no change tracking)
  • parent is optional — supports hierarchical entity relationships
  • channel_code and channel_account_code are optional strings
  • actor.type: 0 = Service, 1 = User, 2 = Client

This contract is generic enough that any microservice can publish events without custom schemas.


Step 2: EventBridge to Firehose (Zero Code)

An EventBridge rule routes matching events directly to a Kinesis Firehose delivery stream. No Lambda in between.

# CloudFormation snippet
AuditLogRule:
  Type: AWS::Events::Rule
  Properties:
    EventBusName: myapp-eventbus
    EventPattern:
      source:
        - prefix: "myapp"
    Targets:
      - Id: FirehoseTarget
        Arn: !GetAtt DeliveryStream.Arn
        RoleArn: !GetAtt EventBridgeRole.Arn
Enter fullscreen mode Exit fullscreen mode

Firehose buffers events (60 seconds or 5 MB, whichever comes first) and writes them as GZIP-compressed JSON to S3.


Step 3: Raw Storage (S3)

Raw events land in a time-partitioned structure:

s3://myapp-audit-logs-bucket/raw/eventbridge/
  year=2026/month=05/day=01/hour=15/
    firehose-1-2026-05-01-15-00-00.json.gz
Enter fullscreen mode Exit fullscreen mode

This layer is:

  • Cheap — GZIP compressed, short retention
  • Not queried directly — exists only as input for Glue
  • Lifecycle managed — transitions to Glacier after 30 days, expires after 90

Step 4: Glue ETL (Daily Transformation)

A scheduled AWS Glue job runs daily and transforms yesterday's raw events into query-optimized Parquet:

import pyspark.sql.functions as F
from pyspark.sql.types import *
from awsglue.context import GlueContext

# Read yesterday's raw JSON
raw_df = spark.read.json(f"s3://{bucket}/raw/eventbridge/year={y}/month={m}/day={d}/")

# Extract detail and explode entities array
detail_df = raw_df.select(
    F.col("id").alias("EventId"),
    F.col("time").alias("Timestamp"),
    F.col("source").alias("Source"),
    F.col("detail.actor.name").alias("ActorName"),
    F.col("detail.actor.type").alias("ActorType"),
    F.col("detail.entity_type").alias("EntityType"),
    F.col("detail.action").alias("Action"),
    F.col("detail.correlation_id").alias("CorrelationId"),
    F.explode("detail.entities").alias("entity_change")
)

# Extract entity fields (always present)
result_df = detail_df.select(
    "*",
    F.col("entity_change.entity.id").cast("string").alias("EntityId"),
    F.col("entity_change.entity.name").alias("EntityName"),
    # Parent fields (optional - coalesce handles missing)
    F.coalesce(F.col("entity_change.parent.id").cast("string"), F.lit(None)).alias("EntityParentId"),
    F.coalesce(F.col("entity_change.parent.name"), F.lit(None)).alias("EntityParentName"),
    F.coalesce(F.col("entity_change.parent.type"), F.lit(None)).alias("EntityParentType"),
    # Optional context strings
    F.coalesce(F.col("entity_change.channel_code"), F.lit(None)).alias("ChannelCode"),
    F.coalesce(F.col("entity_change.channel_account_code"), F.lit(None)).alias("ChannelAccountCode"),
    # Before/After as JSON strings
    F.to_json(F.col("entity_change.before")).alias("Before"),
    F.to_json(F.col("entity_change.after")).alias("After"),
)

# Write partitioned Parquet
result_df.write \
    .mode("overwrite") \
    .partitionBy("dt", "source") \
    .parquet(f"s3://{bucket}/analytics/auditlog_parquet/")
Enter fullscreen mode Exit fullscreen mode

Key Points

  • F.coalesce(..., F.lit(None)) — gracefully handles missing parent, channel_code, etc.
  • F.explode("detail.entities") — one event can produce multiple audit rows
  • Parquet format — columnar, compressed, perfect for Athena
  • Partitioned by dt and source — enables partition pruning

Step 5: Athena Table (Serverless SQL)

The Athena table uses partition projection — no crawlers, no MSCK REPAIR TABLE:

CREATE EXTERNAL TABLE audit_logs_db.auditlog (
  Id string,
  Timestamp timestamp,
  ActorName string,
  ActorType string,
  EntityId string,
  EntityName string,
  EntityType string,
  EntityParentId string,
  EntityParentName string,
  EntityParentType string,
  ChannelCode string,
  ChannelAccountCode string,
  Action string,
  Before string,
  After string,
  CorrelationId string
)
PARTITIONED BY (dt string, source string)
STORED AS PARQUET
LOCATION 's3://myapp-audit-logs-bucket/analytics/auditlog_parquet/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.dt.type' = 'date',
  'projection.dt.range' = '2024-01-01,NOW',
  'projection.dt.format' = 'yyyy-MM-dd',
  'projection.dt.interval' = '1',
  'projection.dt.interval.unit' = 'DAYS',
  'projection.source.type' = 'injected',
  'storage.location.template' = 's3://myapp-audit-logs-bucket/analytics/auditlog_parquet/dt=${dt}/source=${source}/'
);
Enter fullscreen mode Exit fullscreen mode

Example Queries

-- All updates by a specific user in the last week
SELECT *
FROM audit_logs_db.auditlog
WHERE dt >= '2026-05-27'
  AND source = 'myapp.orderservice'
  AND ActorName = 'jane@example.com'
  AND Action = 'Update'
ORDER BY Timestamp DESC
LIMIT 100;

-- All changes to an Order and its child OrderItems
SELECT *
FROM audit_logs_db.auditlog
WHERE dt >= '2026-05-11' AND dt <= '2026-05-18'
  AND source IN ('myapp.orderservice', 'myapp.inventoryservice')
  AND (EntityId = '500' OR EntityParentId = '500')
ORDER BY Timestamp DESC;
Enter fullscreen mode Exit fullscreen mode

Step 6: C# Client Library (NuGet Package)

We built a strongly-typed client that generates safe SQL and handles pagination:

Installation

dotnet add package MyApp.AuditLogs
Enter fullscreen mode Exit fullscreen mode

Setup

services.AddAuditLogAthena(options =>
{
    options.Database = "audit_logs_db";
    options.Table = "auditlog";
    options.OutputLocation = "s3://athena-results-bucket/";
    options.WorkGroup = "primary";
});
Enter fullscreen mode Exit fullscreen mode

Querying

var page = await client.QueryAsync(new AuditLogQuery
{
    From = DateOnly.FromDateTime(DateTime.UtcNow.AddDays(-7)),
    To = DateOnly.FromDateTime(DateTime.UtcNow),
    Sources = new[] { "myapp.orderservice" },
    Limit = 100,
    EntityId = "500",
    IncludeChildren = true,  // Also fetches child entities (e.g. OrderItems)
    EntityTypes = new[] { "Order" },
    Actions = new[] { Action.Update }
});

// Paginate
if (page.NextCursor != null)
{
    var nextPage = await client.QueryAsync(new AuditLogQuery
    {
        // ... same filters ...
        Cursor = page.NextCursor,
        Direction = PageDirection.Next
    });
}
Enter fullscreen mode Exit fullscreen mode

Security

Since Athena doesn't support parameterized queries, the client implements:

  1. Input validation — max string length, date range limits (7 days max)
  2. SQL escaping — single quotes doubled (O'BrienO''Brien)
  3. Type-safe enumsActorType and Action prevent arbitrary injection
  4. Limit enforcement — max 1000 rows per query

Cost Breakdown

For a system processing ~100K events/day:

Service Monthly Cost Notes
EventBridge ~$3 $1/million events
Firehose ~$0.15 $0.029/GB × ~5GB/month
S3 (Raw) ~$0.12 Short retention, GZIP
S3 (Parquet) ~$0.05 Columnar, compressed
Glue ~$13 1 DPU × 30 min/day
Athena ~$0.50 Partition pruning, Parquet
Total ~$17/month

Compare this to running an always-on database or Elasticsearch cluster ($100-500+/month).


Deployment (GitHub Actions + OIDC)

We deploy using GitHub Actions with OIDC authentication — no long-lived AWS credentials:

name: Deploy Audit Logs

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.DEPLOY_ROLE }}
          aws-region: us-east-1

      - name: Deploy CloudFormation
        run: |
          aws cloudformation deploy \
            --template-file aws/cf.yaml \
            --stack-name audit-logs-pipeline \
            --capabilities CAPABILITY_IAM
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

1. Not Everything Needs Real-Time Processing

Audit logs are written once and read occasionally. Daily batch transformation is perfectly adequate and dramatically cheaper than real-time Lambda processing.

2. Partition Projection Eliminates Operational Toil

Without partition projection, every new day requires running MSCK REPAIR TABLE or a Glue Crawler. Partition projection handles this automatically.

3. Make Optional Fields Truly Optional

The parent, channel_code, and before/after fields are all optional in our schema. The Glue job uses F.coalesce() to gracefully handle missing data — no event is ever rejected.

4. Composite Cursors Handle Duplicate Timestamps

Using Timestamp + Id as a pagination cursor ensures stable ordering even when multiple events share the same timestamp.

5. Validate at the Query Layer

Since Athena doesn't support parameterized queries, input validation and escaping in the client library is critical. We enforce max date ranges (7 days) to prevent expensive full-table scans.


Summary

By combining EventBridge, Firehose, S3, Glue, and Athena — and intentionally avoiding Lambda — we built a production audit log system that:

✅ Scales automatically with event volume

✅ Costs ~$17/month for 100K events/day

✅ Supports hierarchical entity queries (parent-child)

✅ Provides cursor-based pagination for frontend integration

✅ Protects against SQL injection

✅ Handles optional fields gracefully

✅ Deploys with zero long-lived credentials

The key takeaway: choose the right tool for the job. For audit logs, a daily batch with Parquet + Athena beats a real-time Lambda pipeline on cost, simplicity, and reliability.

Top comments (0)