Introduction
Audit logging is a critical requirement for any enterprise system — you need to know who did what, when, and to which entity. But building an audit log system that scales with your microservices architecture without breaking the bank is challenging.
In this post, we walk through how we built a production-grade, cost-optimized audit logging pipeline on AWS that captures domain events from multiple microservices and makes them queryable via SQL — all without a single Lambda function or server to manage.
The Problem
We have dozens of microservices (orders, inventory, users, etc.) that modify entities like Orders, Order Items, and User Accounts. We needed:
- Complete audit trail — every create, update, and delete captured
- Parent-child relationships — e.g., an OrderItem belongs to an Order
- Fast, cheap queries — filter by date, source, entity, actor
- Zero operational overhead — no servers, no scaling concerns
- SQL injection protection — safe querying from application code
Architecture
┌──────────────────────┐
│ Microservices │
│ (Orders, Inventory, │
│ Users, etc.) │
└──────────┬───────────┘
│ PutEvents
▼
┌──────────────────────┐
│ Amazon EventBridge │
│ (Central Event Bus) │
└──────────┬───────────┘
│ Rule → Target
▼
┌──────────────────────┐
│ Kinesis Firehose │
│ (Buffer + Compress) │
└──────────┬───────────┘
│ GZIP → S3
▼
┌──────────────────────┐
│ S3 (Raw Layer) │
│ year=YYYY/month=MM/ │
│ day=DD/hour=HH/ │
└──────────┬───────────┘
│ Daily Schedule
▼
┌──────────────────────┐
│ AWS Glue ETL │
│ (PySpark → Parquet) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ S3 (Analytics) │
│ dt=YYYY-MM-DD/ │
│ source=service/ │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Amazon Athena │
│ (Serverless SQL) │
└──────────────────────┘
Why No Lambda?
Most audit log implementations use Lambda to transform events in real-time. We intentionally avoided this:
| Concern | Lambda Approach | Our Approach (Glue) |
|---|---|---|
| Cost at scale | Per-invocation billing adds up | Single daily job, fixed cost |
| Cold starts | Adds latency | N/A — batch processing |
| Concurrency limits | Can throttle under load | No limits |
| Error handling | Complex DLQ patterns | Simple retry on daily job |
| Transformation bugs | Break real-time ingestion | Fix and re-run the batch |
Key insight: Audit logs don't need real-time transformation. A daily batch is cheaper, simpler, and equally useful for compliance and investigation.
Step 1: Event Publishing (Microservices)
Each microservice publishes domain events to EventBridge following a consistent contract:
{
"source": "myapp.orderservice",
"detail-type": "orderitemupdated.v1",
"time": "2026-04-02T12:31:01Z",
"detail": {
"entities": [
{
"entity": {
"id": "1001",
"name": "Order Item #1001",
"status": "Processing"
},
"before": {
"quantity": "2",
"status": "Pending"
},
"after": {
"quantity": "5",
"status": "Processing"
},
"parent": {
"id": "500",
"name": "Order #500",
"type": "Order"
},
"channel_code": "WEB",
"channel_account_code": "Primary"
}
],
"entity_type": "OrderItem",
"action": "Update",
"correlation_id": "40f12582-b3be-4aeb-a7ca-9d9247030e20",
"actor": {
"type": 1,
"name": "jane@example.com"
}
}
}
Design Rules
-
entityis the only required object — it must haveidandname -
beforeandafterare optional JSON maps (missing = no change tracking) -
parentis optional — supports hierarchical entity relationships -
channel_codeandchannel_account_codeare optional strings -
actor.type:0= Service,1= User,2= Client
This contract is generic enough that any microservice can publish events without custom schemas.
Step 2: EventBridge to Firehose (Zero Code)
An EventBridge rule routes matching events directly to a Kinesis Firehose delivery stream. No Lambda in between.
# CloudFormation snippet
AuditLogRule:
Type: AWS::Events::Rule
Properties:
EventBusName: myapp-eventbus
EventPattern:
source:
- prefix: "myapp"
Targets:
- Id: FirehoseTarget
Arn: !GetAtt DeliveryStream.Arn
RoleArn: !GetAtt EventBridgeRole.Arn
Firehose buffers events (60 seconds or 5 MB, whichever comes first) and writes them as GZIP-compressed JSON to S3.
Step 3: Raw Storage (S3)
Raw events land in a time-partitioned structure:
s3://myapp-audit-logs-bucket/raw/eventbridge/
year=2026/month=05/day=01/hour=15/
firehose-1-2026-05-01-15-00-00.json.gz
This layer is:
- Cheap — GZIP compressed, short retention
- Not queried directly — exists only as input for Glue
- Lifecycle managed — transitions to Glacier after 30 days, expires after 90
Step 4: Glue ETL (Daily Transformation)
A scheduled AWS Glue job runs daily and transforms yesterday's raw events into query-optimized Parquet:
import pyspark.sql.functions as F
from pyspark.sql.types import *
from awsglue.context import GlueContext
# Read yesterday's raw JSON
raw_df = spark.read.json(f"s3://{bucket}/raw/eventbridge/year={y}/month={m}/day={d}/")
# Extract detail and explode entities array
detail_df = raw_df.select(
F.col("id").alias("EventId"),
F.col("time").alias("Timestamp"),
F.col("source").alias("Source"),
F.col("detail.actor.name").alias("ActorName"),
F.col("detail.actor.type").alias("ActorType"),
F.col("detail.entity_type").alias("EntityType"),
F.col("detail.action").alias("Action"),
F.col("detail.correlation_id").alias("CorrelationId"),
F.explode("detail.entities").alias("entity_change")
)
# Extract entity fields (always present)
result_df = detail_df.select(
"*",
F.col("entity_change.entity.id").cast("string").alias("EntityId"),
F.col("entity_change.entity.name").alias("EntityName"),
# Parent fields (optional - coalesce handles missing)
F.coalesce(F.col("entity_change.parent.id").cast("string"), F.lit(None)).alias("EntityParentId"),
F.coalesce(F.col("entity_change.parent.name"), F.lit(None)).alias("EntityParentName"),
F.coalesce(F.col("entity_change.parent.type"), F.lit(None)).alias("EntityParentType"),
# Optional context strings
F.coalesce(F.col("entity_change.channel_code"), F.lit(None)).alias("ChannelCode"),
F.coalesce(F.col("entity_change.channel_account_code"), F.lit(None)).alias("ChannelAccountCode"),
# Before/After as JSON strings
F.to_json(F.col("entity_change.before")).alias("Before"),
F.to_json(F.col("entity_change.after")).alias("After"),
)
# Write partitioned Parquet
result_df.write \
.mode("overwrite") \
.partitionBy("dt", "source") \
.parquet(f"s3://{bucket}/analytics/auditlog_parquet/")
Key Points
-
F.coalesce(..., F.lit(None))— gracefully handles missingparent,channel_code, etc. -
F.explode("detail.entities")— one event can produce multiple audit rows - Parquet format — columnar, compressed, perfect for Athena
-
Partitioned by
dtandsource— enables partition pruning
Step 5: Athena Table (Serverless SQL)
The Athena table uses partition projection — no crawlers, no MSCK REPAIR TABLE:
CREATE EXTERNAL TABLE audit_logs_db.auditlog (
Id string,
Timestamp timestamp,
ActorName string,
ActorType string,
EntityId string,
EntityName string,
EntityType string,
EntityParentId string,
EntityParentName string,
EntityParentType string,
ChannelCode string,
ChannelAccountCode string,
Action string,
Before string,
After string,
CorrelationId string
)
PARTITIONED BY (dt string, source string)
STORED AS PARQUET
LOCATION 's3://myapp-audit-logs-bucket/analytics/auditlog_parquet/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.dt.type' = 'date',
'projection.dt.range' = '2024-01-01,NOW',
'projection.dt.format' = 'yyyy-MM-dd',
'projection.dt.interval' = '1',
'projection.dt.interval.unit' = 'DAYS',
'projection.source.type' = 'injected',
'storage.location.template' = 's3://myapp-audit-logs-bucket/analytics/auditlog_parquet/dt=${dt}/source=${source}/'
);
Example Queries
-- All updates by a specific user in the last week
SELECT *
FROM audit_logs_db.auditlog
WHERE dt >= '2026-05-27'
AND source = 'myapp.orderservice'
AND ActorName = 'jane@example.com'
AND Action = 'Update'
ORDER BY Timestamp DESC
LIMIT 100;
-- All changes to an Order and its child OrderItems
SELECT *
FROM audit_logs_db.auditlog
WHERE dt >= '2026-05-11' AND dt <= '2026-05-18'
AND source IN ('myapp.orderservice', 'myapp.inventoryservice')
AND (EntityId = '500' OR EntityParentId = '500')
ORDER BY Timestamp DESC;
Step 6: C# Client Library (NuGet Package)
We built a strongly-typed client that generates safe SQL and handles pagination:
Installation
dotnet add package MyApp.AuditLogs
Setup
services.AddAuditLogAthena(options =>
{
options.Database = "audit_logs_db";
options.Table = "auditlog";
options.OutputLocation = "s3://athena-results-bucket/";
options.WorkGroup = "primary";
});
Querying
var page = await client.QueryAsync(new AuditLogQuery
{
From = DateOnly.FromDateTime(DateTime.UtcNow.AddDays(-7)),
To = DateOnly.FromDateTime(DateTime.UtcNow),
Sources = new[] { "myapp.orderservice" },
Limit = 100,
EntityId = "500",
IncludeChildren = true, // Also fetches child entities (e.g. OrderItems)
EntityTypes = new[] { "Order" },
Actions = new[] { Action.Update }
});
// Paginate
if (page.NextCursor != null)
{
var nextPage = await client.QueryAsync(new AuditLogQuery
{
// ... same filters ...
Cursor = page.NextCursor,
Direction = PageDirection.Next
});
}
Security
Since Athena doesn't support parameterized queries, the client implements:
- Input validation — max string length, date range limits (7 days max)
-
SQL escaping — single quotes doubled (
O'Brien→O''Brien) -
Type-safe enums —
ActorTypeandActionprevent arbitrary injection - Limit enforcement — max 1000 rows per query
Cost Breakdown
For a system processing ~100K events/day:
| Service | Monthly Cost | Notes |
|---|---|---|
| EventBridge | ~$3 | $1/million events |
| Firehose | ~$0.15 | $0.029/GB × ~5GB/month |
| S3 (Raw) | ~$0.12 | Short retention, GZIP |
| S3 (Parquet) | ~$0.05 | Columnar, compressed |
| Glue | ~$13 | 1 DPU × 30 min/day |
| Athena | ~$0.50 | Partition pruning, Parquet |
| Total | ~$17/month |
Compare this to running an always-on database or Elasticsearch cluster ($100-500+/month).
Deployment (GitHub Actions + OIDC)
We deploy using GitHub Actions with OIDC authentication — no long-lived AWS credentials:
name: Deploy Audit Logs
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.DEPLOY_ROLE }}
aws-region: us-east-1
- name: Deploy CloudFormation
run: |
aws cloudformation deploy \
--template-file aws/cf.yaml \
--stack-name audit-logs-pipeline \
--capabilities CAPABILITY_IAM
Lessons Learned
1. Not Everything Needs Real-Time Processing
Audit logs are written once and read occasionally. Daily batch transformation is perfectly adequate and dramatically cheaper than real-time Lambda processing.
2. Partition Projection Eliminates Operational Toil
Without partition projection, every new day requires running MSCK REPAIR TABLE or a Glue Crawler. Partition projection handles this automatically.
3. Make Optional Fields Truly Optional
The parent, channel_code, and before/after fields are all optional in our schema. The Glue job uses F.coalesce() to gracefully handle missing data — no event is ever rejected.
4. Composite Cursors Handle Duplicate Timestamps
Using Timestamp + Id as a pagination cursor ensures stable ordering even when multiple events share the same timestamp.
5. Validate at the Query Layer
Since Athena doesn't support parameterized queries, input validation and escaping in the client library is critical. We enforce max date ranges (7 days) to prevent expensive full-table scans.
Summary
By combining EventBridge, Firehose, S3, Glue, and Athena — and intentionally avoiding Lambda — we built a production audit log system that:
✅ Scales automatically with event volume
✅ Costs ~$17/month for 100K events/day
✅ Supports hierarchical entity queries (parent-child)
✅ Provides cursor-based pagination for frontend integration
✅ Protects against SQL injection
✅ Handles optional fields gracefully
✅ Deploys with zero long-lived credentials
The key takeaway: choose the right tool for the job. For audit logs, a daily batch with Parquet + Athena beats a real-time Lambda pipeline on cost, simplicity, and reliability.
Top comments (0)