I have a number of projects I’ve been working on that are not finished yet but this is one I wanted to complete before I finished packing and got on the plane to attend AWS Re:Invent in Las Vegas this year. I'm hoping to pick up new techniques and meet many other people who build event-driven architectures every day to learn from them.
I see so many great examples of using the managed and serverless services that cloud providers like AWS offer. Being able to build such a complete solution that costs less than $10 a month to run is a common occurrence with these types of builds. You can examine your requirements and budget for any given project and choose from the many tools that are available to use with just an API call and only get charged based on how much you use them.
You can try this project out for yourself by checking out the code found in my Github repo here → Github Repo
The Challenge
Smurf Memorabilia Inc., is a fictional retail chain with multiple store locations and it needs a way to:
Collect daily sales data from each store location
Transform and store that data efficiently
Generate AI-powered business insights
Visualize results in dashboards
The key requirements include: low cost, minimal operational overhead, and pay only for what you use.
Stores will upload their sales data each day in an agreed format. The data will be processed and analysis will be done. Analytics data will be updated and AI-based recommendations will be made. Key people will receive daily emails or SMS messages of what is happening.
The Solution: 100% Serverless Architecture
My solution involves an event-driven ETL platform using managed AWS services. There are no servers to patch, no capacity to plan, and no minimum fees. You pay only when data flows through the system.
Services Used
| Service | Role | Pricing Model |
|---|---|---|
| AWS Lambda | All compute (17 functions) | Per invocation + duration |
| S3 | Object storage | Per GB stored + requests |
| DynamoDB | Metrics database | Per read/write unit (on-demand) |
| Step Functions | Workflow orchestration | Per state transition |
| EventBridge | Event routing | Free tier covers most use cases |
| Bedrock | AI analysis (Nova Lite) | Per token processed |
| API Gateway | REST API | Per request |
| SNS | Notifications | Per message |
These are a few of the managed/serverless offerings from AWS. You can piece together as many of these as you need to build your architecture. These scale automatically from zero to whatever capacity you need.
Smart Data Storage with Apache Parquet
One of the key architectural decisions was converting the raw uploaded JSON sales data into Apache Parquet format. This columnar storage format delivers significant benefits:
Huge Compression
Our 30-day dataset comparison:
Raw JSON uploads: 53.1 MB
Parquet files: 4.7 MB
The examples I have get an 11x reduction in size using the default parquet compression algorithm but it can be changed to use even higher compression if needed. This results in great savings for storage and faster query performance.
Why Parquet?
Columnar Storage: Only reads the columns you need, not entire rows
Built-in Compression: Uses efficient encoding (dictionary, run-length, delta)
Schema Enforcement: Explicit types prevent data quality issues
Ecosystem Support: Works with Athena, Spark, Pandas, and most analytics tools
Type-Safe Schema
We define an explicit PyArrow schema to ensure data quality. We want to make sure we keep track of which Smurf loot is popular every day and follow the trends.
PARQUET_SCHEMA = pa.schema([
("transaction_id", pa.string()),
("transaction_timestamp", pa.timestamp("ms")),
("item_sku", pa.string()),
("item_name", pa.string()),
("quantity", pa.int32()),
("unit_price", pa.decimal128(10, 2)),
("line_total", pa.decimal128(10, 2)),
("discount_amount", pa.decimal128(10, 2)),
("payment_method", pa.string()),
("customer_id", pa.string()),
])
This schema ensures that decimal precision is maintained (critical for financial data) and timestamps are properly typed for time-series analysis.
Hive-Style Partitioning for Efficient Queries
Raw uploads arrive with flat filenames like store_0001_2025-11-27.json. We transform these into a Hive-style partition structure:
s3://bucket/processed/
├── year=2025/
│ └── month=11/
│ ├── day=27/
│ │ ├── store_id=0001/data.parquet
│ │ ├── store_id=0002/data.parquet
│ │ └── ...
│ └── day=28/
│ └── ...
Why This Structure Matters
Partition Pruning: When you query "all sales for November 2025", tools like Amazon Athena only scan files in year=2025/month=11/ - not the entire dataset. This means:
Faster queries
Lower costs (Athena charges per TB scanned)
Better organization
The Transformation Code:
# Parse: store_0001_2025-11-27.json
store_id, year, month, day = parse_filename(filename)
# Output: year=2025/month=11/day=27/store_id=0001/data.parquet
output_key = f"processed/year={year}/month={month}/day={day}/store_id={store_id}/data.parquet"
This simple transformation enables sophisticated analytics without complex ETL pipelines.
Two Analytics Options (Web-based and a more standard Business Intelligence approach)
I wanted to show how you could use multiple approaches to analyze the sales data. We need to use the best approaches to keep track of those 3-apple tall blue creatures and all the ways their fans want to remember them. One in a more simple web version that is built in ReactJS and runs in your browser. I also built a prototype version of Amazon Quick Suite dashboards. Depending on the audience one of these approaches will likely work (or you could build something else.
Option 1: React Dashboard (Developer-Friendly)
The project includes a custom ReactJS application that queries the API directly:
The Web-based analytics approach is likely best for:
Custom visualizations
Embedding in existing applications
Provides Full control over the user experience
No additional licensing costs
The React dashboard provides:
Real-time metrics display
File upload interface with drag-and-drop
Historical trend charts
AI-generated insights and recommendations
Option 2: Amazon Quick Suite (Business-Friendly)
This approach offers a managed Business Intelligence (BI) service that imports data from S3:
The Quick Suite approach is likely best for:
Business users who need self-service analytics
Ad-hoc exploration without writing code
Sharing dashboards with stakeholders
Built-in visualizations (no frontend development)
The current project exports five datasets to S3 in newline-delimited JSON format:
Store summaries (daily metrics per store)
Top products (best sellers)
Anomalies (AI-detected unusual patterns)
Trends (week-over-week analysis)
Recommendations (AI-generated action items)
Quick Suite's SPICE engine imports this data for fast, interactive dashboards.
Choosing between the analytics approach to use:
| Factor | React Dashboard | Quick Suite |
|---|---|---|
| Cost | Included (API calls only) | $24/month per author, $3/month per reader |
| Setup | Requires development | Point-and-click |
| Customization | Unlimited | Template-based |
| User Type | Developers | Business analysts |
| Embedding | Full control | Quick Suite embedding |
Many organizations could use both: ReactJS for customer-facing features, Quick Suite for internal analytics.
Event-Driven Processing
The platform uses an event-driven architecture where each component reacts to events rather than polling for work. I always try to use this type of architecture unless the use-case really doesn’t fit it. AWS Step Functions are used to drive the data upload processing as well as the recommendation and analytics flow handling.
Upload Processing Flow
Store uploads JSON file to S3 (via presigned URL)
S3 emits
Object CreatedeventEventBridge routes event to Step Functions
Step Functions orchestrates the processing pipeline:
* Validate schema
* Convert to Parquet
* Calculate metrics
* Store in DynamoDB
* Check if all stores reported
Daily Analysis Trigger
When the last store uploads for a day, the system automatically triggers a smurfy comprehensive analysis:
The analysis runs exactly when the data is ready. But what if a store fails to report? A scheduled EventBridge rule runs at 11 PM local time as a fallback, ensuring you always get a daily report - even with partial data. The scheduler checks if analysis already ran for that day and skips if so.
If invalid data is uploaded, the key stakeholders will receive email or SNS notifications to follow up with users. If the processing flow fails on the first attempt it has built-in retry and backoff mechanisms.
Daily Email Reports
Once analysis completes, the platform automatically sends a daily summary email via SNS containing:
Total revenue across all stores
Top performing store of the day
AI-detected anomalies and unusual patterns
Business recommendations from Bedrock
Stakeholders receive insights in their inbox without logging into any dashboard.
AI-Powered Insights with Amazon Bedrock
The solution uses Amazon Bedrock with the Nova Lite model (configurable to whatever model you want) to generate business intelligence:
Anomaly Detection: Identifies stores with unusual revenue patterns
Trend Analysis: Compares current performance to historical baselines
Recommendations: Generates actionable business advice
Bedrock is pay-per-token with no minimum commitment - so it’s perfect for batch processing workloads.
The Cost Breakdown
Here's what this platform actually costs for a typical month (e.g., 330 file uploads = multiple stores × 30 days):
| Service | Monthly Cost | Notes |
|---|---|---|
| Lambda | ~$2.00 | 17 functions, ~1000 invocations each |
| Step Functions | ~$0.50 | 360 workflow executions |
| DynamoDB | ~$1.00 | On-demand mode, ~1000 ops |
| S3 | ~$0.01 | ~60 MB stored |
| Bedrock | ~$5.00 | Nova Lite, 30 daily analyses |
| EventBridge | ~$0.00 | Free tier |
| SNS | ~$0.10 | Email notifications |
| CloudWatch Alarms | ~$0.00 | 7 alarms (first 10 free) |
| Total | ~$8.61 |
Add Quick Suite (if needed) for $24/month per author to build dashboards, or just $3/month per reader for view-only access.
Why is this all so cheap?
ARM64 Architecture: Lambda on Graviton2 is ~20% cheaper than x86
Parquet Compression: ~ 11x less storage than JSON
On-Demand DynamoDB: Pay only for actual read/write operations
Event-Driven: No idle compute costs
Infrastructure as Code (IaC)
I’m a big advocate of using IaC for everything. My favourite tools for this are Terraform, the Serverless Application Model (SAM), and the Cloud Development Kit (CDK). In this case there is VPC provisioning and a lot of resources so I chose my go-to tool Terraform. One command deploys everything with Terraform:
terraform apply
Here are some key snippets from the infrastructure code:
Lambda Functions (ARM64 for Cost Savings)
Lambda is best place to host your business logic when code execution times are short. All 17 Lambda functions use ARM64 architecture (Graviton2) for ~ 20% cost savings:
resource "aws_lambda_function" "process_upload" {
filename = data.archive_file.process_upload_zip.output_path
function_name = "process_upload"
role = aws_iam_role.lambda_role.arn
handler = "process_upload.lambda_handler"
runtime = "python3.13"
architectures = ["arm64"]
timeout = 30
memory_size = 1024
layers = [local.powertools_layer_arn, local.pandas_layer_arn]
tracing_config {
mode = "Active"
}
environment {
variables = merge(local.powertools_env_vars, {
S3_BUCKET = aws_s3_bucket.upload_bucket.id
PROCESSED_PREFIX = var.processed_prefix
})
}
}
DynamoDB (Pay-Per-Request)
DynamoDB is my favourite database to use with AWS. It is truly serverless and tables are ready to use in seconds. It offers on-demand billing which means zero compute cost when idle:
resource "aws_dynamodb_table" "sales_data" {
name = "SalesData"
billing_mode = "PAY_PER_REQUEST"
hash_key = "PK"
range_key = "SK"
attribute {
name = "PK"
type = "S"
}
attribute {
name = "SK"
type = "S"
}
# GSI for querying by date across all stores
global_secondary_index {
name = "GSI1"
hash_key = "GSI1PK"
range_key = "GSI1SK"
projection_type = "ALL"
}
}
EventBridge (S3 to Step Functions)
Eventbridge is my favourite AWS service. It offers rules for reacting to events, pipes for bridging data across AWS services, and a nice scheduler. Here i’m using a simple rule that routes S3 uploads to the processing workflow:
resource "aws_cloudwatch_event_rule" "s3_upload" {
name = "capture-s3-uploads"
description = "Capture all S3 object uploads"
event_pattern = jsonencode({
source = ["aws.s3"]
detail-type = ["Object Created"]
detail = {
bucket = {
name = [aws_s3_bucket.upload_bucket.id]
}
object = {
key = [{ prefix = var.upload_prefix }]
}
}
})
}
resource "aws_cloudwatch_event_target" "step_function" {
rule = aws_cloudwatch_event_rule.s3_upload.name
target_id = "UploadProcessorStepFunction"
arn = aws_sfn_state_machine.upload_processor.arn
role_arn = aws_iam_role.eventbridge_step_function_role.arn
}
Step Functions (Workflow Orchestration)
In many cases you want to tightly control and track the flow of processing in your app. AWS Step Function state machines are defined as JSON templates with Lambda ARNs injected:
resource "aws_sfn_state_machine" "upload_processor" {
name = "upload-processor"
role_arn = aws_iam_role.step_function_role.arn
definition = templatefile("${path.module}/../backend/state-machines/upload-processor.json", {
process_upload_lambda_arn = aws_lambda_function.process_upload.arn
calculate_metrics_lambda_arn = aws_lambda_function.calculate_metrics.arn
write_metrics_lambda_arn = aws_lambda_function.write_metrics.arn
check_all_stores_lambda_arn = aws_lambda_function.check_all_stores.arn
sns_alerts_topic_arn = aws_sns_topic.sales_alerts.arn
daily_analysis_state_machine_arn = aws_sfn_state_machine.daily_analysis.arn
})
}
S3 Bucket (Secure by Default)
S3 is at the core of storing data for so many apps today. My setup has Public access blocked, encryption enabled, and EventBridge notifications on:
resource "aws_s3_bucket_public_access_block" "upload_bucket_public_access_block" {
bucket = aws_s3_bucket.upload_bucket.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.upload_bucket.id
eventbridge = true
}
The complete infrastructure includes:
17 Lambda functions
2 Step Functions state machines
API Gateway with 5 endpoints
DynamoDB table with GSI
S3 bucket with security policies
EventBridge rules
SNS topics
IAM roles with least-privilege policies
To set all this up there is no clicking through console pages and no manual configuration drift.
Key Takeaways
Serverless doesn't mean simple - it means you focus on business logic instead of infrastructure.
Parquet is worth the conversion - the great compression pays for itself in storage and query costs.
Hive partitioning enables scale - organize data for how it will be queried, not how it arrives.
Event-driven beats polling - let AWS route events instead of writing schedulers.
Pay-as-you-go works - for variable workloads, managed services beat reserved capacity.
Offer analytics options - different users have different needs; support both custom dashboards and BI tools.
Try It Yourself
The complete source code for my solution is available on GitHub, including:
Terraform infrastructure definitions
17 Lambda functions (Python 3.13)
React frontend application
Sample data generator
Quick Suite setup scripts
Deploy your own instance and start processing data in under 30 minutes.
Built with AWS Lambda, Step Functions, S3, DynamoDB, EventBridge, Bedrock, API Gateway, SNS, and optionally Quick Suite.
CLEANUP (IMPORTANT!!)
If you do end up deploying this yourself please understand some of the included resources will cost you a small amount of real money. Please don’t forget about it.
Please MAKE SURE TO DELETE the stack if you are no longer using it. Running terraform destroy can take care of this or you can delete the server in the AWS console.
Try the setup in your AWS account
You can clone the Github Repo and try this out in your own AWS account. The README.md file mentions any changes you need to make for it to work in your AWS account.
Please let me know if you have any suggestions or problems trying out this example project.
For more articles from me please visit my blog at Darryl's World of Cloud or find me on Bluesky, X, LinkedIn, Medium, Dev.to, or the AWS Community.
For tons of great serverless content and discussions please join the Believe In Serverless community we have put together at this link: Believe In Serverless Community








Top comments (0)