Introduction
Hello guys, my name is Jo and welcome back to my engineering blog!.
I'm gonna talk about something cool today, let's go!😎.
In today's data-driven landscape, organizations collect vast amounts of information from multiple sources in various formats including structured databases, APIs, IoT sensors, and application logs, with much of this data landing in formats like CSV and JSON files that need to be stored, processed, and analyzed efficiently. While Amazon S3 serves as an excellent scalable storage solution for these files, it's a significant engineering challenge 🤕 to extract meaningful insights from data sitting in S3 buckets, especially when stakeholders need regular access to visualizations and reports without the overhead of complex data pipeline management.
Traditional approaches often involve provisioning expensive database instances, writing ETL jobs to move data around, and maintaining multiple data stores just to enable business intelligence tools like Power BI to access the information they need. In this article, I'll walk you through a cost-effective, serverless architecture that uses AWS Glue for data cataloging, Amazon Athena for SQL querying, and Power BI for visualization. This creates a streamlined solution that maintains S3 as your single source of truth while providing the SQL interface that Power BI expects, all without the complexity and cost of traditional database provisioning. Let's dive into it!
While Amazon S3 serves as an excellent scalable storage solution for these files, it's a significant engineering challenge 🤕 to extract meaningful insights from data sitting in S3 buckets
Situation
It's critical to understand that the solution presented in this article is designed for a specific set of circumstances and constraints, and while it may apply to many similar use cases, every engineering decision should start with clearly defined functional and non-functional requirements that match your particular context. In our scenario, CSV files are being deposited into an S3 bucket at regular intervals, which creates a steady stream of data that needs to be accessible for analysis and reporting. These files come pre-structured with a consistent, well-defined schema and the data has already been cleaned and sanitized upstream, meaning we don't need to handle data quality issues, type conversions, or missing field scenarios within our solution. Most importantly, the data represents a single business entity without complex relationships or foreign key dependencies that would require maintaining referential integrity across multiple tables, allowing us to treat each file as a self-contained dataset that can be processed and queried independently.
CSV Files (Regular Intervals) → S3 Bucket Storage
↓
Well-structured + Pre-sanitized Data
↓
Single Entity (No Relational Dependencies)
↓
Ready for Processing (SQL)
Problem/Requirements
The core challenge lies in making the data dumped into S3 buckets accessible and actionable for stakeholders who need regular visibility into operational insights through dashboards and reports. Currently, the team handles similar data requirements by loading CSV contents into traditional SQL databases, which then serve as the data source for Power BI through standard database connections. This solution exists primarily because Power BI has robust, well-established SQL connectivity and most teams are familiar with this approach. However, this conventional method introduces several significant problems that compound over time and create unnecessary operational overhead.
Key Issues with the Current Approach:
Infrastructure Costs - Provisioning and maintaining database instances (RDS, SQL Server, etc.) for what is essentially file storage creates ongoing monthly expenses that scale with data volume and performance requirements
Unnecessary Complexity - Data gets written into SQL tables purely for Power BI access, not because relational database features like ACID transactions, foreign keys, or complex joins are actually needed for this use case
Access Management Overhead - Every new database introduces another set of credentials, connection strings, and security policies that need to be managed, rotated, and monitored
Increased System Dependencies - Adding database services means more moving parts that can fail, require updates, need backup strategies, and demand monitoring - each new service multiplies your potential points of failure
Data Refresh Constraints - Power BI's periodic refresh cycles from SQL databases create timing dependencies where dashboard data freshness becomes a configuration trade-off rather than being driven by actual data availability
Data Fragmentation - Managing multiple data types across different storage solutions (files in S3, structured data in databases) creates inconsistencies in backup strategies, access patterns, and operational procedures
Solution
The Proposed Architecture
The solution replaces the traditional database-centric approach with a serverless, event-driven architecture that treats S3 as both storage and the single source of truth for all data operations. This approach maintains data in its original CSV format while providing the SQL query interface that Power BI requires, eliminating the need for intermediate database layers entirely.
Core Requirements
Our replacement solution addresses four fundamental requirements:
- Single Source of Truth - All data remains in S3 in CSV format, eliminating data duplication and synchronization issues
- Intelligent Cataloging - Files are automatically cataloged with schema information and logical partitioning, similar to database table structures but without the database overhead
- Event-Driven Processing - New file arrivals trigger automatic catalog updates, ensuring data is immediately available for querying
- SQL Query Layer - Athena provides the SQL interface that Power BI expects, querying directly against cataloged S3 data
Architecture Components
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CSV File │───▶│ S3 Bucket │───▶│ S3 Event │───▶│ SQS Queue │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┴───┐
│ Power BI │◀───│ Athena │◀───│ Glue Catalog│◀───│ Lambda │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Tool Breakdown
Component | Purpose | Role in Solution |
---|---|---|
S3 | Primary data storage | Houses original CSV files and partitioned data |
SQS | Event queue | Buffers S3 file creation events for processing |
Lambda | Event processor | Reads files, creates partitions, updates Glue catalog |
Glue | Data catalog | Maintains schema and partition metadata for Athena |
Athena | Query engine | Provides SQL interface for Power BI connectivity |
Why This Solution Works
Cost Efficiency
Athena operates on a pay-per-query model, charging approximately $5 per TB of data scanned (AWS Athena Pricing), which is dramatically cheaper than maintaining always-on database instances that can cost hundreds of dollars monthly regardless of usage patterns.
Operational Simplicity
- Single Data Format - Everything stays in CSV, eliminating format conversion complexity
- Unified Access Control - S3 IAM policies manage all data access, no separate database credentials
- Minimal Dependencies - The entire pipeline uses managed AWS services with no servers to maintain
- Direct Integration - Athena provides native Power BI connectivity through ODBC/JDBC drivers
Performance Features
- Query Caching - Athena automatically caches results for repeated queries
- Partition Pruning - Only scans relevant data partitions, dramatically reducing query costs
- Columnar Optimization - Can easily migrate to Parquet format later for even better performance
Trade-offs and Considerations
Performance Limitations
-- Traditional DB: Sub-second response for cached data
SELECT * FROM sales WHERE date = '2024-01-15';
-- Athena: 2-10 second response depending on data size
SELECT * FROM sales WHERE year='2024' AND month='01' AND day='15';
Athena queries have higher latency than traditional databases since they scan files rather than using pre-built indexes, but this trade-off is acceptable when query frequency is moderate and cost savings are significant.
Initial Setup Complexity
Power BI + Athena Integration Requirements:
- Driver Installation - ODBC/JDBC drivers on dashboard servers
- Network Configuration - VPC endpoints and security group rules for federated AWS accounts
- Authentication Setup - IAM roles and cross-account access policies
- Gateway Configuration - On-premises data gateway driver installation
# Example driver setup
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC-1.1.17.1000-Linux64.zip
# Configure connection string in Power BI
# Server: athena.us-east-1.amazonaws.com
# Database: your_glue_database
While the initial connectivity setup between Power BI and Athena can be complex, especially in enterprise environments with federated AWS accounts, these are one-time configuration challenges that become valuable learning experiences for teams expanding their cloud-native analytics capabilities.
Suggested Implementation
This implementation guide uses TypeScript, but the concepts translate to any language. The key is building a maintainable, extensible solution that follows the Open/Closed Principle - open for extension, closed for modification.
1. Core Schema Design (Critical Prerequisite)
Schema Analysis and Configuration
Start by examining your S3 files to understand the data structure and extract partition information from filenames:
// Example filename: sales_data_2024_03_15.csv
// Partition extraction: year=2024, month=03, day=15
const filename = "sales_data_2024_03_15.csv";
const partitionPattern = /sales_data_(\d{4})_(\d{2})_(\d{2})\.csv/;
const [, year, month, day] = filename.match(partitionPattern);
Shared Configuration Structure
Create a centralized configuration that serves both your code and Infrastructure as Code (IAC):
// shared-config.json
{
"aws": {
"region": "us-east-1",
"account_id": "${AWS_ACCOUNT_ID}"
},
"resources": {
"source_bucket": "my-company-data-source",
"analytics_bucket": "my-company-analytics",
"glue_database": "analytics_db",
"workgroup_name": "analytics_workgroup",
"sqs_queue": "file_processing_queue"
},
"entities": {
"sales": {
"table_name": "sales_data",
"filename_pattern": "sales_data_(\\d{4})_(\\d{2})_(\\d{2})\\.csv",
"partition_keys": ["year", "month", "day"],
"schema": [
{"name": "transaction_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "amount", "type": "decimal(10,2)"},
{"name": "transaction_date", "type": "date"}
]
}
}
}
2. Infrastructure as Code
Account-Agnostic Infrastructure
Structure your IAC to work across any AWS account with minimal changes:
# terraform/main.tf or cloudformation template
# All names reference shared config
resource "aws_s3_bucket" "analytics_bucket" {
bucket = var.shared_config.resources.analytics_bucket
versioning {
enabled = true
}
}
resource "aws_glue_catalog_database" "analytics_db" {
name = var.shared_config.resources.glue_database
}
resource "aws_athena_workgroup" "analytics" {
name = var.shared_config.resources.workgroup_name
configuration {
result_configuration {
output_location = "s3://${aws_s3_bucket.analytics_bucket.bucket}/query-results/"
}
enforce_workgroup_configuration = true
publish_cloudwatch_metrics = true
}
}
Complete Resource Setup
// Infrastructure components from shared config
const infraConfig = {
s3: {
sourceBucket: config.resources.source_bucket,
analyticsBucket: config.resources.analytics_bucket
},
glue: {
database: config.resources.glue_database,
tables: Object.keys(config.entities)
},
sqs: {
queueName: config.resources.sqs_queue,
// Note: Standard queue only - S3 events don't support FIFO
type: "Standard"
},
lambda: {
functionName: "file-processor",
runtime: "nodejs22.x",
sqsTrigger: true
}
}
3. Core Lambda Logic
File Processing Flow
// lambda/fileProcessor.ts
import { SQSEvent, SQSRecord } from 'aws-lambda';
import { S3, Glue, Athena } from 'aws-sdk';
export const handler = async (event: SQSEvent) => {
for (const record of event.Records) {
await processS3File(record);
}
};
async function processS3File(sqsRecord: SQSRecord) {
// 1. Extract S3 event from SQS message
const s3Event = JSON.parse(sqsRecord.body);
const bucket = s3Event.Records[0].s3.bucket.name;
const key = s3Event.Records[0].s3.object.key;
// 2. Get entity configuration
const entityConfig = getEntityFromFilename(key);
// 3. Extract partition information
const partitions = extractPartitions(key, entityConfig.filename_pattern);
// 4. Read and validate file
const fileContent = await s3.getObject({ Bucket: bucket, Key: key }).promise();
// 5. Create partitioned path
const partitionedPath = buildPartitionPath(entityConfig, partitions);
// 6. Write to analytics bucket
await s3.putObject({
Bucket: config.resources.analytics_bucket,
Key: partitionedPath,
Body: fileContent.Body
}).promise();
// 7. Update Glue catalog
await updateGlueCatalog(entityConfig, partitions, partitionedPath);
// 8. Optional: Validate with Athena query
await validateCatalogUpdate(entityConfig.table_name, partitions);
}
Partition Management
function buildPartitionPath(entityConfig: any, partitions: any): string {
// Example: sales_data/year=2024/month=03/day=15/sales_data_2024_03_15.csv
const partitionPath = entityConfig.partition_keys
.map(key => `${key}=${partitions[key]}`)
.join('/');
return `${entityConfig.table_name}/${partitionPath}/${originalFilename}`;
}
async function updateGlueCatalog(entityConfig: any, partitions: any, s3Path: string) {
const glue = new AWS.Glue();
await glue.createPartition({
DatabaseName: config.resources.glue_database,
TableName: entityConfig.table_name,
PartitionInput: {
Values: entityConfig.partition_keys.map(key => partitions[key]),
StorageDescriptor: {
Location: `s3://${config.resources.analytics_bucket}/${s3Path}`,
InputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
SerdeInfo: {
SerializationLibrary: 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
}
}
}
}).promise();
}
4. Power BI Connection Setup
Prerequisites Checklist
# 1. Local Development Setup
# Download Athena ODBC driver
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC-1.1.17.1000-Windows.msi
# 2. Gateway Installation (for enterprise distribution)
# Install driver on Power BI Gateway machine
# Configure data source in gateway admin console
Connection Configuration
# ODBC Connection String
Driver={Amazon Athena ODBC Driver};
Server=athena.us-east-1.amazonaws.com;
Port=443;
Database=analytics_db;
Workgroup=analytics_workgroup;
AuthenticationType=IAM Credentials;
UID=AKIA...;
PWD=secret_key;
S3OutputLocation=s3://my-company-analytics/query-results/;
Network Requirements
# VPC Endpoint Configuration (if using private connectivity)
vpc_endpoints:
- service: com.amazonaws.us-east-1.athena
route_table_ids: ["rtb-xxx"]
- service: com.amazonaws.us-east-1.s3
route_table_ids: ["rtb-xxx"]
# Security Group Rules
security_groups:
athena_access:
ingress:
- protocol: tcp
ports: [443]
source: "power-bi-gateway-sg"
IAM Service Account
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:*",
"glue:GetDatabase",
"glue:GetTable",
"glue:GetPartitions",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:athena:*:*:workgroup/analytics_workgroup",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/analytics_db",
"arn:aws:s3:::my-company-analytics/*"
]
}
]
}
Security Best Practices
- Workspace Binding - Restrict report access to specific distribution lists
- Row-Level Security - Implement in Power BI if data contains sensitive information
- Credential Rotation - Set up automatic rotation for IAM access keys
- VPC Isolation - Use VPC endpoints for private connectivity when possible
This implementation provides a solid foundation that's extensible for additional entities and maintainable across different AWS environments.
Conclusion
This serverless analytics architecture demonstrates how modern cloud services can dramatically simplify data pipelines while reducing costs and operational overhead. By treating S3 as both storage and source of truth, we've eliminated the traditional database layer that often exists purely to satisfy business intelligence tool requirements, not actual business logic needs.
The solution delivers several key advantages: astronomical cost savings through Athena's pay-per-query model versus always-on database instances, operational simplicity with fewer moving parts to manage and monitor, and architectural flexibility that scales effortlessly with data volume and query frequency. The event-driven design ensures data is immediately available for analysis without complex ETL scheduling, while the shared configuration approach makes the entire solution extensible for additional data sources and entities.
While the initial Power BI connectivity setup requires some networking and driver configuration effort, these are one-time investments that unlock significant long-term value. The slight query latency trade-off compared to traditional databases is typically negligible for most business intelligence use cases, especially when weighed against the dramatic reduction in infrastructure costs and complexity.
Key Takeaway: Before defaulting to database solutions for analytics workloads, consider whether your use case actually requires relational database features like transactions, foreign keys, or complex joins. If you're simply storing structured data for reporting and visualization, this S3-native approach can deliver the same business outcomes with fraction of the cost and complexity.
The complete implementation, including Infrastructure as Code templates and Lambda functions, provides a foundation that teams can fork, customize, and extend for their specific data analytics needs. As cloud-native architectures continue to mature, patterns like this represent the future of cost-effective, scalable data engineering.
Ready to implement this solution? Start with the shared configuration design, analyze your existing data patterns, and begin building your serverless analytics pipeline today.
Top comments (0)