JOOJO DONTOH

Posted on Sep 27

Skip the Database: Building Analytics Dashboards Directly from S3 Files

#dataengineering #architecture #analytics #aws

Introduction

Hello guys, my name is Jo and welcome back to my engineering blog!.
I'm gonna talk about something cool today, let's go!😎.
In today's data-driven landscape, organizations collect vast amounts of information from multiple sources in various formats including structured databases, APIs, IoT sensors, and application logs, with much of this data landing in formats like CSV and JSON files that need to be stored, processed, and analyzed efficiently. While Amazon S3 serves as an excellent scalable storage solution for these files, it's a significant engineering challenge 🤕 to extract meaningful insights from data sitting in S3 buckets, especially when stakeholders need regular access to visualizations and reports without the overhead of complex data pipeline management.
Traditional approaches often involve provisioning expensive database instances, writing ETL jobs to move data around, and maintaining multiple data stores just to enable business intelligence tools like Power BI to access the information they need. In this article, I'll walk you through a cost-effective, serverless architecture that uses AWS Glue for data cataloging, Amazon Athena for SQL querying, and Power BI for visualization. This creates a streamlined solution that maintains S3 as your single source of truth while providing the SQL interface that Power BI expects, all without the complexity and cost of traditional database provisioning. Let's dive into it!

While Amazon S3 serves as an excellent scalable storage solution for these files, it's a significant engineering challenge 🤕 to extract meaningful insights from data sitting in S3 buckets

Situation

It's critical to understand that the solution presented in this article is designed for a specific set of circumstances and constraints, and while it may apply to many similar use cases, every engineering decision should start with clearly defined functional and non-functional requirements that match your particular context. In our scenario, CSV files are being deposited into an S3 bucket at regular intervals, which creates a steady stream of data that needs to be accessible for analysis and reporting. These files come pre-structured with a consistent, well-defined schema and the data has already been cleaned and sanitized upstream, meaning we don't need to handle data quality issues, type conversions, or missing field scenarios within our solution. Most importantly, the data represents a single business entity without complex relationships or foreign key dependencies that would require maintaining referential integrity across multiple tables, allowing us to treat each file as a self-contained dataset that can be processed and queried independently.

CSV Files (Regular Intervals) → S3 Bucket Storage
     ↓
Well-structured + Pre-sanitized Data
     ↓
Single Entity (No Relational Dependencies)
     ↓
Ready for Processing (SQL)

Problem/Requirements

The core challenge lies in making the data dumped into S3 buckets accessible and actionable for stakeholders who need regular visibility into operational insights through dashboards and reports. Currently, the team handles similar data requirements by loading CSV contents into traditional SQL databases, which then serve as the data source for Power BI through standard database connections. This solution exists primarily because Power BI has robust, well-established SQL connectivity and most teams are familiar with this approach. However, this conventional method introduces several significant problems that compound over time and create unnecessary operational overhead.

Key Issues with the Current Approach:

Infrastructure Costs - Provisioning and maintaining database instances (RDS, SQL Server, etc.) for what is essentially file storage creates ongoing monthly expenses that scale with data volume and performance requirements
Unnecessary Complexity - Data gets written into SQL tables purely for Power BI access, not because relational database features like ACID transactions, foreign keys, or complex joins are actually needed for this use case
Access Management Overhead - Every new database introduces another set of credentials, connection strings, and security policies that need to be managed, rotated, and monitored
Increased System Dependencies - Adding database services means more moving parts that can fail, require updates, need backup strategies, and demand monitoring - each new service multiplies your potential points of failure
Data Refresh Constraints - Power BI's periodic refresh cycles from SQL databases create timing dependencies where dashboard data freshness becomes a configuration trade-off rather than being driven by actual data availability
Data Fragmentation - Managing multiple data types across different storage solutions (files in S3, structured data in databases) creates inconsistencies in backup strategies, access patterns, and operational procedures

Solution

The Proposed Architecture

The solution replaces the traditional database-centric approach with a serverless, event-driven architecture that treats S3 as both storage and the single source of truth for all data operations. This approach maintains data in its original CSV format while providing the SQL query interface that Power BI requires, eliminating the need for intermediate database layers entirely.

Core Requirements

Our replacement solution addresses four fundamental requirements:

Single Source of Truth - All data remains in S3 in CSV format, eliminating data duplication and synchronization issues
Intelligent Cataloging - Files are automatically cataloged with schema information and logical partitioning, similar to database table structures but without the database overhead
Event-Driven Processing - New file arrivals trigger automatic catalog updates, ensuring data is immediately available for querying
SQL Query Layer - Athena provides the SQL interface that Power BI expects, querying directly against cataloged S3 data

Architecture Components

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   CSV File  │───▶│  S3 Bucket  │───▶│ S3 Event    │───▶│ SQS Queue   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                                   │
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────┴───┐
│  Power BI   │◀───│   Athena    │◀───│ Glue Catalog│◀───│   Lambda    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Tool Breakdown

Component	Purpose	Role in Solution
S3	Primary data storage	Houses original CSV files and partitioned data
SQS	Event queue	Buffers S3 file creation events for processing
Lambda	Event processor	Reads files, creates partitions, updates Glue catalog
Glue	Data catalog	Maintains schema and partition metadata for Athena
Athena	Query engine	Provides SQL interface for Power BI connectivity

Why This Solution Works

Cost Efficiency

Athena operates on a pay-per-query model, charging approximately $5 per TB of data scanned (AWS Athena Pricing), which is dramatically cheaper than maintaining always-on database instances that can cost hundreds of dollars monthly regardless of usage patterns.

Operational Simplicity

Single Data Format - Everything stays in CSV, eliminating format conversion complexity
Unified Access Control - S3 IAM policies manage all data access, no separate database credentials
Minimal Dependencies - The entire pipeline uses managed AWS services with no servers to maintain
Direct Integration - Athena provides native Power BI connectivity through ODBC/JDBC drivers

Performance Features

Query Caching - Athena automatically caches results for repeated queries
Partition Pruning - Only scans relevant data partitions, dramatically reducing query costs
Columnar Optimization - Can easily migrate to Parquet format later for even better performance

Trade-offs and Considerations

Performance Limitations

-- Traditional DB: Sub-second response for cached data
SELECT * FROM sales WHERE date = '2024-01-15';

-- Athena: 2-10 second response depending on data size
SELECT * FROM sales WHERE year='2024' AND month='01' AND day='15';

Athena queries have higher latency than traditional databases since they scan files rather than using pre-built indexes, but this trade-off is acceptable when query frequency is moderate and cost savings are significant.

Initial Setup Complexity

Power BI + Athena Integration Requirements:

Driver Installation - ODBC/JDBC drivers on dashboard servers
Network Configuration - VPC endpoints and security group rules for federated AWS accounts
Authentication Setup - IAM roles and cross-account access policies
Gateway Configuration - On-premises data gateway driver installation

# Example driver setup
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC-1.1.17.1000-Linux64.zip
# Configure connection string in Power BI
# Server: athena.us-east-1.amazonaws.com
# Database: your_glue_database

While the initial connectivity setup between Power BI and Athena can be complex, especially in enterprise environments with federated AWS accounts, these are one-time configuration challenges that become valuable learning experiences for teams expanding their cloud-native analytics capabilities.

Suggested Implementation

This implementation guide uses TypeScript, but the concepts translate to any language. The key is building a maintainable, extensible solution that follows the Open/Closed Principle - open for extension, closed for modification.

1. Core Schema Design (Critical Prerequisite)

Schema Analysis and Configuration

Start by examining your S3 files to understand the data structure and extract partition information from filenames:

// Example filename: sales_data_2024_03_15.csv
// Partition extraction: year=2024, month=03, day=15

const filename = "sales_data_2024_03_15.csv";
const partitionPattern = /sales_data_(\d{4})_(\d{2})_(\d{2})\.csv/;
const [, year, month, day] = filename.match(partitionPattern);

Shared Configuration Structure

Create a centralized configuration that serves both your code and Infrastructure as Code (IAC):

// shared-config.json
{
  "aws": {
    "region": "us-east-1",
    "account_id": "${AWS_ACCOUNT_ID}"
  },
  "resources": {
    "source_bucket": "my-company-data-source",
    "analytics_bucket": "my-company-analytics",
    "glue_database": "analytics_db",
    "workgroup_name": "analytics_workgroup",
    "sqs_queue": "file_processing_queue"
  },
  "entities": {
    "sales": {
      "table_name": "sales_data",
      "filename_pattern": "sales_data_(\\d{4})_(\\d{2})_(\\d{2})\\.csv",
      "partition_keys": ["year", "month", "day"],
      "schema": [
        {"name": "transaction_id", "type": "string"},
        {"name": "customer_id", "type": "string"},
        {"name": "amount", "type": "decimal(10,2)"},
        {"name": "transaction_date", "type": "date"}
      ]
    }
  }
}

2. Infrastructure as Code

Account-Agnostic Infrastructure

Structure your IAC to work across any AWS account with minimal changes:

# terraform/main.tf or cloudformation template
# All names reference shared config

resource "aws_s3_bucket" "analytics_bucket" {
  bucket = var.shared_config.resources.analytics_bucket

  versioning {
    enabled = true
  }
}

resource "aws_glue_catalog_database" "analytics_db" {
  name = var.shared_config.resources.glue_database
}

resource "aws_athena_workgroup" "analytics" {
  name = var.shared_config.resources.workgroup_name

  configuration {
    result_configuration {
      output_location = "s3://${aws_s3_bucket.analytics_bucket.bucket}/query-results/"
    }

    enforce_workgroup_configuration = true
    publish_cloudwatch_metrics = true
  }
}

Complete Resource Setup

// Infrastructure components from shared config
const infraConfig = {
  s3: {
    sourceBucket: config.resources.source_bucket,
    analyticsBucket: config.resources.analytics_bucket
  },
  glue: {
    database: config.resources.glue_database,
    tables: Object.keys(config.entities)
  },
  sqs: {
    queueName: config.resources.sqs_queue,
    // Note: Standard queue only - S3 events don't support FIFO
    type: "Standard"
  },
  lambda: {
    functionName: "file-processor",
    runtime: "nodejs22.x",
    sqsTrigger: true
  }
}

3. Core Lambda Logic

File Processing Flow

// lambda/fileProcessor.ts
import { SQSEvent, SQSRecord } from 'aws-lambda';
import { S3, Glue, Athena } from 'aws-sdk';

export const handler = async (event: SQSEvent) => {
  for (const record of event.Records) {
    await processS3File(record);
  }
};

async function processS3File(sqsRecord: SQSRecord) {
  // 1. Extract S3 event from SQS message
  const s3Event = JSON.parse(sqsRecord.body);
  const bucket = s3Event.Records[0].s3.bucket.name;
  const key = s3Event.Records[0].s3.object.key;

  // 2. Get entity configuration
  const entityConfig = getEntityFromFilename(key);

  // 3. Extract partition information
  const partitions = extractPartitions(key, entityConfig.filename_pattern);

  // 4. Read and validate file
  const fileContent = await s3.getObject({ Bucket: bucket, Key: key }).promise();

  // 5. Create partitioned path
  const partitionedPath = buildPartitionPath(entityConfig, partitions);

  // 6. Write to analytics bucket
  await s3.putObject({
    Bucket: config.resources.analytics_bucket,
    Key: partitionedPath,
    Body: fileContent.Body
  }).promise();

  // 7. Update Glue catalog
  await updateGlueCatalog(entityConfig, partitions, partitionedPath);

  // 8. Optional: Validate with Athena query
  await validateCatalogUpdate(entityConfig.table_name, partitions);
}

Partition Management

function buildPartitionPath(entityConfig: any, partitions: any): string {
  // Example: sales_data/year=2024/month=03/day=15/sales_data_2024_03_15.csv
  const partitionPath = entityConfig.partition_keys
    .map(key => `${key}=${partitions[key]}`)
    .join('/');

  return `${entityConfig.table_name}/${partitionPath}/${originalFilename}`;
}

async function updateGlueCatalog(entityConfig: any, partitions: any, s3Path: string) {
  const glue = new AWS.Glue();

  await glue.createPartition({
    DatabaseName: config.resources.glue_database,
    TableName: entityConfig.table_name,
    PartitionInput: {
      Values: entityConfig.partition_keys.map(key => partitions[key]),
      StorageDescriptor: {
        Location: `s3://${config.resources.analytics_bucket}/${s3Path}`,
        InputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
        OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
        SerdeInfo: {
          SerializationLibrary: 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
        }
      }
    }
  }).promise();
}

4. Power BI Connection Setup

Prerequisites Checklist

# 1. Local Development Setup
# Download Athena ODBC driver
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC-1.1.17.1000-Windows.msi

# 2. Gateway Installation (for enterprise distribution)
# Install driver on Power BI Gateway machine
# Configure data source in gateway admin console

Connection Configuration

# ODBC Connection String
Driver={Amazon Athena ODBC Driver};
Server=athena.us-east-1.amazonaws.com;
Port=443;
Database=analytics_db;
Workgroup=analytics_workgroup;
AuthenticationType=IAM Credentials;
UID=AKIA...;
PWD=secret_key;
S3OutputLocation=s3://my-company-analytics/query-results/;

Network Requirements

# VPC Endpoint Configuration (if using private connectivity)
vpc_endpoints:
  - service: com.amazonaws.us-east-1.athena
    route_table_ids: ["rtb-xxx"]
  - service: com.amazonaws.us-east-1.s3
    route_table_ids: ["rtb-xxx"]

# Security Group Rules
security_groups:
  athena_access:
    ingress:
      - protocol: tcp
        ports: [443]
        source: "power-bi-gateway-sg"

IAM Service Account

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "athena:*",
        "glue:GetDatabase",
        "glue:GetTable",
        "glue:GetPartitions",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:athena:*:*:workgroup/analytics_workgroup",
        "arn:aws:glue:*:*:catalog",
        "arn:aws:glue:*:*:database/analytics_db",
        "arn:aws:s3:::my-company-analytics/*"
      ]
    }
  ]
}

Security Best Practices

Workspace Binding - Restrict report access to specific distribution lists
Row-Level Security - Implement in Power BI if data contains sensitive information
Credential Rotation - Set up automatic rotation for IAM access keys
VPC Isolation - Use VPC endpoints for private connectivity when possible

This implementation provides a solid foundation that's extensible for additional entities and maintainable across different AWS environments.

Conclusion

This serverless analytics architecture demonstrates how modern cloud services can dramatically simplify data pipelines while reducing costs and operational overhead. By treating S3 as both storage and source of truth, we've eliminated the traditional database layer that often exists purely to satisfy business intelligence tool requirements, not actual business logic needs.

The solution delivers several key advantages: astronomical cost savings through Athena's pay-per-query model versus always-on database instances, operational simplicity with fewer moving parts to manage and monitor, and architectural flexibility that scales effortlessly with data volume and query frequency. The event-driven design ensures data is immediately available for analysis without complex ETL scheduling, while the shared configuration approach makes the entire solution extensible for additional data sources and entities.

While the initial Power BI connectivity setup requires some networking and driver configuration effort, these are one-time investments that unlock significant long-term value. The slight query latency trade-off compared to traditional databases is typically negligible for most business intelligence use cases, especially when weighed against the dramatic reduction in infrastructure costs and complexity.

Key Takeaway: Before defaulting to database solutions for analytics workloads, consider whether your use case actually requires relational database features like transactions, foreign keys, or complex joins. If you're simply storing structured data for reporting and visualization, this S3-native approach can deliver the same business outcomes with fraction of the cost and complexity.

The complete implementation, including Infrastructure as Code templates and Lambda functions, provides a foundation that teams can fork, customize, and extend for their specific data analytics needs. As cloud-native architectures continue to mature, patterns like this represent the future of cost-effective, scalable data engineering.

Ready to implement this solution? Start with the shared configuration design, analyze your existing data patterns, and begin building your serverless analytics pipeline today.

Top comments (3)

jorin • Sep 28

Nice article! I am a big fan of the trend to move analytics to object storage.
I also like that there are great open source options such a DuckDB that avoid being tied to a specific platform.
I like it so much I built a data visualization tool on top of it to be able to define the dashboards directly in SQL too.

steve • Sep 29

Interesting, but why is this good "Most importantly, the data represents a single business entity without complex relationships or foreign key dependencies that would require maintaining referential integrity across multiple tables,"
Relational databases have that for a good reason. I can see the benefit of storing flat data in s3, provided its carefully secured with appropriate roles and access policy. Though for performance, ease of use, less complication surely its easier to providion and use postgres or equiv. Something you can set up locally with Docker and transport to aws, azure or on premise.

JOOJO DONTOH • Sep 29 • Edited

First off thanks a lot Steve for reading. The emphasis on the absence of complex relationships or foreign key dependencies that would require maintaining referential integrity exists because, it is one of the most important aspect of this situation that allows use to do away with a relational database all together. if complex relationships were indeed present, it wouldn't make sense to transition/stick to S3 as the primary storage. Because we don't have to account for all these complexities, provisioning a database instance that has a constant cost does not make financial sense. Hopes this answers your question brother.