Aki for AWS Community Builders

Posted on Dec 1, 2025

Building a Data Platform on AWS: Essential Design Considerations for Power BI

#aws #dataengineering #data

Original Japanese article: AWSで構築するデータ基盤の基本とPower BI利用時の設計ポイント

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

When building a data platform on AWS, the most common cloud-native architecture is to use S3 as the data lake, Amazon Redshift as the DWH (data mart), and Amazon QuickSight for BI.
In this article, I’ll walk through the standard architecture and the key decisions you’ll need to make.

I’ll also cover a frequently requested requirement in enterprise environments:
“We want to use Power BI instead of QuickSight.”
We’ll explore what the architecture looks like in that case as well.

The Basic Architecture (S3 × Redshift × QuickSight)

A typical AWS data platform looks like this:

This article focuses on the workloads after data is ingested into AWS, so I’ll skip the source-side integration.

Let’s walk through each component.

S3 (Data Lake)

Amazon S3 is the standard choice for a data lake.
There are many ways to structure it, but adopting a Medallion Architecture is generally recommended.

Key points:

Layered structure such as Raw → Processed → Curated
Schema-on-read provides high flexibility
If you adopt a table format like Apache Iceberg, you gain ACID transactions and schema governance

More details are covered in the article below:
https://zenn.dev/penginpenguin/articles/741bf19ea03a5f

Amazon Redshift (DWH / Data Mart)

Redshift serves as the enterprise data warehouse or data mart. Important design considerations include:

How to separate responsibilities from the Gold layer in the data lake
Whether to use Redshift Serverless or Provisioned
Whether to load lake data into Redshift or treat it as external tables

Let’s break these down.

Separating Redshift and the Data Lake Gold Layer

In a Medallion Architecture, the Gold layer typically stores BI-optimized data.
However, this overlaps conceptually with what a DWH or data mart provides.

Two common approaches are:

1. Place the Gold layer *inside the DWH*

2. Place the Gold layer *in S3*

There is no universal “correct answer.”
Keeping the Gold layer in S3 reduces the load on Redshift and simplifies processing—but may lead to duplicate storage costs.
(There are ways to avoid double storage; I’ll touch on that later.)

Ultimately, the choice depends on constraints such as cost, performance, governance, and data models.

Why use Redshift at all?

Typical reasons include:

BI dashboards require sub-second to second-level query performance Redshift offers more consistent performance for aggregated queries.
KPI / summary tables are more stable inside a DWH Not just for speed—availability and operational stability also matter.
Business teams often ask for cross-domain joins (“horizontal metrics”) Gold layers are domain-oriented; DWHs are cross-functional.

Redshift Serverless vs Provisioned

Redshift Serverless

Cost-efficient for small to mid-sized workloads
Great when workload patterns fluctuate
Minimal operational overhead

Redshift Provisioned

Suitable for large-scale batch workloads or heavy SQL
Works well when workload is predictable

A common approach is to start with Serverless, then consider Provisioned only when required by performance or workload profiles.

Loading Data into Redshift vs External Tables (Spectrum)

With the rise of Lakehouse patterns, choosing between COPY (load) and external tables is increasingly important.

COPY (Load into Redshift)

Fast and stable query performance
May duplicate data depending on lake design

External Tables (Spectrum)

Centralized storage (no duplication)
Cost-effective for large datasets
Queries operate on underlying lake data → avoids the need for multi-location storage

Again, this is a trade-off between performance and cost.

Choosing ETL Services (Glue / EMR / Lambda)

You’ll typically choose among Glue, EMR, or Lambda:

AWS Glue Fully managed ETL with Spark/Python; flexible for typical batch ETL.
AWS Lambda Serverless, lightweight ETL; ideal for real-time or small workloads.
Amazon EMR Fully customizable clusters for big data workloads; ideal for large-scale Spark/Hadoop processing.

The main criteria are data volume and processing time.

More details:
https://zenn.dev/penginpenguin/articles/a2738f7efb5c03

Amazon QuickSight

QuickSight is usually the first choice for BI on AWS due to its tight integration with AWS security, permissions, and serverless nature.

However, enterprise customers often request Power BI for BI standardization.
The next section covers how to architect Power BI connectivity to AWS.

Using Power BI with AWS

In enterprise environments, it’s common to be asked:

“We don't want Redshift exposed to the internet.”
“We want to connect Power BI Service to AWS.”
“We have strict policies on VPN or on-prem connections.”

Power BI naturally integrates well with Azure, so connecting it securely to AWS requires some planning.

Below are the main architectural patterns.

Power BI Connection Patterns

1. Making Redshift Publicly Accessible (Not Recommended)

The simplest—but nearly always unacceptable—option.

Pros

Easy for both Power BI Desktop and Power BI Service to access

Cons

Significant security risks
Rarely allowed in enterprise environments

2. Using Power BI Desktop (Private Connectivity)

When using Power BI Desktop, secure connectivity is typically established via VPN or Direct Connect.

Direct Connect

Dedicated line between on-prem and AWS
Most stable in terms of security and bandwidth
High cost
Common in large enterprises

Often, the setup already exists for other systems—you simply “piggyback” on it.

Site-to-Site VPN / Client VPN

More common setup than Direct Connect
Secure, cost-effective
Works well for multi-site organizations

This is often the most realistic option.

Running Power BI Desktop Inside EC2

Windows EC2 + Power BI Desktop
Simple connectivity, but increases administrative overhead
Usually a “last resort” or workaround

3. Power BI Service (Cloud) with On-Premises Data Gateway

Power BI Service cannot directly use VPN or Direct Connect.
To connect securely to AWS resources (e.g., Redshift), you must use the On-Premises Data Gateway, which runs on Windows—often hosted on EC2.

Considerations:

Redshift does NOT need to be public
Full Power BI Service features available
EC2 gateway creates single point of failure unless made redundant
Operational overhead increases
DirectQuery performance may be limited depending on workload

Conclusion

In this article, we explored key connection patterns to consider when using Power BI on top of a fundamental AWS data platform architecture built with S3, Redshift, and QuickSight.
In enterprise environments especially, ensuring secure connectivity, as well as managing costs and operational overhead, often becomes a critical point of discussion.

In many cases, organizations use both Power BI Desktop and the Power BI Service, so it is common to combine the connection patterns introduced here.

Ultimately, beyond what we covered in this article, architectural decisions must take into account a wide range of requirements and constraints. Understanding these trade-offs is essential. Best practices are, in many ways, “it depends”, which means engineers need a deep understanding of various AWS services and must align technical decisions with business needs — effectively bridging the gap between business and technology.

Finally, the most important aspect is that a data platform should deliver business value and be designed in a way that people actually use.
Technology should not lead for its own sake; a business-value-driven mindset is where everything begins.

I hope this article provides helpful guidance for those planning to build a data platform on AWS.

DEV Community