DEV Community

Aki for AWS Community Builders

Posted on

Building a Data Platform on AWS: Essential Design Considerations for Power BI

Original Japanese article: AWSで構築するデータ基盤の基本とPower BI利用時の設計ポイント

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

When building a data platform on AWS, the most common cloud-native architecture is to use S3 as the data lake, Amazon Redshift as the DWH (data mart), and Amazon QuickSight for BI.
In this article, I’ll walk through the standard architecture and the key decisions you’ll need to make.

I’ll also cover a frequently requested requirement in enterprise environments:
“We want to use Power BI instead of QuickSight.”
We’ll explore what the architecture looks like in that case as well.


The Basic Architecture (S3 × Redshift × QuickSight)

A typical AWS data platform looks like this:

image

This article focuses on the workloads after data is ingested into AWS, so I’ll skip the source-side integration.

Let’s walk through each component.


S3 (Data Lake)

Amazon S3 is the standard choice for a data lake.
There are many ways to structure it, but adopting a Medallion Architecture is generally recommended.

image

Key points:

  • Layered structure such as Raw → Processed → Curated
  • Schema-on-read provides high flexibility
  • If you adopt a table format like Apache Iceberg, you gain ACID transactions and schema governance

More details are covered in the article below:
https://zenn.dev/penginpenguin/articles/741bf19ea03a5f


Amazon Redshift (DWH / Data Mart)

image

Redshift serves as the enterprise data warehouse or data mart. Important design considerations include:

  • How to separate responsibilities from the Gold layer in the data lake
  • Whether to use Redshift Serverless or Provisioned
  • Whether to load lake data into Redshift or treat it as external tables

Let’s break these down.


Separating Redshift and the Data Lake Gold Layer

In a Medallion Architecture, the Gold layer typically stores BI-optimized data.
However, this overlaps conceptually with what a DWH or data mart provides.

Two common approaches are:

1. Place the Gold layer *inside the DWH*
image

2. Place the Gold layer *in S3*
image

There is no universal “correct answer.”
Keeping the Gold layer in S3 reduces the load on Redshift and simplifies processing—but may lead to duplicate storage costs.
(There are ways to avoid double storage; I’ll touch on that later.)

Ultimately, the choice depends on constraints such as cost, performance, governance, and data models.

Why use Redshift at all?

Typical reasons include:

  • BI dashboards require sub-second to second-level query performance Redshift offers more consistent performance for aggregated queries.
  • KPI / summary tables are more stable inside a DWH Not just for speed—availability and operational stability also matter.
  • Business teams often ask for cross-domain joins (“horizontal metrics”) Gold layers are domain-oriented; DWHs are cross-functional.

Redshift Serverless vs Provisioned

image

Redshift Serverless

  • Cost-efficient for small to mid-sized workloads
  • Great when workload patterns fluctuate
  • Minimal operational overhead

Redshift Provisioned

  • Suitable for large-scale batch workloads or heavy SQL
  • Works well when workload is predictable

A common approach is to start with Serverless, then consider Provisioned only when required by performance or workload profiles.


Loading Data into Redshift vs External Tables (Spectrum)

image

With the rise of Lakehouse patterns, choosing between COPY (load) and external tables is increasingly important.

COPY (Load into Redshift)

  • Fast and stable query performance
  • May duplicate data depending on lake design

External Tables (Spectrum)

  • Centralized storage (no duplication)
  • Cost-effective for large datasets
  • Queries operate on underlying lake data → avoids the need for multi-location storage

Again, this is a trade-off between performance and cost.


Choosing ETL Services (Glue / EMR / Lambda)

image

You’ll typically choose among Glue, EMR, or Lambda:

  • AWS Glue Fully managed ETL with Spark/Python; flexible for typical batch ETL.
  • AWS Lambda Serverless, lightweight ETL; ideal for real-time or small workloads.
  • Amazon EMR Fully customizable clusters for big data workloads; ideal for large-scale Spark/Hadoop processing.

The main criteria are data volume and processing time.

More details:
https://zenn.dev/penginpenguin/articles/a2738f7efb5c03


Amazon QuickSight

image

QuickSight is usually the first choice for BI on AWS due to its tight integration with AWS security, permissions, and serverless nature.

However, enterprise customers often request Power BI for BI standardization.
The next section covers how to architect Power BI connectivity to AWS.


Using Power BI with AWS

In enterprise environments, it’s common to be asked:

  • “We don't want Redshift exposed to the internet.”
  • “We want to connect Power BI Service to AWS.”
  • “We have strict policies on VPN or on-prem connections.”

Power BI naturally integrates well with Azure, so connecting it securely to AWS requires some planning.

Below are the main architectural patterns.


Power BI Connection Patterns

1. Making Redshift Publicly Accessible (Not Recommended)

image

The simplest—but nearly always unacceptable—option.

Pros

  • Easy for both Power BI Desktop and Power BI Service to access

Cons

  • Significant security risks
  • Rarely allowed in enterprise environments

2. Using Power BI Desktop (Private Connectivity)

When using Power BI Desktop, secure connectivity is typically established via VPN or Direct Connect.

Direct Connect

image

  • Dedicated line between on-prem and AWS
  • Most stable in terms of security and bandwidth
  • High cost
  • Common in large enterprises

Often, the setup already exists for other systems—you simply “piggyback” on it.

Site-to-Site VPN / Client VPN

image

  • More common setup than Direct Connect
  • Secure, cost-effective
  • Works well for multi-site organizations

This is often the most realistic option.

Running Power BI Desktop Inside EC2

image

  • Windows EC2 + Power BI Desktop
  • Simple connectivity, but increases administrative overhead
  • Usually a “last resort” or workaround

3. Power BI Service (Cloud) with On-Premises Data Gateway

image

Power BI Service cannot directly use VPN or Direct Connect.
To connect securely to AWS resources (e.g., Redshift), you must use the On-Premises Data Gateway, which runs on Windows—often hosted on EC2.

Considerations:

  • Redshift does NOT need to be public
  • Full Power BI Service features available
  • EC2 gateway creates single point of failure unless made redundant
  • Operational overhead increases
  • DirectQuery performance may be limited depending on workload

Conclusion

In this article, we explored key connection patterns to consider when using Power BI on top of a fundamental AWS data platform architecture built with S3, Redshift, and QuickSight.
In enterprise environments especially, ensuring secure connectivity, as well as managing costs and operational overhead, often becomes a critical point of discussion.

In many cases, organizations use both Power BI Desktop and the Power BI Service, so it is common to combine the connection patterns introduced here.

Ultimately, beyond what we covered in this article, architectural decisions must take into account a wide range of requirements and constraints. Understanding these trade-offs is essential. Best practices are, in many ways, “it depends”, which means engineers need a deep understanding of various AWS services and must align technical decisions with business needs — effectively bridging the gap between business and technology.

Finally, the most important aspect is that a data platform should deliver business value and be designed in a way that people actually use.
Technology should not lead for its own sake; a business-value-driven mindset is where everything begins.

I hope this article provides helpful guidance for those planning to build a data platform on AWS.

Top comments (0)