DEV Community

Wakeup Flower
Wakeup Flower

Posted on

Lake Formation workflow of RDS & S3

🌊 Goal:

Create a data lake that combines data from:

  • Amazon RDS (relational database)
  • Amazon S3 (object storage for large-scale analytics)

You’ll then use AWS Lake Formation to centrally manage data access and security β€” without manually writing IAM policies or complex permissions.


πŸ”§ Step-by-Step Breakdown

1️⃣ Create a Data Lake with Lake Formation

Lake Formation is built on top of AWS Glue and Amazon S3 β€” it helps you:

  • Catalog all your data sources (S3, RDS, Redshift, etc.)
  • Define fine-grained access controls (at the table, column, or row level)
  • Use a single, centralized permission model for analytics services (like Athena, Redshift Spectrum, EMR, Glue)

How:

  • In the Lake Formation console, register your data lake location (usually an S3 bucket or a prefix, e.g. s3://company-data-lake/).
  • Lake Formation creates a data catalog (via AWS Glue Data Catalog) that tracks where data lives and its schema.

2️⃣ Create an AWS Glue JDBC Connection to Amazon RDS

JDBC ( Java Database Connectivity )

It’s a standard protocol that lets software applications connect to databases β€” regardless of the database vendor (MySQL, PostgreSQL, Oracle, SQL Server, etc.).

This allows AWS Glue (and by extension, Lake Formation) to connect directly to your Amazon RDS database and crawl or extract metadata.

How:

  • In AWS Glue, go to Connections β†’ Add connection
  • Choose JDBC
  • Enter your RDS endpoint, database name, username, and password (stored securely in AWS Secrets Manager)
  • Ensure your Glue job or crawler runs in the same VPC as your RDS instance, or that it has network access.

Result: Glue can now read schema and data from RDS tables.


3️⃣ Register the S3 Bucket in Lake Formation

This step β€œtells” Lake Formation that this S3 bucket is part of your managed data lake.

Why it matters:

  • Lake Formation then enforces fine-grained access controls on that bucket.
  • Any other AWS analytics service (Athena, Redshift Spectrum, etc.) must check Lake Formation permissions before reading that S3 data.

How:

  • In Lake Formation β†’ Data lake locations β†’ Register location
  • Choose your S3 bucket (e.g., s3://company-data-lake/)
  • Assign a β€œdata lake administrator” who can manage permissions.

4️⃣ Use Lake Formation Access Controls

Now you can define who can see or query what data.

Examples:

  • Allow the data science team to read only the sales table.
  • Allow auditors to read only the transactions table, but hide the customer_email column.
  • Allow marketing to query summary tables but not raw data.

Lake Formation integrates with:

  • Amazon Athena
  • Amazon Redshift Spectrum
  • Amazon EMR
  • AWS Glue

Each of these services checks Lake Formation permissions before accessing data.


πŸ”’ Access Control Example

User Resource Permission
Data Scientist s3://data-lake/sales/ SELECT
Analyst transactions table SELECT (no access to PII columns)
Admin All databases/tables FULL_ACCESS

πŸ’‘ Why this Architecture?

  • βœ… Centralized data governance β€” fine-grained permissions, all permissions managed in Lake Formation.
  • βœ… Unified analytics β€” combine RDS (structured data) and S3 (semi/unstructured data).
  • βœ… Scalable & cost-effective β€” query with Athena or Redshift Spectrum directly on S3.
  • βœ… Minimal coding β€” Glue handles schema discovery; Lake Formation handles permissions.
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Amazon RDS (MySQL)    β”‚
          β”‚ Customer Data          β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ JDBC Connection (AWS Glue)
                     β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚ AWS Glue Data Catalog      β”‚
             β”‚ Metadata for RDS + S3      β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚ AWS Lake Formation         β”‚
             β”‚ - Unified Data Lake        β”‚
             β”‚ - Fine-grained access ctrl β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚       Amazon Athena / BI Tools   β”‚
          β”‚     Query Across RDS + S3 Data   β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Enter fullscreen mode Exit fullscreen mode
Step Component Role
1️⃣ S3 + RDS Raw data sources
2️⃣ AWS Glue JDBC Connects RDS to data catalog
3️⃣ Lake Formation Creates unified data lake with security control
4️⃣ Athena / QuickSight Analyze both sources together
5️⃣ Result Secure, centralized, scalable data access

Top comments (0)