David💻

Posted on Oct 18

Implementing a Secure Data Governance Architecture on AWS with S3, Glue, Athena, and Lake Formation

#data #awsdatalake #todayilearned #aws

This article explains how to built a secure and fully auditable data governance architecture using AWS S3, Glue, CloudTrail, Lake Formation, and Amazon Quick Suite

This design ensures data organization, encryption, version control, access restriction, and advanced traceability, while enabling analytical queries and dashboards through Athena and Quick Suite

Requirements

AWS account
CSV files with data

The proposed architecture

Walkthrough

The goal is to build a centralized S3-based data lake that manages raw, processed, and sensitive data securely.
Our main bucket can be called as company-data-governance-raw

Inside our bucket we follow this folder structure:

company-data-governance-raw/
├── raw/               
│   ├── clientes/
│   ├── transacciones/
│   └── productos/
│
├── processed/          
│   ├── clientes/
│   └── transacciones/
│       └── sensible=no/
│
├── sensitive/  
│
├── athena/ 
│
└── logs/ 
    └── s3-access/

Inside raw folder we will have these respective csv files:

clientes.csv
cliente_id,nombre,email,tarjeta_credito,region,unidad_negocio
1,Juan Perez,juan@email.com,4532-1234-5678-9010,LATAM,Ventas
2,Maria Lopez,maria@email.com,5425-2345-6789-0123,LATAM,Marketing
3,Carlos Ruiz,carlos@email.com,3782-3456-7890-1234,NORTE,Ventas
4,Ana Torres,ana@email.com,6011-4567-8901-2345,EUROPA,IT
5,Luis Garcia,luis@email.com,4916-5678-9012-3456,LATAM,Finanzas

transaccion_id,cliente_id,monto,fecha,tipo,sensible
1001,1,150.50,2025-10-01,compra,no
1002,2,320.75,2025-10-02,compra,no
1003,3,89.99,2025-10-03,devolucion,no
1004,1,1500.00,2025-10-04,compra,si
1005,4,45.20,2025-10-05,compra,no

producto_id,nombre,precio,categoria,stock
101,Laptop,899.99,Tecnologia,50
102,Mouse,25.99,Tecnologia,200
103,Teclado,45.50,Tecnologia,150
104,Monitor,299.99,Tecnologia,75
105,Webcam,79.99,Tecnologia,100

This organization keeps every dataset in its right lifecycle stage, from ingestion to analysis.

S3 Configuration and Security Controls

When creating the bucket:

✅ Block all public access

✅ Enable versioning to preserve data integrity and restore older versions

✅ Enable MFA Delete to prevent accidental or unauthorized deletions

✅ Enable encryption in transit and at rest (S3 SSE-S3 or SSE-KMS with your own CMK)

Using AWS KMS allows symmetric encryption under your control, essential for sensitive workloads.

S3 Bucket Policy – Enforcing Security

To harden the S3 layer, we can configure our bucket policy with these three key rules:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyInsecureTransport",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::company-data-governance-raw/*",
        "arn:aws:s3:::company-data-governance-raw"
      ],
      "Condition": {
        "Bool": { "aws:SecureTransport": "false" }
      }
    },
    {
      "Sid": "RestrictSensitiveData",
      "Effect": "Deny",
      "Principal": "*",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::company-data-governance-raw/sensitive/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": [
            "arn:aws:iam::<account-id>:user/<user-or-role>",
            "arn:aws:iam::<account-id>:role/AWSGlueServiceRole-GobiernoDatos"
          ]
        }
      }
    },
    {
      "Sid": "S3PolicyStmt-DO-NOT-MODIFY-1760725108745",
      "Effect": "Allow",
      "Principal": { "Service": "logging.s3.amazonaws.com" },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::company-data-governance-raw/*",
      "Condition": {
        "StringEquals": { "aws:SourceAccount": "<your-account-id" }
      }
    }
  ]
}

What this does, denies HTTP requests only HTTPS traffic is allowed.

Restricts access to /sensitive/ only for a specific IAM user and Glue role. Grants S3 logging service permission to write access logs only from the same AWS account. This combination provides network level encryption, identity based access control, and logging integrity.

Enabling Access Logging

Enabling S3 server access logging is essential for auditing who accessed what. Logs from all requests (user or programmatic) are stored under /logs/s3-access/ inside the same bucket, ensuring full traceability of every data operation.

Building the Data Catalog in AWS Glue

After uploading your data into the appropriate folders, the next step is to catalog it. Go to AWS Glue → Data Catalog → Databases

Create a new database named data_governance_catalog

Then create a crawler that will scan your S3 bucket and automatically build table schemas.

Steps:

Crawler name: company-crawler-governance-data-raw
Create a new IAM role: DataGovernance
Choose the database data_governance_catalog as target
Run the crawler on demand

Once it finishes, you’ll see three tables reflecting your S3 folder structure (clientes, transacciones, productos).

Adding Traceability with AWS CloudTrail

Security doesn’t stop at encryption. We also need visibility into every read and write operation.

In CloudTrail → Trails, select your primary trail and enable Data Events for S3. This allows auditing of operations like GetObject and PutObject inside the bucket. CloudTrail logs will now show who accessed which file and when, ensuring compliance and audit readiness

Applying Data Access Controls with Lake Formation

For column level and row level permissions, we can use AWS Lake Formation Filters. Go to Lake Formation → Data Filters → Create new filter

Filter name: data-governance-filters

Select your data catalog and focus on sensitive columns in the clientes table (e.g., credit card holder details)
Column filters: hide entire sensitive columns
Row filters: restrict specific data rows
Then, under Permissions, define Grants to control who can see what.
For example, deny the Glue role and the current user access to columns marked as sensitive.

After applying the filter, running queries via Athena will show masked results for those restricted columns.

Querying Data with Amazon Athena

Now we can query the cataloged data using Athena, which automatically integrates with Glue. When executing a query, Athena respects Lake Formation permissions sensitive columns are hidden for restricted users, and queries run against optimized Parquet data under /processed/.

Example:

SELECT * FROM data_governance_catalog.clientes;

Results are stored in the /athena/ folder, ready to visualize in Quick suite.

Visualizing Insights with Amazon QuickSight

To visualize and share insights. Open QuickSight → New Analysis

Choose Athena as the data source
Select the data_governance_catalog
Build dashboards using your filtered datasets

Example visualizations:

Customers by region
Predicted transactions per segment
Sensitive data audit summaries Quick Suite connects seamlessly with Athena, ensuring that governance rules continue to apply even at the visualization layer.

Cost Estimation

Here’s an approximate monthly cost breakdown for a moderate workload:

Component	Usage Assumption	Estimated / Month (USD)
S3 Storage	500 GB (Standard)	$11.50
S3 Requests	Moderate GET/PUT traffic	$8.00
CloudTrail Data Events	25 M events	$25.00
Glue Data Catalog	20k objects / 1 M requests	$1.20
Glue Crawler	~0.3 h/day (≈ 9 DPU-h)	$4.00
Glue ETL (light)	≈ 9 DPU-h/month	$4.00
Athena Queries	3 TB scanned/month	$15.00
CloudWatch Alarms	5 alarms	$0.50
QuickSight (Enterprise)	1 author + 1 reader + 5 GB SPICE	$30.90
Total Estimated Cost		≈ $100.10 / month

A compact and cost efficient solution for complete data governance.

Conclusion

This architecture provides a secure, auditable, and scalable data governance foundation built entirely with managed AWS services.
From S3 encryption to Lake Formation filters and QuickSight dashboards, every layer enforces security, traceability, and performance. We can easily extend this solution by:

Adding Glue ETL jobs for automated transformations
Integrating with Amazon Redshift for advanced analytics
Applying AWS Macie for sensitive data discovery

If you’re building a data lake or starting a governance project, this structure provides a strong and repeatable foundation for compliance ready analytics in AWS.

DEV Community