DEV Community

Cover image for Securing Data Lake in AWS
James Monek for AWS Community Builders

Posted on • Originally published at jamesmonek.com

Securing Data Lake in AWS

Cloud security has always been a challenge for moving resources to the cloud. Particularly with constant negligence in leaving S3 buckets and data open to the public. With massive amounts of data available and the need to make data driven decisions, Amazon offers many services to build and utilize a data lake in the cloud. How can one make the move to building a data lake in the cloud while still maintaining security? Fortunately, Amazon has a service for that. Well, actually multiple services.

For the purpose of this article, we are going to assume that you will create and store your data in S3 buckets.

Getting some of the basics out the way

For starters, let's get some of basic best practices out of the way.

Key Management Service

Absolutely encrypt everything, encryption at rest and encryption in transit. You'll want to utilize Key Management Service to generate keys to be used for your data lake. I suggest creating operational keys for data in CloudTrails, SNS notifications, RDS, and S3 buckets for services such as AWS Config and VPC Flow Logs. This ensures that if sensitive data is accidentally placed in logs and communications, it is still encrypted. Use different KMS keys for data in the lake. Yes, rotate them at least annually if not more frequently.

S3

Misconfigured S3 buckets being left open to the world is the root of many breaches. Until recently, the best practice is to force blocking all public access at the account level.

aws s3control put-public-access-block \
--account-id AccountId \
--public-access-block-configuration '{"BlockPublicAcls": true, "IgnorePublicAcls": true, "BlockPublicPolicy": true, "RestrictPublicBuckets": true}'
Enter fullscreen mode Exit fullscreen mode

Amazon recently announced making the default setting for S3 buckets is deny public access, https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/. I still recommend the using the S3 control and absolutely monitor for changes in those settings which I will cover later.

While Server Side Encryption works fine, I would ensure that all your buckets are encrypted using KMS for extra protection for your lake.

Also, ensure you are enforcing secure connections only (SSL). More information at https://repost.aws/knowledge-center/s3-bucket-policy-for-config-rule

EC2

If you are using EC2 services as part of your lake, I recommend taking measures to ensure all EBS volumes are encrypted by default.

aws ec2 enable-ebs-encryption-by-default
Enter fullscreen mode Exit fullscreen mode

Check for encryption by default on other services you may use such as AWS RDS.

CloudTrails

Everything you do in AWS, whether it's in the console, command line interface, or CloudFormation templates, you are accessing an API. All API calls get logged to CloudTrails. My suggestion here is to create CloudTrails to be sent to S3 buckets and CloudWatch log group. If you have AWS Organizations, then use Organization CloudTrail.

I'm suggesting S3 bucket to allow for your Security Information and Event Management (SIEM) tool to be able pull from the bucket and then you can easily search for, develop dashboards, and alarms based on that data. Prime tools in this area are Elastic Search-Logstash-Kibana (ELK Stack), Splunk, and AWS Opensearch.

Sending them to CloudWatch allows for you to create CloudWatch Alarms based on events. There are a bunch of events you can trigger on such as unauthorized login attempts, changes to security groups, etc. One I like to monitor for is using the root account, which should never be used. More information can be found at https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudwatch-alarms-for-cloudtrail.html.

Guard Duty

One of the easiest services to enable is Guard Duty and it provides intelligent threat detection. It constantly monitors your buckets, workloads, and accounts for anomalies. Findings can also be sent to Security Hub.

AWS Config

AWS Config assess, audits, and evaluates configurations for your resources. It can detect when there are configuration drifts and alert you. At the core, Security Hub compliance packs use AWS Config to check for hundreds of compliance rules. In combination with Security Hub, I've found the tool helpful in getting your account set up securely based on your rules and then audit when configurations change. Also, remember the settings for blocking public access to S3 bucket? Set up configs to monitor changes in these settings.

Cloudwatch

Earlier in the article, I mentioned enabling CloudTrails and sending this to CloudWatch log group. The reason why I suggest doing so, is that you can create alarms based on data in these logs. Unauthorized access to resources, logins, AWS config compliance changes, policy changes, and so on. More information on creating these types of alarms is at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_alarm_log_group_metric_filter.html.

Security Hub

The best way I can describe Security Hub is a central location for the state of securing your account. It contains pre-built compliance packs with rules to secure your resources. Other AWS services such as Guard Duty, AWS Config, Firewall Manager sends findings to security hub. 3rd party tools such as Prowler can also send findings to Security Hub.

When you first start with Security Hub, you'll be overwhelmed with findings. You'll want to assess those findings and correct the ones that makes sense. If you discover ones that you are comfortable with, you can suppress them. You'll want to suppress with a note and the way I handle this is through the CLI, which can be automated, https://docs.aws.amazon.com/cli/latest/reference/securityhub/batch-update-findings.html

Security Hub re-evaluates on its schedule so there is no way to know how long it will take. Handy tip is to use the CLI to force a manual check. Security Hub will throttle you on the number of manual checks in a timeframe. More information is at https://docs.aws.amazon.com/config/latest/developerguide/evaluating-your-resources.html#evaluating-your-resources-console.

Once you get Security Hub in a cleaned state, you'll want to start with setting up notifications using your preferred channel whether that is email or Slack channels, https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cloudwatch-events.html

Prowler

I can't speak about AWS security without mentioning Open Source Prowler which is a tool you can run against your account to generate security findings. It has hundreds of compliance rules. I have a write-up on how to use Prowler along with a CloudFormation script to deploy. The article is at https://dev.to/aws-builders/automating-prowler-for-compliance-checking-in-aws-3oef

Wrapping it all up

By no means is this article a complete list of what you can do to secure your data lake in the cloud. I haven't covered all the services I used or even touched on listed of compliance checks you can deploy. Use this as a starting point.

One final comment about securing your environment. Always use CloudFormation templates, CDK, or CLI to develop your Infrastructure as Code. Avoid doing any of this in the console. This allows for all your work to be portable to other accounts. Trust me, it's worth the investment up front.

Top comments (0)