DEV Community: Sanjeeb Mohapatra

Blog -3-Data Engineering – AWS S3 space monitoring – Storage Lens

Sanjeeb Mohapatra — Fri, 10 Mar 2023 18:51:35 +0000

Data Engineering – AWS S3 cost monitoring – Storage Lens

Amazon S3 is an object storage service and one of the most popular services in AWS which offers of industry-leading scalability, data availability, security, and performance. Organizations can store and retrieve any amount of data from anywhere.

If an organization is using aws for their cloud service, AWS S3 is one of prefer storage solution. Some of the use cases of S3 are:

Build an Enterprise Data Lake
Create a Disaster Recovery System for back up and restore data.
Archive cold data for a long period to meet regulatory requirements
Host a static website.
Integrated with many cloud native solutions to provide storage option.

While you are store unlimited amount of data in S3, it is very important to monitor the storage of S3 and number of objects in S3 buckets. At end of the day every object storage occurred a cost. Organizations may not be able to quantify the storage cost when they have GB, TB data but when the data volume grow to PB’s S3 cost will be high.

For example, when you store 10 PB data (for big enterprise scale applications like data lake, lake house etc) , you have to give 220K USD for UK region for storage. So it is very important to understand the usage of S3 bucket.

AWS S3 have a feature called “Storage lens” where you can create your own custom dashboard and monitor the usage of S3 objects. In this blog, we will create a dashboard using storage lens and see how it work.
To do the same.

Click on S3 in AWS management console. You can see the overall utilization of all your S3 buckets.

To create a Dashboard, click on the storage lens in the left panel. Click on Dashboard
Click on Create Dashboard
Give the below details

Dashboard Name: in this case we give the name as s3-bucket-usage-monitor
Home Region : Select the appropriate region, for us it is London region which is eu-west-2
Select Status as enable so that we can see the status of the dashboard.

For Dashboard Scope, if you are having objects across regions, you can select the region, in our case we ONLY select London region as all our objects are store in London region and include all buckets in the region.

In the metrics section, select the Free metrices. A lot of key metrices are available under free metrices and that is more than enough to monitor usages of S3 buckets.

You can export these metrices to an S3 path for further analytical usages, for our case we disabled this option. Finally click create dashboard.

It will take 48 hours to have the charts ready.

By default, AWS created a lens for you ( which cover all regions), if you want to use the default dashboard, that is fine as well. For any custom requirements like specific region or any rule we can create custom dashboard as well. Sample charts from default dashboard is :

Quick Tips:

Once you understood the usages, you can see which critical data is require for your use case and access pattern. If there are buckets and folders access pattern is unknown, better to use S3 intelligent tiering for cost saving.
Many organizations and enable bucket versioning so that they can avoid the accidental deletion of the object, however ONLY critical data objects (which is difficult to recreate, scripts folder) versioning make sense. But use cases like Data lake when you are getting source data and processed the data and move to archive, you really DO NOT need versioning to the staging bucket.
For files which needs to be stored for long term, better to define a life cycle management to Glacier storage, this can be set it up using Life cycle management policy.

Blog -2 :AWS Data Engineering: Audit Logs Enablement in Redshift and write to S3

Sanjeeb Mohapatra — Sat, 25 Feb 2023 10:22:12 +0000

Amazon Redshift is a columnar, fully managed cloud data warehouse service in AWS eco system. It used in executing complex, analytical queries on large volume of data sets via MPP (Massive Parallel Processing) architecture. The volume of data can be Giga bytes of Peta bytes.

Important Key features for Redshift:

It supports virtually unlimited concurrent users for different workloads, such as BI/dashboard reporting, data ingestion, and adhoc data analysis.
It also monitors user workloads and uses machine learning (ML) to find ways to improve the physical layout of data to further optimize query speeds.
Amazon Redshift supports industry-leading security to protect your data in transit and at rest. It is compliant with SOC1, SOC2, SOC3, and PCI DSS Level 1 requirements.
Both structure and semi structure data can be processed, analysed using Amazon redshift. It supports ORC, Parquet, JSON, CSV, Avro format files.
Amazon Redshift is a fully managed service, user do not worry about tasks such as installing, patching, or updating software and only focus on resources on generating business value rather than maintaining infrastructure.
It flexibly manages workload priorities so that short, fast-running queries won't get stuck in queues behind long-running queries.
It Monitors user workloads and uses sophisticated algorithms to find ways to improve the physical layout of data to optimize query speeds
Amazon redshift has the flexibility to connect different BI tools like QuickSight, Tableau, Power BI and Analytical tools like Jupyter notebook.
It is fault tolerant which helped in enhance the reliability of your data warehouse cluster with features such as continuous monitoring of cluster health and automatic re-replication of data from failed drives and node replacement as necessary.
Using Amazon Redshift Spectrum, user can efficiently query and retrieve structured and semi-structured data from files in Amazon Simple Storage Service (Amazon S3) and no need to load the data into Amazon Redshift tables.
AWS data exchange can be used along with AWS Redshift to load and query third party data sources.
Amazon Redshift ML can be used to create, train, and apply ML models with standard SQL.
Redshift integrates well with AWS services to move, transform and load data quickly and reliably (for example: S3, Dyanmo DB, EMR, EC2, Data Pipeline etc )

In this blog, we are going to discuss the below problem statement.

Problem Statement: One of the leading financial companies is planning to use AWS Redshift for their data warehouse. They must align with below compliance requirements.

Track all audit logs of Redshift cluster There are 3 types of audit logs available in Redshift.

Connection log – Logs authentication attempts, connections, and disconnections.
User log – Logs information about changes to database user definitions.
User activity log – Logs each query before it's run on the database.

The connection and user logs are useful for security checks, it provides detail on which user is connecting to Redshift cluster, User IP address, connection time etc. The user activity log is useful primarily for troubleshooting purposes. It tracks information about the types of queries that both the users and the system perform in the database. The user activity logs captured the all user activity details like query fired by user, user id and record time details.
Note: ( From Aws documentation)

The connection log, user log, and user activity log are enabled together by using the AWS Management Console, the Amazon Redshift API Reference, or the AWS Command Line Interface (AWS CLI). For the user activity log, you must also enable the enable_user_activity_logging database parameter. If you enable only the audit logging feature, but not the associated parameter, the database audit logs log information for only the connection log and user log, but not for the user activity log. The enable_user_activity_logging parameter is not enabled (false) by default. You can set it to true to enable the user activity log. In this demo we are enable the logging from AWS management console and not considered the user activity log as we are not created in custom parameter group.

Need to store audit logs securely i.e encryption at rest.
Need to store audit logs for specific time. Retention policy for the audit log is 6 months.
A details analytics reports needs to derive from Audit log on Monthly basis.

Let’s jump into the lab.

Step -1: Create Redshift IAM role which will have access to have access to S3.

Login to AWS Management Console. Search IAM in the aws search bar. Click on IAM
Click on Roles.
Click create role.
Select Trusted entity type as AWS services
Under Use case tab, search for Redshift and click Next
In the permission policy, search of S3 and select Amazon S3 full access (Note – It is NOT best practice to give full permission in IAM, for our demo we selected full permission, in prod workloads, better to follow least privilege principles) and click Next.
Give the name of the role, description and click create role

Step -2: Create and configure a Redshift cluster

To create a Redshift Cluster, search Redshift in the aws console search bar.
Click on create cluster
Give the name of the cluster under cluster identifier.
We selected the production option to have more configuration, however you can choose free tier eligible as well for demo.

Select “I’II” option to select the cluster type. This will enable to configure the cluster customize approach. We selected the Node type as dc2.large ( Note – This is NOT under free tier and there will be some cost involved in this case). For our demo purpose, we select the number of nodes as 1 but in real production cases number of nodes will always be greater than 1 to have a master and slave configuration.
Select the sample data so that by default Redshift will load sample data for you, this will enable to run some sample query and see the results quickly.
Under Database configuration, give the admin user and password. This user credentials are required to login Redshift cluster and set up other users afterwards.

Under Associated IAM role, select Associate IAM role, select the IAM role which we created in step-1 and click on Associate IAM role.

Additional configurations, select the default option. We will configure the audit logging once the cluster is created. Click create cluster to create the cluster

To see the cluster creation status, click on the cluster on the left panel, click cluster and under cluster, you can see the cluster is under creating status.

Once the cluster is configured correctly, you can see the cluster status changed to Available.

To enable the audit logging, select the cluster, Go to the properties tab and under database configuration, select edit audit logging. Select turn on and select the bucket and prefix where you want to store your audit logs and click save changes.

Once you have enabled the audit logging, you can see the details under database configuration.

To connect Redshift cluster and run queries, you can install client tool and download the redshift driver from aws redshift console or there is a query editor available in Redshift console and you can connect directly to Redshift database from the query editor. Click on the Query editor v2, it will open a new tab where you can get an query editor to run some sample query.

Since we have already selected to load some sample data during the creation of Redshift cluster, select the Redshift cluster, select dev database, select public, select Tables and double click any table to have the select statement in the execute command window. Run the sql statement and see the result.

Once you performed some actions like selecting some record, check the table count, you can see analyze the audit log file which is generated under s3.

To see the log file, you can download the file locally and open it any editor, one sample snapshot for connection log is

Step -3: Define a Life cycle rule on S3 bucket (audit log) and remove files older than 6 months.

To ensure all our audit logs are stored in encrypted at rest, The S3 bucket is encrypted by default (This is a new feature introduced by AWS recently, no action is required from user side). You can see it by navigating to the bucket, click properties, check the default encryption.

To ensure old log files are deleted automatically after 6 months, we will define a S3 life cycle management rule which will delete the files which are created 6 months back. To do the same, navigate to S3 bucket, click on Management, click on Create a life cycle rule.

1. Give the life cycle rule name, for our case it is remove-6month-old-file
2. Since we created a separate S3 bucket to store the audit logs, select apply to all objects in the bucket and Acknowledge the rule.
3. Life cycle rule actions, select Permanently delete noncurrent versions of objects and Delete expired object delete markers or incomplete multipart uploads
4. Under Permanently delete noncurrent version, give the value as 180 days.
5. Under Delete expired object delete markers or incomplete multipart uploads, checked both the options and put 180 days for the files which are not downloaded completely.

The final step is to terminate your Redshift cluster after your POC . To delete the cluster, select cluster and under action, select delete.

Note
If you do not want to take snapshot, unchecked the create final snap shot option and put the details on the confirmation and click on the delete cluster. It will delete your cluster. Once the cluster is deleted, you can not see the cluster in the cluster list of redshift console.

Reference Material:

https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html

Blog -1: Data Engineering - AWS Data and Analytics – Collection – Kinesis Data Stream

Sanjeeb Mohapatra — Wed, 18 Jan 2023 23:47:22 +0000

Amazon Kinesis data stream is used for collect and process large data stream in real time. Kinesis data stream reads the data from a stream as data records (referred as Producer) and downstream applications using different mechanism (called consumer) to consume the data stream. Kinesis data streams is one of the scalable and durable real-time data streaming service.

High level Architecture ( The below diagram is from AWS site – Refer - https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html)

Notes: (Below points are some important concepts in Kinesis data stream taken from above AWS site).

The producer continuously produced (pushed) the data to the Kinesis data stream. The data is stored in Shard in Kinesis data stream.
The Consumers (such as a custom application running on Amazon EC2 or an Amazon Kinesis Data Firehose delivery stream) can store their results using an AWS service such as Amazon DynamoDB, Amazon Redshift, or Amazon S3.
A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams. A data record is a unit of data stored in the Shard. The data record contains the sequence number and partition key and data (in blob) and is immutable (cannot be changed).
The default retention period of data records is 24 hours but it can be extended to 7 days (168 hours)
A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.

The main objective of this blog is to perform the below use case.

Problem Statement: Build a real time streaming application using Amazon Kinesis Data Stream.

Details: In this lab, we are going to perform the below steps.

Creating a real time data streaming system using Amazon Kinesis Data Streams (KDS).
We are going to use Amazon Kinesis, AWS Lambda, Amazon S3, IAM
User will upload a file in S3 (for demo, we will upload via aws management console).
We set up the event trigger on S3 which will trigger a lambda function. The lambda function will work as a producer and produce the details to Kinesis Data Stream.
We will use again 2 consumers (lambda) which will read the data from Kinesis data stream. Then any downstream application can be integrated with these consumers.

The high level architecture diagram is:

Let’s jump into the lab
Step -1: Create a Kinesis Data Stream

1.To create a Kinesis data stream, login to AWS management console, search Kinesis service.
2.Select Kinesis Data Stream and Click on Create data stream
3.Give the name of the data stream
4.Data Stream Capacity (Here we are selecting provisioned but you can select on demand and depend upon the workload, select required option). We selected the number of shards as 1. However, require number shards can be calculated with input data record length, number of records per second and number of consumer.
5.Finally click on the Click Stream

Step -2: Once the KDS (Kinesis Data Stream) is created, click the stream, go to configuration tab, go to Encryption option, click edit and check enable server-side encryption and use the default encryption type (In this case we used the AWS Managed CMK). Click on the Save changes.

Step -3: Create S3 bucket
1.Search S3 service from the aws management console search option.
2.Click on create S3 bucket option
3.Give the bucket name, select the right region (select the same region where you have created your Kinesis Stream)
4.Enable the bucket versioning (However it depends upon the requirement)
5.Select the default encryption option (Amazon S3 managed SSE-S3) and click create bucket to create the S3 bucket.

Step -4: Create lambda function (Producer and Consumers)
In this step, we will create 3 lambda functions (One producer, 2 consumers)
Before creating the lambda function, create a role in IAM and role should have access to S3, Kinesis and lambda basic execution rule. We are not providing the details how to create the role, the policy details Json is available below

Step -4: To create the first lambda function (Producer)
1.Search lambda service from AWS management console
2.Click on Create function
3.Select Author from Scratch

1.Give the basic information of the lambda function, like name of the function, Runtime environment (In this case we selected python as run time environment)
2.In the permission select the existing role and attach the role ( which you have created in the above step)

In the lambda code, copy paste the below code (this is a simple python code, which will read the object from S3 from event trigger and then push to Kinesis data stream).

import json
import boto3

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    print("Bucket Name: ", bucket)
    file_key_name = event['Records'][0]['s3']['object']['key']
    s3_full_path = bucket+"/"+file_key_name
    print(s3_full_path)
    client = boto3.client('s3')
    data_obj = client.get_object(Bucket=bucket, Key=file_key_name)
    body = data_obj['Body']
    data_string = body.read().decode('utf-8')
    print(data_string)

    K_client = boto3.client('kinesis')
    response = K_client.put_record(
    StreamName='sanjeeb-test-kinesis-stream',
    Data=data_string,
    PartitionKey='123'
     )
    print(response)
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

Note - We have to not configured the S3 Event trigger yet, to do the same
1.Go to S3 bucket, click on the bucket you have created earlier.
2.Click on Properties
3.Click on create event notification, give the name, put the suffix (in this case we put as .txt so that in case we upload any text file with .txt extension, it will trigger the target event which is a lambda function
4.Select the event type all object created

Select the destination as lambda function and select your lambda function.

Step -5: Create consumers ( 2 lambda functions, name will just different but code and other steps will be same).
1.To create a consumer lambda function.
2.Search lambda in aws management console
3.Select lambda, create a function
4.Select author from scratch
5.Give basic information
6.Select the run time environment as python.
7.Under permission, attach the same lambda role which is created in case of producer.

In the code section of the lambda, paste the below python code

import json
import base64

def lambda_handler(event, context):
    record = event['Records']
    for rec in record:
        data_record=rec['kinesis']['data']
        data = base64.b64decode(data_record)
        print(data)
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

1.To consume the data from Kinesis, we need to add trigger, to do the same, click configuration tab of the lambda.
2.Click on trigger
3.Click on add trigger
4.Select the source kinesis
5.Select activate trigger and make others default

To test e2e, upload a file in the S3 bucket and see the cloud watch logs (log group). Select the consumer log group to see the consumer log and producer log group to see the producer log.

Note – We are NOT creating the second lambda consumer function. You can follow the same step and script to create the same ( Only function name needs to be changed).