DEV Community: Gururajan Padmanaban

#AWS - Quicksight(SSO) user management.

Gururajan Padmanaban — Sun, 17 Jul 2022 19:21:45 +0000

Scenario:

Consider that our AWS Quicksight account is configured with SSO for visualizing product metrics and you want to manage the Quicksight users.

Requirement:

User activity monitoring, including last login, resource usage, dashboard popularity, etc…

Issue:

Currently, in QuickSight there is no feature that allows you to see the activity of your users' metrics directly. QuickSight users are charged monthly, even if the user does not sign in to the account. You can expect a full month's charge, even if a user is removed before the end of the month.

Why?

To keep track of user activity and identify inactive users.
To notify inactive users to remind them.
Delete the user automatically if the user has not been active for n days.
Make improvements to the dashboards with the lowest visits.
To decommission the dashboards that are less popular.

Solution:

Everything is an API call when it comes to AWS.
Each activity will be logged in to Cloudtrail.
We can use Amazon CloudTrail logs to gather information on user/dashboard activity .

Quicksight Account Access:

When the user login to AWS if the user account is not available in Quicksight the same will be created.
Once the user account is created based on the role and permissions provided, the user can access Quicksight resources such as Dashboards, Analysis, DataSets, etc…
CloudTrail records actions taken by a user, role, or an AWS service in Amazon QuickSight.
CloudTrail will log the following events for Quicksight:

Data source create/update/delete
Data set create/update/delete
Analysis create/access/update/delete
Dashboard create/access/update/delete, etc…

Quicksight user activity monitoring process:

Direct user activity monitoring
Once the SSO tool is integrated with Quicksight, users can use the tool to log in to Quicksight.
Most of the tools use SAML (Security Assertion Markup Language 2.0) to federate users into AWS
The only way to monitor the user activity out of the box is via Quicksight Manage Console only.
By using the AWS API/CLI/SDK we can collect the required information from Cloudtrail.
SSO will use the AWS STS (Security Token Service) to request temporary, limited-privilege credentials for federated users.
The federated user will be provisioned to perform the sts:AssumeRoleWithSAML action to access Quicksight resources.
The SSO user login activity will be logged under {'Event' : 'AssumeRoleWithSAML'}
Other activities will be logged under {'Event' : ['CreateUser', 'DeleteUser', 'UpdateUser', 'GetDashboard', 'GetAnalysis']}
By filtering the events based on the event name we can get the list of all activities from CloudTrail.
From the CloudTrail log, we can get the user identity details such as event time, type, username, identity provider, etc…

Code sample:

AWS CLI:

aws cloudtrail lookup-events - lookup-attributes AttributeKey=EventName,AttributeValue=GetDashboard

AWS API:

{
   "EndTime": number,
   "EventCategory": "string",
   "LookupAttributes": [ 
      { 
         "AttributeKey": "string",
         "AttributeValue": "string"
      }
   ],
   "MaxResults": number,
   "NextToken": "string",
   "StartTime": number
}

Boto SDK:

response = client.lookup_events(
    LookupAttributes=[
        {
            'AttributeKey': 'EventId'|'EventName'|'ReadOnly'|'Username'|'ResourceType'|'ResourceName'|'EventSource'|'AccessKeyId',
            'AttributeValue': 'string'
        },
    ],
    StartTime=datetime(2015, 1, 1),
    EndTime=datetime(2015, 1, 1),
    EventCategory='insight',
    MaxResults=123,
    NextToken='string'
)

Sample response:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "<principal_id>: <user_name>",
        "arn": "arn:aws:sts:: <aws_account_id>:assumed-role/<IAM_role_ name>/<user_name>",
        "accountId": "<aws_account_id>",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "<principal_id>",
                …
            }
        }
    },
    "eventTime": "2022-17-13T16:55:36Z",
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "GetDashboard",
    "awsRegion": "us-east-1",
    "eventID": "65ae334b-4202-4961-9ac7-d5a9d44416e2",
    "readOnly": true,
    "eventType": "AwsServiceEvent",
    "serviceEventDetails": {
        "eventRequestDetails": {
            "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>"
        },
        "eventResponseDetails": {
            "dashboardDetails": {
                "dashboardName": "Product X",
                "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>",
                "analysisIdList": [
                    "arn:aws:quicksight:us-east-1: <aws_account_id>:analysis/<analysis_id>"
            }
        }
    }
}

Delete user:

response = client.delete_user(
    UserName='string',
    AwsAccountId='string',
    Namespace='string'
)

Here are a few things to keep in mind:

CloudTrail keeps the logs for 90 days only.
QuickSight users are charged monthly, even if the user does not sign in to the account.
You can expect an entire month's charge, even if a user is removed before the end of the month.
QuickSight users are charged based on QuickSights Edition (Standard/Enterprise)
QuickSight Readers are charged per reader session basis, and sessions are in 30-minute increments. Each session costs $0.30, per reader, up to a maximum of $5 per reader per month (month-to-month).
If a Reader has a dashboard open, the timer will continue to run(in 30min increments), until the Reader closes or minimizes the dashboard window.
If a Reader opens a dashboard and closes the dashboard before the 30min of the session expires, the timer will stop at 30min and will not continue to run.
10GB of SPICE capacity is provisioned to the QuickSight account for each user added, at no additional cost.
Additional SPICE can be purchased at $0.38 per GB/month.

Conclusion:

Once the required data is available from CloudTrail we can store the data in S3 and use the same to manage the Quicksight account. We use a simple Lambda function that can automatically remove inactive users. We can also set up a trigger for the lambda function using Cloudwath (Cron job).

FAQs:
Question: If a user was added and deleted at any point of a month, the charge would be added for the days he was active right?
Answer:

No, If the Author/Admin user is removed from the account, you are still billed for that user for the month (starting from the day they were listed as active). If the user was already listed on the account from day 1 of the month, you could expect the full monthly charge. If the user was only added to the account partway through the month (say, on the 10th day) then you will only receive a total charge that calculates up to the remaining days of the month, you will not be charged for the 10 days the user was not listed as active.

Question: When an ADMIN / AUTHOR is working on a Resource (Analysis / Dashboard etc….). Will that be charged in any way?
Answer:

Users are charged per month, regardless of their sign-in activity, or use of resources on the account. It is a monthly charge.

Question: If a READER is active in a day the sessions will be charged if the Max $5 USD is not reached already, correct?
Answer:

The user will be charged based on the time a dashboard is open. For example,
A Reader opens a dashboard for 1hr, 2x 30min sessions. Considering each session is charged at $0.30 per 30min session, this particular Reader will be charged $0.60 ($0.30 x 2 (30min + 30min) = $0.60).
If the Reader leaves the dashboard open and is visible at all times, for 16 hours, then the charge for that user will be $5 for that day. However, since there is a maximum charge of $5 per month, you will not see any further billing for that specific Reader.

Question: If a user is added on the 5th and deleted on the 6th will there be any charge?
Answer:

Since users are charged monthly, you will be charged from the 5th until the last day of the same month, thereafter you will not be charged for the removed user. This is why user charges are displayed daily on the Billing Console.

Question: If we were to get a subscription for all our ADMIN / AUTHOR (8 Users) how much cost it will save?
Answer:

For the Enterprise Edition of QuickSight, paying month-to-month, you will pay $192 for the 8 users ($24 per user/month * 8 users = $192)
The Annual Subscription is $18 per user per month. This means you will pay $144 per month for the 8 users, saving you $48 monthly.

References:

Answer: How QuickSight SPICE refresh the data

Gururajan Padmanaban — Wed, 13 Jul 2022 03:50:27 +0000

Jul 13 '22

My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.

Yes, we need to use Athena to read the parquet.

When you say point QuickSight to S3 directly, do you mean without SPICE?

Don't do it, it will increase the Athena…

Open Full Answer

# AWS - Athena query result management FAQs

Gururajan Padmanaban — Tue, 12 Jul 2022 18:35:13 +0000

How to set up the query result location?

The query result location that Athena uses is determined by a combination of workgroup settings and client-side settings.
Athena console, (Athena -> Query Editor -> Settings -> Manage Settings).
The workgroup settings can be specified using the console, CLI, or API.

Can we override the query result location?

Yes, each workgroup setting has an Override client-side settings option that can be enabled.
When this option is enabled, the workgroup settings take precedence over the applicable client-side setting

Will changing the result location affect the existing process?

Changing the query result location will not affect the existing process.
The result is saved in the new location.
Ensure that the new location is accessible by your existing processes.

Is Athena regional service or Availability Zone-based service?

Athena is a regional service.

What is a saved query?

Athena allows the user to save the query in the console, the saved query results will be stored in the specified query result location in the following format QueryName/yyyy/mm/dd/
Query Result Location: QueryResultsLocationInS3/[QueryName|Unsaved/yyyy/mm/dd/]

Why are the API query results stored directly?

Only queries run from the console whose results path has not been overridden by workgroup settings store results using this path structure: QueryName/yyyy/mm/dd/.
Queries that run from the AWS CLI or using the Athena API are saved directly to the QueryResultsLocationInS3

Does Athena cache query results?

No, Athena doesn't support query caching.

How to trigger saved queries via API / CLI?

Athena does not support trigger-saved queries via CLI/API directly.

Is there a Workaround to trigger the saved queries via API?

Use the ListNamedQueries API to grab the saved queries' IDs.
Using this resulting list, you can then loop through the list running GetNamedQuery on each query ID.
Once you have found the saved query that you are looking for, you can then run StartQueryExecution using the resulting values from the GetNamedQuery operation.

Can we store the results in a specific folder via API or CLI?

Yes, bypassing the OutputLocation via the API or CLI.

Can we use saved queries and store the results based on saved query name folders in s3?

Yes, however this only works with queries run from the console whose results path has not been overridden by workgroup settings

Can we use saved queries in Quicksight?

No, we can not directly use saved queries in Quicksight. You would need to copy the query and use the custom SQL option in QuickSight to create a dataset from it.

How to monitor Athena's activity?

We can collect the query execution details such as
- Query string
- Data scanned
- Scanned time
- Database
- Table
- Time etc. From Cloudtrail. Keep in mind that Cloudtrail will only keep the logs for 90 days.

How to keep the Athena query execution details for longer?

Periodically we can collect the Cloud Trial Athena logs and store them in S3 for future reference.

What is the maximum Query string length?

The fixed maximum allowed query string length is 262144 bytes.

Which encoding is used in Athena?

UTF-8.

Is there a cost involved for canceled queries?

Canceled queries are charged based on the amount of data scanned.

Is there a cost involved for failed queries?

There are no charges for failed queries.

What is the minimum amount of scanned data considered for billing?

10MB minimum per query.

Why is the sorting in S3 not working?

To enable sorting in the S3, use the search to reduce the size of the list to 999 objects or fewer.

How to delete huge amounts of data, i.e: 18000 files?

There is no direct way from Athena's side to delete the files from the result location. Please consider using the s3 lifecycle policy to delete objects older than a specific time.

Is it safe to delete the old file (More than 45 days or 90 Days)? If yes, how do we do it?

You can safely delete the old files as long as you don't want to go back and review the results of that particular query at the time it was run.
If you delete the data and then decide you wish to know the results of that query again, you will then have to re-run the query.
AWS does not recommend deleting the Query metadata files as this will cause important information on the query to be lost, however, you are still welcome to do this as long as you do not require information on that query anymore.
You can delete the files from your S3 bucket by navigating to the S3 bucket in the S3 console, selecting the files you wish to delete, and clicking delete.
You can also set a lifecycle configuration on the bucket if you wish to prevent the retention of S3 objects past a certain length of time S3 Lifecycle

Will deleting the data cause any error?

If you select to view a query in the Athena "Recent queries" tab for which you have already deleted the data then you will receive an error message "Could not find results". You can then re-run the query to fetch new results.

Do we need to take a backup?

Whether you decide to delete the data or keep it for future use, backing it up or not, is up to you and your use case.

Can we use the stored query results in SQL Query?

You can create a table using a query result file if you want using CREATE EXTERNAL TABLE, the same way you would with any other CSV file. You can also use CREATE TABLE AS SELECT, a view, or the WITH clause if you wish to use the results of an Athena query in another query however please note these methods will not utilize the old results file but will re-run the query instead.

How to query the results file?

If you wish to create a table on the CSV results file, then you must move the file into a new folder that contains only the data to be scanned in the new table. If any other files are present you may encounter errors as Athena can not recognize exclusion patterns.

Reference:

#AWS - Athena query result management.

Gururajan Padmanaban — Tue, 12 Jul 2022 17:43:40 +0000

What is Athena?

Amazon Athena is a serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
It is a “pay as you go service”, you pay only for the queries that you run

How does it work?

Athena uses Presto under the hood to run queries. Presto is a distributed query engine for big data using the SQL query language.
Athena also uses Apache Hive. Apache Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop

What are Athena's query results?

Athena stores query results and metadata information for each query that runs automatically.
These results are stored in an S3 bucket that we can specify.
We can also encrypt the files if required.
Athena supports DML, and DDL queries.
Files are saved based on the name of the query, and the query ID. QueryID is a unique identifier that Athena assigns to each query when it runs. I.e: 0000e975-9f92-4adb-b853-71fc97dac20e.csv
For each query, the result will contain two files
1. Actual data
2. Metadata
DML query results are saved (CSV) format.
DDL query results are saved as plain text files.
Metadata files are saved in binary format.

What are DML and DDL?

DDL (Data Definition Language) queries include CREATE TABLE and ALTER TABLE ADD PARTITION.
There are no charges for DDL queries.
DML (Data Manipulation Language) queries include SELECT, CREATE TABLE AS (CTAS), and INSERT INTO queries.
DML queries are charged for the number of bytes scanned rounded up to the nearest megabyte

What is a query quota?

A number of queries can be executed at any given time.
Query quota includes both running and queued queries.
Exhausting the query will result in a TooManyRequestsException error.

How to access the query results?

We can directly access/download the files from Athena Console/S3.
We can use AWS CLI / API to get the query results from Athena based on the unique QueryIds.
To access and view query output files, IAM principals (users and roles) need permission for the Amazon S3 GetObject action for the query result location, as well as permission for the Athena GetQueryResults action.
If the location is encrypted, users must have the appropriate key permissions to encrypt and decrypt the query result location.

Why should we bother?

Athena keeps a query history for 45 days.
CloudTrail keeps the log for 90 days.
But the Athena query results are kept forever.
Over a period of time, the query results get accumulated and incur a cost when kept in S3.

What can we do?

Setup a data rotation policy to maintain the Athena query results.
This can be achieved using S3 lifecycle policies.

*Is there a risk involved in deleting the query results? *

You can safely delete the old files as long as you don't want to go back and review the results of that particular query.
But keep in mind once the data is deleted it can’t be restored.
Also while trying to access the result will throw a "Could not find results" error.
We can always run the query to get the results again.
For any reason such as unexpected Athena usage, if we want the query results we will not be able to get them.

What are the best practices to be followed?

Use Athena Saved queries wherever possible.
Create and maintain a separate bucket for Athena query results.
Setup an S3 lifecycle to maintain the query results.
Set Up a monitoring system for collecting and storing the Athena logs periodically into the S3 bucket from Cloud Trail.

Reference:

#AWS - Set up S3 Lifecycle for data rotation

Gururajan Padmanaban — Mon, 11 Jul 2022 15:55:27 +0000

Requirement: Delete all or specified objects after n days.

S3 Lifecycle rules:

An S3 Lifecycle configuration is a set of rules that define actions that Amazon S3 applies to a group of objects.

There are two types of actions:

Transition actions: These actions define when objects transition to another storage class.
Expiration actions: These actions define when objects expire. Amazon S3 deletes expired objects on your behalf.

Rules:

Move current versions of objects between storage classes
Move noncurrent versions of objects between storage classes
Expire the current version of the object
Permanently delete the noncurrent versions of objects
Delete expired object delete markers or incomplete multipart uploads (Will does not work if the rules are scoped with tags)

For our process, we are going to focus on the data rotation rules (3 to 5). Other rules are related to the data retention policy, where we move the objects from one S3 class to another S3 class according to our requirements such as how frequently we need to access them.

Scope:

We can apply these rules to all objects in the bucket or limit the rule's scope using one or more filters.

Filters:

Prefix: Provide a prefix based on which the files will be marked as expired.
File size: We can configure a filter to delete the object only if the file size is more than a certain limit.

Tags (Paid service):

AWS supports tag objects according to our requirements. Tags are key:value pairs e.g: Expire:True. By using these tags we can set up a filter to delete only the tagged objects.

Tags are charged monthly. It is a recurring cost.

To set up a tag for an object also need to pay for the API calls.

Versioning:

To protect the object from being overwritten or deleted S3 provides version support so that we can restore the previous versions if required.

Unversioned: This is the default state for any bucket, if the bucket is not version enabled then all the objects are expired and deleted simultaneously. It is not necessary to set up any other rule to delete the expired object.
Version enabled: If a bucket is a version enabled then we need to explicitly set up a rule to delete the old objects.
Rule: ”Permanently delete the noncurrent versions of objects”

Multipart uploads:

Uploading large files in chunks. S3 supports multipart uploads out of the box (If an object is more than 100 MB).

*Block storage process is supported *
I.e:

We can upload the parts/chunks in any order.
If any part fails during upload, we can upload that part only without affecting other parts.

If the upload fails then the incomplete parts get accumulated over a period of time. To avoid such scenarios we can set up a rule to delete those incomplete parts of objects from s3 by using the rule “Delete expired object delete markers or incomplete multipart uploads”.

Expiration of an object:

Every day at zero time (UTC) the rule will be invoked.
Every rule will be invoked simultaneously.
When the rule is invoked the object must satisfy the rule.
E.g: If you set up a rule to expire an object after 1 day in IST and the rule is triggered at zero time UTC and the object does not satisfy the rule equal to or older than 1 day, then the object will not be marked as expired.
Always keep the timezone in mind when setting up a rule. AWS S3 is in the UTC timezone.

Example expiration process workflow:

Object uploaded: 6 July 09:18 UTC
Lifecycle Rule for expiring objects after 1 day, ran on: 7 July 00:00 UTC
By this time the object had not completed 24 Hrs in the S3 Bucket
Then your Object was completed 24 Hrs in the S3 Bucket on 7 July at 9:18 UTC
The next Lifecycle rule is scheduled to run on: 8 July at 00:00 UTC
At this point, the Lifecycle rule marked the object for expiration with the expiration date as 8 July 00:00 UTC
i.e. 8 July 5:30 AM in Indian Standard Time.
The Lifecycle rules run at 12 AM Midnight UTC and will mark the object for expiration, which are eligible as per the rule specified. S3 rounds the expiration time to midnight UTC the next day, which explains why the object was marked for expiration on 8th July instead of 7th July.

Deleting the object:

The expired objects will not be deleted immediately, S3 will asynchronously remove these from the Bucket on the backend.
This can take some time to complete as S3 performs this operation while ensuring that the service remains available.
However, since the object is marked for expiration we will not be charged for the storage of the same, even though the object might still be visible in the S3 Bucket.
We can access the objects even if they are expired.
The entire folder will be deleted if there are no files left. Because S3 is an object-based storage class anything and everything is an object so even though we access it like a normal file system (Folders and Files) it is actually an object. So every object which satisfies the rule will expire.

Overlapping rules:

When setting up a rule if one rule overlaps another AWS will always go with the rule which is the least expensive and save more w.r.t cost.

Example:

Rule one is set up to migrate the objects from the S3 Standard class to S3 Infrequent Access after 90 days.
Then the second rule is set up to expire the objects after 90 days.
AWS will always go with the second rule because there is no point in migrating the objects if we are going to delete them anyway.

Limitations:

Suffixes are not supported, i.e if we want to delete only a specific file type e.g: .csv we can't use a filter like *.csv.
The entire folder will be deleted if there are no files left. There’s no way around it.
To delete a specific file we need to use tags.

Steps to create a lifecycle rule to expire objects:

Sign in to the AWS Management Console and open the Amazon S3 console
In the Buckets list, choose the name of the bucket that you want to create a lifecycle rule.
Choose the Management tab, and choose to Create lifecycle rule.
In the Lifecycle rule name, enter a name for your rule.
Choose the scope of the lifecycle rule:
- To limit the scope by prefix, in Prefix, enter the prefix
- Enter the object tag key
- Object size - specify the minimum object or maximum object size
Under Lifecycle rule actions, choose the actions that you want your lifecycle rule to perform:
- Expire current versions of objects
- Permanently delete previous versions of objects (if that fits your use case)
- Delete expired delete markers or incomplete multipart uploads (if that fits your use case) Depending on the actions that you choose, different options appear.
To expire current versions of objects, under Expire previous versions of objects, in the Number of days after object creation, enter the number of days(45 days).
To permanently delete previous versions of objects, under Permanently delete previous versions of objects, in Number of days after objects become previous versions, enter the number of days.
Under Delete expired delete markers or incomplete multipart uploads, choose to Delete expired object delete markers and Delete incomplete multipart uploads. Then, enter the number of days after the multipart upload initiation that you want to end and clean up incomplete multipart uploads.
Choose Create rule. If the rule does not contain any errors, Amazon S3 enables it, and you can see it on the Management tab under Lifecycle rules.

Ref:

#AWS - Increase your Quicksight SPICE data refresh frequency.

Gururajan Padmanaban — Mon, 11 Jul 2022 15:51:57 +0000

Scenario:

Let us say I want to fetch the data from the source (Jira) and push it to SPICE and render it in Quicksight Dashboards.

Requirement:
Push the data every 30 Mins once.

*Quicksight supports the following: *

Full refresh
Incremental refresh

Full refresh:

Process - Old data is replaced with new data.
Frequency - Every 1 Hr once
Refresh count - 24 / Day

Incremental refresh:

Process - New data get appended to the dataset.
Frequency - Every 15 Min once
Refresh count - 96 / Day

Issue:

We need to push the data every 30 Min once.
It is going to be a FULL_REFRESH
When it comes to Full Refresh Quicksight only supports Hourly refresh.

Solution:

We can leverage API support from AWS.

Package - Python Boto 3
Class - Quicksight.client
Method - create_ingestion
Process - You can manually refresh datasets by starting new SPICE ingestion.
Refresh cycle: Each 24-hour period is measured starting 24 hours before the current date and time.

Limitations:

Enterprise edition accounts 32 times in a 24-hour period.
Standard edition accounts 8 times in a 24-hour period.

Sample code:

Python - Boto for AWS:

import boto3
client = boto3.client('quicksight')

response = client.create_ingestion(
    DataSetId='string',
    IngestionId='string',
    AwsAccountId='string',
    IngestionType='INCREMENTAL_REFRESH'|'FULL_REFRESH'
)

awswrangler:

import awswrangler as wr
wr.quicksight.cancel_ingestion(ingestion_id="jira_data_sample_refresh", dataset_name="jira_db")

CLI:

aws quicksight create-ingestion --data-set-id dataSetId --ingestion-id jira_data_sample_ingestion --aws-account-id AwsAccountId --region us-east-1

API:

PUT /accounts/AwsAccountId/data-sets/DataSetId/ingestions/IngestionId HTTP/1.1
Content-type: application/json

{
   "IngestionType": "string"
}

Conclusion:

Using this approach we can achieve 56 Full Refreshes for our dataset also we can go one step further and get the peak hours of our source tool (Jira) and configure the data refresh accordingly. This way we can even achieve a refresh frequency of 10 Min once.

Ref:

#AWS - S3 Lifecycle FAQ

Gururajan Padmanaban — Mon, 11 Jul 2022 15:48:37 +0000

FAQ:

Q: If we set up a life cycle will it be applied to the existing files?

Ans: Yes, the Lifecycle rule applies to all the data present inside the S3 bucket, whether it is uploaded before or after the addition of the lifecycle rule.
For example, you have two objects in your S3 bucket, one is 100 days old and the other was recently uploaded just 2 hours before. Now, You added a life cycle for expiring the objects after 30 days, the 100 days object will not wait for another 30 days but it will expire whenever the lifecycle rule runs first. The other recently uploaded object will wait to become 30 days old, and after that lifecycle will run on it.

Q: Can we set up a rule to delete the file under a specific folder?

Ans: Yes, if your files are being stored under a particular prefix then you can configure Lifecycle to only have the purview of that particular prefix. This way any other data uploaded to your S3 bucket would not be deleted.

Q: Is it possible to read the expired objects in S3?

Ans: It is possible but not guaranteed, that the object may or may not be readable after some time.

Q: Is it possible to trigger the lifecycle rule as per timezone?

Ans: Lifecycle configs are executed and fully managed by AWS. AWS executes them at midnight UTC and their execution cannot be automated according to a specific time zone. Lifecycle rules will only execute at Midnight UTC.

Q: Can we control the order in which the rules are invoked/executed?

Ans: No, it is not possible as of now, the Lifecycle will come into play concurrently.

Q: How does expiration work in the case of buckets without versioning enabled?

Ans: The expiration rule will permanently delete the object in case of a non-versioning bucket.

Q: Do we need a second rule for deleting the expired objects in case of buckets without versioning?

Ans: The second rule to permanently remove the delete markers is not needed in the case of a non-versioning enabled bucket.

Q: Is it possible to recover the deleted objects?

Ans: No, deleted objects cannot be recovered.

Q: If the setup expires after 1 day and deletes after 1 day, will the delete rule wait for one more day after the object is expired?

Ans: In the case of a non-versioning bucket the above implementation will not make any difference and the object will be expired after 1 day.
When you are using a Versioning enabled Bucket, the first rule will create a delete marker of the object after 1 day. The second rule will wait for the delete marker to become 1 day old, and then it will permanently delete the marker.
So your object will be deleted after 2 days.

I.e: If the period is set to 45 Days.

In the case of the Non-versioning bucket - The object will get removed after 45 Days. In case of Versioning enabled Bucket -> Object will be converted to delete marker after 45 Days -> Delete Marker will expire after another 45 days -> Total Time to permanent delete = 90 Days.

Q: When the expired data is deleted the entire folder is deleted instead of the files in that folder. How to avoid this and only delete the expired objects and keep the folder intact?

Ans: Amazon S3 has a flat structure with no hierarchy as we would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects. As for now, there is no way to keep the folder intact.

Q: What is the frequency of s3 lifecycle rules?

Ans: S3 lifecycle only runs once a day at 00:00 UTC and tags the objects that fall under its purview for the actions that you have directed it to.

Q: Why are the expired objects not deleted immediately?

Ans: S3 will asynchronously remove these from the Bucket on the backend. This can take some time to complete as S3 performs this operation while ensuring that the service remains available.

Q: Will there be a delay in deleting the files?

Ans: Yes, if you have a large number of objects, therefore, it might cause a delay in the deletion of objects.

Q: Is the expired object charged?

Ans: Once the lifecycle has tagged the data for deletion, then you do not incur any charges for the storage. For example, if an object is scheduled to expire and Amazon S3 does not immediately expire the object, you won't be charged for storage after the expiration time.

Q: Is there any cost for the S3 life cycle?

Ans: There is no cost for applying the S3 lifecycle.

Q: What is a transition cost?

Ans: Transitioning data from S3 Standard to S3 Standard-Infrequent Access will be charged $0.01 per 1,000 requests.

Q: Is there a cost for expiration action?

Ans: No, there is no cost for deleting objects via lifecycle.

Q: Are there any charges for early deletion?

Ans: Yes, S3 offers a few storage classes such as glacier deep archive, One zone IA, etc, which has a constraint of minimum storage duration. If an object residing in any such storage class is deleted before the minimum storage duration is completed then you will incur charges for early deletion. For more information please refer to the following document.

Q: Are there any other costs involved in the S3 lifecycle?

Ans: There are per-request ingest charges when using PUT, COPY, or lifecycle rules to move data into any S3 storage class. Consider the ingest or transition cost before moving objects into any storage class.

Q: Can compressing the file help?

Ans: Compressing the file would help reduce the amount of storage your data claims. But this might only be effective if the compression causes a significant difference in storage. S3 doesn't support any native capability of compressing data, one alternative would be to download the data, compress it and then re-upload it, and later delete the uncompressed data.

Q: Is there any other option to keep the files but reduce the cost?

Ans: You can consider transitioning your data into a different storage class that offers storage at a reduced rate. Here, If your use case is of archiving your objects such that you rarely access this data. Then you can consider S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, or S3 Glacier Deep Archive. Or, if you can bear the loss of data in case of any physical loss of an Availability Zone resulting from disasters then you can consider transitioning your data to S3 one zone IA which is again a cheaper alternative as compared to S3 standard storage class.

Q: What are S3 Glacier and S3 Glacier Deep Archive storage types?

Ans: The S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive storage classes are designed for low-cost data archiving. These storage classes offer the same durability and resiliency as the S3 Standard and S3 Standard-IA storage classes but at reduced rates of storage. However, do note there are retrieval charges involved with these storage classes. The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive objects are not available for real-time access. You must first restore the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive objects before you can access them. You can reduce S3 Glacier Deep Archive retrieval costs by using bulk retrieval, which returns data within 48 hours.
Also, S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive come with a constraint of minimum storage duration of 90 and 180 days respectively. Hence, If your use case is to archive data for a longer duration of time, such that it will be rarely accessed then you may consider transitioning your data to S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, or S3 Glacier Deep Archive depending on your needs for retrieval. A comparison table and more details can be found in the below documents.

Q: What prefix do we need to provide?

Ans: While specifying a prefix, you just need the prefix path and not the entire path that includes the S3 bucket name. And please make sure that the below checkboxes are marked. The following values are specified for the immediate deletion of objects present in the above-mentioned prefix.

Q: Is there any cost for deleting the objects?

Ans: DELETE API requests are free and you are not charged for these.

Q: Is there a cost applied for life cycle rules?

Ans: There is no cost in setting up the lifecycle rule.

Q: Can we set up a rule for a specific file type?

Ans: No, the S3 lifecycle does not support deleting specific file types.

Q: How to delete specific files?

Ans: You can create a lifecycle rule and tag a specific object or specify the prefix(folder) you want the rule to apply to.

Q: How to create a lifecycle rule to expire only .csv files older than 45 days?

Ans: Implement the S3 Batch Operations feature to tag the objects. First, use the s3 inventory to get the manifest to use on the batch operation.
Please be informed that there are charges for s3 inventory and s3 batch operations.
Adding and removing object tags with Amazon S3 Batch Operations
Amazon S3 inventory

Q: When we set up the days for the expiry of an object, which value is considered created or modified?

Ans: Please note Lifecycle actions such as expiration will consider the object created date for calculating the object lifespan after which it will delete the object depending upon the number of days specified in the field "Days after object creation field".

Q: What are the limitations of S3 filters?

Ans: At this time S3 only supports filters based on prefix and/or object tags. For example, you can create a filter for the prefix "myfolder/" with tags "Expire": "true" so only the objects under the prefix with the specified tag will be expired.
Caveats of using tags are that you must add tags to each of the objects and that adding and keeping the tags have extra charges.

Q: Is there a cost for tagging an object?

Ans: Yes, To PUT the tags you are charged $0.005 per 1,000 PUT requests.

Q: Is there a cost for maintaining the tags?

Ans: Yes, to maintain the tags you are charged $0.01 per 10,000 tags per month

Q: What else do we need to consider?

Ans: Please note apart from storage, you also pay for requests made against your S3 buckets and objects. Thus, you may need to take all this into account while generating an estimate of the cost that you will incur.

Ref: