fvgm-spec

Posted on Aug 20, 2023

Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide

#s3 #datawrangling #pandas #python

Introduction

Amazon S3 is a widely used cloud storage service for storing and retrieving data. AWS Data Wrangler (awswrangler) is a Python library that simplifies the process of interacting with various AWS services, including Amazon S3, especially in combination with Pandas DataFrames. In this article, I will guide you into the process of how to effectively use the awswrangler library to interact with Amazon S3, focusing on data manipulation with Pandas DataFrames.

Introduction to AWS Data Wrangler
- What is AWS Data Wrangler?
- Key Features and Benefits
Prerequisites
- Set Up Your AWS Account
- Install Required Libraries
Connecting to Amazon S3
- Configuring AWS Credentials
- Creating a Connection
Uploading and Downloading Data
- Loading Data to S3
- Reading Data from S3
Conclusion
- Leveraging awswrangler for S3 Data Operations
- Further Learning Resources

1. Introduction to AWS Data Wrangler

What is AWS Data Wrangler?

AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.

Key Features and Benefits

Seamless Integration: Integrate AWS services with Pandas DataFrames using familiar methods.
Efficient Data Manipulation: Perform data transformations efficiently using optimized Pandas methods.
Simplified Connection: Easily configure AWS credentials and establish connections to AWS services.
Error Handling: Built-in error handling and logging mechanisms for improved reliability.

2. Prerequisites

Set Up Your AWS Account

Before you start, ensure you have an AWS account setup with the necessary IAM (Identity and Access Management) user credentials with S3 access permissions, and have the AWS CLI configured locally.

Install Required Libraries

It is a generally well know good practice to work in isolated environments, specially when you are trying some new pythn libraries, so if you are a conda user, you should first create a conda environment where you will after that install awswrangler.

First create your conda environment by running:



conda create -n data-wrangling python=3.11

Then activate the environment running:



conda activate data-wrangling

Now is time to install the required libraries inside the environment using the following command:



pip install awswrangler pandas boto3

3. Connecting to Amazon S3

Creating S3 bucket

In order to create the S3 bucket we will use AWS CLI, and if you followed the previous guidelines on Setting up your AWS account, your access keys should be stored in your C:\Users\%USERPROFILE%.aws directory

And then you will be able to create the bucket from command line by running:



aws s3api create-bucket --bucket aws-sdk-pandas72023 --region us-east-2 --create-bucket-configuration LocationConstraint=us-east-2

I have called the bucket aws-skd-pandas72023 but you can name it whatever you like as long as it follows the naming rules for S3 buckets Bucket naming rules

Then you will be receiving the following output in the command line:

And you will be able to visualize the newly created bucket in your AWS user:

Creating a Connection

awswrangler library internally handles Sessions and AWS credentials using boto3 in order to connect to your bucket using your AWS credentials, so besides importing awswrangler you should also:



import boto3

All the packages you would need to import are the following:



#Importing required libraries
import awswrangler as wr
import yfinance as yf
import boto3
import pandas as pd
import datetime as dt
from datetime import date, timedelta

You can also git clone the repository that has the code used in this tutorial.

If you have cloned the repository, you have noticed that we are using the library yfinance to extract stocks data from the API, and then store it in a pandas dataframe, so we can write the extracted data (dataframes) in the previously created S3 bucket using awswrangler.

awswrangler API is able to read and write data in and from a huge kind of file formats and to a numeroius number of AWS services, refer to this list for more information.

4. Loading and Downloading Data from S3 buckets

Covering all the services available on the API to read and write data from and to AWS, would make this tutorial quite long, so we will be working with a commonly used data source storage in data projects, such as S3.

Uploading Data to S3

In the repository shared above I have written a function that writes the dataframe extracted with the get_data_from_api function to the S3 bucket previously created:



def write_data_to_bucket(file_name:str, mode:str):
    """
    Parameters:
    ----------
    mode(str): Available write modes are 'append', 'overwrite' and 'overwrite_partitions'
    """

    path = f"s3://{bucket}/raw-data/{file_name}"
    #Sending dataframe of corresponding ticker to bucket
    wr.s3.to_csv(
        df=df,
        path=path,
        index=True,
        dataset=True,
        mode=mode
    )

So let's put the get_data_from_api into action by passing the stock symbol of NVDA stock and then we will load it to the bucket using write_data_to_bucket

If we go to the S3 bucket, we'll notice that there will be a new folder inside the bucket with the name of the stock, and a CSV file inside of it:

You can also pass multiple dataframes to the function so they will be created in the bucket:

Reading Data from S3

Reading data from an S3 bucket using awswrangler, is a very straightforward task, as you only need to pass the S3 path where your files are stored and the path_suffix corresponding to the method you are using, in this case read_csv. Find more parameters available to use with this method in awswrangler API reference.

In my tutorial I have written a function that takes the folder name where the CSV file was stored when we wrote the data coming from the API in the S3 bucket:



def read_csv_from_bucket(folder_name:str) -> pd.DataFrame:
   df=wr.s3.read_csv(path=f"s3://{bucket}/rawdata/{folder_name}/", 
   path_suffix = ".csv")

   return df

5. Conclusion

There are a lot of methods available in the API Reference page inorder to interact with S3 and multiple other AWS services.

I encourage you to keep testing the once you find useful to integrate with your ETLs and pipelines.

AWS Data Wrangler simplifies the interaction with Amazon S3, providing a seamless experience for performing data operations using Pandas DataFrames. This tutorial covered the fundamental concepts of connecting to S3, uploading, downloading, and transforming data, as well as advanced interactions. By leveraging AWS Data Wrangler, you can streamline your data workflows and focus on deriving insights from your data.

Cheers and happy coding!!

Resources for Further Learning

AWS Data Wrangler Documentation: https://aws-data-wrangler.readthedocs.io/
AWS Data Wrangler GitHub Repository: https://github.com/awslabs/aws-data-wrangler
AWS SDK for Python (Boto3) Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

DEV Community

Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide

Introduction

Table of Contents