DEV Community

Karthik Subramanian for AWS Community Builders

Posted on • Edited on • Originally published at Medium

Writing a CSV to S3 from AWS Lambda

Writing a CSV to S3 from AWS Lambda

In the last post I explained how to scrape a url with selenium and extract the number of search results returned by google for a query string. Let us now see how to insert these results into S3 as a csv file.

Setting up the S3 bucket

Update the template.yaml file and add a new resource for the S3 bucket -

template.yaml file

When defining the bucket we have specified a few additional properties -

  • LifecycleConfiguration: This is optional, but I have set this so that my csvs are deleted after a day since I don’t want them lying around forever

  • CorsConfiguration: In my use case I needed the objects in S3 to be available for download to anyone who has a pre-signed url. Because of this requirement I needed to specify a cors config that allowed any origin. Modify this as per your needs

Lets also define a global environment variable for the bucket name so that the lambdas have the name available to them -

global env variables

We also need to ensure that the Process lambda has access to write to the S3 bucket. Add a new policy to the lambda properties -

S3 policy

Finally, update the Outputs -

Outputs section

Update the process.py file with the following code -

import json
from db import db_helper
from selenium.webdriver.common.by import By
from selenium import webdriver
from io import StringIO
import boto3
import csv
import os
from datetime import datetime
def lambda_handler(event=None, context=None):
request = get_request(event=event)
if request is None:
return {
"statusCode": 400,
"body": {
"message": "Cannot parse url"
}
}
dbHelper = db_helper.DBHelper()
try:
dbHelper.update_order_status(request=request, status='In Progress')
url = request['url']
upload_bucket_name = str(os.environ['UPLOAD_BUCKET'])
driver = get_driver()
driver.get(url)
search_results = driver.find_elements(By.XPATH, "//div[@data-header-feature]")
result_list = []
for result in search_results:
result_list.append({"result": result.text})
dt_string = datetime.now().strftime("%Y-%m-%d_%H%M")
csv_file_name = f'export_{dt_string}.csv'
upload_csv_s3(result_list, upload_bucket_name, csv_file_name)
dbHelper.update_order_status(request=request, status='Complete', location=csv_file_name)
except Exception as e:
print(e)
dbHelper.update_order_status(request=request, status='Failed')
return {
"statusCode": 500,
"body": {
"message": f"Error processing request: {e}"
}
}
return {
"statusCode": 200,
"body": json.dumps(
{
"records found": len(search_results),
}
),
}
def get_request(event) -> str:
if "Records" in event:
body = event['Records'][0]['body']
event = json.loads(body)
return event
def get_driver():
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = "/opt/chrome/chrome"
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-tools")
chrome_options.add_argument("--no-zygote")
chrome_options.add_argument("--single-process")
chrome_options.add_argument("window-size=2560x1440")
chrome_options.add_argument("--remote-debugging-port=9222")
input_driver = webdriver.Chrome("/opt/chromedriver", options=chrome_options)
return input_driver
def upload_csv_s3(data_dictionary,s3_bucket_name,csv_file_name):
print('Starting csv upload to S3')
try:
data_dict_keys = data_dictionary[0].keys()
# creating a file buffer
file_buff = StringIO()
# writing csv data to file buffer
writer = csv.DictWriter(file_buff, fieldnames=data_dict_keys)
writer.writeheader()
writer.writerows(data_dictionary)
# creating s3 client connection
client = boto3.client('s3')
# placing file to S3, file_buff.getvalue() is the CSV body for the file
client.put_object(Body=file_buff.getvalue(), Bucket=s3_bucket_name, Key=csv_file_name)
print('Completed uploading to S3')
except Exception as e:
print(e)
raise e
view raw process.py hosted with ❤ by GitHub

**Note: The update order status call for complete was also modified to include the csv file name

Deploying & Testing

Unlike before, we are going to first deploy our changes to aws so that the S3 bucket gets created and then test our code.

sam build
sam deploy
Enter fullscreen mode Exit fullscreen mode

You should see an output like this -

Console output

Validate the changes

To validate the changes, lets make another post call from postman to the prod api gateway -

postman

Now login to aws console and check the S3 bucket, you should see the csv file created -

s3 bucket

Looking at the dynamodb table we can see that the file_location was also updated with the csv file name

dynamodb table

Source Code

Here is the source code for the project created here.

Next: Part 6: Downloading a file from S3 using API Gateway & AWS Lambda

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Best Practices for Running  Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK cover image

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

This post discusses the process of migrating a growing WordPress eShop business to AWS using AWS CDK for an easily scalable, high availability architecture. The detailed structure encompasses several pillars: Compute, Storage, Database, Cache, CDN, DNS, Security, and Backup.

Read full post