DEV Community

Karthik Subramanian
Karthik Subramanian

Posted on • Originally published at Medium

AWS Lambda & ECR nuances

AWS Lambda & ECR nuances

There are a couple of nuances with aws services I encountered along the way that I wanted to highlight here.

AWS Lambda Ephemeral Storage

One of the first issues I encountered after getting everything setup was that the Process lambda would only work once. After the first execution, each subsequent invovation would fail because the chrome driver would crash at different steps. Since it wouldn’t crash at the same step each time, and it would completely succeed the very first time, I suspected something was up with whatever the invocations were sharing. That led me to the ephemeral storage.

The lambda execution environment provides a file system for your code to use at /tmp. This space has a fixed size of 512 MB. The same Lambda execution environment may be reused by multiple Lambda invocations to optimize performance. Consequently, this is intended as an ephemeral storage area. While functions may cache data here between invocations, it should be used only for data needed by code in a single invocation.

Aha! The chrome driver was using up the /tmp storage space on the first invocation and which is why it was crashing on the next invocation.

Increasing the storage size from 512MB to 3GB resolved the issue for me. All I needed to do was update the template.yaml global function properties -

template.yaml

But this alone isn’t enough. With enough number of executions I’m pretty sure we would exhaust that 3GB storage limit too.

What I needed was a way to make the lambda clean up after itself on each invocation. I ended up creating a wrapper class that would generate a random folder within /tmp and pass that to the chrome options to use for storing user data. It would also delete that folder once the driver exit was called -

from xmlrpc.client import boolean
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.remote import webelement
from selenium import webdriver
import uuid
import os
import shutil
MAX_WAIT = 10
class WebDriverWrapper:
def __init__(self):
self.driver = self.__get_driver()
self.wait = WebDriverWait(self.driver, MAX_WAIT)
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, tb):
if exc_type is not None:
print(exc_type, exc_value, tb)
self.driver.close()
self.driver.quit()
shutil.rmtree(self._tmp_folder)
return True
def __get_driver(self):
self._tmp_folder = '/tmp/{}'.format(uuid.uuid4())
if not os.path.exists(self._tmp_folder):
os.makedirs(self._tmp_folder)
if not os.path.exists(self._tmp_folder + '/chrome-user-data'):
os.makedirs(self._tmp_folder + '/chrome-user-data')
if not os.path.exists(self._tmp_folder + '/data-path'):
os.makedirs(self._tmp_folder + '/data-path')
if not os.path.exists(self._tmp_folder + '/cache-dir'):
os.makedirs(self._tmp_folder + '/cache-dir')
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = "/opt/chrome/chrome"
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-tools")
chrome_options.add_argument("--no-zygote")
chrome_options.add_argument("--single-process")
chrome_options.add_argument("window-size=2560x1440")
chrome_options.add_argument(f"--user-data-dir={self._tmp_folder}/chrome-user-data")
chrome_options.add_argument(f"--data-path={self._tmp_folder}/data-path")
chrome_options.add_argument(f"--disk-cache-dir={self._tmp_folder}/cache-dir")
chrome_options.add_argument("--remote-debugging-port=9222")
input_driver = webdriver.Chrome("/opt/chromedriver", options=chrome_options)
return input_driver

Update process.py to leverage the new wrapper -
import json
import time
from db import db_helper
from selenium.webdriver.common.by import By
from web_driver_wrapper import WebDriverWrapper
from io import StringIO
import boto3
import csv
import os
from datetime import datetime
def lambda_handler(event=None, context=None):
request = get_request(event=event)
if request is None:
return {
"statusCode": 400,
"body": {
"message": "Cannot parse url"
}
}
dbHelper = db_helper.DBHelper()
try:
dbHelper.update_order_status(request=request, status='In Progress')
url = request['url']
upload_bucket_name = str(os.environ['UPLOAD_BUCKET'])
result_list = []
with WebDriverWrapper() as driver_wrapper:
driver = driver_wrapper.driver
driver.get(url)
search_results = driver.find_elements(By.XPATH, "//div[@data-header-feature]")
for result in search_results:
result_list.append({"result": result.text})
if len(result_list) > 0:
dt_string = datetime.now().strftime("%Y-%m-%d_%H%M")
csv_file_name = f'export_{dt_string}.csv'
upload_csv_s3(result_list, upload_bucket_name, csv_file_name)
dbHelper.update_order_status(request=request, status='Complete', location=csv_file_name)
else:
dbHelper.update_order_status(request=request, status='No Results')
except Exception as e:
print(e)
dbHelper.update_order_status(request=request, status='Failed')
return {
"statusCode": 500,
"body": {
"message": f"Error processing request: {e}"
}
}
return {
"statusCode": 200,
"body": json.dumps(
{
"records found": len(result_list),
}
),
}
def get_request(event) -> str:
if "Records" in event:
body = event['Records'][0]['body']
event = json.loads(body)
return event
def upload_csv_s3(data_dictionary,s3_bucket_name,csv_file_name):
print('Starting csv upload to S3')
try:
data_dict_keys = data_dictionary[0].keys()
# creating a file buffer
file_buff = StringIO()
# writing csv data to file buffer
writer = csv.DictWriter(file_buff, fieldnames=data_dict_keys)
writer.writeheader()
writer.writerows(data_dictionary)
# creating s3 client connection
client = boto3.client('s3')
# placing file to S3, file_buff.getvalue() is the CSV body for the file
client.put_object(Body=file_buff.getvalue(), Bucket=s3_bucket_name, Key=csv_file_name)
print('Completed uploading to S3')
except Exception as e:
print(e)
raise e
view raw process.py hosted with ❤ by GitHub

And don’t forget to add it to the Dockerfile -

Dockerfile

AWS Free Tier

One of the primary goals I had when I began this architecture was ensuring it was free by leveraging the services offered as part of the AWS Free Tier.

With everything built & tested I decided to check the AWS Cost Explorer to check my costs.

Cost explorer

$0.02!!! Whats up with that AWS!

Heading over to Budgets > Free Tier showed me who the culprit was -

Budgets

Amazon ECR has a free tier limit of 500MB and I was already at 1GB.

One of the few things I had done when building my architecture was choosing the command “sam deploy — guided” whenever I added a new lambda to the template.yaml file. One of the questions asked was “Create managed ECR repositories for all functions? [Y/n]”. And I had chosen Y each time. That resulted in aws creating a new ECR repo for each of the 3 lambda functions used in this architecture. With each repo size being approx 400mb, you can see how I easily blew past the 500mb limit.

This is why when creating this series I chose the approach of manually modifying the samconfig.toml file and updating the image_repositories list whenever we created a new lambda.

Another cost factor is the number of images stored by the repository. Head over to Amazon ECR > Repositories and click on our repo -

ECR

Those images also occupy space & can be deleted. I personally choose to keep only the latest 3 images and delete the rest. You can also set a lifecycle policy that can automatically delete the older images for you.

Finally, keep an eye on the limits for all the services used. I highly recommend creating a budget in AWS with a threshold specified that notifies you.

Source Code

Here is the source code for the project created here.

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more