This article is part of our Ruby on AWS Lambda blog series. A recent project had us migrating an existing PDF document processing system from Rails Sidekiq to AWS Lambda. The processing includes OCR, creating preview images, splicing the PDF, and more. Moving to Lambda reduced processing time by 300 times in some cases.
This series of articles will serve less as a step-by-step process to get OCR serverless infrastructure up and running and more of a highlight reel of our "Aha!" moments. In part one, we talk about creating a AWS Lambda Layer with Docker. In part two, we chatted about architecting a serverless app. In part three, we went through some strategies surrounding deployment. Here in part four, we'll investigate what's needed to integrate a Rails app that uses Active Storage with AWS Lambda. Check out the other posts in the series:
Note: there are some assumptions made in this article regarding architecture and reading Part Two of this series, Planning & Architecting, may be useful before diving in here.
Performance was our main reason for moving from Sidekiq to AWS Lambda. We used Active Storage on a few models and wanted to continue using it after the migration. Specifically, we wanted to take advantage of Active Storage variants via a GraphQL API and React frontend. The app had multiple views that requested all of the variants at once, and because the variants are created at the time of the request, the user experience under Sidekiq was very poor. What if we could create all of those variants on Lambda as we processed the document? It sounded plausible to us, so we spiked on it.
Note that we'll be touching on specific pieces of code that directly relate to the integration; this won't be a tutorial on how to use Active Storage or setup AWS Lambda. We are using S3 as our storage service, and will be referencing it as a standalone entity. Let's dive in!
Uploading
Instead of passing form params to a model, we’ll take more of a manual approach. In the controller:
@document.file.upload(file)
We have a Document
model with has_one_attached :file
. This allows the file to be uploaded to S3 without creating the backing Active Storage record. Delaying that allows us to process the document on Lambda, create the needed variants, and respond back with data that will be used to create those records.
Lambda has no knowledge of the Rails app or Active Storage, but if we can point it to the file we just uploaded, it can reference and process it. After a Document
has been created, we send document.file.key
to the vent function (LINK TO PLANNING AND ARCHITECTURE POST) that starts the processing work. The document’s key is the name of the file on S3. On Lambda, the real fun begins.
Variant Creation
Essentially, we needed to trick Active Storage. When requesting a variant, Active Storage will look for variant.jpg
in a specific location on S3:
"variants/#{key}/#{variant_key}"
key
is the original file's key (document.file.key
), and variant_key
is a hash derived from the variant's transformation options (e.g. {auto_orient: true, resize: '1000x1000'}
) and the Rails app's ENV['SECRET_KEY_BASE']
. Let's look at the Lambda code:
def create_variant(image, key, transformation)
# create message verifier and variant key
verifier_key = verifier.generate(transformation, purpose: :variation)
variant_key = Digest::SHA256.hexdigest(verifier_key)
tmp = Tempfile.new('variant.jpg')
MiniMagick::Tool::Convert.new do |convert|
convert << image.path
# apply transformations here, like:
# convert.auto_orient
# convert.resize transformation[:resize]
convert << tmp.path
end
s3 = Aws::S3::Client.new
s3.put_object(
bucket: "some-cool-bucket",
key: "variants/#{key}/#{variant_key}",
body: tmp,
content_type: 'image/jpeg'
)
end
#create_variant
takes a File
, document.file.key
, and the transformation options as arguments. The variant_key
is created (we'll look at what verifier
is soon), and the actual image conversion is handled with ImageMagick. Note the actual ImageMagick conversions depend on the transformation options being passed in. For an example I used the options that were previously mentioned, {auto_orient: true, resize: '1000x1000'}
. After the image is processed, we then place it on S3, using the specific path Active Storage expects.
Figuring out how to create the verifier_key
was quite difficult. It took a late night fueled by coffee and spelunking into the bowels of Active Storage, but it proved fruitful. Let's take a look at #verifier
:
def verifier
key_generator = ActiveSupport::CachingKeyGenerator.new(
ActiveSupport::KeyGenerator.new(ENV['SECRET_KEY_BASE'], iterations: 1000)
)
key = key_generator.generate_key('ActiveStorage')
ActiveSupport::MessageVerifier.new(key)
end
Active Storage was created by people much smarter than I, and I'll be the first to admit that I did not look at all the implementation details surrounding ActiveSupport::CachingKeyGenerator
, than I, and I'll be the first to admit that I did not look at all the implementation details surrounding ActiveSupport::KeyGenerator
, ActiveSupport::MessageVerifier
. However, their names are descriptive and give us a good bird's-eye view. A key is generated from your Rails app's SECRET_KEY_BASE
, and that key is then turned into a digest that is used to create the path to the variant. A few notes:
- The
iterations
option when instantiatingActiveSupport::KeyGenerator
does not have to be 1,000. I, of course, didn't test every number, but the natural numbers I tested all had the same result. Active Storage uses 1,000, so I thought I would too. - The argument passed to
#generate_key
is the salt, and for our intended purposes on Lambda it can be any string.
Data for the active_storage_attachments
record
Early on in the spike, we hit a lot of ActiveStorage::IntegrityError
s. While you can find yourself raising this error on accident in many scenarios, the reason why is always the same: the hashed contents of the file don't match what's stored in the database. Since we uploaded the file straight to S3, we purposefully didn't create a record in active_storage_attachments
, and don't have the reference to the file anymore without downloading it (which would be a waste). We had to create the digest on Lambda while we still had a reference to the file.
Here's the data we collect for each page in a document (note that file
is a File
object):
{
content_md5: Digest::MD5.file(file).base64digest,
byte_size: File.size(file).to_s,
key: key,
content_type: content_type
}
After every page is processed, our sink function collects all the data and sends a JSON response back to the Rails app.
All together now
When the Rails app receives the JSON, Sidekiq workers are started that process the response. Part of the workers' responsibility is to create a record in active_storage_attachments
:
def create_blob(data)
params = {
key: data['key'],
filename: data['key'],
checksum: data['content_md5'],
byte_size: data['byte_size'],
content_type: data['content_type']
}
ActiveStorage::Blob.create_before_direct_upload!(params)
end
#create_before_direct_upload!
allows us to create the record and not upload the file to S3 as it already exists there. Tricking Active Storage, and all.
And with that, Active Storage works just as it normally would. With our implementation, however, we process all variants in parallel, improving load times and user experience for everyone!
Top comments (0)