DEV Community: Syed Afroz Pasha

Multipart Upload for Large Files using Pre-Signed URLs - AWS

Syed Afroz Pasha — Tue, 15 Dec 2020 21:32:13 +0000

It’s mind-blowing how fast data is growing. It is now possible to collect raw data with a frequency of more than a million requests per second. Storage is quicker and cheaper. It is normal to store data practically forever, even if it is rarely accessed.

Users of Traindex can upload large data files to create a semantic search index. This article will explain how we implemented the multipart upload feature that allows Traindex users to upload large files.

Problems and their Solutions

We wanted to allow users of Traindex to upload large files, typically 1-2 TB, to Amazon S3 in minimum time and with appropriate access controls.

In this article, I will discuss how to set up pre-signed URLs for the secure upload of files. This allows us to grant temporary access to objects in AWS S3 buckets without needing permission.

So how do you go from a 5GB limit to a 5TB limit in uploading to AWS S3? Using multipart uploads, AWS S3 allows users to upload files partitioned into 10,000 parts. The size of each part may vary from 5MB to 5GB.

The table below shows the upload service limits for S3.

Apart from the size limitations, it is better to keep S3 buckets private and only grant public access when required. We wanted to give the client access to an object without changing the bucket ACL, creating roles, or creating a user on our account. We ended up using S3 pre-signed URLs.

What will you learn?

For a standard multipart upload to work with pre-signed URLs, we need to:

Initiate a multipart upload
Create pre-signed URLs for each part
Upload the parts of the object
Complete multipart upload

Prerequisites

You have to make sure that you have configured your command-line environment not to require the credentials at the time of operations. Steps 1, 2, and 4 stated above are server-side stages. They will need an AWS access keyID and secret key ID. Step 3 is a client-side operation for which the pre-signed URLs are being set up, and hence no credentials will be needed.

If you have not configured your environment to perform server-side operations, then you must complete it first by following these steps:

Download AWS-CLI from this link according to your OS and install it. To configure your AWS-CLI, you need to use the command aws configure and provide the details it requires, as shown below.

$ aws configure

AWS Access Key ID [None]: EXAMPLEFODNN7EXAMPLE
AWS Secret Access Key [None]: eXaMPlEtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: xx-xxxx-x
Default output format [None]: json

Implementation

1. Initiate a Multipart Upload

At this stage, we request AWS S3 to initiate a multipart upload. In response, we will get the UploadId, which will associate each part to the object they are creating.

import boto3

s3 = boto3.client('s3')

bucket = "[XYZ]"
key = "[ABC.pqr]"

response = s3.create_multipart_upload(
    Bucket=bucket, 
    Key=key
)

upload_id = response['UploadId']

Executing this chunk of code after setting up the bucket name and key, we get the UploadID for the file we want to upload. After setting up the bucket name and key, we get the UploadID for the file that needs to be uploaded. It will later be required to combine all parts.

2. Create pre-signed URLs for each part

The parts can now be uploaded via a PUT request. As explained earlier, we are using a pre-signed URL to provide a secure way to upload and grant access to an object without changing the bucket ACL, creating roles, or providing a user on your account. The permitted user can generate the URL for each part of the file and access the S3. The following line of code can generate it:

signed_url = s3.generate_presigned_url(
    ClientMethod ='upload_part',
    Params = {
       'Bucket': bucket,
       'Key': key, 
       'UploadId': upload_id, 
       'PartNumber': part_no
    }
)

As described above, this particular step is a server-side stage and hence demands a preconfigured AWS environment. The pre-signed URLs for each of the parts can now be handed over to the client. They can simply upload the individual parts without direct access to the S3. It means that the service provider does not have to worry about the ACL and change in permission anymore.

3. Upload the parts of the object

This step is the only client-side stage of the process. The default pre-signed URL expiration time is 15 minutes, while the one who is generating it can change the value. Usually, it is kept as minimal as possible for security reasons.

The client can read the part of the object, i.e., file_data, and request to upload the chunk of the data concerning the part number. It is essential to use the pre-signed URLs in sequence as the part number, and the data chunks must be in sequence; otherwise, the object might break, and the upload ends up with a corrupted file. For that reason, a dictionary, i.e., parts, must be managed to store the unique identifier, i.e., eTag of every part concerning the part number. A dictionary must be a manager to keep the unique identifier or eTag of every part of the number.

response = requests.put(signed_url, data=file_data)

etag = response.headers['ETag']  

parts.append({'ETag': etag, 'PartNumber': part_no})

As far as the size of data is concerned, each chunk can be declared into bytes or calculated by dividing the object’s total size by the no. of parts. Look at the example code below:

max_size = 5 * 1024 * 1024    # Approach 1: Assign the size  

max_size = object_size/no_of_parts    # Approach 2: Calculate the size 

with open(fileLocation) as f:
    file_data = f.read(max_size)

4. Complete Multipart Upload

Before this step, check the data’s chunks and the details uploaded to the bucket. Now, we need to merge all the partial files into one. The dictionary parts (about which we discussed in step 3) will be passed as an argument to keep the chunks with their part numbers and eTags to avoid the object from corrupting.

You can refer to the code below to complete the multipart uploading process.

response = s3.complete_multipart_upload(
    Bucket = bucket,
    Key = key,
    MultipartUpload = {'Parts': parts},
    UploadId= upload_id
)

5. Additional step

To avoid any extra charges and cleanup, your S3 bucket and the S3 module stop the multipart upload on request. In case anything seems suspicious and one wants to abort the process, they can use the following code:

response = s3.abort_multipart_upload(
    Bucket = bucket,
    Key = key,
    UploadId = upload_id
)

In this article, we discussed the process of implementing the process of multipart uploading in a secure way pre-signed URLs. The suggested solution is to make a CLI tool to upload large files which saves time and resources and provides flexibility to the users. It is a cheap and efficient solution for users who need to do this frequently.

Benchmarking of Textual Models - Jaccard Similarity

Syed Afroz Pasha — Mon, 28 Sep 2020 21:18:03 +0000

I recently concluded my internship with the Data Science team at Traindex. One of the tasks assigned to me was replicating the ML-based semantic search technique. The Data Science team had implemented this on their Traindex search API.

Traindex uses document similarity techniques like LSI and Doc2Vec to train a model that can identify the best matching documents with a given document or paragraphs or phrases.

One of the most challenging tasks is the benchmarking of language models. The process sometimes requires using a series of techniques for proper evaluation and testing.
One of the benchmarks used for Traindex is Jaccard Similarity. It provides a baseline and is not enough for a complete evaluation of any model.

This article intends to give some background on the where and how of the Jaccard Similarity score. It is useful for creating benchmarks to measure the performance of their language models.

Text Similarity

Text similarity can help us determine the similarity between pairs of documents, or a specific document and a set of other documents. The score calculated by performing the similarity check decides model acceptance, improvement, or rejection. The categorization of string-based text similarity shows various approaches that fit according to the scenario.

There are two main alternatives for finding the similarity metric.

The Character-based approach deals with the individual characters present in the document with the proper sequence.
The Term-based deals with the whole word. The words are often simplified or lemmatized before performing the test as per the initial data cleaning process used for the training purpose.

Introduction - Jaccard Index

For the comparison of a finite number of elements between two observations, it is common practice to count items that are common to both sets. It is a natural fit for comparing posts in case of the representative tags to measure how similar two articles are in terms of tags.

The Jaccard similarity index, also the Jaccard similarity coefficient, compares members of two sets to see shared and distinct members. It is a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations. The Jaccard Index is a statistic to compare and measure how similar two different sets are to each other. Although it is easy to interpret, it is susceptible to small sample sizes. It may give erroneous results, especially with smaller samples or data sets with missing observations.

Let's have a look on this example.

from gensim.matutils import jaccard

print(jaccard(bow_water, bow_bank))
Out: 0.8571428571428572


print(jaccard(doc_water, doc_bank))
Out: 0.8333333333333334

print(jaccard(['word'], ['word']))
Out: 0.0

The three code examples above feature two different input methods.

In the first case, we present to Jaccard document vectors already in the bag of words format. The distance's definition is one minus the size of the intersection upon the size of the union of the vectors. We can see that the distance is likely to be high - and it is.

The last two examples illustrate the ability for Jaccard to accept even lists (i.e. documents) as inputs. In the previous case, because they are the same vectors, the value returned is 0 - this means the distance is 0 and the two documents are identical.

Mathematical Representation

Mathematically, it is a ratio of intersection of two sets over the union of them. In the case of textual documents or phrases, it compares the words and counts the common words. Then divides it with the total number of words present in both of the documents. The same method applies to more than two documents with the same technique.

From the venn diagram above, it can be concluded that:

Jaccard Index = (number in both sets) / (number in either set) * 100

Benchmarking is an essential process of Data Science that indicates the confidence level and effectiveness of a model. Traindex performs multiple benchmarking techniques from which Jaccard means similarity testing is an important one.

A test data set finds the mean similarity score for the trained model. Sampling the main dataset where the validation fraction provides the parameter in the config file. Leaving every feature aside, UCID i.e., the unique document ID and the whole text, are the two crucial columns that are necessary to perform the testing.

The following steps test the LSI model performance on the basis of term-based similarity:

The test data is made containing the UCIDs with their respective text where the sampling is done according to the size mentioned in the configuration file.
Cleaning of the text column of the test data as per the requirements, while the removal of duplicate terms is compulsory. For each document, the query is made by using the text corresponding to its UCID.
The document and index are sent to a function that returns the no of common words present in the query and document with the help of indices.
Meanwhile, the calculations for common words and the total observations are done, followed by the calculation for Jaccard similarity by taking the ratio of the lengths saved earlier.
Following the above steps can produce the adequate results required for the evaluation, but the results are scaled by dividing the score with respective query size. The extra step eradicates the behavioral bent of results because of different query size.

At last, the mean similarity score for all the combined results is calculated by averaging.

A data frame is optional but populated with the similarity score for each result. The values are mapped on the bar chart, which clearly shows the percentage similarity with particular UCID, as shown in the sample figure.

Impact of Jaccard Similarity as an Evaluation Metric

The Jaccard similarity score may not be the best solution for benchmarking, but it may be considered for the following advantages:

Jaccard similarity takes a unique set of words for each sentence. This means that if you repeat any word in the sentence several times, Jaccard's similarity remains unchanged. Consider these two sentences:

Sentence 1: AI is our friend, and it has been friendly.

Sentence 2: AI and humans have always been friendly.

It is robust enough to cater repetitions of a word like the word
"friend" in sentence 1. No matter how much frequent any word is,
Jaccard's similarity will be the same - here it is 0.5.

Instead of rejecting or accepting the document for a given set of queries, It provides a numerical score ranges from 0 to 1, which provides a clear view and can be used as a lookup for further iterations. Figure 3 explains the closeness of random search queries with the same document where the height of the bars represents the Jaccard similarity scores between queries and the document.
Jaccard similarity is useful for cases where duplication does not matter. For example, it will be better to use Jaccard similarity for two product descriptions as the repetition of a word does not reduce their similarity.

In a nutshell, benchmarking for evaluating a set of models is a necessary step in the data science process. The development of various tools and approaches for the calculation of a metric that can differentiate between products is important. In comparison with other domains, evaluation of Machine Learning, the language and contextual models are difficult. When there is less number of benchmarking techniques for textual models, the Jaccard similarity score is an important performance evaluation metric that can help in providing a partial summary of the model. It can be concluded that the effectiveness of any benchmarking technique depends on how good they fit the problem. Also, to understand that a mismatch between the metric and the case can misguide the results.