DEV Community

Cover image for Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.

Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.

This article seeks to articulate a number of common machine learning features and capabilities that can be integrated into applications, data pipelines, business intelligence tools and many other projects depending on your peculiar use case by leveraging utility scripts that invoke the Application Programming Interface(API) of managed ML services provided by AWS.

Prerequisites:

  • AWS credentials: Configure AWS Access Keys via the Command Line Interface(CLI) >> Environment Variables can equally be utilised to ensure secure and persistent access to your AWS account.
  • AWS Software Development Kit(SDK) for Python: Boto3 pip install boto3

Utility scripts can be very useful when considering an idea to test and develop for a specific feature of a program without necessarily having a full stack application developed yet. Some utility scripts to perform cloud operations can be found in this GitHub repository which I compiled for both Python and Bash scripts:

Below are some utility scripts to implement specific machine learning features that you can use in your projects.

A. Language Translation Utility

Use Case: Translate text files or strings into various languages.

AWS Service: Amazon Translate

Implementation:
Automate the translation of text or files into target languages.

Utility Script:

import boto3

translate = boto3.client('translate')

text = "Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.
"
response = translate.translate_text(
    Text=text,
    SourceLanguageCode='en',
    TargetLanguageCode='fr'
)
print(response['TranslatedText'])  
# Output: 'Scripts utilitaires pour accélérer l'adoption de l'apprentissage automatique (ML) en tirant parti des API de service AWS ML.'

Enter fullscreen mode Exit fullscreen mode

B. Text/Data Extraction Utility

Use Case: Extract key data from invoices (Example: invoice numbers, totals) and classify them into categories for financial processing.

AWS Service: Amazon Textract is a machine learning service offered by AWS that reduces the manual efforts and automates the process involved in the extraction of any kind of data such as forms, tables, and texts from scanned documents, making it expedient to derive important information from different sources. I wrote an article that explores this service in detail.Check it out here:

Utility Script:

import boto3
import os
import csv
import re

s3 = boto3.client('s3')
textract = boto3.client('textract')

def upload_files_to_s3(folder, bucket_name):
    for root, _, files in os.walk(folder):
        for file in files:
            if file.endswith(('.pdf', '.jpg', '.png')):
                filepath = os.path.join(root, file)
                s3.upload_file(filepath, bucket_name, file)

def extract_text(bucket_name, document_name):
    response = textract.analyze_document(
        Document={'S3Object': {'Bucket': bucket_name, 'Name': document_name}},
        FeatureTypes=['FORMS']
    )
    return response

def parse_invoice_data(textract_response):
    data = {}
    for block in textract_response['Blocks']:
        if block['BlockType'] == 'KEY_VALUE_SET' and block['EntityTypes'] == ['KEY']:
            key = ''.join([
                item['Text']
                for item in block.get('Relationships', [])
                if 'Text' in item
            ]).strip()

            if key in ['Invoice Number', 'Total', 'Date']:
                data[key] = ''.join([
                    item['Text']
                    for item in block.get('Relationships', [])
                    if 'Text' in item
                ]).strip()
    return data

def save_to_csv(data, csv_file):
    with open(csv_file, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        for item in data:
            writer.writerow([item['Invoice Number'], item['Date'], item['Total']])

# Main function
def main():
    folder_path = './invoices'
    bucket_name = 'your-invoice-bucket'
    csv_file = 'invoice_data_output.scv'

         # Upload files to S3
         upload_files_to_s3(folder_path, bucket_name)

         # Process each file in the S3 bucket
         processed_data = []
         for obj in s3.list_objects_v2(Bucket=bucket_name).get('Contents', []):
             file_name = obj['Key']
             print(f"Processing file: {file_name}")

             # Extract text using Textract
             textract_response = extract_text(bucket_name, file_name)

             # Parse relevant data
             parsed_data = parse_invoice_data(textract_response)
             if parsed_data:
                 processed_data.append(parsed_data)

         # Save extracted data to CSV
         save_to_csv(processed_data, csv_file)
         print(f"Processed data saved to {csv_file}")

     if __name__ == "__main__":
         main()

Enter fullscreen mode Exit fullscreen mode

C. Text Classification and Analysis Utility

Use Case: Analyze text data for sentiment and language detection.

AWS Service: Amazon Comprehend

Utility Script:

import boto3

comprehend = boto3.client('comprehend')

text = "Education is necessary for development!"
response = comprehend.detect_sentiment(Text=text, LanguageCode='en')
print(response['Sentiment'])  

# Output: 'POSITIVE'

Enter fullscreen mode Exit fullscreen mode

D. Image Classification and Processing Utility
Use Case: Automate image analysis for object detection and facial recognition.

AWS Service: Amazon Rekognition

Utility Script:

import boto3
import os
import csv

s3 = boto3.client('s3')
rekognition = boto3.client('rekognition')

def upload_to_s3(folder, bucket_name):
    for root, _, files in os.walk(folder):
        for file in files:
            if file.endswith(('.png', '.jpg', '.jpeg')):
                filepath = os.path.join(root, file)
                s3.upload_file(filepath, bucket_name, file)

def label_images(bucket_name, output_csv):
    with open(output_csv, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Image', 'Labels'])

        for obj in s3.list_objects(Bucket=bucket_name).get('Contents', []):
            response = rekognition.detect_labels(
                Image={'S3Object': {'Bucket': bucket_name, 'Name': obj['Key']}},
                MaxLabels=10
            )
            labels = [label['Name'] for label in response['Labels']]
            writer.writerow([obj['Key'], ', '.join(labels)])

folder_path = './images'
bucket_name = 'your-bucket-name'
output_csv = 'image_labels.csv'

upload_to_s3(folder_path, bucket_name)
label_images(bucket_name, output_csv)
print(f"Labels stored in {output_csv}")

Enter fullscreen mode Exit fullscreen mode

The utility scripts in this article are by no means conclusive! Consider them a starting point to explore different managed machine learning services offered by AWS that you can integrate into your applications and projects. You can go through the official documentation for Boto3:

to explore more code snippets you can customise and tailor to your unique case scenarios for implementing machine learning capabilities by leveraging AWS managed ML services.

Top comments (0)