Wonder Agudah for AWS Community Builders

Posted on Jan 7

Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.

#aws #machinelearning #api #development

This article seeks to articulate a number of common machine learning features and capabilities that can be integrated into applications, data pipelines, business intelligence tools and many other projects depending on your peculiar use case by leveraging utility scripts that invoke the Application Programming Interface(API) of managed ML services provided by AWS.

Prerequisites:

AWS credentials: Configure AWS Access Keys via the Command Line Interface(CLI) >> Environment Variables can equally be utilised to ensure secure and persistent access to your AWS account.
AWS Software Development Kit(SDK) for Python: Boto3 pip install boto3

Utility scripts can be very useful when considering an idea to test and develop for a specific feature of a program without necessarily having a full stack application developed yet. Some utility scripts to perform cloud operations can be found in this GitHub repository which I compiled for both Python and Bash scripts:

Wonder717 / Utility-Scripts

Below are some utility scripts to implement specific machine learning features that you can use in your projects.

A. Language Translation Utility

Use Case: Translate text files or strings into various languages.

AWS Service: Amazon Translate

Implementation:
Automate the translation of text or files into target languages.

Utility Script:

import boto3

translate = boto3.client('translate')

text = "Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.
"
response = translate.translate_text(
    Text=text,
    SourceLanguageCode='en',
    TargetLanguageCode='fr'
)
print(response['TranslatedText'])  
# Output: 'Scripts utilitaires pour accélérer l'adoption de l'apprentissage automatique (ML) en tirant parti des API de service AWS ML.'

B. Text/Data Extraction Utility

Use Case: Extract key data from invoices (Example: invoice numbers, totals) and classify them into categories for financial processing.

AWS Service: Amazon Textract is a machine learning service offered by AWS that reduces the manual efforts and automates the process involved in the extraction of any kind of data such as forms, tables, and texts from scanned documents, making it expedient to derive important information from different sources. I wrote an article that explores this service in detail.Check it out here:

Implementing an Organisational Cloud Resource Tagging Strategy Using Amazon Textract, AWS Lambda and Boto3.

Wonder Agudah for AWS Community Builders ・ Feb 29 '24

#machinelearning #aws #community #lambda

Utility Script:

import boto3
import os
import csv
import re

s3 = boto3.client('s3')
textract = boto3.client('textract')

def upload_files_to_s3(folder, bucket_name):
    for root, _, files in os.walk(folder):
        for file in files:
            if file.endswith(('.pdf', '.jpg', '.png')):
                filepath = os.path.join(root, file)
                s3.upload_file(filepath, bucket_name, file)

def extract_text(bucket_name, document_name):
    response = textract.analyze_document(
        Document={'S3Object': {'Bucket': bucket_name, 'Name': document_name}},
        FeatureTypes=['FORMS']
    )
    return response

def parse_invoice_data(textract_response):
    data = {}
    for block in textract_response['Blocks']:
        if block['BlockType'] == 'KEY_VALUE_SET' and block['EntityTypes'] == ['KEY']:
            key = ''.join([
                item['Text']
                for item in block.get('Relationships', [])
                if 'Text' in item
            ]).strip()

            if key in ['Invoice Number', 'Total', 'Date']:
                data[key] = ''.join([
                    item['Text']
                    for item in block.get('Relationships', [])
                    if 'Text' in item
                ]).strip()
    return data

def save_to_csv(data, csv_file):
    with open(csv_file, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        for item in data:
            writer.writerow([item['Invoice Number'], item['Date'], item['Total']])

# Main function
def main():
    folder_path = './invoices'
    bucket_name = 'your-invoice-bucket'
    csv_file = 'invoice_data_output.scv'

         # Upload files to S3
         upload_files_to_s3(folder_path, bucket_name)

         # Process each file in the S3 bucket
         processed_data = []
         for obj in s3.list_objects_v2(Bucket=bucket_name).get('Contents', []):
             file_name = obj['Key']
             print(f"Processing file: {file_name}")

             # Extract text using Textract
             textract_response = extract_text(bucket_name, file_name)

             # Parse relevant data
             parsed_data = parse_invoice_data(textract_response)
             if parsed_data:
                 processed_data.append(parsed_data)

         # Save extracted data to CSV
         save_to_csv(processed_data, csv_file)
         print(f"Processed data saved to {csv_file}")

     if __name__ == "__main__":
         main()

C. Text Classification and Analysis Utility

Use Case: Analyze text data for sentiment and language detection.

AWS Service: Amazon Comprehend

Utility Script:

import boto3

comprehend = boto3.client('comprehend')

text = "Education is necessary for development!"
response = comprehend.detect_sentiment(Text=text, LanguageCode='en')
print(response['Sentiment'])  

# Output: 'POSITIVE'

D. Image Classification and Processing Utility
Use Case: Automate image analysis for object detection and facial recognition.

AWS Service: Amazon Rekognition

Utility Script:

import boto3
import os
import csv

s3 = boto3.client('s3')
rekognition = boto3.client('rekognition')

def upload_to_s3(folder, bucket_name):
    for root, _, files in os.walk(folder):
        for file in files:
            if file.endswith(('.png', '.jpg', '.jpeg')):
                filepath = os.path.join(root, file)
                s3.upload_file(filepath, bucket_name, file)

def label_images(bucket_name, output_csv):
    with open(output_csv, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Image', 'Labels'])

        for obj in s3.list_objects(Bucket=bucket_name).get('Contents', []):
            response = rekognition.detect_labels(
                Image={'S3Object': {'Bucket': bucket_name, 'Name': obj['Key']}},
                MaxLabels=10
            )
            labels = [label['Name'] for label in response['Labels']]
            writer.writerow([obj['Key'], ', '.join(labels)])

folder_path = './images'
bucket_name = 'your-bucket-name'
output_csv = 'image_labels.csv'

upload_to_s3(folder_path, bucket_name)
label_images(bucket_name, output_csv)
print(f"Labels stored in {output_csv}")

The utility scripts in this article are by no means conclusive! Consider them a starting point to explore different managed machine learning services offered by AWS that you can integrate into your applications and projects. You can go through the official documentation for Boto3:

docs.aws.amazon.com

to explore more code snippets you can customise and tailor to your unique case scenarios for implementing machine learning capabilities by leveraging AWS managed ML services.

DEV Community

Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.

Wonder717 / Utility-Scripts

Implementing an Organisational Cloud Resource Tagging Strategy Using Amazon Textract, AWS Lambda and Boto3.

Wonder Agudah for AWS Community Builders ・ Feb 29 '24

Top comments (0)

Read next

Integration Digest for December 2024

Exploring new AWS EKS auto mode. What is it ? Why it is useful ? How to quick start ?

Gobierno de Datos… cuando no hay Gobierno ni Datos

Managing users’ permissions with Cedar policies in the AWS Lambda