Streamlining Document Processing with AI: My Experience with AWS Texttract 📄🤖

Introduction: “Artificial intelligence is the future, and the future is here.” — Alan Turing 🚀🔮

In this article, I will share my experience with artificial intelligence (AI) focusing on data extraction from scanned documents using AWS Texttract. As a response to a company’s request to solve a process issue, I delved into the world of AI and successfully implemented a solution that revolutionized their document processing workflow. 💼💻

The Problem: “The digital transformation is no longer just an option; it’s a necessity.” — Satya Nadella 💡💻

The company received a large volume of documents on a monthly basis, which were scanned and sent in PDF format. Their manual process involved employees from the administrative department spending hours each day typing the information from these documents into the system. This created a significant delay in processing, leading to operational bottlenecks and a growing backlog. ⏳📚

The Solution: “Technology is best when it brings people together.” — Matt Mullenweg 🌐🤝

To address this challenge, I analyzed the company’s existing infrastructure, which consisted of several monolithic Ruby on Rails applications managed within a Docker network. After careful consideration, I decided to leverage the power of AWS Texttract, an AI tool that interprets text from PDFs and images and returns the extracted data in JSON format.

Defining AWS Texttract: “Artificial intelligence will be the ultimate extension of human intelligence.” — Max Levchin 🤖🧠

AWS Texttract is a service provided by Amazon Web Services (AWS) that utilizes machine learning algorithms to extract text and data from scanned documents. It enables the automated interpretation of various document formats, saving significant time and effort in manual data entry tasks. 💻🔍

Implementation: “Technology is nothing. What’s important is that you have faith in people, that they’re basically good and smart, and if you give them tools, they’ll do wonderful things with them.” — Steve Jobs 💪🔧

I developed a service that fetched the scanned and filled PDF forms from an S3 bucket and asynchronously processed them using AWS Texttract. The extracted data was then automatically populated into the corresponding fields in the system’s forms. Given the diversity of document patterns and formats, I built an internal intelligence system capable of interpreting the unique characteristics of each form. 🚀💡

example of code:

require 'aws-sdk-s3'
require 'aws-sdk-textract'

class DocumentController < ApplicationController
  def process_documents
    # Retrieve scanned PDF documents from S3 bucket
    s3_client = Aws::S3::Client.new(region: 'your-region')
    documents = []

    # List objects in the S3 bucket
    resp = s3_client.list_objects_v2(bucket: 'your-bucket')
    resp.contents.each do |object|
      # Check if the object is a PDF file
      if object.key.end_with?('.pdf')
        # Download the PDF file from S3
        file = s3_client.get_object(bucket: 'your-bucket', key: object.key)
        pdf_content = file.body.read

        # Call AWS Textract to extract form data
        textract_client = Aws::Textract::Client.new(region: 'your-region')
        response = textract_client.start_document_text_detection({
          document_location: {
            s3_object: {
              bucket: 'your-bucket',
              name: object.key
            }
          }
        })

        # Wait for the extraction process to complete
        job_id = response.job_id
        textract_client.wait_until(:document_text_detection_job_completed, job_id: job_id)

        # Retrieve the extracted form data
        result = textract_client.get_document_text_detection(job_id: job_id)
        form_data = parse_form_data(result)

        # Store the form data in your Rails application
        Form.create(form_data: form_data)
        documents << object.key
      end
    end

    render json: { message: 'Documents processed successfully', documents: documents }
  end

  private

  def parse_form_data(result)
    # Extract relevant form fields and values from the Textract result
    # ...
    # Return a hash representing the form data
    # ...
  end
end

🚀 DevOps: Streamlining Development and Deployment

To ensure seamless collaboration and efficient deployment, we implemented a DevOps system centered around GitLab and GitFlow. The project’s source code was stored in GitLab, allowing for version control, code review, and easy collaboration among team members. The GitFlow workflow was followed, enabling a structured approach to feature development and release management.

🔄 Continuous Integration and Deployment Pipeline

The development process was enhanced by a robust CI/CD pipeline. Whenever a merge request was opened targeting the “homologation” branch, an automated pipeline was triggered. This pipeline incorporated various stages, including unit testing, building the containerized project using Docker, and deploying it to the AWS servers.

🐳 Dockerfile Example:

Below is an example Dockerfile for a Ruby on Rails application:

# Use the official Ruby image as the base
FROM ruby:2.7

# Set the working directory
WORKDIR /app

# Copy the Gemfile and Gemfile.lock
COPY Gemfile Gemfile.lock ./

# Install the project dependencies
RUN bundle install

# Copy the application code
COPY . .

# Set the entry point for the container
CMD ["rails", "server", "-b", "0.0.0.0"]

🔧 GitLab Pipeline Configuration (gitlab.yaml)

The GitLab pipeline was configured using a .gitlab-ci.yml file. It defined the stages, jobs, and deployment steps to be executed during the pipeline. Here's an example of a GitLab pipeline configuration file:

stages:
  - test
  - build
  - deploy

unit_tests:
  stage: test
  script:
    - bundle install
    - bundle exec rspec

build_docker_image:
  stage: build
  script:
    - docker build -t your-image-name .

deploy_to_aws:
  stage: deploy
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker push your-image-name
    - aws ecs update-service --cluster your-cluster-name --service your-service-name --force-new-deployment

In this example, the pipeline consists of three stages: “test,” “build,” and “deploy.” Each stage includes one or more jobs that execute specific tasks. The unit_tests job performs unit tests using RSpec, the build_docker_image job builds a Docker image, and the deploy_to_aws job pushes the Docker image to the registry and triggers a new deployment on the AWS ECS cluster. Redis was used to store unstructured data that gave me the status of document processing in Textract.

These examples demonstrate the integration of DevOps practices, such as version control, automated testing, and containerization, which helped streamline the development and deployment processes, ensuring reliability and scalability throughout the project.

Reducing Processing Time: “The future is already here — it’s just not evenly distributed.” — William Gibson 🌟⌛

The implementation of this intelligent solution brought significant improvements to the company’s document processing workflow. By automating the data extraction process, weeks of manual labor were saved each month. The processing delay, previously measured in weeks, was reduced to just a few hours. This achievement not only enhanced operational efficiency but also introduced automation to repetitive and manual tasks. ⏰💼

Conclusion: “AI is not only for engineers; it’s for everyone.” — Fei-Fei Li 🌍🔮

My experience with AI and AWS Texttract in streamlining document processing has demonstrated the tremendous potential of AI in transforming traditional manual processes into efficient, automated workflows. By harnessing the power of AI, we can revolutionize how organizations handle large volumes of data, increasing productivity and reducing operational costs.

diagram of this architecture:

Let’s embrace the future of technology together, empowering ourselves with AI to achieve remarkable outcomes! 💪🚀