DEV Community

Cover image for From a Simple File Upload API to an Event-Driven AWS Document Processing Pipeline
Micheal Angelo
Micheal Angelo

Posted on

From a Simple File Upload API to an Event-Driven AWS Document Processing Pipeline

When I first started learning AWS, I assumed cloud computing was mostly about understanding individual services like Amazon S3, DynamoDB, Lambda, or EC2.

After spending time building a document processing pipeline locally, I realized something different.

Cloud computing isn't simply a collection of services—it's about how those services collaborate to solve real engineering problems.

Rather than learning each service in isolation, I gradually evolved a small FastAPI application into an event-driven, containerized backend. Along the way, I explored concepts like object storage, asynchronous processing, infrastructure automation, containerization, and continuous integration.

This article summarizes that journey and, more importantly, the architectural lessons learned along the way.


Starting with a Simple REST API

The project began with a straightforward goal: accept document uploads through a REST API.

The initial architecture looked like this:

Client
   │
   ▼
FastAPI
   │
   ▼
Local Storage
Enter fullscreen mode Exit fullscreen mode

Uploaded files were simply written to a folder on the local machine.

While functional, this design tightly coupled the application to the local filesystem. The application worked only because it was running on my laptop.


Moving from Local Storage to Object Storage

The first architectural improvement was replacing local storage with Amazon S3.

Instead of keeping uploaded files inside the application directory, they were stored in an object storage service.

The architecture became:

Client
   │
   ▼
FastAPI
   │
   ▼
Amazon S3
Enter fullscreen mode Exit fullscreen mode

This introduced one of the first important cloud concepts:

Application servers shouldn't permanently own user files.

Object storage provides durability, scalability, and independence from the application itself.


Learning AWS Without an AWS Account

Rather than using a paid AWS account, I used Floci, an open-source AWS emulator built on top of LocalStack.

The architecture looked like this:

FastAPI
    │
    ▼
localhost:4566
    │
    ▼
Floci
    │
    ├── S3
    ├── DynamoDB
    └── SQS
Enter fullscreen mode Exit fullscreen mode

The application interacted with Floci using the official AWS SDK (boto3), making the experience very similar to working with real AWS services.

This made it possible to experiment with cloud concepts locally without worrying about cloud costs.


Using boto3 Instead of Raw HTTP Requests

Applications rarely communicate with AWS services by constructing HTTP requests manually.

Instead, AWS provides Software Development Kits (SDKs).

In Python, this is boto3.

A simple call like:

boto3.client("s3")
Enter fullscreen mode Exit fullscreen mode

hides a significant amount of complexity.

The SDK handles authentication, request formatting, retries, and communication with AWS-compatible APIs, allowing developers to focus on application logic rather than protocol details.


Making Infrastructure Self-Initializing

Initially, the application assumed that the required S3 bucket already existed.

That meant manually creating resources before starting the application.

Instead, startup logic was introduced:

Application Starts
        │
        ▼
Check Bucket
        │
 ┌──────┴──────┐
 │             │
 ▼             ▼
Exists     Create Bucket
Enter fullscreen mode Exit fullscreen mode

This small improvement made the application much easier to run in a fresh environment and reduced manual setup.


Separating Files from Metadata

Uploading files solved only part of the problem.

Information about each uploaded document—such as filename, upload time, size, and a unique identifier—also needed to be stored.

Instead of embedding this information within the files themselves, metadata was stored separately in Amazon DynamoDB.

S3
 │
 ▼
Document Files

DynamoDB
 │
 ▼
Document Metadata
Enter fullscreen mode Exit fullscreen mode

Separating binary data from structured metadata is a common design pattern in cloud-native applications.


Introducing Event-Driven Architecture

Initially, the upload request handled every operation synchronously:

Upload File
      │
      ▼
Store Metadata
      │
      ▼
Return Response
Enter fullscreen mode Exit fullscreen mode

This meant users had to wait until every task finished.

To improve the design, asynchronous processing was introduced using Amazon SQS.

The workflow became:

Client
      │
      ▼
FastAPI
      │
      ▼
Upload to S3
      │
      ▼
Send Message to SQS
      │
      ▼
Return Response
Enter fullscreen mode Exit fullscreen mode

Instead of doing everything immediately, the application now creates a message describing the work that still needs to be done.


Why Queues Matter

Queues become especially valuable when traffic increases.

Imagine thousands of users uploading documents simultaneously.

Without a queue:

Requests
    │
    ▼
Application
    │
    ▼
Overloaded
Enter fullscreen mode Exit fullscreen mode

With a queue:

Requests
    │
    ▼
SQS Queue
    │
    ▼
Background Workers
Enter fullscreen mode Exit fullscreen mode

The queue acts as a buffer, smoothing sudden spikes in traffic and allowing work to be processed at a sustainable pace.


Background Workers

A dedicated worker continuously monitors the queue.

Its responsibility is simple:

Receive Message
      │
      ▼
Process
      │
      ▼
Store Metadata
      │
      ▼
Delete Message
Enter fullscreen mode Exit fullscreen mode

Separating background processing from the API keeps responsibilities clear and allows each component to scale independently.


Exploring Serverless Computing

I also experimented with AWS Lambda.

Instead of running workers continuously, Lambda executes code only when an event occurs.

Conceptually:

Event
   │
   ▼
Lambda
   │
   ▼
Execute
   │
   ▼
Terminate
Enter fullscreen mode Exit fullscreen mode

This introduced the idea of serverless computing, where compute resources exist only while work is being performed.


Containerizing the Application

As the project grew, another challenge appeared.

How could another machine run the application without manually installing Python, dependencies, or configuring the environment?

Docker solved this problem.

A Docker image packages:

  • The application
  • Python
  • Dependencies
  • Configuration

into a single portable artifact.

Source Code
      │
      ▼
Docker Build
      │
      ▼
Docker Image
Enter fullscreen mode Exit fullscreen mode

A container is simply a running instance of that image.


Managing Multiple Services with Docker Compose

Eventually, the project consisted of several independent services:

  • FastAPI
  • Background Worker
  • Floci

Instead of starting each one manually, Docker Compose orchestrated the entire environment.

Docker Compose
      │
 ┌────┼────┐
 ▼    ▼    ▼
API Worker Floci
Enter fullscreen mode Exit fullscreen mode

This made local development significantly more reproducible.


Automating Builds with GitHub Actions

Running the application locally wasn't enough.

Every code change should also be verified automatically.

GitHub Actions introduced a simple CI pipeline:

Push Code
     │
     ▼
GitHub Actions
     │
     ▼
Install Dependencies
     │
     ▼
Build Docker Image
     │
     ▼
Report Status
Enter fullscreen mode Exit fullscreen mode

Automation helps catch problems earlier and creates confidence that the project remains buildable.


Sharing Images Through Docker Hub

Docker images initially existed only on one machine.

Publishing them to Docker Hub changed that.

Local Build
     │
     ▼
Docker Hub
     │
     ▼
Any Machine
Enter fullscreen mode Exit fullscreen mode

Once uploaded, the same image can be pulled and executed anywhere without rebuilding.

This is the essence of:

Build once, run anywhere.


Looking Ahead to Deployment

The next natural step is deployment.

Instead of running the application on a personal laptop, the same Docker image can be deployed to an Amazon EC2 instance.

Conceptually:

GitHub
     │
     ▼
GitHub Actions
     │
     ▼
Docker Hub
     │
     ▼
EC2 Instance
     │
     ▼
Docker Compose
     │
 ┌───┼────┐
 ▼   ▼    ▼
API Worker Floci
Enter fullscreen mode Exit fullscreen mode

Cloud deployment becomes much simpler because the application is already packaged as a container.


Key Lessons Learned

This project reinforced several important ideas:

  • Cloud computing is about systems, not isolated services.
  • Object storage and metadata storage solve different problems.
  • Event-driven architectures improve scalability and responsiveness.
  • Queues decouple producers from consumers.
  • Background workers allow long-running tasks to happen asynchronously.
  • Containers provide consistent execution environments.
  • Continuous Integration improves software quality.
  • Good software engineering principles matter just as much as cloud knowledge.

Final Thoughts

Looking back, the biggest takeaway wasn't learning Amazon S3, DynamoDB, SQS, Docker, or GitHub Actions individually.

It was understanding how each component contributes a single responsibility within a larger system.

Cloud applications become easier to extend, maintain, and scale when responsibilities are clearly separated and services communicate through well-defined interfaces.

Building this project transformed cloud computing from a list of services into a connected ecosystem of architectural patterns—and that has been one of the most valuable lessons in my learning journey.


GitHub Repository

The complete project is available here:

Repository: https://github.com/micheal000010000-hub/aws-document-processing-pipeline/tree/release/v6.0

Feedback and suggestions are always welcome.

Top comments (0)