DEV Community: Firdaws Aboulaye

Building a Serverless i18n API with Amazon Nova Lite (Bedrock): Why Cheap Tokens Matter (A Lot)

Firdaws Aboulaye — Tue, 27 Jan 2026 07:12:52 +0000

Translating 10,000 UI message entries into 5 languages (50,000 translations) costs $0.26 with Amazon Nova Lite (production-measured) vs $41 with GPT-4 Turbo — a 99.4% cost reduction with smart prompt engineering and batching. These numbers are validated from a live deployment.

In this article, a “message entry” refers to a single i18n key–value pair
(e.g. auth.error.invalid_password=Invalid password) translated into a target locale

The Economics of LLM-Powered Translation

Internationalization (i18n) at scale is a token usage nightmare. Hundreds or thousands of short messages → thousands of API calls → thousands of tokens wasted in prompt overhead.

Let’s ground this in real numbers.

Assumptions

Average UI message size: ~8 tokens
Prompt overhead: ~50 tokens per call (context + instructions)
No batching: 1 message per API call

Token Usage

Per call

Input: 50 (prompt) + 8 = 58 tokens
Output (translated text): ~8 tokens

Total localized messages (50,000)

Input tokens: 50,000 × 58 = 2.9M
Output tokens: 50,000 × 8 = 0.4M

Cost Comparison (Updated Pricing)

📌 OpenAI LLM Pricing (approx current API rates)

Popular production-grade models like GPT-4o / GPT-4 Turbo cost significantly more than lightweight models. Recent public pricing shows:
Pricing based on publicly documented OpenAI API rates (April 2025). Exact pricing may vary by model version and deployment.

GPT-4 Turbo: ~$10 / 1M input tokens and ~$30 / 1M output tokens (classic API)
GPT-4o (more capable successor): ~$2.50 / 1M input and ~$10 / 1M output (2025 standard model pricing)

Estimated cost for 50,000 translations (no batching):

Model	Input $	Output $	Total
GPT-4 Turbo	2.9M × $10/1M = $29	0.4M × $30/1M = $12	$41
GPT-4o	2.9M × $2.50/1M = $7.25	0.4M × $10/1M = $4.00	$11.25

💸 AWS Bedrock Nova Lite Pricing

AWS Bedrock’s Nova Lite model is extremely cost-efficient:

~$0.06 per 1M input tokens
~$0.24 per 1M output tokens

Estimated cost (no batching):

Input: 2.9M × $0.06/M = $0.17
Output: 0.4M × $0.24/M = $0.10
Total: ~$0.27

With Batching (50 messages per call)

Batching drastically reduces prompt overhead since you amortize the 50+ tokens context over many messages.

Nova Lite total ≈ $0.12
GPT-4o total (same batch assumptions): ~$3–5+

This represents roughly a **99% reduction* in token cost vs mainstream API models.*

Real Production Metrics (Validated)

We deployed this API to production and ran comprehensive tests to validate these claims. Here are the actual measured results from a live AWS deployment.

Test Configuration

Endpoint: AWS Lambda (Python 3.13) + API Gateway (Regional)
Model: global.amazon.nova-2-lite-v1:0
Lambda: 128 MB memory, 30s timeout
Region: eu-central-1
Test date: January 2026

Batch Performance (50 messages per request)

Metric	Value
Input tokens	821
Output tokens	860
Cost per request	$0.000256
Cost per message	$0.00000511
Duration	3.8 seconds

Multi-Language Validation

Tested with identical 50-message batches across multiple languages:

Language	Input Tokens	Output Tokens	Cost (USD)	Duration	Notes
French	821	860	$0.000256	3.8s	Baseline
Spanish	821	854	$0.000254	3.6s	Slightly faster
Japanese	821	843	$0.000252	5.1s	+36% slower (CJK)

Key observations:

Input tokens remain constant (same English source)
Output tokens vary by ~2% based on target language verbosity
Japanese takes 36% longer due to character encoding complexity
Cost variance between languages: negligible (<2%)

Translation Quality Examples

Real translations from production:

Authentication flow:

EN: "Sign In"
FR: "Se connecter"
ES: "Iniciar sesión"
JA: "サインイン"

Error message:

EN: "An error occurred. Please try again."
FR: "Une erreur s'est produite. Veuillez réessayer."
ES: "Se produjo un error. Inténtalo de nuevo."
JA: "エラーが発生しました。もう一度お試しください。"

Variable preservation (100% success rate):

EN: "Welcome {{username}}!"
FR: "Bienvenue {{username}} !"
ES: "¡Bienvenido {{username}}!"
JA: "ようこそ {{username}}！"

Cost Projections from Real Data

Based on measured cost per message ($0.00000511):

Workload	Batches	Total Cost	Time Estimate
50 messages	1	$0.00026	3.8 seconds
1,000 messages	20	$0.0051	~1 minute
10,000 messages	200	$0.0511	~13 minutes
10,000 × 5 languages	1,000	$0.26	~65 minutes

Actual vs Theoretical Comparison

Scenario: 10,000 messages translated into 5 languages (50,000 total translations)

Model	Theoretical	Measured	Variance
GPT-4 Turbo	$41.00	N/A	-
GPT-4o	$11.25	N/A	-
Nova Lite	$0.27 (no batch) / $0.12 (batched)	$0.26	✅ Within range

Savings vs GPT-4 Turbo: $41.00 - $0.26 = $40.74 (99.4% reduction) ✅

Total Experiment Cost

Total test requests: 14 (across all batch sizes and languages)
Total measured cost: $0.00223
Highest single request cost: $0.00026 (50 messages)

These real-world results validate the theoretical calculations and demonstrate that Nova Lite delivers production-ready, cost-effective translations at scale.

Why i18n Translation Explodes Token Costs

Internationalization is uniquely expensive because:

Volume: Thousands of UI messages
Repetition: Same short prompt structure over and over
Multiple locales: EN → FR, ES, DE, IT, PT, JA…
Automated pipelines: CI/CD often regenerates translations on every release

Models help with:

Idiomatic translation
Variable preservation ({{username}})
Tone consistency

But traditional LLM pricing makes these workflows costly unless you optimize.

The Solution

Amazon Nova Lite + Smart Prompt Engineering

Enter Amazon Nova Lite, Amazon's ultra-efficient foundation model — basically the Toyota Corolla of LLMs:
cheap, reliable, and it gets the job done.

It’s not trying to write poetry or marketing slogans. It’s built for high-volume, low-variance workloads like translation, classification, and structured transformation — exactly what i18n needs.

Amazon Nova Lite isn’t glamorous, but it’s cheap and capable — ideal for straightforward translation tasks.

Nova Lite Token Costs:

Input: ~$0.06/1M
Output: ~$0.24/1M
Affordable enough to handle i18n at scale without breaking the budget.

3 Token Optimization Patterns

🔹 Pattern 1 — Prompt Minimalism

Bad (verbose):

prompt = """You are a professional translator...
Text to translate: "Welcome to {{app_name}}!"
Please provide the translation now."""

Good (tight + structured):

prompt =f"""Translate {source} to{target}. JSON only.
Rules: preserve {{{{variables}}}}; match tone.
Input:{json.dumps(messages)}
Output format: [{"key","translated"}]"""

➡️ Smaller prompts = fewer tokens = lower costs.

🔹 Pattern 2 — Batch Processing

Sending one message per request kills throughput and adds 50+ overhead tokens every time.

Bad: 50 calls

for message in messages:
    translate_single(message)

Good: 1 batch call

translate_batch(messages)

Payoff:

Fewer tokens in repetitive instructions
Far less latency
Lower total cost

🔹 Pattern 3 — Structured Outputs

Free-form text wastes tokens and costs on parsing overhead.

Structured JSON:

[
 {
  "key":"greeting",
  "translated":"Bonjour"
 }
]

This removes:

Excess text
Human-readable labels
Parsing overhead

→ for predictable token usage

Production-Ready Implementation Patterns

Your i18n API should incorporate these practical engineering patterns:

🧱 1. Separation of Concerns

Split:

HTTP handler
Translation service (Bedrock client + prompt templates)

This makes your service easier to test and evolve.

📄 2. External Prompt Templates

Loading prompts from files:

Enables version control
No redeploy for prompt tweaks
Keeps code clean

🔁 3. Resource Reuse

Initialize Bedrock clients once per Lambda instance (module scope) — reuse on warm starts.

🔍 4. Observability & Cost Metrics

Emit CloudWatch metrics for:

Input tokens
Output tokens
Real USD cost

Track spend and set alarms!

⚠️ 5. Input Validation

Validate early:

Require messages, locale
Enforce maximum batch size

→ Fail fast = fewer expensive API calls

Example Request / Response

Request:

curl -X POST https://…/translate \
-H"Content-Type: application/json" \
-d '{
  "messages":[{"key":"hi","content":"Hello"}],
  "locale":"fr"
}'

Response:

{
 "translations":[
  {
   "key":"hi",
   "original":"Hello",
   "translated":"Bonjour"
  }
 ],
 "metadata":{
  "model":"global.amazon.nova-2-lite-v1:0",
  "tokens_used":{"input":145,"output":67},
  "cost_usd":0.0000249,
  "duration_ms":234
  }
}

Model Selection Guide

Use Case	Best Model	Why
High-volume i18n	Nova Lite	Ultra-low cost
Creative marketing	GPT-4o / GPT-4.1	Nuance & tone
Legal / medical	High-accuracy large models	Precision matters
Real-time micro latency	Ultra-small models	Fastest inference

Summary

i18n is a token hog.
Amazon Nova Lite offers orders-of-magnitude cost savings

at ~$0.06/1M input and ~$0.24/1M output — ideal for translation.
Prompt engineering + batching shrinks costs dramatically.
Structured responses reduce parsing complexity and waste.
Cost tracking lets you enforce budgets and monitor usage.

Wrap-Up

Total cost for 50,000 translations (10,000 messages × 5 languages):

Model	Without batching	With batching (measured)
GPT-4 Turbo	~$41.00	~$16–20
GPT-4o	~$11.25	~$3–5
Nova Lite	$0.27 (theoretical)	$0.26 ✅ (measured in production)

Validated savings: 99.4% cheaper than GPT-4 Turbo ($40.74 saved) when running your i18n pipeline with an optimized Nova Lite + Bedrock approach.

Production-tested: These numbers are from real API calls to a live deployment, not estimates.

Try It Yourself

All code and templates live in the i18n-ai repo.

Deploy, watch your token usage, and let your translation costs shrink.

Reproduce the Experiments

The production metrics in this article are fully reproducible. The repo includes:

Test datasets: /test-data/ directory with various batch sizes and languages
Experiment script: /scripts/run-experiments.py - automated test runner
Results: /experiment-results.md - raw data from our production tests

Run the experiments yourself:

task run-experiment

This transparency allows you to validate our claims with your own deployment.
If you replicate these experiments with your own message properties or language sets,
feel free to share your results — I’d love to compare!

What I Learned Using Specification-Driven Development with Kiro

Firdaws Aboulaye — Mon, 12 Jan 2026 08:24:03 +0000

For a long time, I thought specifications were just a polite way to slow engineers down.

Big documents.
Outdated diagrams.
Things nobody reads after sprint one.

Then I actually used specification-driven development with Kiro on a real project, and my opinion changed completely.

This article is not about a product, a framework, or a startup idea.

It is about what changed in how I think and work as an engineer once I started treating specifications as first-class citizens.

I Used to Start with Code (and Regret It Later)

My old workflow looked like this:

“I roughly know what needs to be built”
Sketch an architecture in my head
Start coding
Discover edge cases halfway through
Refactor
Repeat

It felt fast.

It was not.

Most of the time, I was not fixing bugs. I was fixing unclear decisions I never made explicitly.

Using Kiro forced me to stop before step two.

Writing the Requirement First Was Uncomfortable

The first thing I had to write was a requirement, not code.

Something like:

As an entity owner, I want to register my organization and receive credentials,

so that I can submit data for processing.

Followed by acceptance criteria such as:

the system accepts only basic information
credentials are generated once
a default configuration is created

No cloud provider.

No database.

No crypto algorithms.

At first, this felt artificial.

I already knew how I wanted to build it, so why slow down?

Then something interesting happened.

The Spec Exposed Assumptions I Didn’t Know I Had

As I wrote the acceptance criteria, questions started appearing:

What exactly does “registered” mean?
What exists after registration?
What must be returned, and what must never be returned again?
What is intentionally out of scope?

None of these questions were about code.

They were about behavior.

I realized that I usually answered these questions implicitly while coding.

That is risky.

Acceptance Criteria Became My Decision Filter

Once the acceptance criteria were written, they became my favorite tool.

During implementation, I kept asking myself:

Does this help satisfy an acceptance criterion?

If the answer was no, I stopped.

This single habit prevented over-engineering, killed nice-to-have features early, and reduced scope creep without meetings.

Reviews also changed.

Discussions became:

Which acceptance criterion does this serve?

instead of:

I feel like this might be useful later.

What Kiro Changed That I Didn’t Expect

At this point, a fair question is:

Couldn’t you do all of this without Kiro?

Technically, yes.

Practically, I never did.

Before Kiro: Acceptance Criteria Were Optional

Before, my workflow looked like this:

Write acceptance criteria
Design the solution
Start coding
Adjust scope along the way

Nothing stopped me from:

adding extra fields while I was there
handling edge cases not in the requirement
making implementation-driven decisions

The acceptance criteria existed, but they had no real weight.

With Kiro: Acceptance Criteria Became Structural

With Kiro, I could not move forward without:

a defined requirement
explicit acceptance criteria
a design derived from them
tasks mapped back to them

Skipping clarity was not forbidden. It was just uncomfortable.

If I tried to jump ahead:

designs felt premature
tasks felt vague
gaps became obvious immediately

Kiro did not magically make me write better specs.

It made weak specs impossible to ignore.

Less Mental Overhead While Coding

Before Kiro, I constantly asked myself:

Is this enough?
Am I missing something?
Should I add this now?

With acceptance criteria embedded into the workflow:

many decisions were already made
implementation became mechanical
confidence increased

This was not about documentation.

It was about removing uncertainty at the exact moment it hurts the most, while coding.

Intent Became a System Constraint

The most important shift was subtle.

Acceptance criteria stopped being guidance and became constraints that the system enforced.

Once that happened:

over-engineering dropped
scope creep became visible
refactoring felt safer

Not because I tried harder, but because deviation became obvious.

Design Became Simpler (Not Bigger)

I expected specifications to lead to heavier designs.

The opposite happened.

Because the behavioral contract was clear, the design only needed to:

satisfy the requirement
handle failures
remain replaceable

No speculative abstractions.

No future-proof layers.

Design stopped being a guessing game and became a direct response to the spec.

Tasks Stopped Being Vague

Breaking the work into tasks became almost mechanical.

Each task:

mapped to an acceptance criterion
had a clear definition of done
avoided vague verbs like “handle” or “manage”

Estimation improved.

Progress became visible.

Half-done work became obvious.

Tasks stopped being things to do and became contracts to fulfill.

Refactoring Suddenly Felt Safe

This part surprised me.

Because the specification defined what must happen, not how, I could:

rewrite internals
simplify logic
replace components

As long as the acceptance criteria still held, the system was correct.

It changed how I view code:

Code is temporary.

Behavior is the real asset.

Mistakes I Made Along the Way

It was not perfect. I made mistakes, and they were useful.

1. I Over-Specified Security Details

My first version included:

key formats
cryptographic assumptions
token strategies

That made the spec rigid and harder to evolve.

Lesson learned: specify security outcomes, not mechanisms.

2. I Treated the First Spec as Final

I assumed the spec needed to be perfect from day one.

It did not.

The best specs evolved:

wording improved
acceptance criteria were reordered
unnecessary constraints were removed

A spec should be living, not frozen.

3. I Underestimated Acceptance Criteria

At first, I treated acceptance criteria like a formality.

They turned out to be the most valuable part of the entire process.

If I had to keep only one thing from the spec, it would be those.

4. I Almost Skipped Specs for Small Features

I thought some features were too small to justify a spec.

Those were exactly the ones that later caused confusion:

unclear defaults
inconsistent behavior
surprising edge cases

Even a short spec was cheaper than fixing misunderstandings later.

What Actually Changed in My Workflow

Using specification-driven development consistently led to:

fewer late surprises
fewer abandoned implementations
fewer “why did we build this?” moments

I now spend more time at the beginning and far less time undoing work later.

It does not slow me down.

It makes me deliberate.

Final Thought

Kiro did not invent good engineering practices.

It made them unavoidable.

It did not teach me how to write better code.

It taught me how to think before writing code.

And that turns out to be the hardest and most valuable part of engineering.

Writing code is easy.

Deciding what code should exist is the real work.

Cut Your AWS Lambda Logging Costs: Filter Logs with AWS SAM

Firdaws Aboulaye — Sat, 28 Dec 2024 07:00:00 +0000

Introduction

AWS Lambda is a serverless compute service that scales automatically based on demand. However, the pay-per-use model extends to your logs as well—excessive logging can significantly increase your Amazon CloudWatch Logs costs. In this article, we’ll look at how to configure and filter Lambda logs using AWS SAM (Serverless Application Model) so you capture only the logs you truly need, especially during periods of high load.

Default Logging in AWS Lambda

By default, Lambda writes plain text logs to CloudWatch. While this is convenient, it can make filtering specific log levels (e.g., DEBUG, INFO, WARN, ERROR) difficult, potentially creating large volumes of unnecessary logs.

Sample Node.js Lambda

exports.handler = async (event) => {
  console.debug("DEBUG: This is a debug message.");
  console.log("INFO: This is an informational message.");
  console.warn("WARN: This is a warning message.");
  console.error("ERROR: This is an error message.");

  return {
    statusCode: 200,
    body: JSON.stringify({ message: "Hello from Lambda!" }),
  };
};

AWS SAM function configuration

Globals:
  Function:
    Runtime: !Ref NodeRuntime
    Handler: index.handler
    Tags:
      App: !Ref App
      Project: !Ref Project

Resources:

  # Functions
  FilteringLogsFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: filtering-logs
      FunctionName: !Sub ${App}-filtering-logs
      Description: Filter logs by log level

In CloudWatch, you might see something like:

All entries appear as plain text, making it difficult to filter precisely—and potentially driving up your log storage costs.

Configuring Logs with AWS SAM

To better control which logs get stored, you can specify LoggingConfig in your SAM template as described here. Pay attention this configuration is not supported in the Global section of SAM template.
This configuration helps you define log levels, format, and the target log group in a more structured way, so you don’t end up paying for excessive logs especially during high load.

SAM Template

  FilteringLogsFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: filtering-logs
      FunctionName: !Sub ${App}-filtering-logs
      Description: Filter logs by log level
      LoggingConfig:
        ApplicationLogLevel: INFO
        LogFormat: JSON

What These Settings Do

ApplicationLogLevel: Filters application-level logs (e.g., DEBUG, INFO, WARN, ERROR). Anything below this level won’t be logged.
LogFormat: Possible values are Text (default) or JSON. When using Text, filtering by log level is not supported, which can lead to higher logging costs. Switching to JSON enables log-level filtering and makes logs more structured.

In CloudWatch, you might see something like:

By switching from the default Text format to JSON and setting appropriate log levels, you can drastically reduce log volume and costs during high load. Structured logs are also much easier to parse and filter in CloudWatch Logs Insights or other monitoring tools.

Additional Tools for Advanced Logging

While SAM’s built-in LoggingConfig is powerful, you may want even more control over your logging. Libraries such as AWS Lambda Powertools (available for multiple runtimes e.g. python) provide additional features like structured logging, tracing, and correlation IDs. These can help you further refine and filter your logs, improve observability, and reduce costs by focusing on the most important information.

Conclusion

Logging is essential for monitoring and troubleshooting AWS Lambda functions, but excessive logs can quickly inflate costs. By leveraging AWS SAM and its LoggingConfig parameters—especially switching to JSON and setting appropriate log levels—you can store only the logs that matter most. In addition, libraries like AWS Lambda Powertools can offer even more granular control. This approach not only cuts costs but also streamlines your troubleshooting and monitoring process.

Highly scalable image storage solution with AWS Serverless - Building File API for Uploads and Downloads

Firdaws Aboulaye — Sat, 21 Dec 2024 07:00:00 +0000

The File API is essential for managing file uploads and downloads in our architecture. We designed a workflow that includes registration, secure uploads, and file state updates to handle file storage efficiently.

Workflow Overview

Registration: Store file information (metadata, state, user or session information) in the system before the upload.
Presigned URL Generation: Provide a short-lived presigned URL for secure, direct uploads to Amazon S3.
File State Update: After the upload, update the file's state and verify its integrity.

Challenges and Solutions

Initial Approach: Using S3 Events
- Method: Each file upload to S3 triggered an S3 Event, invoking a Lambda function per file to update the database and check for corruption.
- Issues:
  - High Concurrency: Massive uploads led to exceeding AWS Lambda's concurrency limits.
  - Increased Errors: Overloaded Lambdas resulted in failed executions.
  - Lack of Retries: S3 Events didn't support easy reprocessing of failed events.

Improved Approach: Leveraging SQS with Batching
- Method: Replaced S3 Events with messages sent to an SQS queue upon file upload. Configured Lambda functions to process batches of events from the queue.
- Benefits:
  - Reduced Executions: Batch processing minimized the number of Lambda invocations.
  - Enhanced Error Handling: SQS allowed retries for failed messages with partial batch response
  - Scalability: SQS standard queue has nearly unlimited throughput.

This architecture ensures scalability, reliability, and efficient handling of large volumes of uploads.

Events:
  SQSEvent:
    Type: SQS
    Properties:
      Enabled: true
      Queue: !GetAtt PostUploadHandlingSQSQueue.Arn
      FunctionResponseTypes:
        - ReportBatchItemFailures
      BatchSize: 9

This configuration ensures optimized SQS message processing and robust scalability, even during high traffic periods.

The BatchSize property in the Lambda definition, as shown below, allows customizing the maximum number of items retrieved per batch (e.g., BatchSize: 9)

Conclusion

The evolution of our File API showcases the importance of adapting architecture to meet real-world demands. By moving from direct S3 Event triggers to an SQS-based batch processing system, we overcame concurrency limits, reduced errors, and improved scalability.

In the next article, we’ll explore another domain API, diving into its unique challenges and the solutions we implemented to address them. Stay tuned for more insights into our journey of building scalable and reliable APIs!

Highly scalable image storage solution with AWS Serverless - architectural decisions

Firdaws Aboulaye — Tue, 10 Dec 2024 07:00:00 +0000

From Monolith to Domains: Architecting a Scalable Solution

In our previous article, we introduced the motivation behind redesigning our image storage solution and the transition to a serverless-first approach. This follow-up focuses on the architectural challenges we faced while implementing the solution and how we overcame them.

Challenge 1: Splitting the Service by Domain

Initially, our legacy system was a monolithic structure, tightly coupled and challenging to scale or extend. To modernize, we began by breaking the solution into domains and APIs, each handling specific responsibilities:

File API: Responsible for handling image uploads, downloads, and storage.
Project API: Facilitating project save/load workflows for users working on designs like photo books.
Scheduler Events: Automating processes like project expiration, cleanup, and background tasks.

Each domain directly tied into the workflows outlined in the previous article, ensuring that the solution remained cohesive yet modular. This separation allowed us to independently scale, test, and enhance each domain.

Challenge 2: Lambda scalability and relational database connection bottlenecks

To implement these APIs, we started with a straightforward and efficient architecture:

API Gateway: Serving as the entry point for client requests.
AWS Lambda: Running serverless functions to process requests and business logic.
Aurora Serverless v1: Providing relational database capabilities for data storage.

Amazon Aurora Serverless v1 with Data API was chosen for its PostgreSQL compatibility and to avoid deploying Lambdas into a VPC or using RDS Proxy. However, the 1000 requests per second limit for the Data API (see AWS limits) became a bottleneck as usage grew. Aurora Serverless v2, available at the time, lacked Data API support (introduced in December 2023). To overcome these limitations, DynamoDB was adopted for its unlimited scalability and seamless serverless integration.

Solution: Switching to DynamoDB

To resolve the database bottleneck, we migrated to Amazon DynamoDB, which provided:

Unlimited scalability: Handling concurrent connections efficiently without manual provisioning.
High availability: Reducing latency for global users.

This shift significantly improved our system's performance and ensured that the APIs could handle rapid spikes in traffic. However, it also required significant learning efforts as we needed to understand the concepts of NoSQL databases. Specifically, we had to master handling 1-to-n and n-to-m relationships, as well as implementing single-table design patterns, which are fundamental to designing efficient and scalable NoSQL solutions.

Challenge 3: Choosing the Right API Gateway

Our initial implementation utilized HTTP APIs on API Gateway, which offered impressive performance and lower costs. However, as we integrated more advanced features, particularly around user authorization, we encountered limitations.

To address these, we switched to REST APIs, which provided:

Enhanced integration with custom authorizers: Streamlining secure access to resources.
Greater flexibility in routing: Supporting complex workflows between domains.

The switch allowed us to maintain the necessary performance while adding advanced features for authentication and access control. If you're deciding between REST APIs and HTTP APIs for your project, refer to this AWS documentation for a detailed comparison of their capabilities and use cases.

The Outcome

By addressing these challenges step-by-step, our architecture evolved into a robust, scalable solution that could support our workflows efficiently:

Domains: Independently scalable APIs for File, Project, and Scheduler domains.
Database: A highly scalable and reliable backend powered by DynamoDB.
API Gateway: REST APIs for secure and feature-rich routing.

This architecture empowered us to meet our business goals while preparing for future feature integrations and user demands.

Conclusion

The journey from a monolithic system to a domain-driven architecture was not without its hurdles. However, each challenge—from database scalability to API design—provided valuable insights that strengthened our solution. In the next article, we’ll explore how we leveraged this architecture to optimize workflows and introduce new features seamlessly.

Stay tuned!

Highly scalable image storage solution with AWS Serverless at ip.labs - Part 1

Firdaws Aboulaye — Tue, 12 Nov 2024 07:00:00 +0000

Co-Author: Vadym Kazulkin

Introduction

ip.labs, a 100% subsidiary of the FUJIFILM Group headquartered in Bonn, is a global leader in white-label e-commerce software for imaging. Our software enables end consumers to design and purchase a variety of photo products, including prints, posters, photo books, calendars, greeting cards, and personalized gifts. In addition, we provide integrations for payment systems, single sign-on (SSO) solutions, and a unique API that allows providers to create a shared cart, facilitating seamless addition of our solution alongside their existing business.

The Journey to the Cloud

The Initial Setup and Challenges

Until 2017, ip.labs managed several on-premises data centers across the globe, operating a large monolithic architecture that relied on a centralized database and NFS storage. While effective initially, this setup presented significant challenges, including scalability limitations, central dependency on infrastructure, and integration hurdles.

Recognizing the need for greater agility, we decided to fully transition to the cloud. AWS was selected as our cloud partner, and by 2021, all customers had been migrated to AWS, allowing us to shut down our data centers completely. The migration began as a “lift and shift” approach but soon evolved to incorporate several key improvements:

Automation at both the infrastructure and deployment levels
Adoption of immutable infrastructure to eliminate hot patches
Blue-Green deployment strategies to minimize downtime
Introduction of independent core microservices, fostering a “you build it, you run it” mindset
Auto-scaling policies to efficiently manage workload variations

Legacy Image Storage Issues

The cloud migration exposed several issues with our legacy image storage solution:

Reliance on central AWS EFS for image uploads/downloads from all EC2 servers, which created scalability and availability risks
Lack of defined APIs and limited testability for image storage
No clearly ownership for image storage solution
Was still a part of the monolithic solution and was tightly coupled into many aspects of the self-written e-commerce system which made integrations to the 3rd party providers really hard

The Move to a New Image Storage Solution

To address these issues and support future image-related features, we made the strategic decision to develop a standalone, API-driven image storage service. Since handling images is our core business, we opted to build our own solution rather than relying on third-party file storage providers.

This new image storage system supports various workflows, including:

Standard Uploads: Local uploads directly from users’ devices.
Mobile Uploads: QR code-based uploads with optional long-term storage that extends beyond typical session duration (usually six hours).
Project Save and Load: A feature that allows users to save intermediate designs, useful for more time-intensive products like photo books.
Project Sharing: Users can share saved projects in various modes, allowing recipients to create copies. Future plans include collaborative editing and commenting.
Cart Item Persistence: Cart items, both internal and external, are saved across multiple days to support integration with shared/external carts.
Order Storage: Orders are stored for a defined period, enabling re-submission to professional labs for production and delivery in case of issues.
Automated and Manual Cleanup: Image deletion is initiated either manually by the user or automatically upon project expiration or user logout.

Serverless Transformation: Why AWS Serverless?

Our new file storage solution, built with AWS Serverless technologies, is now rolled out to nearly all customers. This shift has not only enhanced our storage capabilities but has also paved the way for rapid feature development.

We chose AWS Serverless for several reasons:

Since 2018, we have extensively experimented with AWS Serverless across small to medium applications, experiencing firsthand the benefits of AWS-managed services (Both authors of this series have given many talks about our journey at different conferences or meet ups since then).
AWS Serverless enables rapid delivery, allowing our team to focus more on core capabilities.
Adopting a Serverless-first approach has fostered a culture of learning and knowledge-sharing within our development teams, further solidifying AWS Serverless as our preferred development model.

Advantages of the New Implementation

Our new image storage solution, built on AWS Serverless, offers several significant advantages over our previous monolithic architecture. Key benefits include:

Scalability: Leveraging AWS Serverless allows us to automatically scale resources based on demand. This elasticity ensures that our system remains performant even during peak usage, providing a seamless experience for end users while reducing the need for manual scaling efforts.
Flexibility for Future Expansion: The modular, API-driven architecture makes it easy to extend our file storage to other solutions. This flexibility supports faster adaptation to new business needs and allows for smooth integration with both our own services and external partner platforms.
Enhanced Feature Integration: With the serverless approach, we’ve been able to introduce new features more efficiently, such as mobile uploads and project sharing. These functionalities allow users to conveniently upload photos from any device and share projects with others, all seamlessly integrated into our new storage system.

This new implementation positions us well to continue adding value to our offerings while maintaining a highly scalable, flexible, and feature-rich platform

Conclusion

This article has introduced the foundation of our AWS journey at ip.labs and our shift to a Serverless-first approach to redesign our image storage solution. In the upcoming series, we will explore the architecture of this new storage solution, diving into the details of the supporting APIs and the design choices behind them.

See below a short overview of this architecture.

Stay tuned for the next article, where we’ll break down the architecture and functionality of the image storage solution’s APIs.