DEV Community: Vasil Shaikh

Can AI Translate Technical Content into Indian Languages? Exploring Amazon Translate (English Marathi & Hindi)

Vasil Shaikh — Sun, 11 Jan 2026 08:27:56 +0000

Hello World 👋

I’m Vasil, a DevOps Engineer with a passion for building reliable, scalable, and well-architected cloud platforms. With hands-on experience across cloud infrastructure, CI/CD, observability, and platform engineering, I enjoy turning complex operational challenges into clean, automated solutions.

I’ve been working with AWS Cloud for over 5 years, and I believe it’s high time I start exploring AI on AWS more deeply. Through these posts, I plan to share practical learnings, real-world experiences, and honest perspectives from my journey in DevOps, Cloud, and now AI.

Without further delay — let’s dive in 🚀

Introduction

As someone who writes about AWS in English, I wanted to explore whether Amazon Translate could help make technical AWS content accessible to regional-language audiences.

Instead of assuming it would “just work,” I approached this as an experiment:

Can Amazon Translate handle technical paragraphs?
How does it perform for regional Indian languages like Marathi?
How does that compare with a more widely supported language like Hindi? This post documents what actually happens when you try this in practice — including the limitations.

A real world architecture would look something like this

This diagram shows a simple, real-world flow for publishing AWS content in regional languages.

An author writes the original article in English and stores it as a text or markdown file in Amazon S3. That content is then passed to Amazon Translate, which converts it into Marathi. The translated output is stored back in S3 and can be published to platforms like Medium, dev.to, or internal documentation portals.

You may notice AWS Lambda in the diagram as an optional component. In real production setups, Lambda is often used to automate this workflow (for example, triggering translation when a new file is uploaded).

However, in this article we intentionally keep things simple and interact with Amazon Translate directly using the AWS CLI, without introducing Lambda or additional automation. This keeps the focus on understanding the service itself before adding more moving parts.

Pre-requisites

You’ll need:

An AWS account
AWS CLI configured locally (I’ll be using Cloudshell)
Basic familiarity with AWS services

Let’s begin!

Test Input (English Paragraph)

To keep things realistic, I used a full paragraph from AWS documentation-style content:

What is Amazon Translate? Amazon Translate lets you localize content for diverse global users and translate and analyze large volumes of text to activate cross-lingual communication between users. Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation.

Marathi Translation (Observed Behavior)

Command used:

aws translate translate-text \
  --region us-east-1 \
  --source-language-code en \
  --target-language-code mr \
  --text "What is Amazon Translate? Amazon Translate lets you localize content for diverse global users and translate and analyze large volumes of text to activate cross-lingual communication between users. Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation."

Actual output:

Amazon भाषांतर म्हणजे काय? Amazon Translate आपल्याला विविध जागतिक वापरकर्त्यांसाठी सामग्री स्थानिकीकरण करण्यास आणि वापरकर्त्यांमधील क्रॉस-भाषिक संप्रेषण सक्रिय करण्यासाठी मोठ्या Amazon Translate ही एक न्यूरल मशीन भाषांतर सेवा आहे जी जलद, उच्च-गुणवत्तेची, परवडणारी

What’s going on here?

The paragraph is clearly truncated
Sentences merge abruptly
The translation cuts off before completing the final thought
Technical flow and readability suffer The generated output in no way can be used directly and cannot be even given to an LLM for further fine tuning because the input itself isn’t complete and lacks flow. (So all in all this requires heavy human intervention)

Hindi Translation (Same Paragraph)

Using the exact same input:

aws translate translate-text \
  --region us-east-1 \
  --source-language-code en \
  --target-language-code hi \
  --text "What is Amazon Translate? Amazon Translate lets you localize content for diverse global users and translate and analyze large volumes of text to activate cross-lingual communication between users. Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation."

Actual output:

Amazon Translate क्या है? Amazon Translate से आप विभिन्न वैश्विक उपयोगकर्ताओं के लिए सामग्री का स्थानीयकरण कर सकते हैं और उपयोगकर्ताओं के बीच अंतर-भाषी संचार को सक्रिय करने के लिए बड़ी मात्रा में टेक्स्ट का अनुवाद और विश्लेषण कर सकते हैं। Amazon Translate एक न्यूरल मशीन अनुवाद सेवा है जो तेज़, उच्च-गुणवत्ता, किफायती और अनुकूलन योग्य भाषा अनुवाद प्रदान करती है।

Observations

Complete paragraph
Proper sentence boundaries
Natural flow
Technically accurate This is publishable with minimal human editing.

Translation for Indian Regional Languages: A Broader Challenge

It’s important to call out that what we’re seeing here is not unique to Amazon Translate.

High-quality translation for Indian regional languages has always been a hard problem, even outside AWS and cloud services. This challenge shows up across traditional NLP systems and modern generative AI models alike.

Some of the reasons include:

Linguistic complexity Languages like Marathi have rich morphology, flexible sentence structures, and context-heavy grammar. Direct sentence-to-sentence mapping from English often loses meaning or flow.
Limited high-quality training data Compared to English or Hindi, regional languages have significantly fewer large, clean, technical corpora available for training translation models.
Technical vocabulary mismatch Cloud and software terminology often has no commonly accepted regional equivalent. Models must decide whether to transliterate, translate, or drop context entirely — which can lead to broken sentences.
Mixed-language expectations In real-world usage, Indian technical writing often mixes English service names with regional language explanations. Handling this hybrid style consistently is still difficult for automated systems.

What This Means in the Real World

PLEASE NOTE!

Like I said, what we’re seeing here isn’t a failure of Amazon Translate — it’s a reflection of the broader state of machine translation for Indian regional languages today.

Amazon Translate does support Marathi along with several other Indian regional languages, but for long technical paragraphs, the output can be unreliable.
Hindi performs significantly better for the same technical content. Breaking content into multiple smaller calls is possible, but:
It’s inefficient
Not scalable
Still doesn’t guarantee quality This is important to know before committing to a regional-language publishing workflow.

$Practical Takeaway
If you’re planning to localize content or technical documentation using AWS Translate then:

Hindi: Viable today for paragraph-level technical content
Marathi (and similar regional languages): Needs improvement before it can be used confidently without heavy human intervention
A realistic approach today would be:

Use Amazon Translate for exploration and drafts
Rely on human review and editing for regional languages
Avoid assuming parity across all supported languages

Final Thoughts

This experiment wasn’t about proving that Amazon Translate is perfect — it’s about understanding where it works well and where it still struggles.

For me, the takeaway is clear:

Amazon Translate is strong for widely used languages
Regional technical localization is still a work in progress

And that’s okay — knowing the limits is just as valuable as knowing the features.

Building an AI Document Processing Pipeline on AWS (Textract + Bedrock)

Vasil Shaikh — Sat, 27 Dec 2025 12:37:46 +0000

Hello World 👋

Without further delay — let’s dive in 🚀

Introduction

Processing large volumes of documents like PDFs, scanned images, invoices, or contracts is a common challenge in many businesses. AWS provides managed services that can significantly reduce the manual effort required to extract meaningful text and context from uncategorized documents.

In this article, we’ll walk through a practical pattern for document processing using AI services on AWS:

Amazon Textract — for optical character recognition (OCR) and structured text extraction
Amazon Bedrock — for applying foundation models for understanding and summarization
Amazon S3 — for document storage

This pattern is useful for building Intelligent Document Processing (IDP) pipelines that go beyond simple OCR, adding structure, insights, and context using AI.

Prerequisites

To follow this article, you need:

AWS account access
AWS CLI configured (I’ll be using Cloudshell)

Permissions for:

Amazon S3
Amazon Textract
Amazon Bedrock
A document (PDF or image) stored in an S3 bucket

Architecture Overview

" width="" height="">

Here’s the high-level flow we’ll follow:

Upload document to Amazon S3
Use Amazon Textract to extract raw text and layout from the document
Feed the extracted content into Amazon Bedrock foundation models for summarization or structured extraction (Note: You can use a Lambda function here but we will not be using it in this article to keep it simple)
Store or present the processed results

Step-by-Step Implementation

Step 1: Create a S3 bucket for documents

We’ll start by creating an S3 bucket to store input documents and AI outputs.

 aws s3 mb s3://aws-textract-demo-vasil

(You can replace the suffix with your name to make the bucket unique)

Step 2: Upload Document to S3

Upload your document (PDF, PNG, JPG, etc.) to an S3 bucket.

aws s3 cp path/to/local/document.pdf s3://your-bucket/

Note: If you are going to use Cloudshell like me then you’ll need to upload the file to your Cloudshell shell from the options on the top-right corner>Click Actions>Click ‘Upload file’

For e.g

aws s3 cp invoice.pdf s3://aws-textract-demo-vasil/invoice.pdf

You can upload any invoice you have or download a sample invoice pdf/image from Google. For this example, I used my own invoice that I got after a purchase.

Step 3: Extract Text with Amazon Textract

Textract converts documents into machine-readable text.

aws textract detect-document-text \
  --region us-east-1 \
  --document '{
    "S3Object": {
      "Bucket": "<your-bucket-name>",
      "Name": "invoice.pdf"
    }
  }' > textract-output.json

aws textract detect-document-text \
  --region us-east-1 \
  --document '{
    "S3Object": {
      "Bucket": "aws-textract-demo",
      "Name": "invoice.pdf"
    }
  }' > textract-output.json

And… drumroll 🥁…we hit our first error!

An error occurred (AccessDeniedException) when calling the DetectDocumentText operation: User: arn:aws:iam::393078901895:user/kk_labs_user_465822 is not authorized to perform: textract:DetectDocumentText with an explicit deny in a service control policy

What’s Happening Here?

In a standard AWS account, this error can be resolved by granting the required Textract permissions via IAM.

However, in managed sandbox environments (such as training labs), Textract APIs are often explicitly restricted using service or identity-based policies to control cost and data exposure.

Because of this, the following steps assume a standard AWS account with appropriate permissions.

Note: The outputs shown below are representative of real Amazon Textract responses.

Sample Textract output

{
  "Blocks": [
    {
      "BlockType": "LINE",
      "Text": "Invoice Number: INV-1023",
      "Confidence": 99.2
    },
    {
      "BlockType": "LINE",
      "Text": "Total Amount: $1,250.00",
      "Confidence": 98.7
    },
    {
      "BlockType": "LINE",
      "Text": "Due Date: 15 Aug 2025",
      "Confidence": 97.9
    }
  ]
}

This produces structured JSON containing:

detected text
layout information
confidence scores

Textract doesn’t just do OCR — it understands document structure. That makes it a much better input for LLMs than raw text extraction tools.

Step 4: Prepare text for Bedrock

At this stage, you have raw text from Textract which may include:

words
line breaks
positional information
nested JSON frames

Your goal is to extract the plain, relevant text and concatenate multiple pages into a single text blob. LLMs work best with clean, structured input, so we focus only on the meaningful lines.

You can do this using Python scripts, CLI tools like jq, or if you prefer automation use a Lambda function that processes new documents automatically. Using Lambda is optional and not required for this guide.

jq -r '.Blocks[] | select(.BlockType=="LINE") | .Text' textract-output.json > cleaned_text.txt

Check the content:

cat cleaned_text.txt

Output:

Invoice Number: 12345
Date: 2025-12-27
Total Amount: $1,234.56

Step 5: Feed Textract Output into Amazon Bedrock

We’ll use Meta Llama 3 (3B Instruct) via Amazon Bedrock to interpret the document.

What We’ll Ask the Model

“Summarize this invoice and extract key details.”

Bedrock CLI Invocation

# Safely read the cleaned text and escape it for JSON
PROMPT=$(jq -Rs . < cleaned_text.txt)

# Build the full prompt with instruction
FULL_PROMPT=$(jq -Rn --arg txt "$PROMPT" '$txt + "\n\nPlease summarize this invoice and provide only the key details in a structured format (Invoice Number, Date, Total Amount)."')

# Invoke Bedrock
aws bedrock-runtime invoke-model \
  --region us-east-1 \
  --model-id us.meta.llama3-2-3b-instruct-v1:0 \
  --content-type application/json \
  --accept application/json \
  --cli-binary-format raw-in-base64-out \
  --body "{
    \"prompt\": $FULL_PROMPT,
    \"max_gen_len\": 300,
    \"temperature\": 0.3
  }" \
  response.json

jq -Rs . < cleaned_text.txt escapes all newlines/quotes in your text.
jq -Rn — arg txt “$PROMPT”appends your instruction safely to the JSON string.
— body contains valid JSON with “prompt” including your instruction.

This method avoids all the bash parsing issues and ensures the model gets the full instruction properly.

Model Response

Finally, do cat response.json to see the response. You might see some generic output as well. The generic output is coming from the model itself, and it happens because:

The content in cleaned_text.txt is very small or too generic (in our example, just a few lines like Invoice Number: 12345)
The Llama 3.2 Instruct model interprets short prompts in a “template/illustrative” style and fills in plausible generic text.

What We Just Built

✅ Stored documents in S3
✅ Extracted structured data with Textract
✅ Used an LLM to reason over documents
✅ Built a logical AI pipeline — step by step

All without deploying a single Lambda function.

Where This Goes in the Real World

With automation, this same pipeline can:

Process invoices automatically
Extract contract clauses
Power document search
Feed ERP or finance systems
Trigger workflows based on document content Add Lambda or Step Functions later — the core AI flow stays the same.

Final Thoughts

Well done. Seriously — give yourself a pat 👏

At this point, you’ve successfully used generative AI on AWS:

No apps
No infrastructure
Just clean, composable AI services

And yes — DO NOT forget to clean up resources that you created during this walkthrough to avoid unexpected cost.

Getting Started with AI on AWS: A Practical Guide

Vasil Shaikh — Thu, 25 Dec 2025 16:06:29 +0000

Hello World 👋

I'm Vasil, a DevOps Engineer with a passion for building reliable, scalable, and well-architected cloud platforms. With hands-on experience across cloud infrastructure, CI/CD, observability, and platform engineering, I enjoy turning complex operational challenges into clean, automated solutions.
I've been working with AWS Cloud for over 5 years, and I believe it's high time I start exploring AI on AWS more deeply. Through these posts, I plan to share practical learnings, real-world experiences, and honest perspectives from my journey in DevOps, Cloud, and now AI.

Without further delay - let's dive in 🚀

Introduction

AWS offers a wide range of AI services today, from ready-to-use APIs like Rekognition and Textract to generative AI platforms such as Amazon Bedrock and SageMaker.
The problem is not lack of choice - it's knowing where to start without overcomplicating things.
This article is written for anyone who wants to learn AI on AWS the right way:

without training models on day one
without managing GPUs
without turning a simple idea into a complex architecture

We'll look at how AWS AI services are structured, how they are typically used in real systems, and how to take your first practical step.

Prerequisites

You don't need a data science background to follow this.
You should have:

An AWS account
Basic familiarity with IAM and the AWS Console
AWS CLI configured locally (I'll be using Cloudshell)

If you can deploy a Lambda function or create an S3 bucket, you're ready.

Architecture Overview: How AI Fits Into AWS Applications

Most AI workloads on AWS follow a simple pattern.

Application → API → AI Service → Response

The AI service is not the application itself. It is just another managed AWS service, similar to DynamoDB or S3.
In practice, this usually means:

API Gateway receives a request
Lambda handles validation and logic
An AI service (Bedrock, Textract, Comprehend, etc.) is called
The result is returned or stored

This keeps the system:

loosely coupled
easy to scale
aligned with the AWS Well-Architected Framework

Understanding AWS AI Services (In Simple terms)

Amazon Bedrock

Amazon Bedrock provides access to foundation models through simple API calls.
You send a prompt → You get a response.
No model training. No infrastructure to manage.
This makes Bedrock a good starting point for:

text generation
summarization
chat-style applications

From an architecture perspective, Bedrock behaves like a serverless inference API.

Amazon SageMaker

SageMaker is for cases where you need more control:

training your own models
fine-tuning existing ones
running long-lived inference endpoints

If Bedrock is "plug and play", SageMaker is "build and operate".
For most beginners, SageMaker is not where you start.

Pre-Trained AI Services

AWS also provides purpose-built AI APIs:

Rekognition → images and video
Textract → documents
Transcribe → speech
Translate → language translation
Comprehend → NLP analysis

These services solve very specific problems and integrate easily with S3 and Lambda.

First Practical Step: Calling Amazon Bedrock

Rather than jumping into a full application, the safest way to start is to call Bedrock directly and understand how it behaves.

Step 1: Check Model Availability

Bedrock is region-specific.

aws bedrock list-foundation-models --region us-east-1

If this returns models, your account is ready.

Step 2: Create an IAM Role (Security First)

Following the Security pillar, never use root or overly permissive roles.

Minimal IAM policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "*"
    }
  ]
}

Attach this policy to:

A Lambda execution role or An IAM user used for experimentation
Note: In production environments, this policy should be scoped down to specific foundation models and regions.

Step 3: Invoke a Model

Example using Meta Llama:

aws bedrock-runtime invoke-model \
  --region us-east-1 \
  --model-id meta.llama3-2-3b-instruct-v1:0 \
  --content-type application/json \
  --accept application/json \
  --cli-binary-format raw-in-base64-out \
  --body '{
    "messages": [
      {
        "role": "user",
        "content": [
          { "text": "Explain AWS AI services in simple terms" }
        ]
      }
    ],
    "max_gen_len": 300,
    "temperature": 0.5
  }' \
  response.json

But wait — what is this max_gen_len & temperature ?

max_gen_len defines the maximum number of tokens the model is allowed to generate in the response. Lower values reduce cost and response time, while higher values allow more detailed outputs.

temperature controls how deterministic the response is. Lower values make outputs more predictable and focused, while higher values introduce more creativity and variation.

After running this command, we have successfully hit our first error!

Let’s try to solve this…

The error message is -

An error occurred (ValidationException) when calling the InvokeModel operation: Invocation of model ID meta.llama3–2–3b-instruct-v1:0 with on-demand throughput isn’t supported. Retry your request with the ID or ARN of a n inference profile that contains this model.

This error means the meta.llama3-2-3b-instruct-v1:0 model on AWS Bedrock needs an Inference Profile (like a specific configuration for its deployment) instead of direct "on-demand" calling; you must use the profile's ARN or ID in your code, often because it's a larger model needing dedicated resources or cross-region setup, not just pay-per-request.

To resolve this issue, you need to use an inference profile that contains the meta.llama3-2-3b-instruct-v1:0 model

Refer below AWS:rePost article for more info — https://repost.aws/questions/QUEU82wbYVQk2oU4eNwyiong/bedrock-api-invocation-error-on-demand-throughput-isn-s-supported

Let’s now find the inference profile for our model using —

aws bedrock list-inference-profiles --region us-east-1 | grep meta.llama3-2-3b-instruct-v1:0

Copy the inferenceProfileId (This will be our --model-id now)

So the new command is -

aws bedrock-runtime invoke-model \
  --region us-east-1 \
  --model-id us.meta.llama3-2-3b-instruct-v1:0 \
  --content-type application/json \
  --accept application/json \
  --cli-binary-format raw-in-base64-out \
  --body '{
    "messages": [
      {
        "role": "user",
        "content": [
          { "text": "Explain AWS AI services in simple terms" }
        ]
      }
    ],
    "max_gen_len": 300,
    "temperature": 0.5
  }' \
  response.json

BUT, we have a new error!

But this time the error is for the message format in the body which means we are on the right track, let’s fix this.

aws bedrock-runtime invoke-model \
  --region us-east-1 \
  --model-id us.meta.llama3-2-3b-instruct-v1:0 \
  --content-type application/json \
  --accept application/json \
  --cli-binary-format raw-in-base64-out \
  --body '{
    "prompt": "Explain AWS AI services in simple terms",
    "max_gen_len": 300,
    "temperature": 0.5
  }' \
  response.json

And then use cat command to view the response stored in the response.json file

cat response.json | jq

Well done! You’ve just successfully invoked generative AI on AWS — without writing an application or managing any infrastructure.

How This Fits Real Applications

In real systems:

The prompt usually comes from an API request
Lambda formats the request
Bedrock generates a response
The result may be stored in S3 or DynamoDB The key idea is simple: *AI is a service call, not a separate system. *

Cost and Quota Considerations

AI services are not free, and costs scale quickly if you are not careful.

Things to watch:

Token count (both input and output) directly affects Bedrock cost
Repeated calls inside loops can get expensive
SageMaker endpoints are billed while running

Always start with:

small inputs
clear limits
usage monitoring

Cost optimization is not optional — it’s part of good architecture.

Final Thoughts

Learning AI on AWS does not require a complex setup.

If you understand how to:

call managed services
secure them properly
control costs

you are already on the right path.