DEV Community: Oteng Isaac

Building a Production-Ready Text-to-Text API with AWS Bedrock, Lambda & API Gateway

Oteng Isaac — Wed, 31 Dec 2025 00:20:07 +0000

Building a Production-Ready Text-to-Text API with AWS Bedrock, Lambda & API Gateway

Project Overview

This project demonstrates how to design and deploy a production-ready text-to-text AI API using AWS Bedrock and Amazon Titan Text, exposed securely via Amazon API Gateway and powered by AWS Lambda.

The goal is to show how organizations can integrate Generative AI capabilities into real business systems while maintaining security, scalability, cost control, and observability.

Business Use Case

Many organizations want to leverage Generative AI for:

Internal copilots
Automated content generation
Text summarization
Data explanations
Customer support automation

However, directly exposing foundation models to applications can introduce security, cost, and governance risks.

This project solves that by:
Abstracting the foundation model behind a controlled API
Enforcing consistent prompts and parameters
Centralizing access, logging, and cost management
The result is a secure AI service layer that can be reused across multiple teams and applications.

Architecture Overview

Flow:

Client sends text input to an API endpoint
API Gateway validates and routes the request
Lambda processes the request and invokes AWS Bedrock
Amazon Titan generates a text response

The response is returned to the client

🛠️ Tools & Services Used

🔹 AWS Bedrock

Fully managed service for accessing foundation models

No infrastructure to manage

Enterprise-grade security

Pay-per-use pricing

🔹 Amazon Titan Text (amazon.titan-text-express-v1)

Fast, cost-efficient text generation model

Ideal for text-to-text use cases

Deterministic behavior with low temperature

Designed for enterprise workloads

🔹 AWS Lambda

Serverless compute for business logic

Handles request validation and AI invocation

Scales automatically

🔹 Amazon API Gateway

Securely exposes the AI service as a REST API

Enables authentication, throttling, and monitoring

Acts as the public interface for applications

🔹 Python (Boto3)

AWS SDK for invoking Bedrock

Lightweight and production-friendly

🧠 Why This Design Matters

Stateless AI calls: Foundation models do not retain memory

Explicit control: Prompts and parameters are centrally managed

Security-first: IAM-controlled access to Bedrock

Cost management: Token limits and model choice enforced

Reusability: Multiple applications can consume the same API

This mirrors how AI platforms are built in regulated and enterprise environments.

🧩 AWS Lambda: Text-to-Text Processing Logic

Below is an example AWS Lambda function written in Python that receives text from API Gateway, invokes AWS Bedrock (Amazon Titan Text), and returns the generated response.

This Lambda acts as the controlled AI service layer between your applications and the foundation model.

Let's create the function

Go to the management console, and search for AWS lambda.
Click on Create function to open the function creation page. Enter a name for the function and choose Python as the runtime. Accept all defaults and click on Create function.

Replace the code in the code editor with the code shared in the Github Repo.

Increase the time out to 30 sec as shown below.

🔐 Required IAM Permissions

The Lambda execution role must allow invoking Bedrock models:
Lambda by defaults has access to Cloudwatch for log writing. We need to grant lambda access to Bedrock and the foundation model

Go to Configurations then Permissions. Click on the Role name and update with the policy below. This grants Lambda access to Bedrock and the foundation Model

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowInvokeTitanText",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-express-v1"
      ]
    }
  ]
}

🌐 API Gateway Request Example

Let's create the API using AWS API Gateway. In the AWS API Gateway service page click on Create API. Choose REST API as the type and click on Build.

In the resources page, click on Create resource. Give the resources a name and click on Create resource

Click on Create Method. Choose method type as POST and Integration type as Lambda function. Check Lambda proxy integration and select the created lambda function. Click on Create method.

Click on Deploy API. Select New stage and enter a Stage name. Click on Deploy.

On the Stage details page, copy the Invoke URL. You can use any API client like Postman to test the API as shown below.

Test using any API client. In this demonstration, I have used Postman as shown below.


POST https://05q0if5orb.execute-api.us-east-1.amazonaws.com/prod/text
{
    "text": "what is Amazon Bedrock"
}


✅ API Response Example


{
    "response": "\nAmazon Bedrock is the name of AWS’s managed service for managing the underlying infrastructure that powers your intelligent bot. It is a collection of services that you can use to build, deploy, and scale intelligent bots at scale. Amazon Bedrock is a managed service that makes foundation models from leading AI startup and Amazon’s own Titan models available through APIs. For up-to-date information on Amazon Bedrock and how 3P models are approved, endorsed or selected please see the provided documentation and relevant FAQs."
}

🧠 Why This Lambda Design Matters

Keeps foundation models behind a secure API

Enforces consistent parameters (temperature, token limits)

Prevents direct client access to Bedrock

Enables logging, monitoring, and governance

This pattern is commonly used to build enterprise AI platforms.

📦 Example Use Cases

Text summarization API

AI-powered content generation service

Analytics explanation engine

Internal AI assistant backend

Secure GenAI microservice

🧠 Building a Conversational Chatbot with AWS Bedrock (Amazon Titan)

Oteng Isaac — Fri, 26 Dec 2025 23:12:24 +0000

🧠 Building a Conversational Chatbot with AWS Bedrock (Amazon Titan)

Overview

Large Language Models don’t magically “remember” conversations.

In real-world systems, conversation state must be explicitly managed.

In this project, we build a deterministic, production-style conversational chatbot using:

AWS Bedrock
Amazon Titan Text (amazon.titan-text-express-v1)
Python (boto3)

This project demonstrates how teams can safely integrate foundation models into enterprise workflows without giving up control, observability, or reproducibility.

🔷 What Is AWS Bedrock?

Amazon Bedrock is a fully managed service that provides access to multiple

foundation models (FMs) via a single API — without requiring you to manage infrastructure.

With Bedrock, you can:

Invoke models securely using IAM
Choose models from different providers
Keep data within AWS (no model training on your prompts by default)
Integrate generative AI directly into existing AWS architectures

🔑 Key Bedrock Characteristics

Serverless (no infrastructure management)
Model-agnostic API
Enterprise-grade security
Pay-per-use pricing
Native AWS integration

💼 Business Use Cases of AWS Bedrock

AWS Bedrock is designed for real business workloads, not just demos.

Common Enterprise Use Cases

Internal chatbots & AI copilots
Document summarization & analysis
Automated reporting & insight generation
Semantic search over internal data
AI-assisted debugging & data quality analysis
Analytics narrative generation
Customer support automation

Why Companies Choose Bedrock

Data never leaves AWS
IAM-controlled access
Works seamlessly with S3, Lambda, Glue, Databricks, Redshift
No lock-in to a single model provider

🔶 Amazon Titan Text (`amazon.titan-text-express-v1`)

Amazon Titan Text Express is a fast, cost-efficient text generation model built by AWS.

Key Characteristics

Optimized for low-latency text generation
Ideal for chatbots, summarization, and explanations
Deterministic behavior when temperature is low
Fully managed and secured by AWS

When to Use Titan Text Express

Conversational assistants
Structured responses
Enterprise-safe workloads
Cost-sensitive applications

⚠️ Titan does not manage conversation state — which is why explicit memory handling (as shown in this project) is essential.

🏗 Architecture Overview

Important:

The entire conversation history is sent on every request.

🧠 Core Design Decisions

1️⃣ Explicit Conversation Memory

Amazon Titan does not track sessions.

We:

Store user and assistant messages
Append them to a history list
Inject the full history into each prompt

This makes the system:

Predictable
Auditable
Easy to debug

2️⃣ Role-Based Prompt Formatting

Conversation is formatted as:

User: ...
Assistant: ...

This significantly improves response quality and consistency.

3️⃣ Stop Sequences

We configure:

"stopSequences": ["User:"]

This prevents the model from hallucinating the next user message.

4️⃣ Deterministic Generation

Low temperature

Explicit assistant cue

Token limits

How to Run the Project

Clone the Github repo AWS Bedrock Chatbot(Titan)
Prerequisites

Python 3.9+
AWS credentials configured

Install Dependencies
pip install boto3

Run the Chatbot
python chatbot.py


Type exit to quit

You can also experiment with other models available on the AWS Bedrock service page.

On the AWS Bedrock service page Click on Model Catalog . Here you have access to other model providers like Meta, Anthropic, Mistral AI etc. You can search for different models from different or same providers.

Click on a Model, read it's documentation to understand how to use it in your project and copy the model's ID

AWS Glue ETL Jobs: Transform Your Data at Scale

Oteng Isaac — Sun, 07 Dec 2025 16:18:23 +0000

AWS Glue ETL Jobs: Transform Your Data at Scale

First part: AWS Data Cataloguing
Even though the AWS Glue Crawler creates your Data Catalog automatically, some projects require a transformation step. This is where AWS Glue ETL Jobs come in. Glue ETL allows you to clean, transform, standardize, and enrich your raw datasets using PySpark at scale.

In this section, we will build a simple but production-ready Glue ETL script that:

Reads data from the raw S3 bucket using the Data Catalog
Performs basic cleaning (renaming, casting types, dropping fields)
Converts it into a structured format (Parquet recommended)
Writes the output into the Clean Zone in S3

🏗 Step 1: Create a Glue Job

Open AWS Glue Console → Click on ETL Jobs

You can start the job creation process from a blank canvas, notebook or script editor in the following ways.

Visual ETL
Choose Visual ETL to start with an empty canvas. Use this option when you want to create a job that has multiple data sources or if you want to explore the available data sources.

Author using an interactive code notebook
Choose Notebook to start with a blank Notebook to create jobs in Python using the Spark or Ray kernel.

Author code with a script editor
Choose Script editor to start with only Python boilerplate text added to your job script, or to upload your own script. If you choose to upload your own script, you can select only Python files or files with the extension .scala from your local file system. Use this option if you have a job script you want to import into AWS Glue Studio, or you prefer writing your own ETL job
In this demonstration, I will choose the Script editor

Click Script editor and select Spark as the engine and Start fresh as the option and click Create script

In the Script editor replace the default code with the code below

💻 Glue ETL Job Script

#Imports Python's system module to access command-line arguments.
import sys

#Imports all AWS Glue transformation functions (though none are explicitly used in this script).
from awsglue.transforms import *

#Imports utility to parse job parameters passed to the Glue job.
from awsglue.utils import getResolvedOptions

#Imports GlueContext, which wraps SparkContext and provides Glue-specific functionality.
from awsglue.context import GlueContext

#Imports Job class for managing Glue job lifecycle and bookmarking.
from awsglue.job import Job

#Imports DynamicFrame, Glue's data structure that handles schema variations better than Spark DataFrames.
from awsglue.dynamicframe import DynamicFrame

#Imports SparkContext, the entry point for Spark functionality.
from pyspark.context import SparkContext

#Imports PySpark SQL functions for data transformations.
from pyspark.sql.functions import (
    col, to_date, trim, upper, when, year, month, to_timestamp
)

# ---------------------------------------------------------------------------------
# Initialize Glue Job
# ---------------------------------------------------------------------------------
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# ---------------------------------------------------------------------------------
# Read raw CSV data from S3 using the Glue Data Catalog table
# ---------------------------------------------------------------------------------
raw_dyf = glueContext.create_dynamic_frame.from_catalog(
    database="orders_db",
    table_name="medallion_orders_2025_12_17",   # update with your catalog table name
    transformation_ctx="raw_dyf"
)

# ---------------------------------------------------------------------------------
# Column Standardization
# ---------------------------------------------------------------------------------
df = raw_dyf.toDF()

# Standardize column names (Spark-friendly)
df = df.toDF(*[c.lower().replace(" ", "_") for c in df.columns])

# ---------------------------------------------------------------------------------
# Clean & Transform Data
# ---------------------------------------------------------------------------------

# Trim whitespace
for column in df.columns:
    df = df.withColumn(column, trim(col(column)))

# Convert datatypes
df = (
    df.withColumn("order_date", to_timestamp(col("order_date"), "yyyy-MM-dd HH:mm:ss"))
    .withColumn("order_id", col("order_id").cast("int"))
)

# Remove invalid rows
df = df.filter(col("order_id").isNotNull() & col("customer_id").isNotNull())

# Fix negative values (replace with null or filter)
df = df.withColumn("total_amount", when(col("total_amount") < 0, None).otherwise(col("total_amount")))

# Create derived columns
df = df.withColumn("total_price_in_USD", col("total_amount") * 13)

# Remove duplicates
df = df.dropDuplicates()

# ---------------------------------------------------------------------------------
# Convert back to DynamicFrame
# ---------------------------------------------------------------------------------
final_dyf = DynamicFrame.fromDF(df, glueContext, "final_dyf")

# ---------------------------------------------------------------------------------
# Write to Clean S3 Zone (Partitioned)
# ---------------------------------------------------------------------------------
output_path = "s3://medallion-orders-2025-12-17/clean/orders/"  # update with your path

glueContext.write_dynamic_frame.from_options(
    frame=final_dyf,
    connection_type="s3",
    connection_options={
        "path": output_path,
        "partitionKeys": ["order_status"]
    },
    format="parquet",
    transformation_ctx="datasink"
)

job.commit()

Make sure you replace the S3 destination path, AWS Glue database and table.

Click on Job details. In the Name field enter a name for the job. Choose an IAM role that has access to the data sources (S3 in this case). Leave all defaults and click on Save.

Save the script and click on Run to initiate running of the job

Below shows the running of the job

Below shows the outputs folder clean created by the job and the input folder which contained the data raw

🧼 What This Script Actually Does

1️⃣ Ingestion

Reads the raw CSV using the Data Catalog entry created by the crawler.

2️⃣ Cleaning

Renames inconsistent column names
Drops irrelevant fields
Converts data types
Normalizes the schema
Removes duplicates

3️⃣ Writing to Clean Zone

Outputs the cleaned, structured dataset to an S3 Clean Bucket in Parquet format, ideal for:

Athena
Redshift Spectrum
Quicksight
Machine learning workflows

Github Repo

Data Cataloguing in AWS

Oteng Isaac — Wed, 03 Dec 2025 13:44:41 +0000

AWS Data Cataloguing

Cataloguing Data in AWS Using Glue Crawlers: A Practical Guide for Data Engineers

Introduction

In modern data engineering, one of the most overlooked but powerful capabilities is data cataloguing. Without a clear understanding of what data exists, where it lives, its schema, and how it changes over time, no ETL architecture can scale. In this guide, I walk through how to catalogue data using AWS Glue Crawlers, and how to structure your metadata layer when working with raw and cleaned datasets stored in Amazon S3.

This tutorial uses a simple CSV file in an S3 raw bucket and walks through how AWS Glue automatically discovers its structure and builds a searchable, query-ready data catalog. You can replicate every step through your AWS Console and include screenshots to transform this into a visual, practical learning resource.

What is Data Cataloguing?

Data cataloguing is the process of creating a structured inventory of all your data assets.

A good data catalog contains:

Dataset name
Schema (columns, data types, partitions)
Location (e.g., S3 path)
Metadata (size, owner, last updated)
Tags, classifications, lineage

Think of it as the "index" of your data ecosystem - similar to how a library catalog helps readers find books quickly.

Why it matters:

Makes data discoverable across teams
Reduces manual documentation
Ensures schema consistency across pipelines
Enables data validation and quality checks
Fuels self-service analytics
Supports governance and compliance

Data Cataloguing in ETL Pipelines

ETL pipelines depend heavily on metadata. Before transforming any dataset, the pipeline must understand:

What columns exist
Which data types to enforce
What partitions to use
What schema evolution has happened
How to map raw → cleaned → curated layers

A strong data catalog ensures that:

ETL jobs run reliably
Glue/Spark scripts do not break due to schema drift
Downstream BI tools (Athena, QuickSight, Superset, Power BI) can read data instantly
Data lineage and documentation stay updated

AWS Glue Data Catalog acts as the central metadata store for all your structured and semi-structured data.

Architecture Overview

Below is the structure you'll demonstrate in your article:

The project walkthrough will show how Glue Crawlers:

Scan an S3 bucket
Detect the schema (headers, types, formatting)
Generate metadata
Store the metadata as a table in the Data Catalog

This metadata is then queryable through Amazon Athena, interoperable with Glue ETL Jobs, and usable by analytics tools.

Understanding Amazon S3, AWS Glue Crawler, and the Glue Data Catalog

Amazon S3 (Simple Storage Service)

Amazon S3 is a fully managed object storage service that allows you to store any type of data at scale—CSV files, logs, JSON, Parquet, images, and more.

It is highly durable, cost-effective, and integrates seamlessly with AWS analytics services. In most modern data engineering architectures (including the Medallion architecture), S3 serves as the landing, raw, and processed layers where data is ingested and stored before further transformation.

AWS Glue Crawler

An AWS Glue Crawler is an automated metadata discovery tool that scans data stored in Amazon S3 and other sources.

When the crawler runs, it:

Reads the file structure and content
Detects the data format (CSV, JSON, Parquet, etc.)
Infers column names and data types
Identifies partitions
Classifies datasets using built-in or custom classifiers

The crawler then automatically creates or updates table metadata without you having to define schemas manually.

AWS Glue Data Catalog

The Glue Data Catalog is a centralized metadata repository for all your datasets within AWS.

It stores:

Table definitions
Schema information
Partition details
Metadata used by analytics services

When the Glue Crawler finishes scanning an S3 bucket, it writes the discovered schema and table information into the Glue Data Catalog.

This metadata can then be queried by services such as Athena, EMR, Redshift Spectrum, and AWS Glue ETL jobs.

In short, the workflow is:
S3 → Glue Crawler scans files → Schema is inferred → Metadata is stored in Glue Data Catalog → Data becomes queryable.

Step-by-Step Workflow

Below is the structure you'll follow in your Medium/LinkedIn article when documenting your implementation with screenshots.

1. Upload Your CSV File to Amazon S3

Create an S3 bucket named: medallion-orders-2025-12-17 (Replace with your bucket name)

# Create an S3 bucket (basic settings)
aws s3api create-bucket --bucket medallion-orders-2025-12-17 --region us-east-1

Upload your sample CSV file (e.g., orders.csv)

# Upload the CSV file to the bucket
aws s3 cp orders.csv s3://medallion-orders-2025-12-17/

# Upload to a folder (prefix)
aws s3 cp orders.csv s3://medallion-orders-2025-12-17/raw/orders.csv

2. Create a Glue Database

In the Glue Console:

Go to Data Catalog → Databases
Click Add database

Name it orders_db and click on Create database

3. Create an AWS Glue Crawler

Navigate to Glue → Crawlers
Click on Create crawler
Provide a name (e.g., orders_crawler) and click Next

Click on Add a data source

Choose S3 as the data store and Click on Browse S3 to select the S3 bucket

On the next screen, choose a role (Glue created role or custom IAM role) Click Next

Select your database. For Crawler schedule On demand and click Next then Create crawler.

Run the crawler and wait until status shows complete

4. Run the Crawler & Generate Metadata

Once the crawler completes:

It will create a table inside your Glue Data Catalog database
Open the table to view:
- Columns
- Data types
- S3 location
- Classification (csv)

5. Query the Table Using Amazon Athena

Open Athena
Select your Glue database

Run a simple SELECT * FROM "AwsDataCatalog"."orders_db"."medallion_orders_2025_12_17" limit 10; Repalce tablename with your table

Final Outcome

After completing the steps, you will have:

A fully indexed representation of your raw data
A searchable table in Glue Data Catalog
A metadata-driven foundation for ETL jobs
A structure ready for transformation into a cleaned bucket and eventually a curated analytics layer

This sets the stage for my next article:

"Building ETL pipelines using Glue ETL Jobs and writing cleaned data back into S3."

Conclusion

Data cataloguing is a foundational step in any scalable data engineering architecture. AWS Glue Crawlers make it easy to automate metadata extraction from raw data sources, reduce manual schema definition, and keep your ETL pipelines schema-aware and resilient.

By the end of this project, you'll have a practical, AWS-native setup that you can build on for data cleaning, transformations, and analytical workloads.

Medallion Architecture On AWS

Oteng Isaac — Mon, 01 Dec 2025 16:02:04 +0000

Medallion Architecture On AWS

Building Modern Data Lakes on AWS S3 with the Medallion Architecture

Introduction

Data constitute the foundation of contemporary enterprises. However as the volume,
velocity, and variety of data grow, organizations face a critical
challenge: how to store, manage, and analyze data efficiently at
scale.

data lakes which is a centralized repository is used to stored structured,
semi-structured, and unstructured data in their raw form. When combined with AWS S3
and the Medallion architecture, this approach provides scalable, reliable, and layered approach
for transforming raw data into insights ready for analysis.

In this post, we'll explore:

Why S3 is the go-to storage for modern data lakes
The Medallion architecture and its layers (Bronze, Silver, Gold)
Practical design patterns, best practices, and real-world use cases
How AWS services integrate seamlessly with S3-based data lakes

1. Why AWS S3 for Data Lakes?

Amazon S3 (Simple Storage Service) is object storage that offers:

Unlimited storage -- scale from gigabytes to petabytes of data without distruption
High durability -- 11 nines of durability (99.999999999%)
Flexible storage classes and formats -- Standard, Infrequent Access, Glacier etc classes and can stores data
in CSV ,JSON ,Parquet ,ORC etc formats.
Secure and compliant -- encryption at rest and transit(SSE-S3/SSE-KMS), IAM
policies, and fine-grained access control
Integration with analytics and AI/ML services -- Intergrates seamlessly with AWS services like Glue, Athena,
Redshift Spectrum, EMR, SageMaker, Kinesis and MSK

Why S3 is perfect for a data lake:

Any format: can be stored: CSV, JSON, Parquet, Avro, ORC, images, audio,
logs
Decouples compute and storage -- separates computation from storage so that different analytics engines can
access the same raw data without having to move it.
Supports schema-on-read -- The schema is defined during querying, not during writing

2. Overview of the Medallion Architecture

A layered approach to organising your data lake is the Medallion Architecture. To enhance data quality, governance, and performance ,
it arranges data into several refined layers.

The layers are:

2.1 Bronze Layer -- Raw Data

Purpose: Ingest all raw data exactly as it was obtained from sources.
Data Characteristics:
- Semi-structured or unstructured
- May contain duplicates, errors, or missing values
- Timestamped to monitor ingestion
Use Cases:
- Audit trail
- Unprocessed logs and events
- Origin of downstream transformations
Example in S3:
s3://your-datalake/bronze/customers/
s3://your-datalake/bronze/orders/

2.2 Silver Layer -- Data that has been conformed and cleaned

Purpose: The goal is to create clean, standardised, and enriched databases from raw data.
Data Characteristics:
- Deduplicated
- Corrected data types
- Enhanced using lookups or joins
- Consistent timestamps and formats
Use Cases:
- Intermediate analytics
- Providing data to BI dashboards
Example in S3:
s3://your-datalake/silver/customers/
s3://your-datalake/silver/orders/

Typical transformations:

Filter out bad or null records
Standardise currencies and timestamps.
Join with reference tables (like product categories)
Verify using business rules

2.3 Gold Layer -- Business-Level / Analytics Data

Purpose: Create analytics-ready, aggregated, or curated data
Data Characteristics:
- Fully cleansed, trustworthy, and aggregated
- Optimized for reporting or machine learning
- For quicker enquiries, data is frequently stored in columnar formats (Parquet, ORC).
Use Cases:
- BI dashboards and reports
- ML training datasets
- KPI calculation and trend analysis
Example in S3:
s3://your-datalake/gold/sales_summary/
s3://your-datalake/gold/customer_lifetime_value/

Typical transformations:

Aggregation on daily, weekly, or monthly
Join many silver tables to build fact tables
Compute metrics like revenue, churn, or retention

3. AWS Services that Complement S3 Data Lakes

Building a Medallion architecture on S3 works best when combined with
AWS analytics services:

Service Role in Data Lake

AWS Glue ETL/ELT jobs to transform Bronze → Silver → Gold

Amazon Athena Query S3 data directly using SQL without moving
it

Amazon Redshift Query S3 data as external tables in Redshift
Spectrum

Amazon EMR Distributed Spark/Hadoop processing for
large-scale transformations

AWS Lake Centralized access control, data catalog, and
Formation governance

Amazon QuickSight BI dashboards on curated Gold data

4. Practical S3 Design Patterns

4.1 Partitioning

Organize large datasets for query performance
Common partitions: year=2025/month=11/day=30/
Example:
s3://my-datalake/silver/orders/year=2025/month=11/day=30/orders.parquet

4.2 File Formats

Bronze: raw JSON, CSV, log files
Silver: Parquet or ORC (columnar)
Gold: Parquet with compression (Snappy)

4.3 Naming Conventions

s3://<bucket>/<layer>/<entity>/year=YYYY/month=MM/day=DD/
Helps Athena, Glue Crawlers, and partition pruning

5. Example Data Flow (End-to-End)

Bronze Layer: S3 ingests raw data from sources (e.g., Kafka,
APIs, IoT)
Glue ETL: Cleanses, deduplicates, and standardizes → Silver
Layer
Silver Layer: Curated, conformed tables available for analytics
Gold Layer: Aggregations, business KPIs, and ML datasets
Analytics: Athena queries, Redshift reports, QuickSight
dashboards

Diagram Concept:

6. Best Practices for Medallion Data Lakes on S3

Use separate buckets or prefixes per layer (Bronze/Silver/Gold)
Partition and compress data for performance and cost savings
Enforce data validation rules in Silver ETL
Track metadata with Glue Catalog or Lake Formation
Secure access using IAM policies, S3 bucket policies, and KMS
encryption
Use consistent naming conventions across layers and datasets
Version your data if necessary (append date/time to files for
auditability)

7. Real-World Example

E-Commerce Data Lake:

Bronze: raw JSON order events from web or mobile apps
Silver: deduplicated, validated orders joined with product catalog
Gold: aggregated revenue by category, daily sales metrics, customer
LTV
Analytics: Quicksight dashboards for executives, Athena queries for marketing

Conclusion

combining AWS S3 and the Medallion architecture provides a
scalable, structured, and reliable foundation for modern data
analytics.

S3 gives unlimited storage and flexibility
Medallion layers ensure data quality, governance, and analytics
readiness
Integration with Glue, Athena, Redshift, and QuickSight enables
end-to-end insights

By implementing this architecture, organizations can build
enterprise-grade data lakes, reduce time-to-insight, and empower
data-driven decision-making.

Data ingestion using AWS Services, Part 2

Oteng Isaac — Wed, 25 Dec 2024 02:19:56 +0000

Data ingestion using AWS Services, Part 2

Querying AWS S3 data from AWS Athena using SQL.

AWS Athena is an interactive query service that makes it easy to analyze data on Amazon using standard SQL. In this second part of the tutorial, we are going to crawl the migrated data in AWS S3, create table definitions in the Glue Data Catalog using AWS Glue, and query the data using AWS Athena. AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

Before you proceed with this hands-on tutorial, make sure you have completed the first part of the tutorial, **Data Ingestion using AWS Services Part 1. **Below is an architectural diagram of the full project.

Search for and select **AWS Glue **in the top search bar of the AWS console.
Click on Crawler and then Create crawler. A crawler accesses your data store (e.g., AWS S3), extracts metadata, and creates table definitions in the AWS Glue Data Catalog.

Enter a descriptive name for the crawler job and click Next.

Click on Add data score.

Under Data source, select S3. Click on Browse S3 to choose the AWS S3 bucket containing the data we want to query. Leave all defaults and click on Add an S3 data source.

Verify and click on Next.

Create or select an IAM role under Existing IAM role and click Next.

Click on Add database under Target database or select a database in the dropdown. Let's create a database called testdb. Click Create database.

For frequency, select On demand. This is used to define a time-based schedule for crawlers and jobs in AWS Glue. Click Next.

Check all settings and click Create crawler.

After the successful creation of the crawler, click on Run crawler to start the crawler job.

To check the status of a crawler, click on Crawlers, the name of the crawler, and then Crawler runs.

Verify the table and the database by clicking on Tables.

Search for and select **AWS Athena **in the top search bar of the AWS console.
Click Query editor. In the query editor, click Settings, **then **Manage. In the Manage settings, **select **Browse S3 to select an AWS S3 that will serve as the location of the query result. Click on Save.

In the query editor, enter the following SQL statement:'select **** from testbucketformysqldata123_raw limit 10***’. The query selects all the data migrated into the bucket. Note: Substitute the table name with the name of your table.

The Query results tab shows the results of the query.

This ends the hands-on project on data ingestion using AWS DMS. Next in the series is SaaS data ingestion using Amazon AppFlow.

Data ingestion using AWS Services, Part 1

Oteng Isaac — Wed, 25 Dec 2024 02:17:45 +0000

Data ingestion using AWS Services, Part 1

Data ingestion is the process of collecting, importing, and transferring raw data from various sources to a storage or processing system where it can be further analyzed, transformed, and used for various purposes. The goal is to bring in data from various sources and make it available for analysis and decision-making. Data ingestion is usually a crucial first step in a data pipeline.

Data ingestion can be either in batches where data is brought over in bulk at regular intervals, called BATCH DATA INJECTION, or in near real-time, called **STREAM DATA INGESTION **where data is brought over as soon as it is generated.

The first part of my data ingestion tutorial covers batch data ingestion using AWS services, which I will cover in separate articles. Specifically, I will cover hands-on tutorials on the following topics:

Data ingestion using AWS Data Migration Service (DMS)
SaaS data ingestion using Amazon AppFlow
Data ingestion using AWS Glue
Transferring data into AWS S3 using AWS DataSync

Before we delve into the hands-on, let’s create a data lake in AWS Simple Storage Service (S3). All the ingested data in this hands-on tutorial will first be brought over into a bucket in AWS S3, which is a preferred choice for building data lakes in AWS. Amazon S3 is a scalable object storage service offered by AWS. It is designed to store and retrieve any amount of data from anywhere on the web. Data in S3 is organized into containers called bucket. A bucket is similar to a directory or folder and must have a globally unique name across all of AWS. Let’s create a bucket in AWS S3, which will serve as a destination for all ingested data.

To create a bucket, we can use the AWS CLI available through the AWS management console called the CloudShell.

Log into the AWS console using an administrative user.

2. Search for and select **S3 **in the top search bar of the AWS console.

3. Click on CloudShell on the top bar of the AWS console. AWS CloudShell is a browser-based shell that can quickly run scripts with the AWS Command Line Interface (CLI) and experiment with service API’s.

Creating AWS S3 bucket

This will create a command-line environment with an AWS CLI preinstalled.

Enter the following command to create a bucket : aws s3 mb s3://<name of bucket> and hit enter. With the bucket name being unique globally across all AWS regions. Note: Replace the name of the bucket with a descriptive and unique name.

Alternatively, you can create a bucket using the AWS console bucket creation wizard.

Click on Create bucket, keep all defaults, enter a globally unique name for the bucket and click Create bucket.

After a successful creation of the bucket, click on the name of the bucket and go to the **Properties **tab. Copy the ARN of the created bucket and save it somewhere, which will be needed in a later step.

This ends the creation of the bucket, which will be used as the destination of all ingested data in this tutorial. Next, we want to set permissions on the created bucket to allow AWS DMS access to the bucket to perform operations on the bucket using AWS Identity and Access Management (IAM). AWS IAM is a web service by AWS that helps you securely control access to AWS resources. It enables you to manage users, groups, and permissions within your AWS environment.

Creating AWS IAM Policy

1. Search for and select **IAM **in the top search bar of the AWS console.

2. Click on Policies and then click on Create policy. A policy is an object in AWS that defines permissions.

By default, the Visual editor tab is selected, so click on JSON to change to the JSON tab. In the actions section, grant all actions on S3 and in the resource section, provide the ARN of the AWS S3 bucket created in the previous steps. Note that in a production environment, the scope of the permission must be limited.

Click on Next, provide a descriptive policy name for the policy and click Create policy.
In the left-hand menu, click on Roles and then Create role.
For the Trusted entity type, choose AWS service, and for Use case select DMS then Next.

Select the policy created in the earlier step and click Next.
Provide a descriptive policy name and click Create role.

At this point, we have created an AWS S3 bucket and an IAM role that allows AWS DMS to perform operations on the bucket.

Data ingestion using AWS Data Migration Service

So far, we have created an AWS S3 bucket to host our ingested data. At this point, we will create an AWS Relational Database Service (RDS) MySQL instance, database, and table and populate it with sample data, which will be migrated to AWS S3 as the storage layer using the AWS DMS. We will then crawl the S3 bucket using the Glue Crawler and use the Glue Data Catalog as the metadata repository. We will then query the data using AWS Athena. Below is the architecture of the project.

Summary of AWS services used.

AWS Relational Database Service is a collection of managed services that makes it simple to set up, operate, and scale databases in the cloud.
AWS Database Migration Service is a managed migration and replication service that helps move your database and analytics workloads to AWS quickly, securely, and with minimal downtime and zero data loss.
AWS S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.
AWS Glue (Glue Crawler, Glue Data Catalog) is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
AWS Athena is an interactive query service that makes it easy to analyze data on Amazon using standard SQL.

Creating AWS RDS MySQL DATABASE

Search for and select **AWS RDS **in the top search bar of the AWS console.

Click on Databases and select Create database.

Select **Easy create **for the database creation method. Choosing easy create enables best-practice configuration options that can be changed later.

Choose MySQL under Configuration.

Choose Dev/Test under DB instance size. **Note: You can also select **Free tier for this tutorial.

Accept all defaults and enter a Master password and click Create database.

Connecting to AWS RDS MySQL instance using HeidiSQL Client

Let’s populate the database with sample data. Let’s connect to the AWS RDS MySQL instance using an SQL client. In this tutorial, we are using the free HeidiSQL client. You can use any SQL client of your choice.

If you are using the HeidiSQL client, enter the following details, as shown in the screenshot below: Select Network type as MySQL on RDS. For Hostname/IP copy and paste the Public IPv4 DNS. For User enter the username of the RDS database and Password. Enter 3306 as the Port which is the default for MySQL databases. Note: Make sure you have configured your subnets to allow inbound traffic on port 3306. Click on Open to establish a connection.

After a successful connection, enter the highlighted SQL, as shown in the screenshot below. This creates a database called testdb, sets it as the default database and shows all the created databases with the new database highlighted.

Enter the following SQL statement below to create a table in the database.

CREATE TABLE IF NOT EXISTS actor (
actor_id smallint unsigned NOT NULL AUTO_INCREMENT,
first_name varchar (45) NOT NULL,
last_name varchar (45) NOT NULL,
last_update timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP)

You can also use the ‘*SHOW TABLES’* statement to confirm the creation of the table, as shown below.

Let’s use the *‘*INSERT INTO’ **statement to load sample data into the database.

Let’s use the **‘SELECT * FROM actor’ **to query and verify the inserts.

MIGRATE DATA TO S3 USING AWS DATA MIGRATION SERVICE

Search for and select AWS Data Migration Service **in the top search bar of the AWS console and click on **Replication instances. AWS DMS uses a replication instance to connect to your source data store, read the source data, and format the data for consumption by the target data store.

Click on Create replication instance to create a replication instance.

Provide a descriptive name *and an *optional descriptive Amazon Resource Name (ARN) and select an instance class as shown below. Leave all options at their default values, click on Create replication instance and wait for a successful creation of the instance.

Next, create an Endpoint. An endpoint provides connection, data store type, and location information about your data store. AWS DMS uses this information to connect to the data store and migrate data from the source endpoint to the target endpoint. We will create two endpoints.

Source endpoint: This allows AWS DMS to read data from a database (on-premises or in the cloud) or other data sources, such as AWS S3.

Target endpoint: This allows AWS DMS to write data to a database, or other stores such as AWS S3 or Amazon DynamoDB.

Choose Source endpoint and select Select RDS DB instance since our source is an Amazon RDS database and select the created AWS RDS database from the RDS drop-down menu. Provide a unique identifier for the endpoint identifier and select MySQL as the source engine. Under Access to endpoint database select provide access information manually and enter the username and password of the AWS RDS MySQL instance. Keep all defaults and click on Create endpoint.

Click on Create endpoint again to create a target endpoint. Choose Target endpoint. Provide a label for the endpoint and select Amazon S3 as the target engine. Under Amazon Resource Name (ARN) for the service access role, provide the ARN of the role created earlier (DMSconnectRole). For the Bucket name, enter the name of the bucket created earlier and enter a descriptive name to be used as a folder to store the data under Bucket folder. Click on Create endpoint.
Let’s confirm both endpoints.

Click on Database migration tasks, then click Create task to create a database migration task. This is where all the work happens. You specify what tables (or views) and schemas to use for your migration and any special processing, such as logging requirements, control table data, and error handling. Under Task identifier, enter a unique identifier for the task (myql-to-s3-task). Under Replication instance, select the created replication instance (mysql-to-s3-replication-vpc-0b6a338c0396b19) in the dropdown. For Source database endpoint, select the created source endpoint (database-2). For Target database endpoint, select the created target endpoint (testbucketformysqldata123-raw-data) from the drop-down list. For Migration type, select Migrate existing data.

Under Table mappings, click on Add new selection rule. Under Schema, select Enter a schema. Under Source name enter the name of the created AWS RDS MySQL database in between the percentage sign. Keep all defaults and click on Create task. This starts the migration task automatically.
After a successful migration, the Status shows Load complete, and the progress bar reaches 100%.

Let’s confirm the loaded data in the target Amazon S3 bucket.

This is the end of the **Data ingestion using AWS Services Part 1. **Look out for Part 2 of my data ingestion series.

kinesis Data streams Projects

Oteng Isaac — Thu, 19 Jan 2023 04:17:44 +0000

Kinesis Data Streams Project

Overview

In this project we will be creating an Amazon Kinesis application and use the AWS CLI to put records into the stream and read to check records in the streams

Amazon Kinesis Data Streams is a serverless streaming data service that makes it easy to capture, process, and store data streams at any scale.

Some features

Data inserted into Kinesis data stream can’t be deleted ( Data retention period 24 hours(default) but can be increased)
Records with the same partition goes into the same shard
Producers: AWS SDK, Kinesis Producer Library (KPL), Kinesis Agent
Consumers: AWS SDK, Kinesis client Library (KCL), Kinesis Data Firehose, Kinesis Data Analytics

Amazon Kinesis Data Streams Capacity Modes

Provisioned mode: You can choose the number of shards provisioned, scale manually or using API, send 1MB/S of data to shard and get 2MB/S out of shard
On-demand mode: No need to provision or manage the capacity. Provisioned capacity maximum 200MiB/second write capacity and maximum 400 MiB/second read capacity

Intructions

Create Amazon Kinesis Application

Head to the Amazon Kinesis Services dashboard to create a Kinesis Data Stream .
We have three options : Kinesis Data Stream , Kinesis Data Fireshose and Kinesis Data Analytics
Select Kinesis Data Stream
and then click on Create data stream

Enter a name for your Kinesis Stream. In this tutorial, we will call our stream "DemoStream"

For the capacity mode : Select Provisioned mode
and set the number of shards value to 1, which can be increased or decreased later.

Click on Create Stream to create the stream. You can wait for just some few seconds to get the stream created

Afterwards our stream should be in the "Active" state.

Create a producer to put records into the Kinesis Data Stream
In this tutorial we are going to use the AWS CLI to communicate with the Amazon Kinesis Service

This tutorial assume you have already installed and configured AWS CLI.
You can follow this Documentation [https://docs.aws.amazon.com/cli/v1/userguide/install-windows.html] to install and configure the AWS CLI for your OS. We are using windows 10 for this tutorial.

To confirm that the AWS CLI has been installed and configured on your computer, you can check the installed version by typing "aws -version" which should print the version of the AWS CLI that is installed.

We can view a list of avaialable commands by typing aws kinesis help

Let's check the list of created Kinesis Data Streams in our account by typing aws kinesis list-streams

Now lets put some records into the stream.
We can use the PutRecord to put one record into the stream or PutRecords to put many records into the stream

We are going to simulate data coming from a BANK ATM whose data has to be streammed into Kinesis Data Streams for further processing.

We can use the command below to send a record to the stream.

C:\Users\FIXya TECH>aws kinesis put-record --stream-name myDemo --partition-key 1 --cli-binary-format raw-in-base64-out --data "{'trans_id': 1, 'trans_type': 'ATM', 'amt': 200}"

aws kinesis put-record\
- put-record used to put a single record into the stream
--stream-name DemoStream\
- We are passing in the name of our created stream. In this case DemoStream
--partition-key 1\
- We are defining a partition key for this put operation
--cli-binary-format raw-in-base64-out\
- A flag that specifies how the binary input paramaters should be interpreted. In this case raw-in- base64-out
--data "{'trans_id': 1, 'trans_type': 'ATM', 'amt': 200}"
- The data blob to be put into the stream. In this case "{'trans_id': 1, 'trans_type': 'ATM' , 'amt':200}"

In this example we are using the PutRecord API to put the following 4 records into the stream one by one. Its also possible to use the PutRecords API to write many records at once into the stream

aws kinesis put-record --stream-name DemoStream --partition-key 1 --cli-binary-format raw-in-base64-out --data "{"trans_id": 1, "trans_type": "ATM", "amt": 200}"
aws kinesis put-record --stream-name DemoStream --partition-key 1 --cli-binary-format raw-in-base64-out --data "{"trans_id": 2, "trans_type": "ATM", "amt": 400}"
aws kinesis put-record --stream-name DemoStream --partition-key 1 --cli-binary-format raw-in-base64-out --data "{"trans_id": 3, "trans_type": "ATM", "amt": 600}"
aws kinesis put-record --stream-name DemoStream --partition-key 1 --cli-binary-format raw-in-base64-out --data "{"trans_id": 4, "trans_type": "ATM", "amt": 900}"

After a successful PutRecord API operation, we should get a response object containing the ShardID and the Sequence Number.

A Shard is a uniquely identified group of data records in a Kinesis data stream

The Sequence Number is the identifier associated with every record ingested in the stream, and is assigned when a record is put into the stream. Each stream has one or more shards

Next we want to read records from the stream

First lets get the shards iterator which specifies the shard position from which to start reading data records sequentially.

aws kinesis get-shard-iterator\
to get an Amazon Kinesis shard iterator.It expires 5 minutes after it is retuerned to the requester

--stream-name DemoStream \ We are passing in the name of the created stream. In this case DemoStream
--shard-id shardId-000000000000 \ We are passing in the shardId of the shard in our stream
--shard-iterator-type TRIM_HORIZON \ We are passing in the shard iterater type whicn can be AT_TIMESTAMP, TRIM_HORIZON or LASTEST. In this project we have used TRIM_HORIZON to cause the shardIterator to point to the last untrimmed record in the shard(Oldest data record in the shard)

We can use the command below to get the shard iterator
aws kinesis get-shard-iterator --stream-name DemoStream --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON

We can now use the the GetRecords API to get data records from a Kinesis data stream's shard by specifying the shardIterator.

aws kinesis get-records --shard-iterator\ AAAAAAAAAAE/fk1qZTWLE74jIwqLR/N1OqYDsi9d7KHPhTtk7XIF42kEAJdwg0x0oXlZK/5SC7LciGiW5M3IEHdl/WH4cVYvNO1vvTTNra21WQgOUbgODyGfSeDMhd74BGi7z4l/X0Mi9O98Nexx2uSJx5ZHweKaZzEyRm4wkAYHJ4cmzwV1o2W+h/XBXrjdFB1bAKrj4/fYTGDRwvAVuA79qMoWB9vvq6ZhvYUAOLrQXGEK/sjH9g==

After a succesful GetRecords API call, it returns an object containing the four data records put in the kinesis data stream and the next shard iterator.

Another alternative to send test data to your Amazon Kinesis stream or Amazon Kinesis Firehose delivery stream is the Amazon Kinesis Data Generator.

I hope to do another tutorial on this in the future.

Hosting a static website using Amazon S3

Oteng Isaac — Thu, 19 Jan 2023 03:38:12 +0000

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.Using Amazon S3 you can store and retrieve any amount of data from anywhere and can store any type of data.To store data in Amazon S3, you work with resources known as buckets and objects. A bucket is a container for objects. An object is any type of file and any metadata that describes that file.
In this tutorial, we would use Amazon S3 bucket to host a static website. In Amazon S3, the static website can sustain any conceivable level of traffic, at a very modest cost, without the need to set up, monitor, scale, or manage any servers. All the files of the website are upload to Amazon S3. We can configure any of our s3 buckets as a static website.
When an S3 bucket is configured for website hosting, the bucket is assigned a URL. When request is made to this URL, Amazon S3 returns the HTML file, known as the root object, that has been set for the bucket.
To enable public access to the bucket or objects, permissions must be configured that allows access. To configure these permissions, we would use BUCKET POLICY which is is a resource-based AWS Identity and Access Management (IAM) policy which grants other AWS accounts or IAM users access permissions for the bucket and the objects in it.

We would follow the following steps to complete the tutorial.
Let’s go to the AWS management console and search for Amazon S3.

Let’s go to the AWS management console and search for Amazon S3.

In the bucket section click on Create bucket

In the General configuration section, enter a unique name for the bucket and choose a region preferably a region closer to your location .In this tutorial, I have typed 'tutorotengcode123' as my bucket name. Note: This name must be unique among all created buckets on the AWS platform.

In the Object Ownership section, choose ACLs enabled and Object writer. By selecting ACLS enabled, we have allowed objects in this bucket to be owned by other AWS accounts and access specified using ACLs. By selecting Object writer, we have allowed an object writer to remain the object owner.

In the Block Public Access settings for this bucket, deselect the Block all public access option and choose to select the check box to acknowledge the current settings of turning off block all public access.

Note: Bucket, access points, and objects by default do not allow public access. By deselect Block all public access and accepting to acknowledge the current settings we are allowing public access.
We would scroll down and keep the default settings for encryption and click on Create bucket.

We can view details of the created bucket by clicking on the View details in the Success Alert.

Click on the name of the created bucket and in the Objects tab, click on the Upload button.

In the Files and folders section, click on Add files.
At this stage we can add all the files and folders that make up our static website. Select all the files and folders and click on Upload.

After a successful upload of files, we should receive in the Success alert Upload succeeded.
Let’s review the upload files.

Click on the Properties Tab and scroll down to the Static website hosting section.
Static website hosting is disabled by default.

Click on Edit.

For Static website hosting, select enable ,for Hosting type select Host a static website, and for index document enter index.html.

Click on the Permissions tab, make sure Block all public access is set to Off.

Click on Edit in the Bucket policy section. Under the Bucket ARN, Click to copy the bucket’s ARN. In the Policy editor, copy and paste the bucket policy below by replacing the “Your_Bucket_ARN” with your bucket’s ARN.

{
  "Id": "MyPolicy",
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3GetObjectAllow",
      "Action": [
        "s3:GetObject"
      ],
      "Effect": "Allow",
      "Resource": "Your_Bucket_ARN/*",
      "Principal": "*"
    }
  ]
}

The policy grants the s3: GetObject permission to any public anonymous user. Now click on Save changes.

Let’s go back to the Properties Tab and scroll down to Static website hosting section. In the Static website hosting section, under Hosting type, check to ensure that bucket hosting is set. Under the Bucket website endpoint, click the copy icon to copy the Bucket website endpoint.

Paste the copied bucket website endpoint in your browser’s address bar and press enter.

Congratulations!!!!!!!!!!!!!!, we have successfully hosted a static website on Amazon S3.

DEV Community: Oteng Isaac

Building a Production-Ready Text-to-Text API with AWS Bedrock, Lambda & API Gateway

Building a Production-Ready Text-to-Text API with AWS Bedrock, Lambda & API Gateway

Project Overview

Business Use Case

Architecture Overview

Flow:

🛠️ Tools & Services Used

🔹 AWS Bedrock

🔹 Amazon Titan Text (amazon.titan-text-express-v1)

🔹 AWS Lambda

🔹 Amazon API Gateway

🔹 Python (Boto3)

🧠 Why This Design Matters

🧩 AWS Lambda: Text-to-Text Processing Logic

🔐 Required IAM Permissions

🌐 API Gateway Request Example

🧠 Why This Lambda Design Matters

📦 Example Use Cases

🧠 Building a Conversational Chatbot with AWS Bedrock (Amazon Titan)

🧠 Building a Conversational Chatbot with AWS Bedrock (Amazon Titan)

Overview

🔷 What Is AWS Bedrock?

🔑 Key Bedrock Characteristics

💼 Business Use Cases of AWS Bedrock

Common Enterprise Use Cases

Why Companies Choose Bedrock

🔶 Amazon Titan Text (amazon.titan-text-express-v1)

Key Characteristics

When to Use Titan Text Express

🏗 Architecture Overview

🧠 Core Design Decisions

1️⃣ Explicit Conversation Memory

2️⃣ Role-Based Prompt Formatting

3️⃣ Stop Sequences

AWS Glue ETL Jobs: Transform Your Data at Scale

AWS Glue ETL Jobs: Transform Your Data at Scale

🏗 Step 1: Create a Glue Job

💻 Glue ETL Job Script

🧼 What This Script Actually Does

1️⃣ Ingestion

2️⃣ Cleaning

3️⃣ Writing to Clean Zone

Data Cataloguing in AWS

AWS Data Cataloguing

Cataloguing Data in AWS Using Glue Crawlers: A Practical Guide for Data Engineers

Introduction

What is Data Cataloguing?

Data Cataloguing in ETL Pipelines

Architecture Overview

Understanding Amazon S3, AWS Glue Crawler, and the Glue Data Catalog

Amazon S3 (Simple Storage Service)

AWS Glue Crawler

AWS Glue Data Catalog

Step-by-Step Workflow

1. Upload Your CSV File to Amazon S3

2. Create a Glue Database

3. Create an AWS Glue Crawler

4. Run the Crawler & Generate Metadata

5. Query the Table Using Amazon Athena

Final Outcome

Conclusion

Medallion Architecture On AWS

Medallion Architecture On AWS

Data ingestion using AWS Services, Part 2

Data ingestion using AWS Services, Part 1

kinesis Data streams Projects

Hosting a static website using Amazon S3

🔶 Amazon Titan Text (`amazon.titan-text-express-v1`)