DEV Community: Rocio Baigorria

Stop Overpaying for AWS Data Transfer: A Guide to VPC Endpoints

Rocio Baigorria — Tue, 05 May 2026 13:27:30 +0000

If you are building data pipelines, you’ve probably seen a NAT Gateway charge that made you double-check your architecture.

While prepping for the AWS Solutions Architect Associate (SAA) exam, I’ve been diving into the "invisible" side of networking. We often assume that for a Lambda or an EC2 instance to talk to S3, it must go through the public internet.

This is a costly mistake.

The Problem: The "Public" Default

By default, services like S3, DynamoDB, or Kinesis live outside your VPC. To reach them from a private subnet, traffic usually flows through a NAT Gateway. This introduces:

Security Risks: Your data technically leaves your network perimeter.

Cost Inefficiency: You pay for every GB that passes through that NAT Gateway.

The Solution: VPC Endpoints (AWS PrivateLink)
VPC Endpoints allow you to create a private connection between your VPC and supported AWS services. The traffic never leaves the Amazon network.

Gateway Endpoints (The "OGs") These are specific to S3 and DynamoDB.

How they work: They don't use IPs. They work by adding a prefix list to your Route Table.

Cost: They are free. There is no reason not to use them.

Interface Endpoints (Powered by PrivateLink) These are for almost everything else (SNS, SQS, Kinesis, Athena).

How they work: They provision an Elastic Network Interface (ENI) with a private IP in your subnet.

Cost: There is an hourly charge and a data processing charge. However, for high-volume data pipelines (like Kinesis streams), they are often significantly cheaper than NAT Gateway transfer fees.

Real-World Use Case: The Data Lake Ingestion

Imagine a fleet of producers in a private subnet sending TBs of data to S3 and Kinesis.

Without Endpoints: You pay for NAT Gateway processing for every single byte.

With Endpoints:

Your S3 traffic is free via Gateway Endpoints.
Your Kinesis traffic is private and potentially cheaper via Interface Endpoints.

Moving to "Secure by Design"
A junior architect builds a connection. A senior architect builds a secure pipe.

VPC Endpoints allow you to attach Endpoint Policies. This is where you move from "it works" to "it's bulletproof." You can define exactly which IAM principal can access which specific resource through that endpoint.

🛡️ The Policy: Granular Control
As promised, here is an example of a VPC Endpoint Policy for S3.

This policy ensures that the endpoint can only be used to access a specific production bucket, preventing data exfiltration to unauthorized accounts even if an attacker gains access to your compute resources.

JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictToSpecificBucket",
"Effect": "Allow",
"Principal": "",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my-production-data-bucket",
"arn:aws:s3:::my-production-data-bucket/"
]
}
]
}

Why this matters:
Even if a developer accidentally leaks an IAM Key with S3:* permissions, those keys cannot be used through this endpoint to upload data to a personal bucket. The "pipe" itself is now intelligent.

What about you?
Have you ever been surprised by a NAT Gateway bill? Or are you currently wrestling with Endpoint Policies for multi-region setups? Let's discuss below! ☕👇

AWS Lambda Invocations: 3 Hard Lessons on Tradeoffs

Rocio Baigorria — Wed, 29 Apr 2026 10:49:57 +0000

AWS Lambda: 3 Lessons Learned on Invocations and Tradeoffs

Moving from a monolith to an event-driven architecture feels like a superpower until you hit your first production bottleneck. As a Data Engineer currently prepping for the AWS Solutions Architect Associate (SAA) exam, I’ve had to re-evaluate how I trigger my functions.

It’s not just about "making it work"; it’s about choosing the right tradeoff between cost, speed, and reliability. Here are my top lessons learned from the trenches.

1. The Cost of Waiting: Synchronous vs. Asynchronous

The Lesson: Never make a user (or a calling service) wait for a data-intensive process.

In my early projects, I used synchronous calls for almost everything because they were easier to debug. But I learned that synchronous invocations:

Increase Costs: You pay for the idle time of the calling service.
Create Fragility: If the Lambda fails, the upstream service fails too.

The Tradeoff: We moved to Asynchronous Invocations (--invocation-type Event).

Benefit: Instant 202 Status to the caller and built-in retries (AWS retries twice by default).
Cost: You lose immediate confirmation. You must implement Dead Letter Queues (DLQ) to track failures.

2. Polling vs. Pushing (The Streaming Dilemma)

The Lesson: Not all triggers are created equal.

When working with Amazon SQS or Kinesis, Lambda uses Poll-based invocation.

The Tradeoff: You don't "push" events; an internal mapping service polls the queue for you.
Lesson Learned: Tuning the BatchSize is critical. Too small, and you waste money on empty polls; too large, and a single poisoned message can stall your entire batch processing.

3. Infrastructure as the Source of Truth

The Lesson: If it’s not in Terraform, it doesn’t exist.

Manual tweaks in the AWS Console are "technical debt in real-time." I now treat invocation permissions as part of the core logic. Using STS (Security Token Service) to assume roles with temporary credentials instead of long-lived keys was a game-changer for our security audits.

Lesson in Least Privilege: Only S3 can trigger this specific processor

resource "aws_lambda_permission" "allow_s3" {
  statement_id  = "AllowExecutionFromS3Bucket"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.data_processor.function_name
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.raw_data.arn
}

⚡ Final Thoughts

Architecture is a series of trade-offs. Being a Technical Rebel means questioning the default settings. Don't just use RequestResponse because it's the default. Think about your data's journey, your budget, and your sleep during on-call rotations.

What’s a lesson you learned the hard way while working with AWS Lambda? Let’s discuss below!

Your Serverless Data Lake is Lying to You (Add Observability or Lose Data)

Rocio Baigorria — Mon, 20 Apr 2026 13:15:30 +0000

TL;DR:

Serverless Data Lakes Scale, But Fail Silently Serverless data lakes scale well, but can fail silently.
Without observability, you risk incomplete or incorrect data.
Add a DLQ to capture failed events.
Use Amazon CloudWatch + Amazon SNS for real visibility.
Trade-off: More components, but far more reliable pipelines.

The Moment I Stopped Trusting "Successful Pipelines"
It was 2 AM.
The pipeline had "completed successfully."
Amazon Athena was returning results.
But the numbers didn’t match.

Digging into Amazon CloudWatch logs, I found the issue:
Messages were stuck in a queue no one was monitoring.
No alerts. No visible errors. Just missing data.

Serverless systems don’t fail loudly. They fail silently.

The Typical Setup (and the Hidden Risk)

Most people build serverless data lakes like this:

Amazon S3 → storage

AWS Glue → transformations

Amazon Athena → querying

It works.
But it assumes that if the pipeline runs… the data is correct.
That assumption is dangerous.

What Was Missing: Observability

The problem wasn’t compute or storage. It was visibility.

I couldn’t answer basic questions:

Did all events get processed?
Did anything fail permanently?
Is data delayed or missing?

If you can’t answer those, you don’t have a production system.

The Fix: Design for Failure

I reworked the architecture for an e-commerce analytics demo with one rule: Every failure must be visible.

1. Add a Buffer (S3 → SQS)
Instead of triggering jobs directly:

Amazon S3 emits events

Amazon SQS captures them

Why it matters:

Decoupling
Retry control
No lost events on spikes

2. Add a DLQ (Non-Negotiable)
Every queue has a Dead Letter Queue.
After retries fail: → Message goes to DLQ.

Now:

Nothing disappears

You can inspect failures

You can replay data

Without a DLQ, you’re guessing.

3. Keep Orchestration Simple
AWS Lambda polls SQS

Triggers AWS Glue jobs
No heavy orchestrators needed for this use case.

4. Optimize for Analytics
Raw data in S3 (CSV/JSON)

Transform to Parquet

Partition by date

This keeps costs down and queries fast in Amazon Athena.

Observability (The Part Most People Skip)

This is the difference between "it works" and "it’s reliable".

Metrics (Amazon CloudWatch)
Queue depth
DLQ size
Glue job failures
Lambda errors
Alerts (Amazon SNS)
DLQ > 0 → alert
Glue job fails → alert
Pipeline inactivity → alert

If something breaks, you should know immediately.

Trade-Offs

What you gain:

Reliable data pipelines

Full visibility

Faster debugging

Confidence in your data

What you pay:

More moving parts (SQS, DLQ, Lambda)

Slight increase in cost

Extra setup for monitoring

The Real Decision
You’re not choosing between simple and complex.
You’re choosing between:

A simple system that hides failures

A system that tells you when it breaks

For production systems, that’s not optional.

Final Thought

Serverless removes infrastructure.
It does NOT remove responsibility.

If you don’t design for observability:
Your system will fail quietly—and you won’t know when.

How are you handling failures in your pipelines?
Do you have a DLQ… or are you trusting logs? 👇

Stop Babysitting Servers: Build a Scalable Serverless Data Lake on AWS

Rocio Baigorria — Mon, 06 Apr 2026 21:07:48 +0000

Building data pipelines shouldn't feel like babysitting servers. If you’ve ever managed a dedicated cluster just to run a few SQL queries, you know the pain: capacity planning, idle costs, and the "fun" of scaling infrastructure at 3 AM.

As a Data Engineering professional, I always follow a simple mantra: Design, then exist. (Or in this case: Design serverless, then relax.)

Today, we’re breaking down how to centralize your fragmented data into a Serverless Data Lake using the "Big Three" of AWS: S3, Glue, and Athena.

Why Serverless?

The beauty of a serverless approach is the decoupling of storage from compute. You only pay for what you store and what you process.

Amazon S3 (The Backbone) S3 is your central repository. A professional setup doesn't just "dump" data; it organizes it into Layers:

Raw Layer: The "Source of Truth." Data exactly as it arrived (CSV, JSON, Logs).

Curated Layer: Cleaned, partitioned, and optimized data (usually in Parquet format).

AWS Glue (The Librarian)
You don't want to manually define schemas. Glue Crawlers scan your S3 buckets, infer the data types, and populate the Glue Data Catalog, which acts as a central metadata repository.
*Amazon Athena *(The Engine)
Athena is an interactive query service that lets you run standard SQL directly against your files in S3. There are no clusters to spin up and no infrastructure to manage.

Quick Implementation: From S3 to SQL

Ingest: Upload your dataset into your raw S3 bucket.

Catalog: Point a Glue Crawler at that bucket. Once it finishes, you'll see a new table in your Data Catalog.

Query: Open the Athena Console and run your analysis:

SQL
-- Aggregating sales data directly from S3 files
SELECT
region,
SUM(amount) as total_sales
FROM "data_lake_db"."sales_curated"
GROUP BY region
ORDER BY total_sales DESC;

Data Engineer Pro-Tips

If you're moving from a POC to production, keep these two things in mind:

Friends don't let friends use CSV for Analytics: Convert your data to Apache Parquet. Because it’s a columnar format, Athena only reads the columns you actually query. This can reduce your query costs by up to 90%.
Partitioning is King: Organize your S3 paths by date (e.g., s3://my-bucket/year=2026/month=04/). This limits the amount of data Athena has to scan, making your queries lightning-fast.

Final Thoughts

Serverless Data Lakes allow us to experiment fast. You can build a proof-of-concept in an afternoon and scale it to petabytes without ever touching a Linux terminal.

Are you using a Data Lake at your company, or are you still sticking with traditional Data Warehouses? Let's talk about the pros and cons in the comments!

Flink + AI: Building Real-Time Decision Systems (Not Just Data Pipelines)

Rocio Baigorria — Tue, 31 Mar 2026 11:57:20 +0000

The problem is no longer moving data

For years, “real-time” meant pushing data from transactional systems into dashboards as fast as possible.

That’s no longer enough.

Today, while events are still happening, something — or someone — needs to decide.

The bottleneck isn’t speed anymore.
It’s context.

An AI model without fresh context makes poor decisions.
A pipeline without governance creates noise.
A stateless system cannot understand what’s actually happening.

In a world measured in milliseconds, moving data isn’t the goal.
We need systems that understand context and act while the data is still valuable.

This forces a shift in mindset:

from data pipelines → to decision architectures

The power stack: Flink + AI agents

This is where Apache Flink enters the picture.

Flink is not just another streaming engine.
It’s designed to process events where state and time are first-class citizens.

Two capabilities make it critical:

Stateful processing → it keeps memory across events. You don’t just see the current data point; you see its recent history.
Windowing → it groups events over time (seconds, minutes, hours) to detect patterns instead of isolated signals.

Now combine that with:

an event backbone like Kafka
AI agents (for example, powered by Bedrock or similar platforms)

The flow changes completely:

Events enter through Kafka
Flink processes, cleans, aggregates, and maintains state
The output feeds an AI agent with fresh, structured context
The agent doesn’t just answer — it acts

This is the critical shift:

You’re no longer asking
“What happened?”

You’re asking
“What should I do now?”

Use case: the data “purifier”

Think about it this way.

You wouldn’t drink water directly from a raw source.
You need a purifier to remove impurities and make it safe.

Data works the same way.

An AI agent fed with raw event streams will:

mix old and new signals
lose temporal context
produce inconsistent or “hallucinated” decisions

Flink plays the role of that purifier:

deduplicates events
corrects out-of-order data
enriches streams with state
filters noise

The result is a clean, reliable stream of truth.

When that stream reaches the AI agent, everything changes.

The agent is no longer reacting to fragmented inputs.
It operates on a coherent, real-time representation of reality.

And in real-time systems, that’s the difference between:

automating decisions
or scaling mistakes

From pipelines to systems that decide

We’re entering a phase where the value is no longer in visualizing data, but in acting on it at the right moment.

Flink is not just a processing tool.
It’s a foundational layer for building systems that understand context.

AI agents don’t replace this layer.
They depend on it.

Right now, I’m going deep into this stack — preparing for the Data Streaming World Tour and working toward Flink certification — with a clear focus:

designing systems where data doesn’t just flow, but drives real-time decisions

The real question

How are you managing state in your AI agents in production?

Kafka and Data Streaming: From Batch Thinking to Real-Time Systems

Rocio Baigorria — Tue, 24 Mar 2026 14:38:54 +0000

Most systems don’t fail because of scale. They fail because they were designed for a world that no longer exists.

A world where data arrives late, gets processed in batches, and decisions can wait.

That world is gone.

Today, data moves continuously. Payments, user behavior, logistics, fraud signals — everything is happening in motion. If your system waits, you lose.

This is where data streaming — and Apache Kafka — changes the game.

What is Data Streaming?

Data streaming is the practice of processing data as it is generated, instead of storing it first and analyzing it later.

Think of it like this:

Batch processing: collect → store → process → act
Streaming: produce → process → act (in real time)

The shift is not technical. It’s architectural.

Streaming forces you to think in events, not tables.

Enter Apache Kafka

Apache Kafka is a distributed event streaming platform designed to handle high-throughput, real-time data feeds.

At its core, Kafka is built around a simple idea:

Everything is an event.

An event can be:

A payment
A user click
A sensor reading
A log entry

These events are written to topics, which act like append-only logs.

From there:

Producers send events into Kafka
Consumers read events from Kafka
Consumer groups allow systems to scale horizontally

Kafka doesn’t just move data. It becomes the backbone of your system.

Why Kafka Matters for Data Engineers

Kafka is not just another tool. It represents a shift in how systems are designed.

1. Decoupling Systems

Instead of services calling each other directly, they communicate through events.

Result:

Fewer dependencies
More resilience
Easier scaling

2. Real-Time Processing

You don’t wait for data pipelines to run every hour.

You react instantly.

Use cases:

Fraud detection
Recommendations
Monitoring and alerting

3. Replayability

Kafka stores events for a configurable period.

That means you can:

Reprocess data
Fix bugs retroactively
Build new consumers without touching producers

This is a massive advantage over traditional pipelines.

The Mental Shift: Thinking in Events

Most people struggle with Kafka not because it’s complex, but because it requires a different way of thinking.

Instead of asking:

“What data do I have?”

You ask:

“What just happened?”

That single shift changes everything.

You stop designing databases first
You start designing flows

A Simple Example

Imagine an e-commerce platform.

Instead of updating multiple services directly after a purchase, you emit an event:

OrderPlaced

From there:

Inventory service consumes the event
Payment service processes it
Notification service sends confirmation

Each service reacts independently.

No tight coupling. No fragile chains.

Common Mistakes When Starting with Kafka

Treating Kafka like a message queue
Ignoring partitioning strategy
Not planning for schema evolution
Overcomplicating the architecture too early

SEO Keywords

data streaming
Apache Kafka
event-driven architecture
real-time data processing
Kafka tutorial
streaming pipelines
data engineering
Kafka use cases

Final Thought

Streaming is not a trend. It’s the default.

If you’re still designing batch-first systems, you’re building latency into your architecture from day one.

Kafka is not the only tool in this space — but understanding it forces you to level up as a data engineer.

And that’s the real value.

If you're getting into data engineering, don’t just learn tools.

Learn how data moves.

That’s where the leverage is.

From Kafka to the Cloud: Designing a Real-Time Event-Driven Data Pipeline on AWS

Rocio Baigorria — Mon, 16 Mar 2026 12:05:31 +0000

Modern data platforms are increasingly built around event-driven architectures. Instead of systems constantly polling databases or relying on synchronous APIs, services react to events as they happen.

In this article I’ll walk through the design of a real-time streaming pipeline capable of processing 15,000+ events per second with sub-50ms latency.

The project started as a distributed system built with open-source technologies and later evolved into a cloud-native architecture on AWS.

The key idea is simple:

Understand the fundamentals first, then move the architecture to managed cloud services.

The Original Architecture (Local Distributed System)

The first version of the project was implemented using the following stack:

Apache Kafka for event streaming
Kafka Streams for real-time processing
Spring Boot for the processing services
PostgreSQL for durable storage
Redis for low-latency read projections
Prometheus and Grafana for monitoring

Event Flow

The pipeline follows a typical streaming architecture.

Producer → Schema Registry → Kafka → Stream Processing → Storage → Analytics

A producer publishes transaction events to Kafka
Each event is serialized using Avro and validated against Schema Registry
Kafka partitions allow parallel consumption
A streaming service processes events using Kafka Streams
Results are stored in PostgreSQL and Redis

This architecture enables real-time anomaly detection by applying sliding-window aggregations to the event stream.

Performance Benchmarks

The system was designed with performance and reliability in mind.

Metric Result
Throughput 15K+ events/sec
P99 Latency <50ms
Availability 99.95%
Data Loss 0% (exactly-once processing)

Several optimizations helped achieve these results:

Producer batching (32KB batch size)
Snappy compression
Parallel consumers
Connection pooling
Transactional event processing

Distributed Systems Patterns Implemented

This project demonstrates several architectural patterns commonly used in modern data platforms.

Event Sourcing

Kafka acts as the immutable event log. Every state change is stored as an event.

CQRS

Write operations store events while Redis maintains optimized read models.

Outbox Pattern

Ensures reliable event publishing from the database.

Saga Pattern

Coordinates distributed workflows without synchronous transactions.

Circuit Breaker

Improves resilience by isolating failing components.

Moving the Architecture to AWS

After implementing the pipeline locally, the next step was mapping the same design to managed cloud services on AWS.

The goal was not to redesign the system, but to replace infrastructure with managed services.

Cloud Architecture
Producer
↓
EventBridge / MSK
↓
Lambda processing
↓
Step Functions orchestration
↓
DynamoDB / RDS
↓
CloudWatch monitoring

Event Ingestion

Events can be published to:

Amazon EventBridge for event routing

Amazon MSK for managed Kafka streaming

Processing Layer

Events are processed by AWS Lambda, which allows the pipeline to scale automatically based on event volume.

Workflow Orchestration

Complex workflows are coordinated using AWS Step Functions, which define the pipeline as a series of steps such as:

event validation
enrichment
anomaly detection
persistence

Storage

Data can be stored depending on the access pattern:

DynamoDB for high-scale key-value access

Amazon RDS for relational workloads

Observability

Monitoring and logs are handled by Amazon CloudWatch, allowing engineers to track:

throughput
errors
latency
workflow executions

The Key Insight

The most important lesson from this project is that the architecture itself does not change when moving to the cloud.

The same principles remain:

events are immutable
services react asynchronously
systems scale through partitioned streams
state is derived from event logs

Cloud services simply remove the burden of managing infrastructure.

Final Thoughts

Understanding how streaming systems work internally makes it much easier to design reliable cloud-native data platforms.

Instead of thinking only in terms of tools, focus on the system flow:

Event → Stream → Process → Persist → Observe

Once those fundamentals are clear, migrating the system to cloud platforms like AWS becomes a natural evolution.

Design, therefore I exist.