DEV Community: elliott cordo

Another Data Nerd Guide to re:Invent 2025

elliott cordo — Tue, 14 Oct 2025 00:34:35 +0000

Well, it's that time of year again. In less than two months we'll be in amazing and weird Las Vegas, hustling between venues, trying to maximize learning, networking, and perhaps a fun Las Vegas side quest.

As exciting as this time can be, it's also incredibly daunting (there are nearly 3000 sessions 😱). And to make matters worse you need to have your favorites ready, as in just a few hours we'll need to book sessions. This experience is stressful, very much like buying in demand tickets for a concert or sporting event tickets.

So where to start?

My theme for this year - AI Convergence

We’re witnessing the long-anticipated convergence of data and software engineering — data systems are finally being built and operated like software. At the same time, Generative AI is accelerating this shift, demanding production-grade reliability, automation, and integration into core products. Together, these forces are reshaping what it means to be a data professional.
You can read more about my point of view here.

So, with this theme in mind, here's what's on my list. It's not perfect, but it's a good start for my data friends who are trying to up-skill and jump into the convergence head on.

Data is being built like software.

Data Mesh and Open Data Lake format are driving micro-service like architecture, and having clean, reliable data is more important than ever.

Building Production-Ready Data Systems for AI Applications (ANT403)
Building an operational data lake using S3 Tables and SageMaker (STG413)
Build trust in your data with Amazon SageMaker Catalog (ANT406)

Data professionals need to be full stack, and serverless is the best way to start. 🤓

AWS Serverless developer experience workshop (CNS402)
Building Serverless distributed data processing workloads (CNS404)

Software is becoming intelligent. And AI is forcing it all to converge — faster than ever.

Learn the frameworks and tools to build agents leverage MCP (Model Context Protocol) and RAG (Retrieval Augmented Generation). In order of recommendation -- essential to extra credit. It's a long list, but the solution space is large and evolving. 🤖

Agent in an hour: Build an agentic app with Strands Agents and MCP (IND334)
Vector search with Amazon OpenSearch Service (ANT419)
Unlock interoperability: Build your first MCP server (DVT404)
Deploying Intelligent Agent Systems with MCP (PEX401)
Smart agents meet documents: Building next-gen IDP architectures (AIM415)
Build agentic workflows with Small Language Models and SageMaker AI (AIM406)
Efficient AI model customization for agentic workflows (AIM410)
Implement hybrid search with Aurora PostgreSQL for MCP retrieval (DAT409)

Security and governance 🔒

Mastering agent authentication with Amazon Bedrock AgentCore Identity (AIM321)
Red Team vs Blue Team: Securing AI Agents (DEV317)
Build trustworthy AI applications with Amazon Bedrock Guardrails (AIM303)

And a little fun 🎊

Create your own AI sidekick: a hands-on agent building workshop (DVT403)
Game on: build a retro adventure game in 120 minutes (DVT402)

I hope this guide is helpful, have fun and get weird 😄.

Federated Airflow with SQS

elliott cordo — Sat, 06 Jul 2024 12:34:09 +0000

Simple Queueing Services (SQS) is one of the most simple and effective services AWS has to offer. The thing I like most about it is it's versatility, performing equality well in high volume pub-sub event processing and more general low-volume orchestration. In this post we'll review how we can use SQS to create non-monolithic Airflow architecture, a double-click into my previous post on the subject.

More Airflows = More Happy

Airflow is a great tool and at this point fairly ubiquitous in the data engineering community. However the more complex the environment the more difficult it will be to develop/deploy, and ultimately the less stable it will be.

I generally recommend the following principles when architecting with Airflow:

Small maintainable data product repos
Multiple purposeful Airflow environments
Airflow environments communicating through events

So this all sounds good, but what about dependencies 😬??? Organizations are going to have some dependencies that cross domain/product boundaries. We'd stand to lose that Airflow dependency goodness if the products are running in separate Airflow environments. A good example of this would be a "customer master", which might be used in several independently developed data products. That's where "communicating through events" comes in 😃

SQS to the rescue

Luckily this problem is VERY easily solved using SQS, and we've put together this little demo repository to help you get started.

Step 1: Setup

Assuming you already have two MWAA environments or self-hosted Airflow infrastructure you will need to create an SNS topic and create and subscribe an SQS subscription. We decided to be cute and package these as DAG's, but you can lift the code or create from the console.

You will then need to attach a policy to the role used by your Airflow environment. Note that this policy is fairly permissive as they will enable the steps above to be run through Airflow.



{

    "Version": "2012-10-17",

    "Statement": [

        {

            "Sid": "AllowSNSActions",

            "Effect": "Allow",

            "Action": [

                "sns:CreateTopic",

                "sns:DeleteTopic",

                "sns:ListTopics",

                "sns:ListSubscriptionsByTopic",

                "sns:GetSubscriptionAttributes",

                "sns:Subscribe",

                "sns:SetSubscriptionAttributes",

                "sns:ConfirmSubscription",

                "sns:Publish"

            ],

            "Resource": "arn:aws:sns::{AccountID}:"

        },

        {

            "Sid": "AllowSQSActions",

            "Effect": "Allow",

            "Action": [

                "sqs:CreateQueue",

                "sqs:DeleteQueue",

                "sqs:GetQueueUrl",

                "sqs:GetQueueAttributes",

                "sqs:SetQueueAttributes",

                "sqs:ListQueues",

                "sqs:ReceiveMessage",

                "sqs:DeleteMessage"

            ],

            "Resource": "arn:aws:sqs::{AccountID}:"

        }

    ]

}

Step 2: Create SNS Publish DAG

In the following DAG we create a simulated upstream dependency(consider this your customer master build step). We use the SnsPublishOperator to notify downstream dependencies after our dummy step is complete.

⚠️ Note that if you did not build your SNS/SQS resources using the DAGS, you will need to manually set your Airflow variables with the appropriate ARN's.



from airflow import DAG

from airflow.operators.bash import BashOperator

from airflow.providers.amazon.aws.operators.sns import SnsPublishOperator

from airflow.operators.python import PythonOperator

from airflow.models import Variable

from airflow.utils.dates import days_ago

from datetime import datetime

import boto3

default_args = {

    'owner': 'airflow',

    'start_date': days_ago(1)

}

# Define DAG to display the cross-dag dependency using SNS topic publish

with DAG(

    'sns_publish_dummy',

    default_args=default_args,

    description='A simple DAG to publish a message to an SNS topic',

    schedule_interval=None,

    catchup=False

) as dag:

    # Dummy task to show upward dag dependency success

    dummy_sleep_task = BashOperator(

        task_id='sleep_task',

        bash_command='sleep 10'

    )

<span class="c1"># SNS Publish operator to publish message to SNS topic after the upward tasks are successful



    publish_to_sns = SnsPublishOperator(

        task_id='publish_to_sns',

        target_arn=Variable.get("sns_test_arn"),  # SNS topic arn to which you want to publish the message

        message='This is a test message from Airflow',

        subject='Test SNS Message'

    )

<span class="n">dummy_sleep_task</span> <span class="o">&gt;&gt;</span> <span class="n">publish_to_sns</span>



if name == "main":

    dag.cli()

Step 3: Create SQS Subscribe DAG

This DAG will simulate the downstream dependency, perhaps a customer profile job. Leveraging the SqsSensor, it simply waits for the upstream job to complete, and then runs it's own dummy step. Note that the mode='reschedule' is required to enable this polling/waiting functionality.



from airflow import DAG

from airflow.providers.amazon.aws.sensors.sqs import SqsSensor

from airflow.operators.python import PythonOperator

from airflow.models import Variable

from airflow.utils.dates import days_ago

from datetime import timedelta

import boto3

default_args = {

    'owner': 'airflow',

    'start_date': days_ago(1)

}

def print_sqs_message():

    print("Hello, SQS read and delete successful!! ")

# Define DAG to show cross-dag dependency using SQS sensor operator


with DAG(

    'sqs_sensor_example',

    default_args=default_args,

    description='A simple DAG to sense and print messages from SQS',

    schedule_interval=None

) as dag:

<span class="c1"># SQS sensor operator waiting to receive message in the provided SQS queue from SNS topic



    sense_sqs_queue = SqsSensor(

        task_id='sense_sqs_queue',

        sqs_queue=Variable.get("sqs_queue_test_url"), # Airflow variable name for the SQS queue url 

        aws_conn_id='aws_default',

        max_messages=1,

        wait_time_seconds=20,

        visibility_timeout=30,

        mode='reschedule'  # the task waits for any message to be received in the specified queue

    )

<span class="n">print_message</span> <span class="o">=</span> <span class="nc">PythonOperator</span><span class="p">(</span>
    <span class="n">task_id</span><span class="o">=</span><span class="sh">'</span><span class="s">print_message</span><span class="sh">'</span><span class="p">,</span>
    <span class="n">python_callable</span><span class="o">=</span><span class="n">print_sqs_message</span>
<span class="p">)</span>

<span class="n">sense_sqs_queue</span> <span class="o">&gt;&gt;</span> <span class="n">print_message</span>



if name == "main":

    dag.cli()

Testing, Testing

Once your environment is setup, simply start your SQS subscriber DAG. It will patiently wait polling SQS for a completion state.

When you are ready start your SNS publisher DAG. Once complete your subscriber will start it's dummy step and complete.

Bringing it all together..

Big picture, leveraging SQS you can enable a pragmatic, data-mesh inspired infrastructure like has been illustrated below. No Airflow-based single point of failure, observed domain/product boundaries, and team autonomy.

As a bonus you have also enabled evolutionary architecture. If some team wants to transition from Airflow to Step Functions, or Prefect they are empowered to do so, so long as they continue interacting through SNS/SQS.

I'd like to give special thanks to @deekshagunde for contributing to this article and preparing the demo repo.

MWAA Plugins and Dependency Survival Guide

elliott cordo — Fri, 05 Apr 2024 12:50:21 +0000

Very often it makes sense to use a managed service instead of undifferentiated heavy lifting of properly building and maintaining infrastructure. For me, managing Apache Airflow definitely falls into this category and I often use AWS MWAA (Managed Workflows for Apache Airflow).

As many of you have worked with Airflow already know, customizations, especially modifications to the Python environment can be tricky, and in some cases dangerous. This is mainly due to the fact that Airflow itself is a complex Python application with it's own environmental considerations and dependencies.

This is why I continue to campaign that folks keep their Airflow environment small and purposeful, and reduce customizations by using tools like the pod operator. I detail much of this in my article The Wrath of Unicron.

However it's very difficult to stay completely vanilla 🍦, so here are a few tips when customizing the MWAA environment.

Tip 0: Use MWAA Local Runner

I won't go into great detail here, because the docs are quite good. But you should absolutely be developing, and testing all changes leveraging MWAA Local Runner. It's very close to the real thing and you will avoid waiting for changes to propagate in the actual MWAA environment (my one complaint is 20-40 minutes for an environment update is kinda crazy).

Tip 1: LOGGING!!

Before you start any customization, turn your logging up to 11.

You will need all the detailed log entries, especially for the Scheduler. If your MWAA environment is not recognizing your changes, or getting stuck in the updating state (crash loop), check the requirements log entries.

Tip 2: Constraints File

For the past several versions of Airflow, a public constraints file has been published and maintained. This constraint file protects Airflows dependencies and makes sure that customizations do not break things.

⚠️ With MWAA, messing up dependencies can cause the before-mentioned crash loop, which can often last for hours 😭.

A constraint statement pointing to this file must be referenced in the top of your requirements.txt and will look something like this.

--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-{Airflow-version}/constraints-{Python-version}.txt"

DO NOT OMIT THE CONSTRAINT STATEMENT!!!

..but if you can't make the default file work, see Tip 3 below 😃

Tip 3: Unresolvable Conflicts

Unfortunately not all Python packages are well maintained or have tight locking to upstream dependency versions. Over time, you can run into unresolvable conflicts between your packages and plugins, and the constraints file.

The first recommendation - upgrade to the latest version of Airflow. It's very likely the problems you are experiencing are resolved in the latest version. If this is not an option, I suggest certifying and hosting your own version of the constraint file. This will involve tweaking the package dependencies and making sure they are compatible with Airflow.

This may not be a trivial process, but try your best not to comment lines, and absolutely do not remove the constraints statement all together.

Tip 4: Troubleshooting Plugins

Maybe you did think you did everything right (per the docs), your MWAA environment is booting, but your Plugins are not installing.

First, check your requirements log entries and see if anything has failed during the install. Note that in most cases the requirements install will do a complete rollback of package installs, not just the offenders.

If you don't see your package being installed make sure you referenced your package correctly in your requirements file.

/usr/local/airflow/plugins/data_common_utils-0.2.8-py3-none-any.whl

If all this looks correct, try creating a simple "plugin finder" DAG, and make sure your plugin has been copied to the hosted environment.



from airflow.utils.dates import days_ago

from airflow import DAG

from airflow.operators.bash import BashOperator

dag =  DAG(dag_id = 'plugin-finder', start_date=days_ago(1))

ls_airflow_plugins = BashOperator(

    task_id="ls_airflow_plugins",

    bash_command="ls -laR /usr/local/airflow/plugins",

    dag=dag,

    priority_weight=300,

)

Good Luck!

I hope you all find this helpful. Please comment with other helpful tips!

Data Futures

Test Driving Redshift AI-Driven Scaling

elliott cordo — Thu, 21 Dec 2023 15:13:32 +0000

Of all the amazing announcements from 2023 Reinvent I was probably most excited about the new scaling capabilities announced for AWS Redshift. As a long time Redshift user, perhaps even the debatable customer #1, I'm always eager to learn of the new and exciting ways we can use this service.

Serverless has long been a direction for AWS service offerings, and this was brought to Redshift last year with it's first serverless release. Redshift Serverless V1 was definitely an amazing step forward for the platform. I've implemented at many organizations to supplement existing RA3 infrastructure, or alone as the primary warehouse engine. For workloads that are "spikey", or overall have low duty cycle we've been able to drive amazing price performance.

In terms of price-performance one of the most important parameters for V1 Redshift serverless is the base RPU (Redshift Processing Unit) allocation. This parameter controls initial "warm" Redshift capacity that is ready for work. As you can guess you want this number to be as low as possible to maintain a low baseline cost profile, which is in tension with reducing latency as it autoscales.

💸 IMPORTANT TIP 💸 The default base RPU is 128. When experimenting be sure to turn base RPU way down or you might have an unexpected surprise in your AWS bill.

The main thing to consider here is that the autoscaling is primarily for query concurrency. When it comes to large workloads like ELT jobs, the throughput is limited to the base RPU setting. So if you throw a large workload at an undersized serverless cluster you will not have the performance you expect. If you maintain too high a base RPU, you will wastefully reserving capacity.

The great thing is that since Redshift infrastructure can be manipulated by API, we can do all sorts of creative things like temporarily increasing the RPU setting before our jobs, or even spinning up an ephemeral cluster. But luckily the new scaling options make things much easier now.

AI Scaling

With the latest release Redshift Serverless now uses ML to estimate the required cluster size in order to process a submitted workload. Instead of a base RPU setting you instead control a price performance ratio.

I decided to test out this service using a larger dataset of clickstream data, and troublesome ELT step that I recently refactored. The incremental data is about 25 million rows, and the ELT step is essentially session-izing and creating user/date level aggregates. We'd found that a 128 RPU V1 cluster was providing adequate price performance, with and average run time of just over 4 minutes.

So let's test drive a preview cluster!

I created a new preview cluster, accepting the default of a 50% price performance ratio and ran an incremental workload. The cluster starting from zero hit a max RPU of 128 and completed the workload in 3.5 min and then settled back to zero.

Conclusion

This initial test is really promising, and In the coming month I plan to do additional testing of larger workloads, and additionally layering on some smaller concurrent queries.

Overall this is a great step forward for Redshift. It will enable some really interesting topologies that mix serverless and provisioned RA3 clusters to optimize for just about any workloads, especially when leveraging the new multi-cluster writes (preview).

Data Futures

Cover photo by Ivan Kovac

Avoiding the DBT Monolith

elliott cordo — Sun, 15 Oct 2023 15:27:19 +0000

As a data engineer, unless you've been living under a rock, you've probably been working with DBT, or aspire to do so. DBT is an great step in the right direction for data engineering, removing boilerplate tasks, and establishing observable contracts between models.

However, DBT projects can easily get out of hand. I often see entire data platforms defined in a single DBT project (monolithic repositories). In some organizations this doesn't cause much harm, but in many it becomes a nightmare to maintain. This is especially true when data ecosystem is large, and when data development is more federated with data products being maintained by separate teams. Also see my article on similar pitfalls with DAG's.

DBT has been thinking about this, and there are some major features in preview to help support multi-repo, federated project development. However these features will only be available for Enterprise customers.

Multi-repo Strategy

The good news is that if you are running DBT Core, or want to leverage existing features there are options. Since DBT is built on Python, and is easily extensible with packages. Packages are useful for utility functions, data quality, and just general code reusability. But they can also be used to import and reference other DBT project models.

How to

For demonstration purposes I've created two repos:

Project A - parent repo, vanilla DBT "hello world".
Project B - child repo that inherits from Project A.

For both projects you will need to setup a virtual environment and install the appropriate dbt package dbt-snowflake, dbt-redshift, etc.

In order for Project B to inherit you need to simply add the parent project to packages.yaml, which will be imported when you run dbt deps.

packages:
  - git: "https://github.com/elliottcordo/dbt_poc_a.git"

You can now add the model to your dbt_project.yaml. You can also override config parameters. This is especially helpful if your parent models exist in a different schema.

models:
  dbt_poc_a:
    # +schema: schemaname
    # Config indicated by + and applies to all files under models/example/
  dbt_poc_b:
    # +schema: schemaname
    # Config indicated by + and applies to all files under models/example/
    dbt_poc_b:
      +materialized: view

In the model specification you can now reference the models from Project A.

{{ config(materialized='table') }}

with source_data as (

    select distinct * 
    from {{ ref('dbt_poc_a', 'my_first_dbt_model') }}

)

select *
from source_data

When running models in the child project you will most likely want to suppress building the upstream models (with the assumption they are maintained and built by a different team). You can use select and filter expressions in your dbt run command too accomplish this: dbt run --select dbt_poc_b
⚠️ this is something you really need to be careful with!

You can now run dbt docs generate and dbt docs serve to view dependencies and cross model metadata..

Additional thoughts

Note that this approach alone will not enable safe federated DBT development. Process and culture will also come into play to avoid breaking changes to downstream models. You should also anticipate building a good amount of internal tooling, especially in your CI/CD pipelines.

And just a reminder, if your data platform is small, your data team is small and/or all development is centralized, this approach may be pre-mature optimization.

Data Futures

Running Jobs on Athena Spark

elliott cordo — Fri, 13 Oct 2023 19:00:17 +0000

Athena Spark is hands down my favorite Spark implementations on AWS. First off, it's a managed service and serverless, meaning you don't need to worry about clusters and you only pay for what you use. Secondly it autoscales for a given workload and very successfully hides the complexity of Spark. Last but not least it's instant - you get a useable Spark session literally within the time it takes for the Notebook editor to render. It's magical!

But what if we want to leverage this magic for production workloads? Specifically scheduled jobs, where an interactive Spark environment just doesn't fit? Although the current version of the service is definitely optimized for interactive experience, running scheduled jobs through Athena is both possible, and magical.

Here is a quick walkthrough of how you can run Athena Spark jobs through good ol' boto3.

First we establish a client connection and start a session. Note this session could be reused for several jobs, although the instantiation time is so fast there is probably not much reason to persist it.

import boto3
client = boto3.client('athena')

calculation_response = client.start_session(
    Description='job_session',
    WorkGroup='spark_jobs',
    EngineConfiguration={
        'CoordinatorDpuSize': 1,
        'MaxConcurrentDpus': 20        }
    )

session_id = calculation_response.get('SessionId')
session_state = calculation_response.get('State')
print(session_id)
print(session_state)

Once the session is established we can start submitting work. Instead of referencing an existing notebook, you will instead submit your code as a string. Yes, this seems a bit weird, but after all Python is an interpreted language, stored in plain text, i.e. a string, so get over it 😃! It would be great if you could reference an S3 URI and hopefully they will provide additional options in the future.

I'd recommend maintaining this code as a separate .py file that could be mocked/tested outside this "driver" code.

with open("complicated_spark_job.py","r") as f:
    notebook_code = f.read()

execution_response = client.start_calculation_execution(
    SessionId=session_id,
    Description='daily job',
    CodeBlock=notebook_code,
)

calc_exec_id = execution_response.get('CalculationExecutionId')

We can then iterate and monitor our Spark jobs progress.

while True:
    exec_status_response = client.get_calculation_execution_status(
        CalculationExecutionId=calc_exec_id
    )
    exec_state = exec_status_response.get('Status').get('State')
    print(exec_state)
    if exec_state in ['CANCELED','COMPLETED','FAILED']:
        break
    sleep(5)

When the job completes we can terminate our session, if we have no other work to submit. Don't worry, if you don't forcibly terminate it will time out after a few minutes of idle time.

client.terminate_session(
    SessionId=session_id)

In addition to Athena Spark being fast and easy, it's also very cost effective. DPU's are priced at $0.35 per hour, rounded to the second. So a 1 hour 20 DPU workload (which is allot of processing power) would cost you about 7 bucks!

Data Futures

The Wrath of Unicron - When Airflow Gets Scary

elliott cordo — Wed, 30 Aug 2023 12:42:41 +0000

In case you weren't already a fan of the 1986 Transformer movie, Unicron was a giant, planet sized robot, also known as the God of Chaos.

For me this analogy is too obvious. DAG schedulers like Airflow (cron's), often become bloated fragile monoliths (uni-cron's). And just like this planet eating monster, they bring to all sorts of chaos in for the engineers that maintain and operate them.

There have been quite a few great articles written on the subject of breaking up the Airflow monorepo, and to provide context I'll cover these quickly. However, this approach alone does not defeat Unicron. In this world of increasingly decentralized data development we need to seriously think about using just a single scheduler.

Breaking up the Mono-Repo

Airflow to often reaches limits of project dependencies, multi-team collaboration, scalability, and just overall complexity. It's not Airflows fault, it's the way we use it. Luckily there are a couple great approaches to solving these issues:

1) Use multiple project repos - Airflow will deploy any dag you put in front of it. So you can with a little bit of effort build a deployment pipeline which can deploy dag pipelines from separate project specific repos into a single Airflow. There are a few techniques here ranging from DAGFactory (good article here), leveraging Git sub-modules, to just programmatically moving files around in your pipeline.

2) Containerize your code - reduce the complexity of your Airflow project by packaging your code in separate containerized repositories. Then use the pod operator to execute these processes. Ideally Airflow becomes a pure orchestrator, with very simple product dependencies.

Using both of these techniques, especially in combination will help make your Unicron less formidable, perhaps only moon sized. In fact, in many organizations, this approach coupled with a managed Airflow environment such as AWS Managed Workflows is a really great sweet spot.

Data Mesh vs Unicron

As organizations grow, and data responsibilities become more federated, we need to ask ourselves an important question - do we really need a single scheduler?. I would wholeheartedly say no, in fact it becomes a liability.

The most obvious problem - single point of failure. Having a single scheduler, even with resiliency measures, is dangerous. An infrastructure failure, or even a bad deployment could cause an outage for all teams. In modern architecture we avoid SPF's if at all possible so why create one if we don't need to?

Another issue is the excessive multi-team collaboration on a single project. Possible, especially if we mitigate with the techniques above but not ideal. You might still run into dependency issues, and of course Git conflicts.

And then the most obvious question - what is the benefit? In my experience the majority of DAG's in organization are self contained. In other words they are not using cross DAG dependencies via External Task Sensors. And if they are, there is a good chance the upstream data product is owned and maintained by another team. So other than observing whether it is done or not, there is little utility to being in the same environment.

So how do we defeat Unicron?

My recommendation is to have multiple Airflow environments, either at the team or application level.

My secret sauce (well one way to accomplish this) - implement a lightweight messaging layer to communicate dependencies between the multiple Airflow environments. The implementation details can vary - but here is a quick and simple approach:

At the end of each DAG publish to an SNS topic.
Dependent DAGS subcribe via SQS.
The first step in the dependent DAG would then be a simple poller function (similar to an External Task Sensor), that would simply iterate and sleep until a message is received.

Obviously the implementation details are maleable and SQS could be substituted with Dynamo, Redis, or any other resilient way to notify and exchange information.

You could even have your poller run against the API of other Airflow instances. Although it will possibly couple you to another projects implementation details (i.e. specific Airflow infrastructure and DAG vs data product). Perhaps that other team might change the DAG that builds a specific product, or replace Airflow with Prefect or maybe move to Step Functions. In general we want to design in a way that components can evolve independently, i.e. loose coupling.

One of my very first implementations of this concept was a simple client library named Batchy, backed by Redis and later Dynamo. I created this long before Data Mesh was a thing, but was guided by the same pain points highlighted above. This simple system has been in place for years integrating multiple scheduler instances (primarily Rundeck) with little complaint and great benefit.

SO - in conclusion. Use common sense and don't create a scary, monolithic Unicron. And if have one, be like Grimlock and Kick it's Butt.

Data Futures

Evolutionary Recommender Design with Amazon Personalize

elliott cordo — Tue, 29 Aug 2023 22:39:00 +0000

Over the past few months I've been spending a fair amount of time working on personalization, leveraging one of my new favorite AWS services - Amazon Personalize. Needless to say there is much more that goes into building and launching a personalization system than just turning on a few services and feeding in some data. In this article I'll focus on what it takes to launch a new personalization strategy, and architect it to evolve over time.

Chicken-or-the-egg

In many cases we have a classic chicken-or-the-egg scenario - we feel uncomfortable launching personalization features without confirmed performance and perhaps limited data, but without launching we won't have the feedback loop and data to measure and improve performance?

In some cases this is driven by the maturity and level of adoption of the application. If we don’t include ML recommendations, we are unprepared for growth in users, but without data the ML recommendations alone may not produce relevant enough results to engage users. In other cases we may have a mature product and user base, but still dealing with considerable unknowns. In both cases we need to be able to experiment, measure, and adapt quickly.

Launching on a New Application

Let's consider the special case where we are dealing with a relatively new application, in the early cycles of adoption. In these scenarios our ML algorithms might not be producing high relevancy scores, and simple logic and/or a healthy dose of manual curation may perform better. Note that "performing better" may be more subjective than statistical or metric driven in very early stages.

However, there are several good reason to start introducing these algorithms early

testing the infrastructure - work out any functional or non-functional issues early on
supplementing simple logic - use ML recommendations to add variety, and reduce the chance of recommendation depletion from simple hardcoded algorithms.
being ready for scale - the tipping point when ML recommendation will need to take over is somewhat unpredictable.
accelerating training - gathering implicit feedback from recommendation algorithms will help the models train faster

Evolutionary Design and Candidate Sourcing

My recommended method for integrating ML recommendations in through “candidate sourcing”. This method will require a service layer to be built on top of the recommendation infrastructure to combine and component algorithms. These components could be a Personalize User-Personalization endpoint recipe, perhaps a vector search against some text embeddings, and even a simplistic recommender based on manually curated entries and if-then else.

When a user is to be served recommendations, the underlying service will draw “candidates” for recommendations from the appropriate components , and then combine them through configured percentages/weights.

For a simple example let’s consider the "for you" page or "feed" scenario (something we are all familiar with). In this case the API will make requests of the 3 example component services previously mentioned. The results from each service will then be combined by configurable weights. Let's assume 50% user-personalization, 25% vector search, and the remaining 25% from the simplistic recommender to render their feed.

Ideally we should be able to easily add a 4rd algorithm for candidate sourcing, with a configurable percentage (perhaps we include 10% popular items).

Tuning, Experimentation, and Measurement

We should be able to tune these percentage, preferably without deployment, and also run experiments (group A is 50/25/25, group B is 75/15/10). There are many things we could use to measure the performance of these algorithms, but most simply we could measure the click thru rates by groups, as well as the component services to guide tuning and further experimentation. Obviously, with a bit of work, we could fully automate the deployment scenarios on top of these basic principles (blue green, etc).

--> Data Futures

Adventures in Amazon Personalize Infrastructure Deployment

elliott cordo — Fri, 07 Jul 2023 12:22:40 +0000

Although I spend the majority of my time on building broad, cloud native strategies and systems, I must admit that some of my favorite work in data is quite niche – recommendation systems. Over the past decade I've had the opportunity to build quite a few recommendation systems. Several were the expected ecomm and media use cases, although I also had the opportunity to build in social, and even internal research systems.

These systems are rewarding in two ways: First with sufficient data and good algorithms they almost always yield good results - on one side increasing usage/revenue, and most importantly helping users find what they want. On the other side there is a bit of both art and science. Yes the algorithms are there, and they require selection and the technical work of training and tuning, but there is also a great bit of creativity in mapping these algorithms to user experience, combining them in interesting ways, and even planning for a bit of “fun” and surprise.

In the past, building these systems was pretty heavy on the ML engineering side. Primarily you would be leveraging OSS algorithms, and forced to build your own frameworks for training, serving, and feedback pipelines. Not that this necessarily a horrible slog, at least for me, as building these sorts of things are fun and rewarding in their own way. However I personally always wanted to get to the fun and creative parts.

And Then Amazon Personalize

Amazon Personalize was introduced in Re:Invent of 2018, and went GA in the summer of 2019. It started out targeting pretty basic recommendation use cases, but as it stands now in 2023 I can safely say you can build a comprehensive recommender system completely within the product. This ranges from prepacked algorithms, through to serving and event collection infrastructure. This gives the opportunity to skip ahead to the really fun stuff!

Deployment methodology

Other than just trivial explorations, where I might use the console, I am always Infrastructure-as-code first. Not only does this provide a repeatable way of building and tearing down infrastructure, it’s also a great way to learn about the system from the ground up. However with Amazon Personalize, you will find that IAC coverage is only for very foundational components, namely the DataSet Group (topmost project container), Datasets, and Solutions (a untrained model configuration).

Rest assured those smart folks at AWS didn’t forget these components, or backlog them to get an MVP out the door. Most of the downstream components, particularly solution versions and campaigns are meant to be dynamic, and programmatically managed. Potentially one miss being Event Trackers, which are a foundational one-time setup, and hopefully make it into CloudFormation someday soon.

In an ideal fully productionalized system the flow would look something like the diagram below.

For you step function fans out there, this is an absolutely perfect use case. A step function and several lambdas to start and poll the various Personalize API interactions would do the job nicely. And getting back to the IAC conversation - you could absolutely IAC both the Lambdas and step functions!

But what if you just want to keep it simple, and get things going quickly, or your organization is not ready to build complex step function infrastructure?

Management Notebook Approach

As an engineer, I have a love-hate relationship with notebooks. They are indeed great for prototyping and exploring data. However notebooks are permissive of bad programming habits, and in most cases they end up being run-on-sentence type scripts. But, they can be very helpful when used as “management scripts", and feel much less yucky.

In one of my latest Personalize projects I used cloud formation for dataset groups, datasets and schemas. I then created a management notebook for first time creation of all remaining infrastructure components, and then a re-training notebook for, you guessed it, re-training. I’ve shared the important bits in this repo. Although this repo is meant to be illustrative of the approach, you could certainly customize it and use it for your own custom solution.

Although the notebooks are runnable locally, I host both notebooks in Glue, and have the retraining notebook cron’d to run every hour. And if you wanted to achieve IAC nirvana with this pragmatic solution, you could absolutely IAC the Glue notebooks.

So in summary..

Personalization is awesome, unless you have a really custom/unique recommendation use, there is little reason to build a custom recommender. Personalize is going to require a bit of work to create sustainable infrastructure deployment, definitely consider a pragmatic mix of IAC and management notebooks..

https://www.datafutures.co/

Deduping Customers Quick and Dirty - with SQL and Graphs

elliott cordo — Mon, 20 Feb 2023 20:02:06 +0000

Just about every organization I've worked with has had some sort of data quality problem, most notably the presence of duplicate customer records. I think we can all agree that duplicate customer records can cause all sorts of problems ranging from inconsistent customer experience to inaccurate reporting and analytics.

Unfortunately many think they aren't equipped to deal with this issue. They feel that in order to get their customer data clean they are going to need to implement some complicated and expensive piece of software like an MDM system (yuck) or a CDP (meh). Not only do these software packages take a long time to select and implement, the customer matching capabilities are often not that impressive.

Let’s end this paralysis, roll up our sleeves and clean up our customer data with the tools we have. My grandfather's motto was “it’s better to do something than nothing”, and to me this definitely resonates with this problem. All you need is a good old database, preferably an MPP like Redshift, and a tiny bit of Python. I will share an approach I've used in multiple organizations, in most cases exceeding the results from the aforementioned systems. Although I’ve potentially thrown some shade on this approach by calling it quick and dirty, this is a perfectly reasonable, and productionizable system for large organizations as well as small and scrappy ones.

Standardization and Matching

The first steps in our data cleanup is to standardize our data and then essentially perform a self join to match our results together. The theme of this article is quick and dirty, so we are going to do only very light cleanup. After we walk through this simple but effective solution I’ll provide some tips on how to make this quite sophisticated.

In the first CTE below I do some minor cleanup. I parse the first segment of the zip and strip non numeric phone numbers. Based on my understanding of the data I may choose to do more operations such as trimming strings or padding numbers. I also assembled a concatenated field phone_list which will allow me to compare phone numbers across a number of fields. I could assemble a similar field if I had multiple emails or addresses.

with cust as (
 select
 customernumber,
 firstname,
 lastname,
 email,
 regexp_replace(mobile, '[^0-9]','') mobile,
 regexp_replace(phone1, '[^0-9]','') phone1,
 regexp_replace(phone2, '[^0-9]','') phone2,
 regexp_replace(phone3, '[^0-9]','') phone3,
 regexp_replace(phone4, '[^0-9]','') phone4,
 nvl(regexp_replace(mobile, '[^0-9]',''), '') || '|'
    || nvl(regexp_replace(phone1, '[^0-9]',''), '') || '|'
    || nvl(regexp_replace(phone2, '[^0-9]',''), '') || '|'
    || nvl(regexp_replace(phone3, '[^0-9]',''), '') || '|'
    || nvl(regexp_replace(phone4, '[^0-9]',''), '') as phone_list,
 address1,
 substring(zip, 0, charindex('-', zip) - 1) as zip
from customer A )

I then perform the match. Keeping it simple, I require an exact match on the name. I then OR together a number of conditions such as email, phone, or address. I use charindex to test the existence of a phone number in my concatenated phone_list field. I also use a left(a.address1,5) AND zip which might seem strange but in practice I've found very effective.

select
 a.*,
 b.customernumber as matched_customernumber
from cust a
join cust b on a.firstname = b.firstname
 and a.lastname = b .lastname
 and a.customernumber <> b.customernumber -- we don't want to join the same record to itself
 and (
     a.email = b.email
     or charindex(a.mobile, b.phone_list) > 0
     or charindex(a.phone1, b.phone_list) > 0
     or charindex(a.phone2, b.phone_list) > 0
     or charindex(a.phone3, b.phone_list) > 0
     or charindex(a.phone4, b.phone_list) > 0
     or (left(a.address1,5) = left(b.address1,5) and a.zip = b.zip)
 )

Enhancing The Process

You can put as much effort into standardization and matching as you have time to invest. This effort will afford you a higher match rate (reducing undermatching/more dupes eliminated). However at a certain point, especially if you make things too fuzzy you can result in overmatching (false positives/potentially merging unique people). Given time, you have the opportunity to easily rival commercial tooling, just use your time wisely:

Parsing addresses - standardize and break addresses into parts. You will get a more deterministic join based on street and house number vs a string match
Resolve names to root forms - there are many available lists and csv files out there that will resolve common abbreviations and nicknames such as the infamous Robert = Bob
Fuzzy matching - transforming field values to fuzzy representations or using fuzzy matching techniques such as Levenshtein Distance
Cleaning bad values - null out addresses and emails in your match set which appear in high frequency or match known bad values (such as a store location)

Much of the above can be done in SQL, however the beauty of platforms like Amazon Redshift is they can be extended with Python. Here is a great article on packaging and leveraging the Python module fuzzywuzzy within Redshift.

Clustering and the “Graph Trick”

I bet a few of you have tried something similar to what i’ve outlined above and then faced a problem - how the heck do you cluster these matches and establish a survivor. If this term “survivor” doesn’t immediately register , it essentially represents collapsing a bunch of duplicate records into a single “golden record” which will remain, and all other matched records removed. Let’s look at a quick example to prove why this is both important as well as a difficult task.

For simplicity, I’m only showing half of the rows in the result, as you will have the same match in a reverse relationship. As you can see there is no natural hierarchy, and that there are transitive and multiple matches across the dataset. The first record matches the second, the second matches the third and so on. There is no way to easily move forward here without being able to cluster the results. In SQL terms we want all matches group-able by a partition by clause.

You might be tempted to try a recursive CTE, however since this operation is designed for hierarchies and does not tolerate loops, your query will likely run infinitely and time out. Applying limits is not a good option either as you have no control over where the query will terminate and may have incomplete results.

This is where remodeling this problem as a graph can really simplify the problem. As you can see the picture becomes a lot clearer when modeled as nodes and edges instead of rows. And it’s not just simple for us, it’s also simpler from a programming model perspective.

We can now load this data into a graph model and iterate the subgraphs. The subgraphs being the small disconnected graphs within the larger graph, and in this case just so happen to be our clusters.

import networkx as nx

Graphtype = nx.Graph()

with open('result.csv', "r") as data:
   # skip the header
   next(data, None)
   # parse into the graph
   G = nx.parse_edgelist(data, delimiter=',', create_using=Graphtype)

# append the individual subgraph nodes to a list
results = []
for x in (G.subgraph(c) for c in nx.connected_components(G)):
   results.append(x.nodes())

parent = ''
with open('output.csv', "w") as output:
   for group in results:
       parent = list(group)[0]   
       for match in group:
           output.write(','.join([parent, match]) + "\n")

In this example I assign the first customer in the flattened subgraph as survivor (list(group)[0]) , but you could obviously apply any business logic that is appropriate. Perhaps picking the lowest number, the oldest record, or maybe even importing a dataset of customer spend and choosing the customer record with the highest value or oldest/latest transaction date. But if your trying to keep it simple, and not that familiar with Python or graphs you can use the above script as is, and resolve the rest in SQL

Once you have created this dataset, you can reload it into your database and use it to create your golden record with relative ease. You can now leverage windowed functions, coalesce, and other familiar SQL operations since we can now partition by the survivor's customer number.

Quick and Dirty?

Quick and dirty is a good attention getter, however I actually consider this to be a pretty darn robust solution for building both one-time and recurring customer deduplication solutions. I’d go so far as to say it represents using the right tool for the right job. Matching and final record assembly is easily expressed in SQL, and accessible for most data wranglers to tune and improve. Likewise clustering and survivorship almost perfectly fits the graph model. As a side note, I’d encourage you to keep graphs in mind when solving other types of problems as Graphs Are Everywhere..

One other thing I wanted to mention - this solution performs!. On a customer dataset of about 1 million records, Redshift crushed the match in under 5 seconds, and the graph clustering job in just about 2 seconds! I've personally waited hours or even days for commercial tooling to process a full match on customer datasets of similar scale.

I hope you enjoyed this solution and are encouraged to go squash some duplicates!

https://www.datafutures.co/

Redshift Cost Optimization with Cost Explorer

elliott cordo — Thu, 26 Jan 2023 21:50:04 +0000

These days, cloud efficiency and cost savings are top of mind for many organizations. The current economic conditions aside, it’s always a good opportunity to support efforts to use the cloud efficiently. Besides saving money, the same levers that drive efficiency very often directly support scalability, reliability and sustainability.

Data Warehouses were historically statically provisioned, fixed cost systems. The good news is that advancements in cloud native data warehouse platforms have enabled us to maximize efficiency and cost much like our other engineered systems.

There are obviously a lot of different ways that we can approach efficiency. Optimizing a system like Redshift should include both infrastructure configuration as well as what “runs on” the platform. The latter may include various improvements such as optimizing queries, table structures and materialization patterns, or maybe even moving some of the workloads outside the data warehouse platform (ex. moving big crunches to Elastic Map Reduce). For this article I’ll focus on the infrastructure layer, using only insights from Cost Explorer and assume what runs on Redshift is fixed.

As many of you know, Redshift RA3 instance types decouple compute and storage. Furthermore they have built elasticity into compute, allowing you to handle spikes and increases in workload by leveraging features such as Concurrency Scaling and Serverless Endpoints. These features, along with the ability to blend provisioned with serverless and elastic resources is why Redshift delivers such excellent price-performance.

Cost Explorer to the rescue!

As the old saying goes “you can't improve what you don't measure”. So the first step in figuring out whether your Redshift infrastructure is optimal is reviewing Cost Explorer. This is not necessarily a trivial task as there are many components to the actual billing, with different timing, and unfortunately some cryptic abbreviations.

Within cost explorer change the report parameters to Granularity: Monthly (this is a good place to start), Dimension: Usage type, and set a Filter on Service: Redshift. You will then end up with a report that looks something like below. This example infrastructure configuration is really useful, as it demonstrates nearly all the components you may see with a standard RA3 and serverless deployment. I’ll now walk through the line items, and identify where there may be some efficiency opportunities.

Let’s get familiar with the abbreviations and their meanings. All the pricing details can be found here, but I’ll try to summarize the important bits.

USE-ServerlessUsage - The cost of the used Redshift Processing Units (this includes the whatever base capacity you have reserved plus the associated actual billing above that capacity)
HeavyUsage - Redshift reserved instance cost (be sure to select the 1st of the month in your date selection to pick up this line item)
Node - Redshift on-demand usage
RMS - Redshift managed storage, storage cost in GB hours
CS - Concurrency Scaling, you accrue up to one hour of concurrency scaling per day, usage beyond this is billed per-second on-demand
PaidSnapshots - Backups, necessary of course but definitely not free
USE-DataScanned - Redshift spectrum usage, querying data that exists in S3 or other external sources

The Findings

For context we are looking at 3 pieces of Redshift infrastructure: a reserved ra3.4xlarge cluster, an on-demand ra3.xlplus cluster, and a Redshift Serverless endpoint.

Excessive Concurrency Scaling
Our Concurrency Scaling (CS:ra3.4xlarge) is approaching the cost of our reserved cluster. This cluster is obviously blowing through the daily budget, and relying heavily on on-demand capacity to complete its computing tasks. First off, let’s acknowledge how cool this is - we are running a cluster at or above the redline and still completing the necessary work and serving our end users. A cost effective solution here is to offset the on-demand pricing with reserved instances. A good experiment would be to add an on-demand node to the cluster and observe the reduction in Concurrency Scaling and overall cluster CPU. If the calculated price profile is favorable consider reserving nodes.

Redshift Serverless Usage
The Redshift Serverless endpoint (USE1-ServerlessUsage) costs have eclipsed the costs of our reserved cluster. There may be good reasons for this, and it’s possible that this is the most cost effective way of handling this workload. The first thing to do is to check the Base RPU settings for this cluster, making sure that we haven’t staked a baseline commitment that is too high, and does not reflect the usage and minimum requirements of the cluster. Depending on the actual RPU units consumed and usage patterns this may mean that a reserved provisioned cluster might be more effective. A good common sense rule, at this point in time, is that a highly utilized cluster will be more cost effective staying provisioned. Note that the math here is a little tricky and there is no official guidance just yet, but let’s use 60% avg cluster CPU as a good case to keep provisioned. I suggest reading this section of the documentation on monitoring cost and usage.

On Demand Nodes
The ra3.xlplus cluster is contributing a fair amount of billing given its performance capacity. It’s easy to fall into the trap of running on-demand too long, especially when there are pending changes to the environment. Another common sense rule - if you plan on running this infrastructure more than 6 months you probably want to do a 1 year reserve, and greater than 18 months do a 3 year. We humans are pretty poor planners in general, especially when making guesses with uncertainty. In my experience the rules above have always been net favorable. Even if your guess is correct you’ll be pretty close to the break even point of the reserve.

Snapshots
The snapshot (Redshift:PaidSnapshots) costs are a little high, but reasonable. However, it is still worth an investigation. Start by reviewing your retention policy, and make sure that it complies with your organization's service levels and policies. Also be sure to page back to the early history and make sure you're not permanently safekeeping a large number of final snapshots. I’ve seen large buildups of these from misconfigured CICD pipelines or programmatic restores to non-prod.

In Closing

Optimizing infrastructure can be both a fun and rewarding exercise. It can be like playing Sherlock Holmes with Cost Explorer as your Watson. Based solely on these cost explorer findings there is a relatively easy 10 to 20% cost savings through infrastructure configurations. Although this post outlines high level infrastructure review, there is definitely a lot more digging to do, especially with what “runs on” Redshift.

Go forth and be frugal.

https://www.datafutures.co/