DEV Community: Anna Pastushko

Easy to follow Architecture Framework

Anna Pastushko — Wed, 15 Apr 2026 14:16:06 +0000

Context

When I first opened TOGAF, I got stuck on a basic problem — there's no clear starting point. Every section references another. Every process assumes a governance structure most teams don't have. Same problem was with Zachman. By the time you understand them well enough to follow, you've spent a lot of time. And even then — it's too heavy for most real projects.

Over the years of my work as an architect I accumulated notes: what to do on a new project, what to check before moving to the next phase, where things typically go wrong. Templates I kept reusing. Checklists I open on day one. At some point I had enough material to structure it into something I could share. I called it D3.

What is D3?

D3 is a lightweight architecture delivery framework built around three phases:
Discover → Design → Deploy

Each phase has a clear goal, concrete steps, a list of common pitfalls, and a definition of done checklist. The checklist is the key part — you don't move to the next phase until every item is checked. No guessing whether you're ready. No open unknowns quietly becoming risks later.

It works for solo architects and teams. Startups and enterprises. It doesn't require certifications, governance boards, or months of setup — just a clear problem and a structured way to work through it.

The three phases

Discover — understand the actual business problem, not just the stated request. Agree scope with the right people, capture requirements, document assumptions and open questions. The goal is enough clarity to design with confidence, not perfection.
Design — translate what you learned into a validated architecture. Always present at least two options with trade-offs, never a single answer. Share early — a rough diagram reviewed now is cheaper than a finished document reviewed late.
Deploy — deliver and hand over. This phase ends only when the stakeholder can operate the system independently. Not when the code is deployed. Not when documents are sent. When ownership has actually transferred.

Project types — the framework scales with you

D3 has three project types based on the size and risk of your engagement:

Type	Duration	Goal
Advisory	3–4 weeks	Strategic direction and high-level design
Pilot	2–3 months	Validated prototype or workload integration
Delivery	6+ months	Full production system with operational handover

Each type is cumulative — Pilot includes all Advisory outputs, Delivery includes all Pilot outputs. You pick the type that fits the engagement, the framework adjusts depth accordingly.
What's inside each phase

Beyond the steps themselves, each phase includes:

Common pitfalls — specific mistakes I've made or seen, written plainly. Things like treating the stakeholder's request as the problem, presenting one option instead of two, or deprioritizing CI/CD until the end.
Definition of done checklist — a concrete list of items to check before moving forward. The rule: don't proceed with open unknowns.
Internal tools — working documents for the architect (assumptions log, open questions log, stakeholder map) kept separate from client-facing deliverables.
Session guides — how to run discovery, design review, and handover sessions. Before, during, and after.

Who it's for

Junior or senior — a new project can still leave you unsure what to do first. D3 gives you a clear action plan from day one.
It's aimed at architects, tech leads, and engineers who end up designing systems regardless of title. It's not a methodology or a certification path. It's a structured way to go from a stakeholder's pain point to a delivered, handed-over system — and always know what phase you're in.

The framework is free and open: d3-architecture-framework.vercel.app
I'd love to know if it's useful — or where it doesn't fit your experience.

I Open-Sourced My Solutions Architect Portfolio (Real-World Case Studies)

Anna Pastushko — Tue, 20 Jan 2026 19:13:58 +0000

You know that awkward moment when an employer asks for your SA portfolio or examples of your architectures? Ever tried to find a real SA portfolio example online? I tried. You’ve probably tried too. There’s nothing.

Reference architectures? Sure — pretty many diagrams for AWS, Azure, and Google architectures. But how do YOU format YOUR work into a portfolio that shows architectural thinking, not just “I drew boxes”?

To address this, I decided to open-source my own Solutions Architect portfolio, featuring sanitized case studies from real-world projects. The goal is simple: provide a reference so others don’t start from scratch and can see how architecture decisions, trade-offs, and diagrams are documented in practice.

🌐 Portfolio: https://childishgirl-portfolio.vercel.app
📂 GitHub repository: https://github.com/ChildishGirl/sa-portfolio

Why Solutions Architect Portfolios Are Hard

Compared to developer portfolios, Solutions Architect portfolios are inherently more difficult to create:

Systems are complex: You’re not just showing code, but architecture decisions, constraints, and trade-offs.
Most work is confidential: Real projects often can’t be shared as-is.
Impact is broader: Architecture work connects technical design with business goals.

Because of this, many aspiring architects struggle to create their portfolio without having examples.

What This Portfolio Includes

The portfolio is structured around real-world case studies, carefully sanitized to avoid exposing sensitive information:

Industry context: Because architecture doesn’t exist in a vacuum. The same pattern solves different problems in fintech vs healthcare.
Trade-offs explicitly called out: This is THE signal of architectural thinking. Anyone can pick a solution. Architects explain what they sacrificed to get it.
Diagram style inspired by AWS reference architectures: Familiarity reduces cognitive load. When readers recognize the visual language, they focus on YOUR decisions.
Responsibilities: Shows your actual scope of impact and ownership within the project.

Who This Is For

This portfolio can be useful for:

Junior or aspiring Solutions Architects
Engineers transitioning into architecture roles
Mentors, lecturers, or educators looking for a reference example

Key Principles for Building Your Own SA Portfolio

While working on this portfolio, I followed a few principles that may help others:

Emphasize decisions, not just outcomes: Explain why a solution looks the way it does.
Use diagrams generously: A clear diagram often communicates more than a page of text.
Sanitize aggressively: Replace real names and sensitive data with neutral placeholders.
Make it skimmable: Use headings and concise sections.
Be honest about trade-offs: Mention alternatives you considered.

Final Thoughts

I hope this open-source portfolio helps someone avoid starting from a blank page when preparing for architecture roles.

If you’ve built (or reviewed) Solutions Architect portfolios before, I’d be curious to hear:
What do you think a strong SA portfolio should include?

Migrate local Data Science workspaces to SageMaker Studio

Anna Pastushko — Mon, 08 Aug 2022 07:00:44 +0000

Workspace requirements

Problem context

The data science team used Jupyter notebooks and PyCharm IDE on their local machines to train models but had issues with model’s performance and configuration tracking. They used Google Sheets to track model’s name, parameters values used and performance. It quickly became messy with the increasing number of models. Also, team members wanted to collaborate with each other when building models.

Functional requirements

FR-1 Solution provides collaborative development.
FR-2 Solution can be integrated/expanded with AWS EMR.

Non-functional requirements

NFR-1 Data and code should not be transferred via public internet.
NFR-2 Data scientists should not have access to the public internet.

Constraints

C-1 Data is stored in AWS S3.
C-2 Users’ credentials should be stored in AWS IAM Identity Center (SSO).
C-3 AWS CodeCommit is selected as a version control system.

Proposed Solution

TL;DR

Migrate local development environment to AWS SageMaker Studio and enable SageMaker Experiments and SageMaker Model Registry to track models’ performance.

SageMaker Studio will be launched in VPC only mode to make connection to Amazon EMR available and to satisfy the requirement of disabling access to the public internet.

Architecture

The solution consists of the following components:

VPC with S3 Gateway Endpoint and one Private Subnet with Interface Endpoint to allow data/code transfer without public internet.
IAM Identity Center to reuse existing credentials and enable SSO to SageMaker.
AWS SageMaker Studio
AWS S3 buckets to get data and save models artefacts
AWS CodeCommit repository to store code

You can see the combination of these components on the Deployment view in the diagram at the top of the page.

Functional View

User would login through SSO Login page and access SageMaker Studio. Then user can create and modify Jupyter notebooks, train ML models and track them using Model Registry.

Migration plan

Step 1. Inventory of resources and estimation of needed capacity

The first step is to create a list of all users, models they used and minimal system requirements. You can see an example of such inventory below.

User	Feature	Model	Memory	GPU usage
Harry Potter	Sorting Hat algorithm	XGBoost	15 Mb	No
Ronald Weasley	Generate summaries for inquiries for Hogwarts	Seq2seq	25 Mb	Yes
Hermione Granger	Forecast number of owls needed to deliver school letters	ARIMA	2 Mb	No
Hermione Granger	Access to Restricted Section in Hogwarts Library	CNN	50 Mb	Yes

While choosing the best instance types for models provided in the inventory, I considered the following:

Desired speed of training
Data volume
GPU usage capability
Instance pricing and fast launch capability

I decided to use ml.t3.medium for XGBoost and ARIMA models, and ml.g4dn.xlarge for CNN and Seq2seq. With Free Tier, you have 250 hours of ml.t3.medium instance on Studio notebooks per month for the first 2 months.

Step 2. Set up VPC with Gateway endpoint and update routing table

AWS has the option to create VPC with public/private subnets and S3 Gateway Endpoint automatically. After that, you need to create the following Interface Endpoints:

SageMaker API com.amazonaws.<region>.sagemaker.api.
SageMaker runtime com.amazonaws.<region>.sagemaker.runtime. This is required to run Studio notebooks and to train and host models.
To use SageMaker Projects com.amazonaws.<region>.servicecatalog.
To connect to CodeCommit com.amazonaws.<region>.git-codecommit.

⚠️ Replace <region> with your region. All resources including SSO account, VPC and SageMaker domain should be in the same region. This is important because AWS Identity Center (SSO) is not available in all regions.

Also, you can use provided CDK stack to deploy network resources automatically. Once completed, you would have such network:

Step 3. Set up SSO for users in IAM Identity Center

The next step is to create new/add existing users to group in IAM Identity Center.

Later, you will be able to add this group to SageMaker users. You can add each user without creating group as well, but I prefer to use groups. It allows you to manage users from one place easily.

Step 4. Set up SageMaker Studio without public internet access

Start SageMaker configuration with Standard setup of SageMaker Domain. Each AWS account is limited to one domain per region.

It proposes two options for authentication and I choose AWS Single Sign-On because we have an identity store there. Please note, that SageMaker with SSO authentication should be launched in the same region as SSO, in my case, both are in eu-central-1 region.

After completing SageMaker Domain configuration, you will be able to see SageMaker in Application tab and assign users to it.

The next step is to assign a default execution role, you can use the default role AWS creates for you with AmazonSageMakerFullAccess policy attached to it.

After that, you should configure if you want to use public internet from SageMaker or not. In my case, I disabled internet access with VPC only mode and used created VPC and private subnet to host SageMaker.

Also, you can configure data encryption, Jupyter Lab version and notebook sharing options. Then, you should add data science user group by navigating to the Control panel and Manage your user groups section in it.

Step 5. Link AWS CodeCommit

Create a repository in AWS CodeCommit named sagemaker-repository and clone HTTPS. Then, in the left sidebar of SageMaker Studio, choose the Git icon (identified by a diamond with two branches), then choose Clone a Repository. After that, you will be able to push changes to current notebooks or new notebooks to Git.

Step 6. Setup automatic shut down of idle resources

Sometimes people can forget to shut down instances after they finished their work. This can lead to the situation when resources run all night and in the morning you will receive unexpectedly large bill. But one of JupyterLab extensions can help you to avoid such situations. It automatically shuts down KernelGateway Apps, Kernels and Image Terminals in SageMaker Studio when they are idle for a stipulated period of time. You will be able to configure an idle time limit of your preference. Instructions for setup and scripts can be found in GitHub repository. After installation, you will see a new tab on the left of SageMaker Studio interface.

Step 7. Add experiments and model registry

SageMaker Model Registry is a great tool to keep track of models that were developed by the team. To enable this functionality, you need to go to the SageMaker resources tab (bottom one in the left panel) and choose Model Registry from the drop-down list. You need to create model group for each model that you want to be versioned.

After that, data scientists can register models using one of the methods described in documentations.

SageMaker Experiments is a feature that lets you organize, and compare your machine learning experiments. Model Registry keeps track of model versions that have been chosen as best versions. Experiments are useful for data scientists, because they allow to track all training runs and compare model performance to choose best one. You can add experiments using instruction from documentation.

Step 8. Migrate notebooks

Amazon SageMaker provides XGBoost as a built-in algorithm and data science team decided to use it and re-train the model. So, data scientists just need to call built-in version and provide path to data on S3, more detailed description can be found in documentation. Example notebook can be found here.

Seq2seq notebook migration is pretty similar to the previous case — SageMaker has built-in Seq2seq algorithm. So, you just need to call right image URI and create SageMaker Estimator using it. Example of SageMaker notebook which uses Seq2seq algorithm can be found here.

With ARIMA situation is slightly more difficult, because data science team don’t want to use DeepAR proposed by SageMaker. So, we need to create custom algorithm container. At first, we need to build Docker image, publish it to ECR, and then we can use it from SageMaker. You can find detailed instruction and example of notebook in the documentation.

For CNN build with TensorFlow we can use SageMaker script mode to bring our own model, an example notebook can be found here. Also, we can use local machine to train it, example notebook can be found here.

More examples of different models notebooks you can find in Amazon SageMaker examples repository.

Step 9. Ensure that data science team can use new environment

The final step is to create documentation for the process, so any new users can use it for onboarding to data science working environment. It should include the following sections:

How to add a new model (feature) to registry
How to add a new experiment to SageMaker
How to choose an instance for the model

Conclusion

SageMaker is a great tool for data scientists that allows you to label data, track experiments, train, register and deploy models. It automates and standardizes MLOps practices and provides a lot of helpful features.

CDK resources for Network creation and cost breakdown can be found in GitHub repository.

Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.

Create Serverless Data Pipeline Using AWS CDK (Python)

Anna Pastushko — Fri, 08 Jul 2022 09:45:06 +0000

Context

Data science team are about to start a research study and they requested a solution on AWS cloud. Here is what I got to offer them:

Main process (Data Processing):

Device uploads .mat file with ECG data to S3 bucket (Raw)
Upload triggers event creation which is sent to SQS queue.
Lambda polls SQS queue (event mapping invocation) and starts processing event. Lamba’s runtime is Python Docker container because libraries’ size exceed layers’ size limit of 250 MB. If any error occurs in the process, you’ll get notification in Slack.
Once completed, processed data in parquet format is saved to S3 Bucket (Processed).
To enable data scientists to query the data, Glue Crawler job creates a schema in the Data Catalog. Then, Athena is used to query Processed bucket.

Secondary process (CI/CD):

When a developer wants to change processing job logic, he should just prepare the changes and commit them to the CodeCommit repository. Everything else is automated and handled by CI/CD process.

CodeBuild service converts CDK code into CloudFormation template and deploys it to your account. In other words, it creates all infrastructure components automatically. Once completed, deployed resources’ group (stack) is available in CloudFormation service on web UI. To simplify these two steps and provide self-updates for CI/CD process as well, CodePipeline abstraction is used. You’ll also get Slack notifications about the progress.

Preparation

To prepare your local environment for this project, you should follow the steps described below:

Install AWS CLI and set up credentials.
Install NodeJS to be able to use CDK.
Install CDK using command sudo npm install -g aws-cdk.
Create new directory for your project and change your current working directory to it.
Run cdk init --language python to initiate CDK project.
Run cdk bootstrap to bootstrap AWS account with CDK resources.
Install Docker to run Docker container inside Lambda.

Project Structure

This is the final look of the project. I will provide a step-by-step guide so that you’ll eventually understand each component in it.

DataPipeline
├── assets
│├── lambda
││├── dockerfile
││└── processing.py
├──cdk.out
│└── ...
├── stacks
│├── __init__.py
│├── data_pipeline_stack.py
│├── cicd_stack.py
│└──data_pipeline_stage.py
├── app.py
├── cdk.json
└── requirements.txt

Our starting point is stacks directory. It contains mandatory empty file __init__.py to define a Python package. Three other files are located here:

data_pipeline_stack.py
cicd_stack.py
data_pipeline_stage.py

At first, we open data_pipeline_stack.py and import all libraries and constructs needed for further development. Also, we need to define class with parent class cdk.Stack.

import aws_cdk as cdk
import aws_cdk.aws_s3 as _s3
import aws_cdk.aws_sqs as _sqs
import aws_cdk.aws_iam as _iam
import aws_cdk.aws_ecs as _ecs
import aws_cdk.aws_ecr as _ecr
import aws_cdk.aws_lambda as _lambda
import aws_cdk.aws_s3_notifications as _s3n
from aws_cdk.aws_ecr_assets import DockerImageAsset
from aws_cdk.aws_lambda_event_sources import SqsEventSource

class DataPipelineStack(cdk.Stack):

    def __init__(self, scope, construct_id, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

After that, we use SQS Queue construct to connect S3 bucket and Lambda. Arguments are pretty simple: stack element id (‘raw_data_queue’) , queue name (‘data_pipeline_queue’) and time that message in the queue will not be visible after Lambda takes it for processing (cdk.Duration.seconds(200)). Note that visibility timeout value depends on your processing time — if processing takes 30 seconds, it is better to set it to 60 seconds. In this case, I set it to 200 seconds because processing takes ~100 sec.

data_queue = _sqs.Queue(self, 'raw_data_queue',
                        queue_name='data_pipeline_queue',
                        visibility_timeout=cdk.Duration.seconds(200))

Next, we will create S3 buckets for raw and processed data using Bucket construct. Having in mind that raw data usually is accessed within several first days after upload, we can add lifecycle_rules to transfer data from S3 Standard to S3 Glacier after 7 days to reduce storage cost.

raw_bucket = _s3.Bucket(self, 'raw_bucket',
                        bucket_name='raw-ecg-data',
                        auto_delete_objects=True,
                        removal_policy=cdk.RemovalPolicy.DESTROY,
                        lifecycle_rules=[_s3.LifecycleRule(
                          transitions=[_s3.Transition(
                          storage_class=_s3.StorageClass.GLACIER,
                          transition_after=cdk.Duration.days(7))])])
raw_bucket.add_event_notification(_s3.EventType.OBJECT_CREATED,
                                  _s3n.SqsDestination(data_queue))
_s3.Bucket(self, 'processed_bucket',
           bucket_name='processed-ecg-data',
           auto_delete_objects=True,
           removal_policy=cdk.RemovalPolicy.DESTROY)

Also, we need to connect raw bucket and SQS queue to define destination for events which are generated from the bucket. For that, we use add_event_notification method with two arguments: event we want queue to be notified on (_s3.EventType.OBJECT_CREATED) and destination queue to notify (_s3n.SqsDestination(data_queue)).

⚠️ After stack is destroyed, bucket and all the data inside it will be deleted. This behaviour can be changed by deleting (setting to default) removal_policy and auto_delete_objects arguments.

Next step is to create Lambda using DockerImageFunction construct. Please refer to the code below to see what arguments I define. I think they are pretty self-explanatory and you are already familiar with previous examples so I believe it won’t be a hard time. In case of troubles please refer to documentation.

⚠️ The only thing I should highlight is the value of timeout parameter in Lambda — it should always be less than visibility_timeout parameter in Queue (180 vs 200).

processing_lambda = _lambda.DockerImageFunction(self,    
                           'processing_lambda',
                           function_name='data_processing',
                           description='Starts Docker Container in Lambda to process data',                                              
                           code=_lambda.DockerImageCode
                           .from_image_asset('assets/lambda/',                                                                                                  
                                              file='dockerfile'),                                                
                           architecture=_lambda.Architecture.X86_64,
                           events=[SqsEventSource(data_queue)],
                           timeout=cdk.Duration.seconds(180),
                           retry_attempts=0,
                           environment={'QueueUrl': 
                                        data_queue.queue_url})

processing_lambda.role.attach_inline_policy(_iam.Policy(self,  
                  'access_fer_lambda',
                  statements=[_iam.PolicyStatement( 
                              effect=_iam.Effect.ALLOW,
                              actions=['s3:*'],
                              resources=['*'])]))

Then, we attach policy to automatically created Lambda role, so it can process files from S3 using attach_inline_policy method. You can tune actions/resources parameter to grant Lambda more granular access to S3.

Now we move to assets directory.

There we need to create dockerfile and processing.py with data transformation logic, which is pretty simple. At first, we parse event from SQS to get information about file and bucket, then parse .mat file with ECG data, clean it and save it in .parquet format to Processed bucket. Also, it includes logging and Slack error messages. In the end, we should delete message from queue, so file is not processed again.

For your pipeline, you can change processing logic and replace _url with your own Slack hook.

import io
import json
import os
import boto3
import urllib3
import logging
import scipy.io
import pandas as pd
import pathlib as pl
import awswrangler as wr
from urllib.parse import unquote_plus

s3 = boto3.client('s3')
sqs = boto3.client('sqs')

def handler(event, context):
    # Parse event and get file name
    message_rh = event['Records'][0]['receiptHandle']
    key = json.loads(event['Records'][0]['body'])['Records'][0]['s3']['object']['key']
    key = unquote_plus(key)
    raw_bucket = json.loads(event['Records'][0]['body'])['Records'][0]['s3']['bucket']['name']
    processed_bucket = 'processed-ecg-data'

    try:
        # Read mat file
        _obj = s3.get_object(Bucket=raw_bucket, Key=key)['Body'].read()
        raw_data = scipy.io.loadmat(io.BytesIO(_obj))

        # Clean data
        _df = pd.DataFrame(raw_data['val'][0], columns=['ECG_data'])
        _df.fillna(method='pad', inplace=True)
        _df['ECG_data'] = (_df['ECG_data'] - _df['ECG_data'].min()) / (_df['ECG_data'].max() - _df['ECG_data'].min())

        # Save parquet file to verified bucket
        file_name = pl.Path(key).stem
        wr.s3.to_parquet(df=_df, path=f's3://{processed_bucket}/{file_name}.parquet')
        logging.info(f'File {file_name} was successfully processed.')

    except Exception:
        # Send notification about failure to Slack channel
        _url = 'https://hooks.slack.com/YOUR_HOOK'
        _msg = {'text': f'Processing of {key} was unsuccessful.'}
        http = urllib3.PoolManager()
        resp = http.request(method='POST', url=_url, body=json.dumps(_msg).encode('utf-8'))
        # Write message to logs
        logging.error(f'Processing of {key} was unsuccessful.', exc_info=True)

    # Delete processed message from SQS
    sqs.delete_message(
        QueueUrl=os.environ.get('QueueUrl'),
        ReceiptHandle=message_rh)

Let’s go quickly through the logic of Docker file: at first, we pool special image for Lambda from AWS ECR repository, then install all Python libraries, copy our processing.py script to container and run command to launch handler function from the script.

FROM public.ecr.aws/lambda/python:3.8

RUN pip3 install pandas && \
   pip3 install numpy && \
   pip3 install boto3 && \
   pip3 install scipy && \
   pip3 install awswrangler

WORKDIR /

COPY processing.py ${LAMBDA_TASK_ROOT}

CMD [ "processing.handler" ]

⚠️ Do not forget add libraries you used in processing.py to dockerfile.

At this stage we finished creating Data pipeline stack and can go further and start developing CI/CD stack.

CI/CD stack

For CI/CD process we will use CodePipeline service, which helps us to make deployment process easier. Each time we change Data pipeline or CI/CD stack via CodeCommit push, CodePipeline will automatically re-deploy both stacks. In short, stack for application should be added to CodePipeline stage, after that stage is added to CodePipeline. After that app is synthesised for CI/CD stack, not application stack. You can find more detailed description of connection between files and logic behind CodePipeline construct below.

At first, we need to open cicd_stack.py and start with import of all libraries and constructs we will use. Later, we will create CodeCommit repository manually, but for now we need only reference to it, so we can add it as source for CodePipeline.

import aws_cdk as cdk
import aws_cdk.aws_iam as _iam
import aws_cdk.aws_chatbot as _chatbot
import aws_cdk.aws_codecommit as _ccommit
from stacks.data_pipeline_stage import *
import aws_cdk.aws_codestarnotifications as _notifications
from aws_cdk.pipelines import CodePipeline, CodePipelineSource, ShellStep

class CICDStack(cdk.Stack):

    def __init__(self, scope, construct_id, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        pipeline_repo = _ccommit.Repository.from_repository_arn(  
                        self, 'data_pipeline_repository',
                        repository_name='data_pipeline_repository')

We use CodePipeline construct to create CI/CD process. We use parameter self_mutation set to True to allow pipeline to update itself, it has True value by default. Parameter docker_enables_for_synth should be set to True if we use Docker in our application stack. After that, we add stage with application deployment and initiate pipeline build to construct our pipeline. Latter is necessary step to set up Slack notifications in the future.

pipeline = CodePipeline(self, 'cicd_pipeline',
                        pipeline_name='cicd_pipeline',
                        docker_enabled_for_synth=True,
                        self_mutation=True,
                        synth=ShellStep('Synth',
                        input=CodePipelineSource.code_commit(  
                              pipeline_repo, 'master'),
                              commands=['npm install -g aws-cdk',
                                        'python -m pip install -r 
                                         requirements.txt',
                                        'cdk synth']))
pipeline.add_stage(DataPipelineStage(self, 'DataPipelineDeploy'))
cicd_pipeline.build_pipeline()

Next step is to configure Slack notifications for CodePipeline, so developers can monitor deployment. For that we use SlackChannelConfiguration construct. We can get value for slack_channel_id by right-clicking channel name and copying last 9 characters of URL. To get slack_workspace_id parameter value, use AWS Chatbot Guide. To define types of notifications we want to get, we use NotificationRule constract. If you want to define events for notification more granularly, use Events for notification rules.

slack = _chatbot.SlackChannelConfiguration(self, 'MySlackChannel',                                           
                 slack_channel_configuration_name='#cicd_events',                                           
                 slack_workspace_id='YOUR_ID',                                          
                 slack_channel_id='YOUR_ID')
slack.role.attach_inline_policy(_iam.Policy(self, 'slack_policy',
      statements=[_iam.PolicyStatement(effect=_iam.Effect.ALLOW,  
                                       actions=['chatbot:*'],
                                       resources=['*'])]))
rule = _notifications.NotificationRule(self, 'NotificationRule',
       source=cicd_pipeline.pipeline,
       detail_type=_notifications.DetailType.BASIC,
       events=['codepipeline-pipeline-pipeline-execution-started',
               'codepipeline-pipeline-pipeline-execution-succeeded',
               'codepipeline-pipeline-pipeline-execution-failed'],
       targets=[slack])

ℹ️ With .pipeline property we refer to the CodePipeline pipeline that deploys the CDK app. It is available only after the pipeline has been constructed with build_pipeline() method. For source argument we should pass not construct, but pipeline object.

After defining pipeline we add stage for Data pipeline deploy. To make our project cleaner, we define stage for Data Pipeline deployment in separate file. For that we use cdk.Stage parent class.

import aws_cdk as cdk
from constructs import Construct
from stacks.data_pipeline_stack import *

class DataPipelineStage(cdk.Stage):
    def __init__(self, scope, construct_id, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        DataPipelineStack(self, 'DataPipelineStack')

For those of you, who use CDKv1, additional step is to modify cdk.json configuration file, you should add the following expression to context.

"context": {"@aws-cdk/core:newStyleStackSynthesis": true}

At this point, we created all constructs and files for Data pipeline stack. The only thing left is to create app.py with all final steps. We import all constructs we created from cicd_stack.py and create tags for all stack resources.

from stacks.cicd_stack import *

app = cdk.App()
CICDStack(app, 'CodePipelineStack,
          env=cdk.Environment(account='REPLACE WITH_YOUR_ACCOUNT',
                              region='YOUR_REGION'))
cdk.Tags.of(app).add('env', 'prod')
cdk.Tags.of(app).add('creator', 'anna-pastushko')
cdk.Tags.of(app).add('owner', 'ml-team')
app.synth()

Congratulations, we finished creating our stacks. Now, we can finally create CodeCommit repository called data_pipeline_repository and push files to it.

We can manually add the same tags as we created in the Stack, so we can see all our resources created for this task bound together in cost reports.

⚠️ Check limitations for CodeBuild in Service Quotas before deployment.

Congratulations, now we can finally deploy our stack to AWS using command cdk deploy and enjoy how all resources are set up automatically.

Athena queries

Let’s start with Glue Crawler creation, for that you need to go to Data Catalog section in Glue console and click on Crawlers. Then you should click on add crawler button and go over all steps. I added the same tags as for other Data pipeline resources, so I can track them together.

Don’t change crawler source type, add S3 data store and specify path to your bucket in Include path. After that create new or add already existing role and specify how often you want to run it. Then you should create database , in my case I created ecg_data database. After all steps are completed and crawler is created, run it.

That is all we need to query processed_ecg_data table with Athena. Example of simple query can be found below.

Account cleanup

In case you want to delete all resources created in your account during development, you should perform the following steps:

Run the following command to delete all stacks resources:cdk destroy CodePipelineStack/DataPipelineDeploy/DataPipelineStack CodePipelineStack
Delete CodeCommit repository
Clean ECR repository and S3 buckets created for Athens and CDK because it can incur costs.
Delete Glue Crawler and Database with tables.

ℹ️ Command cdk destroy will only destroy CodePipeline (CI/CD) stack and stacks that depend on it. Since the application stacks don't depend on the CodePipeline stack, they won't be destroyed. We need to destroy Data pipeline stack separately, there is a discussion on how to delete them both.

It is not very convenient to delete some resources manually and there are several discussions with AWS developers to fix it.

Conclusion

CDK provides you with toolkit for development of applications based on AWS services. It can be challenging at first, but your efforts will pay off at the end. You will be able to manage and transfer your application with one command.

CDK resources and full code can be found in GitHub repository.

Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.

Forecasting of periodic events with ML

Anna Pastushko — Thu, 07 Jul 2022 11:51:33 +0000

Periodic events forecasting is quite useful if you are, for example, the data aggregator. Data aggregators or data providers are organizations that collect statistical, financial or any other data from different sources, transform it and then offer it for further analysis and exploration (data as a service).

Data as a Service (DaaS)

It is really important for such organizations to monitor release dates in order to gather data as soon as it is released in the world and plan capacity to handle the incoming volumes of data.
Sometimes authorities that publish data have a schedule of future releases, sometimes not. In some cases, they announce schedule only for the next one or two months and, hence, you may want to make the publication schedule by yourself and predict release dates.
For the majority of statistical releases, you may find a pattern like the day of the week or month. For example, statistics can be released

every last working day of the month,
every third Tuesday of the month,
every last second working day of the month, etc.

Having this in mind and previous history of release dates, we want to predict potential date or range of dates when the next data release might happen.

Case Study

As a case study let’s take the U.S. Conference Board (CB) Consumer Confidence Indicator. It is a leading indicator which measures the level of consumer confidence in economic activity. By using it, we can predict consumer spending, which plays a major role in overall economic activity.

The official data provider does not provide the schedule for this series, but many data aggregators like Investing.com have been collecting the data for a while and series’ release history is available there.

Goal: we need to predict what is the date of the next release(s).

Data preparation

We start with importing all packages for data manipulation, building machine learning models, and other data transformations.

# Data manipulation
import pandas as pd# Manipulation with dates
from datetime import date
from dateutil.relativedelta import relativedelta# Machine learning
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

The next step is to get the list of release history dates. You may have a database with all data and history of release dates that you can use. To make this simple and focus on release dates prediction I will add history to DataFrame manually.

data = pd.DataFrame({'Date': ['2021-01-26','2020-12-22',
                     '2020-11-24','2020-10-27','2020-09-29',
                     '2020-08-25','2020-07-28','2020-06-30',
                     '2020-05-26','2020-04-28','2020-03-31',
                     '2020-02-25','2020-01-28','2019-12-31',
                     '2019-11-26','2019-10-29','2019-09-24',
                     '2019-08-27','2019-07-30','2019-06-25',
                     '2019-05-28']})

We also should add a column with 0 and 1 values to specify if release happened on this date. For now, we only have dates of releases, so we create a column filled with 1 values.

data['Date'] = pd.to_datetime(data['Date'])
data['Release'] = 1

After that, you need to create all rows for dates between releases in DataFrame and fill release column with zeros for them.

r = pd.date_range(start=data['Date'].min(), end=data['Date'].max())
data = data.set_index('Date').reindex(r).fillna(0.0)
       .rename_axis('Date').reset_index()

Now dataset is ready for further manipulations.

Feature engineering

Prediction of next release dates heavily relies on feature engineering because actually, we do not have any features besides release date itself. Therefore, we will create the following features:

month
a calendar day of the month
working day number
day of the week
week of month number
monthly weekday occurrence (second Wednesday of the month)

data['Month'] = data['Date'].dt.month
data['Day'] = data['Date'].dt.day
data['Workday_N'] = np.busday_count(
                    data['Date'].values.astype('datetime64[M]'),
                    data['Date'].values.astype('datetime64[D]'))
data['Week_day'] = data['Date'].dt.weekday
data['Week_of_month'] = (data['Date'].dt.day
                         - data['Date'].dt.weekday - 2) // 7 + 2
data['Weekday_order'] = (data['Date'].dt.day + 6) // 7
data = data.set_index('Date')

Training Machine learning model

By default, we need to split our dataset into two parts: train and test. Don’t forget to set shuffle argument to False, because our goal is to create a forecast based on past events.

x_train, x_test, y_train, y_test = train_test_split(data.drop(['Release'], axis=1), data['Release'],
                 test_size=0.3, random_state=1, shuffle=False)

In general, shuffle helps to get rid of overfitting by choosing different training observations. But it is not our case, every time we should have all history of publication events.

In order to choose the best prediction model, we will test the following models:

XGBoost
K-nearest Neighbors (KNN)
RandomForest

XGBoost

We will use XGBoost with tree base learners and grid search method to choose the best parameters. It searches over all possible combinations of parameters and chooses the best based on cross-validation evaluation.

A drawback of this approach is a long computation time.

Alternatively, the random search can be used. It iterates over the given range given the number of times, choosing values randomly. After a certain number of iterations, it chooses the best model.

However, when you have a large number of parameters, random search tests a relatively low number of combinations. It makes finding a *really*optimal combination almost impossible.

To use grid search you need to specify the list of possible values for each parameter.

DM_train = xgb.DMatrix(data=x_train, label=y_train)
grid_param = {"learning_rate": [0.01, 0.1],
              "n_estimators": [100, 150, 200],
              "alpha": [0.1, 0.5, 1],
              "max_depth": [2, 3, 4]}
model = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=model, param_grid=grid_param,
                       scoring="neg_mean_squared_error",
                       cv=4, verbose=1)
grid_mse.fit(x_train, y_train)
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

As you see the best parameters for our XGBoost model are: alpha = 0.5, n_estimators = 200, max_depth = 4, learning_rate = 0.1.

Let’s train the model with obtained parameters.

xgb_model = xgb.XGBClassifier(objective ='reg:squarederror',
                            colsample_bytree = 1,
                            learning_rate = 0.1,
                            max_depth = 4,
                            alpha = 0.5,
                            n_estimators = 200)
xgb_model.fit(x_train, y_train)
xgb_prediction = xgb_model.predict(x_test)

K-nearest Neighbors (KNN)

K-nearest neighbors model is meant to be used when you are trying to find similarities between observations. This is exactly our case because we are trying to find patterns in the past release dates.

KNN algorithm has less parameters to tune, so it is more simple for those who have not used it before.

knn = KNeighborsClassifier(n_neighbors = 3, algorithm = 'auto',
                           weights = 'distance')
knn.fit(x_train, y_train)
knn_prediction = knn.predict(x_test)

Random Forest

Random forest basic model parameters tuning usually doesn’t take a lot of time. You simply iterate over the possible number of estimators and the maximum depth of trees and choose optimal ones using elbow method.

random_forest = RandomForestClassifier(n_estimators=50,
                                       max_depth=10, random_state=1)
random_forest.fit(x_train, y_train)
rf_prediction = random_forest.predict(x_test)

Comparing the results

We will use confusion matrix to evaluate performance of trained models. It helps us compare models side by side and understand whether our parameters should be tuned any further.

xgb_matrix = metrics.confusion_matrix(xgb_prediction, y_test)
print(f"""
Confusion matrix for XGBoost model:
TN:{xgb_matrix[0][0]}    FN:{xgb_matrix[0][1]}
FP:{xgb_matrix[1][0]}    TP:{xgb_matrix[1][1]}""")knn_matrix = metrics.confusion_matrix(knn_prediction, y_test)
print(f"""
Confusion matrix for KNN model:
TN:{knn_matrix[0][0]}    FN:{knn_matrix[0][1]}
FP:{knn_matrix[1][0]}    TP:{knn_matrix[1][1]}""")rf_matrix = metrics.confusion_matrix(rf_prediction, y_test)
print(f"""
Confusion matrix for Random Forest model:
TN:{rf_matrix[0][0]}    FN:{rf_matrix[0][1]}
FP:{rf_matrix[1][0]}    TP:{rf_matrix[1][1]}""")

As you see, both XGBoost and RandomForest show good performance. They both were able to catch the pattern and predict dates correctly in most cases. However, both models made a mistake with December 2020 release, because it breaks release pattern.

KNN is less accurate than the previous two. It failed to predict three dates correctly and missed 5 releases. At this point, we do not proceed with KNN. In general, it works better if data is normalized, so you can try to tune it if you want.

Concerning the remaining two, for the initial goal, XGBoost model is considered to be overcomplicated in terms of hyperparameters tuning, so RandomForest should be our choice.

Now we need to create DataFrame with future dates for prediction and use trained RandomForest model to predict future releases for one year ahead.

x_predict = pd.DataFrame(pd.date_range(date.today(), (date.today() +
            relativedelta(years=1)),freq='d'), columns=['Date'])
x_predict['Month'] = x_predict['Date'].dt.month
x_predict['Day'] = x_predict['Date'].dt.day
x_predict['Workday_N'] = np.busday_count(
                x_predict['Date'].values.astype('datetime64[M]'),
                x_predict['Date'].values.astype('datetime64[D]'))
x_predict['Week_day'] = x_predict['Date'].dt.weekday
x_predict['Week_of_month'] = (x_predict['Date'].dt.day -
                              x_predict['Date'].dt.weekday - 2)//7+2
x_predict['Weekday_order'] = (x_predict['Date'].dt.day + 6) // 7
x_predict = x_predict.set_index('Date')prediction = xgb_model.predict(x_predict)

That’s it — we created forecast of release dates for U.S. CB Consumer Confidence series for one year ahead.

Conclusion

If you want to predict future dates for periodic events, you should think about meaningful features to create. They should include all information about patterns you can find in history. As you can see we did not spend a lot of time on model’s tuning — even simple models can give good results if you use the right features.

Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.