<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anna Pastushko</title>
    <description>The latest articles on DEV Community by Anna Pastushko (@annpastushko).</description>
    <link>https://dev.to/annpastushko</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F888500%2F9484df8c-94b1-4765-871d-1bfddae28a27.png</url>
      <title>DEV Community: Anna Pastushko</title>
      <link>https://dev.to/annpastushko</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/annpastushko"/>
    <language>en</language>
    <item>
      <title>Easy to follow Architecture Framework</title>
      <dc:creator>Anna Pastushko</dc:creator>
      <pubDate>Wed, 15 Apr 2026 14:16:06 +0000</pubDate>
      <link>https://dev.to/aws-builders/easy-to-follow-architecture-framework-5ckb</link>
      <guid>https://dev.to/aws-builders/easy-to-follow-architecture-framework-5ckb</guid>
      <description>&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;When I first opened TOGAF, I got stuck on a basic problem — there's no clear starting point. Every section references another. Every process assumes a governance structure most teams don't have. Same problem was with Zachman. By the time you understand them well enough to follow, you've spent a lot of time. And even then — it's too heavy for most real projects. &lt;/p&gt;

&lt;p&gt;Over the years of my work as an architect I accumulated notes: what to do on a new project, what to check before moving to the next phase, where things typically go wrong. Templates I kept reusing. Checklists I open on day one. At some point I had enough material to structure it into something I could share. I called it D3.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is D3?
&lt;/h2&gt;

&lt;p&gt;D3 is a lightweight architecture delivery framework built around three phases:&lt;br&gt;
Discover → Design → Deploy&lt;/p&gt;

&lt;p&gt;Each phase has a clear goal, concrete steps, a list of common pitfalls, and a definition of done checklist. The checklist is the key part — you don't move to the next phase until every item is checked. No guessing whether you're ready. No open unknowns quietly becoming risks later.&lt;/p&gt;

&lt;p&gt;It works for solo architects and teams. Startups and enterprises. It doesn't require certifications, governance boards, or months of setup — just a clear problem and a structured way to work through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three phases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Discover&lt;/strong&gt; — understand the actual business problem, not just the stated request. Agree scope with the right people, capture requirements, document assumptions and open questions. The goal is enough clarity to design with confidence, not perfection.&lt;br&gt;
&lt;strong&gt;Design&lt;/strong&gt; — translate what you learned into a validated architecture. Always present at least two options with trade-offs, never a single answer. Share early — a rough diagram reviewed now is cheaper than a finished document reviewed late.&lt;br&gt;
&lt;strong&gt;Deploy&lt;/strong&gt; — deliver and hand over. This phase ends only when the stakeholder can operate the system independently. Not when the code is deployed. Not when documents are sent. When ownership has actually transferred.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project types — the framework scales with you
&lt;/h2&gt;

&lt;p&gt;D3 has three project types based on the size and risk of your engagement:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Advisory&lt;/td&gt;
&lt;td&gt;3–4 weeks&lt;/td&gt;
&lt;td&gt;Strategic direction and high-level design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pilot&lt;/td&gt;
&lt;td&gt;2–3 months&lt;/td&gt;
&lt;td&gt;Validated prototype or workload integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delivery&lt;/td&gt;
&lt;td&gt;6+ months&lt;/td&gt;
&lt;td&gt;Full production system with operational handover&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each type is cumulative — Pilot includes all Advisory outputs, Delivery includes all Pilot outputs. You pick the type that fits the engagement, the framework adjusts depth accordingly.&lt;br&gt;
What's inside each phase&lt;/p&gt;

&lt;p&gt;Beyond the steps themselves, each phase includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common pitfalls — specific mistakes I've made or seen, written plainly. Things like treating the stakeholder's request as the problem, presenting one option instead of two, or deprioritizing CI/CD until the end.&lt;/li&gt;
&lt;li&gt;Definition of done checklist — a concrete list of items to check before moving forward. The rule: don't proceed with open unknowns.&lt;/li&gt;
&lt;li&gt;Internal tools — working documents for the architect (assumptions log, open questions log, stakeholder map) kept separate from client-facing deliverables.&lt;/li&gt;
&lt;li&gt;Session guides — how to run discovery, design review, and handover sessions. Before, during, and after.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Who it's for
&lt;/h2&gt;

&lt;p&gt;Junior or senior — a new project can still leave you unsure what to do first. D3 gives you a clear action plan from day one.&lt;br&gt;
It's aimed at architects, tech leads, and engineers who end up designing systems regardless of title. It's not a methodology or a certification path. It's a structured way to go from a stakeholder's pain point to a delivered, handed-over system — and always know what phase you're in.&lt;/p&gt;

&lt;p&gt;The framework is free and open: d3-architecture-framework.vercel.app&lt;br&gt;
I'd love to know if it's useful — or where it doesn't fit your experience.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>systemdesign</category>
      <category>career</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Open-Sourced My Solutions Architect Portfolio (Real-World Case Studies)</title>
      <dc:creator>Anna Pastushko</dc:creator>
      <pubDate>Tue, 20 Jan 2026 19:13:58 +0000</pubDate>
      <link>https://dev.to/aws-builders/i-open-sourced-my-solutions-architect-portfolio-real-world-case-studies-1i0c</link>
      <guid>https://dev.to/aws-builders/i-open-sourced-my-solutions-architect-portfolio-real-world-case-studies-1i0c</guid>
      <description>&lt;p&gt;You know that awkward moment when an employer asks for your SA portfolio or examples of your architectures? Ever tried to find a real SA portfolio example online? I tried. You’ve probably tried too. There’s nothing. &lt;/p&gt;

&lt;p&gt;Reference architectures? Sure — pretty many diagrams for AWS, Azure, and Google architectures. But how do YOU format YOUR work into a portfolio that shows architectural thinking, not just “I drew boxes”? &lt;/p&gt;

&lt;p&gt;To address this, I decided to &lt;strong&gt;open-source my own Solutions Architect portfolio&lt;/strong&gt;, featuring sanitized case studies from real-world projects. The goal is simple: provide a reference so others don’t start from scratch and can see how architecture decisions, trade-offs, and diagrams are documented in practice.&lt;/p&gt;

&lt;p&gt;🌐 &lt;strong&gt;Portfolio:&lt;/strong&gt; &lt;a href="https://childishgirl-portfolio.vercel.app" rel="noopener noreferrer"&gt;https://childishgirl-portfolio.vercel.app&lt;/a&gt;&lt;br&gt;
📂 &lt;strong&gt;GitHub repository:&lt;/strong&gt; &lt;a href="https://github.com/ChildishGirl/sa-portfolio" rel="noopener noreferrer"&gt;https://github.com/ChildishGirl/sa-portfolio&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Solutions Architect Portfolios Are Hard
&lt;/h2&gt;

&lt;p&gt;Compared to developer portfolios, Solutions Architect portfolios are inherently more difficult to create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Systems are complex:&lt;/strong&gt; You’re not just showing code, but architecture decisions, constraints, and trade-offs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most work is confidential:&lt;/strong&gt; Real projects often can’t be shared as-is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact is broader:&lt;/strong&gt; Architecture work connects technical design with business goals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because of this, many aspiring architects struggle to create their portfolio without having examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Portfolio Includes
&lt;/h2&gt;

&lt;p&gt;The portfolio is structured around &lt;strong&gt;real-world case studies&lt;/strong&gt;, carefully sanitized to avoid exposing sensitive information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Industry context:&lt;/strong&gt; Because architecture doesn’t exist in a vacuum. The same pattern solves different problems in fintech vs healthcare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs explicitly called out:&lt;/strong&gt; This is THE signal of architectural thinking. Anyone can pick a solution. Architects explain what they sacrificed to get it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagram style inspired by AWS reference architectures:&lt;/strong&gt; Familiarity reduces cognitive load. When readers recognize the visual language, they focus on YOUR decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Responsibilities:&lt;/strong&gt; Shows your actual scope of impact and ownership within the project.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;This portfolio can be useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Junior or aspiring Solutions Architects&lt;/li&gt;
&lt;li&gt;Engineers transitioning into architecture roles&lt;/li&gt;
&lt;li&gt;Mentors, lecturers, or educators looking for a reference example&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Principles for Building Your Own SA Portfolio
&lt;/h2&gt;

&lt;p&gt;While working on this portfolio, I followed a few principles that may help others:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Emphasize decisions, not just outcomes&lt;/strong&gt;:
Explain &lt;em&gt;why&lt;/em&gt; a solution looks the way it does.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use diagrams generously&lt;/strong&gt;:
A clear diagram often communicates more than a page of text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sanitize aggressively&lt;/strong&gt;:
Replace real names and sensitive data with neutral placeholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make it skimmable&lt;/strong&gt;:
Use headings and concise sections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be honest about trade-offs&lt;/strong&gt;:
Mention alternatives you considered.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;I hope this open-source portfolio helps someone avoid starting from a blank page when preparing for architecture roles.&lt;/p&gt;

&lt;p&gt;If you’ve built (or reviewed) Solutions Architect portfolios before, I’d be curious to hear:&lt;br&gt;
&lt;strong&gt;What do you think a strong SA portfolio should include?&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tags
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;aws&lt;/code&gt; &lt;code&gt;solutions&lt;/code&gt; &lt;code&gt;architecture&lt;/code&gt; &lt;code&gt;cloud&lt;/code&gt; &lt;code&gt;career&lt;/code&gt; &lt;code&gt;portfolio&lt;/code&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>career</category>
      <category>opensource</category>
      <category>portfolio</category>
    </item>
    <item>
      <title>Migrate local Data Science workspaces to SageMaker Studio</title>
      <dc:creator>Anna Pastushko</dc:creator>
      <pubDate>Mon, 08 Aug 2022 07:00:44 +0000</pubDate>
      <link>https://dev.to/annpastushko/migrate-local-data-science-workspaces-to-sagemaker-studio-139b</link>
      <guid>https://dev.to/annpastushko/migrate-local-data-science-workspaces-to-sagemaker-studio-139b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80m30rjl0qymq6ricvli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80m30rjl0qymq6ricvli.png" alt=" " width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Workspace requirements
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Problem context
&lt;/h2&gt;

&lt;p&gt;The data science team used Jupyter notebooks and PyCharm IDE on their local machines to train models but had issues with model’s performance and configuration tracking. They used Google Sheets to track model’s name, parameters values used and performance. It quickly became messy with the increasing number of models. Also, team members wanted to collaborate with each other when building models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Functional requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FR-1&lt;/strong&gt; Solution provides collaborative development.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FR-2&lt;/strong&gt; Solution can be integrated/expanded with AWS EMR.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Non-functional requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NFR-1&lt;/strong&gt; Data and code should not be transferred via public internet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NFR-2&lt;/strong&gt; Data scientists should not have access to the public internet.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Constraints
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;C-1&lt;/strong&gt; Data is stored in AWS S3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C-2&lt;/strong&gt; Users’ credentials should be stored in AWS IAM Identity Center (SSO).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C-3&lt;/strong&gt; AWS CodeCommit is selected as a version control system.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Proposed Solution
&lt;/h1&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Migrate local development environment to AWS SageMaker Studio and enable SageMaker Experiments and SageMaker Model Registry to track models’ performance.&lt;/p&gt;

&lt;p&gt;SageMaker Studio will be launched in &lt;em&gt;VPC only mode&lt;/em&gt; to make connection to Amazon EMR available and to satisfy the requirement of disabling access to the public internet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The solution consists of the following components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC with S3 Gateway Endpoint and one Private Subnet with Interface Endpoint to allow data/code transfer without public internet.&lt;/li&gt;
&lt;li&gt;IAM Identity Center to reuse existing credentials and enable SSO to SageMaker.&lt;/li&gt;
&lt;li&gt;AWS SageMaker Studio&lt;/li&gt;
&lt;li&gt;AWS S3 buckets to get data and save models artefacts&lt;/li&gt;
&lt;li&gt;AWS CodeCommit repository to store code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can see the combination of these components on the Deployment view in the diagram at the top of the page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Functional View
&lt;/h2&gt;

&lt;p&gt;User would login through SSO Login page and access SageMaker Studio. Then user can create and modify Jupyter notebooks, train ML models and track them using Model Registry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zptrytjsi1q8fj11x44.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zptrytjsi1q8fj11x44.png" alt="Functional view" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration plan
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1. Inventory of resources and estimation of needed capacity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first step is to create a list of all users, models they used and minimal system requirements. You can see an example of such inventory below.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;User&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;GPU usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Harry Potter&lt;/td&gt;
&lt;td&gt;Sorting Hat algorithm&lt;/td&gt;
&lt;td&gt;XGBoost&lt;/td&gt;
&lt;td&gt;15 Mb&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ronald Weasley&lt;/td&gt;
&lt;td&gt;Generate summaries for inquiries for Hogwarts&lt;/td&gt;
&lt;td&gt;Seq2seq&lt;/td&gt;
&lt;td&gt;25 Mb&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermione Granger&lt;/td&gt;
&lt;td&gt;Forecast number of owls needed to deliver school letters&lt;/td&gt;
&lt;td&gt;ARIMA&lt;/td&gt;
&lt;td&gt;2 Mb&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermione Granger&lt;/td&gt;
&lt;td&gt;Access to Restricted Section in Hogwarts Library&lt;/td&gt;
&lt;td&gt;CNN&lt;/td&gt;
&lt;td&gt;50 Mb&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;While choosing the best instance types for models provided in the inventory, I considered the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Desired speed of training&lt;/li&gt;
&lt;li&gt;Data volume&lt;/li&gt;
&lt;li&gt;GPU usage capability&lt;/li&gt;
&lt;li&gt;Instance pricing and fast launch capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I decided to use ml.t3.medium for XGBoost and ARIMA models, and ml.g4dn.xlarge for CNN and Seq2seq. With Free Tier, you have 250 hours of ml.t3.medium instance on Studio notebooks per month for the first 2 months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2. Set up VPC with Gateway endpoint and update routing table&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS has the option to create VPC with public/private subnets and S3 Gateway Endpoint automatically. After that, you need to create the following Interface Endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SageMaker API &lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.sagemaker.api&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;SageMaker runtime &lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.sagemaker.runtime&lt;/code&gt;. This is required to run Studio notebooks and to train and host models.&lt;/li&gt;
&lt;li&gt;To use SageMaker Projects &lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.servicecatalog&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;To connect to CodeCommit &lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.git-codecommit&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ Replace &lt;code&gt;&amp;lt;region&amp;gt;&lt;/code&gt; with your region. All resources including SSO account, VPC and SageMaker domain should be in the same region. This is important because AWS Identity Center (SSO) is not available in all regions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Also, you can use &lt;a href="https://github.com/ChildishGirl/sagemaker-migration" rel="noopener noreferrer"&gt;provided CDK stack&lt;/a&gt; to deploy network resources automatically. Once completed, you would have such network:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb37ieb8r6rn5ejnrq8nl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb37ieb8r6rn5ejnrq8nl.png" alt="Network diagram" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3. Set up SSO for users in IAM Identity Center&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The next step is to create new/add existing users to group in IAM Identity Center.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7vu7z4rxh950khuw6e3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7vu7z4rxh950khuw6e3.png" alt="Data science users group" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Later, you will be able to add this group to SageMaker users. You can add each user without creating group as well, but I prefer to use groups. It allows you to manage users from one place easily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4. Set up SageMaker Studio without public internet access&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start SageMaker configuration with Standard setup of SageMaker Domain. Each AWS account is limited to one domain per region.&lt;/p&gt;

&lt;p&gt;It proposes two options for authentication and I choose AWS Single Sign-On because we have an identity store there. Please note, that SageMaker with SSO authentication should be launched in the same region as SSO, in my case, both are in eu-central-1 region.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xbdpgolt9tlrw4xuov4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xbdpgolt9tlrw4xuov4.png" alt="SageMaker SSO configuration" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After completing SageMaker Domain configuration, you will be able to see SageMaker in Application tab and assign users to it.&lt;/p&gt;

&lt;p&gt;The next step is to assign a default execution role, you can use the default role AWS creates for you with &lt;em&gt;AmazonSageMakerFullAccess&lt;/em&gt; policy attached to it.&lt;/p&gt;

&lt;p&gt;After that, you should configure if you want to use public internet from SageMaker or not. In my case, I disabled internet access with VPC only mode and used created VPC and private subnet to host SageMaker.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7eol2gsa2si1hdoyf88s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7eol2gsa2si1hdoyf88s.png" alt="SageMaker Network configuration" width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, you can configure data encryption, Jupyter Lab version and notebook sharing options. Then, you should add data science user group by navigating to the Control panel and Manage your user groups section in it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5. Link AWS CodeCommit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a repository in AWS CodeCommit named &lt;code&gt;sagemaker-repository&lt;/code&gt; and clone HTTPS. Then, in the left sidebar of SageMaker Studio, choose the Git icon (identified by a diamond with two branches), then choose Clone a Repository. After that, you will be able to push changes to current notebooks or new notebooks to Git.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6. Setup automatic shut down of idle resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes people can forget to shut down instances after they finished their work. This can lead to the situation when resources run all night and in the morning you will receive unexpectedly large bill. But one of JupyterLab extensions can help you to avoid such situations. It automatically shuts down KernelGateway Apps, Kernels and Image Terminals in SageMaker Studio when they are idle for a stipulated period of time. You will be able to configure an idle time limit of your preference. Instructions for setup and scripts can be found in &lt;a href="https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;. After installation, you will see a new tab on the left of SageMaker Studio interface.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72i0q3oof5pb8lw8bhwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72i0q3oof5pb8lw8bhwj.png" alt="Idle resources shut down extension" width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7. Add experiments and model registry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SageMaker Model Registry is a great tool to keep track of models that were developed by the team. To enable this functionality, you need to go to the SageMaker resources tab (bottom one in the left panel) and choose Model Registry from the drop-down list. You need to create model group for each model that you want to be versioned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpr07e1b9f4hg4s1w2qkl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpr07e1b9f4hg4s1w2qkl.png" alt="Model group creation" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After that, data scientists can register models using one of the methods described in &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-version.html#model-registry-version-api" rel="noopener noreferrer"&gt;documentations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;SageMaker Experiments is a feature that lets you organize, and compare your machine learning experiments. Model Registry keeps track of model versions that have been chosen as best versions. Experiments are useful for data scientists, because they allow to track all training runs and compare model performance to choose best one. You can add experiments using instruction from &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-create.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 8. Migrate notebooks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon SageMaker provides XGBoost as a built-in algorithm and data science team decided to use it and re-train the model. So, data scientists just need to call built-in version and provide path to data on S3, more detailed description can be found in &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;. Example notebook can be found &lt;a href="https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/xgboost_mnist/xgboost_mnist.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Seq2seq notebook migration is pretty similar to the previous case — SageMaker has built-in Seq2seq algorithm. So, you just need to call right image URI and create SageMaker Estimator using it. Example of SageMaker notebook which uses Seq2seq algorithm can be found &lt;a href="https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With ARIMA situation is slightly more difficult, because data science team don’t want to use DeepAR proposed by SageMaker. So, we need to create custom algorithm container. At first, we need to build Docker image, publish it to ECR, and then we can use it from SageMaker. You can find &lt;a href="https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.html" rel="noopener noreferrer"&gt;detailed instruction&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html" rel="noopener noreferrer"&gt;example of notebook&lt;/a&gt; in the documentation.&lt;/p&gt;

&lt;p&gt;For CNN build with TensorFlow we can use SageMaker script mode to bring our own model, an example notebook can be found &lt;a href="https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/tensorflow_script_mode_training_and_serving.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;. Also, we can use local machine to train it, example notebook can be found &lt;a href="https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/tensorflow_iris_byom/tensorflow_BYOM_iris.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;More examples of different models notebooks you can find in Amazon SageMaker examples &lt;a href="https://github.com/aws/amazon-sagemaker-examples" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 9. Ensure that data science team can use new environment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The final step is to create documentation for the process, so any new users can use it for onboarding to data science working environment. It should include the following sections:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to add a new model (feature) to registry&lt;/li&gt;
&lt;li&gt;How to add a new experiment to SageMaker&lt;/li&gt;
&lt;li&gt;How to choose an instance for the model&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;SageMaker is a great tool for data scientists that allows you to label data, track experiments, train, register and deploy models. It automates and standardizes MLOps practices and provides a lot of helpful features.&lt;/p&gt;

&lt;p&gt;CDK resources for Network creation and cost breakdown can be found in &lt;a href="https://github.com/ChildishGirl/sagemaker-migration" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.&lt;/p&gt;

</description>
      <category>python</category>
      <category>aws</category>
      <category>cloud</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Create Serverless Data Pipeline Using AWS CDK (Python)</title>
      <dc:creator>Anna Pastushko</dc:creator>
      <pubDate>Fri, 08 Jul 2022 09:45:06 +0000</pubDate>
      <link>https://dev.to/annpastushko/create-serverless-data-pipeline-using-aws-cdk-python-5cg2</link>
      <guid>https://dev.to/annpastushko/create-serverless-data-pipeline-using-aws-cdk-python-5cg2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjon7c0wtpuz3y0qy4gno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjon7c0wtpuz3y0qy4gno.png" alt="Architecture diagram" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;Data science team are about to start a research study and they requested a solution on AWS cloud. Here is what I got to offer them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main process (Data Processing):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Device&lt;/strong&gt; uploads &lt;em&gt;.mat&lt;/em&gt; file with ECG data to S3 bucket (&lt;strong&gt;Raw&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Upload triggers event creation which is sent to &lt;strong&gt;SQS&lt;/strong&gt; queue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda&lt;/strong&gt; polls SQS queue (event mapping invocation) and starts processing event. Lamba’s runtime is Python Docker container because libraries’ size exceed layers’ size limit of 250 MB. If any error occurs in the process, you’ll get notification in &lt;strong&gt;Slack&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Once completed, processed data in parquet format is saved to S3 Bucket (&lt;strong&gt;Processed&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;To enable data scientists to query the data, &lt;strong&gt;Glue Crawler&lt;/strong&gt; job creates a schema in the &lt;strong&gt;Data Catalog&lt;/strong&gt;. Then, &lt;strong&gt;Athena&lt;/strong&gt; is used to query Processed bucket.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Secondary process (CI/CD):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a developer wants to change processing job logic, he should just prepare the changes and commit them to the &lt;strong&gt;CodeCommit&lt;/strong&gt; repository. Everything else is automated and handled by CI/CD process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CodeBuild&lt;/strong&gt; service converts CDK code into CloudFormation template and deploys it to your account. In other words, it creates all infrastructure components automatically. Once completed, deployed resources’ group (stack) is available in &lt;strong&gt;CloudFormation&lt;/strong&gt; service on web UI. To simplify these two steps and provide self-updates for CI/CD process as well, &lt;strong&gt;CodePipeline&lt;/strong&gt; abstraction is used. You’ll also get Slack notifications about the progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparation
&lt;/h2&gt;

&lt;p&gt;To prepare your local environment for this project, you should follow the steps described below:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" rel="noopener noreferrer"&gt;Install AWS CLI&lt;/a&gt; and set up credentials.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nodejs.org/en/download/" rel="noopener noreferrer"&gt;Install NodeJS&lt;/a&gt; to be able to use CDK.&lt;/li&gt;
&lt;li&gt;Install CDK using command &lt;code&gt;sudo npm install -g aws-cdk&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Create new directory for your project and change your current working directory to it.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;cdk init --language python&lt;/code&gt; to initiate CDK project.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;cdk bootstrap&lt;/code&gt; to bootstrap AWS account with CDK resources.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.docker.com/engine/install/" rel="noopener noreferrer"&gt;Install Docker&lt;/a&gt; to run Docker container inside Lambda.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;

&lt;p&gt;This is the final look of the project. I will provide a step-by-step guide so that you’ll eventually understand each component in it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataPipeline
├── assets
│├── lambda
││├── dockerfile
││└── processing.py
├──cdk.out
│└── ...
├── stacks
│├── __init__.py
│├── data_pipeline_stack.py
│├── cicd_stack.py
│└──data_pipeline_stage.py
├── app.py
├── cdk.json
└── requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our starting point is &lt;code&gt;stacks&lt;/code&gt; directory. It contains mandatory empty file &lt;code&gt;__init__.py&lt;/code&gt; to define a Python package. Three other files are located here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;data_pipeline_stack.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cicd_stack.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;data_pipeline_stage.py&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first, we open &lt;code&gt;data_pipeline_stack.py&lt;/code&gt; and import all libraries and constructs needed for further development. Also, we need to define class with parent class &lt;code&gt;cdk.Stack&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cdk&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_s3&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_s3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_sqs&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_sqs&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_iam&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_iam&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_ecs&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_ecs&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_ecr&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_ecr&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_lambda&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_s3_notifications&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_s3n&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_ecr_assets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DockerImageAsset&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_lambda_event_sources&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SqsEventSource&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataPipelineStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;construct_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;construct_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, we use SQS &lt;a href="https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_sqs/Queue.html" rel="noopener noreferrer"&gt;Queue&lt;/a&gt; construct to connect S3 bucket and Lambda. Arguments are pretty simple: stack element id (&lt;code&gt;‘raw_data_queue’&lt;/code&gt;) , queue name (&lt;code&gt;‘data_pipeline_queue’&lt;/code&gt;) and time that message in the queue will not be visible after Lambda takes it for processing (&lt;code&gt;cdk.Duration.seconds(200)&lt;/code&gt;). Note that visibility timeout value depends on your processing time — if processing takes 30 seconds, it is better to set it to 60 seconds. In this case, I set it to 200 seconds because processing takes ~100 sec.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_sqs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_data_queue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;queue_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_pipeline_queue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;visibility_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we will create S3 buckets for raw and processed data using &lt;a href="https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_s3/Bucket.html?highlight=add_event_notification#bucket" rel="noopener noreferrer"&gt;Bucket&lt;/a&gt; construct. Having in mind that raw data usually is accessed within several first days after upload, we can add &lt;code&gt;lifecycle_rules&lt;/code&gt; to transfer data from S3 Standard to S3 Glacier after 7 days to reduce storage cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw-ecg-data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;auto_delete_objects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;removal_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RemovalPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DESTROY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;lifecycle_rules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LifecycleRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                          &lt;span class="n"&gt;transitions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Transition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                          &lt;span class="n"&gt;storage_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StorageClass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GLACIER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;transition_after&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;days&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))])])&lt;/span&gt;
&lt;span class="n"&gt;raw_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_event_notification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_CREATED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="n"&gt;_s3n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SqsDestination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_queue&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;_s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;processed_bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;processed-ecg-data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;auto_delete_objects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;removal_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RemovalPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DESTROY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, we need to connect raw bucket and SQS queue to define destination for events which are generated from the bucket. For that, we use &lt;a href="https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_s3/Bucket.html?highlight=add_event_notification#aws_cdk.aws_s3.Bucket.add_event_notification" rel="noopener noreferrer"&gt;add_event_notification&lt;/a&gt; method with two arguments: event we want queue to be notified on (&lt;code&gt;_s3.EventType.OBJECT_CREATED&lt;/code&gt;) and destination queue to notify (&lt;code&gt;_s3n.SqsDestination(data_queue)&lt;/code&gt;).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ After stack is destroyed, bucket and all the data inside it will be deleted. This behaviour can be changed by deleting (setting to default) removal_policy and auto_delete_objects arguments.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Next step is to create Lambda using &lt;a href="https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_lambda/DockerImageFunction.html" rel="noopener noreferrer"&gt;DockerImageFunction&lt;/a&gt; construct. Please refer to the code below to see what arguments I define. I think they are pretty self-explanatory and you are already familiar with previous examples so I believe it won’t be a hard time. In case of troubles please refer to documentation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ The only thing I should highlight is the value of &lt;code&gt;timeout&lt;/code&gt; parameter in Lambda — it should always be less than &lt;code&gt;visibility_timeout&lt;/code&gt; parameter in Queue (&lt;code&gt;180&lt;/code&gt; vs &lt;code&gt;200&lt;/code&gt;).&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processing_lambda&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DockerImageFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                           &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;processing_lambda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_processing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Starts Docker Container in Lambda to process data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                              
                           &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DockerImageCode&lt;/span&gt;
                           &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_image_asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;assets/lambda/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                                                                                  
                                              &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dockerfile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                                                
                           &lt;span class="n"&gt;architecture&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Architecture&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X86_64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;SqsEventSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_queue&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
                           &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                           &lt;span class="n"&gt;retry_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueueUrl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
                                        &lt;span class="n"&gt;data_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue_url&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;processing_lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attach_inline_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;access_fer_lambda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;statements&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PolicyStatement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; 
                              &lt;span class="n"&gt;effect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Effect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ALLOW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3:*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                              &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we attach policy to automatically created Lambda role, so it can process files from S3 using &lt;code&gt;attach_inline_policy&lt;/code&gt; method. You can tune actions/resources parameter to grant Lambda more granular access to S3.&lt;/p&gt;

&lt;p&gt;Now we move to &lt;code&gt;assets&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;There we need to create &lt;code&gt;dockerfile&lt;/code&gt; and &lt;code&gt;processing.py&lt;/code&gt; with data transformation logic, which is pretty simple. At first, we parse event from SQS to get information about file and bucket, then parse &lt;em&gt;.mat&lt;/em&gt; file with ECG data, clean it and save it in &lt;em&gt;.parquet&lt;/em&gt; format to Processed bucket. Also, it includes logging and Slack error messages. In the end, we should delete message from queue, so file is not processed again.&lt;/p&gt;

&lt;p&gt;For your pipeline, you can change processing logic and replace &lt;code&gt;_url&lt;/code&gt; with your own Slack hook.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scipy.io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;awswrangler&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;wr&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unquote_plus&lt;/span&gt;

&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sqs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Parse event and get file name
&lt;/span&gt;    &lt;span class="n"&gt;message_rh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;receiptHandle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;unquote_plus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;raw_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;processed_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;processed-ecg-data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Read mat file
&lt;/span&gt;        &lt;span class="n"&gt;_obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;raw_bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scipy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loadmat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_obj&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# Clean data
&lt;/span&gt;        &lt;span class="n"&gt;_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ECG_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ECG_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ECG_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ECG_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ECG_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ECG_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# Save parquet file to verified bucket
&lt;/span&gt;        &lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;stem&lt;/span&gt;
        &lt;span class="n"&gt;wr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;processed_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;File &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; was successfully processed.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Send notification about failure to Slack channel
&lt;/span&gt;        &lt;span class="n"&gt;_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://hooks.slack.com/YOUR_HOOK&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Processing of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; was unsuccessful.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PoolManager&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_msg&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="c1"&gt;# Write message to logs
&lt;/span&gt;        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Processing of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; was unsuccessful.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_info&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Delete processed message from SQS
&lt;/span&gt;    &lt;span class="n"&gt;sqs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;QueueUrl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueueUrl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;ReceiptHandle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message_rh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s go quickly through the logic of Docker file: at first, we pool special image for Lambda from AWS ECR repository, then install all Python libraries, copy our &lt;code&gt;processing.py&lt;/code&gt; script to container and run command to launch handler function from the script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ecr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.8&lt;/span&gt;

&lt;span class="n"&gt;RUN&lt;/span&gt; &lt;span class="n"&gt;pip3&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; \
   &lt;span class="n"&gt;pip3&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; \
   &lt;span class="n"&gt;pip3&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; \
   &lt;span class="n"&gt;pip3&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;scipy&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; \
   &lt;span class="n"&gt;pip3&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;awswrangler&lt;/span&gt;

&lt;span class="n"&gt;WORKDIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;

&lt;span class="n"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;processing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;LAMBDA_TASK_ROOT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;CMD&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processing.handler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠️ Do not forget add libraries you used in  &lt;code&gt;processing.py&lt;/code&gt; to &lt;code&gt;dockerfile&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this stage we finished creating Data pipeline stack and can go further and start developing CI/CD stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI/CD stack
&lt;/h2&gt;

&lt;p&gt;For CI/CD process we will use CodePipeline service, which helps us to make deployment process easier. Each time we change Data pipeline or CI/CD stack via CodeCommit push, CodePipeline will automatically re-deploy both stacks. In short, stack for application should be added to CodePipeline stage, after that stage is added to CodePipeline. After that app is synthesised for CI/CD stack, not application stack. You can find more detailed description of connection between files and logic behind CodePipeline construct below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm7odrvecloe8aqs9dvm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm7odrvecloe8aqs9dvm.png" alt="Steps for CodePipeline stack creation" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At first, we need to open &lt;code&gt;cicd_stack.py&lt;/code&gt; and start with import of all libraries and constructs we will use. Later, we will create CodeCommit repository manually, but for now we need only reference to it, so we can add it as source for CodePipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cdk&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_iam&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_iam&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_chatbot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_chatbot&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_codecommit&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_ccommit&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;stacks.data_pipeline_stage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.aws_codestarnotifications&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_notifications&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_cdk.pipelines&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CodePipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CodePipelineSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ShellStep&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CICDStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;construct_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;construct_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;pipeline_repo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ccommit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Repository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_repository_arn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_pipeline_repository&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;repository_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_pipeline_repository&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use &lt;a href="https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.pipelines/CodePipeline.html" rel="noopener noreferrer"&gt;CodePipeline&lt;/a&gt; construct to create CI/CD process. We use parameter &lt;code&gt;self_mutation&lt;/code&gt; set to &lt;code&gt;True&lt;/code&gt; to allow pipeline to update itself, it has &lt;code&gt;True&lt;/code&gt; value by default. Parameter &lt;code&gt;docker_enables_for_synth&lt;/code&gt; should be set to &lt;code&gt;True&lt;/code&gt; if we use Docker in our application stack. After that, we add stage with application deployment and initiate pipeline build to construct our pipeline. Latter is necessary step to set up Slack notifications in the future.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CodePipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cicd_pipeline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;pipeline_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cicd_pipeline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;docker_enabled_for_synth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;self_mutation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;synth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ShellStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Synth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CodePipelineSource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;code_commit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                              &lt;span class="n"&gt;pipeline_repo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;master&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                              &lt;span class="n"&gt;commands&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;npm install -g aws-cdk&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python -m pip install -r 
                                         requirements.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cdk synth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DataPipelineStage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DataPipelineDeploy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;cicd_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build_pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next step is to configure Slack notifications for CodePipeline, so developers can monitor deployment. For that we use &lt;a href="https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.aws_chatbot/SlackChannelConfiguration.html" rel="noopener noreferrer"&gt;SlackChannelConfiguration&lt;/a&gt; construct. We can get value for &lt;code&gt;slack_channel_id&lt;/code&gt; by right-clicking channel name and copying last 9 characters of URL. To get &lt;code&gt;slack_workspace_id&lt;/code&gt; parameter value, use &lt;a href="https://docs.aws.amazon.com/chatbot/latest/adminguide/getting-started.html" rel="noopener noreferrer"&gt;AWS Chatbot Guide&lt;/a&gt;. To define types of notifications we want to get, we use &lt;a href="https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.aws_codestarnotifications/NotificationRule.html" rel="noopener noreferrer"&gt;NotificationRule&lt;/a&gt; constract. If you want to define events for notification more granularly, use &lt;a href="https://docs.aws.amazon.com/dtconsole/latest/userguide/concepts.html#events-ref-pipeline" rel="noopener noreferrer"&gt;Events for notification rules&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;slack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_chatbot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SlackChannelConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MySlackChannel&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                           
                 &lt;span class="n"&gt;slack_channel_configuration_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;#cicd_events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                           
                 &lt;span class="n"&gt;slack_workspace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                          
                 &lt;span class="n"&gt;slack_channel_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;slack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attach_inline_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;slack_policy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;statements&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PolicyStatement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;effect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Effect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ALLOW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                                       &lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chatbot:*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                       &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])]))&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_notifications&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;NotificationRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NotificationRule&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cicd_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;detail_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_notifications&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DetailType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BASIC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;codepipeline-pipeline-pipeline-execution-started&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;codepipeline-pipeline-pipeline-execution-succeeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;codepipeline-pipeline-pipeline-execution-failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
       &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;slack&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;ℹ️ With &lt;a href="https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.pipelines/CodePipeline.html#aws_cdk.pipelines.CodePipeline.pipeline" rel="noopener noreferrer"&gt;.pipeline&lt;/a&gt; property we refer to the CodePipeline pipeline that deploys the CDK app. It is available only after the pipeline has been constructed with &lt;a href="https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.pipelines/CodePipeline.html#aws_cdk.pipelines.CodePipeline.build_pipeline" rel="noopener noreferrer"&gt;build_pipeline()&lt;/a&gt; method. For &lt;code&gt;source&lt;/code&gt; argument we should pass not construct, but pipeline object.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After defining pipeline we add stage for Data pipeline deploy. To make our project cleaner, we define stage for Data Pipeline deployment in separate file. For that we use &lt;a href="https://docs.aws.amazon.com/cdk/api/v1/python/aws_cdk.core/Stage.html" rel="noopener noreferrer"&gt;cdk.Stage&lt;/a&gt; parent class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_cdk&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cdk&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;constructs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Construct&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;stacks.data_pipeline_stack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataPipelineStage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;construct_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;construct_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nc"&gt;DataPipelineStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DataPipelineStack&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For those of you, who use &lt;strong&gt;CDKv1&lt;/strong&gt;, additional step is to modify &lt;code&gt;cdk.json&lt;/code&gt; configuration file, you should add the following expression to context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"context": {"@aws-cdk/core:newStyleStackSynthesis": true}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, we created all constructs and files for Data pipeline stack. The only thing left is to create &lt;code&gt;app.py&lt;/code&gt; with all final steps. We import all constructs we created from &lt;code&gt;cicd_stack.py&lt;/code&gt; and create tags for all stack resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;stacks.cicd_stack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;App&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nc"&gt;CICDStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CodePipelineStack,
          env=cdk.Environment(account=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;WITH_YOUR_ACCOUNT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
                              region=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;YOUR_REGION&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;))
cdk.Tags.of(app).add(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
cdk.Tags.of(app).add(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;creator&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;anna&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pastushko&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
cdk.Tags.of(app).add(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;ml&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
app.synth()
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Congratulations, we finished creating our stacks. Now, we can finally create CodeCommit repository called &lt;code&gt;data_pipeline_repository&lt;/code&gt; and push files to it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4wgpzr160mml1vdh68j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4wgpzr160mml1vdh68j.png" alt="CodeCommit repository creation" width="800" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can manually add the same tags as we created in the Stack, so we can see all our resources created for this task bound together in cost reports.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ Check limitations for CodeBuild in Service Quotas before deployment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Congratulations, now we can finally deploy our stack to AWS using command &lt;code&gt;cdk deploy&lt;/code&gt; and enjoy how all resources are set up automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Athena queries
&lt;/h2&gt;

&lt;p&gt;Let’s start with Glue Crawler creation, for that you need to go to Data Catalog section in Glue console and click on Crawlers. Then you should click on add crawler button and go over all steps. I added the same tags as for other Data pipeline resources, so I can track them together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjs48hewet469j6b584a6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjs48hewet469j6b584a6.png" alt="Glue Crawler creation" width="800" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Don’t change crawler source type, add S3 data store and specify path to your bucket in Include path. After that create new or add already existing role and specify how often you want to run it. Then you should create database , in my case I created &lt;code&gt;ecg_data&lt;/code&gt; database. After all steps are completed and crawler is created, run it.&lt;/p&gt;

&lt;p&gt;That is all we need to query &lt;code&gt;processed_ecg_data&lt;/code&gt; table with Athena. Example of simple query can be found below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kmcwiqr4cskqu4wlzxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kmcwiqr4cskqu4wlzxf.png" alt="Athena query" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Account cleanup
&lt;/h2&gt;

&lt;p&gt;In case you want to delete all resources created in your account during development, you should perform the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run the following command to delete all stacks resources:&lt;code&gt;cdk destroy CodePipelineStack/DataPipelineDeploy/DataPipelineStack CodePipelineStack&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Delete CodeCommit repository&lt;/li&gt;
&lt;li&gt;Clean ECR repository and S3 buckets created for Athens and CDK because it can incur costs.&lt;/li&gt;
&lt;li&gt;Delete Glue Crawler and Database with tables.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;ℹ️ Command &lt;code&gt;cdk destroy&lt;/code&gt; will only destroy CodePipeline (CI/CD) stack and stacks that depend on it. Since the application stacks don't depend on the CodePipeline stack, they won't be destroyed. We need to destroy Data pipeline stack separately, there is a &lt;a href="https://github.com/aws/aws-cdk/issues/10190" rel="noopener noreferrer"&gt;discussion on how to delete them both&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is not very convenient to delete some resources manually and there are several &lt;a href="https://github.com/aws/aws-cdk-rfcs/issues/64" rel="noopener noreferrer"&gt;discussions&lt;/a&gt; with AWS developers to fix it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;CDK provides you with toolkit for development of applications based on AWS services. It can be challenging at first, but your efforts will pay off at the end. You will be able to manage and transfer your application with one command.&lt;/p&gt;

&lt;p&gt;CDK resources and full code can be found in &lt;a href="https://github.com/ChildishGirl/serverless-data-pipeline" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cdk</category>
      <category>serverless</category>
      <category>python</category>
    </item>
    <item>
      <title>Forecasting of periodic events with ML</title>
      <dc:creator>Anna Pastushko</dc:creator>
      <pubDate>Thu, 07 Jul 2022 11:51:33 +0000</pubDate>
      <link>https://dev.to/annpastushko/forecasting-of-periodic-events-with-ml-1ka5</link>
      <guid>https://dev.to/annpastushko/forecasting-of-periodic-events-with-ml-1ka5</guid>
      <description>&lt;p&gt;Periodic events forecasting is quite useful if you are, for example, the data aggregator. Data aggregators or data providers are organizations that collect statistical, financial or any other data from different sources, transform it and then offer it for further analysis and exploration (data as a service).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Data as a Service (DaaS)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is really important for such organizations to monitor release dates in order to gather data as soon as it is released in the world and plan capacity to handle the incoming volumes of data.&lt;br&gt;
Sometimes authorities that publish data have a schedule of future releases, sometimes not. In some cases, they announce schedule only for the next one or two months and, hence, you may want to make the publication schedule by yourself and predict release dates.&lt;br&gt;
For the majority of statistical releases, you may find a pattern like the day of the week or month. For example, statistics can be released&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;every last working day of the month,&lt;/li&gt;
&lt;li&gt;every third Tuesday of the month,&lt;/li&gt;
&lt;li&gt;every last second working day of the month, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having this in mind and previous history of release dates, we want to predict potential date or range of dates when the next data release might happen.&lt;/p&gt;
&lt;h2&gt;
  
  
  Case Study
&lt;/h2&gt;

&lt;p&gt;As a case study let’s take the &lt;a href="https://www.investing.com/economic-calendar/cb-consumer-confidence-48" rel="noopener noreferrer"&gt;U.S. Conference Board (CB) Consumer Confidence Indicator&lt;/a&gt;. It is a leading indicator which measures the level of consumer confidence in economic activity. By using it, we can predict consumer spending, which plays a major role in overall economic activity.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://conference-board.org/eu/" rel="noopener noreferrer"&gt;official data provider&lt;/a&gt; does not provide the schedule for this series, but many data aggregators like &lt;a href="https://www.investing.com/economic-calendar/cb-consumer-confidence-48" rel="noopener noreferrer"&gt;Investing.com&lt;/a&gt; have been collecting the data for a while and series’ release history is available there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: we need to predict what is the date of the next release(s).&lt;/p&gt;
&lt;h2&gt;
  
  
  Data preparation
&lt;/h2&gt;

&lt;p&gt;We start with importing all packages for data manipulation, building machine learning models, and other data transformations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Data manipulation
import pandas as pd# Manipulation with dates
from datetime import date
from dateutil.relativedelta import relativedelta# Machine learning
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is to get the list of release history dates. You may have a database with all data and history of release dates that you can use. To make this simple and focus on release dates prediction I will add history to DataFrame manually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = pd.DataFrame({'Date': ['2021-01-26','2020-12-22',
                     '2020-11-24','2020-10-27','2020-09-29',
                     '2020-08-25','2020-07-28','2020-06-30',
                     '2020-05-26','2020-04-28','2020-03-31',
                     '2020-02-25','2020-01-28','2019-12-31',
                     '2019-11-26','2019-10-29','2019-09-24',
                     '2019-08-27','2019-07-30','2019-06-25',
                     '2019-05-28']})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also should add a column with 0 and 1 values to specify if release happened on this date. For now, we only have dates of releases, so we create a column filled with 1 values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data['Date'] = pd.to_datetime(data['Date'])
data['Release'] = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, you need to create all rows for dates between releases in DataFrame and fill release column with zeros for them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r = pd.date_range(start=data['Date'].min(), end=data['Date'].max())
data = data.set_index('Date').reindex(r).fillna(0.0)
       .rename_axis('Date').reset_index()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now dataset is ready for further manipulations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature engineering
&lt;/h2&gt;

&lt;p&gt;Prediction of next release dates heavily relies on feature engineering because actually, we do not have any features besides release date itself. Therefore, we will create the following features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;month&lt;/li&gt;
&lt;li&gt;a calendar day of the month&lt;/li&gt;
&lt;li&gt;working day number&lt;/li&gt;
&lt;li&gt;day of the week&lt;/li&gt;
&lt;li&gt;week of month number&lt;/li&gt;
&lt;li&gt;monthly weekday occurrence (second Wednesday of the month)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data['Month'] = data['Date'].dt.month
data['Day'] = data['Date'].dt.day
data['Workday_N'] = np.busday_count(
                    data['Date'].values.astype('datetime64[M]'),
                    data['Date'].values.astype('datetime64[D]'))
data['Week_day'] = data['Date'].dt.weekday
data['Week_of_month'] = (data['Date'].dt.day
                         - data['Date'].dt.weekday - 2) // 7 + 2
data['Weekday_order'] = (data['Date'].dt.day + 6) // 7
data = data.set_index('Date')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Training Machine learning model
&lt;/h2&gt;

&lt;p&gt;By default, we need to split our dataset into two parts: train and test. Don’t forget to set shuffle argument to False, because our goal is to create a forecast based on past events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x_train, x_test, y_train, y_test = train_test_split(data.drop(['Release'], axis=1), data['Release'],
                 test_size=0.3, random_state=1, shuffle=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In general, shuffle helps to get rid of overfitting by choosing different training observations. But it is not our case, every time we should have all history of publication events.&lt;/p&gt;

&lt;p&gt;In order to choose the best prediction model, we will test the following models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;K-nearest Neighbors (KNN)&lt;/li&gt;
&lt;li&gt;RandomForest&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  XGBoost
&lt;/h3&gt;

&lt;p&gt;We will use XGBoost with tree base learners and grid search method to choose the best parameters. It searches over all possible combinations of parameters and chooses the best based on cross-validation evaluation.&lt;/p&gt;

&lt;p&gt;A drawback of this approach is a long computation time.&lt;/p&gt;

&lt;p&gt;Alternatively, the random search can be used. It iterates over the given range given the number of times, choosing values randomly. After a certain number of iterations, it chooses the best model.&lt;/p&gt;

&lt;p&gt;However, when you have a large number of parameters, random search tests a relatively low number of combinations. It makes finding a *really*optimal combination almost impossible.&lt;/p&gt;

&lt;p&gt;To use grid search you need to specify the list of possible values for each parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DM_train = xgb.DMatrix(data=x_train, label=y_train)
grid_param = {"learning_rate": [0.01, 0.1],
              "n_estimators": [100, 150, 200],
              "alpha": [0.1, 0.5, 1],
              "max_depth": [2, 3, 4]}
model = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=model, param_grid=grid_param,
                       scoring="neg_mean_squared_error",
                       cv=4, verbose=1)
grid_mse.fit(x_train, y_train)
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you see the best parameters for our XGBoost model are: &lt;code&gt;alpha = 0.5, n_estimators = 200, max_depth = 4, learning_rate = 0.1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let’s train the model with obtained parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb_model = xgb.XGBClassifier(objective ='reg:squarederror',
                            colsample_bytree = 1,
                            learning_rate = 0.1,
                            max_depth = 4,
                            alpha = 0.5,
                            n_estimators = 200)
xgb_model.fit(x_train, y_train)
xgb_prediction = xgb_model.predict(x_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  K-nearest Neighbors (KNN)
&lt;/h3&gt;

&lt;p&gt;K-nearest neighbors model is meant to be used when you are trying to find similarities between observations. This is exactly our case because we are trying to find patterns in the past release dates.&lt;/p&gt;

&lt;p&gt;KNN algorithm has less parameters to tune, so it is more simple for those who have not used it before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;knn = KNeighborsClassifier(n_neighbors = 3, algorithm = 'auto',
                           weights = 'distance')
knn.fit(x_train, y_train)
knn_prediction = knn.predict(x_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Random Forest
&lt;/h3&gt;

&lt;p&gt;Random forest basic model parameters tuning usually doesn’t take a lot of time. You simply iterate over the possible number of estimators and the maximum depth of trees and choose optimal ones using elbow method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;random_forest = RandomForestClassifier(n_estimators=50,
                                       max_depth=10, random_state=1)
random_forest.fit(x_train, y_train)
rf_prediction = random_forest.predict(x_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Comparing the results
&lt;/h2&gt;

&lt;p&gt;We will use confusion matrix to evaluate performance of trained models. It helps us compare models side by side and understand whether our parameters should be tuned any further.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb_matrix = metrics.confusion_matrix(xgb_prediction, y_test)
print(f"""
Confusion matrix for XGBoost model:
TN:{xgb_matrix[0][0]}    FN:{xgb_matrix[0][1]}
FP:{xgb_matrix[1][0]}    TP:{xgb_matrix[1][1]}""")knn_matrix = metrics.confusion_matrix(knn_prediction, y_test)
print(f"""
Confusion matrix for KNN model:
TN:{knn_matrix[0][0]}    FN:{knn_matrix[0][1]}
FP:{knn_matrix[1][0]}    TP:{knn_matrix[1][1]}""")rf_matrix = metrics.confusion_matrix(rf_prediction, y_test)
print(f"""
Confusion matrix for Random Forest model:
TN:{rf_matrix[0][0]}    FN:{rf_matrix[0][1]}
FP:{rf_matrix[1][0]}    TP:{rf_matrix[1][1]}""")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you see, both XGBoost and RandomForest show good performance. They both were able to catch the pattern and predict dates correctly in most cases. However, both models made a mistake with December 2020 release, because it breaks release pattern.&lt;/p&gt;

&lt;p&gt;KNN is less accurate than the previous two. It failed to predict three dates correctly and missed 5 releases. At this point, we do not proceed with KNN. In general, it works better if data is normalized, so you can try to tune it if you want.&lt;/p&gt;

&lt;p&gt;Concerning the remaining two, for the initial goal, XGBoost model is considered to be overcomplicated in terms of hyperparameters tuning, so RandomForest should be our choice.&lt;/p&gt;

&lt;p&gt;Now we need to create DataFrame with future dates for prediction and use trained RandomForest model to predict future releases for one year ahead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x_predict = pd.DataFrame(pd.date_range(date.today(), (date.today() +
            relativedelta(years=1)),freq='d'), columns=['Date'])
x_predict['Month'] = x_predict['Date'].dt.month
x_predict['Day'] = x_predict['Date'].dt.day
x_predict['Workday_N'] = np.busday_count(
                x_predict['Date'].values.astype('datetime64[M]'),
                x_predict['Date'].values.astype('datetime64[D]'))
x_predict['Week_day'] = x_predict['Date'].dt.weekday
x_predict['Week_of_month'] = (x_predict['Date'].dt.day -
                              x_predict['Date'].dt.weekday - 2)//7+2
x_predict['Weekday_order'] = (x_predict['Date'].dt.day + 6) // 7
x_predict = x_predict.set_index('Date')prediction = xgb_model.predict(x_predict)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it — we created forecast of release dates for U.S. CB Consumer Confidence series for one year ahead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you want to predict future dates for periodic events, you should think about meaningful features to create. They should include all information about patterns you can find in history. As you can see we did not spend a lot of time on model’s tuning — even simple models can give good results if you use the right features.&lt;/p&gt;

&lt;p&gt;Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
