DEV Community: Sergii Lischuk

Apache Airflow. How to make the complex workflow as an easy job

Sergii Lischuk — Sun, 20 Feb 2022 18:22:51 +0000

Intro

A couple of weeks ago I started to work with this platform in terms of a feature request. The feature was connected with GCP, observation, and tons of data to be processed. I was looking for something really powerful to make the flow easy and clean to create and run the jobs. And we should not forget consistency, fault tolerance, and correct error handling as well.

My research brings me to the orchestration topic, especially to the Apache Airflow.

Why do you need an orchestration?

If you have just a simple task to export some data to Excel, maybe, you don't need to use orchestration at all. But, if you are working with data, which brings you a really good profit after the processing, or, you mainly working with a large amount of non-clean data daily - seems to be you are in the right place to start thinking about it.

For example, if your company processes a big amount of data and gives your customers good advice and profits from it, almost in all cases the workflow will be the same. Each night somewhere in S3-bucket/Azure Blob storage your providers will create some files with raw data inside. Secondly, you collect that data and aggregate it structurally (eg, push to BigQuery table). Further, you process it with complicated SQL scripts, make (again) aggregation, and patch some invalid data. After, you need to check that data with some external API (could be even your special services), but the data size is too big to start working without parallelism or queue processing. And finally, after the validation, you need to connect the final result with some data-preview tools (like Tableau dashboard) to show it to your customers.

As we can see, this process is not so easy at the first glance. And there are much more cases (pipelines) to be handled in real life.

For this reason, you need to have a workflow manager. And for the last couple of years, de-facto, this is Apace Airflow.

History

Airflow was born like an internal project in Airbnb, in 2014. From the start, it was an open-source project, so, it was easy to provide appropriate functionality with PR as fast as it is possible. In 2016, the project moves to Apache Incubator and in 2019 Airflow becomes a top-level project in Apache Software Foundation.

Components

Airflow is a python project, so, almost all the functionality is a python code.

To start working with Airflow, you need to provide a configuration. And the configuration strongly depends on the number of parallel tasks which will do the job.

Also, there are required components, which will be always detected on your checklist:

Metadata database - database, where Airflow saves all meta-information regarding the current and past tasks, statutes, and results. I will recommend using Postgres here (more stable and effective workflow), but there are configuration and connections for MYSQL, MSSQL, and SQLite as well.
Scheduler - system component, which parses the files with pipeline descriptions and pushes them to the Executor
Web Server - Flask-based app, running through gunicorn. The main goal is to show visually the pipeline process and provide control over it.
Executor - special part, which runs the code (job)(see Execution)

Also, there are components, which are task-depended:

Triggerer - in a simple way - is an event-loop for async operators. Currently, there are not many of them, so, you need to think before regarding the Triggerer component in your workflow
Worker - modified worker from Celery lib. The small node where Celery can run your task.

Execution

Python code that describes the job must be executed somewhere. This part belongs to Executor. Airflow supports the next kind of executors:

SequentialExecutor - run code locally, in the main thread of the Airflow;
LocalExecutor - run code locally, but in different processes of OS
CeleryExecutor - do the job in Celery worker (Celery lib)
DaskExecutor - in Dask cluster
KubernetesExecutor - in k8 pods

From my experience, production code is based on Celery/Kubernetes executors. You need to take in mind this fact, cuz you must be careful with dependencies between the tasks in the pipeline. Every task will be running in its isolated environment, and, with high possibility on different physical devices (computers). So, the sequence of tasks "download file to disk" and "upload file to cloud storage" will not work correctly. More detailed information you can find here

As you may see, Airflow is very customizable. Configuration can be made in most custom ways to be as close as possible with requirements.

In general, there are 2 most spread architectures: single-node and multi-node:

Installation

There are several ways to install Apache Airflow. Let's check them.

- PIP package manager

Not an easy way. First of all, you need to install all dependencies, after - installing and configuring DB (with SQLite you are restricted to use only SequentialExecutor). A good practice is to initialize python virtual env, and then start working with Airflow:

python -m pip install apache-airflow
airflow webserver
airflow scheduler

- Separated Docker images

I found this useful when you are trying to run Airflow on bare-metal servers.

docker run … postgres
docker run … apache/airflow scheduler
docker run … apache/airflow webserver

- Docker compose

In my opinion - the clean and easy way. You just need to create a docker-compose file with all configurations inside, so you will be able to reuse different variables and connections:

docker compose up

- Astronomer CLI

I did not work so much with this tool, but it has a good community around. Also, they have an internal registry for any hooks/operators which will simplify the working process with Airflow.

Base concept

The main entity in this story is DAG (direct acyclic graph) - the housekeeper of tasks. Its spread title, you can meet it in different languages.

The edges of this graph are Task, which is an instance of Operator.

All operators, in general, can be divided into:

Action operator - make some action (ReloadJobOperator, etc.)
Transfer operator - migrate data from one place to another (S3ToGCPOperator)
Sensor operator - wait some action (BQTablePartitionSensor)

Each pipeline is working inside Task Instance - an instance of the operator with timespan (when this operator is started). You are also able to configure Variables and Connections - environment variables, which are responsible for holding different connection strings, logins, etc. With the Web part, you can configure them in the UI.

Last but not least - Hook - is an interface for external services. Hooks are wrappers around popular libraries, APIs, DBs. E.g. - if you need to handle a connection to some SQL server, you can start thinking regarding SqlServiceHook (and it is already exist).

Create DAGs. Main moments

First of all, you need some declarations:

import requests
import pandas as pd
from pathlib import Path
from airflow.models import DAG
from airflow.operators.python import PythonOperator

Next, let's create 2 functions for download data and pivot it (do not forget to check the executor, to be sure if these 2 tasks will be running in one place)

def download_data_fn():
   url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
   resp = requests.get(url)
   Path('titanic.csv').write_text(resp.content.decode())

def pivot_data_fn():
   df = pd.read_csv('titanic.csv')
   df = df.pivot_table(index='Sex', columns='Pclass', values='Name', aggfunc='count')
   df.reset_index().to_csv('titanic_pivoted.csv')

And, the final step is to create DAG with execution order:

with DAG(dag_id='titanic_dag', schedule_interval='*/9 * * * *') as dag:
   download_data = PythonOperator(
       task_id='download_data',
       python_callable=download_data_fn,
       dag=dag,
   )

   pivot_data = PythonOperator(
       task_id='pivot_data',
       python_callable=pivot_data_fn,
       dag=dag,
   )

   download_data >> pivot_data

# variants:
# pivot_data << download_data 
# download_data.set_downstream(pivot_data)
# pivot_data.set_upstream(download_data)

Created file must be located in folder, where all DAGs are located. By default, it is - $AIRFLOW_HOME/dags. If it is so - scheduler will take it to the execution order, and an Executor will run it every 9 minutes.

XComs

Sometimes we have dependencies between Task A and Task B. We do want not just to run tasks one after another, but also pass some results like pipes in the console. For this purpose, we can use XComs.

With XComs (cross-task communication) one task can write special metadata to metadata DB and another can read that data. We can take the previous example and modify it a little bit:

def download_data_fn(**context):
   filename = 'titanic.csv'
   url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
   resp = requests.get(url)
   Path(filename).write_text(resp.content.decode())
   #context['ti'].xcom_push(key='filename', value=filename) # option 1
   return filename # option 2


def pivot_data_fn(ti, **context):
   # filename = ti.xcom_pull(task_ids=['download_data'], key='filename') # option 1
   filename = ti.xcom_pull(task_ids=['download_data'], key='return_value') # option 2
   df = pd.read_csv(filename)
   df = df.pivot_table(index='Sex', columns='Pclass', values='Name', aggfunc='count')
   df.reset_index().to_csv('titanic_pivoted.csv')

with DAG(dag_id='titanic_dag', schedule_interval='*/9 * * * *') as dag:
   download_data = PythonOperator(
       task_id='download_data',
       python_callable=download_data_fn,
       provide_context=True,
   )

   pivot_data = PythonOperator(
       task_id='pivot_data',
       python_callable=pivot_data_fn,
       provide_context=True,
   )

   download_data >> pivot_data

As we can see, there are different ways to use XCom objects, but you should keep in mind that data must be small. If the data size will big, you will spend time-saving that data to DB and can reach the limits of meta DB. Secondly, Airflow is just an orchestrator and must not be used for data processing.

Downsides

You need to know and keep in mind a lot of things to get good results. It's a complicated tool, but its do complex work. Also, in most cases for debugging and tracing you will have a local instance of Airflow, so you will have some stagging and prod environments.

Alternatives

It's good to know what Airflow is not the only one on the market. There are Dagster and Spotify Luigi and others. But they have different pros and cons, be sure that you did a good investigation on the market to choose the best suitable tool for your tasks.

That's all for today ;) I hope this article will give some clues and basics for them will start working with Airflow and orchestration. Stay tuned!

WebAssembly. How to make the web faster than light

Sergii Lischuk — Wed, 10 Nov 2021 10:45:49 +0000

Today is very important to work with the information in fast and understandable manner. If in case of desktop application situation is fine with it, in case of Web we get some troubles - all data are under control of JS, which is fast but not in the top of the performance charts. Here, on the scene, we meet WebAssembly.

The future is coming.

Evolution is everywhere. Even in web stack, there are changes which were made to update the current status of development up to the new edge. We are involved in this process not only as spectators but as an essential users - we got async/await, promises, iterators, etc. Now, from March 2017 (for Chrome) we can use WebAssembly directly in our web apps. But let’s start from the beginning - "Why?", "What?" and "How?" are our best friends in our way as WebAssembly ambassadors.

What is WebAssembly?

WebAssembly (WASM) - its a new binary format which allows us to run our code directly in our browsers.

Problem

Why it was invented and what are the problems that was solved by WASM? In general - our code should be faster in our browsers. But it is not a full problem - it consists of next sub-problems:

Our code should be faster for JS (almost like a native code in CPU);
Zero configuration - solution should be “out of the box” - no special installations, the only browser required;
Security - new technology should be safe and run inside sandbox Cross-platform - desktop, mobile, tablet;
Easy to use and develop;

What is wrong with JS?

Nothing. But due to its design, it is not possible to make it faster. A long way of development and combination of interpreter and compiler at runtime makes JS ‘hardly predictable’ in execution.

For example, you have a function foo(a, b). And you run this function a lot of times only with numbers. After some time of execution, interpreter push this code to the compiler, and the compiler provides machine code, which is super fast for calculation. But! If you pass a string as parameter to foo(a, b), an engine will make ‘de-optimization’: this function will be shifted back to an interpreter and ready-state machine code will be thrown away.

How WebAssembly will help us?

If web app performance is our main goal then we are speaking about code optimizations. If it is not enough, and we are limited by JS engine, we should move code responsible for the high-pressure operation to the WASM module. We re-write this code part to C or Rust and after compilation, we will get some .wasm file. This file we will leave on the server and provide access to it from the browser. “Ok. But how it will work in browser?” — right question now. Next, inside our JS code, we request this module from the server. When it will be loaded and available, JS engine will call methods from .wasm as well as the functions from other modules. The code in this .wasm module will be executed in its own sandbox and result will be returned back to JS.

We can think about the WASM like about native modules in JS — but in this case code inside WASM module is executed not in JS engine.

WASM has some restrictions — it is only can be accessible via JS. So, here is a bottleneck — heavyweight operations will be executed faster, but we got some costs for passing and receiving data.

Conclusions

WASM is aimed to fix troubles, described above:

Speed: WASM executed almost with the speed of machine code on the CPU;
Effectively: binary format, fast parsing, and compilation. All heavyweight operation will be hidden in WASM module;
Security: sandbox model of execution;
An open standard: WASM has its own format and specification. They are available with RFC on the Internet;
The code, inside of the module can be debugged natively from the browser console.

On my opinion WASM is the great feature. With smart usage, working with complicated calculation will be painless for us and for the browser as well. So, apps, which are working with Graphics or CV becomes a native part of the web - and it is really cool news.

Stay tuned!

7 best practices for building containers

Sergii Lischuk — Thu, 04 Nov 2021 15:24:38 +0000

Development was always a way of evolution. The evolution of modern programming development brings a lot of techniques and requirements - its hard to imagine today’s programming without high-level frameworks, containers, cloud computing or special data storages (even if they are not necessary). Working with some of them, I would like to share small notes about the containerization, especially with Docker containers.

7 best practices for building containers

Kubernetes Engine is a great place to run your workloads at scale. But before being able to use Kubernetes, you need to containerize your applications. You can run most applications in a Docker container without too much hassle. However, effectively running those containers in production and streamlining the build process is another story. There are a number of things to watch out for that will make your security and operations teams happier. This post provides tips and best practices to help you effectively build containers.

1. Package a single application per container

A container works best when a single application runs inside it. This application should have a single parent process. For example, do not run PHP and MySQL in the same container: it’s harder to debug, Linux signals will not be properly handled, you can’t horizontally scale the PHP containers, etc. This allows you to tie together the lifecycle of the application to that of the container.

2. Properly handle PID 1, signal handling, and zombie processes

Kubernetes and Docker send Linux signals to your application inside the container to stop it. They send those signals to the process with the process identifier (PID) 1. If you want your application to stop gracefully when needed, you need to properly handle those signals.

3. Optimize for the Docker build cache

Docker can cache layers of your images to accelerate later builds. This is a very useful feature, but it introduces some behaviors that you need to take into account when writing your Dockerfiles. For example, you should add the source code of your application as late as possible in your Dockerfile so that the base image and your application’s dependencies get cached and aren’t rebuilt on every build.

Take this Dockerfile as example:

FROM python:3.5
COPY my_code src
RUN pip install my_requirements

You should swap the last two lines:

FROM python:3.5
RUN pip install my_requirements
COPY my_code src

In the new version, the result of the pip command will be cached and will not be rerun each time the source code changes.

4. Remove unnecessary tools

Reducing the attack surface of your host system is always a good idea, and it’s much easier to do with containers than with traditional systems. Remove everything that the application doesn’t need from your container. Or better yet, include just your application in a "distroless" or scratch image. You should also, if possible, make the filesystem of the container read-only. This should get you some excellent feedback from your security team during your performance review.

5. Build the smallest image possible

Who likes to download hundreds of megabytes of useless data? Aim to have the smallest images possible. This decreases download times, cold start times, and disk usage. You can use several strategies to achieve that: start with a minimal base image, leverage common layers between images and make use of Docker’s multi-stage build feature.

6. Properly tag your images

Tags are how the users choose which version of your image they want to use. There are two main ways to tag your images: Semantic Versioning, or using the Git commit hash of your application. Whichever your choose, document it and clearly set the expectations that the users of the image should have. Be careful: while users expect some tags —like the “latest” tag— to move from one image to another, they expect other tags to be immutable, even if they are not technically so. For example, once you have tagged a specific version of your image, with something like “1.2.3”, you should never move this tag.

7. Carefully consider whether to use a public image

Using public images can be a great way to start working with a particular piece of software. However, using them in production can come with a set of challenges, especially in a high-constraint environment. You might need to control what’s inside them, or you might not want to depend on an external repository, for example. On the other hand, building your own images for every piece of software you use is not trivial, particularly because you need to keep up with the security updates of the upstream software. Carefully weigh the pros and cons of each for your particular use-case, and make a conscious decision.