<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Geazi Anc</title>
    <description>The latest articles on DEV Community by Geazi Anc (@geazi_anc).</description>
    <link>https://dev.to/geazi_anc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F925522%2F0d3ba86c-67ae-45a2-97b5-5b49c18abebf.png</url>
      <title>DEV Community: Geazi Anc</title>
      <link>https://dev.to/geazi_anc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/geazi_anc"/>
    <language>en</language>
    <item>
      <title>Nesse artigo, iremos desenvolver uma pipeline de dados bem simples em tempo real utilizando o Apache Flink em conjunto com a versão 3 da linguagem de programação Scala, fazendo o uso do Pub/Sub como message broker 🚀</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Tue, 07 Jan 2025 15:11:42 +0000</pubDate>
      <link>https://dev.to/geazi_anc/nesse-artigo-iremos-desenvolver-uma-pipeline-de-dados-bem-simples-em-tempo-real-utilizando-o-457e</link>
      <guid>https://dev.to/geazi_anc/nesse-artigo-iremos-desenvolver-uma-pipeline-de-dados-bem-simples-em-tempo-real-utilizando-o-457e</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/geazi_anc" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F925522%2F0d3ba86c-67ae-45a2-97b5-5b49c18abebf.png" alt="geazi_anc"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/geazi_anc/engenharia-de-dados-com-scala-masterizando-o-processamento-de-dados-em-tempo-real-com-apache-flink-e-google-pubsub-m48" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub&lt;/h2&gt;
      &lt;h3&gt;Geazi Anc ・ Aug 9 '24&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#dataengineering&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#scala&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#braziliandevs&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#flink&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>New article alert! Data Engineering with Scala: mastering data processing with Apache Flink and Pub/Sub ❤️‍🔥</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Sat, 04 Jan 2025 00:14:28 +0000</pubDate>
      <link>https://dev.to/geazi_anc/new-article-alert-data-engineering-with-scala-mastering-data-processing-with-apache-flink-and-ach</link>
      <guid>https://dev.to/geazi_anc/new-article-alert-data-engineering-with-scala-mastering-data-processing-with-apache-flink-and-ach</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/geazi_anc" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F925522%2F0d3ba86c-67ae-45a2-97b5-5b49c18abebf.png" alt="geazi_anc"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/geazi_anc/data-engineering-with-scala-mastering-real-time-data-processing-with-apache-flink-and-google-pubsub-3b39" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub&lt;/h2&gt;
      &lt;h3&gt;Geazi Anc ・ Oct 18 '24&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#dataengineering&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#scala&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#datascience&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#flink&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>scala</category>
      <category>dataengineering</category>
      <category>apacheflink</category>
      <category>pubsub</category>
    </item>
    <item>
      <title>Análise de dados de tráfego aéreo em tempo real com Spark Structured Streaming e Apache Kafka</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Mon, 28 Oct 2024 15:50:57 +0000</pubDate>
      <link>https://dev.to/geazi_anc/analise-de-dados-de-trafego-aereo-em-tempo-real-com-spark-structured-streaming-e-apache-kafka-2db5</link>
      <guid>https://dev.to/geazi_anc/analise-de-dados-de-trafego-aereo-em-tempo-real-com-spark-structured-streaming-e-apache-kafka-2db5</guid>
      <description>&lt;p&gt;Atualmente, vivemos em um mundo onde peta bytes de dados são gerados a cada segundo. Como tal, a análise e o processamento desses dados em tempo real torna-se mais do que essencial para uma empresa que busca gerar insights de negócios com mais precisão conforme dados e mais dados são produzidos.&lt;/p&gt;

&lt;p&gt;Hoje, vamos desenvolver uma análise de dados em tempo real com base em dados fictícios de um tráfego aéreo utilizando Spark Structured Streaming e Apache Kafka. Caso não saiba o que são essas tecnologias, sugiro a leitura de meu artigo que escrevi introduzindo elas com mais detalhes, assim como outros conceitos que serão abordados no decorrer desse artigo. Então, não esquece de conferir lá 💚.&lt;/p&gt;


&lt;div class="ltag__link"&gt;
  &lt;a href="/geazi_anc" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F925522%2F0d3ba86c-67ae-45a2-97b5-5b49c18abebf.png" alt="geazi_anc"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/geazi_anc/uma-breve-introducao-ao-processamento-de-dados-em-tempo-real-com-spark-structured-streaming-e-apache-kafka-5gh7" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Uma breve Introdução ao processamento de dados em tempo real com Spark Structured Streaming e Apache Kafka&lt;/h2&gt;
      &lt;h3&gt;Geazi Anc ・ Sep 29 '22&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#pyspark&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#dataengineering&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#apachekafka&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


&lt;p&gt;Você pode conferir o projeto completo em meu &lt;a href="https://github.com/geazi-anc/skyx" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Arquitetura
&lt;/h2&gt;

&lt;p&gt;Pois bem, imagine que você, pessoa engenheira de dados, trabalhe em uma empresa aérea chamada de SkyX, onde a cada segundo dados sobre o tráfego aéreo são gerados.&lt;/p&gt;

&lt;p&gt;Você foi solicitada para desenvolver uma dashboard que exibe em tempo real dados desses voos, como um rank das cidades mais visitadas no exterior; as cidades onde mais saem pessoas; e as aeronaves que mais transportam pessoas ao redor do mundo.&lt;/p&gt;

&lt;p&gt;Esses são os dados que são gerados a cada voo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;aircraft_name: nome da aeronave. Na SkyX, só existem apenas cinco aeronaves disponíveis.&lt;/li&gt;
&lt;li&gt;From: cidade de onde a aeronave está partindo. A SkyX só realiza voos entre cinco cidades ao redor do mundo.&lt;/li&gt;
&lt;li&gt;To: cidade de destino da aeronave. Como foi dito, a SkyX só realiza voos entre cinco cidades ao redor do mundo.&lt;/li&gt;
&lt;li&gt;Passengers: quantidade de passageiros que a aeronave está transportando. Todas as aeronaves da SkyX transportam entre 50 e 100 pessoas a cada voo.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A seguir está a arquitetura básica de nosso projeto:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Produtor: responsável por produzir dados do tráfego aéreo das aeronaves e enviá-los à um tópico do Apache Kafka.&lt;/li&gt;
&lt;li&gt;Consumidor: apenas observa os dados que chegam em tempo real ao tópico do Apache Kafka.&lt;/li&gt;
&lt;li&gt;Análise de dados: três dashboards que processam e analisam em tempo real os dados que chegam no tópico do Apache Kafka. Análise das cidades que mais recebem turistas; análise das cidades que mais saem pessoas para visitar outras cidades; e análise das aeronaves da SkyX que mais transportam pessoas entre as cidades ao redor do mundo.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Preparando o ambiente de desenvolvimento
&lt;/h2&gt;

&lt;p&gt;Este tutorial assume que você já tenha o PySpark instalado em sua máquina. Caso ainda não tenha, confira as etapas na própria &lt;a href="https://spark.apache.org/docs/latest/api/python/getting_started/install.html" rel="noopener noreferrer"&gt;documentação&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Já para o Apache Kafka, vamos utilizar ele por meio de conteinerização via Docker 🎉🐳.&lt;/p&gt;

&lt;p&gt;E, por fim, vamos utilizar o Python através de um ambiente virtual.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Kafka por conteinerização via Docker
&lt;/h3&gt;

&lt;p&gt;Sem mais delongas, crie uma pasta chamada &lt;strong&gt;skyx&lt;/strong&gt; e adicione o arquivo &lt;strong&gt;docker-compose.yml&lt;/strong&gt; dentro dela.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mkdir skyx
$ cd skyx
$ touch docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agora, adicione o seguinte conteúdo dentro do arquivo docker-compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: '3.9'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - 2181:2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - 29092:29092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Já podemos subir nosso servidor do Kafka. Para isso, digite o seguinte comando no terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker compose up -d
$ docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                COMMAND                  SERVICE             STATUS              PORTS
skyx-kafka-1       "/etc/confluent/dock…"   kafka               running             9092/tcp, 0.0.0.0:29092-&amp;gt;29092/tcp
skyx-zookeeper-1   "/etc/confluent/dock…"   zookeeper           running             2888/tcp, 0.0.0.0:2181-&amp;gt;2181/tcp, 3888/tcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Observação: este tutorial está utilizando a versão 2.0 do Docker Compose. É por este motivo que não há o "-" entre &lt;strong&gt;docker&lt;/strong&gt; e &lt;strong&gt;compose&lt;/strong&gt; ☺.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Agora, precisamos criar um tópico dentro do Kafka que irá armazenar os dados enviados em tempo real pelo produtor. Para isso, vamos acessar o Kafka dentro do contêiner:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ docker compose exec kafka bash&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;E enfim criar o tópico, chamado de &lt;strong&gt;airtraffic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ kafka-topics --create --topic airtraffic --bootstrap-server localhost:29092&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Created topic airtraffic.&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Criação do ambiente virtual
&lt;/h3&gt;

&lt;p&gt;Para desenvolvermos nosso produtor, ou seja, a aplicação que será responsável por enviar os dados do tráfego aéreo em tempo real para o tópico do Kafka, precisamos fazer o uso da biblioteca &lt;a href="https://kafka-python.readthedocs.io/en/master/" rel="noopener noreferrer"&gt;kafka-python&lt;/a&gt;. O kafka-python é uma biblioteca desenvolvida pela comunidade que nos permite desenvolver produtores e consumidores que se integram com o Apache Kafka.&lt;/p&gt;

&lt;p&gt;Primeiro, vamos criar um arquivo chamado &lt;strong&gt;requirements.txt&lt;/strong&gt; e adicionar a seguinte dependência dentro dele:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kafka-python&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Segundo, vamos criar um ambiente virtual e instalar as dependências no arquivo requirements.txt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m venv venv
$ venv\scripts\activate
$ pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Agora sim nosso ambiente já está pronto para o desenvolvimento 🚀.&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento do produtor
&lt;/h2&gt;

&lt;p&gt;Agora vamos criar nosso produtor. Como foi dito, o produtor será responsável por enviar os dados do tráfego aéreo para o tópico recém criado do Kafka.&lt;/p&gt;

&lt;p&gt;Como também foi dito na arquitetura, a SkyX realiza voos apenas entre cinco cidades ao redor do mundo, e tem apenas cinco aeronaves disponíveis 😹. Vale ressaltar que cada aeronave transporta entre 50 e 100 pessoas.&lt;/p&gt;

&lt;p&gt;Observe que os dados são gerados de forma aleatória e enviados ao tópico no formato json em um intervalo de tempo entre 1 e 6 segundos 😉.&lt;/p&gt;

&lt;p&gt;Vamos lá! Crie um subdiretório chamado &lt;strong&gt;src&lt;/strong&gt; e outro subdiretório chamado &lt;strong&gt;kafka&lt;/strong&gt;. Dentro do diretório kafka, crie um arquivo chamado &lt;strong&gt;airtraffic_producer.py&lt;/strong&gt; e adicione o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import random
from json import dumps
from time import sleep
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers="localhost:29092",
    value_serializer=lambda x: dumps(x).encode("utf-8")
)

while True:
    cities = [
        "São Paulo, Brazil",
        "Tokyo, Japan",
        "Berlin, Germany",
        "Rome, Italy",
        "Seoul, South Korea"
    ]

    aircraft_names = [
        "Convair B-36 Peacemaker",
        "Lockheed C-5 Galaxy",
        "Northrop B-2 Spirit",
        "Boeing B-52 Stratofortress",
        "McDonnell XF-85 Goblin"
    ]

    aircraft = {
        "aircraft_name": random.choice(aircraft_names),
        "from": random.choice(cities),
        "to": random.choice(cities),
        "passengers": random.randint(50, 101)
    }

    future = producer.send("airtraffic", value=aircraft)
    print(future.get(timeout=60))

    sleep(random.randint(1, 6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Desenvolvemos nosso produtor. Execute-o e deixe rodando por um tempo.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ python airtraffic_producer.py&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento do consumidor
&lt;/h2&gt;

&lt;p&gt;Agora vamos desenvolver nosso consumidor. Essa será uma aplicação bem simples. Ela irá apenas exibir no terminal em tempo real os dados que chegam no tópico do kafka.&lt;/p&gt;

&lt;p&gt;Ainda dentro do diretório kafka, crie um arquivo chamado &lt;strong&gt;airtraffic_consumer.py&lt;/strong&gt; e adicione o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from json import loads
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    "airtraffic",
    bootstrap_servers="localhost:29092",
    value_deserializer=lambda x: loads(x.decode("utf-8"))
)

for msg in consumer:
    print(msg.value)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Viu só, eu te disse que era bem simples. Execute-o e observe os dados que serão exibidos em tempo real conforme o produtor envia os dados ao tópico.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ python airtraffic_consumer.py&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Análise de dados: cidades que mais recebem turistas
&lt;/h2&gt;

&lt;p&gt;Agora começamos com nossa análise de dados. Nesse momento, vamos desenvolver uma dashboard, uma aplicação, que irá exibir em tempo real um rank das cidades que mais recebem turistas. Ou seja, iremos agrupar os dados pela coluna &lt;strong&gt;to&lt;/strong&gt; e fazer uma somatória com base na coluna &lt;strong&gt;passengers&lt;/strong&gt;. Bem simples!&lt;/p&gt;

&lt;p&gt;Para isso, dentro do diretório src, crie um subdiretório chamado &lt;strong&gt;dashboards&lt;/strong&gt; e crie um arquivo chamado &lt;strong&gt;tourists_analysis.py&lt;/strong&gt;. Em seguida, adicione o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("Tourists Analysis")
         .getOrCreate()
         )

df1 = (spark.readStream
       .format("kafka")
       .option("kafka.bootstrap.servers", "localhost:29092")
       .option("subscribe", "airtraffic")
       .option("startingOffsets", "earliest")
       .load()
       )

df2 = df1.selectExpr("CAST(value AS STRING)")

aircraft = {
    "aircraft_name": "",
    "from": "",
    "to": "",
    "passengers": 0
}

schema = F.schema_of_json(F.lit(json.dumps(aircraft)))

airtraffic = (df2.select(F.from_json(df2.value, schema).alias("jsondata"))
              .select("jsondata.*")
              )

tourists = (airtraffic.groupBy("to")
            .agg({"passengers": "sum"})
            .withColumnRenamed("sum(passengers)", "tourists")
            .withColumnRenamed("to", "city")
            .orderBy("tourists", ascending=False)
            )

(tourists.writeStream
 .format("console")
 .outputMode("complete")
 .start()
 .awaitTermination()
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E já podemos executar nosso arquivo através do spark-submit. Mas calma lá! Quando estamos integrando o PySpark com o Kafka, devemos executar o spark-submit de modo diferente. É necessário que informemos o pacote do Apache Kafka e a versão atual do Apache Spark através do parâmetro --packages.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Caso seja a primeira vez que esteja integrando o Apache Spark com o Apache Kafka, talvez a execução do spark-submit demore um pouco. Isso ocorre porque ele precisa fazer o download dos pacotes necessários.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Certifique-se que o produtor ainda esteja rodando para que possamos ver a análise dos dados em tempo real. Dentro do diretório dashboards, execute o seguinte comando:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 tourists_analysis.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------------+--------+
|              city|tourists|
+------------------+--------+
|       Rome, Italy|    2628|
|      Tokyo, Japan|    2467|
|   Berlin, Germany|    2204|
|Seoul, South Korea|    1823|
| São Paulo, Brazil|    1719|
+------------------+--------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Análise de dados: cidades onde mais saem pessoas
&lt;/h2&gt;

&lt;p&gt;Essa análise é bem semelhante a anterior. Porém, ao invés de analisarmos em tempo real as cidades que mais recebem turistas, vamos analisar as cidades onde mais saem pessoas. Para isso, crie um arquivo chamado &lt;strong&gt;leavers_analysis.py&lt;/strong&gt; e adicione o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("Leavers Analysis")
         .getOrCreate()
         )

df1 = (spark.readStream
       .format("kafka")
       .option("kafka.bootstrap.servers", "localhost:29092")
       .option("subscribe", "airtraffic")
       .option("startingOffsets", "earliest")
       .load()
       )

df2 = df1.selectExpr("CAST(value AS STRING)")

aircraft = {
    "aircraft_name": "",
    "from": "",
    "to": "",
    "passengers": 0
}

schema = F.schema_of_json(F.lit(json.dumps(aircraft)))

airtraffic = (df2.select(F.from_json(df2.value, schema).alias("jsondata"))
              .select("jsondata.*")
              )

leavers = (airtraffic.groupBy("from")
           .agg({"passengers": "sum"})
           .withColumnRenamed("sum(passengers)", "leavers")
           .withColumnRenamed("from", "city")
           .orderBy("leavers", ascending=False)
           )

(leavers.writeStream
 .format("console")
 .outputMode("complete")
 .start()
 .awaitTermination()
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Certifique-se que o produtor ainda esteja rodando para que possamos ver a análise dos dados em tempo real. Dentro do diretório dashboards, execute o seguinte comando:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 leavers_analysis.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------------+-------+
|              city|leavers|
+------------------+-------+
|      Tokyo, Japan|   2673|
|   Berlin, Germany|   2305|
| São Paulo, Brazil|   2096|
|Seoul, South Korea|   1895|
|       Rome, Italy|   1872|
+------------------+-------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Análise de dados: aeronaves que mais transportam passageiros
&lt;/h2&gt;

&lt;p&gt;Essa análise é bem mais simples do que as anteriores. Vamos analisar em tempo real as aeronaves que mais transportam passageiros entre as cidades ao redor do mundo. Crie um arquivo chamado &lt;strong&gt;aircrafts_analysis.py&lt;/strong&gt; e adicione o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("Aircrafts Analysis")
         .getOrCreate()
         )

df1 = (spark.readStream
       .format("kafka")
       .option("kafka.bootstrap.servers", "localhost:29092")
       .option("subscribe", "airtraffic")
       .option("startingOffsets", "earliest")
       .load()
       )

df2 = df1.selectExpr("CAST(value AS STRING)")

aircraft = {
    "aircraft_name": "",
    "from": "",
    "to": "",
    "passengers": 0
}

schema = F.schema_of_json(F.lit(json.dumps(aircraft)))

airtraffic = (df2.select(F.from_json(df2.value, schema).alias("jsondata"))
              .select("jsondata.*")
              )

aircrafts = (airtraffic.groupBy("aircraft_name")
             .agg({"passengers": "sum"})
             .withColumnRenamed("sum(passengers)", "total_passengers")
             .orderBy("total_passengers", ascending=False)
             )

(aircrafts.writeStream
 .format("console")
 .outputMode("complete")
 .start()
 .awaitTermination()
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Certifique-se que o produtor ainda esteja rodando para que possamos ver a análise dos dados em tempo real. Dentro do diretório dashboards, execute o seguinte comando:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 aircrafts_analysis.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------+----------------+
|       aircraft_name|total_passengers|
+--------------------+----------------+
|McDonnell XF-85 G...|            2533|
|Boeing B-52 Strat...|            2345|
|Convair B-36 Peac...|            2012|
| Lockheed C-5 Galaxy|            2002|
| Northrop B-2 Spirit|            1949|
+--------------------+----------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Considerações finais
&lt;/h2&gt;

&lt;p&gt;E finalizamos por aqui, pessoal! Neste artigo desenvolvemos uma análise de dados em tempo real com base em dados fictícios de um tráfego aéreo utilizando o Spark Structured Streaming e o Apache Kafka.&lt;/p&gt;

&lt;p&gt;Para isso, desenvolvemos um produtor que envia esses dados em tempo real ao tópico do kafka, e depois desenvolvemos 3 dashboards para analisar esses dados em tempo real.&lt;/p&gt;

&lt;p&gt;Espero que tenham gostado. Até a próxima 💚.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>braziliandevs</category>
      <category>spark</category>
    </item>
    <item>
      <title>Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Fri, 18 Oct 2024 00:28:07 +0000</pubDate>
      <link>https://dev.to/geazi_anc/data-engineering-with-scala-mastering-real-time-data-processing-with-apache-flink-and-google-pubsub-3b39</link>
      <guid>https://dev.to/geazi_anc/data-engineering-with-scala-mastering-real-time-data-processing-with-apache-flink-and-google-pubsub-3b39</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Note: this article is also available in &lt;a href="https://dev.to/geazi_anc/engenharia-de-dados-com-scala-masterizando-o-processamento-de-dados-em-tempo-real-com-apache-flink-e-google-pubsub-m48"&gt;brazilian portuguese&lt;/a&gt; 🌎&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; is a distributed data processing framework for both batch and streaming processing. It can be used to develop event-driven applications; perform batch and streaming data analysis; and can be used to develop ETL data pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/pubsub/docs/overview?hl=en" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt; is a scalable, asynchronous messaging service from Google that separates the services that produce messages from the services that process them. It is used for streaming analytics and data integration pipelines to load and distribute data, and is equally effective as a messaging middleware for service integration or as a queue to load tasks in parallel.&lt;/p&gt;

&lt;p&gt;In this article, we will develop a very simple real-time data pipeline using Apache Flink in conjunction with version 3 of the Scala programming language, using Pub/Sub as a message broker. Before we begin, let's align expectations?&lt;/p&gt;

&lt;p&gt;First, this article is not intended to be an introductory article to Apache Flink. If you have never heard of it before, I suggest you read the &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/" rel="noopener noreferrer"&gt;first steps&lt;/a&gt; from the official documentation. Read it without fear! The Apache Flink documentation is excellent!&lt;/p&gt;

&lt;p&gt;Second, although Apache Flink has an official API for the &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-1.20/api/scala/org/apache/flink/api/scala/index.html" rel="noopener noreferrer"&gt;Scala&lt;/a&gt; language, it has been deprecated and will be removed in future versions. You can learn more about this &lt;a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-265+Deprecate+and+remove+Scala+API+support" rel="noopener noreferrer"&gt;here&lt;/a&gt;. However, since Scala is a JVM-based language and Apache Flink is developed in Java, it is perfectly possible to still use the Scala language for development with Apache Flink, but using the Java APIs. Yes, I also turned my nose up at that. Nobody deserves that! But, to make our lives easier, we will use the &lt;a href="https://github.com/flink-extended/flink-scala-api" rel="noopener noreferrer"&gt;Flink Scala API&lt;/a&gt; library, which is nothing less than a fork of the official Flink Scala API, but completely maintained by the community. I highly recommend this library!&lt;/p&gt;

&lt;p&gt;Third, finally, we will develop a very simple real-time data pipeline. The goal is not to provide a complex example, but rather to provide a guide to working with Apache Flink with the Scala language plus Pub/Sub as a message broker. I had a hard time finding a decent article that used these three technologies together.&lt;/p&gt;

&lt;p&gt;What will we see in this article?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

&lt;ul&gt;
&lt;li&gt;1. Problem definition&lt;/li&gt;
&lt;li&gt;2. Setup&lt;/li&gt;
&lt;li&gt;2.1. Creating topics and subscriptions in Pub/Sub&lt;/li&gt;
&lt;li&gt;2.2. Installing dependencies&lt;/li&gt;
&lt;li&gt;3. Data pipeline development&lt;/li&gt;
&lt;li&gt;3.1. Business models and requirements&lt;/li&gt;
&lt;li&gt;3.2. Defining serializers and deserializers&lt;/li&gt;
&lt;li&gt;3.3. Pipeline arguments&lt;/li&gt;
&lt;li&gt;3.4. Pub/Sub source&lt;/li&gt;
&lt;li&gt;3.5. Pub/Sub Sink&lt;/li&gt;
&lt;li&gt;3.6. Data pipeline and application of business requirements&lt;/li&gt;
&lt;li&gt;4. Running the data pipeline&lt;/li&gt;
&lt;li&gt;5. Conclusion&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Now, enough talk. Let's get started!&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Problem definition
&lt;/h2&gt;

&lt;p&gt;A web application is responsible for receiving the initial registration of new customers from a large Brazilian retail company called My Awesome Company, hereinafter MAC, &lt;em&gt;mac.br&lt;/em&gt;. The application sends the initial registration of new customers in real time to a Pub/Sub topic, and you must develop a data pipeline that processes this data in real time, enriches the initial customer registration with some relevant business information and, finally, sends it to a final topic in Pub/Sub. Pretty simple, right?&lt;/p&gt;

&lt;p&gt;The web application sends the following payload to Pub/Sub:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"birthDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fullName is the client's full name (dann!);&lt;/li&gt;
&lt;li&gt;birthDate is the customer's date of birth, in the format _year-month-day*;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The data pipeline must enrich this basic customer registration with some relevant business information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It is necessary to split the client's full name into &lt;em&gt;first name&lt;/em&gt; and &lt;em&gt;last name&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;The client's current age must be calculated based on their date of birth;&lt;/li&gt;
&lt;li&gt;If the customer is over 30 years old, registration should not be carried out and the customer should be listed as &lt;em&gt;inactive&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Add a &lt;em&gt;createdAt&lt;/em&gt; field, related to the customer creation date.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this understanding, let's start &lt;em&gt;coding&lt;/em&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Setup
&lt;/h2&gt;

&lt;p&gt;Hold on! Let's not start coding yet 🙍🏼 . We'll need to configure a few things first. The initial configurations we'll have to do are the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creation of topics and subscriptions in Pub/Sub;&lt;/li&gt;
&lt;li&gt;Installation of the dependencies required for the data pipeline to work;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.1. Creating topics and subscriptions in Pub/Sub
&lt;/h3&gt;

&lt;p&gt;To create topics and subscriptions in Pub/Sub, we will be using the official Google Cloud CLI, &lt;em&gt;gcloud&lt;/em&gt;. Follow &lt;a href="https://cloud.google.com/sdk/docs/install?hl=en" rel="noopener noreferrer"&gt;these instructions&lt;/a&gt; if you do not already have the CLI properly configured on your machine.&lt;/p&gt;

&lt;p&gt;Now, what topics need to be created?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;created-customer: the topic where the MAC web application will send the payloads relating to the initial customer registrations;&lt;/li&gt;
&lt;li&gt;registered-customer: the final topic where our data pipeline will send customers with their respective registrations duly enriched;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's start with the &lt;em&gt;created-customer&lt;/em&gt; topic. For this topic, we also need to create a standard subscription of type &lt;em&gt;pull&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# creating the topic created-customer&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub topics create created-customer
Created topic &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/topics/created-customer].

&lt;span class="c"&gt;# now, creating a pull subscription to the topic created-customer&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub subscriptions create created-customer-sub &lt;span class="nt"&gt;--topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;created-customer
Created subscription &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/subscriptions/created-customer-sub].
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's create the &lt;em&gt;registered-customer&lt;/em&gt; topic. For this topic, we also need to create a default subscription of type &lt;em&gt;pull&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# creating the registered-customer topic&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub topics create registered-customer
Created topic &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/topics/registered-customer].

&lt;span class="c"&gt;# now, creating a pull subscription to the registered-customer topic&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub subscriptions create registered-customer-sub &lt;span class="nt"&gt;--topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;registered-customer
Created subscription &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/subscriptions/registered-customer-sub].
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2. Installing dependencies
&lt;/h3&gt;

&lt;p&gt;Now yes! Time to code! 🎉&lt;/p&gt;

&lt;p&gt;First of all, the development of our data pipeline will not be based on SBT projects. We will use the &lt;a href="https://scala-cli.virtuslab.org/" rel="noopener noreferrer"&gt;Scala CLI&lt;/a&gt;, a command-line tool that allows compile, run, test and package Scala code. Based on the Scala CLI, we can develop &lt;a href="https://scala-cli.virtuslab.org/docs/guides/scripting/scripts" rel="noopener noreferrer"&gt;Scala scripts&lt;/a&gt; in a very practical and fast way!&lt;/p&gt;

&lt;p&gt;To install dependencies, we will use a Scala CLI feature called &lt;a href="https://scala-cli.virtuslab.org/docs/guides/introduction/using-directives" rel="noopener noreferrer"&gt;directives&lt;/a&gt;. Directives are ways of defining configurations within the own source code, without needing a build tool like SBT for this. One of the directives we will use is to define the dependencies that our pipeline will use, namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Flink Client (Apache Flink's own dependency);&lt;/li&gt;
&lt;li&gt;Flink Scala API (a community-maintained library that allows us to develop code in Apache Flink with Scala APIs);&lt;/li&gt;
&lt;li&gt;Flink Connector GCP PubSub: the official Apache Flink connector that allows us to send and receive Pub/Sub messages;&lt;/li&gt;
&lt;li&gt;Toolkit: a set of useful libraries for everyday tasks, including the uPickle library, used to serialize and deserialize JSON;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To begin, create a directory called &lt;em&gt;br-mac&lt;/em&gt;, and a file called &lt;em&gt;Customers.sc&lt;/em&gt; inside it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;br-mac
...
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;br-mac
...
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;Customers.sc
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, inside the &lt;em&gt;Customers.sc&lt;/em&gt; file, add the following lines that are related to the directives for installing the necessary dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;//&amp;gt; using toolkit default
//&amp;gt; using dep "org.flinkextended::flink-scala-api:1.18.1_1.1.6"
//&amp;gt; using dep "org.apache.flink:flink-clients:1.18.1"
//&amp;gt; using dep org.apache.flink:flink-connector-gcp-pubsub:3.1.0-1.18
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And add the imports that will be used later:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import br.mac.customers.models.*
import br.mac.customers.serializations.*
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.connectors.gcp.pubsub.{PubSubSink, PubSubSource}
import org.apache.flinkx.api.*
import org.apache.flinkx.api.serializers.*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done! Dependencies and imports have been defined. Let's move on.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Data pipeline development
&lt;/h2&gt;

&lt;p&gt;Now it's time to build the data pipeline itself with Apache Flink! This build will consist of six parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business models and requirements;&lt;/li&gt;
&lt;li&gt;Development of serializers and deserializers;&lt;/li&gt;
&lt;li&gt;Using &lt;em&gt;ParameterTool&lt;/em&gt; so that we can get some relevant information for our pipeline through command line arguments;&lt;/li&gt;
&lt;li&gt;Development of PubSubSource so that Apache Flink can read data from the Pub/Sub created-customer topic;&lt;/li&gt;
&lt;li&gt;Development of PubSubSink so that Apache Flink can send the processed data to the registered-customer topic in Pub/Sub;&lt;/li&gt;
&lt;li&gt;Development of the data pipeline core applying business requirements;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's go?&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1. Business models and requirements
&lt;/h3&gt;

&lt;p&gt;Business models are the information that we will receive and send to Pub/Sub. As mentioned before, we will receive a payload in JSON format from Pub/Sub, and send a payload to Pub/Sub also in JSON format. We need to model this payload in Scala classes.&lt;/p&gt;

&lt;p&gt;Since these classes are representations of JSON payloads, we will use the uPickle library to serialize them.&lt;br&gt;
and deserialize them into JSON format. If you are not familiar with the uPickle library, I highly recommend you read the &lt;a href="https://com-lihaoyi.github.io/upickle/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;. It is also an excellent library!&lt;/p&gt;

&lt;p&gt;An example of a payload that we will receive, related to the initial customer registration, is the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"birthDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1995-01-01"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An example of a payload that we will send to Pub/Sub, related to the final customer registration, is the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"firstName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"lastName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"isActive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"createdAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-08-08T18:07:44.167635713Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create another file called &lt;em&gt;Models.scala&lt;/em&gt;. Note that this time the file extension is &lt;em&gt;.scala&lt;/em&gt;, not &lt;em&gt;.sc&lt;/em&gt;. This is because this file is a Scala module, not a Scala script.&lt;/p&gt;

&lt;p&gt;In the file, add the following lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fullName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;
&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;isActive&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;createdAt&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done! However, we are not finished with our models yet. We need to define some methods so that we can satisfy the business requirements that were defined, which are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It is necessary to split the client's full name into &lt;em&gt;first name&lt;/em&gt; and &lt;em&gt;last name&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;The customers's current age must be calculated based on their date of birth;&lt;/li&gt;
&lt;li&gt;If the customer is over 30 years old, registration should not be carried out and the customer should be listed as &lt;em&gt;inactive&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Add a &lt;em&gt;createdAt&lt;/em&gt; field, related to the customer creation date.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first and second business requirements can be defined as methods in the &lt;em&gt;CreatedCustomer&lt;/em&gt; class. For the third, we can define a constructor for the &lt;em&gt;RegisteredCustomer&lt;/em&gt; class that creates an instance of the class with the &lt;em&gt;isActive&lt;/em&gt; attribute set to &lt;em&gt;true&lt;/em&gt; and the &lt;em&gt;createdAt&lt;/em&gt; attribute set to the current time. The fourth requirement will be addressed in the data pipeline itself.&lt;/p&gt;

&lt;p&gt;For the first and second requirement, we need to make some imports in the &lt;em&gt;Models.scala&lt;/em&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.temporal.ChronoUnit&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And we can now define the methods in the &lt;em&gt;CreatedCustomer&lt;/em&gt; class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fullName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;head&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;last&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ChronoUnit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;YEARS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;between&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="py"&gt;toInt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, let's declare the constructor for the &lt;em&gt;RegisteredCustomer&lt;/em&gt; class. We'll do this by defining the &lt;em&gt;apply&lt;/em&gt; method on the &lt;em&gt;companion object&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;lastName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the final code for the &lt;em&gt;Models.scala&lt;/em&gt; file looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.temporal.ChronoUnit&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fullName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;head&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;last&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ChronoUnit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;YEARS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;between&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="py"&gt;toInt&lt;/span&gt;

&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;isActive&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;createdAt&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;

&lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;lastName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2. Defining serializers and deserializers
&lt;/h3&gt;

&lt;p&gt;When we talk about Apache Flink connectors, as is the case with the Apache Flink connector for Pub/Sub, we need to keep in mind two fundamental concepts: &lt;em&gt;serializers&lt;/em&gt; and &lt;em&gt;deserializers&lt;/em&gt;. In other words, &lt;em&gt;serializations&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Serializers are responsible for transforming primitive data types, both from Java and Scala, to be sent to the destination in binary format. Deserializers are responsible for transforming the data received from the source and transforming it into object instances. of the programming languages used.&lt;/p&gt;

&lt;p&gt;In our case, we need to create a serializer that receives an instance of one of our newly created classes, transforms them into JSON strings, and transforms them into binary so that they can then be sent to Pub/Sub. The process is exactly the opposite for deserializers. We need to transform a message, a JSON string, that Pub/Sub sends in binary format and transform this message into an instance of the newly created classes.&lt;/p&gt;

&lt;p&gt;It's a relatively simple process. To deserialize the JSON string into an instance of the case class, we'll use &lt;em&gt;uPickle&lt;/em&gt;. If you're already familiar with Flink, you might be wondering why we don't do this process with the &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/connectors/table/formats/json/" rel="noopener noreferrer"&gt;flink-json&lt;/a&gt; library. Simple, I had a lot of problems using it to deserialize the JSON strings into the case classes. Therefore, I found it more practical to create a custom deserializer that uses the uPickle library for this process.&lt;/p&gt;

&lt;p&gt;Enough talk! Let's code!&lt;/p&gt;

&lt;p&gt;Create another file in the directory called &lt;em&gt;Serializations.scala&lt;/em&gt; and add the following lines inside it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.common.serialization.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's create the deserializer for the &lt;em&gt;CreatedCustomer&lt;/em&gt; class. To do this, simply define a class that extends the AbstractDeserializationSchema abstract class, and define the &lt;em&gt;deserialize&lt;/em&gt; method. For more information, see &lt;a href="https://nightlies.apache.org/flink%20/flink-docs-stable/api/java/org/apache/flink/api/common/serialization/AbstractDeserializationSchema.html" rel="noopener noreferrer"&gt;this documentation&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;message:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See? I told you it was simple!&lt;/p&gt;

&lt;p&gt;Now let's define the serializer for the &lt;em&gt;RegisteredCustomer&lt;/em&gt; class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;serialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;element:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt;
&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting thing about this approach is that we can use any library we want to serialize and deserialize JSON strings. If we were using the &lt;em&gt;flink-json&lt;/em&gt; library, we would be stuck using Java's &lt;em&gt;jackson&lt;/em&gt; library. Yes, I also got goosebumps just thinking about it!&lt;/p&gt;

&lt;p&gt;The final code for the &lt;em&gt;Serializations.scala&lt;/em&gt; file looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.common.serialization.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;message:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
&lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;serialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;element:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt;
&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We end here with serializers and deserializers. Let's continue!&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3. Pipeline arguments
&lt;/h3&gt;

&lt;p&gt;In order to make our pipeline as flexible as possible, we must have a way to receive some parameters that are relevant to the functioning of our application, without having to hard-code this information. This information is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Cloud Platform project ID;&lt;/li&gt;
&lt;li&gt;Name of the Pub/Sub subscription from which Apache Fllink will consume data;&lt;/li&gt;
&lt;li&gt;Name of the Pub/Sub topic where Apache Flink will send the processed data;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To do this, we will receive this information through command line arguments. To do this, we will use a built-in Apache Flink utility called &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/api/java/org/apache/flink/api/java/utils/ParameterTool.html" rel="noopener noreferrer"&gt;ParameterTool&lt;/a&gt;. You can learn more about using this utility &lt;a href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/application_parameters/" rel="noopener noreferrer"&gt;in this documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's get to work! Add the following lines to the &lt;em&gt;Customers.sc&lt;/em&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ParameterTool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromArgs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;projectName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"project"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;subscriptionName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"subscription-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;topicName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"topic-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done! With this, we can pass the project ID, subscription name and topic name to our pipeline through the &lt;em&gt;--project&lt;/em&gt;, &lt;em&gt;--subscription-name&lt;/em&gt; and &lt;em&gt;--topic-name&lt;/em&gt; parameters, respectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4. Pub/Sub source
&lt;/h3&gt;

&lt;p&gt;The Pub/Sub source, as mentioned, is the way Apache Flink will read data from Pub/Sub. We will build this source using the official Apache Flink connector for Pub/Sub. If you are interested in learning more about this connector, check out &lt;a href="https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/pubsub/" rel="noopener noreferrer"&gt;this documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The Pub/Sub source constructor requires the following information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deserializer: the way Apache Flink will transform the message received from Pub/Sub into Scala language objects. Remember the deserializer for the &lt;em&gt;CreatedCustomer&lt;/em&gt; class that we developed above? So, that's what we'll be using;&lt;/li&gt;
&lt;li&gt;ProjectName: The name of the GCP project where you created the Pub/Sub topics and subscriptions;&lt;/li&gt;
&lt;li&gt;SubscriptionName: the name of the subscription from which Apache Flink will consume data related to the initial registration of customers;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add the following lines to the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSource&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;PubSubSource&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSubscriptionName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriptionName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it! Pretty simple too, right?&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5. Pub/Sub Sink
&lt;/h3&gt;

&lt;p&gt;Phew, we're almost done. Let's build the PubSub Sink for our pipeline.&lt;/p&gt;

&lt;p&gt;As stated, Pub/Sub Sink is a way for Apache Flink to send processed data to Pub/Sub. The Pub/Sub Sink constructor requires the following information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serializer: the way Apache Flink will transform the &lt;em&gt;RegisteredCustomer&lt;/em&gt; class instance into a JSON string and then into binary and send it to Pub/Sub. Remember the serializer we created earlier? That's the one we're going to use!&lt;/li&gt;
&lt;li&gt;ProjectName: The name of the GCP project where you created the Pub/Sub topics and subscriptions;&lt;/li&gt;
&lt;li&gt;TopicName: the name of the topic that Apache Fllink will send the processed data to;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add the following lines to the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSink&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;PubSubSink&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSerializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withTopicName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.6. Data pipeline and application of business requirements
&lt;/h3&gt;

&lt;p&gt;We have finally reached the last stage of development! Let's build the core of our data pipeline! As a reminder, our data pipeline will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the initial customer registrations from the Pub/Sub &lt;em&gt;created-customer&lt;/em&gt; topic;&lt;/li&gt;
&lt;li&gt;Apply transformations and rules according to business requirements, such as:&lt;/li&gt;
&lt;li&gt;Split the customer's name into first and last name;&lt;/li&gt;
&lt;li&gt;Calculate the customer's age based on their date of birth;&lt;/li&gt;
&lt;li&gt;Set the client creation date;&lt;/li&gt;
&lt;li&gt;If the customer's age is greater than or equal to 30 years, do not register the customer and set the &lt;em&gt;isActive&lt;/em&gt; status to &lt;em&gt;false&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Send the processed data to the &lt;em&gt;registered-customer&lt;/em&gt; topic in Pub/Sub;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's go! Let's get to work!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;env&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getExecutionEnvironment&lt;/span&gt;
&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;enableCheckpointing&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000L&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;env&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSource&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// reading data from the created-customer topic&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;// splitting the customer's name into first and last name, calculating the age and setting the creation date&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isActive&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// checking if the client's age is greater than or equal to 30&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSink&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// sending the processed data to the registered-customer topic&lt;/span&gt;

&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"customerRegistering"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is it finished? Yes, it is finished! Here is what the complete code looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="c1"&gt;//&amp;gt; using toolkit default&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep "org.flinkextended::flink-scala-api:1.18.1_1.1.6"&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep "org.apache.flink:flink-clients:1.18.1"&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep org.apache.flink:flink-connector-gcp-pubsub:3.1.0-1.18&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.java.utils.ParameterTool&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.streaming.connectors.gcp.pubsub.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;PubSubSink&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;PubSubSource&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flinkx.api.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flinkx.api.serializers.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ParameterTool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromArgs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;projectName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"project"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;subscriptionName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"subscription-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;topicName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"topic-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSource&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;PubSubSource&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSubscriptionName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriptionName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSink&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;PubSubSink&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSerializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withTopicName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;env&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getExecutionEnvironment&lt;/span&gt;
&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;enableCheckpointing&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000L&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;env&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSource&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isActive&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSink&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"customerRegistering"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Running the data pipeline
&lt;/h2&gt;

&lt;p&gt;Before running the pipeline, access Pub/Sub through your browser console, access the &lt;em&gt;created-customer&lt;/em&gt; topic, and manually send some messages according to the &lt;em&gt;CreatedCustomer&lt;/em&gt; payload schema. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"birthDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1995-01-01"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's see all this in action. To do so, run the data pipeline through the Scala CLI. There is no need to package the data pipeline and upload it to a Flink cluster. We are working here in local mode.&lt;/p&gt;

&lt;p&gt;Run the data pipeline with the following command. Note the application parameters as we defined previously:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;scala-cli &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--project&lt;/span&gt; your-project-id-here &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--subscription-name&lt;/span&gt; created-customer-sub &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--topic-name&lt;/span&gt; registered-customer
&lt;span class="c"&gt;# ...&lt;/span&gt;
Compiling project &lt;span class="o"&gt;(&lt;/span&gt;Scala 3.4.2, JVM &lt;span class="o"&gt;(&lt;/span&gt;11&lt;span class="o"&gt;))&lt;/span&gt;
Compiled project &lt;span class="o"&gt;(&lt;/span&gt;Scala 3.4.2, JVM &lt;span class="o"&gt;(&lt;/span&gt;11&lt;span class="o"&gt;))&lt;/span&gt;
SLF4J&lt;span class="o"&gt;(&lt;/span&gt;W&lt;span class="o"&gt;)&lt;/span&gt;: No SLF4J providers were found.
SLF4J&lt;span class="o"&gt;(&lt;/span&gt;W&lt;span class="o"&gt;)&lt;/span&gt;: Defaulting to no-operation &lt;span class="o"&gt;(&lt;/span&gt;NOP&lt;span class="o"&gt;)&lt;/span&gt; logger implementation
SLF4J&lt;span class="o"&gt;(&lt;/span&gt;W&lt;span class="o"&gt;)&lt;/span&gt;: See https://www.slf4j.org/codes.html#noProviders &lt;span class="k"&gt;for &lt;/span&gt;further details.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running! Open Pub/Sub in your browser, go to the &lt;em&gt;registered-customer&lt;/em&gt; topic, and click &lt;em&gt;Pull&lt;/em&gt;. This will show you the data that was processed by Apache Flink 🎉 .&lt;/p&gt;

&lt;p&gt;Press &lt;em&gt;CTRL + C&lt;/em&gt; to stop the pipeline execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Conclusion
&lt;/h2&gt;

&lt;p&gt;And we've reached the end of the article! Today, we did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We define the problem of the company My Awesome Company (MAC);&lt;/li&gt;
&lt;li&gt;We define the JSON payloads that would be received and sent to the Pub/Sub topics;&lt;/li&gt;
&lt;li&gt;We define the business requirements that would be applied to the received data;&lt;/li&gt;
&lt;li&gt;We created two topics in Pub/Sub: one to receive the message regarding the initial registration of customers and another to send the data after being processed by Apache Flink;&lt;/li&gt;
&lt;li&gt;We developed the data pipeline in Apache Flink, defining the business models for each payload received and sent; serializers and deserializers of JSON strings; and finally the data pipeline itself, applying the previously defined business rules;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's all for today, guys! If you liked it, give me a little push and hit &lt;em&gt;like&lt;/em&gt; and share it with your friends, okay?&lt;/p&gt;

&lt;p&gt;See you next time 💚&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>scala</category>
      <category>datascience</category>
      <category>flink</category>
    </item>
    <item>
      <title>Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Fri, 09 Aug 2024 00:23:27 +0000</pubDate>
      <link>https://dev.to/geazi_anc/engenharia-de-dados-com-scala-masterizando-o-processamento-de-dados-em-tempo-real-com-apache-flink-e-google-pubsub-m48</link>
      <guid>https://dev.to/geazi_anc/engenharia-de-dados-com-scala-masterizando-o-processamento-de-dados-em-tempo-real-com-apache-flink-e-google-pubsub-m48</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Note: this article is also available in &lt;a href="https://dev.to/geazi_anc/data-engineering-with-scala-mastering-real-time-data-processing-with-apache-flink-and-google-pubsub-3b39"&gt;english&lt;/a&gt; 🌎&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;O &lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; é um framework de processamento de dados distribuído, tanto para processamento em batch quanto em streaming. Com ele, é possível desenvolver aplicações orientadas a eventos; realizar análise de dados em batch e em streaming; além de poder ser utilizado para o desenvolvimento de pipelines de dados ETL.&lt;/p&gt;

&lt;p&gt;Já o &lt;a href="https://cloud.google.com/pubsub/docs/overview?hl=pt-br" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt; é um serviço de mensagens assíncrono e escalonável da Google que separa os serviços que produzem mensagens dos serviços que as processam. Ele é usado para análises de streaming e pipelines de integração de dados para carregar e distribuir dados, sendo igualmente eficaz como um middleware voltado para mensagens para integração de serviços ou como uma fila para carregar tarefas em paralelo.&lt;/p&gt;

&lt;p&gt;Nesse artigo, iremos desenvolver uma pipeline de dados bem simples em tempo real utilizando o Apache Flink em conjunto com a versão 3 da linguagem de programação Scala, fazendo o uso do Pub/Sub como message broker. Antes de começarmos, vamos alinhar as espectativas?&lt;/p&gt;

&lt;p&gt;Primeiro, esse artigo não tem a pretenção de ser um artigo introdutório ao Apache Flink. Caso você nunca tenha ouvido falar nele antes, sugiro a leitura do &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/" rel="noopener noreferrer"&gt;first steps&lt;/a&gt; da documentação oficial. Leia sem medo! A documentação do Apache Flink é excelente!&lt;/p&gt;

&lt;p&gt;Segundo, apesar do Apache Flink ter uma API oficial para a linguagem &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-1.20/api/scala/org/apache/flink/api/scala/index.html" rel="noopener noreferrer"&gt;Scala&lt;/a&gt;, ela foi descontinuada e será removida nas próximas versões. Você pode saber mais sobre isso &lt;a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-265+Deprecate+and+remove+Scala+API+support" rel="noopener noreferrer"&gt;aqui&lt;/a&gt;. Todavia, como o scala é uma linguagem baseada na JVM e o Apache Flink é desenvolvido em java, é perfeitamente possível ainda utilizarmos a linguagem scala para o desenvolvimento com o Apache Fllink, porém utilizando as APIs do Java. Sim, eu também torci o nariz para isso. Ninguém merece! Mas, para deixar nossa vida mais fácil, vamos utilizar a biblioteca &lt;a href="https://github.com/flink-extended/flink-scala-api" rel="noopener noreferrer"&gt;Flink Scala API&lt;/a&gt;, que é nada menos que um fork da Scala API oficial do Flink, porém completamente mantido pela comunidade. Recomendo muito essa biblioteca!&lt;/p&gt;

&lt;p&gt;Terceiro, por fim, iremos desenvolver aqui uma pipeline de dados em tempo real bem simples. O objetivo não é fornecer um exemplo complexo, mas sim fornecer um guia para trabalhar com o Apache Flink com a linguagem Scala mais o Pub/Sub como message broker. Tive muita dificuldade de encontrar um artigo decente que utilizasse essas três tecnologias em conjunto.&lt;/p&gt;

&lt;p&gt;O que vamos ver nesse artigo?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1. Definição do problema&lt;/li&gt;
&lt;li&gt;
2. Setup

&lt;ul&gt;
&lt;li&gt;2.1. Criação dos tópicos e assinaturas no Pub/Sub&lt;/li&gt;
&lt;li&gt;2.2. Instalação das dependências&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

3. Desenvolvimento da pipeline de dados

&lt;ul&gt;
&lt;li&gt;3.1. Modelos e requisitos de negócio&lt;/li&gt;
&lt;li&gt;3.2. Definindo os serializers e deserializers&lt;/li&gt;
&lt;li&gt;3.3. Argumentos da pipeline&lt;/li&gt;
&lt;li&gt;3.4. Pub/Sub source&lt;/li&gt;
&lt;li&gt;3.5. Pub/Sub Sink&lt;/li&gt;
&lt;li&gt;3.6. Pipeline de dados e aplicação dos requisitos de negócio&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;4. Executando a pipeline de dados&lt;/li&gt;

&lt;li&gt;5. Considerações finais&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Agora, chega de papo. Vamos começar!&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Definição do problema
&lt;/h2&gt;

&lt;p&gt;Uma determinada aplicação web é responsável por receber o cadastro inicial de novos clientes da grande empresa de varejo brasileira chamada de My Awesome Company, doravante MAC, &lt;em&gt;mac.br&lt;/em&gt;. A aplicação envia em tempo real o cadastro inicial dos novos clientes para um tópico do Pub/Sub, e você deve desenvolver uma pipeline de dados que processa esse dado em tempo real, enriqueça o cadastro inicial do cliente com algumas informações relevantes de negócio e, por fim, o envie para um tópico final no Pub/Sub. Bem simples, não?&lt;/p&gt;

&lt;p&gt;A aplicação web envia o seguinte payload para o Pub/Sub:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"birthDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Onde:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fullName é o nome completo do cliente (dann!);&lt;/li&gt;
&lt;li&gt;birthDate é a data de nascimento do cliente, no formato _ano-mes-dia*;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A pipeline de dados deve enriquecer esse cadastro básico do cliente com algumas informações relevantes de negócio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;É preciso separar o nome completo do cliente em &lt;em&gt;primeiro nome&lt;/em&gt; e &lt;em&gt;último nome&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Deve-se calcular a idade atual do cliente com base em sua data de nascimento;&lt;/li&gt;
&lt;li&gt;Se o cliente tiver mais de 30 anos, o cadastro não deve ser realizado e o cliente deve constar como &lt;em&gt;inativo&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Adicionar um campo &lt;em&gt;createdAt&lt;/em&gt;, relacionado a data de criação do cliente.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tendo esse entendimento, vamos começar a &lt;em&gt;codar&lt;/em&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Setup
&lt;/h2&gt;

&lt;p&gt;Calma lá! Não vamos começar a codar ainda 🙍🏼. Vamos precisar configurar algumas coisas antes. As configurações iniciais que vamos ter que fazer são as seguintes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Criação dos tópicos e das assinaturas no Pub/Sub;&lt;/li&gt;
&lt;li&gt;Instalação das dependências necessárias para o funcionamento da pipeline de dados;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.1. Criação dos tópicos e assinaturas no Pub/Sub
&lt;/h3&gt;

&lt;p&gt;Para a criação dos tópicos e assinaturas no Pub/Sub, vamos estar utilizando a CLI oficial da Google Cloud, o &lt;em&gt;gcloud&lt;/em&gt;. Siga &lt;a href="https://cloud.google.com/sdk/docs/install?hl=pt-br" rel="noopener noreferrer"&gt;essas instruções&lt;/a&gt; caso ainda não tenha a CLI devidamente configurada em sua máquina.&lt;/p&gt;

&lt;p&gt;Agora, quais tópicos precisam ser criados?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;created-customer: o tópico onde a aplicação web da MAC irá enviar os payloads referentes aos cadastros iniciais dos clientes;&lt;/li&gt;
&lt;li&gt;registered-customer: o tópico final onde nossa pipeline de dados irá enviar os clientes com os respectivos cadastros devidamente enriquecidos;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vamos começar pelo tópico &lt;em&gt;created-customer&lt;/em&gt;. Para esse tópico, também precisamos criar uma assinatura padrão do tipo &lt;em&gt;pull&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# criando o tópico created-customer&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub topics create created-customer
Created topic &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/topics/created-customer].

&lt;span class="c"&gt;# agora, criando uma assinatura do tipo pull para o tópico created-customer&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub subscriptions create created-customer-sub &lt;span class="nt"&gt;--topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;created-customer
Created subscription &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/subscriptions/created-customer-sub].
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agora, vamos criar o tópico &lt;em&gt;registered-customer&lt;/em&gt;. Para esse tópico, também precisamos criar uma assinatura padrão do tipo &lt;em&gt;pull&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# criando o tópico registered-customer&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub topics create registered-customer
Created topic &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/topics/registered-customer].

&lt;span class="c"&gt;# agora, criando uma assinatura do tipo pull para o tópico registered-customer&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;gcloud pubsub subscriptions create registered-customer-sub &lt;span class="nt"&gt;--topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;registered-customer
Created subscription &lt;span class="o"&gt;[&lt;/span&gt;projects/my-project-id/subscriptions/registered-customer-sub].
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2. Instalação das dependências
&lt;/h3&gt;

&lt;p&gt;Agora sim! Hora de codar! 🎉&lt;/p&gt;

&lt;p&gt;Antes de tudo, o desenvolvimento de nossa pipeline de dados não será feito com base em projetos SBT. Vamos utilizar a &lt;a href="https://scala-cli.virtuslab.org/" rel="noopener noreferrer"&gt;Scala CLI&lt;/a&gt;, uma ferramenta de linha de comando que permite compilar, executar, testar e empacotar códigos Scala. Com base no Scala CLI, podemos desenvolver &lt;a href="https://scala-cli.virtuslab.org/docs/guides/scripting/scripts" rel="noopener noreferrer"&gt;scripts Scala&lt;/a&gt; de forma muito prática e rápida!&lt;/p&gt;

&lt;p&gt;Para a instalação das dependências, vamos utilizar um recurso do Scala CLI chamado de &lt;a href="https://scala-cli.virtuslab.org/docs/guides/introduction/using-directives" rel="noopener noreferrer"&gt;diretivas&lt;/a&gt;. Diretivas são modos de definirmos configurações dentro do próprio código fonte, sem precisar de uma ferramenta de build como o SBT para tal. Uma das diretivas que vamos utilizar é para definirmos as dependências que nossa pipeline irá utilizar, a saber:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Flink Client (a própria dependência do Apache Flink);&lt;/li&gt;
&lt;li&gt;Flink Scala API (uma biblioteca mantida pela comunidade que nos permite desenvolver códigos no Apache Flink com as APIs do Scala);&lt;/li&gt;
&lt;li&gt;Flink Connector GCP PubSub: o connector oficial do Apache Flink que nos permite enviar e receber mensagens do Pub/Sub;&lt;/li&gt;
&lt;li&gt;Toolkit: um conjunto de bibliotecas úteis para tarefas cotidianas, incluindo a biblioteca uPickle, utilizada para serializar e deserializar JSON;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Para começarmos, crie um diretório chamado &lt;em&gt;br-mac&lt;/em&gt;, e um arquivo chamado &lt;em&gt;Customers.sc&lt;/em&gt; dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;br-mac
...
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;br-mac
...
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;Customers.sc
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agora, dentro do arquivo &lt;em&gt;Customers.sc&lt;/em&gt;, adicione as seguintes linhas que são relacionadas as diretivas para a instalação das dependências necessárias:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="c1"&gt;//&amp;gt; using toolkit default&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep "org.flinkextended::flink-scala-api:1.18.1_1.1.6"&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep "org.apache.flink:flink-clients:1.18.1"&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep org.apache.flink:flink-connector-gcp-pubsub:3.1.0-1.18&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E já adicione os imports que serão utilizados posteriormente:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.java.utils.ParameterTool&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.streaming.connectors.gcp.pubsub.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;PubSubSink&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;PubSubSource&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flinkx.api.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flinkx.api.serializers.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! As dependências e os imports foram definidos. Vamos em frente.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Desenvolvimento da pipeline de dados
&lt;/h2&gt;

&lt;p&gt;Chegou o momento de desenvolvermos a pipeline de dados em si com o Apache Flink! Esse desenvolvimento irá consistir em seis partes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Modelos e requisitos de negócio;&lt;/li&gt;
&lt;li&gt;Desenvolvimento dos serializers e deserializers;&lt;/li&gt;
&lt;li&gt;Utilização do &lt;em&gt;ParameterTool&lt;/em&gt; para que possamos pegar algumas informações relevantes para nossa pipeline através de argumentos da linha de comando;&lt;/li&gt;
&lt;li&gt;Desenvolvimento do PubSubSource para que o Apache Flink possa ler os dados do tópico created-customer do Pub/Sub;&lt;/li&gt;
&lt;li&gt;Desenvolvimento do PubSubSink para que o Apache Flink possa enviar os dados processados para o tópico registered-customer no Pub/Sub;&lt;/li&gt;
&lt;li&gt;Desenvolvimento do core da pipeline de dados aplicando os requisitos de negócio;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vam'bora?&lt;/p&gt;

&lt;h2&gt;
  
  
  3.1. Modelos e requisitos de negócio
&lt;/h2&gt;

&lt;p&gt;Os modelos de negócio são as informações que iremos receber e enviar para o Pub/Sub. Como dito anteriormente, eremos receber do Pub/Sub um payload no formato JSON, e enviar um payload para o Pub/Sub também no formato JSON. Precisamos modelar esse payload em &lt;em&gt;classes&lt;/em&gt; em Scala.&lt;/p&gt;

&lt;p&gt;Como essas classes são representações de payloads JSON, vamos utilizar a biblioteca uPickle para que seja possível serializá-las&lt;br&gt;
e deserializá-las no formato JSON. Caso ainda não conheça a biblioteca uPickle, recomendo fortemente que dê uma lida na &lt;a href="https://com-lihaoyi.github.io/upickle/" rel="noopener noreferrer"&gt;documentação&lt;/a&gt;. Também é uma excelente biblioteca!&lt;/p&gt;

&lt;p&gt;Um exemplo de payload que iremos receber, relacionado ao cadastro inicial dos clientes, é o seguinte:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"birthDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1995-01-01"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Já um exemplo de payload que iremos enviar para o Pub/Sub, relacionado ao cadastro final do cliente, é o seguinte:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"firstName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"lastName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"isActive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"createdAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-08-08T18:07:44.167635713Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Crie um outro arquivo chamado &lt;em&gt;Models.scala&lt;/em&gt;. Observe que dessa vez a extensão do arquivo é &lt;em&gt;.scala&lt;/em&gt;, e não &lt;em&gt;.sc&lt;/em&gt;. Isso porque esse arquivo é um módulo Scala, e não um script Scala.&lt;/p&gt;

&lt;p&gt;No arquivo, adicione as seguintes linhas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fullName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;
&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;isActive&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;createdAt&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Porém, ainda não acabamos com nossos modelos. Precisamos definir alguns métodos para que possamos satisfazer os requisitos de negócio que foram definidos, sendo eles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;É preciso separar o nome completo do cliente em &lt;em&gt;primeiro nome&lt;/em&gt; e &lt;em&gt;último nome&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Deve-se calcular a idade atual do cliente com base em sua data de nascimento;&lt;/li&gt;
&lt;li&gt;Se o cliente tiver mais de 30 anos, o cadastro não deve ser realizado e o cliente deve constar como &lt;em&gt;inativo&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;Adicionar um campo &lt;em&gt;createdAt&lt;/em&gt;, relacionado a data de criação do cliente.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;O primeiro e o segundo requisito de negócio podemos definir como métodos na classe &lt;em&gt;CreatedCustomer&lt;/em&gt;. Já o terceiro, podemos definir um construtor para a classe &lt;em&gt;RegisteredCustomer&lt;/em&gt; que cria uma instância da classe com o atributo &lt;em&gt;isActive&lt;/em&gt; definido como &lt;em&gt;true&lt;/em&gt; e o atributo &lt;em&gt;createdAt&lt;/em&gt; definido como o horário atual. O quarto requisito iremos trabalhar na própria pipeline de dados.&lt;/p&gt;

&lt;p&gt;Para o primeiro e segundo requisito, precisamos fazer algumas importações no arquivo &lt;em&gt;Models.scala&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.temporal.ChronoUnit&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E já podemos definir os métodos na classe &lt;em&gt;CreatedCustomer&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fullName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;head&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;last&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ChronoUnit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;YEARS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;between&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="py"&gt;toInt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Por fim, vamos declarar o construtor para a classe &lt;em&gt;RegisteredCustomer&lt;/em&gt;. Vamos fazer isso definindo o método &lt;em&gt;apply&lt;/em&gt; no &lt;em&gt;companion object&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;lastName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Portanto, o código final para o arquivo &lt;em&gt;Models.scala&lt;/em&gt; ficou dessa maneira:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.temporal.ChronoUnit&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.time.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fullName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;head&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;fullName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;last&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ChronoUnit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;YEARS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;between&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;birthDate&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="py"&gt;toInt&lt;/span&gt;

&lt;span class="k"&gt;final&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;isActive&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Boolean&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;createdAt&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;derives&lt;/span&gt; &lt;span class="nc"&gt;ReadWriter&lt;/span&gt;

&lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;firstName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;lastName:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;now&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="py"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3.2. Definindo os serializers e deserializers
&lt;/h2&gt;

&lt;p&gt;Quando falamos em connectores do Apache Flink, como é o caso do connector do Apache Flink para o Pub/Sub, precisamos ter em mente dois conceitos fundamentais: &lt;em&gt;serializers&lt;/em&gt; e &lt;em&gt;deserializers&lt;/em&gt;. Em outras palavras, &lt;em&gt;serializations&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Os serializers são responsáveis por transformar os tipos de dados primitivos, tanto da linguagem Java quanto do Scala, para serem enviados para o destino no formato binário. Já os deserializers são responsáveis por transformar o dado recebido da fonte e transformá-los para instâncias de objetos das linguagens de programação utilizadas.&lt;/p&gt;

&lt;p&gt;No nosso caso, é necessário criar um serializer que receba uma instância de uma das nossas classes recém criadas, transforme-as em strings JSON, as transforme para binário para que aí sim elas possam ser enviadas para o Pub/Sub. O processo é exatamente o oposto para os deserializers. Precisamos transformar uma mensagem, uma string JSON, que o Pub/Sub envia no formato binário e transformar essa mensagem em uma instância das classes recém criadas.&lt;/p&gt;

&lt;p&gt;É um processo relativamente simples. Para deserializarmos a string JSON para uma instância da case class, vamos usar o &lt;em&gt;uPickle&lt;/em&gt;. Se você já tiver familiaridade com o Flink, deve estar se perguntando porque não fazemos esse processo com a biblioteca &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/connectors/table/formats/json/" rel="noopener noreferrer"&gt;flink-json&lt;/a&gt;. Simples, tive muitos problemas ao utilizá-la para deserializar as strings JSON para as case classes. Portanto, achei mais prático criar um deserializer customizado que utiliza a biblioteca uPickle para esse processo.&lt;/p&gt;

&lt;p&gt;Chega de papo! Vamos codar!&lt;/p&gt;

&lt;p&gt;Crie um outro arquivo no diretório chamado &lt;em&gt;Serializations.scala&lt;/em&gt; e adicione as seguintes linhas dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.common.serialization.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vamos criar o deserializer para a classe &lt;em&gt;CreatedCustomer&lt;/em&gt;. Para isso, basta definir uma classe que extende a classe abstrata AbstractDeserializationSchema, e definirmos o método &lt;em&gt;deserialize&lt;/em&gt;. Para mais informações, consulte &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/api/java/org/apache/flink/api/common/serialization/AbstractDeserializationSchema.html" rel="noopener noreferrer"&gt;essa documentação&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;message:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Viu só? Eu disse que era simples!&lt;/p&gt;

&lt;p&gt;Agora vamos definir o serializer para a classe &lt;em&gt;RegisteredCustomer&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;serialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;element:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt;
    &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O interessante dessa abordagem é que podemos utilizar qualquer biblioteca que desejarmos para serializar e deserializar strings JSON. Se estivéssemos utilizando a biblioteca &lt;em&gt;flink-json&lt;/em&gt;, estaríamos refém em utilizar a biblioteca &lt;em&gt;jackson&lt;/em&gt; do Java. Sim, também senti arrepio só de pensar nisso!&lt;/p&gt;

&lt;p&gt;O código final para o arquivo &lt;em&gt;Serializations.scala&lt;/em&gt; ficou dessa forma:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.common.serialization.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;upickle.default.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AbstractDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;deserialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;message:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;CreatedCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;SerializationSchema&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
  &lt;span class="kt"&gt;override&lt;/span&gt; &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;serialize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;element:&lt;/span&gt; &lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt;
    &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;](&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terminamos por aqui com os serializers e deserializers. Vamos continuar!&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3. Argumentos da pipeline
&lt;/h3&gt;

&lt;p&gt;Para que possamos deixar nossa pipeline o mais flexível possível, devemos ter um modo de recebermos alguns parâmetros que são relevantes para o funcionamento da nossa aplicação, sem a necessidade de deixarmos essas informações hard-coded. Essas informações são:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ID do projeto do Google Cloud Platform;&lt;/li&gt;
&lt;li&gt;Nome da assinatura do Pub/Sub de onde o Apache Fllink irá consumir os dados;&lt;/li&gt;
&lt;li&gt;Nome do tópico do Pub/Sub onde o Apache Flink irá enviar os dados processados;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Para isso, vamos receber essas informações através de argumentos da linha de comando. Para isso, vamos utilizar um utillitário built-in do Apache Flink chamado &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/api/java/org/apache/flink/api/java/utils/ParameterTool.html" rel="noopener noreferrer"&gt;ParameterTool&lt;/a&gt;. Você pode aprender mais sobre a utilização desse utilitário &lt;a href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/application_parameters/" rel="noopener noreferrer"&gt;nessa documentação&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Mãos a obra! Adicione as seguintes linhas no arquivo &lt;em&gt;Customers.sc&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;       &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ParameterTool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromArgs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;projectName&lt;/span&gt;      &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"project"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;subscriptionName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"subscription-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;topicName&lt;/span&gt;        &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"topic-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Com isso, podemos passar para nossa pipeline o ID do projeto, o nome da assinatura e o nome do tópico através dos parâmetros &lt;em&gt;--project&lt;/em&gt;, &lt;em&gt;--subscription-name&lt;/em&gt; e &lt;em&gt;--topic-name&lt;/em&gt;, respectivamente.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4. Pub/Sub source
&lt;/h3&gt;

&lt;p&gt;O Pub/Sub source, como dito, é o modo que o Apache Flink irá ler os dados do Pub/Sub. Vamos construir esse source através do connector oficial do Apache Flink para o Pub/Sub. Caso tenha interesse em saber mais sobre esse connector, verifique &lt;a href="https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/pubsub/" rel="noopener noreferrer"&gt;essa documentação&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;O construtor do source do Pub/Sub requer as seguintes informações:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deserializer: o modo que o Apache Flink irá transformar a mensagem recebida do Pub/Sub em objetos da linguagem Scala. Lembra do deserializer para a classe &lt;em&gt;CreatedCustomer&lt;/em&gt; que desenvolvemos mais a cima? Então, é ela que vamos estar usando;&lt;/li&gt;
&lt;li&gt;ProjectName: o nome do projeto do GCP onde você criou os tópicos e as assinaturas do Pub/Sub;&lt;/li&gt;
&lt;li&gt;SubscriptionName: o nome da assinatura de onde o Apache Flink irá consumir os dados relacionados ao cadastro inicial dos clientes;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adicione as seguintes linhas no arquivo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSource&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PubSubSource&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSubscriptionName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriptionName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E é só isso! Bem simples também, né?&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5. Pub/Sub Sink
&lt;/h3&gt;

&lt;p&gt;Ufa, estamos quase acabando. Vamos construir o PubSub Sink da nossa pipeline.&lt;/p&gt;

&lt;p&gt;Como dito, o Pub/Sub Sink é um modo do Apache Flink enviar os dados processados para o Pub/Sub. O construtor do Pub/Sub Sink requer as seguintes informações:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serializer: o modo que o Apache Flink irá transformar a instância da classe &lt;em&gt;RegisteredCustomer&lt;/em&gt; para uma string JSON e em seguida para binário e enviar para o Pub/Sub. Lembra do serializer que criamos anteriormente? É ele que vamos usar!&lt;/li&gt;
&lt;li&gt;ProjectName: o nome do projeto do GCP onde você criou os tópicos e as assinaturas do Pub/Sub;&lt;/li&gt;
&lt;li&gt;TopicName: o nome do tópico que o Apache Fllink irá enviar os dados processados;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adicione as seguintes linhas no arquivo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSink&lt;/span&gt;   &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PubSubSink&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSerializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withTopicName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.6. Pipeline de dados e aplicação dos requisitos de negócio
&lt;/h3&gt;

&lt;p&gt;Chegamos enfim na última etapa de desenvolvimento! Vamos construir o core da nossa pipeline de dados! Recordando, nossa pipeline de dados irá:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ler os cadastros iniciais dos clientes do tópico &lt;em&gt;created-customer&lt;/em&gt; do Pub/Sub;&lt;/li&gt;
&lt;li&gt;Aplicar as transformações e regras conforme os requisitos de negócio, como:

&lt;ul&gt;
&lt;li&gt;Separar o nome do cliente em primeiro e último nome;&lt;/li&gt;
&lt;li&gt;Calcular a idade do cliente com base na data de nascimento;&lt;/li&gt;
&lt;li&gt;Definir a data de criação do cliente;&lt;/li&gt;
&lt;li&gt;Se a idade do cliente for maior ou igual a 30 anos, não registrar o cliente e definir o status &lt;em&gt;isActive&lt;/em&gt; como &lt;em&gt;false&lt;/em&gt;;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Enviar os dados processados para o tópico &lt;em&gt;registered-customer&lt;/em&gt; no Pub/Sub;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Vamos lá! Mão na massa!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;env&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getExecutionEnvironment&lt;/span&gt;
&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;enableCheckpointing&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000L&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;env&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSource&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// lendo os dados do tópico created-customer&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;// separando o nome do cliente em primeiro e último nome, calculando a idade e definindo a data de criação&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isActive&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// verificando se a idade do cliente é maior ou igual a 30&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSink&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// enviando os dados processados para o tópico registered-customer&lt;/span&gt;

  &lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"customerRegistering"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Acabou? Sim, acabou! Veja como ficou o código completo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="c1"&gt;//&amp;gt; using toolkit default&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep "org.flinkextended::flink-scala-api:1.18.1_1.1.6"&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep "org.apache.flink:flink-clients:1.18.1"&lt;/span&gt;
&lt;span class="c1"&gt;//&amp;gt; using dep org.apache.flink:flink-connector-gcp-pubsub:3.1.0-1.18&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.models.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;br.mac.customers.serializations.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.api.java.utils.ParameterTool&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flink.streaming.connectors.gcp.pubsub.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;PubSubSink&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;PubSubSource&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flinkx.api.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.flinkx.api.serializers.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;       &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;ParameterTool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;fromArgs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;projectName&lt;/span&gt;      &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"project"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;subscriptionName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"subscription-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;topicName&lt;/span&gt;        &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"topic-name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSource&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PubSubSource&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CreatedCustomerDeserializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSubscriptionName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriptionName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;pubsubSink&lt;/span&gt;   &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PubSubSink&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withSerializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomerSerializer&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withProjectName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;withTopicName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;env&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getExecutionEnvironment&lt;/span&gt;
&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;enableCheckpointing&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000L&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;env&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSource&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;RegisteredCustomer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;firstName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;lastName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="nv"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isActive&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubsubSink&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"customerRegistering"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Executando a pipeline de dados
&lt;/h2&gt;

&lt;p&gt;Antes de executar a pipeline, acesse o Pub/Sub através do console de seu navegador, acesse o tópico &lt;em&gt;created-customer&lt;/em&gt; e envie manualmente algumas mensagens conforme o schema do payload &lt;em&gt;CreatedCustomer&lt;/em&gt;. Por exemplo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"birthDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1995-01-01"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vamos ver tudo isso em ação? Para isso, execute a pipeline de dados através do Scala CLI. Não é necessário empacotar a pipeline de dados e subir em um cluster Flink. Estamos trabalhando aqui no modo local.&lt;/p&gt;

&lt;p&gt;Execute a pipeline de dados com o seguinte comando. Observe os parâmetros da aplicação, conforme definimos anteriormente:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;scala-cli &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt; seu-project-id-aqui &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subscription-name&lt;/span&gt; created-customer-sub &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--topic-name&lt;/span&gt; registered-customer
&lt;span class="c"&gt;# ...&lt;/span&gt;
Compiling project &lt;span class="o"&gt;(&lt;/span&gt;Scala 3.4.2, JVM &lt;span class="o"&gt;(&lt;/span&gt;11&lt;span class="o"&gt;))&lt;/span&gt;
Compiled project &lt;span class="o"&gt;(&lt;/span&gt;Scala 3.4.2, JVM &lt;span class="o"&gt;(&lt;/span&gt;11&lt;span class="o"&gt;))&lt;/span&gt;
SLF4J&lt;span class="o"&gt;(&lt;/span&gt;W&lt;span class="o"&gt;)&lt;/span&gt;: No SLF4J providers were found.
SLF4J&lt;span class="o"&gt;(&lt;/span&gt;W&lt;span class="o"&gt;)&lt;/span&gt;: Defaulting to no-operation &lt;span class="o"&gt;(&lt;/span&gt;NOP&lt;span class="o"&gt;)&lt;/span&gt; logger implementation
SLF4J&lt;span class="o"&gt;(&lt;/span&gt;W&lt;span class="o"&gt;)&lt;/span&gt;: See https://www.slf4j.org/codes.html#noProviders &lt;span class="k"&gt;for &lt;/span&gt;further details.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Execução em andamento! Abra o Pub/Sub em seu navegador, vá até o tópico &lt;em&gt;registered-customer&lt;/em&gt;, e clique em &lt;em&gt;Efetuar pull&lt;/em&gt;. Com isso você irá visualizar os dados que foram processados pelo Apache Flink 🎉.&lt;/p&gt;

&lt;p&gt;Pressione &lt;em&gt;CTRL + C&lt;/em&gt; para interromper a execução da pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Considerações finais
&lt;/h2&gt;

&lt;p&gt;E chegamos ao fim do artigo! Hoje, fizemos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Definimos o problema da empresa My Awesome Company (MAC);&lt;/li&gt;
&lt;li&gt;Definimos os payloads JSON que seria recebido e enviado aos tópicos do Pub/Sub;&lt;/li&gt;
&lt;li&gt;Definimos os requisitos de negócio que seriam aplicados aos dados recebidos;&lt;/li&gt;
&lt;li&gt;Criamos dois tópicos no Pub/Sub: um para receber a mensagem referente ao cadastro inicial dos clientes e outro para o envio do dado após ser processado pelo Apache Flink;&lt;/li&gt;
&lt;li&gt;Desenvolvemos a pipeline de dados no Apache Flink, definindo os modelos de negócio referente cada payload recebido e enviado; serializers e deserializers das strings JSON; e por fim a pipeline de dados em si, aplicando as regras de negócio anteriormente definidas;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Por hoje é só, "pe-pe-pessoal!" 🐷! Se você gostou, me dê uma forcinha e senta o dedo no &lt;em&gt;like&lt;/em&gt; aí e compartilhe com seus amigos, combinado?&lt;/p&gt;

&lt;p&gt;Até a próxima 💚&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>scala</category>
      <category>braziliandevs</category>
      <category>flink</category>
    </item>
    <item>
      <title>Integrando uma Web API com Datastore Emulator</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Tue, 21 Feb 2023 15:12:12 +0000</pubDate>
      <link>https://dev.to/geazi_anc/integrando-uma-web-api-com-datastore-emulator-3a18</link>
      <guid>https://dev.to/geazi_anc/integrando-uma-web-api-com-datastore-emulator-3a18</guid>
      <description>&lt;p&gt;O custo elevado do faturamento associado aos projetos do Google Cloud Platform (GCP) é algo que sempre devemos ter em mente durante todo o ciclo de desenvolvimento de um produto.&lt;/p&gt;

&lt;p&gt;A fim de mitigar esse problema, uma das abordagens é o uso de emuladores que simulam alguns serviços localmente, acarretando em custo zero para o projeto.&lt;/p&gt;

&lt;p&gt;Hoje, iremos ver como rodar o emulador oficial do Datastore localmente com Docker, e como integrá-lo com uma Web API!&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento da solução
&lt;/h2&gt;

&lt;p&gt;Iremos desenvolver uma breve solução através do notebook para demonstrar o funcionamento do emulator do Datastore. Para tal, iremos desenvolver uma Web API bem simples que irá se integrar com o Datastore local.&lt;/p&gt;

&lt;p&gt;Toda a solução será desenvolvida por meio de contêineres via Docker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Web API
&lt;/h3&gt;

&lt;p&gt;Desenvolvimento de uma API responsável pelo cadastro de usuários. Os dados serão persistidos localmente no contêiner do Datastore.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /users&lt;/code&gt;: salva um usuário no Datastore;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /users&lt;/code&gt;: recupera todos os usuários persistidos no Datastore;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Também será desenvolvido um arquivo Dockerfile para fazer a instalação das bibliotecas necessárias e subir o servidor de desenvolvimento da API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;src
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;src/app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datastore&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;


&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATASTORE_PROJECT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datastore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;


&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;user_entity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;user_entity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_entity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;


&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_users&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;requirements.txt
&lt;span class="nv"&gt;fastapi&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.91.0
&lt;span class="nv"&gt;uvicorn&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.20.0
google-cloud-datastore&lt;span class="o"&gt;==&lt;/span&gt;2.13.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$touch&lt;/span&gt; web-api.Dockerfile
FROM python:3.11-alpine


WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

COPY &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

CMD &lt;span class="nb"&gt;exec &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; uvicorn app:app &lt;span class="nt"&gt;--reload&lt;/span&gt; &lt;span class="nt"&gt;--app-dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./src &lt;span class="nt"&gt;--host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Datastore
&lt;/h3&gt;

&lt;p&gt;O desenvolvimento de um contêiner do Datastore consiste nos seguintes passos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Desenvolvimento de uma imagem customizada tendo como base a imagem oficial dos emuladores do GCP;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - Iniciar o servidor web do Datastore;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;datastore.Dockerfile
FROM gcr.io/google.com/cloudsdktool/google-cloud-cli:emulators


WORKDIR /datastore

CMD &lt;span class="nb"&gt;exec &lt;/span&gt;gcloud beta emulators datastore start &lt;span class="nt"&gt;--project&lt;/span&gt; my-local-project &lt;span class="nt"&gt;--host-port&lt;/span&gt; 0.0.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Desenvolvimento dos contêineres
&lt;/h3&gt;

&lt;p&gt;Por fim, vamos desenvolver um arquivo docker-compose que irá orquestrar os contêineres construídos com base em nossas imagens.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Nota: observe que é necessário configurar as variáveis de ambiente do Datastore no contêiner da API. Isso se faz necessário para que o SDK do Datastore envie as solicitações diretamente para o contêiner local, não para os servidores do GCP.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;docker-compose.yml
version: &lt;span class="s1"&gt;'3.9'&lt;/span&gt;

services:
  web-api:
    build:
      dockerfile: ./web-api.Dockerfile
    environment:
      DATASTORE_DATASET: my-local-dataset
      DATASTORE_EMULATOR_HOST: datastore:8081
      DATASTORE_EMULATOR_HOST_PATH: datastore:8081/datastore
      DATASTORE_HOST: http://datastore:8081
      DATASTORE_PROJECT_ID: my-local-project
    volumes:
      - ./:/app
    ports:
      - 8000:8000

  datastore:
    build:
      dockerfile: ./datastore.Dockerfile
    ports:
      - 8081:8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Vamos testar!
&lt;/h3&gt;

&lt;p&gt;Iremos fazer três solicitações para nossa API. As duas primeiras serão solicitações POST, que irá salvar dois usuários no Datastore, e a última será uma solicitação GET que irá recuperar os dois usuários persistidos no banco de dados.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; &lt;span class="s1"&gt;'POST'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'http://localhost:8000/users'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'accept: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "name": "John",
  "age": 20
}'&lt;/span&gt;

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    57  100    24  100    33     54     75 &lt;span class="nt"&gt;--&lt;/span&gt;:--:-- &lt;span class="nt"&gt;--&lt;/span&gt;:--:-- &lt;span class="nt"&gt;--&lt;/span&gt;:--:--   130
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"John"&lt;/span&gt;,&lt;span class="s2"&gt;"age"&lt;/span&gt;:20&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; &lt;span class="s1"&gt;'POST'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'http://localhost:8000/users'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'accept: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "name": "Mary",
  "age": 18
}'&lt;/span&gt;

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    57  100    24  100    33   1411   1941 &lt;span class="nt"&gt;--&lt;/span&gt;:--:-- &lt;span class="nt"&gt;--&lt;/span&gt;:--:-- &lt;span class="nt"&gt;--&lt;/span&gt;:--:--  3352
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"Mary"&lt;/span&gt;,&lt;span class="s2"&gt;"age"&lt;/span&gt;:18&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; &lt;span class="s1"&gt;'GET'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'http://localhost:8000/users'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'accept: application/json'&lt;/span&gt;

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    51  100    51    0     0    850      0 &lt;span class="nt"&gt;--&lt;/span&gt;:--:-- &lt;span class="nt"&gt;--&lt;/span&gt;:--:-- &lt;span class="nt"&gt;--&lt;/span&gt;:--:--   850
&lt;span class="o"&gt;[{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"John"&lt;/span&gt;,&lt;span class="s2"&gt;"age"&lt;/span&gt;:20&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"Mary"&lt;/span&gt;,&lt;span class="s2"&gt;"age"&lt;/span&gt;:18&lt;span class="o"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Considerações finais
&lt;/h2&gt;

&lt;p&gt;O uso dos emuladores do Google Cloud Platform podem ser uma abordagem interessante durante o ciclo de desenvolvimento de um produto. Com eles, podemos testar nossas soluções quantas vezes forem necessárias sem acarretar em um custo elevado no faturamento do projeto, visto que todas as solicitações das bibliotecas do Google Cloud serão feitas localmente ao invés de serem feitas para os servidores da nuvem.&lt;/p&gt;

&lt;p&gt;Obrigado por ter me acompanhado até aqui 💚. Até a próxima!&lt;/p&gt;

&lt;h2&gt;
  
  
  Referências
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/datastore/docs/tools/datastore-emulator?hl=pt-br" rel="noopener noreferrer"&gt;Como executar o Emulador do Datastore&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/sdk/gcloud/reference/beta/emulators" rel="noopener noreferrer"&gt;Gcloud beta emulators&lt;/a&gt;;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>PySpark: A brief analysis to the most common words in Dracula, by Bram Stoker</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Wed, 11 Jan 2023 17:06:56 +0000</pubDate>
      <link>https://dev.to/geazi_anc/pyspark-a-brief-analysis-to-the-most-common-words-in-dracula-by-bram-stoker-1ij4</link>
      <guid>https://dev.to/geazi_anc/pyspark-a-brief-analysis-to-the-most-common-words-in-dracula-by-bram-stoker-1ij4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Note: this article is also available in &lt;a href="https://dev.to/geazi_anc/pyspark-uma-breve-analise-das-palavras-mais-comuns-em-dracula-por-bram-stoker-4an3"&gt;portuguese&lt;/a&gt; 🌎.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A landmark in Gothic literature, the iconic novel Dracula, written by Bram Stoker in 1897, stirs the emotions of people across the world. Today, to introduce Spark's new concepts and features, we will develop a brief notebook to analyze the most common words in this classic book 🧛🏼‍♂️.&lt;/p&gt;

&lt;p&gt;To do this, we will write a notebook in &lt;a href="https://colab.research.google.com/"&gt;Google Colab&lt;/a&gt;, a cloud service built by Google to encourage machine learning and artificial intelligence researches.&lt;/p&gt;

&lt;p&gt;This notebook is also available in my &lt;a href="https://github.com/geazi-anc/dracula"&gt;GitHub&lt;/a&gt; 😉.&lt;/p&gt;

&lt;p&gt;This novel was obtained through &lt;a href="https://www.gutenberg.org/"&gt;Project Gutenberg&lt;/a&gt;, a digital library that centralizes public books around the world.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before get start
&lt;/h2&gt;

&lt;p&gt;Before start, we need to install &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html"&gt;PySpark&lt;/a&gt; library.&lt;/p&gt;

&lt;p&gt;The PySpark is the official API of Apache Spark for Python. We will develop our data analysis using it 🎲.&lt;/p&gt;

&lt;p&gt;So, create a new code cell in Colab and add the following line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step one: running Apache Spark
&lt;/h2&gt;

&lt;p&gt;After the installation is complete, we need to run Apache Spark. To do this, create a new code cell and add the following code block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("The top most common words in Dracula, by Bram Stoker")
         .getOrCreate()
         )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step two: downloading and reading
&lt;/h2&gt;

&lt;p&gt;In this step, we will download the novel from Guttenberg project and, after that, load it using PySpark.&lt;/p&gt;

&lt;p&gt;We will use &lt;strong&gt;wget&lt;/strong&gt; tool to do this, passing the URL book for it and saving it in local directory, and renaming to &lt;strong&gt;Dracula – Bram Stoker.txt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Again, create a new code cell in Colab and add the following code line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step three: stopwords downloading
&lt;/h2&gt;

&lt;p&gt;In this section, we will download the list of stopwords used in English language. These stops words normally include prepositions, particles, interjections, unions, adverbs, pronouns, introductory words, numbers from 0 to 9 (unambiguous), other frequently used official, independent parts of speech, symbols, punctuation. Relatively recently, this list was supplemented by such commonly used on the Internet sequences of symbols as www, com, http, etc.&lt;/p&gt;

&lt;p&gt;This list was obtained through &lt;a href="https://countwordsfree.com/stopwords"&gt;CountWordsFree&lt;/a&gt;, a website that centralizes the stopwords used in many languages across the world.&lt;/p&gt;

&lt;p&gt;get to work! Create a new code cell in Colab and add the following code line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, let’s load the book using Spark. Create a new code cell and add the following code block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;book = spark.read.text("Dracula - Bram Stoker.txt")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And let’s load the stopwords as well. The stopwords will are stored in a list, in &lt;strong&gt;stopwords&lt;/strong&gt; variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with open("stop_words_english.txt", "r") as f:
    text = f.read()
    stopwords = text.splitlines()

len(stopwords), stopwords[:15]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(851,
 ['able',
  'about',
  'above',
  'abroad',
  'according',
  'accordingly',
  'across',
  'actually',
  'adj',
  'after',
  'afterwards',
  'again',
  'against',
  'ago',
  'ahead']t)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step four: extracting words
&lt;/h2&gt;

&lt;p&gt;After load is completed, we need to extract the words to a dataframe column.&lt;/p&gt;

&lt;p&gt;To do this, use the &lt;strong&gt;split&lt;/strong&gt; function to each line, will split them using blank spaces between them. The result is a list of words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import split

lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step five: exploding list words
&lt;/h2&gt;

&lt;p&gt;Now, let’s convert this list of words in dataframe column, using &lt;strong&gt;explode&lt;/strong&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))
words.show(15)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------+
|     word|
+---------+
|      The|
|  Project|
|Gutenberg|
|    eBook|
|       of|
| Dracula,|
|       by|
|     Bram|
|   Stoker|
|         |
|     This|
|    eBook|
|       is|
|      for|
|      the|
+---------+
only showing top 15 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step six: words to lowercase
&lt;/h2&gt;

&lt;p&gt;This is a simple step. We don't want the same word to be different because of capital letters, so we convert these words to lowercase, using &lt;strong&gt;lower&lt;/strong&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import lower

words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|  dracula,|
|        by|
|      bram|
|    stoker|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
|        in|
+----------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step seven: removing punctuations
&lt;/h2&gt;

&lt;p&gt;so that the same word is not different because of the punctuation at the end of them, is necessary to remove these punctuations.&lt;/p&gt;

&lt;p&gt;We'll do this using the &lt;strong&gt;regexp_extract&lt;/strong&gt; function, which extracts words from a string using a regex.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import regexp_extract

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
+---------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step eight: removing null values
&lt;/h2&gt;

&lt;p&gt;However, how you see, there are null values yet, in other words, blank spaces.&lt;/p&gt;

&lt;p&gt;It is necessary remove them so that these blanks values are not analyzed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
|      the|
+---------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step nine: removing stopwords
&lt;/h2&gt;

&lt;p&gt;We are almost there! The last step is removes the stopwords so that, again, these words are not analyzed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;words_without_stopwords = words_nonull.filter(
    ~words_nonull.word.isin(stopwords))

words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()

words_count_before_removing, words_count_after_removing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(163399, 50222)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step ten: analyzing the most common words in Dracula, finally!
&lt;/h2&gt;

&lt;p&gt;And, finally, our data are completely cleared. So, now we could to analyze the most common words in our book.&lt;/p&gt;

&lt;p&gt;At first, we’ll group the words and after use an aggregate function to count them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;words_count = (words_without_stopwords.groupby("word")
               .count()
               .orderBy("count", ascending=False)
               )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After, show the top 20 most common words. This value may be changed through &lt;strong&gt;rank&lt;/strong&gt; variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rank = 20
words_count.show(rank)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------+-----+
|    word|count|
+--------+-----+
|    time|  381|
| helsing|  323|
|     van|  322|
|    lucy|  297|
|    good|  256|
|     man|  255|
|    mina|  240|
|    dear|  224|
|   night|  224|
|    hand|  209|
|    room|  207|
|    face|  206|
|jonathan|  206|
|   count|  197|
|    door|  197|
|   sleep|  192|
|    poor|  191|
|    eyes|  188|
|    work|  188|
|      dr|  187|
+--------+-----+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;That’s all for now, folks! In this article, we analyzed the most common words in Dracula, written by Bram Stoker. To do this, we cleared the words: removing punctuations; converting from uppercase letters to lowercase; and removing stopwords.&lt;/p&gt;

&lt;p&gt;I hope you enjoyed it. Keep those stakes sharp, watch out for the shadows that walk at night, and see you in next time 🧛🏼‍♂️🍷.&lt;/p&gt;

&lt;h2&gt;
  
  
  bibliography
&lt;/h2&gt;

&lt;p&gt;RIOUX, Jonathan. &lt;a href="https://www.amazon.com.br/Analysis-Python-PySpark-Jonathan-Rioux/dp/1617297208"&gt;Data Analysis with Python and PySpark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;STOKER, Bram. &lt;a href="https://www.gutenberg.org/cache/epub/345/pg345.txt"&gt;Dracula&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>spark</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Uma breve Introdução ao processamento de dados em tempo real com Spark Structured Streaming e Apache Kafka</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Thu, 29 Sep 2022 18:40:13 +0000</pubDate>
      <link>https://dev.to/geazi_anc/uma-breve-introducao-ao-processamento-de-dados-em-tempo-real-com-spark-structured-streaming-e-apache-kafka-5gh7</link>
      <guid>https://dev.to/geazi_anc/uma-breve-introducao-ao-processamento-de-dados-em-tempo-real-com-spark-structured-streaming-e-apache-kafka-5gh7</guid>
      <description>&lt;p&gt;O processamento de dados em tempo real, como o próprio nome diz, é a prática de lidar com o fluxo de dados capturados em tempo real e processados com latência mínima para gerar relatórios instantâneos ou, até mesmo, para produzir respostas automatizadas à um determinado evento.&lt;/p&gt;

&lt;p&gt;Hoje, vamos desenvolver uma aplicação bem simples para a ingestão e o processamento de dados em tempo real com o Spark Structured Streaming e o Apache Kafka 🎲. Como este tutorial tem como objetivo ser uma breve introdução à essas tecnologias, vamos desenvolver um simples contador de palavras. Nada muito elaborado ou complexo 😥.&lt;/p&gt;

&lt;p&gt;Caso queira ver um exemplo mais completo, confira um outro artigo que escrevi: uma análise de dados em tempo real com base em dados de tráfego aéreo.&lt;/p&gt;


&lt;div class="ltag__link"&gt;
  &lt;a href="/geazi_anc" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F925522%2F0d3ba86c-67ae-45a2-97b5-5b49c18abebf.png" alt="geazi_anc"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/geazi_anc/analise-de-dados-de-trafego-aereo-em-tempo-real-com-spark-structured-streaming-e-apache-kafka-2db5" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Análise de dados de tráfego aéreo em tempo real com Spark Structured Streaming e Apache Kafka&lt;/h2&gt;
      &lt;h3&gt;Geazi Anc ・ Oct 28&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#dataengineering&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#braziliandevs&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#spark&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


&lt;p&gt;E aí, se interessou? Então continue lendo!&lt;/p&gt;

&lt;p&gt;Você também pode conferir este projeto em meu &lt;a href="https://github.com/geazi-anc/data-streaming-sample" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; 😉.&lt;/p&gt;

&lt;h2&gt;
  
  
  O que é Spark Structured Streaming e Apache Kafka?
&lt;/h2&gt;

&lt;p&gt;O &lt;a href="https://spark.apache.org/streaming/" rel="noopener noreferrer"&gt;Spark Structured Streaming&lt;/a&gt; é um módulo do PySpark que facilita a criação de aplicativos e pipelines de streaming com as mesmas e familiares APIs do Spark. O Spark Structured Streaming abstrai conceitos complexos de streaming, como processamento incremental, pontos de verificação e marcas d'água, para que você possa criar aplicativos e pipelines de streaming sem aprender novos conceitos ou ferramentas ❇.&lt;/p&gt;

&lt;p&gt;Já o O &lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; é uma plataforma de streaming de eventos distribuídos de código aberto usada por milhares de empresas para pipelines de dados de alto desempenho, análise de streaming, integração de dados e aplicativos de missão crítica.&lt;/p&gt;

&lt;p&gt;Os eventos, que podem ser dados capturados em tempo real, são enviados à um tópico do Kafka. Fazendo uma analogia: um tópico é como se fosse uma pasta de arquivos em seu computador, e os eventos são os arquivos desta pasta.&lt;/p&gt;

&lt;p&gt;Um produtor, ou producer, é responsável por enviar eventos de streaming à um ou mais tópicos do Kafka. Já um consumidor, ou consumer, é responsável por se inscrever em um ou mais tópicos do Kafka e ler ou processar tais eventos enviados pelo produtor.&lt;/p&gt;

&lt;p&gt;Sugiro a leitura da &lt;a href="https://kafka.apache.org/intro" rel="noopener noreferrer"&gt;documentação&lt;/a&gt; que introduz esses conceitos com mais detalhes 😉.&lt;/p&gt;

&lt;p&gt;Agora que já entendemos um pouco sobre os conceitos abordados nesse tutorial, podemos começar a desenvolver nossa aplicação 👏🏼.&lt;/p&gt;

&lt;h2&gt;
  
  
  Arquitetura
&lt;/h2&gt;

&lt;p&gt;Calma lá! Antes de começarmos, vamos conhecer um pouco sobre a arquitetura de nosso projeto. Ela é composta pelos seguintes componentes lógicos, assim como as tecnologias que serão utilizadas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Armazenamento de dados analíticos: é onde nossos dados ficarão armazenados. Para isso, iremos criar um tópico no Apache Kafka para que, posteriormente, possamos consumir esses dados que estarão sendo enviados em tempo real pelo produtor.&lt;/li&gt;
&lt;li&gt;Ingestão de dados: um produtor desenvolvido em Python que irá enviar palavras aleatórias em tempo real ao tópico criado no Kafka.&lt;/li&gt;
&lt;li&gt;Consumo de dados: um consumidor desenvolvido em Python que tem como objetivo apenas monitorar as palavras que estão chegando em tempo real ao tópico do Kafka.&lt;/li&gt;
&lt;li&gt;Processamento de fluxo e análise: uma aplicação desenvolvida com o PySpark que irá consumir em tempo real as palavras enviadas pelo produtor ao tópico do Kafka. É esta aplicação que irá agregar as palavras e gerar um relatório da contagem de tais palavras.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Criando o ambiente de desenvolvimento
&lt;/h2&gt;

&lt;p&gt;Este tutorial assume que você já tenha o PySpark instalado em sua máquina. Caso ainda não tenha, confira as etapas na própria &lt;a href="https://spark.apache.org/docs/latest/api/python/getting_started/install.html" rel="noopener noreferrer"&gt;documentação&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Já para o Apache Kafka, vamos utilizar ele por meio de conteinerização via Docker 🎉🐳.&lt;/p&gt;

&lt;p&gt;E, por fim, vamos utilizar o Python através de um ambiente virtual.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Kafka por conteinerização via Docker
&lt;/h3&gt;

&lt;p&gt;Sem mais delongas, crie uma pasta chamada &lt;strong&gt;data-streaming-sample&lt;/strong&gt; e adicione o arquivo &lt;strong&gt;docker-compose.yml&lt;/strong&gt; dentro dela.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mkdir data-streaming-sample
$ cd data-streaming-sample
$ touch docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agora, adicione o seguinte conteúdo dentro do arquivo docker-compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: '3.9'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - 2181:2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - 29092:29092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Já podemos subir nosso servidor do Kafka. Para isso, digite o seguinte comando no terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker compose up -d
$ docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                COMMAND                  SERVICE             STATUS              PORTS
data-streaming-sample-kafka-1       "/etc/confluent/dock…"   kafka               running             9092/tcp, 0.0.0.0:29092-&amp;gt;29092/tcp
data-streaming-sample-zookeeper-1   "/etc/confluent/dock…"   zookeeper           running             2888/tcp, 0.0.0.0:2181-&amp;gt;2181/tcp, 3888/tcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Observação: este tutorial está utilizando a versão 2.0 do Docker Compose. É por este motivo que não há o "-" entre &lt;strong&gt;docker&lt;/strong&gt; e &lt;strong&gt;compose&lt;/strong&gt; ☺.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Agora, precisamos criar um tópico dentro do Kafka que irá armazenar as palavras enviadas em tempo real pelo produtor. Para isso, vamos acessar o Kafka dentro do contêiner:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ docker compose exec kafka bash&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;E enfim criar o tópico, chamado de &lt;strong&gt;words&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ kafka-topics --create --topic words --bootstrap-server localhost:29092&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Created topic words.&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Criação do ambiente virtual
&lt;/h3&gt;

&lt;p&gt;Para desenvolvermos nosso produtor, ou seja, a aplicação que será responsável por enviar as palavras em tempo real para o tópico do Kafka, precisamos fazer o uso da biblioteca &lt;a href="https://kafka-python.readthedocs.io/en/master/" rel="noopener noreferrer"&gt;kafka-python&lt;/a&gt;. O kafka-python é uma biblioteca desenvolvida pela comunidade que nos permite desenvolver produtores e consumidores que se integram com o Apache Kafka.&lt;/p&gt;

&lt;p&gt;Primeiro, vamos criar um arquivo chamado &lt;strong&gt;requirements.txt&lt;/strong&gt; e adicionar a seguinte dependência dentro dele:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kafka-python&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Segundo, vamos criar um ambiente virtual e instalar as dependências no arquivo requirements.txt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m venv venv
$ venv\scripts\activate
$ pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! Agora sim nosso ambiente já está pronto para o desenvolvimento 🚀.&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento do produtor
&lt;/h2&gt;

&lt;p&gt;Vamos criar nosso produtor. Como foi dito, um produtor é responsável por enviar os dados em tempo real para um tópico no Kafka. Este produtor irá enviar aleatoriamente uma dentre quatro palavras ao tópico words que criamos anteriormente, em um intervalo de tempo aleatório entre um e cinco segundos.&lt;/p&gt;

&lt;p&gt;Para isso, criamos uma instância da classe KafkaProducer. Esta classe recebe dois parâmetros:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bootstrap_servers: o servidor onde está rodando o Kafka. Neste caso, ele está rodando no localhost, na porta 29092, conforme configuramos no arquivo docker-compose.&lt;/li&gt;
&lt;li&gt;value_serializer: umma função que serializa os dados em bits para serem enviados para o Kafka. Neste caso, a função recebe uma string e retorna um objeto do tipo Bytes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Depois, utilizamos o método &lt;strong&gt;send&lt;/strong&gt; para enviar os dados ao tópico. Ele recebe dois parâmetros: o tópico que os dados serão enviados e os dados propriamente dito.&lt;/p&gt;

&lt;p&gt;Vamos criar um diretório chamado &lt;strong&gt;src&lt;/strong&gt; e um subdiretório chamado &lt;strong&gt;kafka&lt;/strong&gt;. Dentro do diretório kafka, vamos criar um arquivo chamado &lt;strong&gt;producer.py&lt;/strong&gt; e adicionar o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import random
from time import sleep
from kafka import KafkaProducer


producer = KafkaProducer(
    bootstrap_servers="localhost:29092",
    value_serializer=lambda x: x.encode("utf-8")
)


while True:
    words = [
        "spark",
        "kafka",
        "streaming",
        "python"
    ]

    word = random.choice(words)
    future = producer.send("words", value=word)

    print(future.get(timeout=60))

    sleep(random.randint(1, 6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E já podemos executar nosso produtor. Você pode interromper a execução a qualquer momento pressionando CTRL + C.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ python producer.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RecordMetadata(topic='words', partition=0, topic_partition=TopicPartition(topic='words', partition=0), offset=0, timestamp=1664469827519, log_start_offset=0, checksum=None, serialized_key_size=-1, serialized_value_size=6, serialized_header_size=-1)
RecordMetadata(topic='words', partition=0, topic_partition=TopicPartition(topic='words', partition=0), offset=1, timestamp=1664469833559, log_start_offset=0, checksum=None, serialized_key_size=-1, serialized_value_size=6, serialized_header_size=-1)
RecordMetadata(topic='words', partition=0, topic_partition=TopicPartition(topic='words', partition=0), offset=2, timestamp=1664469838567, log_start_offset=0, checksum=None, serialized_key_size=-1, serialized_value_size=9, serialized_header_size=-1)
RecordMetadata(topic='words', partition=0, topic_partition=TopicPartition(topic='words', partition=0), offset=3, timestamp=1664469842582, log_start_offset=0, checksum=None, serialized_key_size=-1, serialized_value_size=6, serialized_header_size=-1)
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Desenvolvimento do consumidor
&lt;/h2&gt;

&lt;p&gt;Vamos criar nosso consumidor. Como foi dito antes, e conforme explicado na arquitetura, um consumidor é responsável por se inscrever em um tópico e ler os eventos que são enviados até ele em tempo real. Nosso consumidor simplesmente irá monitorar as palavras que chegam até o tópico words.&lt;/p&gt;

&lt;p&gt;Para isso, vamos criar uma instância da classe &lt;strong&gt;KafkaConsumer&lt;/strong&gt;. Esta classe recebe três parâmetros:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;O tópico que queremos que o consumidor se inscreva. Neste caso, o tópico &lt;strong&gt;words&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;bootstrap_servers: o servidor onde está rodando o Kafka. Neste caso, ele está rodando no localhost, na porta 29092, conforme configuramos no arquivo docker-compose.&lt;/li&gt;
&lt;li&gt;value_deserializer: umma função que deserializa os dados de bits para string.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ainda no diretório kafka, vamos criar um arquivo chamado &lt;strong&gt;consumer.py&lt;/strong&gt; e adicionar o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from kafka import KafkaConsumer


consumer = KafkaConsumer(
    "words",
    bootstrap_servers="localhost:29092",
    value_deserializer=lambda x: x.decode("utf-8")
)


for msg in consumer:
    print(msg.value)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vamos executar nosso consumidor. Certifique-se que o produtor ainda esteja rodando, hein!&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ python consumer.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kafka
streaming
kafka
streaming
spark
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E finalizamos o desenvolvimento de nosso produtor e consumidor. Bem simples, não? ☺&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento do processamento de dados em tempo real com Spark Structured Streaming
&lt;/h2&gt;

&lt;p&gt;Esta é uma etapa bem simples, também. Vamos criar uma aplicação com o PySpark que irá se inscrever no tópico words e processar em tempo real os dados que chegam até ele.&lt;/p&gt;

&lt;p&gt;Primeiro, com base na instância da classe SparkSession, vamos utilizar o método &lt;strong&gt;.writeStream&lt;/strong&gt;. Depois disso, chamamos uma série de métodos encadeados, entre eles &lt;strong&gt;format&lt;/strong&gt; e &lt;strong&gt;options&lt;/strong&gt;. O format iremos dizer a ele que vamos ler dados em tempo real do Kafka. Já os métodos options informamos o servidor onde está rodando o servidor Kafka, o tópico que ele irá consumir e o modo que ele deve consumir, ou seja, do mais antigo para o mais recente.&lt;/p&gt;

&lt;p&gt;Mãos à obra! Vamos criar um novo arquivo no diretório src chamado &lt;strong&gt;word_counts.py&lt;/strong&gt; e adicionar o seguinte código dentro dele:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql import SparkSession
from pyspark.sql import functions as F


spark = (SparkSession.builder
         .appName("Words Count Analysis")
         .getOrCreate()
         )

df1 = (spark.readStream
       .format("kafka")
       .option("kafka.bootstrap.servers", "localhost:29092")
       .option("subscribe", "words")
       .option("startingOffsets", "earliest")
       .load()
       )

df2 = df1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

lines = df2.select(F.explode(F.split(df2.value, " ")).alias("word"))

word_counts = (lines.groupBy("word")
               .count()
               .orderBy("count", ascending=False)
               )

(word_counts.writeStream
 .format("console")
 .outputMode("complete")
 .start()
 .awaitTermination()
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E já podemos executar nosso arquivo através do spark-submit. Mas calma lá! Quando estamos integrando o PySpark com o Kafka, devemos executar o spark-submit de modo diferente. É necessário que informemos o pacote do Apache Kafka e a versão atual do Apache Spark através do parâmetro --packages.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Caso seja a primeira vez que esteja integrando o Apache Spark com o Apache Kafka, talvez a execução do spark-submit demore um pouco. Isso ocorre porque ele precisa fazer o download dos pacotes necessários.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Vamos lá! Sei que seus dedos estão coçando para executar o projeto e ver tudo isso em ação. Portanto, digite no terminal, dentro do diretório src:&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 word_counts.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;E veja o resultado!&lt;/p&gt;

&lt;p&gt;&lt;code&gt;`&lt;br&gt;
+---------+-----+&lt;br&gt;
|     word|count|&lt;br&gt;
+---------+-----+&lt;br&gt;
|    kafka|   11|&lt;br&gt;
|   python|   10|&lt;br&gt;
|    spark|    7|&lt;br&gt;
|streaming|    6|&lt;br&gt;
+---------+-----+&lt;br&gt;
`&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Observe que o dataframe será constantemente atualizado conforme o producer envia novos dados ao tópico. Deixe o produtor rodando por um tempo e veja por você mesmo 😉.&lt;/p&gt;

&lt;h2&gt;
  
  
  Considerações finais
&lt;/h2&gt;

&lt;p&gt;E acabamos por aqui, pessoal. Neste tutorial ensinei como processar dados em tempo real com Spark Structured Streaming e Apache Kafka.&lt;/p&gt;

&lt;p&gt;Para isso, desenvolvemos um simples contador de palavras, que agrega as palavras consumidas de um tópico do Kafka e exibe a contagem dessas palavras de forma decrescente. Também desenvolvemos um produtor, que envia dados em tempo real constantemente à um tópico do Kafka e um consumidor, que apenas monitora o tópico na medida que novos dados são enviados pelo produtor.&lt;/p&gt;

&lt;p&gt;Espero que tenham gostado. Até a próxima 💚!&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>braziliandevs</category>
      <category>spark</category>
    </item>
    <item>
      <title>PySpark: uma breve análise das palavras mais comuns em Drácula, por Bram Stoker</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Sat, 24 Sep 2022 16:24:46 +0000</pubDate>
      <link>https://dev.to/geazi_anc/pyspark-uma-breve-analise-das-palavras-mais-comuns-em-dracula-por-bram-stoker-4an3</link>
      <guid>https://dev.to/geazi_anc/pyspark-uma-breve-analise-das-palavras-mais-comuns-em-dracula-por-bram-stoker-4an3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Note: dis article is also available in &lt;a href="https://dev.to/geazi_anc/pyspark-a-brief-analysis-to-the-most-common-words-in-dracula-by-bram-stoker-1ij4"&gt;english&lt;/a&gt; 🌎.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Considerado como um marco da literatura gótica, o icônico livro Drácula, escrito em 1897 por Bram Stoker, desperta até hoje o fascínio das pessoas por todo o mundo. Hoje, a fim de introduzir novos conceitos e funcionalidades do Apache Spark, vamos desenvolver uma breve análise das palavras mais comuns encontradas neste clássico livro 🧛🏼‍♂️.&lt;/p&gt;

&lt;p&gt;Para isso, vamos desenvolver um notebook no &lt;a href="https://colab.research.google.com/" rel="noopener noreferrer"&gt;Google Colab&lt;/a&gt;, um serviço de nuvem gratuito criado pelo Google para incentivar pesquisas na área de machine learning e inteligência artificial.&lt;/p&gt;

&lt;p&gt;Caso não saiba como usar o Google Colab, confira &lt;a href="https://www.alura.com.br/artigos/google-colab-o-que-e-e-como-usar" rel="noopener noreferrer"&gt;este excelente artigo&lt;/a&gt; da Alura escrito pelo Thiago Santos que ensina, de forma muito didática, como usar o Colab e criar seus primeiros códigos!&lt;/p&gt;

&lt;p&gt;O notebook deste artigo também está disponível em meu &lt;a href="https://github.com/geazi-anc/dracula" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; 😉.&lt;/p&gt;

&lt;p&gt;A obra em questão foi obtida por meio do &lt;a href="https://www.gutenberg.org/" rel="noopener noreferrer"&gt;Projeto Gutenberg&lt;/a&gt;, um acervo digital que reúne livros de todo o mundo que já se encontram em domínio público. A versão plaintext de Drácula pode ser baixada gratuitamente &lt;a href="https://www.gutenberg.org/cache/epub/345/pg345.txt" rel="noopener noreferrer"&gt;aqui&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Antes de começar
&lt;/h2&gt;

&lt;p&gt;Antes de iniciarmos o desenvolvimento de nosso notebook, é necessário fazer a instalação da biblioteca &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html" rel="noopener noreferrer"&gt;PySpark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A biblioteca PySpark é a API oficial do Python para o Apache Spark. É com ela que vamos realizar nossa análise de dados 🎲.&lt;/p&gt;

&lt;p&gt;Crie uma nova célula de código no Colab e execute a seguinte linha:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo um: inicialização do Apache Spark
&lt;/h2&gt;

&lt;p&gt;Logo após a instalação, precisamos inicializar o Apache Spark. Para isso, crie uma nova célula de código no Colab e adicione o seguinte bloco:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         from pyspark.sql import SparkSession


spark = (SparkSession.builder
         .appName("The top most common words in Dracula, by Bram Stoker")
         .getOrCreate()
         )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo dois: download e leitura de Drácula, por Bram Stoker
&lt;/h2&gt;

&lt;p&gt;Agora sim podemos começar! Nesta etapa iremos fazer o download do livro Drácula do projeto Gutenberg e, logo em seguida, fazer a leitura do arquivo através do PySpark.&lt;/p&gt;

&lt;p&gt;O download do livro consiste, basicamente, no uso do utilitário &lt;strong&gt;wget&lt;/strong&gt;, informando a URL que direciona para o livro Drácula no projeto Gutenberg. Depois, salva-se o conteúdo da solicitação, isto é, o próprio livro, no diretório atual, com o nome de &lt;strong&gt;Dracula – Bram Stoker.txt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Crie uma nova célula no colab e adicione o seguinte bloco de código:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo três: download das stopwords em inglês
&lt;/h2&gt;

&lt;p&gt;A seguir, iremos fazer o download de uma lista das stopwords que são frequentemente usadas no idioma inglês. Essas stopwords normalmente incluem preposições, partículas, interjeições, uniões, advérbios, pronomes, palavras introdutórias, números de 0 a 9 ( inequívocos ), outras partes oficiais da fala, símbolos, pontuação. Recentemente, essa lista foi complementada por sequências de símbolos comumente usadas na Internet como www, com, http, etc.&lt;/p&gt;

&lt;p&gt;Essa lista foi adquirida através do site &lt;a href="https://countwordsfree.com/stopwords" rel="noopener noreferrer"&gt;CountWordsFree&lt;/a&gt;, um site que, dentre outros utillitários, reúne as stopwords encontradas em diversos idiomas, incluindo o nosso querido português.&lt;/p&gt;

&lt;p&gt;Mãos a obra! Crie uma nova célula de código e adicione o seguinte bloco:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito esses downloads, podemos fazer a leitura do livro através do PySpark. Crie uma nova célula no Colab e adicione o seguinte bloco de código:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;book = spark.read.text("Dracula - Bram Stoker.txt")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E também vamos fazer a leitura das stopwords que acabamos de baixar. As stopwords serão armazenadas em uma lista, na variável &lt;strong&gt;stopwords&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw_stopwords = spark.read.text("stop_words_english.txt")
stopwords = raw_stopwords.selectExpr("value as stopwords")

stopwords.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;|  stopwords|
+-----------+
|       able|
|      about|
|      above|
|     abroad|
|  according|
|accordingly|
|     across|
|   actually|
|        adj|
|      after|
| afterwards|
|      again|
|    against|
|        ago|
|      ahead|
|      ain't|
|        all|
|      allow|
|     allows|
|     almost|
+-----------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo quatro: Extração individual das palavras
&lt;/h2&gt;

&lt;p&gt;Após a leitura do livro, é necessário que transformemos cada uma das palavras em uma coluna no DataFrame.&lt;/p&gt;

&lt;p&gt;Para isso, utiliza-se o método &lt;strong&gt;split&lt;/strong&gt;, o qual, para cada uma das linhas, irá separar cada uma das palavras através do espaço em branco entre elas. O resultado será uma lista de palavras.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import split


lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo cinco: explodindo a lista de palavras em colunas no DataFrame
&lt;/h2&gt;

&lt;p&gt;Depois das palavras terem sido separadas, é necessário que se faça a conversão desta lista de palavras em colunas no DataFrame.&lt;/p&gt;

&lt;p&gt;Para tal, usa-se o método &lt;strong&gt;explode&lt;/strong&gt; presente no Apache Spark.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import explode, col


words = lines.select(explode(col("line")).alias("word"))
words.show(15)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------+
|     word|
+---------+
|      The|
|  Project|
|Gutenberg|
|    eBook|
|       of|
| Dracula,|
|       by|
|     Bram|
|   Stoker|
|         |
|     This|
|    eBook|
|       is|
|      for|
|      the|
+---------+
only showing top 15 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo seis: transformando todas as palavras em minúsculas
&lt;/h2&gt;

&lt;p&gt;Esta é uma etapa bem simples. Para que não haja distinção da mesma palavra por conta de letras maiúsculas, vamos transformar todas as palavras no DataFrame para letras minúsculas, fazendo o uso da função &lt;strong&gt;lower&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import lower


words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|  dracula,|
|        by|
|      bram|
|    stoker|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
|        in|
+----------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo sete: eliminação de pontuação
&lt;/h2&gt;

&lt;p&gt;Para que também não haja distinção da mesma palavra por conta da pontuação presente no final delas, é preciso removê-las.&lt;/p&gt;

&lt;p&gt;Isso é feito através do método &lt;strong&gt;regexp_extract&lt;/strong&gt;, o qual extrai palavras de uma string por meio de uma expressão regular.&lt;/p&gt;

&lt;p&gt;Calma, não precisa se assustar! A expressão é bem simples. Ela consiste em um conjunto contendo todos os símbolos de A a Z, uma ou mais vezes. Viu, eu te disse que era bem simples 👏🏼.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql.functions import regexp_extract


words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
+---------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo oito: remoção de valores nulos
&lt;/h2&gt;

&lt;p&gt;Como visto, mesmo após a remoção das pontuações ainda há colunas com valores nulos, ou seja, espaços em branco.&lt;/p&gt;

&lt;p&gt;Para que esses espaços em branco não sejam considerados na análise da frequência de cada palavra presente no livro, é necessário removê-los.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
|      the|
+---------+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo nove: remoção das stopwords
&lt;/h2&gt;

&lt;p&gt;Estamos quase lá! Antes de partirmos para a análise das palavras mais comuns propriamente dita, precisamos remover as stopwords de nosso dataframe, para que elas não sejam levadas em consideração durante a análise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;words_without_stopwords = (
    words_nonull.join(stopwords, words_nonull["word"] == stopwords["stopwords"], how="left")
    .filter("stopwords is null")
    .select("word")
)


words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()

words_count_before_removing, words_count_after_removing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(163399, 50222)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Passo dez: análise das palavras mais comuns
&lt;/h2&gt;

&lt;p&gt;E, finalmente, chegamos ao fim da limpesa de nossos dados. Agora sim podemos começar a análise das palavras mais comuns presentes no livro.&lt;/p&gt;

&lt;p&gt;Primeiro, é realizado a contagem das palavras mais frequentes no dataframe. Para isso, vamos agrupar cada uma das palavras e depois vamos usar uma função de agregação, &lt;strong&gt;count&lt;/strong&gt;, para determinar quantas vezes elas aparecem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;words_count = (words_without_stopwords.groupby("word")
               .count()
               .orderBy("count", ascending=False)
               )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depois, vamos exibir as 20 palavras mais comuns. O ranque pode ser ajustado através da variável &lt;strong&gt;rank&lt;/strong&gt;. Sinta-se à vontade para ajustar a variável como preferir.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rank = 20
words_count.show(rank)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------+-----+
|    word|count|
+--------+-----+
|    time|  381|
| helsing|  323|
|     van|  322|
|    lucy|  297|
|    good|  256|
|     man|  255|
|    mina|  240|
|    dear|  224|
|   night|  224|
|    hand|  209|
|    room|  207|
|    face|  206|
|jonathan|  206|
|   count|  197|
|    door|  197|
|   sleep|  192|
|    poor|  191|
|    eyes|  188|
|    work|  188|
|      dr|  187|
+--------+-----+
only showing top 20 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Considerações finais
&lt;/h2&gt;

&lt;p&gt;É isso por hoje, pessoal. Chegamos no fim de nossa breve análise.&lt;/p&gt;

&lt;p&gt;Neste artigo, analisamos as palavras mais comuns do livro Drácula, por Bram Stoker. Para isso, foi necessário fazer uma limpesa nos dados, como dividir as palavras pelos espaços entre elas; explodir a lista de palavras em colunas no dataframe; transformar todas as letras em minúsculas; remover a pontuação de todo o texto através de uma expressão regular; e, por fim, remover as stopwords.&lt;/p&gt;

&lt;p&gt;Espero que tenham gostado. Mantenham as estacas afiadas, cuidado com as sombras que andam pela noite, e até a próxima 🧛🏼‍♂️🍷.&lt;/p&gt;

&lt;h2&gt;
  
  
  Referências
&lt;/h2&gt;

&lt;p&gt;RIOUX, Jonathan. &lt;a href="https://www.amazon.com.br/Analysis-Python-PySpark-Jonathan-Rioux/dp/1617297208" rel="noopener noreferrer"&gt;Data Analysis with Python and PySpark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;STOKER, Bram. &lt;a href="https://www.gutenberg.org/cache/epub/345/pg345.txt" rel="noopener noreferrer"&gt;Dracula&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>spark</category>
      <category>braziliandevs</category>
    </item>
    <item>
      <title>Introdução à análise de dados com PySpark utilizando os dados dos campeões de League of Legends</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Thu, 15 Sep 2022 17:57:35 +0000</pubDate>
      <link>https://dev.to/geazi_anc/introducao-a-analise-de-dados-com-pyspark-utilizando-os-dados-dos-campeoes-de-league-of-legends-5cj9</link>
      <guid>https://dev.to/geazi_anc/introducao-a-analise-de-dados-com-pyspark-utilizando-os-dados-dos-campeoes-de-league-of-legends-5cj9</guid>
      <description>&lt;p&gt;O &lt;a href="https://www.leagueoflegends.com/pt-br/"&gt;League of Legends&lt;/a&gt;, também conhecido como lolzinho, para os íntimos, é um jogo ambientado no mundo fantasioso de Runeterra, com batalhas sangrentas e muita magia. Em League of Legends, os jogadores controlam personagens conhecidos como campeões, cada um com suas habilidades e diferentes estilos de jogo.&lt;/p&gt;

&lt;p&gt;Neste artigo, iremos analisar algumas estatísticas desses campeões fazendo o uso do &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html"&gt;PySpark&lt;/a&gt;, uma API do framework Apache Spark desenvolvida para a linguagem de programação Python 🐍. Os dados serão extraídos da web API &lt;a href="https://developer.riotgames.com/docs/lol#data-dragon"&gt;Data Dragon&lt;/a&gt;, uma API pública da Riot Games.&lt;/p&gt;

&lt;p&gt;Para isso, vamos desenvolver um notebook no &lt;a href="https://colab.research.google.com/"&gt;Google Colab&lt;/a&gt;, um serviço de nuvem gratuito criado pelo Google para incentivar pesquisas na área de machine learning e inteligência artificial.&lt;/p&gt;

&lt;p&gt;Caso não saiba como usar o Google Colab, confira &lt;a href="https://www.alura.com.br/artigos/google-colab-o-que-e-e-como-usar"&gt;este excelente artigo&lt;/a&gt; da Alura escrito pelo Thiago Santos que ensina, de forma muito didática, como usar o Colab e criar seus primeiros códigos!&lt;/p&gt;

&lt;p&gt;O notebook deste artigo também está disponível em meu &lt;a href="https://github.com/geazi-anc/lol-champions-analysis"&gt;GitHub&lt;/a&gt; 😉.&lt;/p&gt;

&lt;p&gt;Peguem suas espadas, preparem suas magias, e vamos começar ⚔🧙🏼‍♀️!&lt;/p&gt;

&lt;h2&gt;
  
  
  Instalação
&lt;/h2&gt;

&lt;p&gt;Antes de começarmos, é necessário fazer a instalação de duas bibliotecas: &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html"&gt;PySpark&lt;/a&gt; e &lt;a href="https://requests.readthedocs.io/en/latest/"&gt;Requests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A biblioteca PySpark, como foi dito, é a API oficial do Python para o Apache Spark. É com ela que vamos realizar nossa análise de dados 🎲.&lt;/p&gt;

&lt;p&gt;Já a biblioteca Requests é uma biblioteca que nos permite fazer solicitações HTTP a um determinado website. Mediante a ela que iremos extrair os dados dos campeões através da API pública da Riot Games 🚀.&lt;/p&gt;

&lt;p&gt;Crie uma nova célula de código no Colab e execute a seguinte linha:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install pyspark
!pip install requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Inicialização
&lt;/h2&gt;

&lt;p&gt;Logo após a instalação das bibliotecas, precisamos inicializar o Apache Spark. Para isso, importamos a classe &lt;strong&gt;SparkSession&lt;/strong&gt; dentro do módulo &lt;strong&gt;sql&lt;/strong&gt; da biblioteca &lt;strong&gt;pyspark&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Depois da importação, instanciamos a classe SparkSession através de uma série de métodos encadeados, como &lt;strong&gt;appName&lt;/strong&gt; e &lt;strong&gt;getOrCreate&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("Introdução à análise de dados com PySpark utilizando os dados dos campeões de League of Legends")
         .getOrCreate()
         )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Extração de dados dos campeões
&lt;/h2&gt;

&lt;p&gt;A extração dos dados dos campeões de League of Legends é feita através de uma solicitação HTTP à um endpoint da API &lt;a href="https://developer.riotgames.com/docs/lol#data-dragon"&gt;Data Dragon&lt;/a&gt;, uma API pública da Riot Games que centraliza os dados do jogo, como campeões, itens, magias e ETC.&lt;/p&gt;

&lt;p&gt;A resposta é um objeto JSON semelhante a este:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "type": "champion",
    "format": "standAloneComplex",
    "version": "12.17.1",
    "data": {
        "Aatrox": {},
        "Ahri": {...},
        "Akali": {...},
        "Akshan": {...},
        "Alistar": {...},
        ...,
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observe que os dados que queremos está dentro da chave &lt;strong&gt;data&lt;/strong&gt;. Vamos pegar esses dados, descartando os demais, e exibir apenas o nome de todos os campeões.&lt;/p&gt;

&lt;p&gt;Crie uma nova célula de código e execute o seguinte bloco:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests

response=requests.get(
"https://ddragon.leagueoflegends.com/cdn/12.17.1/data/pt_BR/champion.json")

champions=response.json().get("data")
champions.keys()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;dict_keys(['Aatrox', 'Ahri', 'Akali', 'Akshan', 'Alistar', ...])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Também podemos ver os dados de um campeão em específico. Nesse caso, vamos ver os dados estatísticos da Akali.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;champions.get("akali")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'version': '12.17.1',
 'id': 'Akali',
 'key': '84',
 'name': 'Akali',
 'title': 'a Assassina Renegada',
 'blurb': 'Abandonando a Ordem Kinkou e seu título de Punho das Sombras, Akali agora ataca sozinha, pronta para ser a arma mortal que seu povo precisa. Embora ela mantenha tudo o que aprendeu com seu mestre Shen, ela se comprometeu a defender Ionia de seus...',
 'info': {'attack': 5, 'defense': 3, 'magic': 8, 'difficulty': 7},
 'image': {'full': 'Akali.png',
  'sprite': 'champion0.png',
  'group': 'champion',
  'x': 96,
  'y': 0,
  'w': 48,
  'h': 48},
 'tags': ['Assassin'],
 'partype': 'Energia',
 'stats': {'hp': 570,
  'hpperlevel': 119,
  'mp': 200,
  'mpperlevel': 0,
  'movespeed': 345,
  'armor': 23,
  'armorperlevel': 4.7,
  'spellblock': 37,
  'spellblockperlevel': 2.05,
  'attackrange': 125,
  'hpregen': 9,
  'hpregenperlevel': 0.9,
  'mpregen': 50,
  'mpregenperlevel': 0,
  'crit': 0,
  'critperlevel': 0,
  'attackdamage': 62,
  'attackdamageperlevel': 3.3,
  'attackspeedperlevel': 3.2,
  'attackspeed': 0.625}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limpesa dos dados
&lt;/h2&gt;

&lt;p&gt;Antes de começarmos de fato com a análise, é necessário fazermos uma limpesa prévia nos dados. Vamos pegar apenas os que nos interessa, e remover os dicionários dentro de dicionários, deixando um único dicionário para cada campeão com os dados necessários.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;champions=[{'name': value['name'], 'title': value['title'], **value['info'], **value['stats']} for key, value in champions.items()]
champions[2]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'name': 'Akali',
 'title': 'a Assassina Renegada',
 'attack': 5,
 'defense': 3,
 'magic': 8,
 'difficulty': 7,
 'hp': 570,
 'hpperlevel': 119,
 'mp': 200,
 'mpperlevel': 0,
 'movespeed': 345,
 'armor': 23,
 'armorperlevel': 4.7,
 'spellblock': 37,
 'spellblockperlevel': 2.05,
 'attackrange': 125,
 'hpregen': 9,
 'hpregenperlevel': 0.9,
 'mpregen': 50,
 'mpregenperlevel': 0,
 'crit': 0,
 'critperlevel': 0,
 'attackdamage': 62,
 'attackdamageperlevel': 3.3,
 'attackspeedperlevel': 3.2,
 'attackspeed': 0.625}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Criando o DataFrame
&lt;/h2&gt;

&lt;p&gt;Agora sim! Os dados dos campeões estão limpos, então já podemos criar nosso DataFrame com o Spark.&lt;/p&gt;

&lt;p&gt;Infelizmente, o Spark é um tanto... seletivo com o tipo de objeto que passamos a ele para criar um DataFrame. Logo, nosso objeto atual champions, que é composto de uma lista de dicionários, não é aceito pelo Spark.&lt;/p&gt;

&lt;p&gt;Mas existe uma solução👏🏼. A biblioteca Pandas é muito mais flexível no que se refere a criação de um novo DataFrame. Portanto, é possível criar um DataFrame do Pandas com nosso objeto champions atual, e em seguida criar um DataFrame do Spark com base no DataFrame criado pelo Pandas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

df = spark.createDataFrame(pd.DataFrame(champions))

df.select("name", "title").show(5, False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------+-----------------------+
|name   |title                  |
+-------+-----------------------+
|Aatrox |a Espada Darkin        |
|Ahri   |a Raposa de Nove Caudas|
|Akali  |a Assassina Renegada   |
|Akshan |o Sentinela Rebelde    |
|Alistar|o Minotauro            |
+-------+-----------------------+
only showing top 5 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Concatenação de colunas
&lt;/h2&gt;

&lt;p&gt;Não sei vocês, mas acho um tanto incômodo ficar selecionando o nome e os títulos dos campeões cada vez que formos visualisar seus dados. Então, vamos concatenar as colunas &lt;strong&gt;name&lt;/strong&gt; e &lt;strong&gt;title&lt;/strong&gt; em uma nova coluna, chamada &lt;strong&gt;full_name&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Para isso, vamos primeiramente utilizar o método &lt;strong&gt;withColumn&lt;/strong&gt;. Em resumo, esse método nos permite criar uma nova coluna em nosso DataFrame.&lt;/p&gt;

&lt;p&gt;O primeiro parâmetro do método é o nome da nossa coluna. Já o segundo parâmetro são os dados que queremos popular nossa nova coluna. Nesse caso, a concatenação da coluna &lt;strong&gt;name&lt;/strong&gt; com a coluna &lt;strong&gt;title&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Para concatenar as colunas de strings, vamos utilizar a função &lt;strong&gt;concat&lt;/strong&gt;.&lt;br&gt;
Esta função recebe como parâmetros o nome das colunas que queremos concatenar. Contudo, não podemos passar apenas o nome dessas colunas. Caso contrário o nome e os títulos ficariam colados um ao outro. Então também usamos a função &lt;strong&gt;lit&lt;/strong&gt;, que cria uma nova coluna literal com o valor que passamos a ela, isto é: ", ".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql import functions as F

df = df.withColumn("full_name", F.concat(df.name, F.lit(", "), df.title))
df.select("full_name").show(5, False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------+
|full_name                    |
+-----------------------------+
|Aatrox, a Espada Darkin      |
|Ahri, a Raposa de Nove Caudas|
|Akali, a Assassina Renegada  |
|Akshan, o Sentinela Rebelde  |
|Alistar, o Minotauro         |
+-----------------------------+
only showing top 5 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quem são os campeões mais poderosos de League of Legends?
&lt;/h2&gt;

&lt;p&gt;Curioso para saber quem são os campeões mais poderosos de League of Legends? Pois é, eu também estou. Vamos descobrir 👀!&lt;/p&gt;

&lt;p&gt;Para esta análise, considere que o que determina o nível de poder de um campeão são seus valores de ataque, armadura, vida e mana.&lt;/p&gt;

&lt;p&gt;Então, para vermos quem são os campeões mais poderosos, basta ordenarmos nosso DataFframe com base nessas colunas, de modo decrescente.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Uma pequena observação: atualmente todos os campeões estão no nível um.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;base_columns = ["attackdamage", "armor", "hp", "mp"]

(df.orderBy(*base_columns, ascending=False)
 .select("full_name", *base_columns)
 .show(5, False)
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------------------------------+------------+-----+-----+-----+
|full_name                        |attackdamage|armor|hp   |mp   |
+---------------------------------+------------+-----+-----+-----+
|Tryndamere, o Rei Bárbaro        |72.0        |33   |696.0|100.0|
|Cho'Gath, o Terror do Vazio      |69.0        |38   |644.0|270.0|
|Renekton, o Carniceiro das Areias|69.0        |35   |660.0|100.0|
|Ornn, O Fogo sob a Montanha      |69.0        |33   |660.0|340.6|
|Kayn, o Ceifador das Sombras     |68.0        |38   |655.0|410.0|
+---------------------------------+------------+-----+-----+-----+
only showing top 5 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Level up!
&lt;/h2&gt;

&lt;p&gt;Como dito, atualmente nossos campeões estão no nível 1. Vamos alterar o nível deles para o nível 10.&lt;/p&gt;

&lt;p&gt;Observe que as estatísticas dos campeões devem acompanhar seus crescimentos conforme o passar dos níveis. Nesta análise, vamos alterar apenas os valores de dano, armadura, vida e mana.&lt;/p&gt;

&lt;p&gt;Para alterarmos esses valores, vamos fazer o uso do método &lt;strong&gt;withColumns&lt;/strong&gt;.&lt;br&gt;
Este método recebe um objeto do tipo dicionário, onde as chaves são os nomes das colunas, e seus valores são as colunas com os dados alterados.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;level = 10

df2 = df.withColumns({
    "attackdamage": df.attackdamage+df.attackdamageperlevel*level,
    "armor": df.armor+df.armorperlevel*level,
    "hp": df.hp+df.hpperlevel*level,
    "mp": df.mp+df.mpperlevel*level
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quem são os campeões mais poderosos de League of Legends (de novo)?
&lt;/h2&gt;

&lt;p&gt;Com todos os campeões já no nível 10, vamos ver se o rank de poder da análise anterior se manteve ou se houve mudança.&lt;br&gt;
Lembrando que ainda estamos analisando o nível de poder apenas com base nas colunas dano, armadura, vida e mana.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(df2.orderBy(*base_columns, ascending=False)
 .select("full_name", *base_columns)
 .show(5, False)
 )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------+------------+-----+------+-----+
|full_name                    |attackdamage|armor|hp    |mp   |
+-----------------------------+------------+-----+------+-----+
|Illaoi, a Sacerdotisa Cráquem|118.0       |85.0 |1746.0|800.0|
|Olaf, o Berserker            |115.0       |77.0 |1835.0|816.0|
|Darius, a Mão de Noxus       |114.0       |91.0 |1792.0|838.0|
|Yorick, o Pastor de Almas    |112.0       |91.0 |1790.0|900.0|
|Cho'Gath, o Terror do Vazio  |111.0       |85.0 |1584.0|870.0|
+-----------------------------+------------+-----+------+-----+
only showing top 5 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Estatísticas dos níveis de poderes
&lt;/h2&gt;

&lt;p&gt;Para finalizar, vamos ver algumas estatísticas simples de todos os nossos campeões no nível 10.&lt;/p&gt;

&lt;p&gt;Vamos determinar a média do dano, o máximo do hp e da mana, e o mínimo da armadura.&lt;/p&gt;

&lt;p&gt;Utilizaremos o método &lt;strong&gt;agg&lt;/strong&gt;. Este método recebe como parâmetro um dicionário, onde as chaves são o nome das colunas que queremos analisar e os valores são as funções que queremos aplicar sobre elas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(df2.agg({
    "attackdamage": "mean",
    "hp": "max",
    "mp": "max",
    "armor": "min"
})
    .show()
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resultado&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------+----------+-----------------+-------+
|max(mp)|min(armor)|avg(attackdamage)|max(hp)|
+-------+----------+-----------------+-------+
|10000.0|      28.0|91.40481987577641| 1892.0|
+-------+----------+-----------------+-------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Considerações finais
&lt;/h2&gt;

&lt;p&gt;É isso, meus amigos. Finalizamos nossa análise por aqui 🎆.&lt;/p&gt;

&lt;p&gt;Neste artigo demonstrei como aplicar uma análise bem simples sobre os dados dos campeões de League of Legends. Fizemos a extração dos dados por meio da API públilca da Riot Games; fizemos uma limpesa prévia nos dados; criamos uma nova coluna com o resultado da concatenação dos nomes dos campeões e seus títulos; ranqueamos os campeões mais poderosos com base em seus níveis de poder; e, por fim, fizemos uma análise das estatísticas dos campeões tanto no nível 1 quanto no nível 10.&lt;/p&gt;

&lt;p&gt;Espero que tenham gostado. Até a próxima 💚!&lt;/p&gt;

</description>
      <category>pyspark</category>
      <category>python</category>
      <category>dataanalysis</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Pokemons Flow: desenvolvendo uma pipeline de dados com apache airflow para extração de pokemon via API</title>
      <dc:creator>Geazi Anc</dc:creator>
      <pubDate>Tue, 13 Sep 2022 13:51:47 +0000</pubDate>
      <link>https://dev.to/geazi_anc/pokemons-flow-desenvolvendo-uma-pipeline-de-dados-com-apache-airflow-para-extracao-de-pokemon-via-api-525m</link>
      <guid>https://dev.to/geazi_anc/pokemons-flow-desenvolvendo-uma-pipeline-de-dados-com-apache-airflow-para-extracao-de-pokemon-via-api-525m</guid>
      <description>&lt;p&gt;O &lt;a href="https://airflow.apache.org/"&gt;Apache Airflow&lt;/a&gt; é uma plataforma desenvolvida pela comunidade para criar, agendar e monitorar fluxos de trabalho, tudo feito programaticamente. Com ela, os pipelines do Airflow são definidos em Python, permitindo a geração dinâmica de pipeline, sem sair da sintaxe que já conhecemos 🐍.&lt;/p&gt;

&lt;p&gt;Saber desenvolver pipeline de dados com o Apache Airflow é um requisito mais do que essencial caso você almege uma carreira em engenharia de dados. Portanto, caso queira saber mais sobre essa poderosa ferramenta, continue lendo 😉.&lt;/p&gt;

&lt;p&gt;Neste artigo irei ensinar como desenvolver uma pipeline de dados para extrair os famosos monstrinhos de bolso, os pokemons, da API &lt;a href="https://pokeapi.co/"&gt;PokeAPI&lt;/a&gt;. Depois da extração será aplicado algumas transformações bem simples nos dados para que, por fim, possam ser salvos localmente, simulando o carregamento dos dados em um data warehouse.&lt;/p&gt;

&lt;p&gt;O código completo desenvolvido neste artigo pode ser conferido em meu &lt;a href="https://github.com/geazi-anc/pokemonsflow"&gt;GitHub&lt;/a&gt; 🤖.&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento baseado em conteinerização via Docker Compose
&lt;/h2&gt;

&lt;p&gt;Existe muitas maneiras de instalar o Airflow em sua máquina. Todas elas podem ser conferidas na própria &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html"&gt;documentação oficial&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Para este artigo, no entanto, iremos desenvolver nossa pipeline baseada em conteinerização via Docker Compose.&lt;br&gt;
Caso não tenha o Docker instalado, confira o guia de instalação na &lt;a href="https://docs.docker.com/engine/install/"&gt;documentação&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;O arquivo YML que irá subir nosso cluster pode ser baixado &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html"&gt;aqui&lt;/a&gt;, diretamente da página oficial do Airflow. Nenhuma mudança no arquivo é necessária para este projeto 💚.&lt;/p&gt;

&lt;p&gt;O download deste arquivo é necessário para as próximas etapas.&lt;/p&gt;
&lt;h2&gt;
  
  
  Inicializando o cluster
&lt;/h2&gt;

&lt;p&gt;Primeiro, vamos criar um diretório chamado &lt;strong&gt;pokemonsflow&lt;/strong&gt; e adicionar o arquivo &lt;strong&gt;docker-compose.yml&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Depois disso, abra o terminal no diretório e digite os seguintes comandos para inicializar o cluster do Airflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker-compose up airflow-init
&lt;span class="nv"&gt;$ &lt;/span&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;docker-compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Com isso, seu cluster do Airflow já está ativo e o ambiente está pronto para o desenvolvimento 🚀.&lt;/p&gt;

&lt;p&gt;Note que agora seu diretório &lt;strong&gt;pokemonsflow&lt;/strong&gt; possue três novos subdiretórios:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dags/
logs/
plugins/
docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Para finalizar, crie um subdiretório chamado &lt;strong&gt;data&lt;/strong&gt; no diretório &lt;strong&gt;dags&lt;/strong&gt;. Este diretório é onde ficará salvo os dados extraídos pela pipeline:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ mkdir dags/data&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Desenvolvimento da pipeline de dados
&lt;/h2&gt;

&lt;p&gt;Agora que nosso ambiente de desenvolvimento via Docker já está inicializado, podemos começar a dar os primeiros passos na estruturação de nosso DAG 🎆.&lt;/p&gt;

&lt;p&gt;Antes de tudo, crie um arquivo chamado &lt;strong&gt;pokemonsflow_dag.py&lt;/strong&gt; no subdiretório &lt;strong&gt;dags&lt;/strong&gt;. Note que o sufixo &lt;strong&gt;_dag&lt;/strong&gt; no arquivo é necessário para o Airflow reconhecer automaticamente nosso DAG 😉:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;dags
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;pokemonsflow_dag.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depois disso, adicione o seguinte código no arquivo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import pendulum
import requests
from airflow.decorators import dag, task


@dag(
    schedule_interval=None,
    start_date=pendulum.datetime(2022, 1, 1, tz='UTC'),
    catchup=False
)
def pokemonsflow_dag():

    @task
    def extract() -&amp;gt; list:
        pass

    @task
    def transform(pokemons: list) -&amp;gt; list:
        pass

    @task
    def load(pokemons: list):
        pass


dag = pokemonsflow_dag()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Com isso já temos a estrutura inicial de nosso DAG. A biblioteca requests e pandas são necessárias para as tarefas de extração e transformação dos dados.&lt;/p&gt;

&lt;p&gt;No DAG temos três tarefas principais, onde:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extract: irá fazer a extração de vinte pokemons da PokeAPI;&lt;/li&gt;
&lt;li&gt;Transform: irá selecionar apenas cinco campos dos pokemons extraídos, e ordená-los de forma decrescente pelo campo &lt;strong&gt;base_experience&lt;/strong&gt;;&lt;/li&gt;
&lt;li&gt;Load: por fim, esta task irá salvar os dados transformados no subdiretório &lt;strong&gt;/dags/data/&lt;/strong&gt;, no formato CSV;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Como pôde perceber, tanto a task &lt;strong&gt;transform&lt;/strong&gt; quanto a task &lt;strong&gt;load&lt;/strong&gt; dependem dos dados extraídos ou transformados pela task anterior. Para que a transição dos dados seja feita entre as tasks, o Airflow usa um mecanismo interno chamado &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html"&gt;XComs&lt;/a&gt;, uma abreviação para Comunicação Cruzada.&lt;/p&gt;

&lt;p&gt;Antes do Airflow 2.0, o compartilhamento de dados entre tasks usando XComs era um tanto... verbosa. Contudo, com a chegada do Airflow 2.0, podemos compartilhar os dados entre tasks apenas passando eles como se fossem parâmetros de funções. Simples, não? ☺&lt;/p&gt;

&lt;p&gt;Agora vamos desenvolver individualmente cada task de nosso DAG.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extração de dados
&lt;/h3&gt;

&lt;p&gt;A extração de pokemons da API é feita através das seguintes etapas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faz uma chamada GET no endpoint &lt;strong&gt;/api/v2/pokemon&lt;/strong&gt;, com o parâmetro &lt;strong&gt;limit=20&lt;/strong&gt; para restringirmos os resultados. O resultado será um json com o campo &lt;strong&gt;result&lt;/strong&gt;, semelhante a este:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
[
  {
    'name': 'bulbasaur',
    'url': 'https://pokeapi.co/api/v2/pokemon/1/'
  },
  {
    'name': 'ivysaur',
    'url': 'https://pokeapi.co/api/v2/pokemon/2/'
  },
  {
    'name': 'venusaur',
    'url': 'https://pokeapi.co/api/v2/pokemon/3/'
  },
  {
    'name': 'charmander',
    'url': 'https://pokeapi.co/api/v2/pokemon/4/'
  },
  ...
]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Depois, acessamos o campo &lt;strong&gt;results&lt;/strong&gt; e fazemos uma chamada GET para cada uma das URLS. O resultado será uma lista com os vinte pokemons extraídos da API;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mãos à obra! Adicione o seguinte código dentro da função &lt;strong&gt;extract&lt;/strong&gt; dentro de nosso DAG, não esquecendo, é claro, de remover a palavra reservada &lt;strong&gt;pass&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    @task
    def extract() -&amp;gt; list:
        url = 'http://pokeapi.co/api/v2/pokemon'

        params = {
            'limit': 20
        }

        response = requests.get(url=url, params=params)

        json_response = response.json()
        results = json_response['results']

        pokemons = [requests.get(url=result['url']).json()
                    for result in results]

        return pokemons
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parabéns! A task para extração dos vinte pokemons da API está concluída 🎉.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transformação dos dados
&lt;/h3&gt;

&lt;p&gt;Caso faça a requisição do endpoint via Postman, por exemplo, você vai notar que cada pokemon contém inúmeros campos, tais como &lt;strong&gt;forms&lt;/strong&gt;, &lt;strong&gt;stats&lt;/strong&gt;, ETC. Contudo, iremos pegar apenas cinco campos desse json.&lt;/p&gt;

&lt;p&gt;As etapas da task &lt;strong&gt;transform&lt;/strong&gt; são as seguintes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pega os dados extraídos pela task anterior, isto é, a task &lt;strong&gt;extract&lt;/strong&gt;;&lt;/li&gt;
&lt;li&gt;Cria um DataFrame do Pandas com os dados e seleciona cinco colunas, descarrtando as demais;&lt;/li&gt;
&lt;li&gt;Ordena os dados do DataFrame pela coluna &lt;strong&gt;base_experience&lt;/strong&gt;, de forma decrescente;&lt;/li&gt;
&lt;li&gt;Converte o DataFrame para uma lista de dicionários do Python, para que os dados possam ser transferidos para a task posterior;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agora, mãos à obra, de novo! Adicione o seguinte código dentro da função &lt;strong&gt;transform&lt;/strong&gt; dentro de nosso DAG, não esquecendo, é claro, de remover a palavra reservada &lt;strong&gt;pass&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    @task
    def transform(pokemons: list) -&amp;gt; list:

        columns = [
            'name',
            'order',
            'base_experience',
            'height',
            'weight'
        ]

        df = pd.DataFrame(data=pokemons, columns=columns)
        df = df.sort_values(['base_experience'], ascending=False)

        pokemons = df.to_dict('records')

        return pokemons
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feito! A task para a transformação dos dados está concluída 🎉.&lt;/p&gt;

&lt;h3&gt;
  
  
  Carregamento dos dados
&lt;/h3&gt;

&lt;p&gt;Por fim, só o que nos resta é desenvolver a task &lt;strong&gt;load&lt;/strong&gt;. As etapas desta task são:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pega os dados transformados pela task anterior;&lt;/li&gt;
&lt;li&gt;Cria um DataFrame do Pandas com os dados transformados;&lt;/li&gt;
&lt;li&gt;Salva os dados do DataFrame no diretório &lt;strong&gt;/dags/data/&lt;/strong&gt;, no formato CSV;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adicione o seguinte código dentro da função &lt;strong&gt;load&lt;/strong&gt; dentro de nosso DAG, não esquecendo, é claro, de remover a palavra reservada &lt;strong&gt;pass&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    @task
    def load(pokemons: list):

        df = pd.DataFrame(data=pokemons)
        df.to_csv('./dags/data/pokemons_dataset.csv', index=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parabéns! Todas as tasks de nosso DAG foram desenvolvidas 🎉.&lt;/p&gt;

&lt;h3&gt;
  
  
  Orquestração das tarefas e transferência de dados
&lt;/h3&gt;

&lt;p&gt;Calma, ainda não acabou! Antes de finalizar, precisamos dizer ao Airflow a ordem para executar essas tarefas, e transferir os dados de uma task para a outra. Abaixo das funções, adicione o seguinte código:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    # ETL pipeline

    # extract
    extracted_pokemons = extract()

    # transform
    transformed_pokemons = transform(pokemons=extracted_pokemons)

    # load
    load(pokemons=transformed_pokemons)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agora sim acabamos 👏🏼. Note que a transferência dos dados entre as tasks é feito passando-os como parâmetros das funções. Muito, muito simples!&lt;/p&gt;

&lt;p&gt;O código final de nosso DAG ficou assim:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import pendulum
import requests
from airflow.decorators import dag, task


@dag(
    schedule_interval=None,
    start_date=pendulum.datetime(2022, 1, 1, tz='UTC'),
    catchup=False
)
def pokemonsflow_dag():

    @task
    def extract() -&amp;gt; list:
        url = 'http://pokeapi.co/api/v2/pokemon'

        params = {
            'limit': 20
        }

        response = requests.get(url=url, params=params)

        json_response = response.json()
        results = json_response['results']

        pokemons = [requests.get(url=result['url']).json()
                    for result in results]

        return pokemons

    @task
    def transform(pokemons: list) -&amp;gt; list:

        columns = [
            'name',
            'order',
            'base_experience',
            'height',
            'weight'
        ]

        df = pd.DataFrame(data=pokemons, columns=columns)
        df = df.sort_values(['base_experience'], ascending=False)

        pokemons = df.to_dict('records')

        return pokemons

    @task
    def load(pokemons: list):

        df = pd.DataFrame(data=pokemons)
        df.to_csv('./dags/data/pokemons_dataset.csv', index=False)

    # ETL pipeline

    # extract
    extracted_pokemons = extract()

    # transform
    transformed_pokemons = transform(pokemons=extracted_pokemons)

    # load
    load(pokemons=transformed_pokemons)


dag = pokemonsflow_dag()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testano o DAG
&lt;/h2&gt;

&lt;p&gt;Já podemos testar se nosso DAG está funcionando conforme o esperado. Para isso, vá até o terminal que está aberto na raiz do diretório e digite os seguintes comandos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;docker-compose &lt;span class="nb"&gt;exec &lt;/span&gt;airflow-worker bash
&lt;span class="nv"&gt;$ &lt;/span&gt;airflow dags &lt;span class="nb"&gt;test &lt;/span&gt;pokemonsflow_dag 2022-01-01

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quando a execução do DAG for finalizada, vá até o diretório &lt;strong&gt;/dags/data/&lt;/strong&gt; e confira o arquivo &lt;strong&gt;pokemons_dataset.csv&lt;/strong&gt;. Todos os vinte pokemons estarão ordenados conforme a coluna &lt;strong&gt;base_experience&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Considerações finais
&lt;/h2&gt;

&lt;p&gt;Como vimos, o Apache Airflow é uma poderosa ferramenta para a orquestração de tarefas de uma pipeline de dados. Não esqueça de conferir as outras inúmeras funcionalidades do Airflow 😉.&lt;/p&gt;

&lt;p&gt;Neste artigo ensinei como desenvolver um DAG para extração de vinte pokemons da API PokeAPI. Foi aplicado algumas transformações bem simples nos dados antes de serem salvos localmente em formato CSV.&lt;/p&gt;

&lt;p&gt;Se você gostou desse artigo, não esqueça de curtir e compartilhar nas redes sociais 💚.&lt;/p&gt;

&lt;p&gt;Até a próxima!&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>braziliandevs</category>
      <category>airflow</category>
    </item>
  </channel>
</rss>
