DEV Community: romerito

Importando Funções Python do Repos para o Notebook do Databricks

romerito — Fri, 10 Feb 2023 02:51:42 +0000

Importar funções no notebook do Databricks sempre foi um pouco complicado, se olharmos para a forma tradicional de desenvolvimento usando o python, é natural criarmos módulos para abstrair ao máximo o código e depois importamos para nosso arquivo principal(main). Olhando por esse lado é simples, porém olhando pela ótica de transformação de dados é um pouco complicado, atualmente para importar uma biblioteca no notebook, precisamos configurar o pacote no cluster, que seja via: pypi, whell etc ... e caso tu precises adicionar 1 novo pacote tem que parar o cluster configurar e reiniciar para que tenha efeito(cansativo).
Mas nem tudo é tristeza, hoje me deparei com algo super interessante de fazer, porém ainda não conclui a fundo o estudo que fiz, quero dizer que temos uma possibilidade, criar todo o código e armazenar no Azure Repos e depois linkar ele ao repos do Databricks e por fim fazer o import package.
Para exemplicar esse ensaio, vamos fazer o seguinte, vou criar umas funções de trasnformação de dados.
Nossa estrutura de projeto vai ficar assim.

Coisa básica, vamos a implementação da função:
loads.py

from json import loads
from pyspark.sql.types import StructType as struct

def schema(line: str):
    schema = f"schemas/{line}"
    with open(schema, "r") as file:
        get = file.read()
        estrutura = struct.fromJson(loads(get))
    return estrutura


def read(arquivo: str, schema: str, formato: str, separador: str, engine: object):
    dataframe = (
        engine.read.format(formato)
        .schema(schema)
        .option("header", "true")
        .option("delimiter", f"{separador}")
        .option("inferSchema", "false")
        .option("path", f"{arquivo}")
        .load()
    )
    return dataframe

Função que carrega o schema, e outra função que vai ler o arquivo para ingestão

Agora vamos criar a implementação do módulo:
processing.py

def view(dataframe: object, descricao: str):
    return dataframe.createOrReplaceTempView(descricao)


def query(dataframe: object, describe: str, file: str, engine: object):
    view(dataframe, describe)
    with open(f"{file}") as load:
        query = load.read()
        queryObject = engine.sql(query)
    return queryObject

função que cria uma view no sparksql a partir de DataFrame e outra função que executa um arquivo sql fazendo uma transformação específica.

E por fim a implementação do código principal

# Databricks notebook source
# DBTITLE 1,Adiciona Caminho de Módulos Python 
import sys
import os
sys.path.append('/Workspace/Repos/falano@outlook.com.br/delta-live-tables/functions/ingestion')
sys.path.append('/Workspace/Repos/falano@outlook.com.br/delta-live-tables/functions/transformer')

# COMMAND ----------

# DBTITLE 1,Importação de Módulos
from loads import schema
from loads import read
from processing import view
from processing import query

# COMMAND ----------

Data = "/FileStore/tables/licitacoes_2022.csv"
FileSQL = "/Workspace/Repos/fulano@outlook.com.br/delta-live-tables/sql/agrupado-por-tipo.sql"

# COMMAND ----------

# DBTITLE 1,Carrega schema de arquivo
SchLicitacao = schema("licitacoes.json")

# COMMAND ----------

# DBTITLE 1,Carrega dados
Licitacao = read(Data, SchLicitacao, "csv", ";", spark)

# COMMAND ----------

display(Licitacao)

# COMMAND ----------

# DBTITLE 1,Query que faz agrupamento pelo Tipo
ValorPorTipo = query(Licitacao,"TabelaLicitacao",FileSQL, spark)
display(ValorPorTipo)

Lembrando que esse código possui informaçãoes do notebook do Databricks, agora vamos subir nossas modificações para o Azure DevOps via git e temos isso

E agora por dentro do Databricks vamos puxar[pull] essas informações.

Agora vou explicar célula por célula do Databricks

O método sys.path.append() adiciona os arquivos no mesmo local onde esta o nosso arquivo principal em tempo de execução, tornando simples a importação dos módulos, vamos ver se funciona?

Prontinho! o metodo type() mostra que o objeto schema é uma função

E por fim vamos ingerir 1 arquivo e fazer umas trasnformações

Agora vamos fazer uma transformação a partir de um arquivo. sql, veja como é o nosso arquivo
/Workspace/Repos/fulano.com.br/delta-live-tables/sql/agrupado-por-tipo.sql

SELECT
Tipo,
SUM(ValorContratado) as ValorContratado
FROM TabelaLicitacao
GROUP BY Tipo
ORDER BY ValorContratado DESC

Pronto! simples demais, com essa possibilidade de ter as funçoes e arquivos de transformações no Azure Repos, o céu é o limite. Espero ter te ajudado com isso, o código está aqui.
https://github.com/romermor/databricks-using-functions-repos

Azure-cli: Linking the azure account and installing the azuredevops extension

romerito — Thu, 15 Dec 2022 05:06:24 +0000

To understand better, look at the previous post:link

Well, now that we have the azure-cli installed on linux, let's link it with our microsoft azure account, the process is quite simple, in terminal, type the following command:

romerito@dev:~$ az login

Pressing [Enter] will bring up a web page similar to this one.

In my case, I already have the account logged in, so I just need to select it. One important detail, the account you add must be the same account that accesses the Azure Portal. After you select the account, a success message will appear, close it and go back to the terminal and you will see something like this.

There you go, from that point on you can manage azure resources via the command line.
And finally, let's install the azure devops extension using this command:

romerito@dev:~$ az extension add --name azure-devops

Wait a few seconds and pahhhh, installed!
To check if it really isntalled, type in terminal:

romerito@dev:~$ az devops -h

Group
    az devops : Manage Azure DevOps organization level operations.
        Related Groups
        az pipelines: Manage Azure Pipelines
        az boards: Manage Azure Boards
        az repos: Manage Azure Repos
        az artifacts: Manage Azure Artifacts.

Subgroups:
    admin            : Manage administration operations.
    extension        : Manage extensions.
    project          : Manage team projects.
    security         : Manage security related operations.
    service-endpoint : Manage service endpoints/connections.
    team             : Manage teams.
    user             : Manage users.
    wiki             : Manage wikis.

Commands:
    configure        : Configure the Azure DevOps CLI or view your configuration.
    invoke           : This command will invoke request for any DevOps area and resource. Please use
                       only json output as the response of this command is not fixed. Helpful docs -
                       https://docs.microsoft.com/en-us/rest/api/azure/devops/.
    login            : Set the credential (PAT) to use for a particular organization.
    logout           : Clear the credential for all or a particular organization.

That's it for today, in the next post we will create a project in azure devops.

Azure-cli: Installing the tool on Linux

romerito — Thu, 15 Dec 2022 03:38:58 +0000

This post is the beginning of a series of others in which we will talk about Azure Cli. This tool is a command line interface in which one can manage microsoft azure resources.
My current linux is that.

romerito@dev:~$ cat /etc/os-release | head -3

NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop

My OS as you can see is PopOS which was built based on the ubuntu/debian architecture, so we will use the APT package manager (Advanced package tool).
There are several ways to install the program, but we have chosen the method of setting up the repository on our own machine.
First we will update our distro using this command:

romerito@dev:~$ sudo apt-get update\
&& sudo apt-get install ca-certificates\
curl\ 
apt-transport-https lsb-release\
gnupg

Now we need to use the gpg utility to authorize external repositories via apt.
but the utility only accepts GPG files, and in this case we need to convert from KEY to GPG, this is possible by passing the --dearmor flag after that we can add the repository link to the package source list.

romerito@dev:~$ curl -sL https://packages.microsoft.com/keys/microsoft.asc |
    gpg --dearmor |
    sudo tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null

And finally we add the repository

romerito@dev:~$ AZ_REPO=$(lsb_release -cs) &&
echo "deb [arch=amd64] https://packages.microsoft.com/repos/azure-cli/ $AZ_REPO main" |
    sudo tee /etc/apt/sources.list.d/azure-cli.list

After that, we update the system again and install the package

romerito@dev:~$ sudo apt-get update &&
sudo apt-get install azure-cli

Ready! now if you type the command az in the terminal we will have this.

In the next post we will link cli to our microsoft azure account, see you later.

Abstracting spark modules using simple python functions

romerito — Wed, 07 Dec 2022 06:32:19 +0000

I really enjoy working with apache spark, but if there is one thing that makes me tired is having to repeat the name of methods and commands, for me it is totally unproductive, and it is based on this complaint made by my person that I decided to create this post. But let's take it easy, this is just an experiment.

Requirements:

Python >= 3.9
Apache Spark >= 3.2 download installer here
Vscode

The first thing we do when working with spark is create a session.

That's it, and then we load any file.

nice, but, have you ever thought if we had to upload more files and every time we had to repeat the same command above? I know, "copy, paste and modify the code, blah, blah, blah". But we don't want that, because let's face it, this is pure kornojob, but, how can we improve this? In my project I created a folder called functions and inside it I created a file called core.py

And a file called main.py which will be our main program, where we will import the methods from core.py, the coding looks like this

And in main.py we implement it like this

Notice how everything became easier, with the read() method I can load innumerable files just by passing parameters

What if I need to create a view? Simple, in core.py write this

And in main.py

Simple, and in the case where we have schemas in .json files how do we load them?

main.py

And finally, we need to write the files somewhere, which is relatively simple too, see

No main.py

complete code for core.py

I hope you enjoyed the joke, I hope that in some way this example will help you. bYe

Example of applying CDC to JSON files with PySpark

romerito — Wed, 30 Nov 2022 05:20:29 +0000

In the previous post we showed an example of how to apply CDC using sparksql. Today we are going to try to exemplify what a use case would be like in real life, for that we are going to use FAKE data that we generate through the platform 4devs.

We select people's registration data. The idea is to load this data into a RAW layer and capture the changes and apply them to the TRUSTED layer.Well, to do this example I'm going to use everything local, I use Pop OS! by Default and I already have apache spark installed, in case you need to install it, here is the link to the installer I created. My project structure looks like this.

First I created a python class that creates a spark session, the file is here: ./lakehouse/src/cdc/internals.py

from delta import *
from pyspark.sql import SparkSession

path = "/home/romeritomorais/Dropbox/tecnology/develop/bigdata/lakehouse/datawarehouse"
DataWareHouse = f"{path}"

class session:

    def spark():
        build = (
            SparkSession.builder.appName("application")
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
            .config("spark.sql.warehouse.dir", DataWareHouse)
            .enableHiveSupport())

        return configure_spark_with_delta_pip(build).getOrCreate()

Our main code loads the spark session and reads the JSON files found in the folder: ./lakehouse/storage/stage

we only have 2 sample files, the data will be loaded into a DataFrame, after that we will convert the data type to string and write to our DataWareHouse in Delta format.

Our main code loads the spark session and reads the JSON files found in the folder: ./lakehouse/storage/stage This is an example of the contents of the files

[
    {
        "nome": "Fernanda Isabella FIORENTINO",
        "idade": 55,
        "cpf": "677.958.886-58",
        "rg": "14.130.167-3",
        "data_nasc": "05/03/1985",
        "sexo": "Feminino",
        "signo": "Peixes",
        "mae": "Bárbara Isabel",
        "pai": "Renato Ian Gomes",
        "email": "fernanda_isabella_pompeu@consultorialk.com.br",
        "senha": "FW2CYDLB0l",
        "cep": "76901-636",
        "endereco": "Governador valadares",
        "numero": 144,
        "bairro": "Pipoca",
        "cidade": "Ji-Paraná",
        "estado": "RO",
        "telefone_fixo": "(69) 3962-4304",
        "celular": "(69) 99407-8102",
        "altura": "1,54",
        "peso": 74,
        "tipo_sanguineo": "A-",
        "cor": "vermelho"
    },
    {
        "nome": "Pietro Mateus Carlos da Silva",
        "idade": 66,
        "cpf": "287.883.345-70",
        "rg": "25.820.920-3",
        "data_nasc": "23/07/1956",
        "sexo": "Masculino",
        "signo": "Leão",
        "mae": "Vera Elaine",
        "pai": "Rodrigo Marcelo da Silva",
        "email": "pietro_dasilva@konekoshouten.com.br",
        "senha": "IrHifX9RyC",
        "cep": "57018-685",
        "endereco": "Conjunto Nossa Senhora do Amparo",
        "numero": 577,
        "bairro": "Chã de Bebedouro",
        "cidade": "Maceió",
        "estado": "AL",
        "telefone_fixo": "(82) 3967-9915",
        "celular": "(82) 99107-9714",
        "altura": "1,85",
        "peso": 66,
        "tipo_sanguineo": "B+",
        "cor": "roxo"
    }
]

This data will be loaded into a DataFrame, after that we will convert the data type to string and write it to our DataWareHouse in Delta format, We will also enable cdf(change data feed) for all tables that will be created. Understand that for this example we are not going to pay attention to the settings, in a real scenario each implementation needs to be analyzed to meet the needs of the project.

#!/usr/bin/env python
# coding: utf-8

from pathlib import Path
from cdc.internals import session

spark = session.spark()
path = Path.cwd()
storage = f"{path.parent}/storage/stage"

if __name__ == '__main__':
    spark.sparkContext.setLogLevel("info")

    # carregando dados e escrevendo no formato delta na bronze
    spark.sql("set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true")

    df = spark.read.option("multiline", "true").json(f"{storage}/*")

    appl = df.selectExpr(
        "REPLACE(CAST(altura AS STRING),',','.') AS altura",
        "CAST(bairro AS STRING) AS bairro",
        "CAST(celular AS STRING) AS celular",
        "CAST(cep AS STRING) AS cep",
        "CAST(cidade AS STRING) AS cidade",
        "CAST(cor AS STRING) AS cor",
        "CAST(cpf AS STRING) AS cpf",
        "CAST(data_nasc AS STRING) AS data_nasc",
        "CAST(email AS STRING) AS email",
        "CAST(endereco AS STRING) AS endereco",
        "CAST(estado AS STRING) AS estado",
        "CAST(idade AS STRING) AS idade",
        "CAST(mae AS STRING) AS mae",
        "CAST(nome AS STRING) AS nome",
        "CAST(numero AS STRING) AS numero",
        "CAST(pai AS STRING) AS pai",
        "CAST(peso AS STRING) AS peso",
        "CAST(rg AS STRING) AS rg",
        "CAST(senha AS STRING) AS senha",
        "CAST(sexo AS STRING) AS sexo",
        "CAST(signo AS STRING) AS signo",
        "CAST(telefone_fixo AS STRING) AS telefone_fixo",
        "CAST(tipo_sanguineo AS STRING) AS tipo_sanguineo"
    )

    # escreve os dados no formato delta na bronze
    appl.write.format("delta").mode("append").saveAsTable("default.rw_cadastro")

    spark.stop()

After writing the data, let's see how it was saved

Ready, we have our data saved in delta format in our Database.

As we insert the data into the RAW, the CDF technology generates a line-by-line versioning of the records, it understands what was inserted, deleted or updated, for each line it generates a version, if I insert a line now it will understand that it was of the INSERT type, and generate a version, let's say it generated version 1, if I insert the same line it will generate version 2 and etc... but when we talk about taking the data to the layer from trusted, we need to understand that we are going to take only the version of the records that reflect the origin, if my data at the origin, let's imagine that it is data from an ERP table, we need to take this data to trusted, and how can we do that, simple, doing a MERGE operation. For this we need to load the table that will undergo the change, in our case the default.tr_cadastro table.

#!/usr/bin/env python
# coding: utf-8

from delta import *
from pyspark.sql.functions import col, dense_rank
from pyspark.sql.window import Window
from pathlib import Path
from cdc.internals import session

spark = session.spark()
tableRaw = "rw_cadastro"
tableTrusted = "tr_cadastro"

cadastroTrustedDF = DeltaTable.forName(spark, f"{tableTrusted}")

Now let's load the changes that the default.rw_cadastro table underwent

change_data = spark.read.format("delta") \
        .option("readChangeFeed", "true") \
        .option("startingVersion", 0) \
        .table(f"{tableRaw}") \
        .filter("_change_type != 'update_preimage'")

In this case I'm loading version 0, but you can adopt a strategy of loading only 1 specific range.

With the changes loaded, let's make a partition in the CPF column, as my JSON file did not come with a field that represents a PRIMARY KEY, I will use the CPF field because I understand that in real life this information will not change. partition will group by that field and then select the last record based on _commit_version

windowPartition = Window.partitionBy(
        "cpf").orderBy(col("_commit_version").desc())

After that we create a new DataFrame using the dense_rank window function and selecting only the desired line, an important point, it is necessary to use the strategy of getting the _commit_version of the last modified records well, without this strategy you will probably load old modifications, and depending of the amount of changes in the RAW can generate a gigantic volume of data to apply the PARTITION

apply_change_data = change_data\
        .withColumn("dense_rank", dense_rank().over(windowPartition))\
        .where("dense_rank=1")\
        .distinct()

Now let's MERGE the changes captured by the CDF.

cadastroTrustedDF.alias("cadastroTrustedDF").merge(apply_change_data.alias("cadastroRawDF"), "cadastroRawDF.cpf = cadastroTrustedDF.cpf") \
        .whenMatchedDelete(
            condition="cadastroRawDF._change_type = 'delete'") \
        .whenMatchedUpdate(
        set={
            "altura": "cadastroRawDF.altura",
            "bairro": "cadastroRawDF.bairro",
            "celular": "cadastroRawDF.celular",
            "cep": "cadastroRawDF.cep",
            "cidade": "cadastroRawDF.cidade",
            "cor": "cadastroRawDF.cor",
            "cpf": "cadastroRawDF.cpf",
            "data_nasc": "cadastroRawDF.data_nasc",
            "email": "cadastroRawDF.email",
            "endereco": "cadastroRawDF.endereco",
            "estado": "cadastroRawDF.estado",
            "idade": "cadastroRawDF.idade",
            "mae": "cadastroRawDF.mae",
            "nome": "cadastroRawDF.nome",
            "numero": "cadastroRawDF.numero",
            "pai": "cadastroRawDF.pai",
            "peso": "cadastroRawDF.peso",
            "rg": "cadastroRawDF.rg",
            "senha": "cadastroRawDF.senha",
            "sexo": "cadastroRawDF.sexo",
            "signo": "cadastroRawDF.signo",
            "telefone_fixo": "cadastroRawDF.telefone_fixo",
            "tipo_sanguineo": "cadastroRawDF.tipo_sanguineo"
        }
    )\
        .whenNotMatchedInsert(
            condition="cadastroRawDF._change_type != 'delete'",
        values={
            "altura": "cadastroRawDF.altura",
            "bairro": "cadastroRawDF.bairro",
            "celular": "cadastroRawDF.celular",
            "cep": "cadastroRawDF.cep",
            "cidade": "cadastroRawDF.cidade",
            "cor": "cadastroRawDF.cor",
            "cpf": "cadastroRawDF.cpf",
            "data_nasc": "cadastroRawDF.data_nasc",
            "email": "cadastroRawDF.email",
            "endereco": "cadastroRawDF.endereco",
            "estado": "cadastroRawDF.estado",
            "idade": "cadastroRawDF.idade",
            "mae": "cadastroRawDF.mae",
            "nome": "cadastroRawDF.nome",
            "numero": "cadastroRawDF.numero",
            "pai": "cadastroRawDF.pai",
            "peso": "cadastroRawDF.peso",
            "rg": "cadastroRawDF.rg",
            "senha": "cadastroRawDF.senha",
            "sexo": "cadastroRawDF.sexo",
            "signo": "cadastroRawDF.signo",
            "telefone_fixo": "cadastroRawDF.telefone_fixo",
            "tipo_sanguineo": "cadastroRawDF.tipo_sanguineo"
        }
    ).execute()

The operation is simple, we have 3 conditions to apply the changes, the first is using the method .whenMatchedDelete() which compares the row from the source table (rw_cadastro) by the CPF field and checks if this row was deleted at the origin, if so, this row is removed from the destination (rw_cadastro).

The second condition is using the .whenMatchedUpdate() method

where again the row of the source table is compared by the field cadastreRawDF.cpf = cadastreTrustedDF.cpf, if there is a change in the origin, it is updated in the destination table.

And finally,.whenNotMatchedInsert() if the records exist only at the source, it understands that this data will be inserted into the destination structure, a very important information, if at the time of this operation there are duplicate lines, let's say that 2 or more records have the CPF repeated , causes an ERROR, because the operation cannot perform consecutive operations on the same record.

But before applying this operation we have to type the data and leave it in the schema of the default.tr_cadastro table, and for that we transform the data before performing the MERGE

# cria view temporaria para transformChangeDataacao dos dados onde é aplicado a tipagem etc ..
selectChangeData.createOrReplaceTempView("transformChangeData")

# transformChangeDataação
apply_change_data = spark.sql("""
SELECT 
  CAST(nome AS STRING) AS nome, 
  CAST(idade AS INT) AS idade, 
  CAST(altura AS FLOAT) AS altura, 
  CAST(sexo AS STRING) AS sexo, 
  CAST(peso AS FLOAT) AS peso, 
  CAST(cor AS STRING) AS cor, 
  CAST(
    replace(data_nasc, '/', '-') AS STRING
  ) AS data_nasc, 
  CAST(signo AS STRING) AS signo, 
  CAST(tipo_sanguineo AS STRING) AS tipo_sanguineo, 
  CAST(
    replace(
      replace(cpf, '.', ''), 
      '-', 
      ''
    ) AS long
  ) AS cpf, 
  CAST(
    replace(
      replace(rg, '.', ''), 
      '-', 
      ''
    ) AS long
  ) AS rg, 
  CAST(mae AS STRING) AS mae, 
  CAST(pai AS STRING) AS pai, 
  CAST(endereco AS STRING) AS endereco, 
  CAST(numero AS STRING) AS numero, 
  CAST(bairro AS STRING) AS bairro, 
  CAST(
    replace(cep, '-', '') AS STRING
  ) AS cep, 
  CAST(cidade AS STRING) AS cidade, 
  CAST(estado AS STRING) AS estado, 
  CAST(
    replace(
      replace(
        replace(
          replace(celular, ')', ''), 
          '(', 
          ''
        ), 
        ' ', 
        ''
      ), 
      '-', 
      ''
    ) AS STRING
  ) AS celular, 
  CAST(
    replace(
      replace(
        replace(
          replace(telefone_fixo, ')', ''), 
          '(', 
          ''
        ), 
        ' ', 
        ''
      ), 
      '-', 
      ''
    ) AS STRING
  ) AS telefone_fixo, 
  CAST(email AS STRING) AS email,
  _change_type,
  _commit_version,
  _commit_timestamp
FROM 
  transformChangeData
""")

And here is a sample of our table data

This is one of the ways to work with CDC in spark using Delta's resources, I hope this piece of information can help you, the project is here github