DEV Community: Gustavo Martin Morcuende

Cómo construir tu propia data platform. From zero to hero.

Gustavo Martin Morcuende — Fri, 09 Jun 2023 20:08:41 +0000

Introducción

Este artículo es el resultado de la ponencia presentada el día 28 de abril de 2023 en la Salmorejo Tech. Las slides de la presentación pueden encontrarse en el siguiente enlace: slideshare.

Con esta ponencia se buscaba explicar a una audiencia con conocimientos básicos de tecnología, las distintas herramientas que se pueden emplear para construir una plataforma de datos.

La ponencia comienza con una configuración sencilla, que prácticamente cualquier persona del mundo de la informática puede entender. Termina con una configuración compleja, donde sin entrar en muchos detalles, sí permite a la audiencia hacerse una idea de qué herramientas se requieren para implementar la solución.

Diferencia entre el mundo operacional y el mundo analítico.

El mundo operacional es donde encontramos las típicas aplicaciones de frontend y backend. En este mundo no es estrictamente necesario guardar históricos de información. Aquí estamos más interesados en guardar lo que el usuario necesite para realizar sus operaciones y estas operaciones pueden caducar con el paso del tiempo. Además, en el mundo operacional, guardamos grandes cantidades de información personal como pueden ser el email, teléfonos de contacto, direcciones, etc, etc. Esto es así porque los necesitamos para contactar con el usuario. Por ejemplo, para enviarle un pedido.

En el mundo analítico lo que se quiere es guardar tanta información como sea posible. En muchas ocasiones historificada. En este mundo también, no es estrictamente necesario guardar datos personales, por ejemplo el email. Aquí no necesitamos contactar con el usuario, y por tanto no necesitamos conocer su email real, pero sí que puede que estemos interesados en saber cuántos emails distintos se han utilizado en el sistema a lo largo de los años.

Es en este mundo analítico donde implementaremos nuestra plataforma de datos.

¿Qué es una plataforma de datos?

Una plataforma de datos es un conjunto de aplicaciones, herramientas, bases de datos que permiten la adquisición, el almacenamiento, preparación y gobierno de datos. Es una solución completa para el procesado, ingesta, analizado y presentación de datos generados por una empresa.

Ver links:

¿Quiénes son nuestros usuari@s?

Antes de seguir adelante implementando una solución tecnológica, tenemos que identificar los usari@s que utilizarán dicha solución, así como sus necesidades de negocio.
A continuación listamos los casos más típicos de usuarios que podemos encontrar para una plataforma de datos.

Data engineer.

Se enfoca en el diseño, construcción, mantenimiento y gestión de infraestructuras de datos.

Implementación y gestión de sistemas de almacenamiento de datos (bases de datos, almacenamientos en la nube, etc, etc)
Asegurar que los datos estén limpios, organizados y estructurados de manera adecuada para que puedan ser utilizados de manera efectiva.

Data analysts y data scientists y machine learning engineers.

Data scientist: utiliza técnicas estadísticas y de análisis de datos para extraer información útil con el objetivo de mejorar la toma de decisiones y la eficacia de una empresa.
Data analyst: recopila, procesa y analiza datos para ayudar a las empresas a tomar decisiones informadas. Su trabajo es proporcionar información relevante y accionable para impulsar el crecimiento y el éxito empresarial.
Machine learning engineer: desarrolla y optimiza modelos de aprendizaje automático para resolver problemas empresariales complejos. Su trabajo es construir sistemas que puedan aprender y mejorar a medida que se exponen a más datos.

Solución simple.

Ahora que ya sabemos quiénes son nuestros clientes podemos empezar a plantear soluciones. Como se anticipó en la introducción, iremos del modelo más simple al más complejo.

En esta solución, el mundo operacional y el mundo analítico comparten la misma base de datos.

Observamos que todos nuestros usuarios comparten el mismo sistema. La plataforma de datos utilizará como sistema de almacenamiento la misma base de datos que el resto del sistema operacional.

Para pequeñas y medianas empresas, esta puede ser una solución de compromiso, donde no se quiere añadir la complejidad que supone añadir sistemas de almacenamiento específicos para el mundo analítico.

La plataforma de datos no necesitará proveer de un sistema de almacenamiento especial.

Ventajas: sistema más simple de mantener.
Inconvenientes: acciones realizadas en el mundo analítico (por ejemplo sacar datos en un dashboard) pueden afectar a operaciones como pueden ser la compra de un producto desde el frontend porque el sistema de almacenamiento es compartido.

Herramientas que tendremos que proporcionar

Base de datos

postgresql, mysql, oracle, etc, etc
esquemas
tablas
gestión de permisos

Aplicaciones

Leen tablas de la base de datos, realizan una transformación y escriben los resultados en otras tablas.
ETL, extract, transform, load

Dashboards

Diagramas donde se muestran datos de interés

Machine learning

MLFlow
Kubeflow

Ejemplo de aplicaciones que podemos usar

Base de datos, por ejemplo PostgreSQL.
Aplicaciones como Apache Airflow para el desarrollo de ETLs.
Dashboards como Qlik y Tableau.
Para machine learning por ejemplo podemos proporcionar Kubeflow.

Gobernanza

Gran importancia tendrá la definición y aplicación de reglas específicas para estandarizar nombres de las tablas, bases de datos, de procesos, etc, etc.

Además será importante crear reglas de utilización de las herramientas ofrecidas por la plataforma de datos. Recordemos que al final, detrás de la tecnología hay personas.

Debemos evitar que se haga un mal uso de dicha tecnología, para ello la gobernanza será fundamental.

Solución intermedia.

En esta solución observamos que el mundo operacional ahora es mucho más complejo.

Esta solución es necesaria cuando queremos evitar que procesos del mundo analítico afecten al mundo operacional. Además, el mundo operacional está compuesto por diferentes sistemas. Queremos tener todos nuestros datos analíticos en un único lugar para de este modo poder analizarlos y transformarlos de una manera sencilla.

La plataforma de datos necesitará proveer en este caso de una base de datos propia y de herramientas que permitan la extracción de la información almacenada en los diferentes sistemas del mundo analítico.

Ventajas: acciones realizadas en el mundo analítico no afectan al operacional porque el sistema de almacenamiento no es compartido. Todos los datos analíticos están recogidos en un único lugar.
Inconvenientes: mayor complejidad y costes.

Herramientas que tendremos que proporcionar

En esta solución las herramientas a proporcionar son las mismas que en la solución simple, pero ahora tenemos un nuevo tipo de base de datos, el Data Warehouse y aplicaciones que nos permitirán consumir información de los sistemas operacionales. El resto de las herramientas son las mismas que las que se explicaron en la anterior solución.

Data Warehouse, por ejemplo, AWS Redshift.
Aplicaciones como Apache Airflow para el desarrollo de ETLs.
Dashboards como Qlik y Tableau.
Para machine learning por ejemplo podemos proporcionar Kubeflow.

¿Qué es un Data Warehouse?

Es una base de datos centralizada que integra muchas fuentes de datos.
Permite aislar los sistemas operacionales de los analíticos.
Queries lanzadas desde el sistema analítico no afectan al operacional.
Permite reorganizar la información de forma que sea más fácilmente analizable.
Proporciona un único modelo de datos.
Permite mantener un histórico de información que el operacional, por no necesitarla, puede borrar.
Permite integrar múltiples fuentes de datos en un único lugar.

Modelado específico, esquema en estrella.
Compuesto de tablas de hechos y de dimensiones.
Tabla de hechos: sucesión de hechos, alto número de registros.
Tabla de dimensiones: descripción de los hechos, pocos registros y muchos atributos.
Permite la optimización de las queries en modo lectura.
Permite queries más simples, sin necesidad de múltiples JOINs como podría suceder en un modelo normalizado de entidad-relación.
Permisos vía GRANTs en tablas.

¿Qué es AWS Redshift?

Es una solución de Data Warehouse implementada por Amazon Web Services. Sin ningún esfuerzo, en la nube, podemos crear nuestro propio servidor.

En la captura de pantalla superior, se muestra la interfaz gráfica que permite crear y configurar AWS Redshift.

¡Cuidado! Nunca uses la interfaz gráfica para crear y mantener tu infraestructura en la nube. Usa siempre infraestructura como código. Con esto consigues que tu infraestructura sea reproducible, automatizable y fácilmente mantenible por cualquier persona en tu equipo u organización. Para ello hay diferentes soluciones como pueden ser CloudFormation, CDK, Terraform y muchas otras.

A continuación, documentamos un ejemplo de código Terraform que permite crear de forma sencilla un cluster AWS Redshift serverless.

  1 resource "aws_redshiftserverless_workgroup" "serverless" {
  2   workgroup_name       = var.name
  3   namespace_name       = aws_redshiftserverless_namespace.serverless.id
  4   base_capacity        = var.base_capacity
  5   security_group_ids   = var.security_group_ids
  6   subnet_ids           = var.subnet_ids
  7   enhanced_vpc_routing = true
  8   publicly_accessible  = var.publicly_accessible
  9   tags                 = var.tags
 10 }
 11 
 12 resource "aws_redshiftserverless_namespace" "serverless" {
 13   namespace_name       = var.name
 14   admin_username       = var.admin_username
 15   admin_user_password  = var.admin_user_password
 16   db_name              = var.db_name
 17   iam_roles            = var.iam_roles
 18   default_iam_role_arn = var.default_iam_role_arn
 19   tags                 = var.tags
 20 
 21   # https://github.com/hashicorp/terraform-provider-aws/issues/26624
 22   lifecycle {
 23     ignore_changes = [
 24       iam_roles
 25     ]
 26   }
 27 }
 28 
 29 resource "aws_route53_record" "serverless" {
 30   for_each = toset(var.route53_record_zone_ids)
 31   zone_id  = each.value
 32   name     = "redshift-${var.route53_record_name}"
 33   type     = "CNAME"
 34   ttl      = 30
 35   records  = aws_redshiftserverless_workgroup.serverless.endpoint.*.address
 36 }

Solución avanzada.

En esta solución aparecen dos nuevos elementos: el Data Lake o Lakehouse, y fuentes de datos de tipo JSON, AVRO, XML o cualquier tipo de API.

Esta solución la implementaremos cuando tengamos que guardar grandes cantidades de datos no estructurados como pueden ser eventos generados por el Internet de las Cosas.

Ventajas: podemos guardar datos no estructurados en grandes cantidades.
Inconvenientes: mayor complejidad y costes.

Herramientas que tendremos que proporcionar

En esta solución las herramientas a proporcionar son las mismas que en la solución intermedia, pero ahora se añade la necesidad de implementar un Data Lake o un Lakehouse.

En nuestro caso, y porque estamos usando las herramientas proporcionadas por AWS en la nube, el Lakehouse se implementará haciendo uso de AWS S3.

¿Qué es un Data Lake o Lakehouse?

Es un sistema de almacenamiento de datos masivo y barato.
Se utiliza para almacenar grandes cantidades de información en su formato nativo, sin necesidad de que los datos estén estructurados de una manera particular (JSON, XML, logs, etc)
Los datos pueden provenir de diferentes fuentes, bases de datos, sensores, registros de máquinas, APIs, etc.
Permite aislar los sistemas operacionales de los analíticos.
Se utilizan sistemas distribuidos como AWS S3 de Amazon o HDFS (sistema de archivos de Hadoop)

¿Qué es un Data Lake o Lakehouse implementado en AWS S3?

En Adevinta, implementado en AWS S3 (en Amazon Cloud)
Puede verse como un sistema de archivos con carpetas
¡Pero no es un sistema de archivos!
Los archivos se llaman objetos.
Podemos usarlo mediante el Hadoop File System, Apache Spark, etc, etc.
Permisos vía IAM Roles.

¿Cómo podemos usar el Data Lake o Lakehouse?

Para poder usarlo existen aplicaciones como Apache Spark. En la captura de pantalla superior, se muestra un notebook ejecutando código Apache Spark que permite leer un archivo comprimido en formato gzip y mostrar la información que contiene.

Conclusión.

En esta ponencia hemos presentado diferentes soluciones para construir una plataforma de datos. Desde la más sencilla hasta la más compleja. Otras soluciones son posibles, pero todas ellas tendrán piezas muy similares a las aquí discutidas.

Ahora ya solo queda que tú también montes en tu empresa tu propia data platform y logres ese ascenso o mejora laboral que te mereces.

How to build your own data platform. Episode 2: authorization layer. Data Warehouse implementation.

Gustavo Martin Morcuende — Sun, 04 Jun 2023 23:05:20 +0000

Introduction.

This article is the second part of the episode about building an authorization layer for your data platform. You can find the whole list of articles following this link: https://medium.com/@gu.martinm/list/how-to-build-your-own-data-platform-9e6f85e4ce39

In the previous article we talked about how to implement the authorization layer in the Data Lake, in this second part we will be talking about the same but in the Data Warehouse.

Authorization layer.

You can see in this diagram the Lakehouse with its metastore and the Data Warehouse. We already talked about the authorization layer for the Lakehouse in the previous article. Now it is the turn for the Data Warehouse.

Because we will be using Amazon Web Services with AWS Redshift, we will be implementing this layer using Lake Formation.

Processing layer.

Human users and processes will be the ones accessing the stored data through the authorization layer. Machines and processes like Zeppelin notebooks, AWS Athena for SQL, clusters of AWS EMR, Databricks, etc, etc.

The problem with the authorization.

Data engineers, data analysts and data scientists work in different and sometimes isolated teams. They do not want their data to be deleted or changed by tools or people outside their teams.

Data owners are typically in charge of granting access to their data.

Owner — consumer, relationship.

A data consumer requests access to some data owned by a different team in a different domain. For example, a table in a database.
The data owner grants access by approving the access request.
Upon the approval of an access request, a new permission is added to the specific table.

Our authorization layer must be able to provide the above capability if we want to implement a data mesh with success.

Data Warehouse, AWS Redshift.

The Data Warehouse is implemented on the top of AWS Redshift. Not many years ago a new service was released by Amazon called AWS Redshift RA3. What makes RA3 different from the old Redshift is that, in the new implementation, computation and storage are separated. Before having RA3, if users needed more storage capabilities, more computation had also to be paid even if computation was not a problem. And in the opposite way, when users needed more computation capabilities, more storage had to be paid. So, Redshift costs were typically high.

We will be using AWS Redshift RA3. Here you can find some useful links that explain further what are AWS Redshift and AWS Redshift RA3:

Data Warehouse, AWS Redshift RA3.

Amazon Redshift data sharing allows you to securely and easily share data for read purposes across different Amazon Redshift clusters without the complexity and delays associated with data copies and data movement. Data can be shared at many levels, including schemas, tables, views, and user-defined functions, providing fine-grained access controls that can be tailored for different users and businesses that all need access to the data.

Lake Formation can be integrated with data sharing.

For further information visit the following links:

Authorization, Federated Lake Formation.

Using Lake Formation with AWS Redshift RA3 we can manage the permissions across different accounts from only one central account in a federated way. We are delegating permissions to other accounts but we keep the control of them.

Authorization, implementation.

In order to implement federated authorization with AWS Redshift RA3 you can follow the next steps:

AWS Redshift RA3, producer account:

CREATE DATASHARE producer_sharing
GRANT USAGE ON DATASHARE producer_sharing TO ACCOUNT ‘FEDERATED_GOVERNANCE’
ALTER DATASHARE producer_sharing ADD SCHEMA producer_schema

AWS Redshift RA3, consumer account:

CREATE DATASHARE consumer_sharing
GRANT USAGE ON DATASHARE consumer_sharing TO ACCOUNT ‘FEDERATED_GOVERNANCE’
ALTER DATASHARE consumer_sharing ADD SCHEMA consumer_schema

AWS Redshift RA3, main federated account:

Through Lake formation console, allow access from consumer account to producer_sharing. You can see a screenshot about this configuration down below.

With the above configuration, the query from the consumer account will only see the column brand_id.

Conclusion.

In this article we have explained how you can implement an authorization layer using AWS AWS Redshift RA3 and AWS Lake Formation.

With this authorization layer we will be able to resolve the following problems:

Producers and consumers from different domains must have the capability of working in an isolated way (if they wish so) if we want to implement a data mesh with success.
Producers must be able to decide how consumers can access their data. They are the data owners, and they decide how others use their data.
Fine grained permissions can be established. At column and even if we want, at row level. This will be of great interest if we want to be GDPR compliant. More information about how to implement the GDPR in your own data platform will be explained in future articles.

Stay tuned for the next article about how to implement your own Data Platform with success.

I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.

How to build your own data platform. Episode 2: authorization layer. Data Lake implementation.

Gustavo Martin Morcuende — Fri, 02 Jun 2023 21:43:39 +0000

Introduction.

This is the second episode in the series about how to build your own data platform. You can find the whole list of articles in the following link https://medium.com/@gu.martinm/list/how-to-build-your-own-data-platform-9e6f85e4ce39

Remember, a data platform will be used by many teams and users. Also the data to be stored could be coming from many and different sources. Data owners will want to set permissions and boundaries about who can access the data that they are storing in the data platform.

In this episode I will explain how you can add these capabilities to your data platform. Also I will introduce the concept of data mesh, and how you can use the authorization layer for implementing the workflow between data consumers and data owners that you will need for creating a successful data mesh.

Authorization layer.

Our authorization layer will be on the top of the storage one. In this way, users and applications willing to use the stored data will need to do it through this layer in a safe way. No data will escape from the storage layer without authorization.
For implementing this layer you can use different solutions like Unity Catalog from Databricks, Lake Formation from AWS, plain IAM roles also from AWS, Apache Ranger, Privacera and many others.

For this article, and because we are working with Amazon Web Services, we will be implementing this layer using IAM roles and Lake Formation.

Processing layer.

The problem with the authorization.

Data engineers, data analysts and data scientists work in different and sometimes isolated teams. They do not want their data to be deleted or changed by tools or people outside their teams.

Also, for being GDPR compliant, to access PII data, big restrictions will be required even at column or row level.

Every stored data needs to have an owner, and in Data Mesh, data owners are typically in charge of granting access to their data.

What is a Data mesh?

Taken from https://www.datamesh-architecture.com/#what-is-data-mesh

The term data mesh, coined in 2019 by Zhamak Dehghani, is based on four key principles:

Domain ownership: Domain teams are responsible for their data, aligning with the boundaries of their team's domain. An authorization layer will be required for implementing those boundaries for some team.
Data as a product: Analytical data should be treated as a product, with consumers beyond the domain. An owner-consumer relationship will exist, where consumers require access to products owned by a different team.
Self-serve data infrastructure platform: A data platform team provides domain-agnostic tools and systems to build, execute, and maintain data products.
Federated governance: Interoperability of data products is achieved through standardization promoted by the governance group.

Owner - consumer, relationship.

A data consumer requests access to some data owned by a different team in a different domain. For example, a table in a database.
The data owner grants access by approving the access request.
Upon the approval of an access request, a new permission is added to the specific table.

Our authorization layer must be able to provide the above capability if we want to implement a data mesh with success.

Data Lake.

In this section we will write a brief recap about what we explained in previous article: https://medium.com/@gu.martinm/how-to-build-your-own-data-platform-f273014701ff

AWS S3.

Notebooks, Spark jobs, clusters, etc, etc, run in Amazon virtual servers called EC2.
These virtual servers require permissions for accessing AWS S3. These permissions are given by IAM Roles.
We will be working with Amazon Web Services. As we said before, because the amount of data to be stored is huge, we can not use HDD or SSD data storages, we need something cheaper. In this case we will be talking about AWS S3.
Also, in order to ease the use of the Data Lake, we can implement metastores on the top of it. For example, Hive Metastore or Glue Catalog. We are not going to explain deeply how a metastore works, that will be left for another future article.

When using a notebook (for example a Databricks notebook) and having a metastore, the first thing that the notebook will do is to ask the metastore where the data is physically located. Once the metastore responds, the notebook will go to the path in AWS S3 where the data is stored using the permissions given by the IAM Role.

Direct access or with a metastore.

We have two options for working with the data. With or without using a metastore.
With the metastore, users can have access to the data in the Data Lake in an easier way because they can use SQL statements as they do in any other databases.

Authorization, direct access.

Consumers run their notebooks or any other applications from their AWS accounts and consume data located in the producer’s account.

These notebooks and applications run in Amazon virtual servers called Amazon EC2 instances, and for accessing the data located in AWS S3 in the producer’s account, they use IAM Roles (the permissions for accessing the data)

S3 bucket policy

For example, for being able to access to the S3 bucket called s3://producer, with the IAM Role with ARN arn:aws:iam::ACCOUNT_CONSUMER:role/IAM_ROLE_CONSUMER, we can use the following AWS S3 bucket policy in the s3://producer bucket:

Direct access

Here, we are showing an example, where from a Databricks notebook using the above IAM Role and running in the consumer account, we are able to access data located in the producer’s account.

Can we do it better?

With Glue Catalog as metastore, data in S3 can be accessed as if it was stored in a table with rows and columns.

If we use tables instead of the direct access, we can grant permissions even at column level.

Lake Formation provides its own permissions model that augments the IAM permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant or revoke mechanism, much like a database. Lake Formation permissions are enforced using granular controls at the column, row, and cell-levels.

Authorization, Lake Formation.

For using Lake Formation we will need the following elements:

An application running in some machine in an AWS account. For example, an AWS EC2 instance where a Spark notebook will be executed.
A shared resource between the producer and consumer’s account. In this case we are sharing the S3 bucket called producer.
An IAM Role with permissions for using the producer’s bucket.
Two AWS Glue Catalogues as metastores. The one in the consumer's account will be in charge of forwarding the table resolution to the metastore in the producer’s account. Both metastores are also shared between the two accounts.

The catalogue in the producer’s account contains all the required information for translating the virtual table to its physical S3 location.

In the below screenshots you can see the Lake Formation configuration for the Glue metastore located in the producer’s account.

First you can see the table and database where the producer’s table is located. You can also see that we are sharing the specific table with the consumer’s account.

Database: schema
Table: producer

In the above table we can configure access permissions. For example, we can decide that we will be allowing only the use of SELECT statements from the consumer’s account and also the only column that will be shown is the one called brand_id.

Now, from the Spark notebook running in the consumer’s account we can run SQL statements against the table located in the producer’s account.

Because we only allowed access to the column called brand_id, the consumer will only see values for that column. Any other column will be hidden.

Conclusion.

In this article we have explained how you can implement an authorization layer using AWS IAM Roles and AWS Lake Formation.

With this authorization layer we will be able to resolve the following problems:

Producers and consumers from different domains must have the capability of working in an isolated way (if they wish so) if we want to implement a data mesh with success.
Producers must be able to decide how consumers can access their data. They are the data owners, and they decide how others use their data.
Fine grained permissions can be established. At column and even if we want, at row level. This will be of great interest if we want to be GDPR compliant. More information about how to implement the GDPR in your own data platform will be explained in future articles.

Stay tuned for the next article about how to implement your own Data Platform with success.

I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.

How to build your own data platform. Episode 1: sharing data between environments. Data Warehouse implementation.

Gustavo Martin Morcuende — Tue, 06 Dec 2022 00:27:50 +0000

Introduction.

This article is the second part of the first episode about how to build your own data platform. To catch up, follow this link: https://dev.to/adevintaspain/how-to-build-your-own-data-platform-4l6c

As a short recap, remember that for creating a data platform many parts are involved. For this first episode we are only focusing on the component that we called storage layer. In the storage layer we could find the Lakehouse or Data Lake and the Data Warehouse. In the previous article we talked about how to share data in the Data Lake, in this second part we will be talking about the same but in the Data Warehouse.

Storage layer.

You can see in this diagram three different elements:

Data Lakehouse: we already talked about it in the previous article.
Metastore: also we explained it in the last article. We will talk about it more deeply in the coming articles.
Data Warehouse: many times you will need to implement star schemas for creating data marts. Here, users can find meaningful data for creating dashboards, machine learning products or any other thing that users require. In this case, the Data Warehouse will be implemented on AWS Redshift.

Current situation (environment isolation)

Remember that if you want users to create data products as fast as possible, you will need to create at least one environment where these users can mess around with the stored data. In this isolated environment they will be able to break and change as many things as they want. Our production environment must be isolated from this and other environments because we do not want to break productive processes.

The problem with data.

We want users to be able to work with huge amounts of data in an easy and fast way, but we want them to do that in isolated environments from the productive one because we do not want them to break anything.

Data Warehouse, AWS Redshift.

All the environments have the same components but isolated one of each other.

Since the release of AWS Redshift RA3, because storage and computation are separated, users can decide if they want to increase either their storage or computational capabilities and only pay for what they need.

We will be using AWS Redshift RA3. Here you can find some useful links that explain further what are AWS Redshift and AWS Redshift RA3:

Data Warehouse, Redshift RA3.

With Redshift RA3 storage is located under the component called Redshift Managed Storage located in AWS S3. As you can see on the above diagram, compute nodes are separated from the storage.

You can find more information about RA3 in the following link: https://aws.amazon.com/blogs/big-data/use-amazon-redshift-ra3-with-managed-storage-in-your-modern-data-architecture/

Data Warehouse, integration and production environments.

In the integration environment we work with data as you can see in the pictures below.

In the production environment we have the exact same system but isolated from the integration environment. In production we find the exact same statements.

Data Warehouse, sharing data.

AWS Redshift RA3 includes something called data sharing. With data sharing we can access with read only permissions to Redshift data located in other Redshift servers and even in different accounts or environments.

Data sharing provides instant, granular, and high-performance access without copying data and data movement. You can query live data constantly across all consumers on different RA3 clusters in the same AWS account, in a different AWS account, or in a different AWS Region. Queries accessing shared data use the compute resources of the consumer Amazon Redshift cluster and don’t impact the performance of the producer cluster.

Data Sharing.

With Data Sharing, we can configure the AWS Redshift in the integration environment for accessing the storage of the AWS Redshift located in the production environment.

You can find more information about it in the following link: https://aws.amazon.com/blogs/big-data/sharing-amazon-redshift-data-securely-across-amazon-redshift-clusters-for-workload-isolation/

Data Sharing, implementation.

In order to create a data sharing between the integration and production AWS Redshift servers, you can follow the next steps.

AWS Redshift RA3, production environment, statements to run:

CREATE DATASHARE meetup_sharing;
GRANT USAGE ON DATASHARE meetup_sharing TO ACCOUNT 'INTEGRATION';
ALTER DATASHARE meetup_sharing ADD SCHEMA schema;
ALTER DATASHARE meetup_sharing SET INCLUDENEW = TRUE FOR SCHEMA schema;

AWS Redshift RA3, integration environment, statements to run:

CREATE DATABASE meetup_pro FROM DATASHARE meetup_sharing OF ACCOUNT 'PRODUCTION'
CREATE EXTERNAL SCHEMA IF NOT EXISTS pro_schema FROM REDSHIFT DATABASE 'meetup_pro' SCHEMA 'schema';
GRANT USAGE ON SCHEMA pro_schema TO schema;

With the above configuration, when using the pro_ prefix in the integration environment, we will be accessing data located in the production one. This access is read only, so we can not modify that data in any way.

Conclusion.

Through this article we have covered how to resolve the following problems in a Data Lake implemented in AWS S3:

Users (data engineers, data analysts and data scientists) need to work in pre-production environments with the same amount of data as in production.
We want to have different and isolated environments: integration, production, etc.
Users need to work with the data in the easiest possible way.

Stay tuned for the next article about how to implement your own Data Platform with success.

I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.

How to build your own data platform. Episode 1: sharing data between environments. Data Lake implementation.

Gustavo Martin Morcuende — Tue, 29 Nov 2022 01:06:07 +0000

Introduction.

Data is the new oil. Companies want to make the most of the data they produce. For achieving this goal, there is a need for systems capable of consuming, processing, analysing, and presenting massive volumes of data. These systems need to be easy to use, but they also need to be reliable, able to detect problems and store data correctly. These and other issues are intended to be resolved by Data Platforms.

It is not an easy task to build a Data Platform. Multiple skill sets are needed, from infrastructure and programming to data management.

This article is the first, of what I hope will be a longer series of articles where we'll try to unravel the secrets of how to build a Data Platform that allows you to generate value-added products for your users.

What is a data platform?

We can discover definitions of what is a data platform just using our preferred web search engine. For example, I found the following definitions:

A data platform enables the acquisition, storage, preparation, delivery, and governance of your data, and adds a security layer for users and applications.
https://www.mongodb.com/what-is-a-data-platform
A data platform is a complete solution for ingesting, processing, analyzing and presenting the data generated by the systems, processes and infrastructures of the modern digital organization.
https://www.splunk.com/en_us/data-insider/what-is-a-data-platform.html

So, a data platform is a place where we can store data from multiple sources. Also a data platform provides users with the required tools for searching, working and transforming that data, with the goal of creating some kinds of products. These products could be dashboards with useful insights, machine learning products, etc, etc.

What is a data platform? Very simplified diagram.

In this diagram we can find all the basic components that create a data platform (we are not trying to describe a Data Mesh or a Data Management Platform, those things will be left out for other future articles) You can find the same components with other names but same functionality in other diagrams describing other data platforms. In this diagram we can find these components:

Data Sources: databases, REST APIs, event buses, analytics tools, etc, etc.
Consumption: tools for consuming the data sources.
Storage: the place where the consumed data will be located.
Security layer: component in charge of providing authentication, authorization and auditory.
Processing: programs or tools that will enable us to work with the stored data.
Data Catalog: because the amount of stored data will be huge, we need a tool that will make easy for users to find the data that they need.
Tableau, Qlik, Kubeflow, MLflow, etc, etc: data will be used for some goal. Typically this goal could be to create a dashboard with meaningful diagrams, create models for machine learning and many other things.

This first article will be focusing on the storage layer, so from now on, we will talk only about that component.

Storage layer.

Of course, the storage layer is the place where the data is stored. Because the amount of data to be stored is huge, we can not use HDD or SSD data storages, we need something cheaper. In this case we will be talking about AWS S3 because we are working with Amazon Web Services. For Azure, you could use Azure Data Lake Storage Gen2. If you are working with Google Cloud, you could use Google Cloud Storage. It does not matter what storage you use as long as it is cheap and can store a huge amount of data.

You can see in this diagram three different elements:

Data Lakehouse: it is the evolution of the traditional Data Lake. Data Lakehouse implements all the capabilities of a Data Lake plus ACID transactions. You can find more information about Lakehouses in this link.

Usually in a Data Lakehouse or a Data Lake you can find different zones for storing data. The number of zones that you can find depends on how you want to classify your data. How to create and classify the data in your Data Lake or Lakehouse is a complicated matter that will be treated in a future article. The Data Lake is the first place where the consumed data is stored. Sometimes it is just meaningless raw data.

Data Warehouse: many times you will need to implement star schemas for creating data marts in order to make easy for users the use of the stored data. Here, users can find meaningful data for creating dashboards, machine learning products or any other thing that users require.
Metastore: data is stored in the blob storage, if you want to use this data as if it was stored in a traditional database we need an element for translating schemas and table names to folders and files in the blob storage. This translation is made by the metastore.

This article does not try to deeply explain how the above three elements work. Those explanations will be left out for other future articles.

Current situation (environment isolation)

If you want users to create data products as fast as possible, you will need to create at least one environment where these users can mess around with the stored data. In this isolated environment they will be able to break and change as many things as they want. Our production environment must be isolated from this and other environments because we do not want to break productive processes. Different and isolated environments will exist. These environments contain the same processing and storage layers but these layers are isolated in their own environments. So notebooks in the sandbox environment can not break data stored in the storage layer from the production environment.

The problem with data.

Data engineers, data analysts, data scientists and data people in general who work with big data require huge amounts of data in order to implement, develop and test their applications. These applications could be ETLs, ELTs, notebooks, dashboards, etc, etc.

In a healthy system, these users should be able to work in a safe environment where they can be sure that they do not break anything that already works in production when trying to implement new solutions. We need to create isolated environments. These environments could be sandbox, integration, production environment, etc, etc.
The problem with having different and isolated environments is that, in no productive environments, the amount of data will probably be much lower than the one that will be generated in production.

So now, we face the following problem: we want users to be able to work with huge amounts of data in an easy and fast way, but we want them to do that in isolated environments from the productive one because we do not want them to break anything.

The solution 1.

Remember that we have data sources, and those data sources must be connected to our different and isolated environments. We could ask those data sources to send us the same amount of data as they are sending in the productive environment.

The problem with this solution is that, in many cases, those data sources have their own no preproductive environments and it is impossible for them to generate the same amount of data in the rest of the environments as in the production environment. Also, they will not be willing to connect our own no preproductive environments to their productive ones because we could break their environments.

This solution in many cases will not work.

The solution 2.

Another solution could be as simple as implementing a job for copying data from the storage located in the productive environment to the no productive one. For example, a Jenkins job.

The problem with this solution is that copying huge amounts of data is not fast and also, the job can break easily for multiple reasons (not having the right permissions, the right amount of memory for moving all the required data, etc, etc)

This solution does not ease the development of new applications because the copying process is slow, sometimes will not work, and data is not immediately available.

The solution 3.

What our users need is to have access to data generated in the production environment from the tools running in the no productive environments. We need to provide a solution where applications like notebooks running in for example the integration environment can access the storage located in the productive one.

This solution will work in all cases. This is the solution that we are going to explain in this article focusing on the component related to the Data Lake. In a next article we will explain the same solution implemented in a Data Warehouse.

Data Lake, AWS S3.

Notebooks, Spark jobs, clusters, etc, etc, run in Amazon virtual servers called EC2.
These virtual servers require permissions for accessing AWS S3. These permissions are given by IAM Roles.
We will be working with Amazon Web Services. As we said before, because the amount of data to be stored is huge, we can not use HDD or SSD data storages, we need something cheaper. In this case we will be talking about AWS S3.
Also, in order to make easy the use of the Data Lake, we can implement metastores on the top of it. For example, Hive Metastore or Glue Catalog. We are not going to explain deeply how a metastore works, that will be left for another future article.

Data Lake, integration and production environments.

In the integration environment we have two options for working with the data. With or without using a metastore.

In the production environment we have the exact same system but isolated from the integration environment. In production we find the exact same two options.

As you can see, the metastore allows us to use the data located in the Data Lake as it was a normal database. Also, we can see that the metastore does not store data but the metadata that allows us to find the real stored data in AWS S3. With the metastore, users can have access to the data in the Data Lake in an easier way because they can use SQL statements as they do in any other database.

Data Lake, sharing data.

When users run their notebooks or any other application from the integration environment they need to have access to the production data located in the storage zone of the production environment.

Remember that those notebooks and applications run in Amazon virtual servers called Amazon EC2 instances, and for accessing the data located in AWS S3 they use IAM Roles (the permissions for accessing the data) We can modify the IAM Role in the (for example) integration environment in order to allow EC2 instances to access data located in the productive storage zone.

IAM Role configuration.

For example, for being able to access to S3 integration and production folders we can configure the IAM Role in the following way:

Any application running on a machine with this IAM Role can read data from production and integration and can only modify the data located in the integration environment. So the productive data is not modified in any way.

Applying the solution.

Once we have applied the above configuration in the IAM Role, users have direct access to the data located in the productive environment, for example from the integration environment.

Can we do it better?

With this configuration users, from for example their notebooks, can access the productive data and work with it without being able to modify it. But, we know, by means of a metastore, users can access the data even in an easier way. So the question is: can we use metastores with this solution?

We will see how to do it in the next section of this article.

Data Lake, sharing data. Waggle Dance.

Waggle Dance is a request routing Hive metastore proxy that allows tables to be concurrently accessed across multiple Hive deployments.

In short, Waggle Dance provides a unified endpoint with which you can describe, query, and join tables that may exist in multiple distinct Hive deployments. Such deployments may exist in disparate regions, accounts, or clouds (security and network permitting).

For further information follow this link: https://github.com/ExpediaGroup/waggle-dance

Now, when asking for some table from the integration environment, and based on some configuration, the Waggle Dance living in the integration environment decides if the metastore to be asked resides either in the production or integration environment.

For example, this configuration could be based on some prefix. In the below example, the pro_prefix. When using this prefix the data to be retrieved will be located in the production environment instead of the integration one.

Conclusion.

Through this article we have covered how to resolve the following problems in a Data Lake implemented in AWS S3:

Users (data engineers, data analysts and data scientists) need to work in pre-production environments with the same amount of data as in production.
We want to have different and isolated environments: integration, production, etc.
Users need to work with the data in the easiest possible way.

Stay tuned for the next article about how to share data with AWS Redshift and many others that will follow about how to implement your own Data Platform with success.

I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.

Spark: unit, integration and end-to-end tests.

Gustavo Martin Morcuende — Thu, 15 Oct 2020 07:15:25 +0000

Introduction

At Adevinta Spain we are building a Data Platform where we use multiple applications and frameworks some of them based on Spark.

In order to increase the quality of our Spark applications we wanted to run tests in the same way as we did with other frameworks. What means, we wanted to be able to run unit, integration and end-to-end tests.

This article explains the way Spark tests are run at Adevinta Spain. Hopefully, it will be useful for other big data developers searching ways to improve the quality of their code and at the same time their CI pipelines.

Unit, integration and end-to-end tests.

When working with Spark, developers usually will be facing the need of implementing these kinds of tests. Other tests like smoke tests, acceptance tests, etc, etc are outside the scope of this article so I will not be mentioning them.

Unit Tests: at this level we will be dealing with code that does not require a Spark Session in order to work. Also, this kind of code does not talk with the outside world.
Integration Tests: at some point we will need to use a Spark Session. At this level we will be testing Spark transformations and in many cases we will have to deal with external systems such as databases, Kafka clusters, etc, etc.
End-to-end Tests: our application probably will be composed of several Spark transformations working together in order to implement some feature required by some user. Here, we will be testing the whole application.

Spark project layout

This is a typical scala project layout. I think this layout should work under any use case but if it does not work for you, at least I hope, it will bring some inspiration or ideas to your testing implementation.



src/
├── main
│   └── scala
│       └── example
│           ├── app
│           │   └── AwesomeApp.scala
│           ├── job
│           │   └── AwesomeJob.scala
│           └── service
│               └── AwesomeService.scala
└── test
    ├── resources
    │   ├── awesomejob
    │   │   └── sourcepath
    │   │       └── awesome.json
    │   └── log4j.properties
    └── scala
        └── example
            ├── app
            │   └── AwesomeAppEndToEndTest.scala
            ├── job
            │   └── AwesomeJobIntegrationTest.scala
            ├── service
            │   └── AwesomeServiceTest.scala
            └── SharedSparkSessionHelper.scala

Application layout

app package

Under this package we will find the classes in charge of running our Spark applications. Typically we will have only one Spark application.

job package

A Spark application should implement some kind of transformations. Modules under this package run Spark jobs that require a Spark Session.

service package

Sometimes business logic does not require a Spark Session in order to work. In such cases, we can implement the logic in a different module.

Shared Spark Session

One of the biggest problems to be solved when running Spark tests is the isolation of these tests. Running a test should not affect the results of another. In order to achieve this goal we are going to need a Spark Session for each set of tests, in this way, the results of these tests will not affect others that will also require a Spark Session.

So, we need to implement a system that will enable us to run, clear and stop a Spark Session whenever we need it (before and after a set of related Spark tests)

The details of the implementation are explained down below:

beforeAll: beforeAll is a scala test function that runs before any other test in our class under test. We will be using this function for starting our Spark Session.
sparkConf: sparkConf function enables us to load different Spark Sessions with different Spark configurations.
embedded hive: spark-warehouse and metastore_db are folders used by Spark when enabling the Hive support. Different Spark Sessions in the same process can not use the same folders. Because of that, we need to create random folders in every Spark Session.
beforeEach: scala test function that creates a temporary path which is useful when our Spark tests end up writing results in some location.
afterEach: clears and resets the Spark Session at the end of every test. Also, it removes the temporary path.
afterAll: stops the current Spark Session after the set of tests are run. In this way we will be able to run a new Spark Session if it is needed (if there is another set of tests requiring the use of Spark)

How it works

The basic idea behind SharedSparkSessionHelper lies in the fact that there is one Spark Session per Java process and it is stored in an InheritableThreadLocal. When calling getOrCreate method from SparkSession.Builder we end up either creating a new Spark Session (and storing it in the InheritableThreadLocal) or using an existing one.

So, for example, when running an end-to-end test, because SharedSparkSessionHelper is loaded before anything else (by means of the beforeAll method), the application under test will be using the Spark Session launched by SharedSparkSessionHelper.

Once the test class is finished, the afterAll method stops the Spark Session and removes it from the InheritableThreadLocal leaving our test environment ready for a new Spark Session. In this way, tests using Spark can run in an isolated way.

Awesome project

This article would be nothing without a real example. Just following this link you will find a project with sbt, scalatest, scalastyle, sbt-coverage and scalafmt where I use the SharedSparkSessionHelper trait.

This application can be run in any of the available clusters that currently exist such as Kubernetes, Apache Hadoop Yarn, Spak running in cluster mode or any other of your choice.

Conclusion

Testing Spark applications can seem more complicated than with other frameworks not only because of the need of preparing a data set but also because of the lack of tools that allow us to automate such tests. By means of the SharedSparkSessionHelper trait we can automate our tests in an easy way.

I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.