DEV Community: /\\: Fabien PORTES

What is DataOps and how to make it real with Dataform ?

/\\: Fabien PORTES — Thu, 07 Dec 2023 16:48:06 +0000

The software development world has advanced CI/CD tools, enabling efficient software delivery. The data world is catching up adopting CI/CD practices for successful data platforms. With automation and cloud technologies, data teams can automate data processes and ensure reliability and scalability.

This is where DataOps starts. DataOps is a set of practices to improve quality, speed, and collaboration and promote a culture of continuous improvement between people working on data analytics.

By converging software development and data practices, organizations maximize the potential of their data assets for informed decision-making. First part of the article will relate each step of a devops CI/CD to a step of a Data Ops CI/CD. The second part will show how Dataform can help for each of these steps.

What are the expectations of a DataOps CI/CD ?

As for software development projects, a data project needs a CI/CD chain that will ensure the modifications to the code base will be automatically tested and deployed. The code base here is the logic written in SQL that transforms raw data into curated consumable data.
The software development world can help us to identify the key features of a successful data CICD.

1. Compilation

In the software development world, the compilation is used to build an executable package and ensure there are no syntax errors in the language used. There still can be some runtime errors but at least the code follows the language rules. For an interpreted language we can make an analogy with the language parser used by the linter.

Similarly, in the context of a data project, this step focuses on validating the written SQL transformations. In the case of a sequence of transformations, it becomes crucial to verify that the resulting directed acyclic graph (DAG) generated by these transformations is valid. This verification can be conducted during the compilation step, as well as with the aid of a linter integrated into your preferred code editor.
By employing a compilation step and leveraging a linter, the following benefits can be obtained:

Language compliance: The compilation step ensures that the SQL transformations conform to the syntax and rules specified by the database management system or query language. Syntax validation: The linter performs static analysis on the SQL code, highlighting potential syntax errors, typos, or incorrect usage of language constructs. It helps identify issues early in the development process, promoting clean and error-free code.
Structural integrity: In the case of a chain of transformations, verifying the resulting DAG's validity is crucial. By confirming that the graph is acyclic, developers can ensure that data flows correctly through the transformations without encountering circular dependencies or infinite loops.
IDE integration: IDEs equipped with linters offer real-time feedback and suggestions while writing SQL code, enabling developers to spot and rectify errors immediately. This streamlines the development workflow and enhances code quality. By combining compilation checks and leveraging linters, developers can improve the reliability and correctness of their SQL transformations in data projects. This proactive approach helps catch errors early, promotes adherence to language rules, and ensures the integrity of the transformation process within the overall project.

2. Unit Testing

Unit testing in computer science is a software testing technique used to verify the correctness and functionality of individual units or components of a software application. A unit refers to the smallest testable part of a program, such as a function or a method. The purpose of unit testing is to isolate and test each unit in isolation, ensuring that it behaves as expected and meets the specified requirements.
In a data project, unit testing involves verifying the accuracy and functionality of a SQL transformation that generates outputs based on input data. It is beneficial to utilize mock data during unit testing to ensure consistent and expected results are obtained with each test run.
During unit testing of a SQL transformation, mock input data can be employed to simulate various scenarios and validate the behavior of the transformation. By providing predetermined input data, developers can assess whether the transformation produces the anticipated output(s) as defined by the test cases.
This will ensure

early bug detections in the SQL query in case the logic is wildly changed
code maintainability: unit test are a way to document code and provide insights on the logic implemented
modularity: unit testing encourages modular design emphasizing the individual units. You might write simpler and more modular SQL queries if you write unit tests for them faster feedback loop: unit tests are easily executable and fast, providing immediate feedback of the correctness of a unit.

3. Deployment

In software development once the tests pass and the code analysis is satisfactory, the deployment step is triggered. The artifacts are deployed to the target environment, which can include development, staging, or production environments, depending on the deployment strategy.
In a data project, deployment can be seen as the step where the SQL queries are run on the target environment. The compiled direct acyclic graph is actually run on the environment to produce the data objects defined.
The deployment step in a data project plays a pivotal role in transforming raw data into meaningful and usable insights. By executing SQL queries on the target environment and running the compiled DAG, data objects are generated, paving the way for further analysis, reporting, or utilization within the project.

4. Integration test

Integration testing, in the context of software development, is a testing technique that focuses on verifying the interactions and cooperation between different components or modules of a system. Integration tests aim to identify issues that may arise when multiple components are integrated and working together as a whole.
Integration tests play a crucial role in ensuring the accuracy and reliability of the data project by validating the integrity and consistency of the transformed data. By examining the output tables resulting from the executed DAG, these tests assess whether the data conforms to the expected format, structure, and content.
The key aspects to consider when conducting integration tests in a data project are:

Real data evaluation: Integration tests involve analyzing the transformed data with real-world characteristics.
DAG transformation verification: The integration tests focus on verifying the proper execution of the DAG of transformations. By evaluating the resulting tables, developers can identify any unexpected or incorrect data manipulation that may occur during the transformation process.
Anomaly detection: Integration tests aim to uncover any anomalies or inconsistencies in the transformed data. This includes detecting missing data, data corruption, data loss, or any deviations from the expected outcomes.
Validation against requirements: Integration tests assess whether the transformed data aligns with the specified requirements and business rules. This ensures that the data project delivers the expected results and meets the defined criteria for success.

To effectively perform integration tests in a data project, it is crucial to develop comprehensive test cases that cover various scenarios, including edge cases and critical data paths. These tests should be automated to enable frequent execution and ensure consistency in the evaluation process.

Step	Software	Data
Compilation	Syntax, typing validation Artefact generation	DAG generation and validation SQL syntax validation
Unit Tests	Code functionality testing	Query logic testing
Deployment	Artefacts deployment	DAG execution
Integration Test	System testing	Data Testing Functional testing

How to make DataOps with Dataform ?

To meet the industrial requirements of DataOps, a continuous integration and continuous development chain must be established based on the cool features of Dataform. Dataform is a service for data analysts to develop, test, version control, and schedule complex SQL workflows for data transformation in BigQuery.

Let's explore how DataOps can be achieved with Dataform based on the DataOps expectations of the previous paragraph.

1. Compile the changes

You can compile your project using the Dataform CLI or APIs through the web user interface. The compilation process checks that the project is valid by verifying the following points:

Are the dependencies used existing?
Are the config blocks correct?
Is the templating used valid?

The output of the compilation is a DAG of the list of actions to be run, corresponding to the tables to be created and the operations to be performed on your data warehouse. This output can be displayed as JSON through the CLI.
Below is a truncated output example defining a source table and a downstream one.

{
    "tables": [
        {
            "type": "table",
            "target": {
                "schema": "demo",
                "name": "source_table",
                "database": "demo_db"
            },
            "query": "SELECT 1 AS demo",
            "disabled": false,
            "fileName": "definitions/demo/demo_source.sqlx",
            "dependencyTargets": [],
            "enumType": "TABLE"
        },
        {
            "type": "table",
            "target": {
                "schema": "demo",
                "name": "downstream",
                "database": "demo_db"
            },
            "query": "SELECT * FROM source_table",
            "disabled": false,
            "fileName": "definitions/demo/downstream.sqlx",
            "dependencyTargets": [
                {
                    "schema": "demo",
                    "name": "source_table",
                    "database": "demo_db"
                }
            ],
            "enumType": "TABLE"
        }
    ],
    "projectConfig": {...},
    "graphErrors": {...},
    "declarations": {...},
    "targets": {...}
}

2. Run unit tests

Using the test feature of Dataform, unit tests can be run against queries to validate SQL code logic. It will ensure the logic of SQL queries is fulfilled. Indeed for a given select statement you can mock the from clause with fake data and also mock the expected result of the query with these fake data. If the query logic is changed in a way that the mocked output data doesn’t match with the query run on the mocked input data, an error is raised.
Unit tests give you confidence that your code produces the output data you expect. In case the test is failing an error is thrown and you can prevent merge of the changes to your code base.
Below is an example of testing downstream dataset defined on the previous example.

config {
  type: "test",
  dataset: "downstream"
}

input "ages" {
  SELECT 1 AS demo
}

SELECT 1 AS demo

3. Run the changes

You can run your project using the Dataform CLI or APIs through the web user interface. Running a project is actually running the DAG of the actions generated at the compilation step. The actions can be creations of tables and views as well as feeding tables with the logic implemented.
You can also specify a specific action to be run or a list of actions that match specific tags. This is very useful to schedule different parts of the project at different frequencies.

4. Run the integration tests

The Dataform assertions can be used to run integration tests once your project has been run. With assertions test you can validate the content of the data and throw an error in case data quality is not matching your expectations.
Assertions is a SQL query that should not return any row. You can see it as a query that is looking for errors in the data. For instance the following query creates an assertion that checks if the id column does not contain null values.

SELECT id FROM ref("table_1") WHERE id is not null

It also has table_1 as a dependency. So once the table_1 is built the assertion will run and check for errors in the data. If errors are found the assertion will fail and raise an error. This way you can ensure data quality of your data platform.
Assertions can also be manually configured as dependencies of the DAG of the queries to be run so that you can interrupt the project run on case assertions are not fulfilled.
Below is an example of adding simple assertions to the previous defined downstream table.

config {
  type: "table",
  assertions: {
    uniqueKey: ["demo"],
    nonNull: ["demo"],
    rowConditions: [
      'demo is null or demo > 0'
    ]
  }
SELECT 1 AS demo
}

Conclusion

All of the features proposed by Dataform integrate well with a CI/CD tool chain as typical steps can be performed to validate and deploy changes to the SQL code base. This brings data engineering to a level of industrialisation on par with the software development world.

Follow up my next article that will talk about a concrete industrialisation example of dataform !

Thanks for reading! I'm Fabien, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

10 principles for your event driven architecture

/\\: Fabien PORTES — Wed, 07 Jun 2023 21:04:13 +0000

Today I had the opportunity to assist at the Serverless day event in Paris. This event is dedicated to fostering a community around serverless technologies. There have been many conferences especially about AWS lambdas optimization and architecture and event driven architectures. These topics are indeed key in the serverless world.

One conference particularly held my attention. Luc VAN DONKERSGOED, AWS Hero gave us 10 principles to avoid chaos in a serverless event driven architecture journey. I will summarize them in this article to make you find peace of mind in case you are working on a microservices event driven architecture.

The principles are divided in 3 categories.

Event design

Use Json, not text, yaml, Avro or Protobuf

Using json as a unique message format is the way to go in the event driven architecture paradigm because json is evolutive. Your events payload is likely to evolve and json can support you in this evolution with the introduction of new keys without breaking changes.

Json is also extensively and natively supported by most of the services and libraries you can work with.

The benefits of compression from an Avro or Protobuf format is not worth it as it will require some extended decompression steps. Yaml is also not recommended because it is made for humans not for machines and usually needs a specific parser to be decoded.

Use event enveloppe for metadata

Your event will benefit from transporting metadata related to it, like a unique identifier, time at which it was generated etc… These metadata should be part of the json payload in a dedicated entry. The payload itself should appear in the data entry so that the payload looks like the following:

{
  "metadata": {
        "event_id": "uuid",
        "event_time": "Timestamp"
    },
    "data": {
        "height_kg": 12,
        "width_cm": 14
    }
}

Metadata will help in the observability of your microservices architecture.

Use unique event id

You should embed in the metadata of your event payload a unique id of your event. You can easily generate a uuid to do so. Unique identifier of your event is crucial to trace events and make observability of your solution real.

By following these 3 first principles you can design event payload in a way that will let you make them evolve easily. Now we need to evolve and communicate within our microservices.

Communicate and evolve

Use schemas and contract

The first step to enable serene communication between your microservices is to use a contract between your microservices. These will make a consumer services not to know anything about the producer service but only know the contract that bound the producer to its output. This will ensure decoupling of producer and consumer service.

Maintain backward compatibility

Once you define a contract between your microservices, you should include in your metadata entry the version of the event schema. In case of a breaking change, the event version will evolve and the consumers of these events will know what version of the schema to use to adapt to it. This will ensure you can maintain backward compatibility.

Maintain schema registry

For a consumer to retrieve the schema of an event, we need a central place to store the schema of your events. This is the responsibility of the schema registry, a single source of truth of all the schemas of the events generated by your microservices.

You can use json schema to declare the schema of your event version.

Use an event broker as a transportation layer

This is the key piece to decouple your microservices. Once you have a schema registry to store the contract between your microservices, you will want to use an event broker to completely decouple them. The event broker will receive the events of the producers and forward them to the consumers. In case of a failure of a consumer the event broker will act as a buffer for the events and the consumer will treat the events queue once it is back alive.

There is now no more interaction microservice to microservice but just producers that push events to the event broker and consumers that read events from the event broker. We now need some principles to integrate and support our event driven architecture.

Integrate and support

Use event supporting APIs

As your architecture evolves you will need to add entries to your data payload of your event. For example a consumer can need the color of a car while another needs the brand and another one the price. It can be tempting to add all of these information to your event payload but at some point the payload will get too big.

Instead you should make the producer expose a ressource APIs to get more information about the resource of an event. The event will hold the identifier of the ressource and you will query the producer to know more about the resource attributes. This way you can keep your payload size relatively small.

Use the storage first pattern

In order to avoid unpredictable latency of microservices, you should use the storage-first pattern. A consumer should first store the message, acknowledge the reception of the message and then process it from the storage. This way your microservices are even more loosely decoupled.

In addition, as your events hold a unique identifier and you store them, you can ensure the single processing of a unique event. Indeed you can check when an event arrives if it is already stored and if so implement the desired logic. This principle can also help make your architecture idempotent.

Trace your events

Last principle is about tracing your event for the observability of your microservices. You can include in the metadata section a unique trace id and span id that will help you build a monitoring system that can follow the trace of individual steps of a request. The trace id will identify the request while the span id identifies the steps. This information can be cascading in the event metadata section of the events.

I thank Luc VAN DONKERSGOED and I hope these principles will help you find peace of mind in case you're in the process of building an event driven architecture with microservices. If not, you can ask me for further guidance !

Thanks for reading! I'm Fabien, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

DLDB.IO: google analytics alternative, GDPR compliant by design

/\\: Fabien PORTES — Tue, 13 Dec 2022 13:56:57 +0000

Analytics solutions are used by enterprises to better understand their clients and refine their business.
Let’s say I have a business proposing an application developed to control the engine of e-bikes and to provide insights on the users’ rides. The usual analytic solution collects data about the users and its position and then performs analytics on them. For example, I could know what are the best places to install a new charging station. I would just have to query my data looking for places where the e-bike users end up with less than 5% of battery.

Current way of doing analytics

The use case mentioned above seems naive but it requires three steps.
First, I need to collect from the application some unique ID like my users’ device and their geolocation. It would help me to perform further analytics on the data like looking for the most popular places where e-bikes batteries are under 5%. These data are personal as they can be used to identify individuals using the app. For example, we can get places of living from the data. In Europe, due to GDPR, it means I need to retrieve the end user's consent to collect and process such data. I must also process them in Europe.
Secondly, we need a central place to store this data and a place that scales well with the number of users and granularity of the collected data.
Finally, the computational power must scale with the data volumetry collected.

The figure below describes how we usually do analytics.

Currently, in Europe, the GDPR requirements are to be handled by the juridic department of the business. The collection and processing of data is provided by cloud platforms. Then solutions to do analytics exist and the question is about how much will it cost ?

Now imagine that you can perform analytics in a much simpler way, without needing the previous requirements i.e. you don’t need the user consent because you do not collect personal data and you do not face scalability issues for storage and processing.

This is the main topic of this article. We discuss an alternative solution : DLDB.io !

DLDB.io : new way of doing analytics

DLDB.IO provides an SDK (IOS, Android, flutter, react-native, unity) that stores geolocation and users’ data on the end user terminal. To do analytics, this data is processed by the terminal to answer queries sent by the Analytics engine. The results of the query sent by the terminal are already aggregated and cannot allow identification of the users.
More precisely queries sent by the analytics engine have a lifespan. The terminal will pull the query once it is online and the application is used. If the terminal receives the query during the query lifespan the result will be collected and used by the analytics engine.

The image below describes how DLDB.io does analytics.

Benefits of DLDB.io analytics solution

The benefits are:

GDPR compliance by design:
- Users’ data are not collected on the cloud. It stays on the end user terminal which means there is no risk to bad usage of the data due to data dissemination. Indeed bad usage is more likely to happen when data is collected and copied multiple times.
- No user’s consent is needed to collect data as data is not collected (thanks captain obvious!) Right to be forgotten: users’ data is easy to delete because it is stored only on the user’s terminal and is easy to delete.
- A registry of queries performed on users’ data is available on the terminal.
Scalability
- The storage is mainly handled by the end user device. A massive storage solution is not needed.
- The processing is partly deported to the end user’s device. Data sent to the analytics engine is pre-aggregated and cleaned.

All of these benefits are making DLDB.io a great solution for analytics. The distributed computing performed by the end user terminal is an application of edge computing which is very innovative in the analytics field.
Now if we look deeper on how analytics works with this solution we can find some limitations despite everything.

Limitations of DLDB.io analytics solution

The most obvious limitation: if the device is not available during the lifespan of the query, the Analytics engine will not get a result on the query for this device. Yet we can assume that for a fleet of terminals we can get the results of enough terminals to make the query result statistically valid even if we do not get results from all the terminals.
The lifespan of a query depends on the use rate of the application. If the application is used a lot, the query lifespan can be low whereas if the application is used once a day the query lifespan will be around a day. This is limiting the freshness of the data on which we wan’t to perform the query.

Finally, I bet that this analytics solution is going in the flow of history because the regulations are becoming more constraining which represents a risk for businesses doing analytics. The DLDB.io solution mitigates the risk. Also the volume of data collected by current analytics solutions is increasing and the edge computing solution might reduce the cost of storage and processing of this data. Indeed, we can imagine that the distributed processing on the end users terminal is less greedy in CO2 than a single big query. This is good for the planet!

If you are interested you can request a beta invite and get support from the DLDB.io team.