DEV Community: dataroots

Our highlights from Big Data London

Bart — Thu, 05 Oct 2023 12:48:52 +0000

Last week, the four of us attended both days of Big Data London, Europe's biggest yearly data conference. In this post, we're sharing our highlights from the event. There were a few recurring topics that we've distinguished as the most common themes throughout the conference. And since it's still 2023, you couldn't enter any session or exhibitor booth without hearing about Generative AI.

Generative AI everywhere

Data Fabric, the new hype term

The data platform world is filled with hype terms like modern data stack, data fabric, open, and self-serve analytics. The data fabric seems to be a recent one that sticks. A data platform can be as simple as a simple storage solution with an ETL/ELT tool to move or transform data. A data fabric is then a more complete solution that includes other essential features like DataOps & DevOps, data governance solutions, and visualization functionalities.

the room was packed for an introduction to Microsoft Fabric

What is a Lakehouse again?

Another trend that definitely seems to last is the one of the lakehouse. Databricks, Microsoft, Starburst, Dremio, and Clickhouse all offer their own version of the lakehouse, all competing with each other and their bigger brother, Snowflake. We noticed that it became a thing to position a data platform offering as an open platform, to compete with the closed offering of Snowflake. There are 2 things we can conclude from this: Snowflake has become the most popular platform and openness and cross-vendor compatibility has become a major selling point to compete with it.

So let's say you are in the market for a new data platform and you want to investigate the available options to build a lakehouse. Let's first look at what all vendors have in common; what defines a lakehouse?

A lakehouse is the evolution of the data lake combined with the data warehouse. Every lakehouse builds upon a data lake where the data is stored. Then, a SQL engine exposes that data for data consumption through dashboarding, machine learning, reverse ETL, and more. The main difference with the previous decade is that data doesn't have to be stored in the SQL engine itself anymore. It can live in the data lake, in any open format.

Delta Lake: winner of the data format battle

Next, you'd have to pick a format to store your data in. Here, Apache Parquet used to be the most obvious choice. Then, the transactional layers on top of Parquet came along: Delta Lake, Hudi, and Iceberg. Delta Lake seems to the clear winner in terms of compatibility here. All data tooling (not just lakehouses) we saw at the conference supported Delta Lake, while only very few offered support for Iceberg or Hudi.

Of course Delta Lake UniForm was discussed during the Databricks meetup

The Open Lakehouse

What all lakehouse vendors have in common is that they boast themselves on being open. But what does that actually mean? All of them can work with your data stored in Delta Lake on any of the common cloud providers (AWS, Azure, GCP). While a data vendor used to try to lock you into their ecosystem by staying incompatible with the rest of the market, the opposite now has become true. This is an evolution we really like to see. The cloud also has become a commodity in this context. You can work with whichever cloud provider you'd prefer without having to miss out on any features. Even Microsoft's new platform named Fabric can work with data stored on AWS S3 and soon with Google Cloud Storage as well.

Data Transformations

Once your data is available in your data platform (or data fabric?), you'd usually have to start transforming that data to build a data product. We can distinguish 2 clear categories of tooling here.

The low-code tooling

Big Data London featured talks and exhibitor booths with low-code data transformation tooling like Matillion, Alteryx, Coalesce.io, Rivery, and more. Here, it seems that Alteryx is the older, more commonly known solution, while Coalesce seems to be one of the more popular and complete options for newer platforms. Other vendors like Microsoft stay neutral on this front by providing excellent support for both low-code and code-first solutions.

As dataroots, we believe that complex operations like data transformations are usually best written as code, rather than as a combination of boxes and arrows locked away in a vendor-specific format.

The code-first tooling

While Apache Spark was historically the go-to tool for any kind of data transformations, we saw once again at Big Data London that dbt has taken over this role. Almost every common data tool has some kind of integration with dbt. We didn't attend any talks focussing on dbt itself, but it was featured tens of times.

Data governance

As the volume and complexity of data continue to grow, organisations are increasingly recognising the importance of effective data governance to ensure data accuracy, security, and compliance. At Big Data London, we had the opportunity to explore a multitude of stands that delve into various aspects of data governance, including data quality management, data privacy, regulatory compliance, and data security. From BigID to Collibra, the big data governance players had their stand to showcase their products. However, smaller players which offer a solution with less feature were nonetheless very interesting. The most interesting ones were related to data quality and data catalog.

Data Quality and Data Contracts

Data contracts took part of the spotlight as the newest addition to the existing data mesh ecosystem and a way to extend the current data quality implementations with additional features like SLA's and stakeholders. Agilelab's Witboost showcased a compelling solution during the presentation, demonstrating the seamless integration of a standardized data contract interface into existing workflows. This innovative approach not only streamlines data management but also enhances accountability and collaboration within an organisation. Moreover, the data quality panel brought first hand experiences from companies like Hello Fresh which implemented data contracts in their applications.

Focusing more on data quality, two tools really stood out from the crowd: Soda and Giskard. Soda focussing more on tabular datasets while Giskard has impressive testing features for AI and ML models with upcoming support for LLM testing on the horizon. Giskard offers an interesting scan functionality which is able to detect pottential issues in your ML model. The tests resulting from this scan can be added together in a testing suite which can be extended with your own tests and used to thorougly test your ML models.

the Soda booth was one of the busiest booths

Data Catalogs

At Big Data London, three compelling data catalog tools took the spotlight: Atlan, CastorDoc, and Sifflet. Sifflet stands out as it not only provides a comprehensive data catalog but goes beyond by integrating anomaly detection and team collaboration features like root cause analysis and incident management. On the other hand, Atlan and CastorDoc offer valuable integration options with other tools, enabling them to complement their capabilities and fill in any gaps within their toolset, notably Soda.

Dataroots will certainly conduct a more in-depth investigation into Sifflet and CastorDoc.

Data Strategy

85% of data projects fail. All the above do not guarantee a successful data project at the end of the day (of years, even decade!). Technology is just one (initially smaller) part, but still important. But, the right people, with the right mindset, along with a rigorous strategy holds the key to a successful data project in equal proportions.

How to build a Data Culture?

A successful "Data Culture" in a company looks like this: everyone understands the value of data and can use it to improve business outcomes. One of the major themes at Big Data London was focus on "Data Culture is for everyone, not just your data teams". Fluency and data training to ensure holistic collaboration across various departments for data strategy is what we must aim for in our organisations.

Why most data projects fail?

Having a data strategy isn't enough. There needs to be a plan and execution. A first point to start with planning can be answering these :

Who? (how many analysts, data scientists, data engineers or analytics engineers do you need and what will be their specific responsibilities?)

What? (What is the business value- just saying you wat AI isn't enough. Define a clear and attainable path to value creation)

When? (When will you generate value? Stay away from unattainable timelines)

Where? (Where will data be processed; data teams need to have a clear architecture)

and finally How? (Data teams need a clear plan that they are executing. This plan needs a singular focus on the work that will go down in different directions)

Data is everywhere

To no one's surprise: data is everywhere. Big Data London hosted various very interesting talks about data related tools and the most recent advances in the world of data, of which multiple are highlighted in this blog post. Next to that, the event also offered insights and success stories on how organisations use all sorts and volumes of data to achieve data driven performance. Whether it is to put guest speaker Tim Peake into space, enable HelloFresh to deliver millions of mealboxes across the globe or help McLaren put Lando Norris on pole position for the Formula 1 race weekend, the trend where data really drives different operations to achieve value is here to stay and we at dataroots are very excited to be in the midst of it!

Tim Peake's talk

See you next year?

With more than 15.000 attendees, there is no other conference in Europe like Big Data London. It's the place where data practitioners all over Europe gather. We're excited to see next year's schedule and if you're planning on going, look out for the dataroots logo! 😉

Fabric end-to-end use case: Analytics Engineering part 2 - Reports

Bart — Thu, 28 Sep 2023 07:00:30 +0000

Welcome to the fifth part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the analytics engineering part of the use case.

In this series, we will explore how to use Microsoft Fabric to ingest, transform, and analyze data using a real-world use case. The series focuses on data engineering and analytics engineering. We will be using OneLake, Notebooks, Lakehouse, SQL Endpoints, Data Pipelines, dbt, and Power BI.

All posts in this series

This post is part of a 5-part series on an end-to-end use case for Microsoft Fabric:

Use case introduction: the European energy market

If you’re following this series, feel free to skip this section as it’s the same introduction every time. 🙃

Since Russia invaded Ukraine, energy prices across Europe have increased significantly. This has led to a surge of alternative and green energy sources, such as solar and wind. However, these sources are not always available, and the energy grid needs to be able to handle the load.

Therefore, most European energy markets are converging towards a model with dynamic energy prices. In a lot of European countries, you can already opt for a dynamic tariff where the price of electricity changes every hour. This brings challenges, but also lots of opportunities. By analyzing the prices, you can optimize your energy consumption and save money. The flexibility and options improve a lot with the installation of a home battery. With some contracts, you could even earn money by selling your energy back to the grid at peak times or when the price is negative.

In this use case, we will be ingesting Epex Spot (European Energy Exchange) day-ahead energy pricing data. Energy companies buy and sell energy on this exchange. The price of energy is announced one day in advance. The price can even become negative when there will be too much energy on the grid (e.g. it’s sunnier and windier than expected and some energy plants cannot easily scale down capacity).

Since it’s quite a lot of content with more than 1 hour of total reading time, I’ve split it up into 5 easily digestible parts.

We want to ingest this data and store it in OneLake. At one point, we could combine this data with weather forecasts to train a machine learning model to predict the energy price.

After ingestion, we will transform the data and model it for dashboarding. In the dashboard, we will have simple advice on how to optimize your energy consumption and save money by smartly using a home battery.

All data is publicly available, so you can follow along in your own Fabric Workspace.

Reporting

I am by far not the best at creating dashboards and reports. I know how to get something going and visualize what I want to see, but there are lots of people more experienced at this than I am. Nevertheless, to showcase how easy it is to create a Report with the Power BI experience in Fabric, I will create a paginated report to visualize our electricity pricing data. This is an excellent way to conclude our end-to-end use case through Fabric.

I've put this under the umbrella of analytics engineering as typically an analytics engineer will create the visualizations that go with the data marts. Of course, this is not set in stone and you might find different kinds of roles doing this task across different organizations.

Creating a report

Let's start by creating a new report. We can do this by clicking the + New report button in the top left corner of the Power BI experience in Fabric. Fabric then asks us to select a dataset to use for the report. Every Lakehouse on Fabric is automatically exposed as a dataset as well, so just look for the name of your Lakehouse from the previous posts and select it. Fabric can even automatically suggest a report based on your data. For now, we will just create a blank report. You can find the button to create a blank report under the chevron 🔽 button in the bottom right corner.

Selecting a dataset for the report

Map visualization

Our report is going to have multiple pages and on the first page, we're going to visualize the pricing data for Europe for the current month.

Our first visualization is going to be a map with circles indicating the average monthly electricity price for that country. This should give us a helicopter view of the pricing data for Europe. Click the Map 🌍 icon in the Visualizations pane to create a map. Then, on the Data pane, expand the mrt_avg_price_per_month data mart and drag the country column to the Location field that popped up in the Visualizations pane. Drag the month_name_short to Legend. You can use the chevron 🔽 on the right to select Rename for this visual and type month to make sure users just see month there instead of the actual column name. Drag the avg_month_price to Bubble size. Click on the chevron 🔽 next to the avg_month_size you just dragged and make sure the Average aggregation is selected.

Configuring the map visualization

Next, switch the tab on top of the Visualizations pane from the Build visual to the Format visual tab. New tabs appear under a search bar. In the first tab, Visual, you can change the style of the map under Map settings. I picked Light. Under General and Title, you can change the title of the visualization to something like Average price per country per month.

Formatting the map visualization

This should give you a map that looks like this:

The map visualization

I'm happy with that result for now, so let's move on to the next visualization.

Average daily price per market in a line chart

The map visualization is great at giving us an impression of how electricity pricing in different countries compares to each other, but it doesn't tell us anything about the price fluctuations, how prices evolve over time, or how prices between different markets are related to each other.

In the next visualization, we're going to use a Line chart to visualize the pricing evolution on a daily basis per market. Click the Line chart icon 📈 to put a line chart on the canvas. Then, open the mrt_avg_price_per_day data mart under Data and drag the date column to the X-axis, avg_day_price to Y-axis, and market to Legend. Make sure the selected aggregation for the avg_day_price is Sum.

Configuring the line chart visualization

Then switch the Visualization pane from the Build visual to the Format visual tab. Click on the General tab that just appeared and fill in an appropriate title like Average daily price by market.

Formatting the line chart visualization

This should give you a line chart that looks like this:

The line chart visualization

Resize and drag the visualizations to fill the page and you end up with a first page on the report that looks like this:

The first page of the report

Right away, we notice that most prices follow the same trends, but that the Scandinavian countries have significantly lower prices than the rest of Europe.

Details for a specific market with a multi-row card

Let's add a second page to the report that shows us the details for a specific market. We can do this by clicking the ➕ sign at the bottom of the canvas. By right-clicking on the pages, you can also rename them. I renamed the first page to Europe - month and the second page to Belgium - day.

I would like to get a quick overview of the prices and for that, I chose a Multi-row card. So click on the icon for a Multi-row card to add it to the canvas. Then open the mrt_simple_advice view under Data and drag the same price_cent_kwh column 4 times to the Fields section on the Visualizations pane. Then, use the chevrons 🔽 to set the aggregations to Minimum, Average, Median, and Maximum respectively and rename the fields to Lowest price (cent/kWh), Average price (cent/kWh), Median price (cent/kWh), and Highest price (cent/kWh).

Configuring the multi-row card visualization

Filters

The problem we have right now is that this visualization shows us the prices for all markets and all days. We only want to see the prices for Belgium (or for your market if applicable). Therefore, expand the Filters pane to the left of the Visualizations pane and drag the date and the market columns to the Filters on this page section. For the date, you can set the Filter type to Relative date and set the 2 dropdowns to is in this and day. For the market, you can check the Require single selection checkbox and select your market.

Configuring the filters for the second page

Displaying advice using a table

For our last visualization, we're using a Table to display the advice itself. Click the Table icon to add a table icon to the canvas and drag the columns start_time, price_cent_kwh, and simple_advice to the Columns. Use the chevrons 🔽 to rename the columns to start time, price (cent/kWh), and simple consumption advice. Also, make sure to set the aggregation for the price_cent_kwh column to Don't summarize.

Configuring the table visualization

We then end up with our second page looking like this:

The second page of the report

Finish your work by clicking Save report from the File menu and open the report in Reading view to see the final result.

Conclusion

This concludes our end-to-end use case for Microsoft Fabric. We've seen how to ingest and transform data, model it into data marts using dbt, and visualize it using Power BI. I have a special bonus prepared to showcase an interesting use case for the Data Activator in Fabric, but since that is still in private preview, I will wait until it is publicly available before publishing that post.

I know you could do a lot more with this data, the modeling, and the visualizations, but I hope this gives you a good impression of how easy it is to get started with Fabric and how you can use it to build a data platform that is easy to use for both data engineers and data analysts.

An Azure infrastructure to run Great Expectations on a Notion database

Bart — Sun, 24 Sep 2023 23:16:04 +0000

At dataroots, we use Notion databases to keep track of our project portfolio and ongoing initiatives etc. Ricardo has implemented a NotionAPI to validate data in Notion database. This blog elaborates on a way to automate this process in Azure.

There are many ways to design an automation pipeline to run data quality checks, which must include these components: version control (e.g. GitHub, Azure DevOps), CI/CD platform (e.g. GitHub Actions, Azure Pipelines), scheduler (e.g. Cron, Airflow, Prefect), a machine or agent to run the program and a place (either locally or on the cloud) to store validation results.

In our design, we implemented a streamlined architecture (see below) by using GitHub actions and some Azure services, and we will introduce how it works.

Architecture solution

Azure Functions

Azure Functions is a great candidate if you only need to run simple functions and do not want to care about the infrastructure management. In our case, we scheduled a Python (3.9) function. It does the following stuff:

Firstly, it reads data from Notion database based on the Notion API key and database id as a pandas dataframe.
Secondly, it creates a run time batch request for the pandas dataframe.
Finally, it takes corresponding expectations and runs the checkpoint accordingly.  ## Azure Key Vault

We use Azure Key Vault to store sensitive information including the Notion API key (to fetch Notion data), Azure connection string to a storage account, and Slack web hook (to send validation results to Slack). These secrets are then configured as environment variables in an Azure Function.

Azure Storage Account

We created 3 Azure storage services: 1 Azure file share and 2 Azure blob storage. The file share is used to store the Great Expectations project, by mounting it on the Azure Function, the Function could recognise the context and perform relevant actions. In the configuration, the expectations are stored in the file share, while the data docs and validations are stored in another blob storage. In one of the blobs, the static website is enabled so that people can have a web interface for validation results (see the screenshot below). The action of sending Slack notifications is also configured here, specifically in the checkpoint folder.

A screenshot of the validation results

GitHub Actions

The automation of the build&manage procedure is done through a CI/CD pipeline in GitHub. To simplify, only 3 steps were performed:

Login to Azure
Checkout source code
Deploy the great expectations folder to the file share

In this way, the data context is separated with the validation function. It is good as we frequently work on adding/updating expectations and seldom modify the validation function. It would be nice to have another workflow to deploy the entire infrastructure to Azure in the future though.

Further improvement

In this article, we described a solution to implement an Azure infrastructure to run Great expectations on Notion database. There are several things to be improved in the future:

Another Github workflow to be created to deploy the whole infrasturcture
Investigate how to better visualise the data docs without keeping all validation results in Azure storage
Probably a customised web interfact to check the validation results  ### References

[

Validate data using an Azure Function and Great Expectations

I recently got the question if it is possible to run Great Expectations validations in a Azure Function, and my first thought was “Yes, of course you can”, so I dove in and got it working! Great Expectations (Great Expectations Home Page • Great Expectations) is a popular Python-based OSS tool to…

TECHCOMMUNITY.MICROSOFT.COMEduard van Valkenburg

](https://techcommunity.microsoft.com/t5/azure-architecture-blog/validate-data-using-an-azure-function-and-great-expectations/ba-p/3723467?ref=dataroots.ghost.io)

[

GitHub - datarootsio/notion-dbs-data-quality: Using Great Expectations and Notion’s API, this repo aims to provide data quality for our databases in Notion.

Using Great Expectations and Notion's API, this repo aims to provide data quality for our databases in Notion. - GitHub - datarootsio/notion-dbs-data-quality: Using Great Expectations and Notio…

GitHubdatarootsio

](https://github.com/datarootsio/notion-dbs-data-quality?ref=dataroots.ghost.io)

Fabric end-to-end use case: Analytics Engineering part 1 - dbt with the Lakehouse

Bart — Thu, 21 Sep 2023 07:00:30 +0000

Welcome to the fourth part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the analytics engineering part of the use case.

All posts in this series

This post is part of a 5-part series on an end-to-end use case for Microsoft Fabric:

Use case introduction: the European energy market

If you’re following this series, feel free to skip this section as it’s the same introduction every time. 🙃

Since it’s quite a lot of content with more than 1 hour of total reading time, I’ve split it up into 5 easily digestible parts.

We want to ingest this data and store it in OneLake. At one point, we could combine this data with weather forecasts to train a machine learning model to predict the energy price.

All data is publicly available, so you can follow along in your own Fabric Workspace.

Why dbt? And how?

dbt is a popular open-source analytics engineering tool that allows you to transform data in your warehouse. I helped build dbt support for Microsoft Fabric, and I'm excited to show you how it works.

dbt is a great fit for this use case as we want to transform the raw relational electricity pricing data into data marts ready for visualization through Power BI. It has the advantage that everything we build can be stored in git and includes great data lineage and documentation features.

Note that in this blog post and use case, we will be using dbt with a Lakehouse on Fabric. The Lakehouse works great with dbt, but only exposes a small subset of dbt's capabilities. If you want to use dbt to the fullest, you can use it with a Fabric Data Warehouse.

In this post we'll take our first steps with dbt and we'll also look at the best practices on using dbt. dbt itself is a very simple tool, but the power comes from how you use it.

Installing dbt-fabric

dbt is a Python package, so make sure you have Python installed and create a new project folder on your machine. I'd suggest creating a virtual environment for your project, but that's not required. Open a terminal and run the following commands:

mkdir fabric-epex
cd fabric-epex
python -m venv .venv

The syntax for activating the virtual environment differs between operating systems and shells.

OS	Shell	Command to activate virtual environment
Windows	cmd.exe	`.venv\Scripts\activate.bat`
Windows	PowerShell	`.venv\Scripts\Activate.ps1`
macOS/Linux	bash/zsh	`source .venv/bin/activate`
macOS/Linux	PowerShell	`.venv/bin/Activate.ps1`

Once you have activated your virtual environment, you can install dbt with Fabric support.

pip install dbt-fabric

There is one more requirement we need to fulfill: the ODBC driver. Connections to Fabric go over either TDS or ODBC. For dbt, we need the ODBC driver. You can find the installation instructions here.

Tooling: VS Code

Since Fabric support is not available on dbt Cloud yet, I'd recommend to use Visual Studio Code to work with dbt projects at the moment. In VS Code, you can configure the IDE to use the Python interpreter from the virtual environment you created above. If you then open new terminal windows in VS Code, they will automatically have the virtual environment activated.

If you open the Command Palette in VS Code, you can look for Python: Select to configure the Python interpreter and select the one from your virtual environment.

Selecting the Python interpreter in VS Code

Authentication

There are lots of ways to authenticate to Fabric from dbt, but the easiest one is probably to use the Azure CLI. If you don't have it installed yet, you can find instructions here.

Once you have the Azure CLI installed, you can log in with the command az login. This will open a browser window where you can log in with your Azure credentials.

Creating a dbt project

Now that you have dbt installed and are authenticated to Fabric, you can create a new dbt project. Run the following command in your terminal:

dbt init fabric_epex

Project names in dbt cannot have dashes, so we're using an underscore in the name above. dbt will ask you which adapter you want to use, but at this point, the one for Fabric is the only one you have installed, so you can just press enter.

Configuring the dbt profile

Profiles in dbt are used to store connection details and credentials to your data warehouse. The default location for dbt profiles is in your home directory. Since we're using Azure CLI for authentication, we have the benefit that our profile will not contain any credentials by itself. That means we can store it right in our dbt project folder and commit it to git.

Create a new file called profiles.yml in the fabric_epex folder and add the following content:

fabric_epex:
  target: dev
  outputs:
    dev:
      type: fabric
      driver: ODBC Driver 18 for SQL Server
      server: connection_string_from_fabric # change this
      port: 1433
      authentication: cli
      database: name_of_your_lakehouse # change this
      schema: dbo

As you can see from the snippet above, there are two things you need to change: the connection string and the database name. The name of the lakehouse is an easy one, but for the connection string, you need to know where to look. Navigate to your Lakehouse, click on the ⚙️ settings icon and then on SQL endpoint. Here, you can copy your connection string.

The SQL endpoint in the Fabric UI

You can validate that your configuration is working by running the command dbt debug in the terminal where you have your virtual environment activated.

Deleting sample data and configuration

New dbt projects come with sample data and configuration. We don't need any of that, so delete the contents of the models folder and replace the dbt_project.yml file with the following:

name: 'fabric_epex'
version: '1.0.0'
config-version: 2

profile: 'fabric_epex'

models:
  fabric_epex:
    +materialized: view

Since the Lakehouse can only create SQL Views and not Tables (only through Spark), we configure the project to materialize all models as views.

Creating the first model and source

A model in dbt is a SQL query that transforms data. It takes the form of a SQL file containing a SELECT statement. dbt then materializes the output of the query as views, tables, or CTEs. In our case, we want to create a view that transforms the raw data from the Lakehouse into data marts.

What is a CTE?

A CTE is a Common Table Expression. It's a way to split up a SQL query into multiple logical parts. You can think of it as a temporary table that only exists for the duration of the query. It's a great way to make your SQL code more readable and maintainable. You could probably write more performant queries without CTEs, but that's not the goal of dbt. It's meant to make your code more readable, understandable, and maintainable. If an analytics query takes 5 minutes instead of 4, that's not a big of a deal since they are usually run once or a couple of times a day.

Staging source

Create a folder named staging in the models folder. This is where we will put the models that load the raw data from the Lakehouse. We only have 1 raw table, so we only need 1 raw model. For our model to be able to reference this table, we have to define the table as a source.

Create a file called __sources.yml in the staging folder you just created. You can name the file itself however you'd like, but I prefer the double underscore to make sure that I can easily find the file at top of the folder. Add the following content to the file:

version: 2

sources:
  - name: epex_spot_prices
    schema: dbo
    tables:
      - name: epex_spot_prices
        description: The EPEX Spot prices for the day-ahead market
        columns:
          - name: market
            description: The market for which this price is valid
          - name: start_time
            description: The timestamp this price becomes valid
          - name: end_time
            description: The timestamp this price is no longer valid
          - name: buy_volume
            description: The volume of buy orders at this price (in MWh)
          - name: sell_volume
            description: The volume of sell orders at this price (in MWh)
          - name: volume
            description: The total trading volume of orders at this price (in MWh)
          - name: price
            description: The energy price (in EUR/MWh)

So as you can see, we tell dbt the name of the schema and the table where it can find our source data. We also define all columns and give descriptions for each column. This is how you document your data in dbt. You'll see in a bit how this documentation can be used and visualized.

This source by itself doesn't do anything. You can validate this by running dbt run in your terminal, it will output Found 0 models, 0 tests, 0 snapshots, 0 analyses, 327 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics. That means that it found our source, so now we can create a model that references it.

Staging model

Create a new file called stg_epex_spot_prices.sql in the same staging folder and add the following content:

with src as (
    select *
    from {{ source('epex_spot_prices', 'epex_spot_prices') }}
),

conversions as (
    select
        market,
        convert(date, start_time) as date,
        convert(time, start_time) as start_time,
        convert(time, end_time) as end_time,
        price / 10 as price_cent_kwh
    from src
),

with_country as (
    select
        *,
        case
            when market like 'NO%' then 'Norway'
            when market like 'SE%' then 'Sweden'
            when market like 'DK%' then 'Denmark'
            when market like 'DE-LU' then 'Germany'
            when market = 'FI' then 'Finland'
            when market = 'BE' then 'Belgium'
            when market = 'PL' then 'Poland'
            when market = 'AT' then 'Austria'
            when market = 'FR' then 'France'
            when market = 'NL' then 'the Netherlands'
            when market = 'CH' then 'Switzerland'
            when market = 'GB' then 'United Kingdom'
            else 'Unknown'
        end as country
    from conversions
),

final as (
    select
        *,
        case
            when country in ('Belgium', 'the Netherlands', 'Germany', 'France', 'Switzerland', 'Austria') then 'West Europe'
            when country in ('Great Britain') then 'North Europe'
            when country in ('Poland') then 'Central Europe'
            when country in ('Norway', 'Sweden', 'Finland', 'Denmark') then 'Scandinavia'
            else 'Unknown'
        end as region
    from with_country
    where price_cent_kwh > 0
)

select
    market,
    date,
    start_time,
    end_time,
    price_cent_kwh,
    country,
    region
from final

There a few dbt best practices you can see being applied here:

Split all the transformations into CTEs. This makes it easier to read and understand the code.
The last CTE should be named final and the last SELECT statement should select from final. This makes it easier to find the output of the model and to add more CTEs later on.
Use the source macro to reference the source table. This makes it easier to change the source table later on. This also tells dbt how dependencies work in your project and will become visible in the documentation.
Don't do any major transformations in the staging models themselves. They are meant to cleanse, but not to end up with a completely different table structure.
Always expose the raw source data in the staging models. This makes it easier to debug and to understand the data lineage.

The SQL itself is pretty straightforward, but if you're new to dbt, then this will be the first time you're seeing Jinja in {{ source('epex_spot_prices', 'epex_spot_prices') }}. This is the source macro I mentioned above. It takes the name of the source and the name of the table and returns the fully qualified name of the table. In this case, it will return name_of_your_lakehouse.dbo.epex_spot_prices. This way you can decouple the source name from the actual table name.

The first run

Nothing more exciting than the first succeeding dbt run 😊. Run dbt run in your terminal and you should see the command succeeding with the message Completed successfully followed by Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1.

You can look at your fancy new dbt model in the Lakehouse UI on Fabric as well.

The materialized dbt model in the Lakehouse UI

This also compiled our code. You can find the output of the compilation in the target folder, under the subfolder compiled. There, dbt follows the same structure as your project and puts the compiled SQL files. You can find the compiled SQL for our model in target/compiled/fabric_epex/staging/stg_epex_spot_prices.sql. In this compiled version, the Jinja was replaced by the actual value. This is what dbt used to build the view in the Lakehouse.

Generating documentation

We already added some documentation to our source, but we can do better by also properly documenting our dbt model. Like with the source definition, you can name the yaml file however you'd like, but I prefer __schema.yml to keep it a the top of the folder.

This is what I've put inside the file:

version: 2

models:
  - name: stg_epex_spot_prices
    description: The EPEX Spot prices for the day-ahead market
    columns:
      - name: market
        description: The market for which this price is valid
      - name: date
        description: The date this price becomes valid
      - name: start_time
        description: The timestamp this price becomes valid
      - name: end_time
        description: The timestamp this price is no longer valid
      - name: price_cent_kwh
        description: The energy price (in euro cent/kWh)
      - name: country
        description: The country in which this market is located
      - name: region
        description: Where in Europe the market is located

Now, run dbt docs generate in your terminal. This will generate the documenation as a static HTML website in a folder called target. To open the documentation in your browser automatically, you can run dbt docs serve.

From there, you can navigate through the project and open the source and the model to see their documentation. It includes descriptions of the table, the view, the columns, the code used, references, and much more. If we click the Lineage graph button in the bottom right corner, we can see the data lineage of our model. This is a pretty simple one, since we only have 1 source and 1 model, but it will grow as we add more models.

The dbt documentation and the lineage information

Side note: modelling

Experienced data analysts will notice that I'm not using the Kimball model in this use case. Instead, I opted for OBT: One Big Table. Every data mart will be a table with all the information in it. This is not a requirement for dbt or for the Lakehouse and not a recommendation either. You can model your data however you'd like and I felt most comfortable with this approach for this use case. But you could easily use the Kimball model as well.

Building data marts

Now that we have our source data available in a model, we can start building data marts on top of it. Create a new folder named marts in the models folder. We'll create the markets below one by one. During this process, make sure to run dbt run after each change to validate that your code compiles and runs successfully.

Data mart: average price per month per market

Let's start simple and create a mart that calculates the average price per month per market. Create a new file called mrt_avg_price_per_month.sql in the marts folder and add the following content:

with stg as (
    select *
    from {{ ref('stg_epex_spot_prices') }}
),

with_month as (
    select
        *,
        datetrunc(month, date) as month,
        year(date) as year,
        format(date, 'MMM') as month_name_short
    from stg
),

final as (
    select
        market,
        country,
        region,
        month,
        month_name_short,
        year,
        avg(price_cent_kwh) as avg_month_price
    from with_month
    group by market, country, region, month, month_name_short, year
)

select *
from final

There are a few things we can observe in this SQL query:

We use the ref macro to reference the staging model. This is the same as the source macro, but it's used for models instead of sources. Since we can have multiple sources, but only one dbt project (this is changing in dbt 1.6), we only have to specify the name of the model that we're referencing. The model can be located in any folder and can be materialized as anything. We could even configure the referenced model to have a different schema or view name in the Lakehouse and our reference would still work.
The referenced model is the first CTE in the query. It's a best practice to put all the models you're referencing as 1:1 CTEs as the top of the model. This makes it easier to the reader of your code to understand where the data is coming from.
Besides the reference, we have 2 CTEs. We have the final one, as in our previous model, and we have one where we add information about the month to the data. In the final CTE, we group all columns by the month and the market and calculate the average price per month per market.

The average price per month data mart

Data mart: average price per day per market

Let's create another data mart that calculates the average price per day per market. Create a new file called mrt_avg_price_per_day.sql in the marts folder and add the following content:

with stg as (
    select *
    from {{ ref('stg_epex_spot_prices') }}
),

final as (
    select
        market,
        country,
        region,
        date,
        avg(price_cent_kwh) as avg_day_price
    from stg
    group by market, country, region, date
)

select *
from final

This one is much simpler than the previous one. We don't need to add any information about the date, since we're grouping by the date itself. We can just calculate the average price per day per market.

The average price per day data mart

Data mart: transposed version of the average price per day per market

Now, let's try something more challenging where we can also start to see the value of dbt a bit more. We want to create a data mart that has the average price per day per market, but transposed. So instead of having 1 row per day per market, we want to have 1 row per day with a column for each market.

Create a new file called mrt_avg_price_per_day_transposed.sql in the marts folder and add the following content:

{% set markets_query %}
    select
        distinct market
    from {{ ref('stg_epex_spot_prices') }}
    order by market
{% endset %}

{% set markets_query_results = run_query(markets_query) %}

{% if execute %}
    {% set markets = markets_query_results.columns[0].values() %}
{% else %}
    {% set markets = [] %}
{% endif %}

with avgs as (
    select *
    from {{ ref('mrt_avg_price_per_day') }}
),

final as (
    select
        date,
        {% for market in markets %}
        sum(
            case
                when market = '{{ market }}' then avg_day_price
                else 0
            end
        ) as market_{{ market | replace("-","_") | lower }}
        {% if not loop.last %},{% endif %}
        {% endfor %}
    from avgs
    group by date
)

select *
from final

We can build upon the previous data mart. We could have opted to keep our data lineage a bit simpler and follow another dbt best practice by having an intermediate model in between. What's that? We'd move the contents of the model with the average price per day into a different model in a folder named intermediate and then reference that model in the data mart as well as in this model. Given that this is a small project, I opted to keep it simple and just reference the data mart directly.

We can distinguish the 2 CTEs at the bottom, but the more interesting part is at the top. We want to create one column per market, so in our final CTE we'd have to iterate over all markets.

Variables in Jinja

Most of the Jinja statements we saw until now have double curly braces {{ funtion_name() }} which means that you're outputting something. In Jinja, you'll also often notice a curly brace with a percentage sign {% operation %}. This means that you're executing something. Here, you can implement logical and conditional statements.

A common operation is to set a variable, just like you'd in Python. To set a variable, begin you statement with the set keyword. In the first lines of the query we create a variable named markets_query and set its content to the SQL query on lines 2 to 5, ending with the endset keyword. This is called a multi-line set statement. Right below, on line 8, we see a single-line set statement. Here, we set the value of the variable markets_query_results to the result of the query we just defined. This means that dbt will have to execute the query on lines 2 to 5 and store the result in the variable.

Compilation and execution

There is an important remark to take into account here. dbt has 2 stages: compilation and execution. In the compilation stage, it takes all the dbt models and compiles the Jinja-SQL into regular SQL. In the execution stage, it runs the compiled SQL against the configured data warehouse; in this case the Lakehouse. You can compile your code with the command dbt compile. This creates the artifacts in the target folder mentioned above. This means that only during the execution phase, dbt runs queries against the Lakehouse. That is why we have a conditional statement in the code above. We only want to execute the query if we're in the execution phase. If we're in the compilation phase, we don't want to run the query and we just set it to an empty list.

Loops in Jinja

This all comes together in lines 24 to 32. Here we use a for loop to iterate over all the markets present in our data. We then use a CASE statement in SQL to create a column for each market. Since the market names can contain dashes, we replace them with underscores and convert the whole string to lowercase to have consistent column names. Let's also have a closer look at line 28. Columns in a SELECT statement are separated by commas, but we can't have a comma after the last column. So we use the special loop variable in dbt to check if we're at the last iteration of the loop. If we are, we don't add a comma, otherwise we do.

Putting it all together

We then group by the date column to have a single row per date and summarize the average price per market in the columns we created. This is the result:

The transposed data mart

Without dbt's powerful Jinja syntax, we'd have to write a lot more SQL, with a few lines of code per market, to achieve the same result.

Data marts: lowest and highest price per day per market

Next, I'd like to have 2 more data marts that show me the moment of the lowest price per day and the moment of highest price per day for each market. Let's start with the lowest price. Create a new file called mrt_lowest_price_per_day.sql in the marts folder and add the following content:

with stg as (
    select
        market,
        date,
        start_time,
        end_time,
        price_cent_kwh
    from {{ ref('stg_epex_spot_prices') }}
),

with_rank as (
    select
        *,
        row_number() over (partition by date, market order by price_cent_kwh asc) as rn
    from stg
),

final as (
    select
        date,
        market,
        start_time,
        end_time,
        price_cent_kwh
    from with_rank
    where rn = 1
)

select *
from final

This is again a pretty straightforward SQL query, following the best practices as outlined in the previous dbt models. We're using a windowing function to rank the prices per day per market from lowest to highest and then take the rows with the lowest ranks/prices.

The lowest price per day data mart

Now, for the highest price, the only difference is that we order by the price descending instead of ascending. It seems a bit silly to just copy the entire file and change 3 characters. Luckily, dbt has a solution that allows us to make our code more flexible.

Creating macros

Macros are reusable bits of SQL code that can parametrized. You could think of them like functions in Python. You can use any SQL or Jinja in a macro. Let's see how this works with our example.

Start by creating a file named find_date_moment.sql in the macros folder in your project. Add the following content:

{% macro find_date_moment(which_moment) %}
    {% set order = "asc" %}
    {% if which_moment == "highest" %}
        {% set order = "desc" %}
    {% endif %}

    with stg as (
        select
            market,
            date,
            start_time,
            end_time,
            price_cent_kwh
        from {{ ref('stg_epex_spot_prices') }}
    ),

    with_rank as (
        select
            *,
            row_number() over (partition by date, market order by price_cent_kwh {{ order }}) as rn
        from stg
    ),

    calc_result as (
        select
            date,
            market,
            start_time,
            end_time,
            price_cent_kwh
        from with_rank
        where rn = 1
    )
{% endmacro %}

A macro is created with the macro and endmacro keywords within {% and %}. Our macro takes 1 argument named which_moment to indicate if we want to find the moments with the lowest or the highest price. Then we change the order accordingly on lines 2 to 5 by setting a variable named order to the corresponding value. We have to parametrize the ordering on line 20, so there we can use our order variable.

Using macros

Using macros work in the exact same way as how we used the built-in ref and source macros. We can just call our macro with double curly braces like so: {{ find_date_moment("highest") }}. Let's change the content of our mrt_lowest_price_per_day.sql file to the following:

{{ find_date_moment("lowest") }}

select * from calc_result

And then we can create our second data mart named mrt_highest_price_per_day.sql with the following content:

{{ find_date_moment("highest") }}

select * from calc_result

You'll notice the first data mart still produces exact the same output and our second data mart works flawlessly as well.

The highest price per day data mart

Data mart: simple advice

Our final data mart is a simple one. The goal of this data mart is to provide some very basic advice to electricity consumers with a home battery. If you intelligently use your home battery by charging it when the price is low and discharging it when the price is high, you can save money. In some countries you could even make money by selling your electricity back to the grid when the price is high if you charge your battery when the price is low.

This is under no circumstances financial advice and also not super reliable. This is just meant as an example to showcase what you could do with this data.

Create a new file called mrt_simple_advice.sql in the marts folder and add the following content:

{{ find_date_moment("lowest") }}

, final as (
    select
        market,
        date,
        substring(convert(nvarchar, start_time, 14), 1, 5) as start_time,
        substring(convert(nvarchar, end_time, 14), 1, 5) as end_time,
        price_cent_kwh,
        case
            when price_cent_kwh < 0 then 'discharge'
            when rn < 10 then 'charge+grid'
            when rn < 18 then 'grid'
            when rn < 24 then 'battery'
            else 'battery+discharge'
        end as simple_advice
    from with_rank
)

select *
from final

As you can see, we can reuse our macro ranking the prices per day per market. The macro outputs a CTE with all the ranks, so we can just use that CTE in our final CTE. Depending on the rank, we then output a simple advice.

If the price is negative, we advise to discharge the battery to make some profit. If the price is in the lowest 10 prices for that day, we advise to charge the battery and consume electricity from the grid. For the next 8 pricing moments (ascending), we advise to consume electricity from the grid, without consuming electricity from the battery. For all other pricing moments except for the highest price throughout the day, we recommend to only consume electricity from the battery. Finally, on the most expensive moment of the day, we even recommend to discharge the battery to make some profit.

The simple electricity consumption advice data mart

Lineage

Now that our project is finished, we can also take a look at the end result of the lineage. If you run dbt docs generate again, followed by dbt docs serve, you can open the entire Lineage graph by clicking the Lineage button on the bottom right corner.

The finished lineage graph in the dbt documentation

Finished source code

This concludes our dbt project to build data marts on top of the raw data from the Lakehouse. You can find the finished dbt project here, on GitHub.

👉 Next part

In the next part, we'll use our data marts in a Report to visualize their output.

The link will start working as soon as the post is published, about a week after this one.

Fabric end-to-end use case: Data Engineering part 2 - Pipelines

Bart — Thu, 14 Sep 2023 07:00:29 +0000

Welcome to the third part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the data engineering part of the use case.

All posts in this series

This post is part of a 5-part series on an end-to-end use case for Microsoft Fabric:

Use case introduction: the European energy market

If you’re following this series, feel free to skip this section as it’s the same introduction every time. 🙃

Since it’s quite a lot of content with more than 1 hour of total reading time, I’ve split it up into 5 easily digestible parts.

We want to ingest this data and store it in OneLake. At one point, we could combine this data with weather forecasts to train a machine learning model to predict the energy price.

All data is publicly available, so you can follow along in your own Fabric Workspace.

The goal

In our previous post, we created a Notebook to load the Epex Spot data for the current day for a single market. In this post, we're going to automate this and load the data for all markets. We will be using a Data Pipeline in Fabric to do this.

Data Pipelines are a feature coming from Azure Data Factory and Azure Synapse Pipelines. They are a way to orchestrate data movement and transformation activities. They are built using a visual designer, but can also be exported as JSON and edited in a text editor. Later on, it will be possible to store these JSON versions in git and implement version control.

Pipeline

To create a Pipeline, go to the Data Engineering experience in Fabric and create a new Pipeline.

Create a new Pipeline

A Pipeline consists of a series of Activities and forms what is commonly called a DAG in the data engineering world. A DAG, Directed Acyclic Graph, are tasks that are connected to each other and can be executed in parallel or sequentially. They cannot have cycles, so you cannot link an Activity to an Activity that has already been executed. One Activity's result can be linked to zero, one or more other Activities. You can also make an Activity be executed only when a previous Activity has failed or only when it has succeeded. Some Activities can be nested inside other Activities if the outer Activity is a control flow Activity like a loop or an if condition.

ForEach Activity & expressions

Since we want to do something and repeat the process for every market, we're starting with a ForEach Activity. Click on the Activities tab and then on the icon for the ForEach Activity as indicated below. It will appear on the canvas and you can then click on it and give it a name.

Add a ForEach Activity

Now we have to tell Fabric on what we want to loop. Click on the ForEach Activity and then on the Settings tab. Here you can specify the collection you want to loop over. In our case, we want to loop over the markets. We can do this by using a dynamic expression. Expressions help you make your Activities parametrized and dynamic. Since this is a feature coming from Azure Data Factory, you can consult the Azure Data Factory documentation on expressions for more information on the syntax. In our case, we want to loop over the markets, so we can use the following expression to build an array of the markets as strings:

@createArray('AT','BE','CH','DE-LU','DK1','DK2','FI','FR','GB','NL','NO1','NO2','NO3','NO4','NO5','PL','SE1','SE2','SE3','SE4')

To enter the expression, we have to open the Pipeline expression builder by clicking on the Add dynamic content link that appears after highlighting the empty text box after Items. The expression builder will open and you can enter the expression.

Click on the Items text box to show the link to open the expression builder

Click Ok after entering the expression to save it. I would also recommend setting the Batch count to 2 or 1 to avoid overloading your capacity as this task is not time critical anyway.

Notebook Activity

Now that we have a loop that goes over the markets, we have to tell the Pipeline what it should do on every iteration of the loop. Inside the ForEach Activity block, you can spot a pencil ✏️ icon. This icon changes the canvas to the Activities that are inside the ForEach Activity. Once that is open, you can always click on the Main canvas link at the top left corner to return to the outer activities.

Inside this canvas, we can add an Activity to run a Notebook by clicking on the Notebook Activity at the top. At the time of writing, this is the third icon. Then click on the Activity to change its name.

Add a Notebook Activity

Next, we have to configure the Notebook Activity to run our Notebook with the market as its parameter. This is where it all comes together.

Click on the Settings tab when you have the Activity open and select the Notebook from the dropdown.

Then expand the Base parameters section and use the + New button to add a new parameter to the list. The name of the parameter should match the name of the variable in the Notebook. In our case, this is market_area. Make sure the Type is set to String. Now we have to link its value to the loop variable from the ForEach Activity since we are looping over every market one by one. To do this, click on the Value text box and then on the Add dynamic content link that appears. This will open the expression builder again. This time, we can use the following expression:

@item()

Link the iterator variable to the parameter in the Notebook

Finish the configuration by clicking Ok and return to the main canvas by clicking on the Main canvas link at the top left corner.

Validation & first run

On the Home tab of the top menu, you can find a 💾 icon to save your work. There is also a ✅ Validate button to verify if you didn't make any mistakes. If you did, you will get a message with the error. Otherwise, it will say No errors were found.

We've reached the point where it's time to take our Pipeline for a spin. Click on the Run ▶️ icon to trigger the first run. A new tab named Output appears at the bottom of the screen.

First run

You can now follow step by step what the Pipeline is doing. Every Activity also has a few buttons to dive deeper into the monitoring & logging. The icon with the arrow pointing to the square shows the input of the Activity while the one with the icon pointing outwards the square shows the output of the Activity.

Specifically for Notebook and a few other Activities, you also see a 👓 icon to open more details about the execution of the Notebook. In the new window that opens, you can find lots of detailed information: all the different Spark jobs that were executed as part of this Notebook with another link to open the Spark UI to dive into the details of the Spark job. You can also find the logs of the Spark cluster that executed the Notebook and on the Related items page, you can see the Notebook itself with the cell that has been added to set the market_area.

Execution details of a single Notebook run

You can close the monitoring window and you should see that a line has appeared for every market and that the status of every line is Succeeded. The status for the ForEach Activity and the Pipeline itself should also be Succeeded once all Notebooks have finished executing. If not, you can open the 💬 and the monitoring pages to see what went wrong.

Successful run

Scheduling

Now that we know that our Pipeline is working and executing our Notebook for every Market, all that's left to ingest the data on a daily basis is scheduling the Pipeline. To do this, click on the 🗓️ Schedule icon.

A pop-up opens and we can easily schedule the Pipeline by the frequency we'd like. I've set mine to run every 3 hours as it is not known when pricing data becomes available and our Notebook will skip unnecessary work if there is no data or the data has already been ingested.

Schedule the Pipeline

👉 Next part

In the next part, we will use the ingested data to generate some insights using dbt.

The link will start working as soon as the post is published, about a week after this one.

Building Data Mesh Architectures with Dataflow on Google Cloud

Bart — Sun, 10 Sep 2023 22:44:51 +0000

Hi again, data enthusiasts! Welcome to an exciting blog where we will explore the world of data mesh architectures and how they help solve data-related challenges for businesses. In this blog, we will delve into the concept of data mesh, the problems it addresses, and the role of Dataflow on Google Cloud Platform (GCP) in implementing data mesh architectures. We will also demonstrate how Dataflow seamlessly integrates with data mesh to empower organisations with efficient data management and analysis.

The Challenge: Exploding Data Growth and Extracting Value

Data is growing exponentially, with production increasing approximately 10x over the past eight years. However, many companies are struggling to extract meaningful and measurable value from their data investments. Around 70% of businesses worldwide are facing difficulties in realizing the true potential of their data. Bridging this gap, known as the "data value bridge," has become a top priority for organisations.

The Data Mesh Solution: Decentralised Data Ownership with Centralised Provisioning

To address the data value bridge, businesses have several options, including centralised data lakes or data warehouses, adopting a data mesh architecture, or combining both approaches. In this blog, we will focus on data mesh architectures, which uniquely combine decentralised data ownership with centralised provisioning of infrastructure, governance policies, and data management best practices. This combination leads to faster time-to-value and increased agility in responding to external changes.

Data Mesh is a relatively new architectural approach to managing data in a decentralised and domain-oriented manner. It was introduced by Zhamak Dehghani, a ThoughtWorks principal consultant, in 2019. The concept of Data Mesh is based on the idea of treating data as a product and decentralising its ownership and governance across various domains within an organisation.

Traditionally, data management in organisations has been centralised, with a single data team responsible for all data-related tasks, including data storage, processing, and governance. However, as organisations grow and generate more data, this centralised approach can lead to challenges, such as bottlenecks in data access, delayed data delivery, and increased complexities in data ownership.

Data Mesh Building Blocks: Treating Data as a Product

At the core of the data mesh architecture is the concept of treating data as a product. The architecture begins by creating logical groups or domains, structured around the unique context of a business. These domains are empowered to produce and share their data products in a decentralised manner. Simultaneously, a centralised approach provides the necessary underlying technology infrastructure, governance policies, and data management best practices. The result is interoperability across data products and a standardised way to monitor and manage the data estate.

Dataflow: Empowering Data Mesh Architecture on GCP

Google Cloud's unified analytics solution, Dataflow, plays a pivotal role in implementing data mesh architectures. Let's explore how Dataflow supports the core building blocks of the data mesh:

Data Products: Dataflow enables the creation of batch and streaming pipelines for ingesting and enriching domain data, transforming it into valuable data products.
Decentralised Data Ownership: Dataflow supports multiple programming languages (Java, Go, and Python) and provides a variety of connectors, allowing different domains to create pipelines that best suit their needs.
Self-Service Capabilities: Dataflow offers ready-to-use templates for quick start-up and simplified management through features like horizontal and vertical auto-scaling.
Governance and Security: Dataflow seamlessly integrates with Google Cloud's Identity and Access Management (IAM) and advanced networking solutions, enabling federated governance for data mesh architectures.

Streaming Data Products: Empowering Real-time Insights

Dataflow plays a crucial role in processing streaming data for data mesh architectures. For instance, consider a multinational organisation's website managed by a team responsible for various regional domains. These teams may need real-time insights into web traffic changes caused by ad placements or external events. Dataflow is utilised to parse website click payloads, enrich the data with geolocation information, save it in BigQuery, and publish aggregated web traffic to Pub/Sub topics.

Consumers can discover these data products in the data catalog, request access, and create subscriptions to receive the relevant data. Dataflow ensures smooth real-time processing and enables consumers to further process streaming data in their own pipelines.

A Technical Guide to Getting Started with Google Dataflow on GCP

Google Dataflow is a fully managed service that provides a unified programming model for both batch and streaming data processing. In this guide, we'll walk you through the process of setting up Google Dataflow and creating basic data processing pipelines. Let's dive in!

Step 1: Create a Google Cloud Project and Enable Dataflow API

Sign in to your Google account and navigate to the Google Cloud Console (https://console.cloud.google.com/).
Create a new project or select an existing one.
In the Google Cloud Console, click on the "Navigation menu" > "Dataflow."
If the Dataflow API is not enabled for your project, click on the "Enable" button to enable it.

Step 2: Set up Google Cloud SDK and Local Development Environment

Install the Google Cloud SDK on your local machine (https://cloud.google.com/sdk/docs/install).
Open a terminal or command prompt and run gcloud init to authenticate and configure your project settings.

Step 3: Write a Simple Batch Data Processing Pipeline

Create a new Python file (e.g., batch_pipeline.py) using your code editor.

Step 4: Run the Batch Data Processing Pipeline Locally

In the terminal, navigate to the directory containing the batch_pipeline.py file.
Run the pipeline with the following command python batch_pipeline.py

Step 3: Write a Simple Streaming Data Processing Pipeline

Create a new Python file (e.g., streaming_pipeline.py) using a code editor.

Step 4: Run the Streaming Data Processing Pipeline Locally

In the terminal, navigate to the directory containing the streaming_pipeline.py file.
Run the pipeline with the following command python streaming_pipeline.py

Conclusion

Data mesh architectures offer a powerful solution to bridge the gap between data production and value extraction. Google Cloud's Dataflow perfectly complements data mesh by empowering organisations with the tools and capabilities needed to create data products, support decentralised data ownership, and enable real-time insights through streaming data products. By combining the strengths of data mesh and Dataflow, businesses can unlock the true potential of their data investments and drive data democratisation across the organisation.

Fabric end-to-end use case: Data Engineering part 1 - Spark and Pandas in Notebooks

Bart — Thu, 07 Sep 2023 07:00:28 +0000

Welcome to the second part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the data engineering part of the use case.

All posts in this series

This post is part of a 5-part series on an end-to-end use case for Microsoft Fabric:

Use case introduction: the European energy market

If you’re following this series, feel free to skip this section as it’s the same introduction every time. 🙃

Since it’s quite a lot of content with more than 1 hour of total reading time, I’ve split it up into 5 easily digestible parts.

We want to ingest this data and store it in OneLake. At one point, we could combine this data with weather forecasts to train a machine learning model to predict the energy price.

All data is publicly available, so you can follow along in your own Fabric Workspace.

Source data

The data we are using comes from the table marked below from the Epex Spot website. The data is - very unfortunately - not easily available through an API or a regular data format.

Table used as data source

Instead, we have to scrape the data from the website. Luckily, I'm not the first person to do this. I found this GitHub repository where Steffen Zimmermann already went through the hassle of analyzing the website's HTML code and extracting the data we need from it. By looking at that source code and taking the bits and pieces we need from it, we can achieve the same result in a Notebook.

Installing beautifulsoup4

Before we can get started, we first need to install the beautifulsoup4 Python package in the Spark runtime environment of our Fabric Workspace. This can be done by going to your Workspace settings and then expanding the Data Engineering/Science section. There, you will find the </> Library management tab. Click on it and then on the + Add from PyPI button. In the text box that appears, enter beautifulsoup4, pick the latest version, and click on the Apply button. This will install the package in your Workspace. I suggest you take a coffee now since the installation process took a few minutes for me.

Installing beautifulsoup4

The Notebook

Now it's time to have some fun and start playing with the data. Create a new Notebook and follow Fabric's prompts to also create a new Lakehouse. Both operations take a few seconds and don't require any infrastructure to be provisioned.

The Notebook

Unfortunately the Fabric feature to export Notebooks doesn't fully work yet, so I cannot share the Notebook itself at the time of writing. I will update this post with a link to a downloadable version you can import once this feature is available. In the meantime, the code below is the full Notebook. Let's go through it step by step.

market_area = "BE"

from datetime import date, datetime, timedelta, timezone
from typing import Any

import pandas as pd
import requests
from bs4 import BeautifulSoup
from delta.tables import *
from pyspark.sql.functions import *

def _to_float(v: str) -> float:
    return float(v.replace(",", ""))

def _as_date_str(v: date) -> str:
    return v.strftime("%Y-%m-%d")

def extract_invokes(data: dict[str, Any]) -> dict[str, Any]:
    invokes = {}
    for entry in data:
        if entry["command"] == "invoke":
            invokes[entry["selector"]] = entry
    return invokes

def fetch_data(delivery_date: date, market_area: str) -> dict[str, Any]:
    trading_date = delivery_date - timedelta(days=1)
    params = {
        "market_area": market_area,
        "trading_date": _as_date_str(trading_date),
        "delivery_date": _as_date_str(delivery_date),
        "modality": "Auction",
        "sub_modality": "DayAhead",
        "product": "60",
        "data_mode": "table",
        "ajax_form": 1,
    }
    data = {
        "form_id": "market_data_filters_form",
        "_triggering_element_name": "submit_js",
    }
    r = requests.post("https://www.epexspot.com/en/market-data", params=params, data=data)
    r.raise_for_status()
    return r.json()

def extract_table_data(delivery_date: datetime, data: dict[str, Any], market_area: str) -> pd.DataFrame:
    soup = BeautifulSoup(data["args"][0], features="html.parser")

    try:
        table = soup.find("table", class_="table-01 table-length-1")
        body = table.tbody
        rows = body.find_all_next("tr")
    except AttributeError:
        return [] # no data available

    start_time = delivery_date.replace(hour=0, minute=0, second=0, microsecond=0)

    # convert timezone to UTC (and adjust timestamp)
    start_time = start_time.astimezone(timezone.utc)

    records = []
    for row in rows:
        end_time = start_time + timedelta(hours=1)
        buy_volume_col = row.td
        sell_volume_col = buy_volume_col.find_next_sibling("td")
        volume_col = sell_volume_col.find_next_sibling("td")
        price_col = volume_col.find_next_sibling("td")
        records.append(
            (
                market_area,
                start_time,
                end_time,
                _to_float(buy_volume_col.string),
                _to_float(sell_volume_col.string),
                _to_float(volume_col.string),
                _to_float(price_col.string),
            )
        )
        start_time = end_time

    return pd.DataFrame.from_records(records, columns=["market", "start_time", "end_time", "buy_volume", "sell_volume", "volume", "price"])

def fetch_day(delivery_date: datetime, market_area) -> pd.DataFrame:
    data = fetch_data(delivery_date.date(), market_area)
    invokes = extract_invokes(data)

    # check if there is an invoke command with selector ".js-md-widget"
    # because this contains the table with the results
    table_data = invokes.get(".js-md-widget")
    if table_data is None:
        # no data for this day
        return []
    return extract_table_data(delivery_date, table_data, market_area)

current = DeltaTable.forName(spark, "epex_spot_prices")
current_df = current.toDF()
current_df = current_df.withColumn("date", to_date(col("start_time")))
current_market_count = current_df.filter((current_df.market == market_area) & (current_df.date == _as_date_str(datetime.now() + timedelta(days=1)))).count()
if current_market_count == 24:
    mssparkutils.notebook.exit("Already ingested")

prices = fetch_day(datetime.now() + timedelta(days=1), market_area)
if len(prices) == 0:
    mssparkutils.notebook.exit("No prices available yet")

spark_df = spark.createDataFrame(prices)

current.alias("current").merge(spark_df.alias("new"), "current.market = new.market AND current.start_time = new.start_time AND current.end_time = new.end_time").whenNotMatchedInsertAll().execute()

Walkthrough

Parameter cell

market_area = "BE"

The first cell is a parameter cell where we set the market we want to ingest. I would like to run this notebook for all markets where Epex Spot is active, so by parametrizing the market area, we can pass the market area as a parameter to the notebook when we run it.

To make a cell a parameter cell, click on the ... next to the code cell and select Toggle parameter cell.

Making a cell a parameter cell

Imports

Next, we import all required libraries.

from datetime import date, datetime, timedelta, timezone
from typing import Any

import pandas as pd
import requests
from bs4 import BeautifulSoup
from delta.tables import *
from pyspark.sql.functions import *

Pandas is used to create a DataFrame from the data we extract from the website.
Requests is used to make the HTTP request to the website.
BeautifulSoup is used to parse the HTML response.
Delta is used to write the data to the Delta table with the upsert functionality.
And finally, we import some functions from the PySpark library to see if we already ingested the data and avoid extra HTTP requests and CU consumption (read more about billing in my previous blog post.

Scraping functions

def _to_float(v: str) -> float:
    return float(v.replace(",", ""))

def _as_date_str(v: date) -> str:
    return v.strftime("%Y-%m-%d")

def extract_invokes(data: dict[str, Any]) -> dict[str, Any]:
    invokes = {}
    for entry in data:
        if entry["command"] == "invoke":
            invokes[entry["selector"]] = entry
    return invokes

def fetch_data(delivery_date: date, market_area: str) -> dict[str, Any]:
    trading_date = delivery_date - timedelta(days=1)
    params = {
        "market_area": market_area,
        "trading_date": _as_date_str(trading_date),
        "delivery_date": _as_date_str(delivery_date),
        "modality": "Auction",
        "sub_modality": "DayAhead",
        "product": "60",
        "data_mode": "table",
        "ajax_form": 1,
    }
    data = {
        "form_id": "market_data_filters_form",
        "_triggering_element_name": "submit_js",
    }
    r = requests.post("https://www.epexspot.com/en/market-data", params=params, data=data)
    r.raise_for_status()
    return r.json()

def extract_table_data(delivery_date: datetime, data: dict[str, Any], market_area: str) -> pd.DataFrame:
    soup = BeautifulSoup(data["args"][0], features="html.parser")

    try:
        table = soup.find("table", class_="table-01 table-length-1")
        body = table.tbody
        rows = body.find_all_next("tr")
    except AttributeError:
        return [] # no data available

    start_time = delivery_date.replace(hour=0, minute=0, second=0, microsecond=0)

    # convert timezone to UTC (and adjust timestamp)
    start_time = start_time.astimezone(timezone.utc)

    records = []
    for row in rows:
        end_time = start_time + timedelta(hours=1)
        buy_volume_col = row.td
        sell_volume_col = buy_volume_col.find_next_sibling("td")
        volume_col = sell_volume_col.find_next_sibling("td")
        price_col = volume_col.find_next_sibling("td")
        records.append(
            (
                market_area,
                start_time,
                end_time,
                _to_float(buy_volume_col.string),
                _to_float(sell_volume_col.string),
                _to_float(volume_col.string),
                _to_float(price_col.string),
            )
        )
        start_time = end_time

    return pd.DataFrame.from_records(records, columns=["market", "start_time", "end_time", "buy_volume", "sell_volume", "volume", "price"])

The next 5 functions are used to scrape the data from the website. You can skip to the next paragraph if you are not that interested in the scraping itself. The code above basically works as follows:

We submit an HTTP POST request (a submitted form) to the web page and the server responds with a JSON object. The request has to contain the parameters that the users would normally select on the website.
Deeply nested inside this JSON object we get back, we find HTML code containing the table that contains the data we are interested in.
We pass the HTML code to BeautifulSoup, which parses the HTML code for us and we ask it to look for for the table.

The interesting part is located at the end of the code snippet above. Here, we loop over all rows in the table and append each of them to a list of tuples. We also convert the date and time to UTC, so that we don't have to worry about timezones or daylight-saving time later on. Finally, we convert the list of tuples to a Pandas DataFrame.

def fetch_day(delivery_date: datetime, market_area) -> pd.DataFrame:
    data = fetch_data(delivery_date.date(), market_area)
    invokes = extract_invokes(data)

    # check if there is an invoke command with selector ".js-md-widget"
    # because this contains the table with the results
    table_data = invokes.get(".js-md-widget")
    if table_data is None:
        # no data for this day
        return []
    return extract_table_data(delivery_date, table_data, market_area)

The last scraping-related function adds more error handling. It is not known when the pricing data becomes available, so the notebook will run multiple times a day until we have data for the next day.

First run & partitioning strategy

prices = fetch_day(datetime.now() + timedelta(days=1), market_area)
if len(prices) == 0:
    mssparkutils.notebook.exit("No prices available yet")

spark_df = spark.createDataFrame(prices)

spark_df.write.mode("overwrite").format("delta").option("overwriteSchema", "true").partitionBy("market").saveAsTable("epex_spot_prices")

The code above is the version of the notebook that should be used on the very first run to initialize the Delta table. We call our scraping functions to fetch data for the market set above for the next day and then convert the Pandas DataFrame to a Spark DataFrame. We then write the Spark DataFrame to the Delta table.

The line mssparkutils.notebook.exit("No prices available yet") tells the notebook to exit gracefully, without errors, if the prices are not available yet.

The last line is the most interesting one. We write the data to the Delta table and partition it by market. This means that the data will be stored in multiple folders, one for each market. The folder structure will look like this:

epex_spot_prices
|── _delta_log
├── market=AT
│   ├── part-00000-....parquet
│   ├── part-00001-....parquet
│   ├── ...
├── market=BE
│   ├── part-00000-....parquet
│   ├── part-00001-....parquet
│   ├── ...
├── ...

Delta tables are locked when they are written to, except when the multiple processes are not writing to the same partition. So partitioning by market allows us to run the notebook concurrently for different markets.

Subsequent runs

current = DeltaTable.forName(spark, "epex_spot_prices")
current_df = current.toDF()
current_df = current_df.withColumn("date", to_date(col("start_time")))
current_market_count = current_df.filter((current_df.market == market_area) & (current_df.date == _as_date_str(datetime.now() + timedelta(days=1)))).count()
if current_market_count == 24:
    mssparkutils.notebook.exit("Already ingested")

On subsequent runs, we should avoid doing unnecessary HTTP calls by first checking if we already have the data for this market for the next day. Since there is one row per hour, we should have 24 rows per market. We can check this by counting the number of rows for the market and date we are interested in. If we have 24 rows, we can exit the notebook.

In the code above, I am loading the data first as a DeltaTable object so that we can use this later on to merge the new data with the existing data. I then convert the DeltaTable to a Spark DataFrame and add a column with the date. Then I filter the DataFrame to only contain the rows for the market and date we are interested in and count the number of rows.

prices = fetch_day(datetime.now() + timedelta(days=1), market_area)
if len(prices) == 0:
    mssparkutils.notebook.exit("No prices available yet")

spark_df = spark.createDataFrame(prices)

current.alias("current").merge(spark_df.alias("new"), "current.market = new.market AND current.start_time = new.start_time AND current.end_time = new.end_time").whenNotMatchedInsertAll().execute()

Finally, we fetch the data for the next day and convert it to a Spark DataFrame. We then merge the new data with the existing data. The merge function is a Delta-specific function that allows us to merge two DataFrames. We merge the new data with the existing data on the columns market, start_time and end_time. We use the whenNotMatchedInsertAll function to insert all rows from the new DataFrame that do not match any rows in the existing DataFrame. This means that we will only insert the new rows. The execute function executes the merge operation.

This is where the magic of Delta Lake comes into play. Thanks to its transaction log, this merge or upsert operation becomes available. Doing the same thing with plain Parquet or CSV files would be much more difficult.

Since we have the check above, we could just ignore the merging and always append all new data, but years of data engineering have taught me that it is better to be safe than sorry.

The result

Now it's time to run the notebook and see it in action. I'm always amazed by how fast Fabric can spin up a Spark session. The same operation took minutes with the platforms that were available a few years ago.

After running the notebook, you can find the Delta table in the Lakehouse.

Resulting rows in the Lakehouse

We can also check out the files by clicking on the ... and choosing View files. Here we see the partitioning strategy in action.

Files in the Lakehouse

👉 Next part

In the next part, we'll use our Notebook in a Pipeline and schedule it to run a couple of times a day.

The link will start working as soon as the post is published, about a week after this one.

Data Warehousing Made Easy: Leveraging Google BigQuery for Seamless Analytics

Bart — Sun, 03 Sep 2023 22:41:59 +0000

In our previous blogs about Data Engineering on Google Cloud Platforms (GCP), we touched upon how to choose the right database for your data projects. In this blog, we will look at Google BigQuery in detail. We will cover the features of BigQuery, best practices and performance optimisation techniques.

Google BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyse all their data very quickly. It is a cloud-based service that offers a number of benefits over traditional on-premises data warehouses, including:

Scalability: BigQuery can scale to handle any size dataset, making it ideal for businesses that are growing rapidly.
Performance : BigQuery is very fast, even for complex queries.
Cost-effectiveness: BigQuery is a pay-as-you-go service, so you only pay for the data you store and the queries you run.
Ease of use: BigQuery is easy to use, even for non-technical users.

Features of Google BigQuery

Google BigQuery offers a number of features that make it a powerful data warehouse solution. These features include:

Massive scalability: BigQuery can store and process petabytes of data.
Fast performance: BigQuery can run complex queries very quickly.
Flexible pricing: BigQuery is a pay-as-you-go service, so you only pay for the data you store and the queries you run.
Easy integration: BigQuery can be integrated with a variety of other Google Cloud Platform services.
Strong security: BigQuery uses a variety of security features to protect your data.

Feature / Detail	Description
Purpose	Fully managed, serverless data warehouse for analytics.
Data Storage	Columnar storage with Capacitor format.
Query Language	SQL-like language for querying data.
Scalability	Highly scalable, can process petabytes of data.
Performance	Super-fast queries with distributed architecture.
Automatic Indexing	No need to manage indexes; automatic optimization.
Data Import	Batch and streaming data ingestion supported.
Data Export	Can export data to other storage solutions.
Security	Multi-layered security with IAM and encryption.
Data Encryption	Data at rest and in transit encryption by default.
Cost	Pay-as-you-go pricing based on data processed.
Data Formats	Supports various data formats like CSV, JSON, AVRO.
Data Partitioning	Tables can be partitioned for better performance.
Data Clustering	Clustering helps optimize storage and query efficiency.
Data Retention Policy	Customizable data retention periods for tables.
Data Transfer Service	Move data from other sources into BigQuery easily.
Machine Learning	Integration with BigQuery ML for ML models on data.
Data Analysis	Supports Data Studio, Sheets, and other BI tools.
Data Governance	Data access controls and audit logs for compliance.
Streaming Inserts	Real-time data insertion for immediate analysis.
Data Export	Can export query results to various file formats.
Geographic Location	Data can be stored and processed in specific regions.
Data Resilience	Replication and backups to ensure data durability.
API Access	API for programmatic access and integration.
Data Cost Controls	Provides cost controls to manage expenses.

Best Practices for Using Google BigQuery

There are a number of best practices that you can follow to get the most out of Google BigQuery. These best practices include:

Use the right data types: When loading data into BigQuery, use the correct data types for your data. This will help to improve performance and reduce costs.
Use partitioning and clustering: Partitioning and clustering can help to improve the performance of queries on large datasets.
Use materialised views: Materialized views can be used to pre-compute queries, which can improve performance.
Use BigQuery ML: BigQuery ML can be used to build machine learning models on top of your data.

Performance Optimisation Techniques for Google BigQuery

There are a number of performance optimisation techniques that you can use to improve the performance of your queries in BigQuery. These techniques include:

Use the right index: Using the right index can improve the performance of queries that access specific columns or rows.
Use the right query language: Using the correct query language can help to improve the performance of your queries.
Optimise your queries: Optimising your queries can help to improve their performance.
Use BigQuery's built-in tools: BigQuery offers a number of built-in tools that can help you to optimise your queries.

Below is a comprehensive table outlining various performance optimisation techniques for Google BigQuery:

Performance Optimization Technique	Description
1. Data Partitioning	Partitioning tables based on one or more columns can significantly improve query performance. By dividing data into smaller, manageable partitions, BigQuery can scan only the relevant partitions during query execution, reducing the amount of data processed. Date-based, timestamp-based, or integer range partitioning are commonly used strategies.
2. Data Clustering	Clustering can further enhance query performance by physically organizing data within partitions based on specific columns. This reduces the amount of data read during query execution, as similar data is stored together. Choosing appropriate clustering columns can improve performance for both partitioned and non-partitioned tables.
3. Query Optimization	Writing efficient queries is crucial for performance. Avoid using unnecessary wildcards, optimize joins, and use explicit column lists instead of SELECT * to reduce data transfer and processing overhead. Understanding query execution plans using EXPLAIN syntax can help identify bottlenecks and optimize the query execution process.
4. Denormalization	For read-heavy workloads, denormalizing data and storing pre-aggregated results can speed up queries. While this increases storage requirements, it reduces the need for complex joins and aggregation during query execution. Careful consideration should be given to the trade-off between storage costs and query performance.
5. Partition and Cluster Size	When creating tables, consider the appropriate partition and cluster sizes based on data distribution and query patterns. Smaller partitions allow more parallelism but increase metadata overhead. Larger cluster sizes may improve performance for certain types of queries, especially those that benefit from full column scans. Regularly monitor and adjust these parameters.
6. Table Design	Optimize table schemas based on query patterns. Avoid using nested and repeated fields in tables unless necessary. Flattening data and using nested or repeated fields only when needed can improve performance. Additionally, use the appropriate data types to minimize storage usage and reduce processing overhead during query execution.
7. Cache and Materialized Views	Leverage caching mechanisms and materialized views for frequently accessed queries. BigQuery provides automatic query result caching, which can speed up repeated queries. Materialized views store precomputed results for specific queries, reducing the need to recompute data on-the-fly, further improving performance for specific use cases.
8. Streaming and Batch Loading	Choose the right data loading method based on your requirements. Streaming data is ideal for real-time data ingestion, while batch loading is more efficient for large-scale data uploads. For batch loading, consider using BigQuery Data Transfer Service or data migration tools for optimal performance.
9. Query Scheduling	Schedule resource-intensive or recurring queries during off-peak hours to avoid contention with other workloads. BigQuery allows you to set up reservation slots to ensure dedicated resources for critical workloads, improving performance consistency.
10. Monitoring and Logging	Regularly monitor query performance using BigQuery's built-in monitoring and logging tools. Keep track of long-running queries, query delays, and other performance metrics. Use Stackdriver or any other monitoring tools to set up alerts and notifications to proactively identify and address performance issues.
11. Use BigQuery ML for Predictions	For machine learning tasks, consider using BigQuery ML, which allows you to build and execute machine learning models directly within BigQuery. By leveraging the power of BigQuery's distributed architecture, you can perform predictions and analysis on large datasets with ease, reducing data movement and improving overall performance.
12. Use BI Tools and Data Studio	Utilize BI tools like Data Studio for visualization and data exploration. These tools can optimize data retrieval and display, reducing the need for complex queries and allowing users to interact with data in a more intuitive way, thereby enhancing overall performance.

Performance optimisation is an iterative process. Each optimisation technique's effectiveness will depend on your specific data, workload, and business requirements. It is crucial to always consider the trade-offs between performance improvements and associated costs or complexities.

Technical Details

In addition to the best practices outlined above, there are a number of technical details that data engineers and ML engineers should be aware of when using Google BigQuery. These details include:

The BigQuery query language: The BigQuery query language is a SQL-like language that is used to run queries in BigQuery. The BigQuery query language supports a number of features that can help to improve the performance of your queries, such as partitioning and clustering.
The BigQuery storage format: BigQuery uses a proprietary storage format that is optimised for performance. The BigQuery storage format is designed to store data in a way that minimises the amount of disk space that is used and maximises the speed at which queries can be run.
The BigQuery execution engine: BigQuery uses a distributed execution engine to run queries. The BigQuery execution engine is designed to scale to handle any size dataset and to run queries as quickly as possible.

Guide to Getting Started

In this technical guide, we'll walk you through the process of setting up BigQuery and performing basic data warehousing tasks. Let's dive in!

Step 1: Creating a Google Cloud Project and Enabling BigQuery API

Sign in to your Google account and navigate to the Google Cloud Console (https://console.cloud.google.com/).
Create a new project or select an existing one.
In the Google Cloud Console, click on the "Navigation menu" > "BigQuery."
If the BigQuery API is not enabled for your project, click on the "Enable" button to enable it.

Step 2: Setting up the BigQuery Web UI

After enabling the BigQuery API, click on the "Go to BigQuery" button to open the BigQuery Web UI.
You'll see the Query Editor, where you can run SQL-like queries on your datasets.

Step 3: Creating a Dataset

In the BigQuery Web UI, click on your project name (top-left corner) to open the project dropdown.
Click on "Create Dataset."
Provide a Dataset ID (e.g., "my_dataset") and choose the data location.
Click "Create Dataset" to create a new dataset.

Step 4: Uploading Data to BigQuery

In the BigQuery Web UI, select your dataset from the left-hand panel.
Click on "Create Table."
Choose "Upload" as the data source.
Select your data file (CSV, JSON, etc.) and configure the schema.
Click "Create Table" to upload the data.

Step 5: Running Queries in BigQuery

In the Query Editor, enter your SQL-like query.
For example, to retrieve all rows from a table named "my_table," use SELECT *FROMproject_id.dataset_id.my_table;
Click the "Run" button to execute the query.

Step 6: Advanced Queries with BigQuery

BigQuery supports complex queries for data transformations and aggregations.
For example, to calculate the total sales amount per product, use SELECT product_name, SUM(sales_amount) AS total_sales FROMproject_id.dataset_id.sales_tableGROUP BY product_name;

Step 7: Managing BigQuery Resources

In the BigQuery Web UI, click on "Query History" to view your previous queries.
Click on "Job History" to monitor the status of running or completed jobs.

Conclusion

You have successfully set up Google BigQuery and performed basic data warehousing tasks! With BigQuery's powerful querying capabilities and scalability, you can now explore and analyze large datasets effortlessly. Start leveraging the potential of Google BigQuery to unlock valuable insights from your data!

By following the best practices and performance optimisation techniques outlined in this blog post, you can get the most out of BigQuery and improve the performance of your queries.

We hope you found this blog post informative. Keep an eye out for our blogs on data engineering on GCP!

Fabric end-to-end use case: overview & architecture

Bart — Thu, 31 Aug 2023 07:00:27 +0000

Welcome to the first part of a 5-part series on an end-to-end use case for Microsoft Fabric. This post will focus on the architecture overview of the use case.

All posts in this series

This post is part of a 5-part series on an end-to-end use case for Microsoft Fabric:

Use case introduction: the European energy market

If you’re following this series, feel free to skip this section as it’s the same introduction every time. 🙃

Since it’s quite a lot of content with more than 1 hour of total reading time, I’ve split it up into 5 easily digestible parts.

We want to ingest this data and store it in OneLake. At one point, we could combine this data with weather forecasts to train a machine learning model to predict the energy price.

All data is publicly available, so you can follow along in your own Fabric Workspace.

Architecture overview

Architecture of the electricity pricing use case

In the diagram above you can see the 6 components of the data flow. Most of them are Fabric components, with 2 exceptions: Beautiful Soup and dbt. We'll go over the components one by one, talk a bit about the people typically involved in such a project, and conclude with an overview of the data flows.

Web scraping with Beautiful Soup 4

BeautifulSoup 4 is a Python library that is used to scrape data from the web. Unfortunately, the electricity prices are not available in a machine-readable format, so we need to scrape them from this page where they're published as soon as they are available. Beautiful Soup can load the web page and extract the data we need from it. This process is called scraping.

After retrieving the data, we store it in a Delta table on OneLake. That way, we can use a Lakehouse and the included SQL Endpoint to work with the data in SQL.

Building views with dbt

dbt is a tool for analytics engineers where you can use SQL with Jinja templating to make that SQL easier to use and add functionalities. By using dbt, we can leverage a lot of best practices from software engineering in our analytics projects. All will become more clear in the post dedicated to this part of the use case. 😉

Fabric components

As the goal of the use case is to demonstrate the capabilities of Fabric, we'll be using a lot of Fabric components. You could build this in many different ways using Data Flows, Data Warehouses, etc. but those didn't make into this use case. The main components of this use case are the Lakehouse and its features.

OneLake

OneLake is the centralized data lake in Fabric. You can read more about OneLake in my previous post. By using OneLake we don't have to worry about things like credentials, security, networking, and more. Fabric's OneLake handles all of that for us.

Lakehouse

The Lakehouse is one of Fabric's engines running on top of OneLake. It features an extensive Python runtime for running PySpark or other Python applications.

The main feature here is the Lakehouse's ability to automatically discover datasets stored in the open-source Delta Lake format. Those datasets become automatically available for querying in the SQL Endpoint.

The personas: data engineering and analytics engineering

As you'll see, this work can be done by a single person. But I chose to split it up into data engineering and analytics engineering as that is those are the typical workloads you see in a data team.

The data engineer is responsible for the data flow from the source to the data store. This also includes building and configuring the data platform to handle this data at scale. Usually, the data engineer will be using a programming language like Python to build the data pipelines. Alternatively, a lot can be done with low-code interfaces in Fabric like Data Flows and Data Pipelines. We'll use both to demonstrate the possibilities.

The analytics engineer is responsible for transforming that data and building insights from it. The analytics engineer transforms the data and then goes on to visualize data marts to make the information available to the end user. Typically, analytics engineers work with SQL and SQL-based tools like dbt to transform the data. Power BI is one of the best options available to visualize the data, but there are many others as well.

Data flow

The data flow is pretty simple. The data is scraped from the web and stored in a Delta table on OneLake. Then, dbt is used to build views on top of that data. Those views are then visualized using Power BI.

A notebook with Beautiful Soup 4 is used to load the data into memory

The notebook appends the data to an existing Delta table

A Pipeline is used to orchestrate the notebook and run it on a schedule with different parameters

dbt is used with the Lakehouses's SQL Endpoint to build views on the data

The data marts created with dbt are visualized using Power BI

👉 Next part

In the next part, we'll build up the Notebook and look at the logic that goes into building a redundant, reliable, but simple data pipeline.

The link will start working as soon as the post is published, about a week after this one.

Best Practices for Selecting a Database on Google Cloud Platform

Bart — Sun, 27 Aug 2023 22:35:56 +0000

In our previous blog, we gently introduced GCP for data engineering and shared a lite edition of best practices, when it comes to data storage, security and machine learning. In this one, we will dive deep into choosing a database on GCP.

The importance of a database in data engineering cannot be overstated. Data engineering is the backbone of any data-driven organisation, responsible for designing, building, and maintaining the infrastructure and systems that enable efficient and effective data processing and analysis. A well-chosen and well-designed database plays a central role in the success of data engineering efforts

Migrating to Google Cloud Platform (GCP) can bring numerous benefits to an organisation, including improved scalability, agility, and cost-effectiveness. However, the success of a migration largely depends on choosing the correct database for your specific needs.

How to choose your database on GCP?

Google Cloud Platform (GCP) offers a wide array of database services, each tailored to specific use cases. In this blog, we'll explore the best practices for selecting the ideal database on GCP to meet the needs of your technical projects. Let's jump in!

Understand Your Data and Use Case:

Before making any decisions, take the time to understand your data and the requirements of your application. Consider the volume of data, the read and write patterns, the required data structure, and any specific performance or compliance needs. Different databases excel in distinct scenarios, so a thorough assessment of your use case is the first step to a successful choice.

Choose Between Relational and NoSQL Databases:

GCP provides both relational and NoSQL database options. Relational databases, like Cloud SQL (MySQL, PostgreSQL) and Cloud Spanner, are suitable for structured data with complex relationships, ensuring ACID (Atomicity, Consistency, Isolation, Durability) transactions and strong consistency. NoSQL databases, such as Cloud Firestore, Cloud Datastore, and Bigtable, are best suited for large-scale, unstructured or semi-structured data with high scalability and eventual consistency.

Consider Scalability:

Scalability is a critical factor when choosing a database on GCP. Assess whether your application requires horizontal or vertical scalability. NoSQL databases generally offer better horizontal scalability, allowing you to add more nodes to handle increased traffic. On the other hand, some relational databases can be vertically scaled by upgrading the instance's hardware, but this approach has its limits.

Evaluate Performance Requirements:

Performance is another crucial aspect. Depending on your application, you might need high-throughput reads and writes or low-latency queries. For demanding workloads, consider databases with in-memory caching capabilities like Cloud Memorystore or high-speed read and write access like Cloud Bigtable.

Factor in Geographic Distribution:

If your application needs to serve users across the globe, consider a globally distributed database. Google Cloud offers features like multi-region and regional configurations, which can help you achieve low-latency access and better disaster recovery strategies.

Security and Compliance:

Ensure that the chosen database meets your security and compliance needs. GCP provides robust security features, such as encryption at rest and in transit, IAM (Identity and Access Management) controls, and audit logging. If your application handles sensitive data, compliance with industry standards like GDPR or HIPAA might be mandatory.

Cost Optimisation:

Cost is a significant factor when operating databases in the cloud. Review the pricing models for different database options and identify the most cost-effective solution that meets your requirements. Additionally, consider automated scaling options that can help you optimise costs based on usage patterns.

Evaluate Database Management and Operations:

Consider the level of control and management you require. Fully managed databases like Cloud SQL and Cloud Firestore handle most administrative tasks, allowing your team to focus on application development. However, if you need more control and customisation, self-managed options like Compute Engine instances running databases might be more suitable.

Commonly Used Databases on GCP

Here's an overview on some of the commonly used databases on GCP:

Cloud SQL: Cloud SQL is a fully managed relational database service that supports popular database engines like MySQL, PostgreSQL, and SQL Server. It is an excellent choice for traditional database applications and workloads that require ACID compliance and strong consistency. Cloud SQL offers automated backups, scaling options, and seamless integration with other GCP services.
BigQuery: BigQuery is a serverless, fully managed data warehouse and analytics platform. It is designed to handle large-scale, columnar datasets and allows users to perform ad-hoc SQL queries for analytics purposes. With its blazing-fast performance and automatic scaling, BigQuery is a go-to solution for data analysts and data engineers dealing with big data and complex analytical workloads.
Cloud Spanner: Cloud Spanner is a globally-distributed, horizontally-scalable database service that provides both relational and NoSQL capabilities. It combines the benefits of traditional relational databases with the scalability of NoSQL databases. Cloud Spanner offers strong consistency and global ACID transactions, making it an excellent choice for mission-critical, globally-distributed applications.
Cloud Firestore: Cloud Firestore is a fully managed NoSQL database that allows developers to build real-time applications with ease. It offers seamless synchronization across devices and platforms, making it ideal for mobile and web applications. Cloud Firestore's real-time data syncing and automatic scaling make it a popular choice for applications requiring low-latency data access.
Cloud Datastore: Cloud Datastore is a NoSQL document database that provides horizontal scalability and automatic replication. It is suitable for semi-structured data and offers ACID transactions at the entity level. Cloud Datastore is a versatile database that can handle various use cases, from web applications to gaming backends.
Cloud Bigtable: Cloud Bigtable is a high-performance, NoSQL wide-column store database. It is designed to handle massive amounts of data with low-latency access, making it ideal for time-series data, IoT applications, and analytical workloads. Cloud Bigtable's scalability and performance make it a preferred choice for applications that require large-scale data storage and processing.

Comparison

The following table provides a high-level comparison of the major features of each database service:

Feature	Cloud SQL	BigQuery	Cloud Spanner	Bigtable	Cloud Datastore	Firestore
Database Type	Relational	Data Warehouse	Relational & NoSQL	NoSQL	NoSQL	NoSQL
Use Case	Traditional DB Apps	Analytics, OLAP	Globally-distributed, mission-critical apps	Time-series Data	Web & Mobile Apps	Real-time Apps
Scaling	Vertical, Manual	Horizontal, Auto	Horizontal, Auto	Horizontal, Auto	Horizontal, Auto	Horizontal, Auto
Data Size Limit	Up to 30TB	Petabytes	Petabytes	Petabytes	Not Specified	Not Specified
Performance	Low-latency	High-performance	High-performance	Low-latency	Medium	Medium
Data Model	Structured	Columnar, Semi-structured	Structured & Semi-structured	Wide-column	Semi-structured	Semi-structured
Consistency Model	ACID	Eventually Consistent	Strongly Consistent	Eventual Consistency	Eventual Consistency	Strongly Consistent
Query Language	SQL	SQL	SQL	NoSQL API	GQL (Datastore Query Language)	GQL (Firestore Query Language)
Indexing Support	Yes	Yes	Automatic & Manual	Yes	Automatic	Automatic
Cost Model	Provisioned	On-Demand	On-Demand	On-Demand	On-Demand	On-Demand
Real-time Analytics	Limited	Yes	Limited	No	Limited	Yes
Geospatial Support	Limited	Limited	Yes	No	Yes	Yes
Native Integrations	Some GCP Services	Many GCP Services	Many GCP Services	Limited	Many GCP Services	Many GCP Services
Use of Partitioning	Yes	Yes	Yes	Yes	Yes	Yes
Secondary Indexes	Yes	No	Yes	Yes	Yes	Yes

Choosing the right database on Google Cloud Platform is a critical decision that can significantly impact the performance, scalability, and cost-effectiveness of your application. By understanding your data, evaluating performance needs, and considering factors like scalability, security, and compliance, you can make an informed decision that aligns with your technical project's objectives. It is crucial to always keep in mind that each database option has its strengths and weaknesses, so take the time to select the one that best suits your unique use case.

Happy data managing!

Let Fabric teach you how to code with Data Wrangler

Bart — Thu, 24 Aug 2023 07:34:53 +0000

I'm going to be honest with you. I'm bad at writing Pandas data transformation code. Throughout the years I mostly focussed on the Spark APIs in Scala and PySpark, SQL, dbt, and some others, but I find the Pandas APIs usually just confusing and hard to read. I don't like the black box magic and lock-in of low-code solutions either. Did you know that Microsoft Fabric has the perfect middle ground for this? It's called Data Wrangler. Let's dive in!

Actually, Data Wrangler is not a new thing. It was built as an extension for Visual Studio Code. It assists data engineers to wrangle their data with an intuitive interface that helps you write Pandas code. Microsoft nicely integrated this into the Fabric experience to make it even easier to use.

To get started with Data Wangler, create a new notebook in your Fabric Workspace. A notebook needs a Fabric Lakehouse and in our tutorial, we're going to need some data to wrangle. I uploaded this CSV file with the most popular movies & series on IMDB to my Lakehouse and linked the Lakehouse to my notebook. To get started with wrangling data, we need to load our CSV into a Pandas DataFrame. The syntax to read a CSV file with Pandas is pretty straightforward, but you can save yourself the hassle by right-clicking on the CSV file.

Generate the code to load the CSV file

This puts the code to load the CSV file in the first cell so all you have to do is click the :play_button: button to execute the code. You can see the result right below.

Load the DataFrame

This has now loaded the DataFrame into the current session and we can now start wrangling! To get to the Data Wangler, click the Data button and select the DataFrame (df in my example) you want to wrangle.

Select your DataFrame to open the Data Wrangler

Instantly, you get thrown into a clarifying overview of your data. You get statistical facts, nice histograms, and lots more without having to write a single line of code. Amazing!

Data Wrangler overview

But my data is not clean. So how do we start? At the left bottom corner, you see a list of operations currently applied to your DataFrame. That list is empty now, but we're going to add some basic data cleansing operations to it. Let's start with a basic one: dropping all rows with missing values and dropping all duplicate rows. All we have to do is browse through the list of available operations in the menu on the left and pick the operation we want to apply. For every operation, there are some parameters you can set. In our case, we had to identify the columns which should never be empty or the columns used to detect duplicates. Easy peasy!

Drop duplicate rows

The Fabric Data Wangler instantly calculates and shows us the results. Where is the code now, you ask? Well, it's right there at the bottom of your screen while you're configuring an operation. If you now want to put that code into your notebook, the button Add code to notebook allows you to do so.

Add the code to your notebook

This takes you back to the notebook, and well ... you guessed it, the code is there!

For me it generated this cell:

def clean_data(df):
    # Drop rows with missing data in columns: 'MOVIES', 'YEAR'
    df = df.dropna(subset=['MOVIES', 'YEAR'])
    # Drop duplicate rows in columns: 'MOVIES', 'YEAR'
    df = df.drop_duplicates(subset=['MOVIES', 'YEAR'])
    # Change column type to string for column: 'YEAR'
    df = df.astype({'YEAR': 'string'})
    return df

df_clean = clean_data(df.copy())
df_clean.head()

What stands out to me is that even generated comments to explain what the code does. This is a great way to learn Pandas code!

I couldn't get enough of this productive and fun way to wrangle my data and learn Pandas code. Next, I wanted to fix the YEAR column. For this, I decided to write the code myself, so I just added a new cell and a few lines to remove the characters I didn't want to have in that column:

df_clean["YEAR"] = df_clean["YEAR"].str.replace("(", "")
df_clean["YEAR"] = df_clean["YEAR"].str.replace(")", "")
df_clean["YEAR"] = df_clean["YEAR"].str.replace(" ", "")
display(df_clean)

Wrangling more data? Just execute all cells and you can see the new df_clean pop-up in the same menu as before to open the Data Wrangler.

Going back to the Data Wangler

My YEAR column is far from finished. Sometimes it contains a start and an end, and sometimes only a single year. Let's split that. The Data Wrangler in Fabric has an operation to do exactly that. While we're at it, let's do the same with the GENRE column as well.

Split the year column

What's really nice about this, is that you instantly get to see a preview of your operation so that you can fine-tune the parameters to get the result you want. I had no idea how many genres a movie or series could have, but what do you know, it's 3!

Split the genre column

I finished my wrangling by renaming some columns and sorting by the rating. This time I didn't have to go through the operations menu, I could just right-click on the column and click Rename or Sort.

Now, let's see how all of this translates to code! I clicked the Add code to notebook button again and it brought me back to my notebook.

The completed notebook

Once again it nicely documented my code as well:

# Split text using string '–' in column: 'YEAR'
df_split = df_clean['YEAR'].str.split(pat='–', expand=True).add_prefix('YEAR_')
df_clean = pd.concat([df_clean.iloc[:, :2], df_split, df_clean.iloc[:, 2:]], axis=1)
df_clean = df_clean.drop(columns=['YEAR'])
# Rename column 'YEAR_0' to 'year'
df_clean = df_clean.rename(columns={'YEAR_0': 'year'})
# Rename column 'MOVIES' to 'title'
df_clean = df_clean.rename(columns={'MOVIES': 'title'})
# Rename column 'YEAR_1' to 'year_end'
df_clean = df_clean.rename(columns={'YEAR_1': 'year_end'})
# Split text using string ',' in column: 'GENRE'
df_split = df_clean['GENRE'].str.split(pat=',', expand=True).add_prefix('GENRE_')
df_clean = pd.concat([df_clean.iloc[:, :4], df_split, df_clean.iloc[:, 4:]], axis=1)
df_clean = df_clean.drop(columns=['GENRE'])
# Rename column 'GENRE_0' to 'genre_0'
df_clean = df_clean.rename(columns={'GENRE_0': 'genre_0'})
# Rename column 'GENRE_1' to 'genre_1'
df_clean = df_clean.rename(columns={'GENRE_1': 'genre_1'})
# Rename column 'GENRE_2' to 'genre_2'
df_clean = df_clean.rename(columns={'GENRE_2': 'genre_2'})
# Remove leading and trailing whitespace in columns: 'genre_0', 'genre_1', 'genre_2'
df_clean['genre_0'] = df_clean['genre_0'].str.strip()
df_clean['genre_1'] = df_clean['genre_1'].str.strip()
df_clean['genre_2'] = df_clean['genre_2'].str.strip()
# Rename column 'RATING' to 'rating'
df_clean = df_clean.rename(columns={'RATING': 'rating'})
# Rename column 'VOTES' to 'vote_count'
df_clean = df_clean.rename(columns={'VOTES': 'vote_count'})
# Sort by column: 'rating' (descending)
df_clean = df_clean.sort_values(['rating'], ascending=[False])

I'm really impressed with the Data Wrangler in Fabric. It's a great way to learn Pandas code and it's a great way to quickly understand and clean your data. I'm definitely going to use this more often!

[

Connect to Fabric Lakehouses & Warehouses from Python code - Sam Debruyn

In this post, I will show you how to connect to your Microsoft Fabric Lakehouses and Warehouses from Python. Packages & dependencies To connect to Fabric, we’ll use the Microsoft ODBC Driver. This driver is available for Windows, Linux, and macOS. Click on your operating system to download and in…

datarootsdataroots

](https://dataroots.io/research/contributions/connect-to-fabric-lakehouses-warehouses-from-python-code/?ref=dataroots.ghost.io)

Best Practices for Data Engineering on Google Cloud Platforms

Bart — Sun, 20 Aug 2023 22:34:56 +0000

Data engineering is a crucial process involving data collection, storage, processing, and analysis. It plays a vital role in the data science lifecycle, enabling businesses to make informed decisions. In this blog post, we will discuss practical best practices for data engineering on Google Cloud Platforms. These practices will help you build scalable, reliable, and secure data engineering solutions. Additionally, we will cover individual services like BigQuery, DataFlow, Dataplex in detail in our upcoming blog series on GCP.

Google Cloud Platform (GCP) offers a wide range of services that can be used for data engineering. These services include:

BigQuery: A fully managed, petabyte-scale analytics data warehouse
Cloud Storage: A scalable, durable, and highly available object storage service
Cloud Dataflow: A unified programming model for batch and streaming data processing
Dataproc: A managed Hadoop and Spark service
Cloud Data Fusion: A fully managed, cloud native, enterprise data integration service

and a plethora of others.

These services can be used to build a variety of data engineering solutions, such as:

Data warehouses for storing and analysing large datasets
Data pipelines for processing data in real time
Machine learning models for making predictions

Key Components of Data Engineering on Google Cloud

The key components of data engineering on Google Cloud are:

Data storage: Data storage is the process of storing data in a way that it can be accessed and processed later. GCP offers a variety of data storage services, including BigQuery, Cloud Storage, and Cloud SQL.
Data processing: Data processing is the process of transforming data into a format that can be used for analysis or machine learning. GCP offers a variety of data processing services, including Cloud Dataflow, Dataproc, and Cloud Data Fusion.
Data analysis: Data analysis is the process of extracting insights from data. GCP offers a variety of data analysis services, including BigQuery ML, Cloud Dataproc, and Cloud Data Fusion.
Machine learning: Machine learning is the process of building models that can learn from data and make predictions. GCP offers a variety of machine learning services, including Cloud ML Engine, Cloud AutoML, and Cloud TPUs.

Services and Tools for Data Engineering on Google Cloud

In addition to the key components of data engineering, GCP also offers a variety of services and tools that can be used for data engineering purposes. These services and tools include:

Cloud Dataproc: A managed Hadoop and Spark service that can be used to process large datasets.
Cloud Dataflow: A unified programming model for batch and streaming data processing.
Cloud Data Fusion: A fully managed, cloud native, enterprise data integration service.
BigQuery ML: A service that provides machine learning capabilities on top of BigQuery.
Cloud AutoML: A service that helps you build machine learning models without having to write any code.
Cloud TPUs: Specialised hardware accelerators for machine learning.

Best Practices for Data Storage

The first step in any data engineering project is to choose the right data storage solution. The following factors should be considered when choosing a data storage solution:

The type of data: Some data, such as structured data, is well-suited for relational databases. Other data, such as unstructured data, is better suited for object storage.
The volume of data: The amount of data that needs to be stored will affect the choice of data storage solution.
The performance requirements: The performance requirements of the data processing application will also affect the choice of data storage solution.

Best Practices for Data Processing

Once the data has been stored, it needs to be processed. The following factors should be considered when choosing a data processing solution:

The type of data processing: Some data processing tasks, such as batch processing, can be performed on a batch processing engine. Other data processing tasks, such as streaming processing, can be performed on a streaming processing engine.
The volume of data: The amount of data that needs to be processed will affect the choice of data processing solution.
The performance requirements: The performance requirements of the data processing application will also affect the choice of data processing solution.

Best Practices for Data Security

Data security is a critical part of any data engineering project. The following factors should be considered when implementing data security measures:

Encryption: Data should be encrypted at rest and in transit.
Access control: Access to data should be restricted to authorised users.
Auditing: Data access should be audited to track who has accessed the data and when.

By implementing these data security measures, you can help to protect your data from unauthorised access.

Best Practices for Machine Learning

Machine learning is a powerful tool that can be used to make predictions from data. The following factors should be considered when choosing a machine learning solution:

The type of machine learning task: Some machine learning tasks, such as classification, can be performed using a supervised learning algorithm. Other machine learning tasks, such as clustering, can be performed using an unsupervised learning algorithm.
The volume of data: The amount of data that is available will affect the choice of machine learning solution.
The performance requirements: The performance requirements of the machine learning application will also affect the choice of machine learning solution.

Conclusion

Data engineering on Google Cloud Platforms offers powerful solutions for collecting, processing, and analysing data. By following these best practices, you can build efficient and secure data engineering solutions. Stay tuned for more blog posts in our GCP series. Hope you enjoyed this practical introduction to data engineering on GCP! These are just a few of the best practices for data engineering on Google Cloud.