DEV Community: Gabriel Luz

A Complete Google Cloud Certification Guide (2023)

Gabriel Luz — Sun, 26 Mar 2023 01:17:33 +0000

Cloud computing is the strategy of delivering computing services, including storage, servers, networking, intelligence, analytics, software, and databases over the Internet or “cloud”. This streamlines the innovation process, optimizes costs when scaling, and makes resources flexible. According to Gartner, cloud computing remains a strong technology trend for 2023 and beyond.

For Information Technology (IT) professionals, it is very interesting to obtain cloud computing certifications to take advantage of the growing investment scenario in this area and meet the growing demand of companies for projects using Cloud technologies.

Given this scenario, Google Cloud Platform (GCP) is one of the main players in the cloud computing market and its certifications are increasingly becoming an object of desire and a goal of study and preparation for IT professionals.

What are the different GCP certification levels?

Google Cloud Platform has segmented its catalog of certifications into the following levels:

Fundamentals.

This is the entry level for cloud certifications. It is a test that expects the candidate to understand the main services on the platform in a basic way and be able to solve business problems based on this knowledge.

Associate.

Currently only Cloud Engineer is at this level. This is an ideal starting point for anyone seeking professional certifications and covers skills such as deploying, monitoring, and maintaining projects on Google Cloud.

Professional.

The professional level, in turn, has a wide variety of tests, each focused on a specific role, such as data engineering, devops and machine learning. The tests at this level go into great detail on the specific services of each role.

Certifications

In this section of the article, we are going to go into detail about each of the tests and with that you will have a better idea of which one has a greater fit with your professional goals.

Digital Leader

This is the initial certification within the Google Cloud catalog. Its target audience is people who want to demonstrate that they have mastery of the basics of cloud computing and the GCP platform. It is also important to have basic notions about computing and the internet, in addition to knowing how to identify business demands and match them with Cloud services that are fit to solve these demands.

According to the guide provided by Google, the exam covers the following contents:

Digital transformation with Google Cloud (~10% of the exam).
Infrastructure and application modernization (~30% of the exam).
Innovating with data and Google Cloud (~30% of the exam).
Google Cloud security and operations (~30% of the exam).

Associate Cloud Engineer

Google defines a Cloud Engineer as a person who performs tasks such as deploying applications, performing monitoring of multiple projects, and maintaining solutions that use Google-managed or self-managed services on the Google Cloud. The professional who plays this role must be able to interact with the Cloud environment through the Console and via the command line.

Although it is Associate level, this test should not be considered simple as it aims to demonstrate that the professional has a solid foundation of cloud knowledge to carry out day-to-day tasks in a cloud computing context. Before attempting the Professional level exams, it is highly recommended that you take the Associate Cloud Engineer exam as it serves as the foundation for all activity within Google Cloud.

According to the test guide, these are the charged contents:

Set up a cloud solution environment.
Deploy and implement a cloud solution.
Configure access and security.
Plan and configure a cloud solution.
Ensure successful operation of a cloud solution.

Professional Cloud Architect

The first Professional level exam we will cover is the Professional Cloud Architect. Of the tests at this level, it is the one with the highest degree of generalization. Although it is challenging because of the large amount of content required to study it, it is certainly one of the most interesting because it gives an overview of the entire Google Cloud environment.

It is important to note that this is not just a test on Google's Cloud platform tools, but a test on Solution Architecture in the Cloud environment. This makes it more difficult on the one hand but on the other hand makes this certification more valuable. Being successful in this test means that you really know what you are doing, and it would even be a good indicator that you would be a good cloud architect on Amazon's AWS or Microsoft's Azure, as the Clouds have many similarities.

In addition to knowing the GCP tools themselves, it is necessary that the candidate, as well as the solution architect role itself, know how to use the cloud in a professional environment — balancing tradeoffs, managing costs, paying attention to security and best practices, making the sustainable environment.

According to the official guide, the Professional Cloud Architect certification exam assesses your ability in the following topics:

Design and plan a cloud solution architecture.
Design for security and compliance.
Manage implementations of cloud architecture.
Manage and provision the cloud solution infrastructure.
Analyze and optimize technical and business processes.
Ensure solution and operations reliability.

Professional Cloud Developer

In the paragraph below we can see how Google sees the role of Professional Cloud Developer.

“A Professional Cloud Developer builds scalable and highly available applications using Google-recommended practices and tools. This individual has experience with cloud-native applications, developer tools, managed services, and next-generation databases. A Professional Cloud Developer also has proficiency with at least one general-purpose programming language and is skilled at producing meaningful metrics and logs to debug and trace code.”

This is also a test that I consider quite generic, as it is aimed at software engineers who want to build applications on Google Cloud using its various services for this purpose. In addition to general knowledge of Information Technology concepts and GCP itself, it is interesting that the candidate also has experience with software engineering issues, such as code versioning, good practices in code development, among others. The Cloud Developer role overlaps with the Professional Cloud Architect role. Developers not only need to understand cloud-native architectures, they also need to be able to build those systems.

For this test, it is interesting to master certain managed services that make life easier for developers, such as Cloud Run, PubSub and, to a certain extent, the Kubernetes Engine.

According to the exam guide, these are the topics expected of the Cloud Developer candidate:

Design highly scalable, available, reliable cloud-native applications.
Deploy applications.
Manage deployed applications.
Build and test applications.
Integrate Google Cloud services.

Professional DevOps Engineer

“A Professional Cloud DevOps Engineer is responsible for efficient development operations that can balance service reliability and delivery speed. They are skilled at using Google Cloud Platform to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.”

In the paragraph above we can see how Google sees the role of DevOps. This certification is aimed at people who are responsible for efficient development operations in an organization and can balance service reliability and delivery speed. Although it is a deep discussion, for Google, the role of DevOps is very close or even ends up being the same as the role of Site Reliability Engineer.
At its core, this entire function of the SRE is to enable the whole team to build better software faster. And if you spend all your time just putting out fires and chasing your tail, that's not going to happen. So he uses the power of software development to amplify the impact of his time.

In the context of Google Cloud, this role encompasses a series of products that are very important in the context of software development, such as Operations Suite tools such as Cloud Logging, Cloud Monitoring, Cloud Trace, Cloud Debugger and Cloud Trace. Also tools that engulf CI/CD such as Cloud Build, Cloud Source Code Repository, Container and Artifact Registry. And finally, the test also covers the subject of Infrastructure as code with the Cloud Deployment Manager product.

According to the exam guide, these are the subjects charged:

Apply site reliability engineering principles to a service.
Implement service monitoring strategies.
Manage service incidents.
Optimize service performance.
Build and implement CI/CD pipelines for a service.

Professional Cloud Security Engineer

This certification is based on the security function within the cloud environment, including identity and access management (IAM), data protection, network security defenses, and more. In the paragraph below we can see Google's understanding of this role:

“Through an understanding of security best practices and industry security requirements, this individual designs, develops, and manages a secure infrastructure leveraging Google security technologies. The Cloud Security Professional should be proficient in all aspects of Cloud Security including managing identity and access management, defining organizational structure and policies, using Google technologies to provide data protection, configuring network security defenses, collecting and analyzing Google Cloud Platform logs, managing incident responses, and an understanding of regulatory concerns.”

That is, Google expects that the Cloud Security professional, through an understanding of security best practices and the security requirements practiced by the market, will be able to help organizations a secure Cloud environment.

According to the exam guide, these are the topics charged by Google:

Configure access within a cloud solution environment.
Configure network security.
Ensure data protection.
Manage operations within a cloud solution environment.
Ensure compliance.

Professional Cloud Network Engineer

This certification is intended for people who implement and manage network architectures on GCP. It is important to point out that unlike the network specialist functions of the past, this one never interacts with any hardware as the network in a CLOUD environment is defined by software. Google defines this role as follows:

“Implements and manages network architectures in Google Cloud Platform… [and] may work on networking or cloud teams with architects who design the infrastructure. By leveraging experience implementing VPCs, hybrid connectivity, network services, and security for established network architectures, this individual ensures successful cloud implementations using the command line interface or the Google Cloud Platform Console.”

As with other tests, for this one it is also advisable that the professional has, in addition to knowledge about Google Cloud, specific knowledge about networking. In general, this role expects the professional to implement and manage network architectures on Google Cloud.

The official guide for this exam requires the candidate to master the following topics:

Design, plan, and prototype a Google Cloud network.
Configure network services.
Manage, monitor, and optimize network operations.
Implement Virtual Private Cloud (VPC) instances.
Implement hybrid interconnectivity.

Professional Data Engineer

According to Google: “A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data.”. In the last decade, the market in general learned that it is possible to extract a lot of value from large amounts of data, with that, the role of Data Engineer is to create systems for processing, treatment and availability of data for an entire organization. Also according to Google:

"A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.".

Product-wise, this exam covers all data-related products on Google Cloud, such as Dataproc, BigQuery, Dataflow, Dataprep, Data Fusion, Cloud Storage, and PubSub. But it's not just about configuring these components. This role also involves monitoring, maintaining, debugging and, over time, improving these pipelines, so this professional must also be able to use GCP's Operations Suite products. And like other tests, it is important that the professional is proficient in subjects off Google Cloud, such as Apache Spark, Apache Map Reduce and SQL.

Although there is a test dedicated to this subject, in the Data Engineer certification it is also expected that the candidate is familiar with the subject of Machine Learning and familiar with the Vertex AI platform, from GCP.

According to the test guide, these are the topics charged:

Design data processing systems.
Ensure solution quality.
Operationalize machine learning models.
Build and operationalize data processing systems.

Professional Machine Learning Engineer

This certification is for people who design, build, and deploy machine learning models to solve business challenges using GCP technologies. Google defines the role of Machine Learning Engineer as follows:

“A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer is proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation and needs familiarity with application development, infrastructure management, data engineering, and security.”.

It is worth highlighting the importance of the word "productize". This certification is focused not only on model training but also and even mainly on making them productive and, therefore, making them really useful for the organization. And like other professional-level exams, the ML Engineer must be proficient in all aspects of the field, not just those that relate to Google Cloud technologies.

The test consists of business questions where you must identify how Machine Learning techniques can help even more practical and direct questions about how to use Google Cloud projects for the development of ML models. But make no mistake, they are also charged with matters more related to the Cloud itself and less to the Artificial Intelligence area, such as security and privacy in its pipeline using items such as IAM and key management.

According to the test guide, these are the subjects charged:

Frame ML problems.
Architect ML solutions.
Design data preparation and processing systems.
Develop ML models.
Automate and orchestrate ML pipelines.
Monitor, optimize, and maintain ML solutions.

Professional Cloud Database Engineer

This is Google Cloud's latest professional-level certification and also the most recent, released in 2022. Below we can check Google's definition of this role:

"A Professional Cloud Database Engineer is a database professional with two years of Google Cloud experience and five years of overall database and IT experience. The Professional Cloud Database Engineer designs, creates, manages, and troubleshoots Google Cloud databases used by applications to store and retrieve data. The Professional Cloud Database Engineer should be comfortable translating business and technical requirements into scalable and cost-effective database solutions.".

This test has some congruences with the Data Engineer certification. However, the Cloud Database Engineer role is more focused on the databases themselves, whether relational or not. So you can expect more specific questions about products like Cloud SQL, Firestore (Native and Datastore mode), BigTable, BigQuery and migrating databases to the Cloud environment.

According to the exam guide, these are the knowledge required of the candidate:

Design scalable and highly available cloud database solutions.
Migrate data solutions.
Manage a solution that can span multiple database solutions.
Deploy scalable and highly available databases in Google Cloud.

Which Google Cloud certification is best for me?

Which or even which certifications to choose depends a lot on your use of the Google Cloud platform. If you are completely new to the world of cloud computing, it would be interesting to prepare for the Digital Leader test, as it will give you the knowledge base to move on to more complex and specific tests.

If you already have practical experience with Google Cloud or even have an entry-level certification from another provider, such as Azure's AZ 900 or AWS Cloud Practitioner, it would be advisable to go straight to Associate Cloud Engineer, as you already understand how the world of Cloud works and will go through the study for this test, will specialize in the Google platform.

When it comes to professional-level certifications, the choice of which certification to take is based on which one or which ones are closest to your current role in the company. But regardless of your position, I particularly advise you to take the Professional Cloud Architect test, as it will give you a privileged view of the entire Cloud, both from a technical and business point of view. By studying for this test, you will build a solid foundation to pursue any other specialization within the Cloud environment.

If you are an infrastructure professional, the DevOps, Security and Network exams might make sense for you. If you work with data on a daily basis, the Data Engineer, Database and Machine Learning tests can be good study paths. Finally, if you are a software engineer or developer, the Cloud Developer and DevOps tests can be very useful for your day to day.

Which exams to take ultimately ends up being a private decision about what you want for your career and what knowledge is needed to achieve your goals. I particularly like to have a macro view of everything that makes sense for my career and that the Cloud offers. In my specific scenario, as I work as a data engineer in my company, I chose to take the test focused on this role but also on Machine Learning and Database, because I believe that although I do not use this knowledge on a daily basis, having a good idea about these technologies can help me to stand out in the job market and also to communicate better with professionals in these careers. I also think it's worth mentioning that I followed the process of studying for Digital Leader, Associate Cloud Engineer and Professional Cloud Engineer to have a solid knowledge base about Google Cloud Platform.

How to prepare

As I mentioned several times throughout this article, for many of the certification tests it is desirable that the candidate has specific knowledge about Information Technology and its subareas of specialization, such as networks, security, machine learning and data.

I also take the opportunity to highlight something extremely relevant to keep in mind along your preparation path. The approval certificate alone does not have much value for the job market. What really adds value to the certified professional is all the practical and theoretical knowledge he acquired while studying for the exam. In other words, what actually determines the success of the process is your ability to transform your study into real value for the company through your daily work on Google Cloud.

Two of the top study sources I like to recommend is the A Cloud Guru course platform, which has courses on various certification exams. Another source of study that I find very useful is Dan Sullivan's courses on the Udemy platform.

When studying for certification exams, it is important that in addition to taking courses, you also prepare yourself by solving various simulations. For that purpose I recommend Whizlabs and Exam Topics platforms.

The last point you should keep in mind throughout your study process is to become able to navigate and extract information from the Google Cloud documentation, as in your day-to-day life it will be your main source of reference about the platform.

Conclusion

Through this article I tried to explain the context of each of the certification exams offered by Google Cloud. I hope the tips I've given provide you with insights to apply on your path to becoming a Google Cloud Certified Professional!

Data Mesh: Scaling Delivery of Data as Product

Gabriel Luz — Thu, 30 Jun 2022 16:35:06 +0000

Managing and generating value through data has been a challenge for most companies for decades. In recent years, the phenomenon of Big Data has brought a lot of optimism because of the promise of a revolution in business. However, what was seen was frustration due to limitations that data architectures have, and that prevent them from providing the expected value for companies. Given this scenario, a new architectural paradigm emerged, Data Mesh, which aims to remove bottlenecks and allow a more optimized delivery of value through data.

The challenge of creating value through data

Before getting into the details of this new architectural paradigm, let's understand the current context and identify the causes of failure in the data journey of many companies.
For that, let's review some of the key components that are used as big data repositories. The first of these is the Data Warehouse, which emerged in large corporations decades ago. In this context, the business had a slower and more predictable pace, the architecture was essentially composed of large and complex systems with little or no integration between them. The challenge was to get a unified view across the systems.
More recently, starting in the 2010s, a model that gained popularity was the Data lake, which emerged in a much more dynamic and less predictable business moment. This model was used not only by large corporations but also by new companies with disruptive value propositions. The architecture has evolved for a greater number of applications, which are simpler and more integrated, often stimulated by the use of new technologies such as cloud and microservices.

To better understand the concepts behind these two models, I suggest reading the following articles right here on the AvenueCode blog, Data Lake e Data Warehouse.

It's important to highlight that in these two architectural paradigms, the teams responsible for the data have characteristics of high specialization and centralization.

In light of this scenario, even though investments in the data area continue to grow, confidence in the return by adding real value to the business is decreasing. Based on a study by NewVantage Partners, it is possible to observe that only 24% of companies are actually able to adopt a data culture. However, it is important to point out that the problem does not lie in the technology itself, as the great advances of the last decade have dealt very well with the problems arising from the large volume and processing of data. Limitations in delivering value to the business result from processes and data models, due to intrinsic characteristics of such practices, such as:

Monolithic and centralized architecture: based on the premise of the need to centralize data to obtain real value, architectures have always been complex and concentrated in a single place. Even in this context, it was relatively simple to start a Data Warehouse or Data Lake project, however, the difficulty lies in scaling, since these models have problems keeping up with the rapid changes arising from the business areas.

Source: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Highly centralized and specialized responsibilities: The responsibility for complex architectures is in the hands of a highly specialized engineering team that often works in isolation from the rest of the company, that is, far from where the data is generated and used. With that, this team can become a bottleneck when some changes are needed or in the addition of a new process to the data pipeline. In addition to the fact that the members of this team hardly have a business vision of all areas and, therefore, cannot respond to changes in business rules at the ideal speed.

Source: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

The adoption of a centralized structure both in terms of staff and in terms of data platform brought major challenges for the real democratization of data such as the problem of data quality, due to the lack of business expertise by the engineering team. And there is also a deficiency in scalability issues, both due to engineering limitations as well as the complexity and interdependence of steps in the data pipeline.

Data mesh

Faced with the aforementioned problems, Zhamak Dehghani presented a new approach to data architectures, in these two articles on Martin Fowler's blog: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Data Mesh Principles and Logical Architecture. With the main objective being the democratization of data, Data Mesh challenges the models previously adopted and the assumption that large volumes of data must always be centralized to be able to use them or even to use a single team to manage these resources. To reach its full potential, the Data Mesh architecture follows 4 basic principles:

Domain-oriented data architecture.
Data as a product.
Infrastructure that makes data accessible as self-service.
Federated governance.

Source: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Domain-oriented data architecture

The data architecture must be built and modeled in a way that is oriented to the different business domains, instead of being centralized in a single team. This practice can bring some benefits, such as using and managing data close to its respective sources, rather than having the effort of moving it. This is of great importance because moving the data requires a cost, for example, if it is necessary to add more processing jobs in a generic workflow and assuming that each job is a possible point of failure. Another benefit of this Data Mesh principle lies in the fact that data responsibility is balanced according to the domains involved and this makes the growth of new data sources and coupling them more agile to keep up with the rapid evolution of business. This makes it easier to scale at the same pace as business demands.

Data as a product

Making data architectures distributed is interesting from the point of view of allowing more optimized scalability, but it brings problems that did not exist in the centralized model, such as the lack of standardization in access and data quality. To solve these problems, Data Mesh proposes to think of data as a product, and for that, it is necessary to create new roles, such as the Data Product Owner and the Data Developer. These new roles are responsible for defining and developing products. Instead of looking at data as a service, the Data Product Owner must apply product thinking to create a better experience for customers or users, while the Data Developer works with a focus on developing the product itself. Within the responsibilities, the Data Product Owner of each domain must ensure that the data is accessible and well documented, as well as determine the form of storage and ensure the quality of the data. The purpose of this principle is to provide a good experience for users to perform analysis and bring real value to the business.

Infrastructure that makes data accessible as self-service

Another concern that arises in the decentralization scenario is the spreading of knowledge in technologies that were previously concentrated. There would be a risk of overloading domain teams and generating reworks regarding the data platform and its infrastructure, which needs to be built and constantly managed. Since the skills needed for this task are highly specialized and difficult to find, it would be impractical to require each domain to create its own infrastructure environment. Thus, one of the Data Mesh principles is to propose a self-service data platform, to allow autonomy of domain teams. This infrastructure is intended to be a high-level abstraction to remove complexities and the challenge of provisioning and managing the lifecycle of data products. It is important to note that this platform must be domain agnostic. The self-service infrastructure must include features to reduce the current cost and expertise required to build data products, including scalable data storage, data products schema, data pipeline construction and orchestration, data lineage, etc. Thus, the objective of this principle is to ensure that domain teams can create and consume data products autonomously, using the platform's abstractions.

Source: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Federated governance

One of the fundamental principles of Desh Mesh is to create federated governance, aiming to balance centralized and decentralized governance models to seek the positive points of both. Federated governance has some features such as domain decentralization, interoperability through global standardization, a dynamic topology, and most importantly, the automated execution of decisions by the platform. Traditionally, governance teams use a centralized model of rules and processes and accumulate full responsibility for ensuring global standards for data. In Data Mesh, the governance team changes its approach to sharing responsibility through federation, being responsible for defining, for example, what are the global (not local) rules for data quality and security, instead of being responsible for quality and security of all company data. That is, each Data Product Owner has domain-local autonomy and decision-making power while creating and adhering to a set of global rules, to ensure a healthy and interoperable ecosystem. Taking the LGPD as an example, the global governance team remains legally responsible and can inspect domains to ensure global rules.

It is important to highlight that a domain's data only becomes a product after it has gone through the quality assurance process locally according to the expected data product quality metrics and global standardization rules. Data Product Owners in each domain are in the best position to decide how to measure data quality locally, knowing the details of the business operations that produce the data. Although such decision-making is localized and autonomous, it is necessary to ensure that the modeling is meeting the company's global quality standards, defined by the federated governance team.

Data Mesh Architecture and Adoption

In the image below, it is possible to see the summarized architecture and observe the four principles at a high level, starting with the data platform, passing through the domains responsible not only for applications and systems but also for data, all under the macro responsibility of federated governance, ensuring product interoperability.

Source: Data Mesh Principles and Logical Architecture

In the current scenario and thanks to the advancement of data storage and processing technology in recent years, the technological factor is not a problem for the adoption of Data Mesh, since the tools used in Data Lake/Warehouse can be used in the new model. This article presents the possibility of creating a Data Mesh architecture based on GCP (Google Cloud Platform). In addition, there is a wide variety of cloud data storage options that allow domain data products to choose the right storage for the need.

It is important to point out that Data Mesh requires a change of culture within your company, from the business area to engineering, which can be a barrier in the implementation of this model. To know if your company would really benefit from Data Mesh, you need to answer some questions, such as:

The number of data sources.
The number of people on the data team.
The possible number of business domains.
If the data engineering team is currently a bottleneck often.
What is the current level of importance that the company gives to the subject of data governance.

In general, the greater the number of data sources, set of consumers, business rules, and complexity of business domains, the use of Data Lake/Warehouse can end up becoming a bottleneck in the delivery of quality solutions. This is a scenario that would possibly benefit from the adoption of a Data Mesh based architecture. It is also valid to carry out specific projects in situations that could make good use of Data Mesh and change the culture and architecture little by little. Taking as an example a situation where if you are discarding data sources to be imported, which are valuable to business users, because it is complex to integrate these data sources in the current Data Lake/Warehouse structure, then this could be a good opportunity. to perform the migration to this new architecture.

Conclusion

With this, we conclude that Data Mesh offers an alternative to current data architecture models, allowing greater synergy between technical teams and business areas, which are the big users of data.

References

Get data from Cloud SQL with Python

Gabriel Luz — Fri, 24 Jun 2022 02:30:15 +0000

Hello, there!

A possible scenario in the day-to-day work of a data engineer may be the need to transport data from a SQL base to another place. In this short tutorial I show a simple way to perform this operation using Python.

For the database, I used Cloud SQL, which is a managed database service from Google Cloud Platform (GCP). This GCP product provides a cloud-based alternative to MySQL, PostgreSQL and SQL Server databases. The great advantage of Cloud SQL is that it is a managed service, that is, you do not have to worry about some tasks related to the infrastructure where the database will run, tasks such as backups, maintenance and updates, monitoring, logging, etc. In this example I used Postgres SQL.

In this tutorial, I'm going to assume that you already have an operational Cloud SQL, if you don't, you can follow this tutorial to create a database from scratch in a few minutes. To interact with GCP we will use the SDK. If you are a Mac OS user, you can install it through the Homebrew package manager, just run the command:

brew install --cask google-cloud-sdk

If you are a user of another operational system, you can check this link for the correct way to install the SDK on your OS. Then, you need to use the SDK to authenticate your GCP account, for that, just use the command below and log into your account and then select the project that will be used:

gcloud auth application-default login gcloud config set project <PROJECT_ID>

As in any python project, it is good practice to create a virtual environment dedicated to it, for that, we will use virtualenv, to download it just run the command:

pip install virtualenv

Next, we create the virtual environment for the project and launch it:

virtualenv <my_env_name> source <my_env_name>/bin/activate

With the virtual environment created, we can download the libraries that will be used in the project:

The first step of our Python code will be import the downloaded libraries:

from google.cloud.sql.connector import Connector
import sqlalchemy
import configparser
import pandas as pd
from fastavro import writer, parse_schema

Next, we will need to set the connection parameters. It is always good to point out that you should never put passwords or direct access keys in the code. With that in mind, let's create an .env file to store the database access data. It's extremely important to put this file in .gitignore so it doesn't go up to the git repository you're using. Two other alternatives would be to use environment variables or a GCP service designed to store access information, the secret manager. However, for this project, we will continue in this way. Fill in the .env file with the following information:

[DEFAULT]
db_user = database user
db_pass = password
db_name = database name
project_id = GCP project id
region = region where the database was created
instance_name = name of the database instance

Going back to the python code, let's use the configpaser library to read these parameters:

#set config
parser = configparser.ConfigParser()
parser.read(".env")

project_id = parser.get("DEFAULT", "project_id")
region = parser.get("DEFAULT", "region")
instance_name = parser.get("DEFAULT", "instance_name")
db_user = parser.get("DEFAULT", "db_user")
db_pass = parser.get("DEFAULT", "db_pass")
db_name = parser.get("DEFAULT", "db_name")

Then we set up the connection name to follow the pattern: PROJECT_ID:REGION:INSTANCE_NAME

# initialize parameters
INSTANCE_CONNECTION_NAME = f"{project_id}:{region}:{instance_name}" # i.e demo-project:us-central1:demo-instance
print(f"Your instance connection name is: {INSTANCE_CONNECTION_NAME}")

To make the connection itself, we are going to use the cloud-sql-python-connector library. Let's initialize the Connector object and make a call to the connect method, passing the connection parameters.

# initialize Connector object
connector = Connector()

# function to return the database connection object
def getconn():
   conn = connector.connect(
       INSTANCE_CONNECTION_NAME,
       "pg8000",
       user=db_user,
       password=db_pass,
       db=db_name
   )
   return conn

The Connector itself creates connection objects by calling its connect method but does not manage database connection pooling. For this reason, it is recommended to use the connector alongside a library that can create connection pools. For our case, we will use the SQLAlchemy library. This will allow for connections to remain open and be reused, reducing connection overhead and the number of connections needed.

# create connection pool with 'creator' argument to our connection object function
pool = sqlalchemy.create_engine(
   "postgresql+pg8000://",
   creator=getconn,
)

The returned connection pool engine can then be used to query and modify the database. For the database, I chose to use this dataset, containing information about the 2021-22 Uefa Champions League season. To upload the data to Postgres, I used the SQL Debeaver client, which has a feature to easily import data. The base has 8 tables, as shown in the image below, but I will only use one as it will be enough for the purpose of this short tutorial.

Through the pool object, created in the last code block, we can perform a query using the pandas library, as follows:

tb_key_stats = pd.read_sql_query("select * from public.tb_key_stats limit 1000", con=pool)
print(tb_key_stats)

With that, we have the extracted data:

As a plus to this tutorial, we are going to export the extracted data to an Avro file. In case you're not familiar, Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. Unlike another very popular data engineering format, Parquet, Avro is row-oriented, just like csv. In case you don't know the difference between the two orientations, consider this dataset.

Here’s how it would be organized in both row and column storage:

Column storage files are more lightweight, as adequate compression can be made for each column. Row storage doesn’t work in that way, since a single row can have multiple data types.

A very useful feature of Avro is that it stores the schema in a Json like object, so the data typing is already known in advance.

To convert the previously generated pandas dataframe into an Avro file, we are going to use the FastAvro library. This process essentially contains three steps:

Define the schema: We need to define a schema in JSON format to specify which fields are expected, along with their respective data types. Let's do this through a Python dictionary and then make the fastavro.parse_schema() function receive this object.

schema_tb_key_stats = {
   'doc': 'UCL_2021_22',
   'name': 'UCL',
   'namespace': 'stats',
   'type': 'record',
   'fields': [
       {'name': 'player_name', 'type': 'string'},
       {'name': 'club', 'type': 'string'},
       {'name': 'position', 'type': 'string'},
       {'name': 'minutes_played', 'type': 'int'},
       {'name': 'match_played', 'type': 'int'},
       {'name': 'goals', 'type': 'int'},
       {'name': 'assists', 'type': 'int'},
       {'name': 'distance_covered', 'type': 'float'}
   ]
}
parsed_schema_tb_key_stats = parse_schema(schema_tb_key_stats)

Convert the DataFrame to a list of records: let's convert the dataframe to a list of dictionaries:

records = tb_key_stats.to_dict('records')

Generate the Avro file:

with open('tb_key_stats_deflate.avro', 'wb') as out:
   writer(out, parsed_schema_tb_key_stats, records, codec='deflate')

With that we have the file generated successfully. As you can see, for this example I used deflate compression, but it is also possible to choose snappy or no compression at all, by default the codec parameter is set to null. It is also clear that if it is a dataset with many columns it can become a tedious work to pass the whole schema to a Json object, this is the price paid for efficiency. While there are creative ways to automate this process, that's for another article.

That’s It!

The purpose of this article was to present a very simple way to extract data from a Postgres database in Cloud SQL. In a future article I intend to demonstrate an efficient way to extract all tables from the created database, since for today we only extract one table.

Let me know if you have any questions, feedback, or ideas over on Linkedin or in the comments below. Thanks for reading!

References

How to prepare for the GCP Professional Data Engineer certification

Gabriel Luz — Mon, 02 May 2022 00:23:54 +0000

Hi!

I recently passed the GCP Professional Data Engineer certification exam and some people asked me for tips for the exam and study materials, so I decided to write this post to explain the path I took.

This is certainly not the only possible way to prepare for this test, but it was the one that worked for me. So be open-minded as each person has different ways of studying and learning.

Expectations alignment

Unlike some other certifications, the Professional Data Engineer (PDE) is definitely not a simple exam, where just study for 8 hours and you'll be ready. Google writes its questions in a way that only someone with hands-on experience and an understanding of their services can get across.

It is important to note that a certification is the validation of the knowledge you gain. The goal is not for you to memorize questions, but to actually understand the services that the cloud offers to be able to apply them in your day to day work. Keep this in mind as you prepare.

For this exam, Google recommends the following background: 3+ years of industry experience including 1+ years designing and managing solutions using Google Cloud. At the time I took the test I had 3 years of experience in the field of data engineering but I had no background with data-oriented cloud services. It is important to say that I already had a knowledge base in Cloud as I used AWS for 6 months in my previous work, but it was in a devops context and not data engineering. Either way, I think it's important for a person to have basic knowledge of Cloud architecture before attempting a specific exam like the GCP PDE. Therefore, for the GCP Professional Data Engineer exam I needed 4 months of intense studies to feel minimally confident to take the exam.

The exam itself

Before starting your preparation, it is worth exploring the official website. It contains all up-to-date information about the test and its content. It is worth mentioning that on the official website the cost of the exam is 200 dollars, but because I live in Brazil, I only paid 120 dollars. Google provides a section of the site, called Exam Guide, to explain what is on the test. So it's worth reading before starting your study and also taking a look when you feel you've studied a considerable amount of material, to see where you stand on the requirements.

If your company is a Google Cloud partner, it is very likely that it has the benefit of vouchers for its employees. These vouchers are usually available to anyone who completes a knowledge path made available by Google on the Qwiklabs platform.

The exam has a total of 50 questions and is 2 hours long. I consider this time enough to answer all the questions and even review the ones that were left as doubts.

And the last comment about the test. Due to the COVID-19 pandemic, it is possible to take the exam online. Although this way is a bit boring because Google requires you to have a well-prepared environment for the test. Personally, I prefer to do it in person at an accredited center as it eliminates the risk of possible computer and internet problems that may occur at home as well as allowing a greater level of concentration for the test.

Preparation

Courses and training

There are many online and face-to-face courses that prepare you for the GCP Professional Data Engineer exam. I will register here the ones I had contact with and I can attest to the quality.

A Cloud Guru

This was the main course I used to prepare for the exam. This is probably the main study platform for Cloud certifications. This course pretty much covers everything you need to pass the exam and is very detailed. Although more expensive, this platform also includes labs to practice what is taught in class.

Udemy Course by Dan Sullivan

This was another course I took and I really enjoyed it. The instructor can make links between services in a very didactic way. It is also worth mentioning that Udemy always promotes courses. This one in particular I managed to buy for 25 reais. The course structure is constantly evolving along with the test content.

Qwiklabs

This is the platform where Google makes its own trainings available. Companies that are Google Cloud partners typically have access to this platform. I took most of the courses on the data engineering track but I consider them to be of average quality. The best thing about this platform are the numerous labs to practice. Cloud Guru also has labs but Qwiklabs has a lot more.

Practice experience

As I mentioned at the outset, practical experience is very important when studying for certification. While Google requires a year of cloud work to pass the exam, you can succeed in other ways. The first and most recommended is to create an account in the so-called Free Tier. This account will allow you to use Google Cloud Platform resources to practice what you learn in class. All the courses mentioned teach how to create this account. But if you want to check it out, you can take a look at this video tutorial.

The other way to practice is through platforms that offer labs. As I mentioned in the last section, the two platforms I've used that offer this feature are Cloud Guru and Qwiklabs, with Qwiklabs having a larger amount of labs to practice with.

I make it clear that I prefer the approach of creating a Free Tier account as it allows for a much more complete learning experience than taking a ready-to-use environment.

Extra study resources

In addition to the courses, other resources were essential for passing the exam.

Professional Data Engineer Study Guide Book by Dun Sullivan

This excellent book written by Dan Sullivan (Yes, the same author as the second course listed) contains almost all the information needed to pass the exam. It has well-organized chapters on GCP services and provides many key points to remember to solve the exam. At the end of each section, there are tests on the presented content. To tell you the truth, after reading this book, I was able to improve my knowledge by about 40-50%, and my results in the mock test improved significantly.

The Cloud Girl and Visualizing Google Cloud

The YouTube channel The Cloud Girl and the book Visualizing Google Cloud are produced by Priyanka Vergadia, a Google Cloud Developer Advocate. She manages to explain GCP's products and services through beautifully crafted illustrations. Both the book and the channel are great teaching resources and helped me a lot in understanding important concepts for the exam.

Google documentation

I consider reading the documentation of each GCP service and product very important not only for the test but also for playing the role of GCP Data Engineer in the day to day work. I always recommend consulting the documentation when any point of doubt about Google Cloud services arises, that is, it is a resource that you should use from the initial moment of preparation for the exam until the end.

Data Engineer Practice Exam

An extremely important step in your preparation is taking practical exams. This is the classic, the more the better. See which are the areas of study where you are getting less right and read the corresponding documentation again. Prepare yourself for many business scenario questions, asking which technology to use and for questions with multiple alternatives. In my exam, I found several questions to select two alternatives. Below I have listed the platforms I used throughout my preparation. An important detail is that on some of the platforms it is possible to pause the simulations, so you don't have to do all 50 questions at once. Although I recommend that you do at least three practical exams without interruptions.

Whizlabs
Exam Topics
Practical exam at the end of the Cloud Guru course
Practical exam at the end of the Dan Sullivan course
Practical exam at the begging of the Dan Sullivan book and at the end of each chapter
Sample Questions do Google

I also watched the videos on the AwesomeGCP YouTube channel by Sathish VJ. It is an excellent source of study because in addition to tips and content itself, it also has several videos with commented questions.

Final tips

These are the main services that most fell on my test, that is, that had two or more questions about them:

ML concepts + AI at GCP (Vertex AI, Vision, NLP etc APIs).
BigQuery (some questions covering details about the tool like backup and SQL structures like window functions).
BigTable (the main detail is design of tables to avoid hot spots).
Cloud SQL.
Dataflow.
Dataproc (some issues addressing Dataproc + HDFS).
Pub/sub (especially used for scenarios where there needs to be decoupling between systems).
Cloud Storage (main topic was storage classes).

Other services appeared with specific questions, with a maximum of two questions each:

Cloud Spanner
Composer
Compute Engine
Data Loss Prevention
Dataprep
Firestore

It is also very important to be familiar with cloud concepts such as IAM, which pertains to permissioning resources in the Cloud.

Conclusion

In this post I presented the steps I followed to prepare for the GCP Professional Data Engineer exam. As explained at the beginning, there is no “silver bullet” to succeed in the test. Some things work better with one person than another, and I'm sure there are other (and better) ways to prepare.

Honestly, even studying very intensively for 4 months, after the third question on the test my feeling was that I only knew a little bit about everything and that it wouldn't be enough to pass, since several questions asked small details about the tools. For several moments in the test I was sure I wouldn't be tested and in those moments, anxiety and despair took over me. I wasted at least 5 minutes of exam time staring at nothing and regretting that I wasn't doing well with the answers. But I tried to stay calm and concentrate. At the end of the exam I received the result of "Pass". I was extremely happy and relieved to know that all the effort and sacrifices paid off. After 3 days Google finally sent me the result confirmation and the GCP Professional Data Engineer certificate.

If you have any questions about the preparation, feel free to message me. I hope I was able to convey useful information to you, and I wish you good luck in the exam!