DEV Community: Jeff Hale

The Weird World of Missing Values in Pandas

Jeff Hale — Fri, 22 Nov 2019 20:07:30 +0000

If you use the Python pandas library for data science and data analysis things, you'll eventually see NaN, NaT, and None in your DataFrame. These values all represent missing data. However, there are subtle and not-so-subtle differences in how they behave and when they appear..

Let's take a look at the three types of missing values and learn how to find them.

`NaN`, `NaT`, and `None`

`NaN`

If a column is numeric and you have a missing value that value will be a NaN. NaN stands for Not a Number.

NaNs are always floats. So if you have an integer column and it has a NaN added to it, the column is upcasted to become a float column. This behavior may seem strange, but it is based on NumPy's capabilities as of this writing. In general, floats take up very little space in memory, so pandas decided to treat them this way. The pandas dev team is hoping NumPy will provide a native NA solution soon.

`NaT`

If a column is a DateTime and you have a missing value, then that value will be a NaT. NaT stands for Not a Time.

`None`

A pandas object dtype column - the dtype for strings as of this writing - can hold None, NaN, NaT or all three at the same time!

What are these NaN values anyway?

NaN is a NumPy value. np.NaN
NaT is a Pandas value. pd.NaT
None is a vanilla Python value. None

However, they display in a DataFrame as NaN, NaT, and None.

Strange Things are afoot with Missing values

Behavior with missing values can get weird. Let's make a Series with each type of missing value.



pd.Series([np.NaN, pd.NaT, None])

0   NaT
1   NaT
2   NaT
dtype: datetime64[ns]

Pandas created the Series as a DateTime dtype. Ok.

You can cast it to an object dtype if you like.



pd.Series([np.NaN, pd.NaT, None]).astype('object')

0    NaT
1    NaT
2    NaT
dtype: object

But you can't cast it to a numeric dtype.



pd.Series([np.NaN, pd.NaT, None]).astype('float')



    ---------------------------------------------------------------------------

    TypeError                                 Traceback (most recent call last)

    <ipython-input-255-66ec4de18835> in <module>
    ----> 1 pd.Series([np.NaN, pd.NaT, None]).astype('float')

 ...


    TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]

Also note that you can change an object column with Nones into a numeric column with pd.to_numeric. No problem.

Equality Check

Another bizarre thing about missing values in Pandas is that some varieties are equal to themselves and others aren't.

NaN doesn't equal NaN.



np.NaN == np.NaN



    False

And NaT doesn't equal NaT.



pd.NaT == pd.NaT



    False

But None does equal None.



None == None



    True

Fun! 😁

Now let's turn our attention finding missing values.

Finding Missing Values with df.isna()

Use df.isna() to find NaN, NaT, and None values. They all evaluate to True with this method.

A boolean DataFrame is returned if df.isna() is called on a DataFrame and a Series is returned if called on a Series.

Let's see df.isna() in action! Here's a DataFrame with all three types of missing values:

Here's the code to return a boolean DataFrame with True for missing values.



df.isna()

A one-liner to return a DataFrame of all your missing values is pretty cool. Deciding what to do with those missing values is a whole nother question that I'll be exploring in my upcoming Memorable Pandas book.

Note that it's totally fine to have all three Pandas missing value types in your DataFrame at the same time, assuming you are okay with missing values.

Wrap

I hope you found this intro to missing values in the Python pandas library to be useful. 😀

If you did, please do all the nice things on Dev and share it on your favorite social media so other people can find it, too. 👏

I write about Python, Docker, and data science things. Check out my other guides if you're into that stuff. 👍

You don't want to MISS them! (Missing values. Get it?) 🙄

Thanks to Kevin Markham of Data School for suggestions on an earlier version of this article!

The True Guide to True and False in PostgreSQL

Jeff Hale — Wed, 23 Oct 2019 19:21:25 +0000

TRUE, FALSE, and NULL are the possible boolean values in PostgreSQL.

Surprisingly, there are a bunch of different values you can use for TRUE and FALSE - and one alternative for NULL. Also surprisingly, some values you'd expect might work, don't work.

Let's check out TRUE first.

TRUE

The following literal values evaluate to TRUE. Note that case doesn't matter.

true
't'
'tr'
'tru'
'true'
'y'
'ye'
'yes'
'on'
'1'

Other similar options will cause a syntax error, such as 1, or tru.

Now let's look at FALSE.

FALSE

Here are literal values that will evaluate to FALSE:

false
'f'
'fa'
'fal'
'fals'
'false'
'n'
'no'
'of'
'off'
'0'

Other similar options that throw syntax errors include 0, fa, and '0.0'.

NULL

NULL is the value PostgreSQL uses for missing value or no value. Note that NULL is not equal to any value. NULL isn't even equal to itself!

UNKNOWN evaluates to NULL. Again, capitalization doesn't matter.

There are no string literal values that evaluate to NULL. Similar terms throw syntax errors, including nan, none, and n.

Advice

Stick with TRUE, FALSE, and NULL. As the [docs]((https://www.postgresql.org/docs/12/datatype-boolean.html) state, "The key words TRUE and FALSE are the preferred (SQL-compliant) method for writing Boolean constants in SQL queries."

Use WHERE my_column IS NULL and not WHERE my_column = NULL to return the rows with NULL values. Remember, NULL is not equal to NULL in PostgreSQL. 😁

Code

Here's the code you can use to test different values:




/* make the table and values*/
CREATE TABLE test1 (a boolean, b text);
INSERT INTO test1 VALUES (TRUE, 'I am true');
INSERT INTO test1 VALUES (FALSE, 'I am false');
INSERT INTO test1 VALUES (NULL, 'I am null');

/* see the data */
SELECT * 
FROM test1;

/* test it out */
SELECT * 
FROM test1
WHERE a = 'true'

You can use WHERE a = to compare TRUE or FALSE booleans, strings, or numbers.

Comparing a string with IS won't work. For example, WHERE a IS 'true', will cause an error.

You must use = or LIKE to compare string values that you want to evaluate to a boolean. For example WHERE a = 'true'.

However, you need to use WHERE a IS to test against NULL options.

Fun! 😉

Wrap

I hope you found this little guide to be interesting and informative. If you did, please share it on your favorite social media so other folks can find it too. 👏

I write about Python, Data Science, and other fun tech topics. Follow me and join my Data Awesome mailing list if you're in to that stuff.

Happy PostgreSQLing! 👍

Don’t Sweat the Solver Stuff

Jeff Hale — Fri, 27 Sep 2019 01:22:43 +0000

Logistic regression is the bread-and-butter algorithm for machine learning classification. If you’re a practicing or aspiring data scientist, you’ll want to know the ins and outs of how to use it. Also, Scikit-learn’s LogisticRegression is spitting out warnings about changing the default solver, so this is a great time to learn when to use which solver. 😀

FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. 
Specify a solver to silence this warning.

In this article, you’ll learn about Scikit-learn LogisticRegression solver choices and see two evaluations of them. Also, you’ll see key API options and get answers to frequently asked questions. By the end of the article, you’ll know more about logistic regression in Scikit-learn and not sweat the solver stuff. 😓

I’m using Scikit-learn version 0.21.3 in this analysis.

When to use Logistic Regression

A classification problem is one in which you try to predict discrete outcomes, such as whether someone has a disease. In contrast, a regression problem is one in which you are trying to predict a value of a continuous variable, such as the sale price of a home. Although logistic regression has regression in its name, it’s an algorithm for classification problems.

Logistic regression is probably the most important supervised learning classification method. It’s a fast, versatile extension of a generalized linear model.

Logistic regression makes an excellent baseline algorithm. It works well when the relationship between the features and the target aren’t too complex.

Logistic regression produces feature weights that are generally interpretable, which makes it especially useful when you need to be able to explain the reasons for a decision. This interpretability often comes in handy — for example, with lenders who need to justify their loan decisions.

There is no closed-form solution for logistic regression problems. This is fine — we don’t use the closed form solution for linear regression problems because it’s slow.

Solving logistic regression is an optimization problem. Thankfully, nice folks have created several solver algorithms we can use. 😁

Logistic regression is a primary algorithm to use for most classification problems. It's a fast, versatile extension of a generalized linear model. It produces feature weights that are generally interpretable, which makes it especially useful when you need to be able to explain the reasons for a decision. This interpretability often comes in handy — for example, with loan denials.

There is no closed-form solution for logistic regression. This is fine, because we don't even use the closed form solution to solve linear regression problems because it's slow. Instead, solving logistic regression is an optimization problem. Thankfully, very nice folks have created several algorithms to solve it. 😁

Solver Options

Scikit-learn ships with five different solvers. Each solver tries to find the parameter weights that minimize a cost function. Here are the five options:

newton-cg — A newton method. Newton methods use an exact Hessian matrix. It's slow for large datasets, because it computes the second derivatives.
lbfgs — Stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates, so it saves memory. It isn't super fast with large data sets. It will be the default solver as of Scikit-learn version 0.22.0.
liblinear — Library for Large Linear Classification. Uses a coordinate descent algorithm. Coordinate descent is based on minimizing a multivariate function by solving univariate optimization problems in a loop. In other words, it moves toward the minimum in one direction at a time. It is the default solver prior to v0.22.0. It performs pretty well with high dimensionality. It does have a number of drawbacks. It can get stuck, is unable to run in parallel, and can only solve multi-class logistic regression with one-vs.-rest.
sag — Stochastic Average Gradient descent. A variation of gradient descent and incremental aggregated gradient approaches that uses a random sample of previous gradient values. Fast for big datasets.
saga — Extension of sag that also allows for L1 regularization. Should generally train faster than sag.

An excellent discussion of the different options can be found in this Stack Overflow answer.

The chart below from the Scikit-learn documentation lists characteristics of the solvers, including the the regularization penalties available.

Why is the Default Solver Being Changed?

liblinear is fast with small datasets, but has problems with saddle points and can't be parallelized over multiple processor cores. It can only use one-vs.-rest to solve multi-class problems. It also penalizes the intercept, which isn't good for interpretation.

lbfgs avoids these drawbacks, is relatively fast, and doesn't require similarly-scaled data. It's the best choice for most cases without a really large dataset. Some discussion of why the default was changed is in this GitHub issue.

Let's evaluate the Logistic Regression solvers with two prediction classification projects — one binary and one multi-class.

Solver Tests

Binary classification solver example

First, let's look at a binary classification problem. I used the built-in scikit-learn breast_cancer dataset. The goal is to predict whether a breast mass is cancerous.

The features consist of numeric data about cell nuclei. They were computed from digitized images of biopsies. The dataset contains 569 observations and 30 numeric features. I split the dataset into training and test sets and conducted a grid search on the training set with each different solver. You can access my Jupyter notebook used in all analyses at Kaggle.

The most relevant code snippet is below.

solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
params = dict(solver=solver_list)
log_reg = LogisticRegression(C=1, n_jobs=-1, random_state=34)
clf = GridSearchCV(log_reg, params, cv=5)
clf.fit(X_train, y_train)
scores = clf.cv_results_['mean_test_score']

for score, solver in zip(scores, solver_list):
    print(f"  {solver} {score:.3f}" )

liblinear 0.939
newton-cg 0.939
lbfgs 0.934
sag 0.911
saga 0.904

The values for sag and saga lag behind their peers.

After scaling the features, the solvers all perform better and sag and saga are just as accurate as the other solvers.

liblinear 0.960
newton-cg 0.962
lbfgs 0.962
sag 0.962
saga 0.962

Now let's look at an example with three classes.

Multi-class solver example

I evaluated the logistic regression solvers in a multi-class classification problem with Scikit-learn's wine dataset. The dataset contains 178 samples and 13 numeric features. The goal is to predict the type of grapes used to make the wine from the chemical features of the wine.

solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
parameters = dict(solver=solver_list)
lr = LogisticRegression(random_state=34, multi_class="auto", n_jobs=-1, C=1)
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
scores = clf.cv_results_['mean_test_score']

for score, solver, in zip(scores, solver_list):
    print(f"{solver}: {score:.3f}")

liblinear: 0.962
newton-cg: 0.947
lbfgs: 0.955
sag: 0.699
saga: 0.662

Scikit-learn gives a warning that the sag and saga models did not converge. In other words, they never arrived at a minimum point. Unsurprisingly, the results aren't so great for those solvers.

Let's make a little bar chart using the Seaborn library to show the differences for this problem.

After scaling the features between 0 and 1, then sag and saga reach the same mean accuracy scores as the other models.

liblinear: 0.955
newton-cg: 0.970
lbfgs: 0.970
sag: 0.970
saga: 0.970

Note the caveat that both of these examples are with small datasets. Also, we're not looking at memory and speed requirements in these examples.

Bottom line: the forthcoming default lbfgs solver is a good first choice for most cases. If you're dealing with a large dataset or want to apply L1 regularization, I suggest you start with saga. Remember that saga needs the features to be on a similar scale.

Do you have a use case for newton-cg or sag? If so, please share in the comments. 💬

Next, I'll demystify key parameter options for LogisticRegression in Scikit-learn.

Parameters

The Scikit-learn LogisticRegression class can take the following arguments.

penalty, dual, tol, C, fit_intercept, intercept_scaling, class_weight, random_state, solver, max_iter, verbose, warm_start, n_jobs, l1_ratio

I won't include all of the parameters below, just excerpts from those parameters most likely to be valuable to most folks. See the docs for those that are omitted. I've added additional information in italics.

C — float, optional, default = 1
Smaller values have more regularization. Inverse of regularization strength. Must be positive value. Usually search logarithmically: [.001, .01, .1, 1, 10, 100, 1000]

random_state : int, RandomState instance or None, optional (default=None) Note that you must set the random state here for reproducibility.

solver {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, optional (default=’liblinear’). See the chart above for more info.

Changed in version 0.20: Default will change from ‘liblinear’ to ‘lbfgs’ in 0.22.

multi_class : str, {‘ovr’, ‘multinomial’, ‘auto’}, optional (default=’ovr’)
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

Changed in version 0.20: Default will change from ‘ovr’ to ‘auto’ in 0.22. ovr stands for one vs. rest. See further discussion below.

l1_ratio : float or None, optional (default=None)
The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting `l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2. Only for saga.

Commentary:
If you have a multiclass problem, then setting multi-class to auto will use the multinomial option every time it's available. That's the most theoretically sound choice. auto will soon be the default.

Use l1_ratio if want to use some L1 regularization with the saga solver. Note that like the ElasticNet linear regression option, you can use a mix of L1 and L2 penalization.

Also note that an L2 regularization of C=1 is applied by default.

After fitting the model the attributes are: classes_, coef_, intercept_, and n_iter. coef_ contains an array of the feature weights and intercept_ contains an

Logistic Regression FAQ:

Now let's address those nagging questions you might have about Logistic Regression in Scikit-learn.

Can I use LogisticRegression for a multilabel problem — meaning one output can be a member of multiple classes at once?

Nope. Sorry, if you need that, find another classification algorithm here.

Which kind of regularization should I use?

Regularization shifts your model toward the bias side of things in the bias/variance tradeoff. Regularization makes for a more generalizable logistic regression model, especially in cases with few data points. You're going to want hyperparameter search over the regularization parameter C.

If you want to do some dimensionality reduction through regularization, use L1 regularization. L1 regularization is Manhattan or Taxicab regularization. L2 regularization is Euclidian regularization and generally performs better in generalized linear regression problems.

You must use the saga solver if you want to apply a mix of L1 and L2 regularization. The liblinear solver requires you to have regularization. However, you could just make C such as a large value that it had a very, very small regularization penalty. Again, C is currently set to 1 by default.

Should I scale the features?

If using sag and saga solvers, make sure the features are on a similar scale. We saw the importance of this above.

Should I remove outliers?

Probably. Removing outliers will generally improve model performance. Standardizing the inputs would also reduce outliers' effects.

RobustScaler can scale features and you can avoid dropping outliers. See my article discussing scaling and standardizing here.

Which other assumptions really matter?

Observations should be independent of each other.

Should I transform my features using polynomials and interactions?

Just as in linear regression, you can use higher order polynomials and interactions. This transformation allows your model to learn a more complex decision boundary. Then, you aren't limited to a linear decision boundary. However, overfitting becomes a risk and interpreting feature importances gets trickier. It might also be more difficult for the solver to find the global minimum.

Should I do dimensionality reduction if there are lots of features?

Probably. Principal Components Analysis is a nice choice if interpretability isn't vital. Recursive Feature Elimination can help you remove the least important features. Alternatively, L1 regularization can drive less important feature weights to zero if you are using the saga solver.

Is multicollinearity in my features a problem?

It is for interpretation of the feature importances. You can't rely on the model weights to be meaningful when there is high correlation between the variables. Credit for effecting the outcome variable might go to just one of the correlated features.

There are many ways to test for multicollinearity. See Kraha et al. (2012) here.

One popular option is to check the Variance Inflation Factor (VIF). A VIF cutoff around 5 to 10 is usually as problematic, but there's a lively debate as to what an appropriate VIF cutoff should be.

You can compute the VIF by taking the correlation matrix, inverting it, and taking the values on the diagonal for each feature.

The correlation coefficients alone are not sufficient to determine problematic multicollinearity with multiple features.

If the sample size is small, getting more data might be most helpful for removing multi-collinearity.

When should I use LogisticRegressionCV?

[LogisticRegressionCV](https://scikit-learn.org/stable/modules/linear_model.html) is the Scikit-learn algorithm you want if you have a lot of data and want to speed up your calculations while doing cross-validation to tune your hyperparameters.

Wrap

Now you know what to do when you see the LogisticRegression solver warning — and better yet, how to avoid it in the first place. No more sweat! 😅

I suggest you use the upcoming default lbgfs solver for most cases. If you have a lot of data or need L1 regularization, try saga. Make sure you scale your features if you're using saga.

I hope you found this discussion of logistic regression helpful. If you did, please share it on your favorite social media so other people can find it, too. 👏

I write about Python, Docker, data science, and more. If any of that’s of interest to you, read more here and sign up for my email list.😄

Happy logisticing!

How to Remember Pandas Index Methods

Jeff Hale — Fri, 19 Jul 2019 18:09:08 +0000

When method names are similar, it's difficult to keep them separate in your mind.
This makes remembering them harder.

Pandas has a slew of methods for creating and adjusting a DataFrame index.
This is a brief guide to help you create a little mental space between methods for easier memorization.

The Jupyter Notebook is on Kaggle here.

import pandas as pd
import numpy as np

Make a DataFrame without specifying an index (you get a default index).

df = pd.DataFrame(dict(a=[1,2,3,4], b=[2,5,6,4]))
df

	a	b
0	1	2
1	2	5
2	3	6
3	4	4

Make a DataFrame with an index by using the index keyword argument.

df2 = pd.DataFrame(dict(a=[1,2,3,4], b=[2,5,6,4]), index = [1,2,5,6])
df2

	a	b
1	1	2
2	2	5
5	3	6
6	4	4

Move a column to be the index with .set_index()

df3 = df2.set_index("a")
df3

	b
a
1	2
2	5
3	6
4	4

Rename the index values from scratch with .index

df3.index = [2,3,4,5]
df3

	b
2	2
3	5
4	6
5	4

Note that index is a property of the DataFrame not a method, so the syntax is different.

Nuke the index values and start over from 0 with .reset_index()

df4 = df3.reset_index()
df4

	index	b
0	2	2
1	3	5
2	4	6
3	5	4

If you don't want the index to become a column, pass drop=True to reset_index().

df5 = df3.reset_index(drop=True)
df5

	b
0	2
1	5
2	6
3	4

Reorder the rows with .reindex()

df6 = df5.reindex([2,3,1,0])
df6

	b
2	6
3	4
1	5
0	2

Passing a value that isn't in the index results in a NaN.

df7 = df5.reindex([2,3,1,0,6])
df7

	b
2	6.0
3	4.0
1	5.0
0	2.0
6	NaN

Advice

Ideally, add an index when you create your DataFrame with index =.

If reading from a .csv file you can set an index column by passing the column number.

For example:

df = pd.read_csv(my_csv, index_col=3)

Or pass index_col=False to exlcude.

How to set or change the index:

df.set_index() - move a column to the index
df.index - add an index manually
df.reset_index() - reset the index to 0, 1, 2 ...
df.reindex() - reorder the rows

Word associations to remember:

set_index() - move column
index - manual
reset_index() - reset
reindex - reorder

Wrap

I hope this article helped you create a little mental space to keep Pandas index methods straight. If it did, please give it some love so other people can find it, too.

I write about Data Science, Dev Ops, Python and other stuff. Check out my other articles if any of that sounds interesting.

Follow me and connect:
Medium
Dev.to
Twitter
LinkedIn
Kaggle
GitHub

Happy indexing!

10 Days to Become a Google Cloud Certified Professional Data Engineer

Jeff Hale — Wed, 19 Jun 2019 21:59:43 +0000

I recently took the updated Google Cloud Certified Professional Data Engineer exam. Studying for the test is a great way to learn the data engineering process with Google Cloud.

I recommend studying for the exam if you want to use Google Cloud products and:

are a data engineer
want to become a data engineer
want to build a tech company
are a data scientist and want to understand the whole data pipeline

In this article I’ll share the what, why, and how to help you take your best shot at the exam. 🎯

Why

Let’s tackle the why first. I decided to take the Google Cloud Certified Professional Data Engineer exam for two reasons. First, I wanted to learn more about Google Cloud products for data engineering and machine learning. Second, I wanted to pass the exam and demonstrate that I’d learned the information. 😃

I chose a Google exam over offerings from AWS and Microsoft Azure for a few reasons. First, Google is the leading cloud provider in terms of machine learning and AI. They are also the platform I would use if I were starting a company in the space.

Compared to the other major cloud services, Google has the clearest help docs and the best UX. They also have the lowest prices for GPUs and the most powerful machines for training deep learning models.

Additionally, the Google exam has good study materials available — which we’ll dig into below. It’s also a professional level exam, which means that it’s difficult, but passage signifies the highest level of mastery. Finally, the Professional Data Engineer test was updated in March 2019, so I figured it should be more relevant than an older, un-updated exam.

If you’re a data person and prefer AWS, check out the Machine Learning and Big Data specialty certificate exams. They are $300 each, plus $40 per practice exam.

If you’re into Microsoft Azure, they have two exams that must be passed to attain the Certified: Azure Data Engineer Associate designation. The Azure exams have a revamp date of June 21, 2019.

Study Plan

As context, I’d used a number of Google Cloud products, but didn’t know the difference between BigQuery and Bigtable before I started studying for the exam. I also hadn’t done much data engineering work.

This isn’t the kind of test you can cram for in a day or two. I doubt hardly anyone is prepared to take this exam without a good bit of studying; the number of Google products and their options changes so fast.

Here are the resources I used to study for the exam. The format below is inspired by Daniel Bourke’s helpful post that I used as a guide for my study plan.

Linux Academy

Helpfulness : 7.5/10

Linux Academy’s Google Cloud Certified Professional Data Engineer course had good content. The course has videos, quizzes, a Lucid Chart e-book, and a final exam. Linux Academy provides free GCP practice time. It also has a helpful community Slack channel.

I took a legal pad worth of notes as I studied — and most of them came from the Linux Academy videos.

A legal pad before studying.

The course wasn’t updated for the new test as of early June 2019, so it wasn’t as helpful as it could have been. The instructor said the materials will probably be totally updated in late June 2019.

The Linux Academy final exam took a number of questions from the official Google practice exam. Don’t put much faith in the final exam results if you are taking the test in mid-June 2019. The test isn’t totally updated and the actual exam questions felt more difficult.

Overall, the UX isn’t bad, but there are some minor annoying issues (for example, the video is either full screen or tiny).

Bottom line: Linux Academy makes a great base, but you might want to wait until their training materials are updated to start studying for the exam.

Linux Academy is $49 a month, paid monthly, with a 7-day free trial.

Qwicklabs

Helpfulness: 5.5/10

The Quicklabs exercises aren’t focussed on the exam. I found this nice for overall learning but not very that helpful if you’re trying to figure out what you need to learn for the test.

Like Linux Academy, Qwicklabs provides a Google Cloud sandbox for practice. Qwicklabs checks your progress in the sandbox, which is nice. It doesn’t have videos.

The UX is alright. The countdown timer for each lesson is a bit distracting and pressure inducing — however there is a countdown time on the actual Google exam, too. The Qwicklabs timer is quite large — I suggest moving that part of the window offscreen if it’s distracting.

Qwicklabs countdown timer example

When doing interactive exercises, I recommend setting up your windows side-by-side — one for instruction and one for your work in GCP.

Qwicklabs courses cost credits that you can purchase. You can purchase a monthly unlimited Qwicklabs subscription for $55 a month. Discount codes may be available at sathish vj’s post here.

I recommend doing Linux Academy first and then using Qwicklabs for more practice.

Udemy

Helpfulness: 5.5/10

This resource consists of just three 50 question practice exams with a timer. The practice exams had a few updated questions, but still had old case study questions. They used the same Google official practice exam questions as Linux academy. Several questions had grammatical issues. Also, several questions were now incorrect. For example, now there is a BigQuery ML K-means algorithm.

I did learn things by taking the exam and reviewing the answers. The answers were detailed and linked to source documents. Just don’t put much faith in the score. The real exam feels far harder. 😄

Overall, these exams aren’t great, but I found them worth the time and money because there were few good options.

$9.99 for a one-time purchase (price may change — I saw it for $10.99 first).

Coursera

Google recommends taking the Coursera Data Engineering, Big Data, and Machine Learning on GCP Specialization. This specialization consists of five Coursera courses. I decided not to take it because it looked like it hadn’t been updated for the revised exam — it referenced the old exam case studies. In hindsight, I would have taken these courses because they look quite thorough.

Official Practice Exam

Helpfulness: 5.5/10

The official Google practice exam is available online as a mini-version of the real exam. The questions are the most relevant; I just wish there were more of them. As noted above, the questions are also used by several other folks in their practice exams.

You have to fill out a form to take the practice exam, but it’s free.

Other Good Resources

Here are the cheat sheets, blog posts, and other resources I used to study for the exam.

Maverick Lin’s cheatsheet here is very good, but pre the March exam refresh.
Guang X’s here is pre-updated exam.
Dmitri Lerko’s post here reflects the updated exam.
Chetan Sharma’s post here also reflects the updated exam.
The official Google Cloud docs are expansive. You’ll certainly want to spend some time taking notes from them. Not all the latest material is on the exam, but it’s all good to learn. 😃Here are the BigQuery docs, for example.
The official Google Cloud blog is here. It’s worth spending some time with it to help you understand topics you might find challenging.

So many things learn!

Do you have other resources that you found helpful? Please share them in the comments or send them to me on Twitter @discdiver.

One thing I found unnecessarily difficult was determining how updated study materials were. To make this easier, I suggested to Google that they should version their certification exams — just as most software follows semantic versioning. A version label like 1.1 could make it easy for training material providers to indicate which test version their materials match. This could save test-takers time and avoid frustration. If you think this is a good idea, please let Google know. You can tweet to them @ GCPcloud. 😃

For what it’s worth, I generally take tests well and am confident in my ability to learn with self-directed study. If self-directed study isn’t your thing, and your budget allows, you might want to take in-person courses.

Now let’s turn to the test.

The Exam

The exam consist of 50 multiple choice questions. You have two hours to complete it. You’re able to mark questions for later review and revisit all questions before submitting the test.

Rumor has it that you need about 70% correct to pass the exam. However, there is not an official published passing score. Google says:

Not all questions may be scored.

At any given time, a small number of questions on our exams may be unscored. These are newly developed questions that are being evaluated for their effectiveness. This is a standard practice in the testing industry.

The score needed to pass is confidential.

The passing score for each exam is confidential. It is determined by a panel of internal and external subject matter experts, following an industry-accepted standard setting process. The passing score is applied equally to all examinees. It is re-evaluated when changes are made to the exam content.

You never learn your score, just whether you passed or failed. If you pass the test, your certification is good for two years.

The exam will cost you $200. If you don’t pass, you can take it again for another $200 in 14 days. If you don’t pass on your second try, you need to wait 60 days and pay again.

Here’s the official test overview.

What do you see in the crystal?

How to Know When You’re Ready?

If you decide to study for the Google Cloud Certified Professional Data Engineer exam, it’s hard to know when you’re ready to take the test. It’s tricky because there are few good test simulations and you don’t even know what you need to pass!

As with most things in life, practice improves your chances of performing well. Take as many practice exams as you can and review the results. You want to feel confident that you know the concepts, pitfalls, and best practices.

I originally planned to study for a month or so, but I decided to push it hard. On the sixth day I tried to register to take the exam the next day, but the testing center was booked. I decided to take a few more days to study and spend time with family in town over the weekend.

I ended up with 10 days of pretty intense study and a few days break in the middle. I felt decently prepared on test day. I hadn’t memorized every IAM role for every resource, but I had a good understanding of best practices with key products.

The Test Experience

You take the exam on a computer at a testing center. You’ll have to leave your phone and other personal belongings with the proctor. You’ll be video recorded during the test. Other people will probably be in the same room taking other exams.

Earplugs, scratch paper, and pencils are provided. It sounds silly, but if you’re not an earplug wearer, you may want to practice with them ahead of time. I suggest you don’t press start until they are firmly in your ears.

Ears.

I had read that the test would be difficult. It was still way harder than I thought it would be. It felt like the hardest test I’ve ever taken, and I’ve taken the SAT, ACT, GMAT, GRE, LSAT and several certification exams. For what it’s worth, this was my first exam from a cloud provider.

The test is difficult for several reasons:

The breadth of material is vast. There are lots of google products and lots of potential questions about each product and how they work together. There are over 200 Google Cloud APIs. This exam doesn’t cover all of them, but it covers a bunch.
The exam also tests your knowledge of several Apache open source products related to Google’s offerings.
It’s not even clear exactly how many Google products could be on the exam because new products are always being added and products are being changed.
The questions are often multi-line, requiring consideration of multiple variables and intense concentration.
Some questions have multiple answers required (if more than one answer is required, the number of answers is specified).
Many answers are somewhat correct. You need to choose the best answer.

The exam will test you in more ways than one. When I took the exam I just tried to stay focussed and not let the voice of self-doubt enter my head.

I had about 30 minutes left after my first pass through the questions. I marked seven answers for review. After reviewing, I had 10 minutes to spare. I clicked submit knowing I had tried my best and the chips would fall where they may.

Poker chips.

On the next screen I saw I had provisionally passed. 😃I collected my belongings from the proctor and headed out.

I received an email from Google the next day that I had officially passed. It included a code for some free swag. I would have preferred a less expensive test, but now I’ve got some humiliswag.

I plan to write about Google tools for data injestion, processing, storage, and machine learning in a future article. Follow me to make sure you don’t miss it. Now I’ll mention what I didn’t see on the exam.

What I Didn’t See

As many IAM questions as I thought I might. There were a bunch on the various practice tests.
Questions on exact product costs. Just know what makes sense if you’re more cost sensitive or less cost sensitive.
Firestore questions.
AI Hub questions.
Many ML concept questions. I went into the test knowing ML concepts better than Google database products, so perhaps this explains why this part of the test didn’t loom large to me.
Many questions with code samples.

Wrap

It makes sense to study for this exam if you want to learn more about Google’s data science and engineering products and you have the time to devote to it. This exam doesn’t have you writing actual queries or cleaning data, so you’ll want to look elsewhere to develop those skills.

If you aren’t already a GCP pro, I guarantee you’ll learn things if you put the time in to study for the exam.

The way I look at it, if you pass the test, great. If you don’t, that’s okay. Either way, you’ll learn a bunch, and that’s most important. 😃

It’s the climb.

Speaking of learning, I hope you found this article helpful for your learning. If you did, please share it on your favorite social media channel. 👍

I help folks learn about cloud computing, data science, and other tech topics. Check out my other articles if you’re into that stuff.

Happy studying! 📙

DEV Community: Jeff Hale

The Weird World of Missing Values in Pandas

NaN, NaT, and None

NaN

NaT

None

What are these NaN values anyway?

Strange Things are afoot with Missing values

Equality Check

Finding Missing Values with df.isna()

Wrap

The True Guide to True and False in PostgreSQL

TRUE

FALSE

NULL

Advice

Code

Wrap

Don’t Sweat the Solver Stuff

When to use Logistic Regression

Solver Options

Why is the Default Solver Being Changed?

Solver Tests

Binary classification solver example

Multi-class solver example

Parameters

Logistic Regression FAQ:

Can I use LogisticRegression for a multilabel problem — meaning one output can be a member of multiple classes at once?

Which kind of regularization should I use?

Should I scale the features?

Should I remove outliers?

Which other assumptions really matter?

Should I transform my features using polynomials and interactions?

Should I do dimensionality reduction if there are lots of features?

Is multicollinearity in my features a problem?

When should I use LogisticRegressionCV?

Wrap

How to Remember Pandas Index Methods

Make a DataFrame without specifying an index (you get a default index).

Make a DataFrame with an index by using the index keyword argument.

Move a column to be the index with .set_index()

Rename the index values from scratch with .index

Nuke the index values and start over from 0 with .reset_index()

Reorder the rows with .reindex()

Advice

How to set or change the index:

Word associations to remember:

Wrap

10 Days to Become a Google Cloud Certified Professional Data Engineer

Why

Study Plan

Linux Academy

Qwicklabs

Udemy

Coursera

Official Practice Exam

Other Good Resources

The Exam

How to Know When You’re Ready?

The Test Experience

What I Didn’t See

Wrap

`NaN`, `NaT`, and `None`

`NaN`

`NaT`

`None`