<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeff Hale</title>
    <description>The latest articles on DEV Community by Jeff Hale (@discdiver).</description>
    <link>https://dev.to/discdiver</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F143744%2F0ada7f64-e32c-484b-8038-5d8f41bed0aa.jpg</url>
      <title>DEV Community: Jeff Hale</title>
      <link>https://dev.to/discdiver</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/discdiver"/>
    <language>en</language>
    <item>
      <title>The Weird World of Missing Values in Pandas</title>
      <dc:creator>Jeff Hale</dc:creator>
      <pubDate>Fri, 22 Nov 2019 20:07:30 +0000</pubDate>
      <link>https://dev.to/discdiver/the-weird-world-of-missing-values-in-pandas-3kph</link>
      <guid>https://dev.to/discdiver/the-weird-world-of-missing-values-in-pandas-3kph</guid>
      <description>&lt;p&gt;If you use the Python pandas library for data science and data analysis things, you'll eventually see &lt;code&gt;NaN&lt;/code&gt;, &lt;code&gt;NaT&lt;/code&gt;, and &lt;code&gt;None&lt;/code&gt; in your DataFrame. These values all represent missing data. However, there are subtle and not-so-subtle differences in how they behave and when they appear..&lt;/p&gt;

&lt;p&gt;Let's take a look at the three types of missing values and learn how to find them. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;code&gt;NaN&lt;/code&gt;, &lt;code&gt;NaT&lt;/code&gt;, and &lt;code&gt;None&lt;/code&gt;
&lt;/h1&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;NaN&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;If a column is numeric and you have a missing value that value will be a &lt;code&gt;NaN&lt;/code&gt;. &lt;code&gt;NaN&lt;/code&gt; stands for &lt;em&gt;Not a Number&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;NaN&lt;/code&gt;s are always floats. So if you have an integer column and it has a &lt;code&gt;NaN&lt;/code&gt; added to it, the column is upcasted to become a &lt;code&gt;float&lt;/code&gt; column. This behavior may seem strange, but it is based on NumPy's capabilities as of this writing. In general, floats take up very little space in memory, so pandas decided to treat them this way. The pandas dev team is hoping NumPy will provide a native NA solution soon.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;NaT&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;If a column is a DateTime and you have a missing value, then that value will be a &lt;code&gt;NaT&lt;/code&gt;. &lt;code&gt;NaT&lt;/code&gt; stands for &lt;em&gt;Not a Time&lt;/em&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;None&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;A pandas &lt;code&gt;object&lt;/code&gt; dtype column - the dtype for strings as of this writing - can hold &lt;code&gt;None&lt;/code&gt;, &lt;code&gt;NaN&lt;/code&gt;, &lt;code&gt;NaT&lt;/code&gt; or all three at the same time! &lt;/p&gt;

&lt;h2&gt;
  
  
  What are these NaN values anyway?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;NaN&lt;/code&gt; is a NumPy value. &lt;code&gt;np.NaN&lt;/code&gt;&lt;br&gt;
&lt;code&gt;NaT&lt;/code&gt; is a Pandas value. &lt;code&gt;pd.NaT&lt;/code&gt;&lt;br&gt;
&lt;code&gt;None&lt;/code&gt; is a vanilla Python value. &lt;code&gt;None&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;However, they display in a DataFrame as &lt;code&gt;NaN&lt;/code&gt;, &lt;code&gt;NaT&lt;/code&gt;, and &lt;code&gt;None&lt;/code&gt;. &lt;/p&gt;

&lt;h1&gt;
  
  
  Strange Things are afoot with Missing values
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fi6hux9a8w3ch7o3nvrqx.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fi6hux9a8w3ch7o3nvrqx.gif" alt="Strange Things are Afoot gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Behavior with missing values can get weird. Let's make a Series with each type of missing value.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0   NaT
1   NaT
2   NaT
dtype: datetime64[ns]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Pandas created the Series as a DateTime dtype. Ok. &lt;/p&gt;

&lt;p&gt;You can cast it to an object dtype if you like.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0    NaT
1    NaT
2    NaT
dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But you can't cast it to a numeric dtype.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

    ---------------------------------------------------------------------------

    TypeError                                 Traceback (most recent call last)

    &amp;lt;ipython-input-255-66ec4de18835&amp;gt; in &amp;lt;module&amp;gt;
    ----&amp;gt; 1 pd.Series([np.NaN, pd.NaT, None]).astype('float')

 ...


    TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Also note that you can change an object column with &lt;code&gt;None&lt;/code&gt;s into a numeric column with &lt;code&gt;pd.to_numeric&lt;/code&gt;. No problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Equality Check
&lt;/h3&gt;

&lt;p&gt;Another bizarre thing about missing values in Pandas is that some varieties are equal to themselves and others aren't.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NaN&lt;/code&gt; doesn't equal &lt;code&gt;NaN&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

    False


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And &lt;code&gt;NaT&lt;/code&gt; doesn't equal &lt;code&gt;NaT&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaT&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaT&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

    False


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But &lt;code&gt;None&lt;/code&gt; does equal &lt;code&gt;None&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

    True


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Fun! 😁&lt;/p&gt;

&lt;p&gt;Now let's turn our attention finding missing values.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding Missing Values with &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html" rel="noopener noreferrer"&gt;df.isna()&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;df.isna()&lt;/code&gt; to find &lt;code&gt;NaN&lt;/code&gt;, &lt;code&gt;NaT&lt;/code&gt;, and &lt;code&gt;None&lt;/code&gt; values. They all evaluate to &lt;code&gt;True&lt;/code&gt; with this method. &lt;/p&gt;

&lt;p&gt;A boolean DataFrame is returned if &lt;code&gt;df.isna()&lt;/code&gt; is called on a DataFrame and a Series is returned if called on a Series.&lt;/p&gt;

&lt;p&gt;Let's see &lt;code&gt;df.isna()&lt;/code&gt; in action! Here's a DataFrame with all three types of missing values:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F5e2i0mew255roo3jcc0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F5e2i0mew255roo3jcc0p.png" alt="DataFrame with all three types of missing values"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the code to return a boolean DataFrame with &lt;code&gt;True&lt;/code&gt; for missing values.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpago9zyumlco37qcdgw5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpago9zyumlco37qcdgw5.png" alt="boolean DataFrame image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A one-liner to return a DataFrame of all your missing values is pretty cool. Deciding what to do with those missing values is a whole nother question that I'll be exploring in my upcoming Memorable Pandas book.&lt;/p&gt;

&lt;p&gt;Note that it's totally fine to have all three Pandas missing value types in your DataFrame at the same time, assuming you are okay with missing values. &lt;/p&gt;

&lt;h1&gt;
  
  
  Wrap
&lt;/h1&gt;

&lt;p&gt;I hope you found this intro to missing values in the Python pandas library to be useful. 😀 &lt;/p&gt;

&lt;p&gt;If you did, please do all the nice things on Dev and share it on your favorite social media so other people can find it, too. 👏&lt;/p&gt;

&lt;p&gt;I write about Python, Docker, and data science things. Check out &lt;a href="https://jeffhale.net" rel="noopener noreferrer"&gt;my other guides&lt;/a&gt; if you're into that stuff. 👍 &lt;/p&gt;

&lt;p&gt;You don't want to MISS them! (&lt;em&gt;Missing values&lt;/em&gt;. Get it?) 🙄&lt;/p&gt;

&lt;p&gt;Thanks to Kevin Markham of &lt;a href="https://www.dataschool.io/" rel="noopener noreferrer"&gt;Data School&lt;/a&gt; for suggestions on an earlier version of this article!&lt;/p&gt;

</description>
      <category>python</category>
      <category>pandas</category>
      <category>codenewbie</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The True Guide to True and False in PostgreSQL</title>
      <dc:creator>Jeff Hale</dc:creator>
      <pubDate>Wed, 23 Oct 2019 19:21:25 +0000</pubDate>
      <link>https://dev.to/discdiver/the-true-guide-to-true-and-false-in-postgresql-1p69</link>
      <guid>https://dev.to/discdiver/the-true-guide-to-true-and-false-in-postgresql-1p69</guid>
      <description>&lt;p&gt;&lt;code&gt;TRUE&lt;/code&gt;, &lt;code&gt;FALSE&lt;/code&gt;, and &lt;code&gt;NULL&lt;/code&gt; are the possible boolean values in PostgreSQL. &lt;/p&gt;

&lt;p&gt;Surprisingly, there are a bunch of different values you can use for &lt;code&gt;TRUE&lt;/code&gt; and &lt;code&gt;FALSE&lt;/code&gt; - and one alternative for &lt;code&gt;NULL&lt;/code&gt;. Also surprisingly, some values you'd expect might work, don't work. &lt;/p&gt;

&lt;p&gt;Let's check out &lt;code&gt;TRUE&lt;/code&gt; first.&lt;/p&gt;

&lt;h1&gt;
  
  
  TRUE
&lt;/h1&gt;

&lt;p&gt;The following literal values evaluate to &lt;code&gt;TRUE&lt;/code&gt;. Note that case doesn't matter.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;true&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'t'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'tr'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'tru'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'true'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'y'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'ye'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'yes'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'on'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'1'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Other similar options will cause a syntax error, such as &lt;code&gt;1&lt;/code&gt;, or &lt;code&gt;tru&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now let's look at &lt;code&gt;FALSE&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  FALSE
&lt;/h1&gt;

&lt;p&gt;Here are literal values that will evaluate to &lt;code&gt;FALSE&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;false&lt;/code&gt; &lt;br&gt;
&lt;code&gt;'f'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'fa'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'fal'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'fals'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'false'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'n'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'no'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'of'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'off'&lt;/code&gt;&lt;br&gt;
&lt;code&gt;'0'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Other similar options that throw syntax errors include &lt;code&gt;0&lt;/code&gt;, &lt;code&gt;fa&lt;/code&gt;, and &lt;code&gt;'0.0'&lt;/code&gt;. &lt;/p&gt;

&lt;h1&gt;
  
  
  NULL
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;NULL&lt;/code&gt; is the value PostgreSQL uses for &lt;em&gt;missing value&lt;/em&gt; or &lt;em&gt;no value&lt;/em&gt;. Note that &lt;code&gt;NULL&lt;/code&gt; is not equal to any value. &lt;code&gt;NULL&lt;/code&gt; isn't even equal to itself!&lt;/p&gt;

&lt;p&gt;&lt;code&gt;UNKNOWN&lt;/code&gt; evaluates to &lt;code&gt;NULL&lt;/code&gt;. Again, capitalization doesn't matter.&lt;/p&gt;

&lt;p&gt;There are no string literal values that evaluate to &lt;code&gt;NULL&lt;/code&gt;. Similar terms throw syntax errors, including &lt;code&gt;nan&lt;/code&gt;, &lt;code&gt;none&lt;/code&gt;, and &lt;code&gt;n&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Advice
&lt;/h1&gt;

&lt;p&gt;Stick with &lt;code&gt;TRUE&lt;/code&gt;, &lt;code&gt;FALSE&lt;/code&gt;, and &lt;code&gt;NULL&lt;/code&gt;. As the [docs]((&lt;a href="https://www.postgresql.org/docs/12/datatype-boolean.html" rel="noopener noreferrer"&gt;https://www.postgresql.org/docs/12/datatype-boolean.html&lt;/a&gt;) state, "The key words TRUE and FALSE are the preferred (SQL-compliant) method for writing Boolean constants in SQL queries." &lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;WHERE my_column IS NULL&lt;/code&gt; and not &lt;code&gt;WHERE my_column = NULL&lt;/code&gt; to return the rows with &lt;code&gt;NULL&lt;/code&gt; values. Remember, &lt;code&gt;NULL&lt;/code&gt; is not equal to &lt;code&gt;NULL&lt;/code&gt; in PostgreSQL. 😁&lt;/p&gt;

&lt;h1&gt;
  
  
  Code
&lt;/h1&gt;

&lt;p&gt;Here's the code you can use to test different values:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;


&lt;span class="cm"&gt;/* make the table and values*/&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;test1&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="nb"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;test1&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'I am true'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;test1&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'I am false'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;test1&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'I am null'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* see the data */&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;test1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="cm"&gt;/* test it out */&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;test1&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can use &lt;code&gt;WHERE a =&lt;/code&gt; to compare &lt;code&gt;TRUE&lt;/code&gt; or &lt;code&gt;FALSE&lt;/code&gt; booleans, strings, or numbers. &lt;/p&gt;

&lt;p&gt;Comparing a string with &lt;code&gt;IS&lt;/code&gt; won't work. For example, &lt;code&gt;WHERE a IS 'true'&lt;/code&gt;, will cause an error.&lt;/p&gt;

&lt;p&gt;You must use &lt;code&gt;=&lt;/code&gt; or &lt;code&gt;LIKE&lt;/code&gt; to compare string values that you want to evaluate to a boolean. For example &lt;code&gt;WHERE a = 'true'&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;However, you need to use &lt;code&gt;WHERE a IS&lt;/code&gt; to test against &lt;code&gt;NULL&lt;/code&gt; options. &lt;/p&gt;

&lt;p&gt;Fun! 😉&lt;/p&gt;

&lt;h1&gt;
  
  
  Wrap
&lt;/h1&gt;

&lt;p&gt;I hope you found this little guide to be interesting and informative. If you did, please share it on your favorite social media so other folks can find it too. 👏&lt;/p&gt;

&lt;p&gt;I write about Python, Data Science, and other fun tech topics. Follow me and join my &lt;a href="https://dataawesome.us20.list-manage.com/subscribe?u=b694acf1df58e5bb039ce60a6&amp;amp;id=5da23b7424" rel="noopener noreferrer"&gt;Data Awesome mailing list&lt;/a&gt; if you're in to that stuff.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgf0pg6on21sraojajvaw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgf0pg6on21sraojajvaw.jpg" alt="Truing a Wheel"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy PostgreSQLing! 👍 &lt;/p&gt;

</description>
      <category>sql</category>
      <category>beginners</category>
      <category>database</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Don’t Sweat the Solver Stuff</title>
      <dc:creator>Jeff Hale</dc:creator>
      <pubDate>Fri, 27 Sep 2019 01:22:43 +0000</pubDate>
      <link>https://dev.to/discdiver/don-t-sweat-the-solver-stuff-20np</link>
      <guid>https://dev.to/discdiver/don-t-sweat-the-solver-stuff-20np</guid>
      <description>&lt;p&gt;Logistic regression is the bread-and-butter algorithm for machine learning classification. If you’re a practicing or aspiring data scientist, you’ll want to know the ins and outs of how to use it. Also, Scikit-learn’s LogisticRegression is spitting out warnings about changing the default solver, so this is a great time to learn when to use which solver. 😀&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. 
Specify a solver to silence this warning.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In this article, you’ll learn about Scikit-learn LogisticRegression solver choices and see two evaluations of them. Also, you’ll see key API options and get answers to frequently asked questions. By the end of the article, you’ll know more about logistic regression in Scikit-learn and not sweat the solver stuff. 😓&lt;/p&gt;

&lt;p&gt;I’m using Scikit-learn version 0.21.3 in this analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use Logistic Regression
&lt;/h2&gt;

&lt;p&gt;A classification problem is one in which you try to predict discrete outcomes, such as whether someone has a disease. In contrast, a regression problem is one in which you are trying to predict a value of a continuous variable, such as the sale price of a home. Although logistic regression has regression in its name, it’s an algorithm for classification problems.&lt;/p&gt;

&lt;p&gt;Logistic regression is probably the most important supervised learning classification method. It’s a fast, versatile extension of a generalized linear model.&lt;/p&gt;

&lt;p&gt;Logistic regression makes an excellent baseline algorithm. It works well when the relationship between the features and the target aren’t too complex.&lt;/p&gt;

&lt;p&gt;Logistic regression produces feature weights that are generally interpretable, which makes it especially useful when you need to be able to explain the reasons for a decision. This interpretability often comes in handy — for example, with lenders who need to justify their loan decisions.&lt;/p&gt;

&lt;p&gt;There is no closed-form solution for logistic regression problems. This is fine — we don’t use the closed form solution for linear regression problems because it’s slow.&lt;/p&gt;

&lt;p&gt;Solving logistic regression is an optimization problem. Thankfully, nice folks have created several solver algorithms we can use. 😁&lt;/p&gt;

&lt;p&gt;Logistic regression is a primary algorithm to use for most classification problems. It's a fast, versatile extension of a generalized linear model. It produces feature weights that are generally interpretable, which makes it especially useful when you need to be able to explain the reasons for a decision. This interpretability often comes in handy — for example, with loan denials.&lt;/p&gt;

&lt;p&gt;There is no closed-form solution for logistic regression. This is fine, because we don't even use the closed form solution to solve linear regression problems because it's slow. Instead, solving logistic regression is an optimization problem. Thankfully, very nice folks have created several algorithms to solve it. 😁&lt;/p&gt;

&lt;h3&gt;
  
  
  Solver Options
&lt;/h3&gt;

&lt;p&gt;Scikit-learn ships with five different solvers. Each solver tries to find the parameter weights that minimize a cost function. Here are the five options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;newton-cg&lt;/code&gt; — A newton method. Newton methods use an exact Hessian matrix. It's slow for large datasets, because it computes the second derivatives. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lbfgs&lt;/code&gt; — Stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates, so it saves memory. It isn't super fast with large data sets. It will be the default solver as of Scikit-learn version 0.22.0.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Coordinate_descent"&gt;&lt;code&gt;liblinear&lt;/code&gt;&lt;/a&gt; — Library for Large Linear Classification. Uses a coordinate descent algorithm. Coordinate descent is based on minimizing a multivariate function by solving univariate optimization problems in a loop. In other words, it moves toward the minimum in one direction at a time. It is the default solver prior to v0.22.0. It performs pretty well with high dimensionality. It does have a number of drawbacks. It can get stuck, is unable to run in parallel, and can only solve multi-class logistic regression with one-vs.-rest.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hal.inria.fr/hal-00860051/document"&gt;&lt;code&gt;sag&lt;/code&gt;&lt;/a&gt; — Stochastic Average Gradient descent. A variation of gradient descent and incremental aggregated gradient approaches that uses a random sample of previous gradient values. Fast for big datasets.
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;saga&lt;/code&gt; — Extension of &lt;em&gt;sag&lt;/em&gt; that also allows for L1 regularization. Should generally train faster than &lt;em&gt;sag&lt;/em&gt;. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An excellent discussion of the different options can be found in &lt;a href="https://stackoverflow.com/a/52388406/4590385"&gt;this Stack Overflow answer&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The chart below from the &lt;a href="https://scikit-learn.org/stable/modules/linear_model.html"&gt;Scikit-learn documentation&lt;/a&gt; lists characteristics of the solvers, including the the regularization penalties available.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MN11kH-u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/4fp77cl65seflvn3e2xm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MN11kH-u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/4fp77cl65seflvn3e2xm.png" alt="Scikit-learn Chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is the Default Solver Being Changed?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;liblinear&lt;/code&gt; is fast with small datasets, but has problems with saddle points and can't be parallelized over multiple processor cores. It can only use one-vs.-rest to solve multi-class problems. It also penalizes the intercept, which isn't good for interpretation. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;lbfgs&lt;/code&gt; avoids these drawbacks, is relatively fast, and doesn't require similarly-scaled data. It's the best choice for most cases without a really large dataset. Some discussion of why the default was changed is in &lt;a href="https://github.com/scikit-learn/scikit-learn/issues/9997"&gt;this GitHub issue&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Let's evaluate the Logistic Regression solvers with two prediction classification projects — one binary and one multi-class.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solver Tests
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Binary classification solver example
&lt;/h3&gt;

&lt;p&gt;First, let's look at a binary classification problem. I used the built-in &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html"&gt;scikit-learn breast_cancer dataset&lt;/a&gt;. The goal is to predict whether a breast mass is cancerous. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9cxoL-KS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/iax44u61dfr4po17tyme.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9cxoL-KS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/iax44u61dfr4po17tyme.jpg" alt="Cancer?"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The features consist of numeric data about cell nuclei. They were computed from digitized images of biopsies. The dataset contains 569 observations and 30 numeric features. I split the dataset into training and test sets and conducted a grid search on the training set with each different solver. You can access my Jupyter notebook used in all analyses at &lt;a href="https://www.kaggle.com/discdiver/logistic-regression-don-t-sweat-the-solver-stuff"&gt;Kaggle&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The most relevant code snippet is below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;solver_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'liblinear'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'newton-cg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'lbfgs'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'sag'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'saga'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;solver_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log_reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_reg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cv_results_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'mean_test_score'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solver&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solver_list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"  {solver} {score:.3f}"&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;liblinear 0.939&lt;br&gt;
      newton-cg 0.939&lt;br&gt;
      lbfgs 0.934&lt;br&gt;
      sag 0.911&lt;br&gt;
      saga 0.904&lt;/p&gt;

&lt;p&gt;The values for &lt;em&gt;sag&lt;/em&gt; and &lt;em&gt;saga&lt;/em&gt; lag behind their peers.&lt;/p&gt;

&lt;p&gt;After scaling the features, the solvers all perform better and &lt;em&gt;sag&lt;/em&gt; and &lt;em&gt;saga&lt;/em&gt; are just as accurate as the other solvers.&lt;/p&gt;

&lt;p&gt;liblinear 0.960&lt;br&gt;
        newton-cg 0.962&lt;br&gt;
        lbfgs 0.962&lt;br&gt;
        sag 0.962&lt;br&gt;
        saga 0.962&lt;/p&gt;

&lt;p&gt;Now let's look at an example with three classes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Multi-class solver example
&lt;/h3&gt;

&lt;p&gt;I evaluated the logistic regression solvers in a multi-class classification problem with Scikit-learn's &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine"&gt;wine dataset&lt;/a&gt;. The dataset contains 178 samples and 13 numeric features. The goal is to predict the type of grapes used to make the wine from the chemical features of the wine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;solver_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'liblinear'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'newton-cg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'lbfgs'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'sag'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'saga'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;solver_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;multi_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"auto"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cv_results_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'mean_test_score'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solver_list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"{solver}: {score:.3f}"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;liblinear: 0.962&lt;br&gt;
        newton-cg: 0.947&lt;br&gt;
        lbfgs: 0.955&lt;br&gt;
        sag: 0.699&lt;br&gt;
        saga: 0.662&lt;/p&gt;

&lt;p&gt;Scikit-learn gives a warning that the &lt;em&gt;sag&lt;/em&gt; and &lt;em&gt;saga&lt;/em&gt; models did not converge. In other words, they never arrived at a minimum point. Unsurprisingly, the results aren't so great for those solvers.&lt;/p&gt;

&lt;p&gt;Let's make a little bar chart using the Seaborn library to show the differences for this problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1ted4_Mh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/4p28crtz8rcjs6c8fffj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1ted4_Mh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/4p28crtz8rcjs6c8fffj.png" alt="Unscaled Results"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After scaling the features between 0 and 1, then &lt;em&gt;sag&lt;/em&gt; and &lt;em&gt;saga&lt;/em&gt; reach the same mean accuracy scores as the other models. &lt;/p&gt;

&lt;p&gt;liblinear: 0.955&lt;br&gt;
        newton-cg: 0.970&lt;br&gt;
        lbfgs: 0.970&lt;br&gt;
        sag: 0.970&lt;br&gt;
        saga: 0.970&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fU5DhMfh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/nnf9doru3nppvzzj5hxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fU5DhMfh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/nnf9doru3nppvzzj5hxj.png" alt="Scaled Results"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note the caveat that both of these examples are with small datasets. Also, we're not looking at memory and speed requirements in these examples.&lt;/p&gt;

&lt;p&gt;Bottom line: the forthcoming default &lt;em&gt;lbfgs&lt;/em&gt; solver is a good first choice for most cases. If you're dealing with a large dataset or want to apply L1 regularization, I suggest you start with &lt;em&gt;saga&lt;/em&gt;. Remember that &lt;em&gt;saga&lt;/em&gt; needs the features to be on a similar scale. &lt;/p&gt;

&lt;p&gt;Do you have a use case for &lt;em&gt;newton-cg&lt;/em&gt; or &lt;em&gt;sag&lt;/em&gt;? If so, please share in the comments. 💬&lt;/p&gt;

&lt;p&gt;Next, I'll demystify key parameter options for LogisticRegression in Scikit-learn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Q7tcaKrC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/clg5my153mllg1au94y0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Q7tcaKrC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/clg5my153mllg1au94y0.jpg" alt="Logistics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Parameters
&lt;/h2&gt;

&lt;p&gt;The Scikit-learn LogisticRegression class can take the following arguments.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;penalty&lt;/code&gt;, &lt;code&gt;dual&lt;/code&gt;, &lt;code&gt;tol&lt;/code&gt;, &lt;code&gt;C&lt;/code&gt;, &lt;code&gt;fit_intercept&lt;/code&gt;, &lt;code&gt;intercept_scaling&lt;/code&gt;, &lt;code&gt;class_weight&lt;/code&gt;, &lt;code&gt;random_state&lt;/code&gt;, &lt;code&gt;solver&lt;/code&gt;, &lt;code&gt;max_iter&lt;/code&gt;, &lt;code&gt;verbose&lt;/code&gt;, &lt;code&gt;warm_start&lt;/code&gt;, &lt;code&gt;n_jobs&lt;/code&gt;, &lt;code&gt;l1_ratio&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I won't include all of the parameters below, just excerpts from those parameters most likely to be valuable to most folks. See the &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"&gt;docs&lt;/a&gt; for those that are omitted. I've added additional information in &lt;em&gt;italics&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;C&lt;/code&gt; — float, optional, default = 1&lt;br&gt;
Smaller values have more regularization. Inverse of regularization strength. &lt;em&gt;Must be positive value. Usually search logarithmically: [.001, .01, .1, 1, 10, 100, 1000]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;random_state&lt;/code&gt; : int, RandomState instance or None, optional (default=None) &lt;em&gt;Note that you must set the random state here for reproducibility.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;solver&lt;/code&gt; {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, optional (default=’liblinear’). &lt;em&gt;See the chart above for more info.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changed in version 0.20: Default will change from ‘liblinear’ to ‘lbfgs’ in 0.22.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;multi_class&lt;/code&gt; : str, {‘ovr’, ‘multinomial’, ‘auto’}, optional (default=’ovr’)&lt;br&gt;
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changed in version 0.20: Default will change from ‘ovr’ to ‘auto’ in 0.22.&lt;/strong&gt; &lt;em&gt;ovr stands for one vs. rest. See further discussion below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;l1_ratio&lt;/code&gt; : float or None, optional (default=None)&lt;br&gt;
The Elastic-Net mixing parameter, with 0 &amp;lt;= l1_ratio &amp;lt;= 1. Only used if penalty='elasticnet'. Setting `l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 &amp;lt; l1_ratio &amp;lt;1, the penalty is a combination of L1 and L2. &lt;em&gt;Only for saga.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Commentary:&lt;/em&gt;&lt;br&gt;
If you have a multiclass problem, then setting &lt;code&gt;multi-class&lt;/code&gt; to &lt;code&gt;auto&lt;/code&gt; will use the multinomial option every time it's available.  That's the most theoretically sound choice. &lt;code&gt;auto&lt;/code&gt; will soon be the default. &lt;/p&gt;

&lt;p&gt;Use &lt;em&gt;l1_ratio&lt;/em&gt; if want to use some L1 regularization with the &lt;em&gt;saga&lt;/em&gt; solver. Note that like the ElasticNet linear regression option, you can use a mix of L1 and L2 penalization.&lt;/p&gt;

&lt;p&gt;Also note that an L2 regularization of &lt;code&gt;C=1&lt;/code&gt; is applied by default. &lt;/p&gt;

&lt;p&gt;After fitting the model the attributes are: &lt;code&gt;classes_&lt;/code&gt;, &lt;code&gt;coef_&lt;/code&gt;, &lt;code&gt;intercept_&lt;/code&gt;, and &lt;code&gt;n_iter&lt;/code&gt;. &lt;code&gt;coef_&lt;/code&gt; contains an array of the feature weights and &lt;code&gt;intercept_&lt;/code&gt; contains an &lt;/p&gt;

&lt;h2&gt;
  
  
  Logistic Regression FAQ:
&lt;/h2&gt;

&lt;p&gt;Now let's address those nagging questions you might have about Logistic Regression in Scikit-learn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use LogisticRegression for a multilabel problem — meaning one output can be a member of multiple classes at once?
&lt;/h3&gt;

&lt;p&gt;Nope. Sorry, if you need that, find another classification algorithm &lt;a href="https://scikit-learn.org/stable/modules/multiclass.html"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which kind of regularization should I use?
&lt;/h3&gt;

&lt;p&gt;Regularization shifts your model toward the bias side of things in the bias/variance tradeoff. Regularization makes for a more generalizable logistic regression model, especially in cases with few data points. You're going to want hyperparameter search over the regularization parameter &lt;em&gt;C&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;If you want to do some dimensionality reduction through regularization, use L1 regularization. L1 regularization is Manhattan or Taxicab regularization. L2 regularization is Euclidian regularization and generally performs better in generalized linear regression problems. &lt;/p&gt;

&lt;p&gt;You must use the &lt;em&gt;saga&lt;/em&gt; solver if you want to apply a mix of L1 and L2 regularization. The &lt;em&gt;liblinear&lt;/em&gt; solver requires you to have  regularization. However, you could just make &lt;em&gt;C&lt;/em&gt; such as a large value that it had a very, very small regularization penalty. Again, &lt;em&gt;C&lt;/em&gt; is currently set to 1 by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I scale the features?
&lt;/h3&gt;

&lt;p&gt;If using &lt;em&gt;sag&lt;/em&gt; and &lt;em&gt;saga&lt;/em&gt; solvers, make sure the features are on a similar scale. We saw the importance of this above. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--z46jeODs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/dabqhoc7glq0z5r8za81.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--z46jeODs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/dabqhoc7glq0z5r8za81.jpg" alt="Scale"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I remove outliers?
&lt;/h3&gt;

&lt;p&gt;Probably. Removing outliers will generally improve model performance. Standardizing the inputs would also reduce outliers' effects.&lt;/p&gt;

&lt;p&gt;RobustScaler can scale features and you can avoid dropping outliers. See my article discussing scaling and standardizing &lt;a href="https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02?source=friends_link&amp;amp;sk=a82c5faefadd171fe07506db4d4f29db"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which other assumptions really matter?
&lt;/h3&gt;

&lt;p&gt;Observations should be independent of each other. &lt;/p&gt;

&lt;h3&gt;
  
  
  Should I transform my features using polynomials and interactions?
&lt;/h3&gt;

&lt;p&gt;Just as in linear regression, you can use higher order polynomials and interactions. This transformation allows your model to learn a more complex decision boundary. Then, you aren't limited to a linear decision boundary. However, overfitting becomes a risk and interpreting feature importances gets trickier. It might also be more difficult for the solver to find the global minimum. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--C9M8v92a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/7kvcrcbb2mtzcb7dzlzo.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--C9M8v92a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/7kvcrcbb2mtzcb7dzlzo.jpg" alt="Cocoons"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I do dimensionality reduction if there are lots of features?
&lt;/h3&gt;

&lt;p&gt;Probably. Principal Components Analysis is a nice choice if interpretability isn't vital. Recursive Feature Elimination can help you remove the least important features. Alternatively, L1 regularization can drive less important feature weights to zero if you are using the &lt;em&gt;saga&lt;/em&gt; solver. &lt;/p&gt;

&lt;h3&gt;
  
  
  Is multicollinearity in my features a problem?
&lt;/h3&gt;

&lt;p&gt;It is for interpretation of the feature importances. You can't rely on the model weights to be meaningful when there is high correlation between the variables. Credit for effecting the outcome variable might go to just one of the correlated features. &lt;/p&gt;

&lt;p&gt;There are many ways to test for multicollinearity. See &lt;a href="http://www.frontiersin.org/files/EBooks/194/assets/pdf/Sweating%20the%20Small%20Stuff%20-%20Does%20data%20cleaning%20and%20testing%20of%20assumptions%20really%20matter%20in%20the%2021st%20century.pdf"&gt;Kraha et al. (2012) here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;One popular option is to check the Variance Inflation Factor (VIF). A VIF cutoff around 5 to 10 is usually as problematic, but there's a &lt;a href="https://www.researchgate.net/post/Multicollinearity_issues_is_a_value_less_than_10_acceptable_for_VIF"&gt;lively debate&lt;/a&gt; as to what an appropriate VIF cutoff should be. &lt;/p&gt;

&lt;p&gt;You can compute the VIF by taking the correlation matrix, inverting it, and taking the values on the diagonal for each feature.&lt;/p&gt;

&lt;p&gt;The correlation coefficients alone are not sufficient to determine problematic multicollinearity with multiple features.&lt;/p&gt;

&lt;p&gt;If the sample size is small, &lt;a href="(https://www.researchgate.net/publication/226005307_A_Caution_Regarding_Rules_of_Thumb_for_Variance_Inflation_Factors)"&gt;getting more data&lt;/a&gt; might be most helpful for removing multi-collinearity.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I use LogisticRegressionCV?
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;[LogisticRegressionCV]&lt;/em&gt;(&lt;a href="https://scikit-learn.org/stable/modules/linear_model.html"&gt;https://scikit-learn.org/stable/modules/linear_model.html&lt;/a&gt;) is the Scikit-learn algorithm you want if you have a lot of data and want to speed up your calculations while doing cross-validation to tune your hyperparameters. &lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;Now you know what to do when you see the &lt;code&gt;LogisticRegression&lt;/code&gt; solver warning — and better yet, how to avoid it in the first place. No more sweat! 😅&lt;/p&gt;

&lt;p&gt;I suggest you use the upcoming default &lt;em&gt;lbgfs&lt;/em&gt; solver for most cases. If you have a lot of data or need L1 regularization, try &lt;em&gt;saga&lt;/em&gt;. Make sure you scale your features if you're using &lt;em&gt;saga&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I hope you found this discussion of logistic regression helpful. If you did, please share it on your favorite social media so other people can find it, too. 👏&lt;/p&gt;

&lt;p&gt;I write about Python, Docker, data science, and more. If any of that’s of interest to you, read more &lt;a href="https://medium.com/@jeffhale"&gt;here&lt;/a&gt; and sign up for &lt;a href="http://eepurl.com/gjfLAz"&gt;my email list&lt;/a&gt;.😄&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dataawesome.us20.list-manage.com/subscribe?u=b694acf1df58e5bb039ce60a6&amp;amp;id=5da23b7424"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Nf_yrj9B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/kccvfj14zcqvcq5c4jri.png" alt="email list"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WvrLYqWx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/zdbg1tpp0bzvs6gbsaha.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WvrLYqWx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/zdbg1tpp0bzvs6gbsaha.jpg" alt="lighthouse2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy logisticing! &lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>How to Remember Pandas Index Methods</title>
      <dc:creator>Jeff Hale</dc:creator>
      <pubDate>Fri, 19 Jul 2019 18:09:08 +0000</pubDate>
      <link>https://dev.to/discdiver/how-to-remember-pandas-index-methods-3l0d</link>
      <guid>https://dev.to/discdiver/how-to-remember-pandas-index-methods-3l0d</guid>
      <description>&lt;p&gt;When method names are similar, it's difficult to keep them separate in your mind. &lt;br&gt;
This makes remembering them harder. &lt;/p&gt;

&lt;p&gt;Pandas has a slew of methods for creating and adjusting a DataFrame index.&lt;br&gt;
This is a brief guide to help you create a little mental space between methods for easier memorization.&lt;/p&gt;

&lt;p&gt;The Jupyter Notebook is on Kaggle &lt;a href="https://www.kaggle.com/discdiver/how-to-remember-pandas-index-methods/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Make a DataFrame without specifying an index (you get a default index).
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;a&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Make a DataFrame with an index by using the &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html" rel="noopener noreferrer"&gt;index&lt;/a&gt; keyword argument.
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;a&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Move a column to be the index with &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index" rel="noopener noreferrer"&gt;.set_index()&lt;/a&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;a&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Rename the index values from scratch with &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html" rel="noopener noreferrer"&gt;.index&lt;/a&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note that &lt;code&gt;index&lt;/code&gt; is a property of the DataFrame not a method, so the syntax is different.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nuke the index values and start over from 0 with &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index" rel="noopener noreferrer"&gt;.reset_index()&lt;/a&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;index&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you don't want the index to become a column, pass &lt;code&gt;drop=True&lt;/code&gt; to &lt;code&gt;reset_index()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Reorder the rows with &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex" rel="noopener noreferrer"&gt;.reindex()&lt;/a&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reindex&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Passing a value that isn't in the index results in a NaN.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df7&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reindex&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df7&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;6.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;4.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;5.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Advice
&lt;/h2&gt;

&lt;p&gt;Ideally, add an index when you create your DataFrame with &lt;code&gt;index =&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;If reading from a .csv file you can set an index column by passing the column number. &lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df = pd.read_csv(my_csv, index_col=3)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Or pass &lt;code&gt;index_col=False&lt;/code&gt; to exlcude.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to set or change the index:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;df.set_index()&lt;/code&gt; - move a column to the index &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;df.index&lt;/code&gt; - add an index manually&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;df.reset_index()&lt;/code&gt; - reset the index to &lt;em&gt;0, 1, 2 ...&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;df.reindex()&lt;/code&gt; - reorder the rows&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Word associations to remember:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;set_index()&lt;/code&gt; - move column&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;index&lt;/code&gt; - manual&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;reset_index()&lt;/code&gt; - reset&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;reindex&lt;/code&gt; - reorder&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrap
&lt;/h2&gt;

&lt;p&gt;I hope this article helped you create a little mental space to keep Pandas index methods straight. If it did, please give it some love so other people can find it, too.&lt;/p&gt;

&lt;p&gt;I write about Data Science, Dev Ops, Python and other stuff. Check out my other &lt;a href="https://medium.com/@jeffhale" rel="noopener noreferrer"&gt;articles&lt;/a&gt; if any of that sounds interesting.&lt;/p&gt;

&lt;p&gt;Follow me and connect:&lt;br&gt;
&lt;a href="https://medium.com/@jeffhale" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/discdiver"&gt;Dev.to&lt;/a&gt;&lt;br&gt;
&lt;a href="https://twitter.com/discdiver" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/-jeffhale" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.kaggle.com/discdiver" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/discdiver" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgd6kiza5mgsh470nruhi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgd6kiza5mgsh470nruhi.jpg" alt="Reset Button"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy indexing!&lt;/p&gt;

</description>
      <category>python</category>
      <category>tutorial</category>
      <category>machinelearning</category>
      <category>pandas</category>
    </item>
    <item>
      <title>10 Days to Become a Google Cloud Certified Professional Data Engineer</title>
      <dc:creator>Jeff Hale</dc:creator>
      <pubDate>Wed, 19 Jun 2019 21:59:43 +0000</pubDate>
      <link>https://dev.to/discdiver/10-days-to-become-a-google-cloud-certified-professional-data-engineer-4cn4</link>
      <guid>https://dev.to/discdiver/10-days-to-become-a-google-cloud-certified-professional-data-engineer-4cn4</guid>
      <description>&lt;p&gt;I recently took the updated &lt;a href="https://cloud.google.com/certification/data-engineer" rel="noopener noreferrer"&gt;Google Cloud Certified Professional Data Engineer exam&lt;/a&gt;. Studying for the test is a great way to learn the data engineering process with Google Cloud.&lt;/p&gt;

&lt;p&gt;I recommend studying for the exam if you want to use Google Cloud products and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;are a data engineer&lt;/li&gt;
&lt;li&gt;want to become a data engineer&lt;/li&gt;
&lt;li&gt;want to build a tech company&lt;/li&gt;
&lt;li&gt;are a data scientist and want to understand the whole data pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article I’ll share the what, why, and how to help you take your best shot at the exam. 🎯&lt;/p&gt;

&lt;h3&gt;
  
  
  Why
&lt;/h3&gt;

&lt;p&gt;Let’s tackle the &lt;em&gt;why&lt;/em&gt; first. I decided to take the Google Cloud Certified Professional Data Engineer exam for two reasons. First, I wanted to learn more about Google Cloud products for data engineering and machine learning. Second, I wanted to pass the exam and demonstrate that I’d learned the information. 😃&lt;/p&gt;

&lt;p&gt;I chose a Google exam over offerings from AWS and Microsoft Azure for a few reasons. First, Google is the leading cloud provider in terms of machine learning and AI. They are also the platform I would use if I were starting a company in the space.&lt;/p&gt;

&lt;p&gt;Compared to the other major cloud services, Google has the clearest help docs and the best UX. They also have the lowest prices for &lt;a href="https://towardsdatascience.com/maximize-your-gpu-dollars-a9133f4e546a" rel="noopener noreferrer"&gt;GPUs&lt;/a&gt; and the &lt;a href="https://cloud.google.com/tpu/" rel="noopener noreferrer"&gt;most powerful machines&lt;/a&gt; for training deep learning models.&lt;/p&gt;

&lt;p&gt;Additionally, the Google exam has good study materials available — which we’ll dig into below. It’s also a professional level exam, which means that it’s difficult, but passage signifies the highest level of mastery. Finally, the Professional Data Engineer test was updated in March 2019, so I figured it should be more relevant than an older, un-updated exam.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fup1nse4tjr5it09lavgt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fup1nse4tjr5it09lavgt.png" width="400" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’re a data person and prefer AWS, check out the &lt;a href="https://aws.amazon.com/certification/certified-machine-learning-specialty/" rel="noopener noreferrer"&gt;Machine Learning&lt;/a&gt; and &lt;a href="https://aws.amazon.com/certification/certified-machine-learning-specialty/" rel="noopener noreferrer"&gt;Big Data&lt;/a&gt; &lt;a href="https://aws.amazon.com/certification/?nav=tc&amp;amp;loc=3" rel="noopener noreferrer"&gt;specialty certificate&lt;/a&gt; exams. They are $300 each, plus $40 per practice exam.&lt;/p&gt;

&lt;p&gt;If you’re into Microsoft Azure, they have two exams that must be passed to attain the &lt;a href="https://www.microsoft.com/en-us/learning/azure-data-engineer.aspx" rel="noopener noreferrer"&gt;Certified: Azure Data Engineer Associate&lt;/a&gt; designation. The Azure exams have a revamp date of June 21, 2019.&lt;/p&gt;

&lt;h3&gt;
  
  
  Study Plan
&lt;/h3&gt;

&lt;p&gt;As context, I’d used a number of Google Cloud products, but didn’t know the difference between BigQuery and Bigtable before I started studying for the exam. I also hadn’t done much data engineering work.&lt;/p&gt;

&lt;p&gt;This isn’t the kind of test you can cram for in a day or two. I doubt hardly anyone is prepared to take this exam without a good bit of studying; the number of Google products and their options changes so fast.&lt;/p&gt;

&lt;p&gt;Here are the resources I used to study for &lt;a href="https://cloud.google.com/certification/data-engineer" rel="noopener noreferrer"&gt;the exam&lt;/a&gt;. The format below is inspired by &lt;a href="https://medium.com/u/dbc019e228f5" rel="noopener noreferrer"&gt;Daniel Bourke&lt;/a&gt;’s helpful post that I used as a guide for my study plan.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linux Academy
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzgua9tyw1ou4i9zllp6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzgua9tyw1ou4i9zllp6.png" width="460" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helpfulness&lt;/strong&gt; : 7.5/10&lt;/p&gt;

&lt;p&gt;Linux Academy’s Google Cloud Certified Professional Data Engineer &lt;a href="https://linuxacademy.com/google-cloud-platform/training/course/name/google-cloud-data-engineer" rel="noopener noreferrer"&gt;course&lt;/a&gt; had good content. The course has videos, quizzes, a &lt;a href="https://www.lucidchart.com/documents/view/0ca44a63-4ea4-4d78-8367-2465512d21be/1" rel="noopener noreferrer"&gt;Lucid Chart e-book&lt;/a&gt;, and a final exam. Linux Academy provides free GCP practice time. It also has a helpful community Slack channel.&lt;/p&gt;

&lt;p&gt;I took a legal pad worth of notes as I studied — and most of them came from the Linux Academy videos.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcl7aamw4a7axcc0it5y9.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcl7aamw4a7axcc0it5y9.jpeg" width="640" height="480"&gt;&lt;/a&gt;A legal pad before studying.&lt;/p&gt;

&lt;p&gt;The course wasn’t updated for the new test as of early June 2019, so it wasn’t as helpful as it could have been. The instructor said the materials will probably be totally updated in late June 2019.&lt;/p&gt;

&lt;p&gt;The Linux Academy final exam took a number of questions from the official Google practice exam. Don’t put much faith in the final exam results if you are taking the test in mid-June 2019. The test isn’t totally updated and the actual exam questions felt more difficult.&lt;/p&gt;

&lt;p&gt;Overall, the UX isn’t bad, but there are some minor annoying issues (for example, the video is either full screen or tiny).&lt;/p&gt;

&lt;p&gt;Bottom line: Linux Academy makes a great base, but you might want to wait until their training materials are updated to start studying for the exam.&lt;/p&gt;

&lt;p&gt;Linux Academy is $49 a month, paid monthly, with a 7-day free trial.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwicklabs
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw2dofe11rhclji7xg6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw2dofe11rhclji7xg6m.png" width="278" height="70"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Helpfulness: 5.5/10&lt;/p&gt;

&lt;p&gt;The Quicklabs exercises aren’t focussed on the exam. I found this nice for overall learning but not very that helpful if you’re trying to figure out what you need to learn for the test.&lt;/p&gt;

&lt;p&gt;Like Linux Academy, Qwicklabs provides a Google Cloud sandbox for practice. Qwicklabs checks your progress in the sandbox, which is nice. It doesn’t have videos.&lt;/p&gt;

&lt;p&gt;The UX is alright. The countdown timer for each lesson is a bit distracting and pressure inducing — however there is a countdown time on the actual Google exam, too. The Qwicklabs timer is quite large — I suggest moving that part of the window offscreen if it’s distracting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99nc51qoiuxx16mr6906.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99nc51qoiuxx16mr6906.png" width="800" height="167"&gt;&lt;/a&gt;Qwicklabs countdown timer example&lt;/p&gt;

&lt;p&gt;When doing interactive exercises, I recommend setting up your windows side-by-side — one for instruction and one for your work in GCP.&lt;/p&gt;

&lt;p&gt;Qwicklabs courses cost credits that you can purchase. You can purchase a monthly unlimited &lt;a href="https://www.qwiklabs.com/payments/pricing" rel="noopener noreferrer"&gt;Qwicklabs subscription&lt;/a&gt; for $55 a month. Discount codes may be available at &lt;a href="https://medium.com/u/ba857441758a" rel="noopener noreferrer"&gt;sathish vj&lt;/a&gt;’s post&lt;a href="https://medium.com/@sathishvj/qwiklabs-free-codes-gcp-and-aws-e40f3855ffdb" rel="noopener noreferrer"&gt; here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I recommend doing Linux Academy first and then using Qwicklabs for more practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Udemy
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz44p5d8tmmb0sw548ks4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz44p5d8tmmb0sw548ks4.png" width="250" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Helpfulness: 5.5/10&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.udemy.com/google-cloud-certified-professional-data-engineer-practice-exams/" rel="noopener noreferrer"&gt;This resource&lt;/a&gt; consists of just three 50 question practice exams with a timer. The practice exams had a few updated questions, but still had old case study questions. They used the same Google official practice exam questions as Linux academy. Several questions had grammatical issues. Also, several questions were now incorrect. For example, now there is a BigQuery ML K-means algorithm.&lt;/p&gt;

&lt;p&gt;I did learn things by taking the exam and reviewing the answers. The answers were detailed and linked to source documents. Just don’t put much faith in the score. The real exam feels far harder. 😄&lt;/p&gt;

&lt;p&gt;Overall, these exams aren’t great, but I found them worth the time and money because there were few good options.&lt;/p&gt;

&lt;p&gt;$9.99 for a one-time purchase (price may change — I saw it for $10.99 first).&lt;/p&gt;

&lt;h3&gt;
  
  
  Coursera
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0mfroac3pwq5powmr52v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0mfroac3pwq5powmr52v.png" width="294" height="64"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google recommends taking the &lt;a href="https://www.coursera.org/learn/preparing-cloud-professional-data-engineer-exam?utm_source=googlecloud&amp;amp;utm_medium=institutions&amp;amp;utm_campaign=GoogleCloud_Cert_Prep_PDE" rel="noopener noreferrer"&gt;Coursera Data Engineering, Big Data, and Machine Learning on GCP Specialization&lt;/a&gt;. This specialization consists of five Coursera courses. I decided not to take it because it looked like it hadn’t been updated for the revised exam — it referenced the old exam case studies. In hindsight, I would have taken these courses because they look quite thorough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Official Practice Exam
&lt;/h3&gt;

&lt;p&gt;Helpfulness: 5.5/10&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://cloud.google.com/certification/practice-exam/data-engineer" rel="noopener noreferrer"&gt;official Google practice exam&lt;/a&gt; is available online as a mini-version of the real exam. The questions are the most relevant; I just wish there were more of them. As noted above, the questions are also used by several other folks in their practice exams.&lt;/p&gt;

&lt;p&gt;You have to fill out a form to take the practice exam, but it’s free.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Good Resources
&lt;/h3&gt;

&lt;p&gt;Here are the cheat sheets, blog posts, and other resources I used to study for the exam.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maverick Lin’s cheatsheet &lt;a href="https://github.com/ml874/Data-Engineering-on-GCP-Cheatsheet/blob/master/data_engineering_on_GCP.pdf" rel="noopener noreferrer"&gt;here&lt;/a&gt; is very good, but pre the March exam refresh.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/u/faf29a77ec33" rel="noopener noreferrer"&gt;Guang X&lt;/a&gt;’s &lt;a href="https://medium.com/weareservian/google-cloud-data-engineer-exam-study-guide-9afc80be2ee3" rel="noopener noreferrer"&gt;here&lt;/a&gt; is pre-updated exam.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/u/857e12d5597a" rel="noopener noreferrer"&gt;Dmitri Lerko&lt;/a&gt;’s post &lt;a href="https://deploy.live/blog/google-cloud-certified-professional-data-engineer/" rel="noopener noreferrer"&gt;here&lt;/a&gt; reflects the updated exam.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/u/138b47b69562" rel="noopener noreferrer"&gt;Chetan Sharma&lt;/a&gt;’s post &lt;a href="https://medium.com/@chetansharma90" rel="noopener noreferrer"&gt;here&lt;/a&gt; also reflects the updated exam.&lt;/li&gt;
&lt;li&gt;The official Google Cloud docs are expansive. You’ll certainly want to spend some time taking notes from them. Not all the latest material is on the exam, but it’s all good to learn. 😃Here are the &lt;a href="https://cloud.google.com/bigquery/docs/" rel="noopener noreferrer"&gt;BigQuery&lt;/a&gt; docs, for example.&lt;/li&gt;
&lt;li&gt;The official Google Cloud blog is &lt;a href="https://cloud.google.com/blog/products/gcp" rel="noopener noreferrer"&gt;here&lt;/a&gt;. It’s worth spending some time with it to help you understand topics you might find challenging.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F46x6h6cptmvhbb2m4iyv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F46x6h6cptmvhbb2m4iyv.jpeg" width="800" height="410"&gt;&lt;/a&gt;So many things learn!&lt;/p&gt;

&lt;p&gt;Do you have other resources that you found helpful? Please share them in the comments or send them to me on Twitter &lt;a class="mentioned-user" href="https://dev.to/discdiver"&gt;@discdiver&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One thing I found unnecessarily difficult was determining how updated study materials were. To make this easier, I suggested to Google that they should version their certification exams — just as most software follows &lt;a href="https://semver.org/" rel="noopener noreferrer"&gt;semantic versioning&lt;/a&gt;. A version label like 1.1 could make it easy for training material providers to indicate which test version their materials match. This could save test-takers time and avoid frustration. If you think this is a good idea, please let Google know. You can tweet to them &lt;a href="https://twitter.com/GCPcloud" rel="noopener noreferrer"&gt;&lt;strong&gt;@&lt;/strong&gt; GCPcloud&lt;/a&gt;. 😃&lt;/p&gt;

&lt;p&gt;For what it’s worth, I generally take tests well and am confident in my ability to learn with self-directed study. If self-directed study isn’t your thing, and your budget allows, you might want to take &lt;a href="https://cloud.google.com/certification/data-engineer" rel="noopener noreferrer"&gt;in-person courses&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now let’s turn to the test.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Exam
&lt;/h3&gt;

&lt;p&gt;The exam consist of 50 multiple choice questions. You have two hours to complete it. You’re able to mark questions for later review and revisit all questions before submitting the test.&lt;/p&gt;

&lt;p&gt;Rumor has it that you need about 70% correct to pass the exam. However, there is not an official published passing score. &lt;a href="https://cloud.google.com/certification/faqs/#0" rel="noopener noreferrer"&gt;Google says&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Not all questions may be scored.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;At any given time, a small number of questions on our exams may be unscored. These are newly developed questions that are being evaluated for their effectiveness. This is a standard practice in the testing industry.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;The score needed to pass is confidential.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;The passing score for each exam is confidential. It is determined by a panel of internal and external subject matter experts, following an industry-accepted standard setting process. The passing score is applied equally to all examinees. It is re-evaluated when changes are made to the exam content.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You never learn your score, just whether you passed or failed. If you pass the test, your certification is good for two years.&lt;/p&gt;

&lt;p&gt;The exam will cost you $200. If you don’t pass, you can take it again for another $200 in 14 days. If you don’t pass on your second try, you need to wait 60 days and pay again.&lt;/p&gt;

&lt;p&gt;Here’s &lt;a href="https://cloud.google.com/certification/guides/data-engineer/" rel="noopener noreferrer"&gt;the official test overview&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3l7krrhfhf8ve07z0abt.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3l7krrhfhf8ve07z0abt.jpeg" width="800" height="532"&gt;&lt;/a&gt;What do you see in the crystal?&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Know When You’re Ready?
&lt;/h3&gt;

&lt;p&gt;If you decide to study for the Google Cloud Certified Professional Data Engineer exam, it’s hard to know when you’re ready to take the test. It’s tricky because there are few good test simulations and you don’t even know what you need to pass!&lt;/p&gt;

&lt;p&gt;As with most things in life, practice improves your chances of performing well. Take as many practice exams as you can and review the results. You want to feel confident that you know the concepts, pitfalls, and best practices.&lt;/p&gt;

&lt;p&gt;I originally planned to study for a month or so, but I decided to push it hard. On the sixth day I tried to register to take the exam the next day, but the testing center was booked. I decided to take a few more days to study and spend time with family in town over the weekend.&lt;/p&gt;

&lt;p&gt;I ended up with 10 days of pretty intense study and a few days break in the middle. I felt decently prepared on test day. I hadn’t memorized every IAM role for every resource, but I had a good understanding of best practices with key products.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Test Experience
&lt;/h3&gt;

&lt;p&gt;You take the exam on a computer at a testing center. You’ll have to leave your phone and other personal belongings with the proctor. You’ll be video recorded during the test. Other people will probably be in the same room taking other exams.&lt;/p&gt;

&lt;p&gt;Earplugs, scratch paper, and pencils are provided. It sounds silly, but if you’re not an earplug wearer, you may want to practice with them ahead of time. I suggest you don’t press start until they are firmly in your ears.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vki4xhxge07tm0gc4db.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vki4xhxge07tm0gc4db.jpeg" width="800" height="532"&gt;&lt;/a&gt;Ears.&lt;/p&gt;

&lt;p&gt;I had read that the test would be difficult. It was still way harder than I thought it would be. It felt like the hardest test I’ve ever taken, and I’ve taken the SAT, ACT, GMAT, GRE, LSAT and several certification exams. For what it’s worth, this was my first exam from a cloud provider.&lt;/p&gt;

&lt;p&gt;The test is difficult for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The breadth of material is vast. There are lots of google products and lots of potential questions about each product and how they work together. There are over 200 Google Cloud APIs. This exam doesn’t cover all of them, but it covers a bunch.&lt;/li&gt;
&lt;li&gt;The exam also tests your knowledge of several Apache open source products related to Google’s offerings.&lt;/li&gt;
&lt;li&gt;It’s not even clear exactly how many Google products could be on the exam because new products are always being added and products are being changed.&lt;/li&gt;
&lt;li&gt;The questions are often multi-line, requiring consideration of multiple variables and intense concentration.&lt;/li&gt;
&lt;li&gt;Some questions have multiple answers required (if more than one answer is required, the number of answers is specified).&lt;/li&gt;
&lt;li&gt;Many answers are somewhat correct. You need to choose the best answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exam will test you in more ways than one. When I took the exam I just tried to stay focussed and not let the voice of self-doubt enter my head.&lt;/p&gt;

&lt;p&gt;I had about 30 minutes left after my first pass through the questions. I marked seven answers for review. After reviewing, I had 10 minutes to spare. I clicked &lt;em&gt;submit&lt;/em&gt; knowing I had tried my best and the chips would fall where they may.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31v0f6p6b3pr5rsjfp32.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31v0f6p6b3pr5rsjfp32.jpeg" width="640" height="425"&gt;&lt;/a&gt;Poker chips.&lt;/p&gt;

&lt;p&gt;On the next screen I saw I had provisionally &lt;em&gt;passed&lt;/em&gt;. 😃I collected my belongings from the proctor and headed out.&lt;/p&gt;

&lt;p&gt;I received an email from Google the next day that I had officially passed. It included a code for some free swag. I would have preferred a less expensive test, but now I’ve got some humiliswag.&lt;/p&gt;

&lt;p&gt;I plan to write about Google tools for data injestion, processing, storage, and machine learning in a future article. Follow &lt;a href="http://medium.com/@jeffhale" rel="noopener noreferrer"&gt;me&lt;/a&gt; to make sure you don’t miss it. Now I’ll mention what I didn’t see on the exam.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Didn’t See
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;As many IAM questions as I thought I might. There were a bunch on the various practice tests.&lt;/li&gt;
&lt;li&gt;Questions on exact product costs. Just know what makes sense if you’re more cost sensitive or less cost sensitive.&lt;/li&gt;
&lt;li&gt;Firestore questions.&lt;/li&gt;
&lt;li&gt;AI Hub questions.&lt;/li&gt;
&lt;li&gt;Many ML concept questions. I went into the test knowing ML concepts better than Google database products, so perhaps this explains why this part of the test didn’t loom large to me.&lt;/li&gt;
&lt;li&gt;Many questions with code samples.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Wrap
&lt;/h3&gt;

&lt;p&gt;It makes sense to study for this exam if you want to learn more about Google’s data science and engineering products and you have the time to devote to it. This exam doesn’t have you writing actual queries or cleaning data, so you’ll want to look elsewhere to develop those skills.&lt;/p&gt;

&lt;p&gt;If you aren’t already a GCP pro, I guarantee you’ll learn things if you put the time in to study for the exam.&lt;/p&gt;

&lt;p&gt;The way I look at it, if you pass the test, great. If you don’t, that’s okay. Either way, you’ll learn a bunch, and that’s most important. 😃&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8unky52qvf5g3cje0fb.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8unky52qvf5g3cje0fb.jpeg" width="800" height="532"&gt;&lt;/a&gt;It’s the climb.&lt;/p&gt;

&lt;p&gt;Speaking of learning, I hope you found this article helpful for your learning. If you did, please share it on your favorite social media channel. 👍&lt;/p&gt;

&lt;p&gt;I help folks learn about cloud computing, data science, and other tech topics. Check out &lt;a href="https://medium.com/@jeffhale" rel="noopener noreferrer"&gt;my other articles&lt;/a&gt; if you’re into that stuff.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://eepurl.com/gjfLAz" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fettjveljwybv51rexa9n.png" width="800" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy studying! 📙&lt;/p&gt;




</description>
      <category>cloud</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>google</category>
    </item>
  </channel>
</rss>
