<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: StrataScratch</title>
    <description>The latest articles on DEV Community by StrataScratch (@nate_at_stratascratch).</description>
    <link>https://dev.to/nate_at_stratascratch</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F619799%2F365c0aa7-5637-42e7-ae76-89e00006aec1.png</url>
      <title>DEV Community: StrataScratch</title>
      <link>https://dev.to/nate_at_stratascratch</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nate_at_stratascratch"/>
    <language>en</language>
    <item>
      <title>Practicing String Manipulation in SQL</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Mon, 06 Mar 2023 09:14:49 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/practicing-string-manipulation-in-sql-29a2</link>
      <guid>https://dev.to/nate_at_stratascratch/practicing-string-manipulation-in-sql-29a2</guid>
      <description>&lt;p&gt;&lt;em&gt;A detailed walkthrough of the solution for a Google interview question to practice SQL String Manipulation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the wealth of data being captured by companies, not all of them will be structured and numerical. So today, our focus is to hone your skill in manipulating strings in SQL by introducing several advanced functions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interview Question Example to Practice SQL String Manipulation
&lt;/h2&gt;

&lt;p&gt;Let’s dive into an example question from an interview at Google to practice SQL string manipulation. The question is entitled ‘File Contents Shuffle’. It asks:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--W5DFwuGG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/68yzznd384374ba0act2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--W5DFwuGG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/68yzznd384374ba0act2.png" alt="Practicing String Manipulation in SQL" width="785" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Link to the question: &lt;a href="https://platform.stratascratch.com/coding/9818-file-contents-shuffle?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+string+manipulation"&gt;https://platform.stratascratch.com/coding/9818-file-contents-shuffle&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Video Solution:
&lt;/h3&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/BgN5hpl3WKc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;To understand the question a bit better, let’s have a look at the dataset we’re working with.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Exploring the Dataset
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wRknJLdE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/am0ytb4wremcucdscxdh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wRknJLdE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/am0ytb4wremcucdscxdh.png" alt="Practicing String Manipulation in SQL" width="880" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The table google_file_store provides a list of text files with the filename as one column and its contents on the other. Both columns are string data.&lt;/p&gt;

&lt;p&gt;The question asks us specifically to look at the record where the filename is final.txt’. Notice that there are punctuation marks and duplication of some words like ‘the’, ‘and’, and ‘a’.&lt;/p&gt;

&lt;p&gt;When dealing with strings, always remember that data may not be ‘clean’. Watch out for punctuation marks, numbers, a mix of upper and lower cases, double spaces, and duplication of words. State how you’d like to deal with these scenarios or clarify this with your interviewer. For today, we will ignore them first.&lt;/p&gt;

&lt;p&gt;The contents of the ‘final.txt’ need to be sorted alphabetically, returned in lowercase with a new filename called ‘wacky.txt’.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Writing Out the Approach
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IXji5G8Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6und2r61gcijmkh45i8j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IXji5G8Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6und2r61gcijmkh45i8j.png" alt="Practicing String Manipulation in SQL" width="880" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you’ve fully understood the requirements of the question, formulate a plan of how you’ll build the solution. Oftentimes, you already have an idea of what this is but I strongly suggest writing this out step-by-step. This forces you to identify any gaps in your thinking or errors that you may have missed otherwise.&lt;/p&gt;

&lt;p&gt;From the instructions alone, you could easily write out these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Filter the table where the filename is ‘final.txt’&lt;/li&gt;
&lt;li&gt;Sort its contents alphabetically&lt;/li&gt;
&lt;li&gt;Convert the words into lowercase&lt;/li&gt;
&lt;li&gt;Return the contents with ‘wacky.txt’ as the filename column&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While this sounds simple at the start, there are several important steps missed. To avoid this, I would also encourage you to think about the input and output at each step.&lt;/p&gt;

&lt;p&gt;For example, the output of Step 1 is:&lt;br&gt;
&lt;code&gt;SELECT * FROM google_file_store&lt;br&gt;
WHERE filename = 'final.txt'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Rf0WsdrK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/27540ruqyt4j35emfrd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Rf0WsdrK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/27540ruqyt4j35emfrd5.png" alt="Practicing String Manipulation in SQL" width="880" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The contents are in string format and are encoded in one row only so we cannot immediately sort out the words alphabetically. If, instead, each word has its own row, we can do the usual sort through the ORDER BY() clause.&lt;/p&gt;

&lt;p&gt;So we need to prepare the data first so we can manipulate them more easily later on. Let’s call this the data preparation step with the aim of having a string convert to the column of words. This is how we will do it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data preparation:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert the string into an array by splitting the text using a space to identify the individual words&lt;/li&gt;
&lt;li&gt;Explode the array column-wise so that each element in the array becomes its own row&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This will allow us to proceed to Step 2 where we can sort the new column alphabetically and turn it into lower case.&lt;/p&gt;

&lt;p&gt;Then, we would like to return the result as a string like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hhHJ_2Qk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h5hj6kb6rgkx7bw7jfxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hhHJ_2Qk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h5hj6kb6rgkx7bw7jfxf.png" alt="Practicing String Manipulation in SQL" width="880" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And we cannot do that directly with the current format so another data transformation is required. This time, it has to be the reverse of Step 2 where the aim is to summarize the contents of a column into an array and stitch these elements together in a string format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data reformatting:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Aggregate column into an array&lt;/li&gt;
&lt;li&gt;Combine elements of the array using a space, returning this as a string&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Therefore, our full approach follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Filter the table where the filename is ‘final.txt’&lt;/li&gt;
&lt;li&gt;Data preparation:
a) Convert the string into an array using space as a delimiter
b) Explode the array column-wise&lt;/li&gt;
&lt;li&gt;Sort its contents alphabetically&lt;/li&gt;
&lt;li&gt;Convert the words into lowercase&lt;/li&gt;
&lt;li&gt;Data reformatting:
a) Aggregate column into an array
b) Combine elements of the array using a space, returning this as a string&lt;/li&gt;
&lt;li&gt;Return the contents with ‘wacky.txt’ as the filename column&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don’t you feel more confident about tackling the question now that you have the steps written out? This will also provide you a good reference point if you ever feel stuck in the interview.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Coding the solution
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cZzdKNAR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.sanity.io/images/oaglaatp/production/093675c2dcb5693a30989db797466cc1e4d1a6ae-5001x2501.jpg%3Fw%3D1920%26h%3D960%26auto%3Dformat" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cZzdKNAR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.sanity.io/images/oaglaatp/production/093675c2dcb5693a30989db797466cc1e4d1a6ae-5001x2501.jpg%3Fw%3D1920%26h%3D960%26auto%3Dformat" alt="Practicing String Manipulation in SQL" width="880" height="440"&gt;&lt;/a&gt;&lt;br&gt;
Let’s code up the query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1). Filter the table&lt;/strong&gt;&lt;br&gt;
First, let’s only look at the file ‘final.txt’. We can do this through using an equality condition in the WHERE() clause since we know the exact filename we are looking for.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT * FROM google_file_store&lt;br&gt;
WHERE filename = 'final.txt'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;However, if we only knew it started with ‘final’, we can use the LIKE() or ILIKE() function. These two functions are used to match strings based on a given pattern. The only difference is that LIKE() is case sensitive and ILIKE() is not.&lt;/p&gt;

&lt;p&gt;Here, we can use the ILIKE()function with the wildcard operator, %, representing zero or more characters. This allows us to retrieve the records where the filename starts with ‘final’.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT * FROM google_file_store&lt;br&gt;
WHERE filename ILIKE 'final%'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Hdh66GE7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p7uy9y3xyhpkhiqpvmvi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Hdh66GE7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p7uy9y3xyhpkhiqpvmvi.png" alt="Practicing String Manipulation in SQL" width="880" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2). Data preparation&lt;/strong&gt;&lt;br&gt;
Next, let’s prepare the data for manipulation. We will use the STRING_TO_ARRAY() function which takes in a string and converts this to an array (or a list). The elements in this array are based on the delimiter we specify. So if we use a space as a delimiter, it creates an individual element whenever it sees a space. Essentially, it will break up our text into words like this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT STRING_TO_ARRAY(contents, ' ') AS word&lt;br&gt;
FROM google_file_store&lt;br&gt;
WHERE filename ILIKE 'final%'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;As you can see, arrays provide a lot of information at one go but we cannot access or analyze its contents easily so a common manipulation done on arrays is ‘exploding’ them. We can do this with the UNNEST() function, which will take an array as an input and output a column where each array element becomes accessible as a separate row. Imagine this as a row-to-column transformation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word&lt;br&gt;
FROM google_file_store&lt;br&gt;
WHERE filename ILIKE 'final%'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--25wMvUK_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uuk5wj4rvr9qiak2kcry.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--25wMvUK_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uuk5wj4rvr9qiak2kcry.png" alt="String Manipulation in SQL" width="880" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3). Sort the contents alphabetically&lt;/strong&gt;&lt;br&gt;
Having transformed our data earlier makes the sorting straightforward using the ORDER BY() function.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT UNNEST (STRING_TO_ARRAY(CONTENTS, ' ')) AS word&lt;br&gt;
FROM google_file_store&lt;br&gt;
WHERE filename ILIKE 'final%'&lt;br&gt;
ORDER BY word&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YeYbkjsq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bav0cpwfgrcrwom7z3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YeYbkjsq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bav0cpwfgrcrwom7z3j.png" alt="String Manipulation in SQL" width="880" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4). Convert the words into lowercase using LOWER()&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT LOWER(word) AS CONTENTS&lt;br&gt;
FROM&lt;br&gt;
  (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word&lt;br&gt;
      FROM google_file_store&lt;br&gt;
      WHERE filename ILIKE 'final%' &lt;br&gt;
   ORDER BY word) base&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Utv4acN6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xdf1bll97f738c6xzgdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Utv4acN6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xdf1bll97f738c6xzgdm.png" alt="String Manipulation" width="880" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5). Data reformatting&lt;/strong&gt;&lt;br&gt;
Finally, to return the contents in a string format, we’ll do the reverse of the steps earlier.&lt;/p&gt;

&lt;p&gt;First, we will aggregate the rows of the contents column into an array using the ARRAY_AGG()function. ARRAY_AGG() is an &lt;a href="https://www.stratascratch.com/blog/the-ultimate-guide-to-sql-aggregate-functions/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+string+manipulation"&gt;aggregate function&lt;/a&gt; so like your SUM() and AVG(), it will take a column and output a single row summarizing the set of values. But here, instead of performing a calculation, it will return an array listing all the values of the column.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT ARRAY_AGG(LOWER(word)) AS contents&lt;br&gt;
FROM&lt;br&gt;
  (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word&lt;br&gt;
      FROM google_file_store&lt;br&gt;
      WHERE filename ILIKE 'final%' &lt;br&gt;
   ORDER BY word) base&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then, we can return this as a text by combining these individual words. The ARRAY_TO_STRING() takes in an array, combines the individual elements using a specified delimiter like a space, and returns the output as a string.&lt;/p&gt;

&lt;p&gt;In the same query, we’ll hardcode the filename as ‘wacky.txt’ so our final solution looks like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT 'wacky.txt' AS filename,&lt;br&gt;
       ARRAY_TO_STRING(ARRAY_AGG(LOWER(word)), ' ') AS contents&lt;br&gt;
FROM&lt;br&gt;
  (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word&lt;br&gt;
      FROM google_file_store&lt;br&gt;
      WHERE filename ILIKE 'final%' &lt;br&gt;
   ORDER BY word) base&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus
&lt;/h3&gt;

&lt;p&gt;For more advanced users of SQL, you may be familiar with the REGEX_SPLIT_TO_TABLE() function which gives the same output as the UNNEST(STRING_TO_ARRAY()) combination we used earlier.&lt;/p&gt;

&lt;p&gt;REGEX_SPLIT_TO_TABLE() will take in a string, separate these by a delimiter and return a table with each element in a separate row.&lt;/p&gt;

&lt;p&gt;This is helpful for more complex manipulations where the use of regex is required. In this example, however, the delimiter is simply a space so the code is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT regexp_split_to_table(contents, ' ') AS word &lt;br&gt;
FROM google_file_store &lt;br&gt;
WHERE filename ILIKE 'final%'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hk-sSIbX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ul0hhdvel45tjm5rsb4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hk-sSIbX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ul0hhdvel45tjm5rsb4s.png" alt="SQL String Manipulation" width="880" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this gives us the same result as we had in Step 2!&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This was an interesting example to level up your SQL string manipulation skills and I hope you learned something new today.&lt;/p&gt;

&lt;p&gt;If you ever find yourself stuck doing SQL string manipulation, remember you can transform the data in another format first if that makes the next steps easier. Converting strings to arrays is now one of the tricks up your sleeve to impress your interviewer.&lt;/p&gt;

&lt;p&gt;Practice more &lt;a href="https://www.stratascratch.com/blog/sql-interview-questions-you-must-prepare-the-ultimate-guide/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+string+manipulation"&gt;SQL interview questions&lt;/a&gt; and test your new skills on our coding platform where you can look specifically for string-related questions.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>career</category>
    </item>
    <item>
      <title>Spotify Advanced SQL Interview Question on PARTITION BY Clause</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Thu, 26 Jan 2023 04:31:50 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/spotify-advanced-sql-interview-question-on-partition-by-clause-2mab</link>
      <guid>https://dev.to/nate_at_stratascratch/spotify-advanced-sql-interview-question-on-partition-by-clause-2mab</guid>
      <description>&lt;p&gt;&lt;em&gt;A detailed solution walkthrough to a hard Spotify SQL interview question involving Joins, Aggregations, Case Statements, and Partition By Clause.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In this article, we’ll walk you through one of the &lt;a href="https://www.stratascratch.com/blog/advanced-sql-interview-questions-you-must-know-how-to-answer/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+spotify+advanced+sql+question"&gt;Advanced SQL interview questions&lt;/a&gt;. This question is a Hard level problem and will test your advanced SQL skills such as &lt;a href="https://www.stratascratch.com/blog/different-types-of-sql-joins-that-you-must-know/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+spotify+advanced+sql+question"&gt;SQL Joins&lt;/a&gt;, &lt;a href="https://www.stratascratch.com/blog/the-ultimate-guide-to-sql-aggregate-functions/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+spotify+advanced+sql+question"&gt;Aggregations&lt;/a&gt;, Partition By clauses, and &lt;a href="https://www.stratascratch.com/blog/a-comprehensive-guide-to-case-when-statements-in-sql/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+spotify+advanced+sql+question"&gt;Case statements&lt;/a&gt;. Follow along by clicking on the link to the question provided below. Let us solve this problem using our 3-step framework that you can use to solve any coding question anytime.&lt;/p&gt;

&lt;h1&gt;
  
  
  Spotify Advanced SQL Interview Question
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Days At Number One&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;"Find the number of days a US track has stayed in the 1st position for both the US and worldwide rankings. Output the track name and the number of days in the 1st position. Order your output alphabetically by track name.&lt;br&gt;
If the region 'US' appears in dataset, it should be included in the worldwide ranking."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Link to the question: &lt;a href="https://platform.stratascratch.com/coding/10173-days-at-number-one" rel="noopener noreferrer"&gt;https://platform.stratascratch.com/coding/10173-days-at-number-one&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Video Solution
&lt;/h3&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/93quPoReV1M"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;The question is entitled “Days at Number One” and asks us to find the number of days a US soundtrack has stayed in the first position in the Spotify daily rankings tables for both the US and worldwide rankings. The question further clarifies that if the region ‘US’ appears in the worldwide ranking table, it counts as a worldwide track.&lt;/p&gt;

&lt;p&gt;In the output, we are expected to display two columns, viz., track name and the number of days at number one. Now, let us work backward to the approach. But first, let us explore the dataset provided.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Exploring the Dataset
&lt;/h2&gt;

&lt;p&gt;Spotify has provided us with two tables, namely, spotify_daily_rankings_2017_us and spotify_worldwide_daily_song_ranking.&lt;/p&gt;

&lt;p&gt;The first dataset, spotify_daily_rankings_2017_us, contains the following columns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table: spotify_daily_rankings_2017_us&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpihfmeipjjtcstwd9nea.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpihfmeipjjtcstwd9nea.png" alt="Spotify Advanced SQL Interview Questions" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see from this table that it only contains tracks that have held the first position in the US daily rankings on various dates.&lt;/p&gt;

&lt;p&gt;The second table is named spotify_worldwide_daily_song_ranking, and it has the following schema:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table: spotify_worldwide_daily_song_ranking&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme03alr2p1xrwu5l674e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fme03alr2p1xrwu5l674e.png" alt="Spotify Advanced SQL Interview Questions" width="800" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One look into this table, and we can see that there are tracks of various positions in this table, and not just the number one tracks as we observed from the US rankings table. Another difference to notice from this table is that there is a column for the region the track belongs to. As the question clarified, any track that hails from the US will also be a part of the worldwide rankings table.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Writing Out the Approach
&lt;/h2&gt;

&lt;p&gt;Once you are familiar with the datasets provided, it is time to write out the approach you are about to take to solve the problem.&lt;/p&gt;

&lt;p&gt;Going back to the question, the key to finding the number of days a US track has stayed in that position in both the tables is the word “both”. Instinctively, you’ll go for a join, which is absolutely correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Merge the two tables in an Inner Join on the track name and date columns.&lt;/strong&gt;&lt;br&gt;
In our case, we will specifically use an inner join in order to filter out the tracks that are US tracks present in both tables and on the same dates. So, the inner join must be made on the two common columns, track name and date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Filter out the US tracks that are in position #1.&lt;/strong&gt;&lt;br&gt;
Once we have identified the tracks, we will then filter out the tracks that were in the number one position in the US rankings table. A simple WHERE clause is apt for this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Define a new column using SUM() OVER (PARTITION BY ) clauses&lt;/strong&gt;&lt;br&gt;
Next, we will create a subset of the latest result table containing only the US track names. A tricky little thing we will be using to achieve it is using an OVER (PARTITION BY) clause.&lt;/p&gt;

&lt;p&gt;An OVER (PARTITION BY) clause helps us specify the columns on which we will perform windows functions. In our case, we are going to use a SUM function to aggregate the data. We will perform aggregation on the track name column so that we can find the sum of the number of times a US number one track has been number one worldwide as well.&lt;/p&gt;

&lt;p&gt;Before moving on to the next step, let us break this step apart into smaller steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.1 Check if the US #1 track is also #1 in the worldwide rankings.&lt;/strong&gt;&lt;br&gt;
A CASE statement will do the trick. CASE statements are basically if-then statements. When the WHEN condition goes through, the THEN parameter is returned, otherwise, the ELSE parameter is returned.&lt;/p&gt;

&lt;p&gt;We will need to put a condition on the ‘position’ column of the worldwide rankings table in that it will return the value ‘1’ when its ranking is indeed number 1; else we will return the value ‘0’. We are using numerical values in this CASE statement since we will end up adding them to get the total number of days the track has stayed in that position.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.2 Get the sum of the number of times it has occurred.&lt;/strong&gt;&lt;br&gt;
We will save the value of the SUM() function as a new column, say ‘n_days_on_n1_position’. As a result, we will have a sum value of the number of tracks a number one US track has been number one in both the tables on the same dates. We will use this query as a temporary table before proceeding toward displaying the final table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Select the track name and perform the MAX() function on the temp table.&lt;/strong&gt;&lt;br&gt;
In this step, we will basically select the maximum value of the ‘n_days_on_n1_position’ column as well as the corresponding track name to be displayed in the final result table.&lt;/p&gt;

&lt;p&gt;Also, since we are using an aggregate function, we’ll couple it with a GROUP BY clause at the end, and we are grouping by track name in our case. This way, we only have one row per track with the maximum value displayed beside it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Order the track name column alphabetically.&lt;/strong&gt;&lt;br&gt;
As the question has suggested, we will order the output table by track name alphabetically.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Coding the Solution
&lt;/h2&gt;

&lt;p&gt;Let’s get right into coding without further ado.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Merge the two tables in an Inner Join on the track name and date columns.&lt;/strong&gt;&lt;br&gt;
Let us begin with selecting the track names from the US rankings table and seeing it in the console.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT trackname&lt;br&gt;
FROM spotify_daily_rankings_2017_us&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31rxwbys979iimacg6vl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31rxwbys979iimacg6vl.png" alt="Spotify Advanced SQL Interview Questions" width="711" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let us now use an inner join on the common columns - trackname and date - to merge the two tables.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The merged table looks like this:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvoexzosvey0gl4ve4rv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvoexzosvey0gl4ve4rv.png" alt="Spotify Advanced SQL Interview Questions" width="720" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Filter out the US tracks that are in position #1.&lt;/strong&gt;&lt;br&gt;
We can add a WHERE clause at the end of the query to filter out the tracks that were in the first position alone.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;And the output appears to be the same, meaning that all four tracks that were common in both tables are all number one tracks.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n872glpxz5jvrh12h5e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n872glpxz5jvrh12h5e.png" alt="Spotify Advanced SQL Interview Questions" width="707" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Define a new column using SUM() OVER (PARTITION BY ) clauses&lt;/strong&gt;&lt;br&gt;
First, let us add a new column to the table named ‘n_days_on_n1_position’.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname,&lt;br&gt;
       n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.1 Check if the US #1 track is also #1 in the worldwide rankings.&lt;/strong&gt;&lt;br&gt;
Write out the skeleton of the OVER (PARTITION BY ) clause first, and then we will add the conditions and parameters slowly.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname,&lt;br&gt;
        (CASE &lt;br&gt;
    WHEN        THEN &lt;br&gt;
END)        &lt;br&gt;
 OVER(PARTITION BY )  AS n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.2 Get the sum of the number of times it has occurred.&lt;/strong&gt;&lt;br&gt;
Insert windows function SUM() around the CASE statement.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname,&lt;br&gt;
       SUM(CASE &lt;br&gt;
        WHEN    THEN &lt;br&gt;
END) OVER(PARTITION BY )  AS n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We’ve got the skeleton ready to add our logic in. So, when the position is number 1 from the worldwide rankings table, then we will return 1, or else we will return 0. Let us now insert these parameters into our CASE statement.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname,&lt;br&gt;
       SUM(CASE&lt;br&gt;
       WHEN world.position = 1 THEN 1&lt;br&gt;
       ELSE 0&lt;br&gt;
       END) OVER(PARTITION BY )  AS n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Also, since we are performing these functions on the US track names that we had selected in the first line of the query, we will insert the column name within the PARTITION BY clause.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT us.trackname,&lt;br&gt;
       SUM(CASE&lt;br&gt;
       WHEN world.position = 1 THEN 1&lt;br&gt;
       ELSE 0&lt;br&gt;
       END) OVER(PARTITION BY us.trackname)  AS n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now, we can run this query, and the output is as below:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyx8cq6m6fq87lwipn7a9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyx8cq6m6fq87lwipn7a9.png" alt="SQL interview questions" width="707" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can find multiple entries for the track ‘HUMBLE.’ because we have used the OVER(PARTITION BY column_name ) clause, which returns all the records upon which the function was performed, including the duplicates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Select the track name and perform the MAX() function on the temp table.&lt;/strong&gt;&lt;br&gt;
We definitely need to use a MAX() function on the ‘n_days_on_n1_position’ column to pick only the largest value for each track name, denoting the number of days in the number one position. But first, we need to convert the query we’ve drafted so far into a temporary table named ‘tmp’.&lt;/p&gt;

&lt;p&gt;In addition to that, since we have used an aggregate query, we will group the table by track name.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT tmp.trackname,&lt;br&gt;
        MAX(n_days_on_n1_position) AS n_days_on_n1_position&lt;br&gt;
FROM &lt;br&gt;
(SELECT us.trackname,&lt;br&gt;
       SUM(CASE&lt;br&gt;
       WHEN world.position = 1 THEN 1&lt;br&gt;
       ELSE 0&lt;br&gt;
       END) OVER(PARTITION BY us.trackname)  AS n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1) tmp&lt;br&gt;
GROUP BY trackname&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Let us run the query now and take a look at the output.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqravb8rnnk5y2sv5vaa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqravb8rnnk5y2sv5vaa.png" alt="SQL interview questions" width="717" height="183"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Order the track name column alphabetically.&lt;/strong&gt;&lt;br&gt;
Finally, we will order the table by the track name alphabetically, as the question suggests we do.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT tmp.trackname,&lt;br&gt;
        MAX(n_days_on_n1_position) AS n_days_on_n1_position&lt;br&gt;
FROM &lt;br&gt;
(SELECT us.trackname,&lt;br&gt;
       SUM(CASE&lt;br&gt;
       WHEN world.position = 1 THEN 1&lt;br&gt;
       ELSE 0&lt;br&gt;
       END) OVER(PARTITION BY us.trackname)  AS n_days_on_n1_position&lt;br&gt;
FROM spotify_daily_rankings_2017_us us&lt;br&gt;
INNER JOIN spotify_worldwide_daily_song_ranking world&lt;br&gt;
ON world.trackname=us.trackname&lt;br&gt;
AND world.date = us.date&lt;br&gt;
WHERE us.position = 1) tmp&lt;br&gt;
GROUP BY trackname&lt;br&gt;
ORDER BY trackname&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The final result is as shown below:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9kpth7vfkgnlazrdpsp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9kpth7vfkgnlazrdpsp.png" alt="Spotify SQL interview questions" width="712" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the final result, we can infer that there are two tracks that were in the number one position in both the US and worldwide rankings tables. The first track, ‘Bad and Boujee (feat. Lil Uzi Vert)’, was number one for a day in both lists, and the second track, ‘HUMBLE.’, had three days of glory as number one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;It was a very interesting problem where we used multiple advanced SQL clauses such as windows function OVER (PARTITION BY) and CASE statements, join, and aggregations using SUM() and MAX() functions. I hope you enjoyed working on the problem as well. Explore our platform for more Data Science related &lt;a href="https://www.stratascratch.com/blog/sql-interview-questions-you-must-prepare-the-ultimate-guide/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to+spotify+advanced+sql+question"&gt;SQL interview questions&lt;/a&gt; and walkthroughs. Good luck!&lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>Facebook Python Interview Questions</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Mon, 02 Jan 2023 10:10:26 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/facebook-python-interview-questions-15j7</link>
      <guid>https://dev.to/nate_at_stratascratch/facebook-python-interview-questions-15j7</guid>
      <description>&lt;p&gt;This Facebook python interview question will test your ability to use joins, perform transformations and calculations, and address edge-case scenarios.&lt;/p&gt;

&lt;p&gt;We’re back with another Python interview question from Facebook / Meta. We will solve this question using our 3-step framework that can be used to solve any &lt;a href="https://www.stratascratch.com/blog/top-30-python-interview-questions-and-answers/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;python interview questions&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Facebook Python Interview Question
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---HmBwxg0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2hxk81427os20hsswxdf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---HmBwxg0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2hxk81427os20hsswxdf.png" alt="Facebook Python Interview Question" width="750" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Link to the question: &lt;a href="https://platform.stratascratch.com/coding/2123-product-families?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2123-product-families&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Video Solution:
&lt;/h2&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/RmMuS5iiviI"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;This Facebook python interview question revolves around the concept of promotional campaigns. In solving the problem, we are analyzing how each of the product families is selling, with and without promotions applied.&lt;/p&gt;

&lt;p&gt;We are seeking a result table that shows all the product families, their corresponding total units sold as well as the percentage of units sold under a valid promotion. A valid promotion is defined in the problem statement as ‘not empty’ and ‘contained within the promotions table’.&lt;/p&gt;

&lt;p&gt;Looks pretty straightforward. The trick is to transform the data provided into the desired format for us to display. For that to happen, we will need to look into the tables provided.&lt;/p&gt;

&lt;p&gt;Always begin your solution by exploring the dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework to solve this Facebook python interview question
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KLTq8h5d--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lr2kpucdq273vod9rsxl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KLTq8h5d--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lr2kpucdq273vod9rsxl.png" alt="Framework to solve this Facebook python interview question" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Exploring the Dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Meta has provided three tables, viz., &lt;strong&gt;facebook_products, facebook_sales_promotions, and facebook_sales.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There’s a lot of information available in those tables, so we need to distinguish and prioritize what columns or relationships to investigate during the interview. To preview the tables given, we can use the head() function. If you are practicing on our StrataScratch platform, however, the table can be previewed using the ‘Preview’ button.&lt;/p&gt;

&lt;p&gt;The first table, &lt;strong&gt;facebook_products&lt;/strong&gt;, contains information related to the products. It includes supplementary information such as class, brand, category, family, and more attributes like whether the product is low-fat and recyclable.&lt;/p&gt;

&lt;p&gt;It has the following schema:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XBPww6FD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b7c90asufs89snc2f9vp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XBPww6FD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b7c90asufs89snc2f9vp.png" alt="Facebook python interview question" width="705" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Run the code below to view a preview of the table:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;facebook_products.head()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Upon previewing the table using the head() function, we can see the following table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_dKSFwmS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lyqy2v2h0cps3j7akbek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_dKSFwmS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lyqy2v2h0cps3j7akbek.png" alt="The first table facebook_products" width="616" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second table, facebook_sales_promotions, is a table dedicated to promotions available. It contains the start and end dates of the promotions, the media channel used to promote them, as well the cost of these campaigns.&lt;/p&gt;

&lt;p&gt;The schema of the table is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BZyGegHD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rkp4c7rcusi0jb93tdfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BZyGegHD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rkp4c7rcusi0jb93tdfl.png" alt="Facebook python interview question" width="691" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Run the code below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;facebook_sales_promotions.head()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A preview of the table looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9KBgQiTD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h6k5um9wb8qj24fjzhon.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9KBgQiTD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h6k5um9wb8qj24fjzhon.png" alt="The second table facebook_sales_promotions" width="611" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thirdly, we have got the facebook_sales table which has the following schema:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--r5-5_HNK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xjs4w7cro0jsq97ph6oj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--r5-5_HNK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xjs4w7cro0jsq97ph6oj.png" alt="Facebook python interview question" width="674" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Preview the table by running the following code:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;facebook_sales.head()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hRcEwf-u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ar0mijwqn2mfmo6x6n0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hRcEwf-u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ar0mijwqn2mfmo6x6n0w.png" alt="Third facebook_sales" width="613" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you explore the schema and datasets provided, notice the common columns between these tables. Specifically, they are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;product _id - facebook_sales and facebook_products tables&lt;/li&gt;
&lt;li&gt;promotion_id -  facebook_sales and facebook_sales_promotions tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make it a practice to recognize the common columns as they help relate the tables later when we are figuring out an approach to solve this Facebook python interview question, which brings us to the next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Writing Out the Approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we have formed a better idea of our datasets let’s formulate our approach by identifying the high-level steps to our solution. It’s always good to have a plan before the execution so let’s not code anything just yet!&lt;/p&gt;

&lt;p&gt;From the question, we figured out that the output table must contain three columns: Product Family, Total Units Sold, and Percentage of Units Sold Under Valid Promotion.&lt;/p&gt;

&lt;p&gt;Let’s create a narrative that will help us navigate through this Facebook python interview question.&lt;/p&gt;

&lt;p&gt;For each product family, we need the units sold in total and the percentage of these units sold under a valid promotion. To ensure the promotions are valid, we will need to cross-check the ‘promotion_id’ in the facebook_sales table with the ‘promotion_id’ in the facebook_sales_promotions table.&lt;/p&gt;

&lt;p&gt;Therefore, we need to use data from multiple tables, so we need to merge these tables to identify the product family and the validity of each sales promotion.&lt;/p&gt;

&lt;p&gt;Let’s list out the steps we are going to take.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The first step in our approach is to identify which ‘product_family’ the ‘product_id’ from the &lt;strong&gt;facebook_sales&lt;/strong&gt; table belongs to. We can achieve this by merging the &lt;strong&gt;facebook_sales&lt;/strong&gt; and &lt;strong&gt;facebook_products&lt;/strong&gt; tables.&lt;/li&gt;
&lt;li&gt;Next, we will create a new column to identify the sales made under valid promotions.&lt;/li&gt;
&lt;li&gt;Once the valid promotions are identified, we will split the merged table into valid and invalid promotion sales.&lt;/li&gt;
&lt;li&gt;Now, for each subset, we will compute the total units sold for each product family.&lt;/li&gt;
&lt;li&gt;We will then merge these subsets into one main table.&lt;/li&gt;
&lt;li&gt;We will then fill the null values with zeroes to avoid errors later on.&lt;/li&gt;
&lt;li&gt;We can now calculate the total sales per product family.&lt;/li&gt;
&lt;li&gt;Now that all the necessary information is available for the percentage calculation, we will compute the percentage of units sold under promotion.&lt;/li&gt;
&lt;li&gt;Finally, we will select the three columns mentioned earlier and replace any null values with zeroes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;3. Coding the Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we have the approach written down, it is time to translate them into code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1). Identify the product family by joining the products and sales table&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Firstly, let’s merge the facebook_products and facebook_sales tables on the common column between them, i.e., product_id. By default, we would’ve gone for an inner join, but in this, that’s not such a good idea.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge Case:&lt;/strong&gt;&lt;br&gt;
Remember that this Facebook python interview question asks us to make calculations for &lt;strong&gt;all&lt;/strong&gt; the available product families. In an ideal setting, the various product families would be well-represented in the sales table. But in a realistic setting, it may not be the case.&lt;/p&gt;

&lt;p&gt;We need to take into account phased-out items and newly introduced items that would have periods wherein there would be no sale. These would be entered in the table as NULL values. We will act upon this edge case later in our solution by filling these NULL values with 0s to avoid errors.&lt;/p&gt;

&lt;p&gt;Regardless of whether a sale was made or not, all product families need to be handled in our solution. So, we will use an ‘&lt;strong&gt;outer&lt;/strong&gt;’ join instead of an inner join.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
merged = facebook_sales.merge(facebook_products, how="outer", on="product_id")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output of the table is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IsoIljJP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z8zaaoe2p90881n608ij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IsoIljJP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z8zaaoe2p90881n608ij.png" alt="Output for the Facebook python interview question" width="880" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You could counter this approach with one where the &lt;strong&gt;facebook_products&lt;/strong&gt; table is the base table and merged with &lt;strong&gt;facebook_sales&lt;/strong&gt; using a left join. This is another viable approach to ensure that all the product families are being captured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Create a new column identifying the sales made under a valid promotion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Next, we want to establish the promotion validity of each sale and create a new column for this information.&lt;/p&gt;

&lt;p&gt;To mark a promotion as ‘valid’, we will need to ensure that the promotion_id in the sales table is not empty and that the promotion_id is also contained in the promotions table.&lt;/p&gt;

&lt;p&gt;Of course, there are different ways of getting this done. A smart trick is to use a &lt;strong&gt;map-lambda&lt;/strong&gt; combination. For those of you who are not familiar, a map takes in an iterable like a list and transforms each item of that iterable by applying the function specified.&lt;/p&gt;

&lt;p&gt;We only need a temporary function here, so we can use a lambda function instead of defining a whole new function separately. As an added bonus, lambda allows us to specify the function in a single line of code!&lt;/p&gt;

&lt;p&gt;Now the merged table should contain the main requirements of our solution - the product family, the number of units sold, and an indication of the promotion validity.&lt;/p&gt;

&lt;p&gt;Digging deeper into the target table once more, the total units sold and percentage of sales on promotion can be calculated as:&lt;/p&gt;

&lt;p&gt;Code below as a comment (under “OUTPUT: product_family | n_sold | perc_promotion”)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;# product_family | SUM(units_sold) | SUM(units_sold_valid) / SUM(units_sold)*100&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now, we will create a new column named &lt;strong&gt;valid_promotion&lt;/strong&gt; which will be marked as True if the promotion is valid and ‘False’ if it isn’t. A valid promotion was defined by two factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The promotion_id cannot be missing or null&lt;/li&gt;
&lt;li&gt;The promotion_id should also exist in the facebook_sales_promotions table&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first condition can be checked by using the pandas function, isna(). So we will need to import the pandas first. The latter can be efficiently achieved by getting the full list of unique promotion_ids from the promotions table and checking if it exists in that list.&lt;/p&gt;

&lt;p&gt;Let’s now write these two conditions in code as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;merged['valid_promotion'] = merged.promotion_id.map(lambda x: \&lt;br&gt;
        not pd.isna(x) and x \&lt;br&gt;
        in facebook_sales_promotions.promotion_id.unique())&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output of this step produces a table that contains a table containing only one column containing TRUE and FALSE values corresponding to whether the promotion_id was valid or not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--R4PB1H88--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v03n38mjveuluom35ugk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--R4PB1H88--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v03n38mjveuluom35ugk.png" alt="Output 2 for the Facebook python interview question" width="597" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Split the merged table into valid and invalid promotion sales&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s now use this table to split up the merged table we had created earlier into valid and invalid promotion sales. We can split the merged table by the column ‘valid_promotion’ as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;valid_promotion = merged[merged.valid_promotion]&lt;br&gt;
invalid_promotion = merged[~merged.valid_promotion]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Valid promotion dataset is illustrated below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8yJ5GwPi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pdm8bwhuj9igxeka0pej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8yJ5GwPi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pdm8bwhuj9igxeka0pej.png" alt="Valid promotions dataset output" width="808" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the invalid promotions dataset is as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JkNOONxC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d1x3ej8qg7kr0apzijp3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JkNOONxC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d1x3ej8qg7kr0apzijp3.png" alt="Invalid promotions dataset output" width="808" height="163"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. For each subset, compute the total units sold per product family&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we have segregated the merged table into valid and invalid promotion datasets, we can perform aggregation on these datasets to calculate the sum of units sold for the sales made in each of these datasets. We will use the &lt;strong&gt;groupby()&lt;/strong&gt; function to get the total units sold at the product family level.&lt;/p&gt;

&lt;p&gt;Let’s start with the valid promotions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;valid_promotion.groupby('product_family')['units_sold'].sum()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rov2VWdk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/axxxc93vgmvfx4q51x5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rov2VWdk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/axxxc93vgmvfx4q51x5d.png" alt="Output 3 for the Facebook python interview question" width="760" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output is a column of total units sold under a valid promotion. Let’s convert this into a dataframe using the &lt;strong&gt;to_frame()&lt;/strong&gt; function that allows us to specify the column name directly in this line. Only append the previous code with the to_frame() function.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds')&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UsBu-BQz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k920e4a58nznfv1wsp3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UsBu-BQz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k920e4a58nznfv1wsp3m.png" alt="Output 4 for the Facebook python interview question" width="760" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It does not make a reference to the product family that the sales are made on. It is being stored as an index. So, let us reset the index so that the product_family column will become available again. Append reset_index() function to the code above:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds').reset_index()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now the output displays the respective product family as well:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VWjsgIhe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wmbhpss0hqrn981917cm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VWjsgIhe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wmbhpss0hqrn981917cm.png" alt="Output 5 for the Facebook python interview question" width="760" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To differentiate between the valid and invalid promotions, let’s save the above line as &lt;strong&gt;results_valid&lt;/strong&gt; and repeat the same steps for the invalid_promotion table.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;results_valid = valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds').reset_index()&lt;br&gt;
invalid_promotion = merged[~merged.valid_promotion]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Note that we have already typed out the code for invalid_promotion.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result_invalid = invalid_promotion.groupby('product_family')['units_sold'].sum().to_frame('invalid_solds').reset_index()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The results_invalid table looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--F3n2AUyT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mzxiguqelpy76a6qnb9d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F3n2AUyT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mzxiguqelpy76a6qnb9d.png" alt="Output 6 for the Facebook python interview question" width="760" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Merge the results for the valid and invalid promotions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, it is time to merge the two tables: results_valid and results_invalid. Basically, our objective is to make our calculations down the line easier. The main table will contain the product family, the total units sold under valid promotion as well as the total units sold under invalid promotion sales.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result = results_valid.merge(result_invalid, how='outer', on='product_family')&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GqzQU5eD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ufcpvjdtodnsdig4u88x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GqzQU5eD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ufcpvjdtodnsdig4u88x.png" alt="Output 7 for the Facebook python interview question" width="760" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s take a step back and consider the edge case scenario that we had identified earlier in Step 1, wherein a product family does not have any sales on a valid promotion.&lt;/p&gt;

&lt;p&gt;Now that it comes up, there could be a scenario wherein all the sales are made under promotion which creates a missing value when we try to merge them. These scenarios, once made apparent, can be easily solved with the use of the **fillna() **function. All we need to do is fill the null values with zeroes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Fill null values with zeroes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s append the previous line of code with the &lt;strong&gt;fillna()&lt;/strong&gt; function as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result = result_valid.merge(results_invalid, how='outer', on='product_family').fillna(0)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output table is cleaner and will make our further calculations a cakewalk.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--o1ntfWKF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3gtxzjc2qz3oouxk3qx6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--o1ntfWKF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3gtxzjc2qz3oouxk3qx6.png" alt="Output 8 for the Facebook python interview question" width="712" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Calculate the total sales&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let us create new columns to store the value of total sales and the percentage of sales under valid promotions.&lt;br&gt;
Firstly, let’s calculate the total sales, which but a sum of the valid_solds and invalid_solds.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result['total'] = result['valid_solds'] + result['invalid_solds']&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output looks as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T0d6HVYO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c442x1o6wcemiho2ptun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T0d6HVYO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c442x1o6wcemiho2ptun.png" alt="Output 9 for the Facebook python interview question" width="712" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Calculate the percentage of units sold under promotion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The last piece of computation we need to perform is to capture the percentage of units sold under a valid promotion. We can calculate the same using the formula applied in the code below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result['valid_solds_percentage'] = result['valid_solds'] / result['total'] * 100&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output of this line is as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9nohCdGm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3our97lfri9urwmy329z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9nohCdGm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3our97lfri9urwmy329z.png" alt="Output 10 for the Facebook python interview question" width="880" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Display the relevant columns, replacing any na’s with 0s&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lastly, let us select the required columns for our output: &lt;strong&gt;product_family, total, valid_solds_percentage&lt;/strong&gt;, and add a final touch by replacing any na’s with zeroes.&lt;/p&gt;

&lt;p&gt;Before displaying the results, let’s be mindful of the edge case scenarios where product families may have no sales i.e., the total units sold will be zero. Subsequently, the total units sold under a promotion would also be zero. You will be thrown an error when zero is divided by zero. We are covering this edge case by replacing any na’s with zeros.&lt;/p&gt;

&lt;p&gt;Select the necessary columns as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result[['product_family', 'total','valid_solds_percentage']].fillna(0)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To get the complete picture of the solution, here is the complete solution:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
merged = facebook_sales.merge(facebook_products, how="outer", on="product_id")&lt;br&gt;
merged['valid_promotion'] = merged.promotion_id.map(lambda x: \&lt;br&gt;
        not pd.isna(x) and x \&lt;br&gt;
        in facebook_sales_promotions.promotion_id.unique())&lt;br&gt;
valid_promotion = merged[merged.valid_promotion]&lt;br&gt;
invalid_promotion = merged[~merged.valid_promotion]&lt;br&gt;
results_valid = valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds').reset_index()&lt;br&gt;
result_invalid = invalid_promotion.groupby('product_family')['units_sold'].sum().to_frame('invalid_solds').reset_index()&lt;br&gt;
result = results_valid.merge(result_invalid, how='outer', on='product_family').fillna(0)&lt;br&gt;
result['total'] = result['valid_solds'] + result['invalid_solds']&lt;br&gt;
result['valid_solds_percentage'] = result['valid_solds'] / result['total'] * 100&lt;br&gt;
result[['product_family', 'total','valid_solds_percentage']].fillna(0)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now let us run the complete code to get the expected result as shown below.&lt;/p&gt;

&lt;p&gt;All required columns and the first 5 rows of the solution are shown&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Onbu_TmK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mp08k58x3njm32hb5e2q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Onbu_TmK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mp08k58x3njm32hb5e2q.png" alt="Output 10 for the Facebook python interview question" width="839" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you dive a little further into the result table, we can see that there is no sale for the product family ‘Accessory’. Had we not handled the edge case, we would have been thrown an error during our calculations. You can see now how paramount it is to anticipate this at various points in our solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Despite the difficulty level of this Facebook python interview question, it was not as complicated as we would’ve expected. We hope you learned something about JOINs, data transformation, and aggregations.&lt;/p&gt;

&lt;p&gt;Practice is the only way to mastery. Keep practicing from our &lt;a href="https://www.stratascratch.com/blog/python-coding-interview-questions/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Python Coding Interview Questions&lt;/a&gt; article.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>career</category>
    </item>
    <item>
      <title>How to Get Hired as a Data Scientist at Google</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Tue, 01 Nov 2022 03:32:51 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/how-to-get-hired-as-a-data-scientist-at-google-103g</link>
      <guid>https://dev.to/nate_at_stratascratch/how-to-get-hired-as-a-data-scientist-at-google-103g</guid>
      <description>&lt;p&gt;&lt;em&gt;Everybody wants to work at Google, but how do you become one that works there?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8zcHIEMW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7nga2fzsmqguzz7kxl1v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8zcHIEMW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7nga2fzsmqguzz7kxl1v.png" alt="How to Get Hired as a Data Scientist at Google" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why would you even want to work at Google? Every person has different career goals, motivations, and reasons for choosing a certain career and an employer. However, I think it would be safe to reduce the multitude of reasons to two:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Competitive salary&lt;/li&gt;
&lt;li&gt;Using and furthering your skills as a data scientist&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These two reasons work together. Of course, you want to be well paid, especially when data science is a specific field requiring multidisciplinary knowledge and education. You want to get compensated fairly for the time, effort, and money you invested in your education.&lt;/p&gt;

&lt;p&gt;While you need to make a living (unless you inherited a significant amount of wealth), why not make it a comfortable living while also doing something that you find interesting? You just don’t accidentally start working in data science, so it’s a safe bet that you’re here because you find data and data science interesting, regardless of money. And if you do, then you’d want to work at a top company that is a leader in innovation and the latest technologies. Working at such a company challenges your skills. It gives you the platform for developing them further than you’d be able in most other companies by participating in the most varied and technically-advanced data science projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google Data Scientist Salary
&lt;/h2&gt;

&lt;p&gt;One of the ways how Google attracts top data scientists is by offering them a competitive salary. As a data scientist at Google, you can earn well above (almost 250%) the US median salary, even in the most junior positions. It usually includes not only the basic salary but also the cash and stock bonuses.&lt;/p&gt;

&lt;p&gt;Apart from this, other material benefits cover&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insurance, Health, and Wellness&lt;/li&gt;
&lt;li&gt;Financial and Retirement&lt;/li&gt;
&lt;li&gt;Home&lt;/li&gt;
&lt;li&gt;Transportation&lt;/li&gt;
&lt;li&gt;Perks and Discounts,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and other benefits unique to Google.&lt;/p&gt;

&lt;p&gt;Some of these benefits are health &amp;amp; life insurance, 401k, Student Loan Repayment Plan, remote work, adoption and surrogacy assistance, transportation allowance, free lunch and drinks,  tuition reimbursement, etc. You can learn more about salaries, benefits, and levels in the &lt;a href="https://www.stratascratch.com/blog/google-data-scientist-salary/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Google Data Scientist Salary article.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Now, the Hard Part: Getting Hired!
&lt;/h2&gt;

&lt;p&gt;Knowing what salary you can get is not essential for getting a job at Google. Moreover, it’s completely irrelevant, but it can serve as a good motivation for getting the hard part done. How do you do that? The approach is defined by three aspects you need to pay attention to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Knowing Google’s hiring process&lt;/li&gt;
&lt;li&gt;Having skills they need&lt;/li&gt;
&lt;li&gt;Acing job interview&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  1. What is Google’s Hiring Process?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uuQEd5o0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gic4ojrmvznai02aoiff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uuQEd5o0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gic4ojrmvznai02aoiff.png" alt="Data Scientist at Google" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Knowing how Google hires is the first step in getting hired. From a high-level perspective, Google’s process is the same as in any other company and consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Applying for a job&lt;/li&gt;
&lt;li&gt;Interviews&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, details matter, and you should go into detail on what Google wants to see in your job application and what their interviews look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Job Application
&lt;/h2&gt;

&lt;p&gt;Google strongly advises that you do a little self-reflection before you jump to applying for a job. This means they want you to think about your skills, interests, goals, and motivations. Even before applying, you can check if you’re the right fit for a job that way.  When thinking about your professional life and yourself as a person, think about whether you enjoy more working alone or as part of a team, what kind of job (or its part) you find most rewarding, what your passions are, do you get excited about solving a problem or discussing it, and so on.&lt;/p&gt;

&lt;p&gt;Once you decide to apply for a job at Google, you should know the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cover letters are not required&lt;/li&gt;
&lt;li&gt;Tailor CV to the specific position (even if you apply for multiple positions) – generic CVs are a big no-no!&lt;/li&gt;
&lt;li&gt;Keep CV concise and focused&lt;/li&gt;
&lt;li&gt;Highlight the skills that are required for the job you apply for&lt;/li&gt;
&lt;li&gt;Quantify the success at your previous job  – ‘successful’, ‘quicker’, ‘more efficient’, ‘disruption’, and ‘data-driven’ is not a metric&lt;/li&gt;
&lt;li&gt;Mention your references – if you have somebody that can verify your work experience, projects you did, your character, and skills in general, that will ensure your resume will be looked at; it doesn’t guarantee you’ll get an interview, but it can increase your chances.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Interviews at Google
&lt;/h2&gt;

&lt;p&gt;Before you come to the interview stage, ensure you know Google as a company well. Be informed about their history, organization, values, and products. This will show you’re really interested in working at Google and that you didn’t apply for a job accidentally. Imagine that you send your resume, you come to the interview, and the interviewer doesn’t know your name or anything about your education or work history. You wouldn’t be happy, would you? The same goes for Google: they like to see that what you know about them makes you want to work for them.&lt;/p&gt;

&lt;p&gt;The first step before the interviews with Google is a phone call with the recruiter. They will ask you a few general questions about your work experience and interest in working at Google. They might also ask a simple technical question or two (e.g., a probability question or something easily solved in a minute) to get a general idea of whether you’re suitable for the position. However, the main point of this call is to understand your work experience and how it aligns with what Google is looking for.&lt;/p&gt;

&lt;p&gt;Then comes the interview process at Google, and there’s no big mystery here: Google itself lists the types of interviews you could expect. You just need to take some time to inform yourself about it and be prepared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Online assessment&lt;/strong&gt; is the first elimination stage when it comes to interviews. This is usually a short test of your coding skills conducted online.&lt;/p&gt;

&lt;p&gt;If you get past this, then comes &lt;strong&gt;short virtual chat or two.&lt;/strong&gt; These are not on-premises but over the phone or video chat. They will involve a recruiter, hiring manager, and/or a colleague from the team asking you about the skills required for the job you applied for. The point is for them to get a picture of your technical profile and if it generally suits the position. They can also learn that even though you may be missing some of the required skills, you have some other skills that can be well used in a particular job.&lt;/p&gt;

&lt;p&gt;Depending on the job, Google might ask you to do &lt;strong&gt;project work.&lt;/strong&gt; This means doing a little project or providing some of your previous work/code.&lt;/p&gt;

&lt;p&gt;All these steps are where the candidates get eliminated before they get to the in-depth interviews. There are usually 3-4 interviews in one day, intended to assess your technical skills, problem-solving and thinking process, and personality traits.&lt;/p&gt;

&lt;p&gt;When it comes to testing your expertise, this is usually done through the following types of questions. They don’t come up every time; the types of questions you get heavily depend on the position you applied for.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coding Questions&lt;/li&gt;
&lt;li&gt;Algorithm Questions&lt;/li&gt;
&lt;li&gt;Statistics Questions&lt;/li&gt;
&lt;li&gt;Modeling Questions&lt;/li&gt;
&lt;li&gt;Business Case Questions&lt;/li&gt;
&lt;li&gt;Product Questions&lt;/li&gt;
&lt;li&gt;Technical Questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find more about the whole process on the &lt;a href="https://careers.google.com/how-we-hire/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=medium#step-interviews"&gt;Google website.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While you’re making yourself familiar with Google’s hiring process, use this.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. What Skills Google Wants to See in Data Scientists?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E09Aq9II--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o2xyosl6uo1d327z1pxn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E09Aq9II--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o2xyosl6uo1d327z1pxn.png" alt="What Skills Google Wants to See in Data Scientists?" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There’s no such thing as an ideal candidate. Google knows that because Google knows everything. Every candidate has unique skills and characteristics that could make them a desirable candidate. Google tries to select the candidates with the best combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hard skills and qualifications,  and&lt;/li&gt;
&lt;li&gt;Soft skills&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hard Skills and Qualifications
&lt;/h2&gt;

&lt;p&gt;The general requirements for data scientists at Google are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Master's Degree in Statistics, Computer Science, or other relevant quantitative disciplines&lt;/li&gt;
&lt;li&gt;Relevant experience, which you’ll need more of to compensate for if you’re lacking a required formal education level&lt;/li&gt;
&lt;li&gt;Programming languages: SQL and R/Python&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Depending on the position you’re applying for, some other specific requirements can be focused on the following areas: statistics, machine learning, AI, data analysis, data visualization, engineering, software development, products, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Soft Skills
&lt;/h2&gt;

&lt;p&gt;Getting hired at Google requires high scores at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interdisciplinarity&lt;/li&gt;
&lt;li&gt;Big-picture Perspective&lt;/li&gt;
&lt;li&gt;Being Customer-Oriented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data science is an interdisciplinary field per se. It merges statistics, mathematics, and business knowledge. This interdisciplinarity is compounded by the requirement for data scientists to work with other sectors in Google, such as Product, Marketing, Engineering, etc.&lt;/p&gt;

&lt;p&gt;The essence of data science is problem-solving with the business outcome in mind. To solve problems, every company (including Google) introduces projects. The only way to complete the projects successfully is to be focused on the project outcome and know how to achieve this goal. With such a desirable skill, it’s no wonder Google wants to see big-picture energy from their data scientists.&lt;/p&gt;

&lt;p&gt;Every data science project at Google has business in mind, and when we say business, it means customers. All you do will, directly or indirectly, be used by Google’s customers. Their satisfaction is the key for Google keeping their market-leader position, as is for you getting the job. Show that you have this in you, and you’re one step closer to becoming a data scientist at Google.&lt;/p&gt;

&lt;p&gt;The final step for achieving this is performing well in the interviews.&lt;/p&gt;

&lt;p&gt;To get more details about Google’s hiring process and the skills they’re looking for, take a look at &lt;a href="https://www.stratascratch.com/blog/the-ultimate-guide-to-become-a-data-scientist-at-google/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;The Ultimate Guide to Become a Data Scientist at Google.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Acing Job Interview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CUjm2Kwl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o75k44om82ki4fyvgp1t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CUjm2Kwl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o75k44om82ki4fyvgp1t.png" alt="Acing Job Interview" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The central part of all the interviews you’ll have at Google are, one way or another, your technical skills.&lt;/p&gt;

&lt;p&gt;While you for sure don’t know which questions you’ll get, there are still ways for you to better your chances of getting a job.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Brush up your skills&lt;/li&gt;
&lt;li&gt;Have a clear approach to coding questions&lt;/li&gt;
&lt;li&gt;Be self-aware&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Brushing Up Your Skills
&lt;/h2&gt;

&lt;p&gt;You’re preparing for a job interview, right? Solving the actual job interview questions before the real job interview at Google seems quite logical.&lt;/p&gt;

&lt;p&gt;There are platforms where you can do that—for example, StrataScratch, LeetCode, SQLPad, or HackerRank. There you can practice your SQL, Python, algorithm and other technical skills tested by Google.&lt;/p&gt;

&lt;p&gt;Also, there are other ways to refresh your knowledge or learn something new. You have course websites (e.g., Coursera, Udemy, edX), Youtube channels (e.g, freeCodeCamp.org, Alex the Analyst, Amigoscode), blogs (LearnSQL.com, GeeksforGeeks, W3Schools), and data science community (Stack Overflow, Reddit, GitHub, Codementor) at your disposal. While they don’t necessarily prepare you specifically for a Google job, these resources (and many others) can help you with the data science concepts you can easily apply at the Google job interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework for the Coding Questions
&lt;/h2&gt;

&lt;p&gt;It is crucial to write a correct solution when you’re at the coding interview, I don’t deny that. The pressure and limited time at the interviews can make even the most experienced look a level or two below their natural coding-masters selves.&lt;/p&gt;

&lt;p&gt;To get around this, I advise that you always have a clearly defined framework of how to approach solving the coding questions. I found that these four general guidelines work best:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explore the dataset&lt;/li&gt;
&lt;li&gt;Identify relevant columns&lt;/li&gt;
&lt;li&gt;Write out the code logic&lt;/li&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Exploring the dataset
&lt;/h2&gt;

&lt;p&gt;Exploring the dataset involves getting to know each table’s data structure. It also means detecting the shared columns between the tables, thus knowing how the tables can communicate. Along the way, get a sense of data types in each column and whether there might be duplicate or NULL values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identify relevant columns
&lt;/h2&gt;

&lt;p&gt;When you identify the relevant columns, you will eliminate the unnecessary columns that can clutter your thinking and divert you when writing a code. The interview questions often give you more data than you need, reflecting a data scientist's real life. Consider this as a small test where you can show that you can differentiate between relevant and irrelevant data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Write out the code logic
&lt;/h2&gt;

&lt;p&gt;Before you start coding, it’s important to write out all the steps of your solution. Break down the code into logical blocks and/or individual steps, and decide on the functions you will use, why you will use them, and how. The code logic can be written in English (or any other language the interview is conducted in) or a pseudo-SQL/R/Python/any other programming language code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Coding should, at this point, feel almost like a technicality. All the previous steps will make it possible for you to focus on the code syntax, its efficiency, and debugging. It also allows you to check the code logic and catch all the missing or unnecessary steps.&lt;/p&gt;

&lt;p&gt;This is how this framework can be applied to the &lt;a href="https://www.stratascratch.com/blog/google-data-scientist-interview-questions/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Google Data Scientist Interview Questions.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-awareness
&lt;/h2&gt;

&lt;p&gt;Job interviews are stressful, draining, and require a lot of concentration. They are, to be honest, a pain in the ass. They can sometimes show the worst side of our characters. Don’t let this happen to you by considering three simple things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow down&lt;/li&gt;
&lt;li&gt;Be friendly&lt;/li&gt;
&lt;li&gt;Listen to the interviewers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When under stress, people tend to jump to answers, stop taking time for thinking, and talk too fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slow down.&lt;/strong&gt;  Allow yourself time to think about what you’re going to say, ask for clarifications if you didn’t understand what was being asked, and try to be as articulate when you talk.&lt;/p&gt;

&lt;p&gt;People often wrongly think that silence between the interviewer’s question and your answer shows you’re a slow thinker, low on self-confidence, or whatnot. No, it shows that you’re thinking and not simulating it. It shows you’re confident enough to take your time to come up with the best possible answer, which ultimately shows you can control stressful situations. Highly desirable skills for a data scientist!Also, if you talk at a medium pace, the chance is better that you won’t blurt out something stupid. And everything smart that you say will be easily followed and acknowledged as smart by the interviewer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Be friendly.&lt;/strong&gt; You won’t be working alone in a cave atop some hill. Want it or not, you’ll work within a team and cooperate with other teams. Having different personalities in a team or across the teams is desirable. But this has limits. Nobody wants to work with a person that drains all the energy from other people, starts petty fights, takes credit for someone else work or sabotages everybody else. Google wants people who others enjoy working with, so remaining friendly and good-spirited under pressure is something they’ll look for. The interview is a perfect opportunity to showcase this side of yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Listening&lt;/strong&gt; is equally important as talking. Pay attention to what the interviewer asks so you can answer their questions. Don’t interrupt them, but ask questions if you want something to be clarified.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Getting hired at Google starts with knowing what you want and finding an ad for a job that you’d like. Then comes the part where you apply for a job by satisfying specific format requirements and make yourself familiar with Google itself: its hiring process and other aspects of the way they operate.&lt;/p&gt;

&lt;p&gt;When you come to the interview stage, you must know what you can expect there: what types of interviews they conduct and the topics. Once you know that, you should prepare yourself as best as possible. Use various sources, such as job interview questions examples, Youtube channels, blog articles, and courses, or get involved with the data science community to ask about technical concepts and their experiences on getting hired by Google.&lt;/p&gt;

&lt;p&gt;These steps prepare you to shine in a job interview, where you can confidently showcase your hard and soft skills. In other words, the best version of yourself.&lt;/p&gt;

</description>
      <category>career</category>
      <category>datascience</category>
      <category>community</category>
      <category>motivation</category>
    </item>
    <item>
      <title>Statistics Cheat Sheet: Data Collection and Exploration</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Thu, 27 Oct 2022 03:03:37 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/statistics-cheat-sheet-data-collection-and-exploration-1amd</link>
      <guid>https://dev.to/nate_at_stratascratch/statistics-cheat-sheet-data-collection-and-exploration-1amd</guid>
      <description>&lt;p&gt;&lt;em&gt;This Statistics Cheat Sheet includes the concepts that you must know for your next data science or analytics interview presented in an easy-to-remember manner.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eZOFNk9T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/m9jxbqfwjlsxdlbyc2vd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eZOFNk9T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/m9jxbqfwjlsxdlbyc2vd.png" alt="Statistics Cheat Sheet" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Statistics is the foundation of Data Analysis and Data Science. While a lot of aspirants concentrate on learning fuzzy algorithms with arcane names, they neglect the fundamentals and end up messing up their interviews. Without an in-depth understanding of statistics, it is difficult to make a serious &lt;a href="https://www.stratascratch.com/blog/a-complete-guide-to-data-scientist-career-path/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;career in data science&lt;/a&gt;. One need not be a Ph.D., but one must be able to understand the basic math and intuition behind the statistical methods in order to be successful. In this series, we will go through the fundamentals of statistics that you must know in order to clear your next Data Science Interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Statistics
&lt;/h2&gt;

&lt;p&gt;Statistics is the science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. There are three main pillars of statistics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data Collection and Exploration&lt;/li&gt;
&lt;li&gt;Probability&lt;/li&gt;
&lt;li&gt;Statistical Inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this three-part series, we will look at major areas of statistics relevant to a budding Data Scientist. In this part, we will look at Data Collection and Exploration. We have a fantastic set of &lt;a href="https://www.stratascratch.com/blog/categories/statistics/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;statistics blog articles&lt;/a&gt; that you can find here. Some of the articles that are recommended are.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.stratascratch.com/blog/ab-testing-data-science-interview-questions-guide/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;A/B Testing for Data Science Interviews&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stratascratch.com/blog/basic-types-of-statistical-tests-in-data-science/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Basic Types of Statistical Tests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stratascratch.com/blog/30-probability-and-statistics-interview-questions-for-data-scientists/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Probability Interview Questions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, check out our comprehensive “&lt;a href="https://www.stratascratch.com/blog/a-comprehensive-statistics-cheat-sheet-for-data-science-interviews/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;statistics cheat sheet&lt;/a&gt;” that goes beyond the very fundamentals of statistics (like mean/median/mode).&lt;/p&gt;

&lt;h2&gt;
  
  
  Sample vs Population
&lt;/h2&gt;

&lt;p&gt;In order to analyze data, it is important to collect it. In statistics, we usually set out to examine a population. A population can be considered to be a collection of objects, people, or natural phenomena under study. For example, the income of a graduate fresh out of college, the weight of a donut, or the time spent on smartphones. Since it is not always possible or it is prohibitively expensive (or both) to collect data about the entire population, we rely on a subset of the population.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2CK6gIfr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z8ds11xr70o2mj2a78ax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2CK6gIfr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z8ds11xr70o2mj2a78ax.png" alt="Sample vs Population in Statistics Cheat Sheet" width="447" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This subset (or sample), if chosen properly, can help us understand the entire population with relative surety and make our decisions. From this sample data (for example, the incomes of people), we calculate a statistic (for example, the typical income). This statistic represents a property of a sample. The statistic is an estimate of a population parameter (the typical income of all Americans).&lt;/p&gt;

&lt;h2&gt;
  
  
  Sampling Methods
&lt;/h2&gt;

&lt;p&gt;For a sample to be representative of the population, it should have the same characteristics as the rest of the population. For example, if one were to survey the attitudes of Americans towards conservative values, then the students of the liberal arts department from a Blue state college might not be the best representation. Statisticians use multiple ways to ensure that the sample is random and truly representative of the entire population. Here we look at some of the most common methods used for Sampling. Each method has its pros and cons, we will look into those as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simple Random Sample
&lt;/h2&gt;

&lt;p&gt;The easiest way of sampling methods is a simple random sample. One randomly picks a subset of the entire population. Each person is equally likely to be chosen as every other person. In other words, each person has the same probability of being chosen.&lt;br&gt;
​&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U8KQLt01--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9q4r5vsgt33s7uv25tbb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U8KQLt01--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9q4r5vsgt33s7uv25tbb.png" alt="Statistics Cheat Sheet" width="705" height="87"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;where n is the size of the population.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0zj7v-LV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zd0ls4od8xb8bdn3azmp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0zj7v-LV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zd0ls4od8xb8bdn3azmp.png" alt="Simple random sample" width="512" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A simple random sample has two properties that make it the standard against which we compare all the other sampling methods&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bias:&lt;/strong&gt; A simple random sample is unbiased. In other words, each unit has the same chance of being chosen as every other unit. There is no preference given to a particular unit or units&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independence:&lt;/strong&gt; The selection of one does not influence the chances of selection of another unit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, in the real world, a completely unbiased and independent sample is very difficult (if not impossible) to find. One of the most common instances is the underrepresentation of &lt;a href="https://www.nytimes.com/2017/05/31/upshot/a-2016-review-why-key-state-polls-were-wrong-about-trump.html#:~:text=It%E2%80%99s%20no%20small,census%20voting%20data."&gt;less educated voters&lt;/a&gt; in the samples used for the 2016 Election Polling. While it is possible to generate a list of completely random respondents, the final results might be skewed because people may not respond, thus &lt;a href="https://www.pewresearch.org/fact-tank/2016/11/09/why-2016-election-polls-missed-their-mark/"&gt;skewing the sample and diverging from population characteristics&lt;/a&gt;. There are more effective and efficient ways to sample a population if we know something about the population.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stratified Sample
&lt;/h2&gt;

&lt;p&gt;In a stratified sample, we divide the population into homogeneous groups (or strata) and then take a proportionate number from each stratum. For example, we can divide a college into various departments and then take a random sample from each department in the proportion of the strengths of each department.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mGhhqkGC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ey8d6xol5hwamnygrz3n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mGhhqkGC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ey8d6xol5hwamnygrz3n.png" alt="Stratified Sample in Statistics Cheat Sheet" width="512" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An example of how stratified sampling could look like&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QTK6uysp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/209n0sqebslixfuvqqsv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QTK6uysp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/209n0sqebslixfuvqqsv.png" alt="Example of how stratified sampling could look" width="705" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here the sample represents 1% of each region. A more complex example can be when we introduce multiple characteristics. For example, let us bifurcate each region by gender as well. The sampling process would look something like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---gMQb202--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y5je3l0kfkz7su0nkkw0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---gMQb202--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y5je3l0kfkz7su0nkkw0.png" alt="Stratified sampling in statistics cheat sheet" width="713" height="498"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The major advantage of stratified sampling is that it captures the key characteristics of the population in the sample. As with a weighted average, stratified sampling produces characteristics that are proportional to the overall population. However, if the strata cannot be formed, then the method can lead to erroneous results. It is also time-consuming and relatively more expensive as the analysts have to identify each member of the sample and classify them into exactly one of the strata. Further, there might be cases where the members might fall into multiple strata. In such a scenario, the sample might misrepresent the population.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cluster Sample
&lt;/h2&gt;

&lt;p&gt;Sometimes it is cost-effective to select survey respondents in clusters. For example, instead of going through each building in a town and randomly sampling the respondents, one could randomly select some of the buildings (clusters) and survey all the residents living in them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eY4AbUmZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xzi1pwfli8jrvrd0oi52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eY4AbUmZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xzi1pwfli8jrvrd0oi52.png" alt="Cluster Sampling in statistics cheat sheet" width="512" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This can result in an increase in speed and greater cost savings on account of lesser logistical requirements. The cluster sample method’s effectiveness depends on how representative the members of the chosen clusters are when compared to the population. To alleviate this, the sample size for cluster sampling is usually larger than that for simple random sampling, as the characteristics of the members in a cluster usually tend to be similar and may not capture all the population characteristics. However, the cost savings on account of reduced travel and time might still mean that even with the additional sample size, cluster sampling turns out to be the cheaper option.&lt;/p&gt;

&lt;p&gt;Cluster sampling can be further optimized by using multi-stage clustering. As the name suggests, in a multi-stage clustering, once the clusters are chosen, they are further clustered in smaller units, thus reducing the costs further. For example, if we wanted to find the learning abilities of students across the country. We start off clustering on the basis of states. Further, in these states, we cluster on the basis of schools and choose a random sample of the schools in the sector and survey the students in these schools. This is usually used for national surveys for employment, health, and household statistics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Systematic Sample
&lt;/h2&gt;

&lt;p&gt;Another widely used sampling method for large population sizes is Systematic Sampling. In this process, using a random starting position, every kth element is chosen to be included in the sample.&lt;/p&gt;

&lt;p&gt;For example, we might choose to sample every third person entering a building.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FDYOUSl9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bufesn75huqvqas58bj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FDYOUSl9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bufesn75huqvqas58bj4.png" alt="Systematic Sampling in statistics cheat sheet" width="512" height="184"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is a quick and convenient method and gives a result similar to a simple random sample if the interval is chosen carefully. This method is used very widely as it is simple to implement and explain. One of the use cases of systematic sampling is for conducting exit polls during elections. A systematic sample makes it easier to separate groups of voters who might all be voting for the same person or party.&lt;/p&gt;

&lt;h2&gt;
  
  
  Convenience Sample
&lt;/h2&gt;

&lt;p&gt;Another method not usually recommended but used because of circumstances is convenience sampling. Also known as grab sampling or opportunity sampling, the method involves grabbing whatever sample is available.&lt;/p&gt;

&lt;p&gt;For example, on account of lack of time or resources, the analyst might choose to sample only from their neighboring homes and offices instead of trying to finding respondents from across the entire city.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lpyHiK1o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nvooj7c3cjo8seo2q3w9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lpyHiK1o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nvooj7c3cjo8seo2q3w9.png" alt="Convenience Sampling in statistics cheat sheet" width="512" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you might anticipate, this method can be unreliable and hence not recommended. However, it might be the only way to collect data sometimes. For example, instead of trying to contact all the users of marijuana, one might choose to just go to the nearest college dorm and survey the attendees of a party.&lt;/p&gt;

&lt;p&gt;The advantages of this method include convenience, speed, and cost-effectiveness. However, the collected samples may not be truly random or representative of the population. However, it does provide some information as opposed to just hunches from the analyst. This method is widely used in pilot testing or &lt;a href="https://en.wikipedia.org/wiki/Minimum_viable_product"&gt;MVPs&lt;/a&gt; for testing and launching new products.&lt;/p&gt;

&lt;h2&gt;
  
  
  Descriptive Statistics
&lt;/h2&gt;

&lt;p&gt;Now that we have found out how to collect the data, let us move to the next step - analyzing the collected data. Displaying and describing the collected data in the forms of graphs and numbers is called Descriptive Statistics. Let us use some data points to analyze this. We use a hypothetical dataset of 200 students from a program with salary offers from three companies A, B, and C. You can find the &lt;a href="https://github.com/viveknest/statascratch-solutions/blob/main/salaries_dataset.csv"&gt;dataset here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6VG_Uf_h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yarr4e3z808bgmf848u0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6VG_Uf_h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yarr4e3z808bgmf848u0.png" alt="Descriptive Statistics" width="743" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we had a limited number of data points, we could have simply plotted a bar graph like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--s_oNkiZK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1begsa68ccx0bkx26wge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--s_oNkiZK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1begsa68ccx0bkx26wge.png" alt="Descriptive Statistics Graph" width="512" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, if we try to do this for our full dataset, we will end up with something like this&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--verkXMwG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g6n0vkiivwa0416wty6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--verkXMwG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g6n0vkiivwa0416wty6l.png" alt="Descriptive Statistics for full dataset" width="512" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you see, if we were to look at each student one by one, the process can become quite tedious and overwhelming. An easier way is to summarize the data. That is where descriptive statistics come into play. Let us look at a couple of ways.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stem and Leaf Plot
&lt;/h2&gt;

&lt;p&gt;One of the oldest plots was the stem and leaf plot. The idea is to divide the values into a stem and leaf. The last significant digit is the leaf, the remaining number is the stem. For example, in the case of the number 239, 9 is the leaf, and 23 is the stem. For the number 53, 3 is the leaf, and 5 is the stem. To draw the plot, we write the stems in order vertically and then write each of the leaves in ascending order. For our salary dataset, the stem and leaf plot for Salary A will look like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5Zuyp7a---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kk1l5bwzwbfrdsyuaoym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5Zuyp7a---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kk1l5bwzwbfrdsyuaoym.png" alt="Stem and Leaf Plot in Statistics Cheat Sheet" width="497" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above graph shows that there is one observation in the range 10 - 19, which is 12. In the range 40 - 49, we have four observations 42, 45, 45, and 46. The cumulative frequency is also shown in the leftmost column.&lt;/p&gt;

&lt;p&gt;We can similarly plot the stem graphs for Company B and Company C's salaries as well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eQqGg9kg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2xexzlxf9canilibx0rn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eQqGg9kg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2xexzlxf9canilibx0rn.png" alt="Plot the stem graphs in Statistics Cheat Sheet" width="398" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CjHJy4Dp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sn5cwav672x9r6d12kxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CjHJy4Dp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sn5cwav672x9r6d12kxx.png" alt="Stem and leaf plot salaries in Statistics Cheat Sheet" width="398" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With larger datasets and a variety of values, it becomes unwieldy, as you can see in the case of Salaries for Company B. To alleviate these problems, we have a histogram.&lt;/p&gt;

&lt;h2&gt;
  
  
  Histograms
&lt;/h2&gt;

&lt;p&gt;Histograms extend the concept of stem and leaf plots with the difference being that instead of dividing the numbers into tens, we can decide how we want to group these numbers. As with the stem and leaf plot, we first decide the bins (or buckets) that we would like the numbers to be in. We then count the frequency in each bin and then plot the values in a bar graph. So if we decide to bin the numbers in 10s, we will get the same graph as the stem and leaf plot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OhFUSdFt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8kg4m4qul9jvqrx3hdio.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OhFUSdFt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8kg4m4qul9jvqrx3hdio.png" alt=" Histograms in statistics" width="512" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also create bins in the 20s. This is what the graph will look like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EhFKFhft--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h7nmsp8mx5n4oujdlkrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EhFKFhft--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h7nmsp8mx5n4oujdlkrs.png" alt="Histograms in statistics bins of 20s" width="512" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can observe, the numbers are higher in each bin. This is natural as we will have a greater number of observations now as the bin widths have widened. As with a stem and leaf plot, the histograms are a good way to examine the spread of the data. Let us look at how the other histograms of the other company salaries look.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---AJAgpJX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j5h42ab9dv81qmn1d7ze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---AJAgpJX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j5h42ab9dv81qmn1d7ze.png" alt="Histograms salaries in Statistics Cheat Sheet" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pb6Zb3rK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eohkh5gvz702yf3zmd0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pb6Zb3rK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eohkh5gvz702yf3zmd0n.png" alt="Histograms salaries in Statistics Cheat Sheet" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While examining data visually is helpful, we need some measurements to provide us with information about the characteristics of the data. The most vital set of measurements are the central tendency (a typical value that describes the data) and the spread of the values in the dataset. Let us look at the commonly used numerical statistics used to describe a dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measures of Central Tendency
&lt;/h2&gt;

&lt;p&gt;The measures of central tendency describe what a typical value in the data would look like. In the case of our salaries, think of it as what a typical salary from the three companies looks like. There are three commonly used measures of central tendency - the mean, median, and mode. Let us look at these in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mean
&lt;/h2&gt;

&lt;p&gt;Mean (or average) is the most widely used measure of central tendency. The mean for a data set is calculated by dividing the sum of all observations in the dataset by the number of observations.&lt;/p&gt;

&lt;p&gt;For a dataset with n values, the mean usually denoted by x̄ is given by&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lcYGt_W4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n86zhzu3d2r9rgckqvtb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lcYGt_W4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n86zhzu3d2r9rgckqvtb.png" alt="dataset with n values in Statistics Cheat Sheet" width="702" height="75"&gt;&lt;/a&gt;&lt;br&gt;
​&lt;/p&gt;

&lt;p&gt;the mean usually denoted by&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MK0wriLH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sbx09lxsxpd0do2tr3hx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MK0wriLH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sbx09lxsxpd0do2tr3hx.png" alt="Statistics Cheat Sheet" width="702" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;is given by&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LBWdBX1a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p2fo46460dwqa4ndfv22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LBWdBX1a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p2fo46460dwqa4ndfv22.png" alt="Statistics Cheat Sheet" width="702" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In simple words,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6JHrjzAx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w25csj98s00shh9gza7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6JHrjzAx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w25csj98s00shh9gza7b.png" alt="Statistics Cheat Sheet" width="702" height="75"&gt;&lt;/a&gt;​&lt;/p&gt;

&lt;p&gt;Let us calculate the means of each of the three companies’ salaries. If you are using spreadsheet software, you can use the AVERAGE function to calculate means.&lt;/p&gt;

&lt;p&gt;Salary A(k)    101.525&lt;br&gt;
Salary B(k)     94.760&lt;br&gt;
Salary C(k)     87.590&lt;/p&gt;

&lt;p&gt;Let us see where the mean lies on the histograms&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S212G9XT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rp4ceuwxrq9imyv206uz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S212G9XT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rp4ceuwxrq9imyv206uz.png" alt="Histogram salaries in Statistics Cheat Sheet" width="512" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bw0c_CLv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gs3xdjr83em2secae1zd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bw0c_CLv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gs3xdjr83em2secae1zd.png" alt="Histogram salaries in Statistics Cheat Sheet" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Mwfyolbm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8p0odhjlgf3ks2nftd9a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Mwfyolbm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8p0odhjlgf3ks2nftd9a.png" alt="Histogram salaries in Statistics Cheat Sheet" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the means for Companies A and C appear alright, at first glance, the mean for companies B appears a bit misleading.&lt;/p&gt;

&lt;p&gt;If one does a quick visual calculation (or uses the stem and leaf plots), more than half the number of students (100) was offered a salary of 70k and lower. While the calculated mean was around 95k.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mrmktdPo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/byctlgy35hku6p26qows.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mrmktdPo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/byctlgy35hku6p26qows.png" alt="Stem and leaf plots salaries" width="451" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is one of the problems with means. For a balanced dataset, the mean represents the middle value, for an asymmetrical distribution, the mean appears to be misleading as a few extreme values can shift the mean away. If you observe, there are seven students who were offered salaries in excess of 250k. To alleviate this problem, we cannot always rely on the mean alone. This nicely leads us to the next measure - the median.&lt;/p&gt;

&lt;h2&gt;
  
  
  Median
&lt;/h2&gt;

&lt;p&gt;The median is quite simply the middle value of the dataset when the observations are ordered. To calculate the median, we simply arrange the values in ascending or descending order and pick the middle value. For example, for observations 18, 35, 7, 20, and 27, we start by arranging them.&lt;/p&gt;

&lt;p&gt;7, 18, 20, 27, 35&lt;/p&gt;

&lt;p&gt;Now we pick the middle value, which in this case is 20. If we have an even number of values, then we pick the average of the two middle values. For example, if we add another observation 42 to the above, we will get the following ordered values.&lt;/p&gt;

&lt;p&gt;7, 18, 20, 27, 35, 42&lt;/p&gt;

&lt;p&gt;In this case, the median will be the average of the two middle values, 20 and 27.&lt;/p&gt;

&lt;p&gt;Therefore,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kIJ_5w40--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/39t7okc6i6hyc49zpub3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kIJ_5w40--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/39t7okc6i6hyc49zpub3.png" alt="Statistics Cheat Sheet" width="704" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: The median divides the dataset into two halves each containing the same number of observations.&lt;/p&gt;

&lt;p&gt;Let us find the medians for the three datasets and plot them on the histograms.&lt;/p&gt;

&lt;p&gt;Salary A(k)    101.0&lt;br&gt;
Salary B(k)     78.0&lt;br&gt;
Salary C(k)     90.5&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TyA_nU-s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bxj8gxzia9acv4o6jbx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TyA_nU-s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bxj8gxzia9acv4o6jbx.png" alt="Example of median in statistics" width="512" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2fJpDn1g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dadgc71zvsafhm1oy52f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2fJpDn1g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dadgc71zvsafhm1oy52f.png" alt="Example of median in statistics" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--z5rR3tTX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o0b7p72moez0lplalra5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--z5rR3tTX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o0b7p72moez0lplalra5.png" alt="Example of median in statistics" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we had expected, the medians for Companies A and C are pretty close to their means, but the median for B is separated from its median by almost a full bin. One of the advantages of the median is that it is not easily impacted by extreme values. It is therefore preferred in datasets that are not balanced (roughly equal spread of observations on either side of the mean).&lt;/p&gt;

&lt;h2&gt;
  
  
  Mode
&lt;/h2&gt;

&lt;p&gt;Another measure that is widely used is the Mode. The mode represents the most frequent observation in the dataset. Let us calculate mode with a simple example.&lt;/p&gt;

&lt;p&gt;Suppose the ages of a group of five students are 23, 21, 18, 21, and 20, then the mode for this data is 21 since it appears the maximum number of times. A dataset can have multiple modes as well. For example if the ages were 18, 23, 21, 23 and 18, then the dataset has two modes 18 and 23 since both the values appear twice. Such data is called multi-modal data.&lt;/p&gt;

&lt;p&gt;Let us calculate the mode and plot them on the histograms.&lt;/p&gt;

&lt;p&gt;Salary A(k)    92&lt;br&gt;
Salary B(k)     39, 105&lt;br&gt;
Salary C(k)     95&lt;/p&gt;

&lt;p&gt;Note for Salaries offered by Company B, there are two values that appear the most number of times (39 and 105).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MCkN7TRC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8bckkchzw06969axybaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MCkN7TRC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8bckkchzw06969axybaz.png" alt="Example of mode in statistics" width="512" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gBz7FCng--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vp4zq2q5916c8fklykhi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gBz7FCng--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vp4zq2q5916c8fklykhi.png" alt="Example of mode in statistics" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--recpllY1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8rhbv0wry357r8g8otzq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--recpllY1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8rhbv0wry357r8g8otzq.png" alt="Example of mode in statistics" width="512" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Measures of Spread
&lt;/h2&gt;

&lt;p&gt;In statistics, spread (or dispersion or variability or scatter) is the extent to which the data is stretched or squeezed. Think of it as a measure of how far from the center the data tends to be present. For instance, if everyone was offered the same salary, the spread will be 0. We can evaluate the spread using a histogram. For thinly spread data, the histogram will be skinny, for example, Salaries offered by Companies A and C. Whereas, for dataset with a greater range of values, the histogram will be wider as in the case of Company B. Let us look at the mathematical measures used to evaluate the spread.&lt;/p&gt;

&lt;h2&gt;
  
  
  Range
&lt;/h2&gt;

&lt;p&gt;The range of the dataset is the difference between the highest and the lowest values. The range of three datasets is as follows:&lt;/p&gt;

&lt;p&gt;Salary A(k)    218&lt;br&gt;
Salary B(k)    338&lt;br&gt;
Salary C(k)     99&lt;/p&gt;

&lt;p&gt;This is in line with what we saw visually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interquartile Range (IQR)
&lt;/h2&gt;

&lt;p&gt;While the range gives a good idea about the spread of the dataset, as with the mean, the range is prone to be influenced by extreme values on either side of the spectrum. We, therefore, use a more nuanced version of the range called the Interquartile Range (or IQR for short). Quartiles are an extension of the concept of the median. Just as the median divides the dataset in two, each containing an equal number of observations, Quartiles divide the dataset into four, each containing an equal number of observations. Each of these quarter boundaries are represented by Q1, Q2, Q3, and Q4 or the first, second, third, and fourth quartile, respectively.&lt;/p&gt;

&lt;p&gt;The first quartile represents the maximum value for the bottom 25% of the values (by magnitude), the second quartile contains the next 25%, and so on. Let us plot the four quartiles on the histogram of the Salaries offered by the company A.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JbJC4apD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n9ihb7u9nn6vhmdjelvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JbJC4apD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n9ihb7u9nn6vhmdjelvh.png" alt="Interquartile Range in Statistics Cheat Sheet" width="512" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you might have guessed, the median is the second quartile (Q2). The numerical values are:&lt;/p&gt;

&lt;p&gt;Q1: 81.75&lt;br&gt;
Q2: 101&lt;br&gt;
Q3: 119.25&lt;br&gt;
Q4: 230&lt;/p&gt;

&lt;p&gt;IQR measures the spread between the first and the third quartiles or the range of the middle 50% of the values excluding the top 25% and bottom 25% of the observations.&lt;/p&gt;

&lt;p&gt;IQR = Q3 - Q1&lt;/p&gt;

&lt;p&gt;Let us calculate the IQR for the three salaries.&lt;/p&gt;

&lt;p&gt;Salary A(k): 37.5&lt;br&gt;
Salary B(k): 100.5&lt;br&gt;
Salary C(k): 27.5&lt;/p&gt;

&lt;p&gt;The trend is similar to what we saw earlier, but this also shows that the middle 50% values for Company A are relatively closely packed.&lt;/p&gt;

&lt;p&gt;The IQR value is used for constructing the box plot (also called the box and whiskers plot).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2MPBROCe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vktahsumuybrok5kz3i6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2MPBROCe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vktahsumuybrok5kz3i6.png" alt="IQR value for constructing the box plot" width="512" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let us deconstruct the box-plot.&lt;/p&gt;

&lt;p&gt;The Box represents the middle 50% of the data. The ends are Q1 and Q3. The line inside the box is the Median. The boundaries of the whiskers are 1.5 IQR to the left and right of Q1 and Q3 respectively. The range of values from Q1 - 1.5 IQR and Q3 + 1.5 IQR is popularly known as the fence. This is the acceptable value for balanced distributions. Values outside this fence are outliers (extreme values)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Zrt_6Vz1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vnb0mq72mt65z7d1yda2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Zrt_6Vz1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vnb0mq72mt65z7d1yda2.png" alt="IQR value for constructing the box plot" width="704" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Variance and Standard Deviation
&lt;/h2&gt;

&lt;p&gt;Till now we have used only the extreme and quartile values to measure the spread. The most widely used measure for spread is the standard deviation (along with variance). The standard deviation measures the difference of each value from the mean of the dataset and calculates a single number that represents the spread of the data.&lt;/p&gt;

&lt;p&gt;Let's take a simple dataset to show the calculations involved. Suppose I observe the following temperatures over the course of five days (in Fahrenheit) : 82, 93, 87, 91 and 92. The average of these values will be&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--is0YGdGQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vud9j0bmn1vl71qlzlzu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--is0YGdGQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vud9j0bmn1vl71qlzlzu.png" alt="Statistics Cheat Sheet" width="704" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We need to find how much each value differs from the mean. We can find this by subtracting the mean from each value. We get the following.&lt;/p&gt;

&lt;p&gt;(82 - 89), (93 - 89), (87 - 89), (91 - 89) and(92 - 89)&lt;/p&gt;

&lt;p&gt;or -7, 4, -2, 2, 3&lt;/p&gt;

&lt;p&gt;These values are called residuals or deviations from the mean or simply deviations. Since we would like one single value, let us try to calculate the mean of these values. If you calculate that, you will find that the sum of these values adds up to zero. &lt;a href="https://en.wikipedia.org/wiki/Arithmetic_mean#:~:text=The%20mean%20is%20the%20only%20single%20number%20for%20which%20the%20residuals"&gt;This is the basic property of the mean&lt;/a&gt;. To overcome this, we need to remove the sign from the deviations. The most common way to do this is to square the values. Since the square of a real number is always positive, we are now guaranteed to get a positive value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wodH209F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7z0fnol9jf83b74tt7vf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wodH209F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7z0fnol9jf83b74tt7vf.png" alt="Statistics Cheat Sheet" width="704" height="82"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We now take the mean of this and get&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tfpXYMRb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3fvf3hc1gsqcy2et8hdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tfpXYMRb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3fvf3hc1gsqcy2et8hdx.png" alt="Statistics Cheat Sheet" width="704" height="82"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This value is called the variance of the data.&lt;/p&gt;

&lt;p&gt;However, if you notice carefully, the units are also squared now, so 16.4 is not in Fahrenheit, rather it is in Fahrenheit squared!! To bring it back to the original units, we take the square root.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SMZyTO-V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/532a2ex236pnvucjvphh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SMZyTO-V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/532a2ex236pnvucjvphh.png" alt="Statistics Cheat Sheet" width="704" height="82"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The number 4.05 is considered the standard deviation of the dataset.&lt;/p&gt;

&lt;p&gt;There is, however, one twist. This number will be the standard deviation if the dataset were the entire population. Since this is not the case, we need to adjust the formula to get the sample variance and standard deviations. To do this, we use n - 1 instead of n in the divisor. This is called the Bessel’s correction. This largely stems from the fact that a sample variance will always be lesser than the population variance. &lt;a href="https://www.youtube.com/watch?v=sHRBg6BhKjI"&gt;You can see a wonderful explanation here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Therefore sample variance usually denoted by s2 =&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---RyXL3hx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8t1myuzhkocn3ri7noir.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---RyXL3hx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8t1myuzhkocn3ri7noir.png" alt="Statistics Cheat Sheet" width="704" height="82"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And sample standard deviation s =&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dxzncF05--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0pf8808cr9o5wqb177ij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dxzncF05--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0pf8808cr9o5wqb177ij.png" alt="Statistics Cheat Sheet" width="704" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As an exercise try to calculate the standard deviations and variations for the Salaries offered by the three companies. You can use a simple spreadsheet program to do this. Also try to calculate the values without using the built-in formula.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now that we are armed with the basic tools for finding the center and the spread of the samples, we will extend this in the next part, where we look at another key aspect of statistics - Probability and Random Events.&lt;/p&gt;

&lt;p&gt;In this article, we looked at the various ways of collecting data samples for analysis. We used a hypothetical salaries data set and learned how to plot graphs like histograms and box plots. We also learned about the measures of central tendency and the measure of spread. This will set us up nicely for the next two parts. In preparation for statistics, you can use the StrataScratch platform, where we have a community of more than 20,000 aspirants aiming to get into the most sought-after Data Science and Data Analyst roles at companies like Google, Amazon, Microsoft, Netflix, etc. Join StrataScratch today and turn your dream into a reality.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>productivity</category>
      <category>beginners</category>
      <category>sql</category>
    </item>
    <item>
      <title>How FAANG companies are leveraging data science and AI</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Thu, 29 Sep 2022 07:04:23 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/how-faang-companies-are-leveraging-data-science-and-ai-kfc</link>
      <guid>https://dev.to/nate_at_stratascratch/how-faang-companies-are-leveraging-data-science-and-ai-kfc</guid>
      <description>&lt;p&gt;&lt;em&gt;This article will cover information about how FAANG companies leverage Data Science and AI to drive product innovation and thereby improve their customer satisfaction and drive revenue growth.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As the data is increasing at an exponential rate, most of the companies are leveraging this data to drive growth and enhance customer experience. Big Data has spread around the world since the 1960s. Data that contains greater variety, arriving in growing columns, and with increased velocity is what &lt;a href="https://www.oracle.com/big-data/what-is-big-data/" rel="noopener noreferrer"&gt;Oracle&lt;/a&gt; defines as big data. Every industry has seen a rise in the number of data science firms that analyze this data for commercial insights. There are many &lt;a href="https://www.stratascratch.com/blog/11-best-companies-to-work-for-as-a-data-scientist/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;data science companies&lt;/a&gt; that have improved over time due to data-driven decision making. In this article, we will be talking about FAANG companies. FAANG stands for Facebook/Meta, Amazon, Apple, Netflix and Google.&lt;/p&gt;

&lt;p&gt;Big Data is useless without the knowledge of experts who can transform cutting-edge technology into useful insights. The value of a data scientist who knows how to wring relevant insights out of gigabytes of data is rising as more and more firms today unlock the power of big data.&lt;/p&gt;

&lt;p&gt;Data processing and analysis have a huge value, and it is becoming increasingly obvious as time goes on. The importance a data scientist holds in a company is still largely unknown to executives, despite the fact that they are aware of how data science is a hot business and how data scientists are like modern-day superheroes.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why do companies need Data Science Capabilities?
&lt;/h1&gt;

&lt;p&gt;In order to succeed in this ever-changing world, companies need to rely heavily on data driven decisions and make their products more innovative. Some examples of innovative products backed by data science include Alexa by Amazon which is a virtual assistant and does basic jobs using voice commands, Google’s apps like translate, maps, etc., Netflix's recommendation system to show users what they might like, etc. Having such innovative products backed by data helps the companies in building a great customer experience which improves customer loyalty and thereby drives growth.&lt;/p&gt;

&lt;p&gt;Data Science and AI can add value to the businesses by empowering management to make better/data driven decisions. It can also help the companies in directing actions based on trends, identifying opportunities, making decisions with quantifiable, data-driven evidence and test these decisions, etc. Let’s look at some examples of how and why companies are using Data Science.&lt;/p&gt;

&lt;h3&gt;
  
  
  ML usage to Increase Competitiveness
&lt;/h3&gt;

&lt;p&gt;It is hard to be unaware of how machine learning affects enterprises. Machine learning has gained popularity in recent years, and the solutions it offers can be highly advantageous to any business now and in the future. At the moment, enterprises all over the world employ machine learning mostly for:&lt;/p&gt;

&lt;p&gt;• Using predictive analysis to improve customer interactions&lt;br&gt;
• Revenue projections and product marketing&lt;br&gt;
• Simplified data management&lt;br&gt;
• Improved selling models&lt;br&gt;
• Awareness of fraud and cybersecurity&lt;/p&gt;

&lt;p&gt;However, other companies have advanced machine learning and are now using it in highly inventive ways. For instance, Pinterest used machine learning and data science to build its whole content discovery system. This technology aids the business in anticipating customer preferences and improves the accuracy of search results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business process optimization using data science
&lt;/h3&gt;

&lt;p&gt;Data science provides businesses with a wide range of options for streamlining key business procedures. For instance, big data analysis has recently become more prevalent in the manufacturing industry. More and more manufacturing facilities are interested in investing in data analytics and IIoT. (Industrial Internet of Things). Real-time tracking systems and sensor technology gather and analyze data that manufacturers can utilize to:&lt;/p&gt;

&lt;p&gt;• Eliminate snags in the production process&lt;br&gt;
• Boost the assets' effectiveness&lt;br&gt;
• Keep track of product flaws and quality&lt;br&gt;
• Conduct product testing&lt;/p&gt;

&lt;p&gt;Data science assists firms in minimizing production problems that may have an impact on the product's quality, factory logistics, and shipping procedures. Recruitment is a fantastic illustration of how data science can be used in innovative ways to improve business processes beyond manufacturing.&lt;/p&gt;

&lt;p&gt;Now, let’s see how FAANG companies are using Data Science and AI to innovate their products and improve customer experience. We have already covered how the work culture of these FAANG companies is in &lt;a href="https://www.stratascratch.com/blog/ultimate-guide-to-the-top-5-data-science-companies/" rel="noopener noreferrer"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is Facebook/Meta using Data Science and AI?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xak9k4jk6u8bvu1cie8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xak9k4jk6u8bvu1cie8.png" alt="How is Facebook Meta using Data Science and AI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Facebook, now Meta is one of the biggest social media companies in the world. With close to &lt;a href="https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/" rel="noopener noreferrer"&gt;3 billion monthly active users worldwide&lt;/a&gt;, Facebook has captured enormous amounts of data. For Facebook, most of its revenue comes from advertising. So the most important application of data science at Facebook is to decide which advertisements to show to which users.&lt;/p&gt;

&lt;p&gt;The company changed the name to Meta since the strategy involved evolving the company into virtual platforms of social technologies where users will be able to &lt;a href="https://seekingalpha.com/article/4471770-how-does-facebook-make-money" rel="noopener noreferrer"&gt;immerse themselves in “metaverse”&lt;/a&gt;. Facebook has been leveraging data science and AI in every step to make the customer experience better and drive exponential revenue growth. Every 60 seconds on Facebook, 510K comments are posted, 293K status updates and 136K photos are uploaded which is massive. So what does Facebook do with all this data? How do they leverage data science and AI capabilities to make the most of this data? Let's see some examples where Facebook uses Data Science extensively for enhancing customer experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Text Analytics
&lt;/h3&gt;

&lt;p&gt;A majority of the data that is available on Facebook is in the form of text; for example, posts and comments unlike that of Instagram which has photos and videos. Facebook has developed an in-house tool called Deep Text which analyzes the text that we shart on post, comments and extract the meaning out of this. This technology is used to identify any abusive posts on Facebook.&lt;/p&gt;

&lt;p&gt;Deep Text is a deep learning based text understanding engine that can understand the text by extracting meaning from it at a near-human accuracy level. It is built by leveraging state of the art neural network architectures that can perform word/character level based learning.&lt;/p&gt;

&lt;p&gt;Some applications of Deep Text include identifying the sentiment of the post (positive, negative, neutral) or identifying the emotions in the post (sad, happy, angry, threat, etc.). This framework can also be used in identifying the topic; for example whether the post is about Cricket or Football by recognizing the player names from the post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Topic Data
&lt;/h3&gt;

&lt;p&gt;Facebook has developed &lt;a href="https://www.facebook.com/business/news/topic-data" rel="noopener noreferrer"&gt;Topic Data&lt;/a&gt; which leverages data science to help marketers to understand what people are talking about the topics related to their business so that marketers can make their products and marketing relevant to their customers. Earlier to this technology, marketers had to rely on what people are posting online and topics they are talking about but it provided a very limited view. Thus, Facebook developed a framework built by data science to help marketers in building their marketing content more effectively and personalized.&lt;/p&gt;

&lt;p&gt;Some examples of how marketers are using Topic Data:&lt;/p&gt;

&lt;p&gt;• Inventory manager of a fashion retailer using this data to understand clothing trends of its target audience to decide on which products to stock.&lt;br&gt;
• A company can use this to understand their brand positioning and what the sentiment of their brand is.&lt;br&gt;
• Companies selling a hair de-frizzing product can see demographics data and target relevant customers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advertising
&lt;/h3&gt;

&lt;p&gt;Facebook’s main revenue stream is advertising. The company has been using their data very effectively to decide which ads should be shown to which users. On Facebook, we usually see sponsored posts and those are the examples of these targeted advertising practices. When you search for any product on the web and if you are logged in to Facebook, then chances are you will see the product you searched for on your Facebook app as an ad.&lt;/p&gt;

&lt;p&gt;Other than the topics discussed above, there are many ways in which Facebook is using data science and AI. Some recent projects that Facebook has undertaken are:&lt;/p&gt;

&lt;p&gt;• Detecting Deepfakes - The company recently launched a &lt;a href="https://ai.facebook.com/blog/deepfake-detection-challenge/" rel="noopener noreferrer"&gt;Deepfake Detection Challenge&lt;/a&gt; to build detection models for Deepfake content and to speed up their efforts.&lt;br&gt;
• Language Translation - If you see a Facebook post in another language, Facebook can translate that in real time automatically for you&lt;br&gt;
• Suicide Prevention - Facebook can identify a sentiment of a post and look for signs of trouble in users post and comments and thereby generating alerts and helping people in crisis.&lt;br&gt;
• Image Recognition - Users can easily search through photos without having them to rely on tags or a surrounding test.&lt;br&gt;
How is Amazon using Data Science and AI?&lt;/p&gt;

&lt;h2&gt;
  
  
  How is  Amazon using Data Science and AI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02pvrhj2w0iyvuauerad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02pvrhj2w0iyvuauerad.png" alt="How is Amazon using Data Science and AI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amazon is one of the largest ecommerce companies in the world. Amazon collects around 1 exabyte of purchase history data from their customers. Apart from online shopping, Amazon provides services like Amazon Web Services, Amazon Pay, Amazon Pantry and many more. With all these services, imagine how much information Amazon has about their customers. They leverage this information to improve the customer experience and drive revenue growth. Amazon collects data about what pages customers visit on their website, what products they are interested in, what category they are interested in and based on all this information, they recommend new products to customers which they are likely to buy.&lt;/p&gt;

&lt;p&gt;Amazon is the leader in collecting and processing users' information about how they are spending their money. They use sophisticated data science and machine learning modes for targeted marketing which helps them in increasing customer engagement. Let’s see some of the examples of how Amazon uses Data Science and AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendation Engine
&lt;/h3&gt;

&lt;p&gt;With developments in AI, Amazon began working on building a state of the art recommendation engine which would be able to analyze the customer’s behavior on the website and thereby accurately predict what the customer might be interested in the future. The recommendation engine collects a lot of data about the products and the users and then forms relations and dependencies between them.&lt;/p&gt;

&lt;p&gt;User-Product relationship occurs when some users with a specific set of characteristics have a similar preference in some products and they buy them often. Example includes Game of Thrones fans buying GoT merchandise and related items. Product-Product relationship occurs when few products on the website are similar in appearance and specifications. For example, if you search for a water bottle and are interested in bottles only for the gym, then all the similar items would be placed together. User-User relationship occurs when a certain set of customers have very similar taste or preference for certain products. For example, tenagers massively buying merchandise from their favorite Youtuber.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alexa
&lt;/h3&gt;

&lt;p&gt;Amazon offers virtual assistants such as Alexa or Echo/ Echo Show. This product is widely used by the customers to get basic information such as setting reminders, checking weather, checking latest news, etc.&lt;/p&gt;

&lt;p&gt;When the users speak with Alexa, the recordings are uploaded to Amazon's servers as voice files and these files are used to train the machine learning algorithms and help make Alexa experience better. Thus, Amazon is continuously collecting data from its users and they use advanced tools in AI to understand what the user is saying. Some customers might not be happy in sharing this voice data and Amazon provides a way to delete these data from their servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pricing Optimization
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.businessinsider.com/amazon-price-changes-2018-8?r=US&amp;amp;IR=T" rel="noopener noreferrer"&gt;Amazon changes prices&lt;/a&gt; on its products every 10 mins which is close to 2.5m changes in a day across all its products. Amazon has algorithms in place which assess a person’s willingness to buy a specific product. The aim is to set the price of a product in such a way that the customer is likely to buy that product, which is known as &lt;a href="https://en.wikipedia.org/wiki/Dynamic_pricing" rel="noopener noreferrer"&gt;dynamic pricing&lt;/a&gt;. The changes in the prices occur depending on the user’s activity on the website, competitor’s pricing, product availability, profit margin and many more.&lt;/p&gt;

&lt;p&gt;Other than the topics discussed above, there are many ways in which Amazon uses Data Science and AI. One of the latest is using Alexa - Enabled Voice Shopping. This feature uses voice commands as input data and performs the purchasing flows based on the commands. It allows the users to find and purchase products and walk through the checkout flow with voice prompts instead of clicking or tapping on your phone/Echo screen. The goal of this feature is to provide a seamless customer experience for ordering a product and Data Science and AI is at the center of these products.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is Apple using Data Science and AI?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F623f5s1chtf9i1tst8ck.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F623f5s1chtf9i1tst8ck.png" alt="How is Apple using Data Science and AI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apple, formerly known as Apple Computers Inc. is a global technology corporation specializing in designing, developing and selling consumer electronics such as Mac, Iphone, Ipad, Airpods, etc. Apple uses AI and Data Science widely to innovate products and build better customer experiences. Let’s discuss some of the AI applications that Apple uses:&lt;/p&gt;

&lt;h3&gt;
  
  
  Siri
&lt;/h3&gt;

&lt;p&gt;Similar to Alexa, Siri is Apple’s AI-enabled assistant that is available across Apple devices such as Mac, Ipad, Iphone, Airpods, etc. which aims to help the users quickly navigate through the required tasks and perform them quickly without touching the screen. Examples such as setting alarms, reminders, calling someone, weather updates, news, etc.&lt;/p&gt;

&lt;p&gt;As a virtual assistant, Siri is built on large-scale ML algorithms that combine speech recognition along with text mining and natural language processing. Firstly, Siri uses speech recognition to translate the human text into textual format and natural language processing is used to identify the meaning of the sentence and prepare the next best response for the user’s task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apple Watch Sleep Tracking
&lt;/h3&gt;

&lt;p&gt;Apple has many ways to collect data regarding their customers. One of the most recent ones is the use of Apple Watch where Apple can track user’s activity throughout the day. Apple has partnered with IBM to apply digital information for health management. Using this technology, customers can monitor their health and lifestyle throughout the day and thereby make improvements to it.&lt;/p&gt;

&lt;p&gt;Apple Watch also tracks the sleeping patterns and provides data to their customers regarding their deep sleep and light sleep. Based on this data, Apple Watch reminds their users about their sleeping times and how to improve it. These notifications/reminders from the Apple watch are a result of sophisticated machine learning models in the backend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apple HomePod
&lt;/h3&gt;

&lt;p&gt;Another AI-enabled technology Apple uses is in HomePod, a speaker powered by Siri that can do more tasks. “With multiple HomePod mini speakers placed around the house, you can have a connected sound system for your whole home. Ask Siri to play one song everywhere or, just as easily, a different song in each room,” (from Keynote speech). Additionally, the speaker can also act as a HomeHub, connecting to Apple’s HomeKit that can be accessed through the iPhone.&lt;/p&gt;

&lt;p&gt;Apart from the projects discussed above, Apple uses Data Science and AI in everything they do, especially in innovating their products and technologies. Apple is always ahead in the game of technology and they do this by using big data extensively. The big data technologies in Apple are used to build innovative products such as Siri, Homepod, Apple’s Digital Car Key (partnered with BMW series 5), etc. To conclude, we can say that Apple is utilizing technological advances to improve its user experience by using Artificial Intelligence and Data Science methodologies.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is Netflix using Data Science and AI?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i89ftimquedble9wg4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i89ftimquedble9wg4x.png" alt="How is Netflix using Data Science and AI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Netflix, originally a DVD-rental service, boomed into one of the biggest video streaming platforms with many options for movies and web series. Netflix generated $24.9 billion revenue in 2021, 23% increase as compared to 2020. In 2022, Netflix has &lt;a href="https://www.businessofapps.com/data/netflix-statistics/" rel="noopener noreferrer"&gt;222 million subscribers worldwide&lt;/a&gt;. Netflix has reigned as the number 1 over the top (OTT) streaming platform since its launch in 2007. So what’s the secret of such huge success? It is all about using Data and Analytics to make enhancements in their products and improve customer experience. With the use of data extensively, Netflix can provide users with personalized movie and TV shows recommendations, optimize production planning, predict the popularity of original content, personalize marketing content such as trailers/thumbnail images and help the stakeholders in decision making. Data is at the center of everything that Netflix does. Let’s see some of Data Science applications Netflix uses:&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendation Engine
&lt;/h3&gt;

&lt;p&gt;Netflix has built a near-real time recommendation system for their customers by using the huge amount of data they have. Netflix captures information about each user and ranks those users based on what type of content they watch, what they search, what they add to watch-list, etc. This type of data becomes part of Big Data and all this information is stored in the databases which is then used by machine learning algorithms to build a pattern indicating the viewer’s taste. This pattern may match with another user or may not match with anyone since each user might have a unique taste. Based on these ratings to each customer, the recommendation system provides TV shows or movies that the user is likely to watch and enjoy. Thus, they use Data Science extensively to recommend new shows to their end users thereby improving customer experience significantly.&lt;/p&gt;

&lt;p&gt;Other than the user behavior on the web-application, Netflix captures data like viewing day, time, location, type of device used, etc. It also captures data around search key-words that a user uses to find a movie/show. Using this data, Netflix is not only able to provide suggestions of the next shows/movies to watch but also arrange the selections into rows based on an individual’s viewing preferences. For example, Netflix will position the program that you are most likely to watch in the top left corner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Planning &amp;amp; Content Development Analytics
&lt;/h3&gt;

&lt;p&gt;At Netflix, Data Science and AI is not only used to understand user behavior for recommendation systems, but it’s also used in production planning and content development activities. When the creators come up with an idea about a movie or the show, data plays an integral part in making any decisions.&lt;/p&gt;

&lt;p&gt;Based on the content developed historically and its performance over time, a lot of data is crunched to find insights around what went well and what can be improved. Data around how viewers perceived the previous content is really helpful in predicting the likeability of the new content. For example, Netflix’s executive knew that Umbrella Academy is going to be a hit because it checks certain parameters. It's a series which shows protagonists growth right from childhood to adulthood, features actor Elliot Page and it’s a comic action adventure where all these parameters have been successful in the past.&lt;/p&gt;

&lt;p&gt;Also, data is widely used to find out different shoot locations, timings and day of the shoot. Simple prediction models can save them a significant amount of time and effort in planning and reducing the expenses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Personalized Artwork and Imagery Selection
&lt;/h3&gt;

&lt;p&gt;Netflix knows that imagery plays a very important role in how viewers perceive movies or TV shows to watch. The main objective of the content platform team is to surface those aspects of the story that might be intriguing for the users and increase the chances for that user to watch the tv show or movie. This imagery for a show/movie is purely backed by data science and it is personalized for all the users. Netflix uses Artwork Visual Analysis (AVA), which is a collection of tools and machine learning algorithms that extract relevant imagery from the videos to surface as thumbnails for the customers. Netflix has many tv shows and typically each TV show has around 10 episodes in each season on average which is close to 9 million frames. To select a frame from such a large video manually, which will catch audience attention is tedious and thus, Netflix has developed AVA which does this work automatically. More information on different algorithms used in AVA can be found on their &lt;a href="https://netflixtechblog.com/ava-the-art-and-science-of-image-discovery-at-netflix-a442f163af6" rel="noopener noreferrer"&gt;tech blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;How is Google using Data Science and AI?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8g4kk4yyk6fe8vktu4tf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8g4kk4yyk6fe8vktu4tf.png" alt="How is Google using Data Science and AI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google is a technology giant which uses Data Science and AI extensively. It is a multinational internet company that provides digital products and services such as online search and advertising, cloud computing and software, etc. Google has a wide range of products and services available and all of them are backed heavily by data science. Google has also acquired Youtube recently which adds on to the list of services that the company offers. Products such as Google Search, Google Photos, Google Drive, G-Suite, Gmail, Voice Search, Reverse Image Search, Maps, speech recognition, Translate and many more. All these products use Data Science to help improve the customer experience and drive revenue growth for the company.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Translate
&lt;/h3&gt;

&lt;p&gt;Google Translate is a simple online tool that can be used to translate any text from one language to another language. Earlier in 2006, Google used Statistical Machine Translation for this app but has made great progress in using AI for instant real time translation of text. The latest machine learning algorithms Google Translate uses, provides translation in 109 different languages and it has boosted the quality and reliability of these translations. Google has made significant improvements in the field of Natural Language Processing which enabled them to have a high accuracy rate in Google Translate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Ads
&lt;/h3&gt;

&lt;p&gt;Google Ads, formerly known as AdWords is a part of Google’s marketing suite of tools. Google Ads gives full control to the businesses and users to advertise their products online and profile the users based on their searches. This data is used by Google to target the right advert to the right users which is a main idea behind Google Ads. Google uses state of the art machine learning algorithms that rank thousands of keywords based on several metrics, which are then used to pick the right ad to show to the users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gmail
&lt;/h3&gt;

&lt;p&gt;Gmail has been used by a lot of customers as their primary email service and Google has implemented many smart features to it. One of the latest smart features is called a smart reply. This feature reads the email, extract the meaning of that email and provides the user with possible responses so that the users don’t have to type a lot. Also, Google uses machine learning algorithms to identify and categorize emails as Spam or not-spam. Another Data Science backed feature in Gmail is automatic categorization of emails into Promotions, Social, Updates, Priority, etc.&lt;/p&gt;

&lt;p&gt;Apart from the products discussed above, Google uses Machine Learning and AI to build smart products. Google is leading in the technology space due to heavy investments in research in advanced computer science and artificial intelligence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Any organization that uses its data effectively can benefit from data science. Data science is beneficial to any business in any industry, from statistics and insights throughout workflows and employing new applicants to assisting senior employees in making more informed decisions.&lt;/p&gt;

&lt;p&gt;If your aim is to work for either of the above data science companies, it’s very important to develop &lt;a href="https://www.stratascratch.com/blog/what-skills-do-you-need-as-a-data-scientist/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;skills needed for a Data Scientist&lt;/a&gt; / Data Engineers, etc. depending on your preference. With StrataScratch, you can tackle coding as well as non-coding questions and would be able to be a part of a community of like minded people. You can communicate and collaborate with other aspiring Data Engineers and work towards achieving your dream job. We have more than 400 real life SQL questions on the platform ranging from beginner to advanced level. We highly recommend you joining the community of over 20K learners and be interview ready. All the best!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>database</category>
      <category>analytics</category>
      <category>ai</category>
    </item>
    <item>
      <title>Find the Retention Rates – Salesforce SQL Interview Question</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Wed, 17 Aug 2022 07:26:10 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/find-the-retention-rates-salesforce-sql-interview-question-3nff</link>
      <guid>https://dev.to/nate_at_stratascratch/find-the-retention-rates-salesforce-sql-interview-question-3nff</guid>
      <description>&lt;p&gt;&lt;em&gt;Retention rates are one of the key business metrics. We’ll show you how to calculate them by explaining in detail how to solve the Salesforce data science interview question.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YH5ltgjP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kzldhxm9bbslvcr0jki8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YH5ltgjP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kzldhxm9bbslvcr0jki8.png" alt="Find the Retention Rates – Salesforce SQL Interview Question" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The retention rate is one of the important business metrics, especially in marketing, investing, and product management.&lt;/p&gt;

&lt;p&gt;It refers to the percentage of customers continuing to do business with a company. This usually means extending your subscription or in any other way continuing to use the company’s products and services, such as software, application, maintenance, etc.&lt;/p&gt;

&lt;p&gt;The retention rate is calculated by dividing the number of retained customers by the number of customers at the beginning of the period. The number of retained customers shouldn’t include customers acquired during the monitored period. In other words, the formula is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NZ54kw-I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jlwz322gfchtqq9efdzy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NZ54kw-I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jlwz322gfchtqq9efdzy.png" alt="retention rate formula" width="753" height="79"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PEC – Period End Number of Customers&lt;/li&gt;
&lt;li&gt;NC – New Customers in the Period&lt;/li&gt;
&lt;li&gt;PSC – Period Start Number of Customers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, we’ll have a look at the interview question and try to find the retention rates using SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retention Rate - A Data Science Interview Question by Salesforce
&lt;/h2&gt;

&lt;p&gt;Here’s what this question asks you:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wUv4J_LW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rpps74ps5qaatbk7xyz5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wUv4J_LW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rpps74ps5qaatbk7xyz5.png" alt="Data Science Interview Question by Salesforce to find Retention Rate" width="818" height="252"&gt;&lt;/a&gt;&lt;br&gt;
Link to the question: &lt;a href="https://platform.stratascratch.com/coding/2053-retention-rate?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2053-retention-rate&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/8zeLdtkY2CQ"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Dataset to Work With
&lt;/h2&gt;

&lt;p&gt;To solve this problem, Salesforce gives you only one table: sf_events.&lt;/p&gt;

&lt;p&gt;It has three columns:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pUJDlJbr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qwn7g6z7u4sqia2bxskn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pUJDlJbr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qwn7g6z7u4sqia2bxskn.png" alt="Salesforce Dataset Table" width="242" height="115"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To get an idea about the data it contains, here are the first few rows from the table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m89nHIQC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jmdqwhh4rrath2ahx48n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m89nHIQC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jmdqwhh4rrath2ahx48n.png" alt="Salesforce Dataset Table" width="817" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution Approach
&lt;/h2&gt;

&lt;p&gt;Since this table is not a list of all users, but the list of users’ activity in each month, we don’t need to calculate the number of new users each month. In other words, we want to see how many users active in December 2020 were also active in January 2021 or any other future month. We also need to have a look at all the users active in January 2021 and see were they active in February 2021 or any other future month. This is also the assumption stated in the question.&lt;/p&gt;

&lt;p&gt;With this assumption in mind, the retention rate is calculated by finding the users active in future months and dividing this number by the number of users in December 2020 or January 2021, depending on which retention rate you’re calculating.&lt;/p&gt;

&lt;p&gt;For example, if the user were active in December 2020, it would appear in a table with a December 2020 timestamp. If there’s any future activity (in January 2021 or on), this user would be considered as retained for December 2020. If the user were active in December 2020 but didn’t appear in any of the coming months, it would be considered not retained.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assumptions
&lt;/h2&gt;

&lt;p&gt;Our solution will be based on the following assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a user is listed in the table, this represents the user’s activity for the date in the record.&lt;/li&gt;
&lt;li&gt;We consider only retention rates for Dec 2020 and January 2021.&lt;/li&gt;
&lt;li&gt;The table does not represent the list of all users but only the active users.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Solution Breakdown
&lt;/h2&gt;

&lt;p&gt;The steps you have to build into your code are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find all active users in December 2020 by using the date field. Do the same for January 2021. That way, you’re getting denominators for the Dec and Jan retention rates.&lt;/li&gt;
&lt;li&gt;Find the maximum date of the user’s activity to see if the user has the activity in the future months. To do that, create a table with the user_id and max date.&lt;/li&gt;
&lt;li&gt;Join all the active users in the month with the list of users with future activity. That way, you’ll get the list of December 2020 users and their latest activity date. Then count the number of users with activity after Dec and divide it by the number of users in Dec to get the Dec retention rate. Apply the same principle to calculate the January retention rate. Mind the fact that for Jan retention rate, the future activity begins with February 2021.&lt;/li&gt;
&lt;li&gt;Consolidate by account_id. Use either Jan or Dec accounts list because it’s assumed that both months contain the complete list of account_ids.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;The first thing is to find users active in December 2020.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH dec_2020 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 12
     AND EXTRACT(YEAR
                 FROM date) = 2020 ),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To do that, we’re using the CTE. We’re interested in the distinct accounts and users, and to get the users active in December 2020, we’re using the EXTRACT() function in the WHERE clause.&lt;/p&gt;

&lt;p&gt;The second CTE does the same thing for the users active in January 2021.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;jan_2021 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 1
     AND EXTRACT(YEAR
                 FROM date) = 2021 ),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we want to find the latest active date for each user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;max_date AS
  (SELECT user_id,
          MAX(Date) AS max_date
   FROM sf_events
   GROUP BY user_id),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you could probably say from the solution breakdown, here we’ll use the MAX() function to find the latest active date.&lt;/p&gt;

&lt;p&gt;Now comes the step where we calculate the retention rate. First the December 2020 retention rate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;retention_dec_2020 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date &amp;gt; '2020-12-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_dec
   FROM dec_2020
   JOIN max_date ON dec_2020.user_id = max_date.user_id
   GROUP BY account_id),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we joined the two CTEs together to match active users in Dec with users that have had future activity. It’s possible that the future activity is in December. Because of that, we’ll only count the users that had activity after Dec.&lt;/p&gt;

&lt;p&gt;We used the CASE WHEN statements to allocate values of 1 to all users that had activity after December 2020. Sum these values, divide them by the total number of users in December 2020, and you get the Dec retention rate.&lt;/p&gt;

&lt;p&gt;Then we do the same for January 2021 retention rate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;retention_jan_2021 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date &amp;gt; '2021-01-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_jan
   FROM jan_2021
   JOIN max_date ON jan_2021.user_id = max_date.user_id
   GROUP BY account_id)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have the retention rate for Dec and Jan active users, we only need to group by account_id and divide the retentions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT retention_jan_2021.account_id,
       retention_jan / retention_dec AS retention
FROM retention_jan_2021
INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The complete answer to this question is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH dec_2020 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 12
     AND EXTRACT(YEAR
                 FROM date) = 2020 ),

 jan_2021 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 1
     AND EXTRACT(YEAR
                 FROM date) = 2021 ),

max_date AS
  (SELECT user_id,
          MAX(Date) AS max_date
   FROM sf_events
   GROUP BY user_id),

 retention_dec_2020 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date &amp;gt; '2020-12-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_dec
   FROM dec_2020
   JOIN max_date ON dec_2020.user_id = max_date.user_id
   GROUP BY account_id),

retention_jan_2021 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date &amp;gt; '2021-01-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_jan
   FROM jan_2021
   JOIN max_date ON jan_2021.user_id = max_date.user_id
   GROUP BY account_id)

SELECT retention_jan_2021.account_id,
       retention_jan / retention_dec AS retention
FROM retention_jan_2021
INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Edge Case Consideration
&lt;/h2&gt;

&lt;p&gt;As an edge case, we’ll consider the possibility that not all accounts were present each month.&lt;/p&gt;

&lt;p&gt;To compensate for that and to include all accounts, you can use two workarounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  FULL OUTER JOIN
&lt;/h2&gt;

&lt;p&gt;The first workaround is to use the FULL OUTER JOIN instead of INNER JOIN in the SELECT statement referencing the CTEs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT 
    COALESCE(retention_jan_2021.account_id, retention_dec_2020.account_id) AS account_id,
    COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention
FROM retention_jan_2021
FULL OUTER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the COALESCE function to get the January accounts and the December accounts not appearing in January. Then use the same function to divide the two retention rates, with NULL when there’s no retention rate for that account.The CTEs calculating the retention rates are joined using the FULL OUTER JOIN. If you don’t feel at home with all these different JOINs and what they do, don’t worry! Here’s an article “&lt;a href="https://www.stratascratch.com/blog/how-to-join-3-or-more-tables-in-sql/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;How to Join 3 or More Tables in SQL&lt;/a&gt;” that explains everything about the JOINs you need to know.&lt;/p&gt;

&lt;p&gt;The issue with this edge case solution is that it’s computationally intensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  UNION
&lt;/h2&gt;

&lt;p&gt;There’s another way. You can get a complete list of all accounts by using UNION, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;all_accounts AS
  (SELECT account_id
   FROM retention_jan_2021
   UNION 
   SELECT account_id
   FROM retention_dec_2020)

SELECT a.account_id,
       COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention
FROM all_accounts a
LEFT JOIN retention_jan_2021 j ON a.account_id = j.account_id
LEFT JOIN retention_dec_2020 d ON a.account_id = d.account_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both these workarounds have a downside, which is they only capture the accounts that are in December and January. This means they don’t consider all months in the dataset.&lt;/p&gt;

&lt;p&gt;If you want all months, you can simply create a table with all the distinct account IDs found in the table. This would, however, mean listing all the accounts for all time, so you may get a lot of accounts with retention being zero because they don’t have any users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This Salesforce data science interview question is not easy. But if you hung on in there until the end, you have gotten really valuable knowledge. That is calculating the retention rates.&lt;/p&gt;

&lt;p&gt;Knowing that will not only get you a bigger chance of success at the job interview. It will also make you a valuable asset to a company, because you’ve shown that you possess a high level of business, as well as technical knowledge. If you want to practice more questions from Salesforce, check out our previous post “&lt;a href="https://www.stratascratch.com/blog/salesforce-data-scientist-coding-interview-questions/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Salesforce Data Scientist Coding Interview Questions&lt;/a&gt;” or you can also find questions from other top companies here “&lt;a href="https://www.stratascratch.com/blog/sql-interview-questions-you-must-prepare-the-ultimate-guide/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;SQL Interview Questions You Must Prepare: The Ultimate Guide&lt;/a&gt;”.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>programming</category>
      <category>datascience</category>
    </item>
    <item>
      <title>A Resource Guide to Jump-Start Your Own Data Science Projects</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Fri, 10 Jun 2022 02:32:24 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/a-resource-guide-to-jump-start-your-own-data-science-projects-322j</link>
      <guid>https://dev.to/nate_at_stratascratch/a-resource-guide-to-jump-start-your-own-data-science-projects-322j</guid>
      <description>&lt;p&gt;&lt;em&gt;A very into-detail guide on the data science project components and the resources for jump starting your very own project.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vgtXmlTi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d90s9qrobw57wglw7pad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vgtXmlTi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d90s9qrobw57wglw7pad.png" alt="Image description" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a resource guide to start your data science projects. This guide will go over what are the various components to successful data science projects and websites and online datasets to help jump start your project!&lt;/p&gt;

&lt;p&gt;First off, you need to understand what and why specific components are required in a full stack data science project (such as time-series analysis and APIs). We have created a detailed breakdown and what interviewers look for in this article → “&lt;a href="https://www.stratascratch.com/blog/data-analytics-project-ideas-that-will-get-you-the-job/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Data Analytics Project Ideas That Will Get You The Job&lt;/a&gt;” and the video below:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/c4Af2FcgamA"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Components of Data Science project:
&lt;/h2&gt;

&lt;p&gt;1) Promising dataset&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real data&lt;/li&gt;
&lt;li&gt;Timestamps&lt;/li&gt;
&lt;li&gt;Qualitative data&lt;/li&gt;
&lt;li&gt;Quantitative data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Modern Technologies&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Cloud Databases (Relational + Non-Relational data)&lt;/li&gt;
&lt;li&gt;AWS → S3 buckets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3) Building model&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics&lt;/li&gt;
&lt;li&gt;Modifying dataset&lt;/li&gt;
&lt;li&gt;Diagnostics tests&lt;/li&gt;
&lt;li&gt;Transformation&lt;/li&gt;
&lt;li&gt;Test/Control&lt;/li&gt;
&lt;li&gt;Model Selection&lt;/li&gt;
&lt;li&gt;Optimizing&lt;/li&gt;
&lt;li&gt;Math&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;4) Making impact/validation&lt;/p&gt;

&lt;h2&gt;
  
  
  Promising Dataset
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Real Data
&lt;/h2&gt;

&lt;p&gt;Any great data science project uses constantly updated data.&lt;/p&gt;

&lt;p&gt;There are 2 important reasons for using updating real data.&lt;/p&gt;

&lt;p&gt;1) The dataset is never truly complete&lt;/p&gt;

&lt;p&gt;A real world dataset needs manipulation of values to derive a new metric or needs to be filtered. Data wrangling is one of the most important parts of data science, since a model is only as good as the dataset it analyzes.&lt;/p&gt;

&lt;p&gt;2) The dataset is updated in real time&lt;/p&gt;

&lt;p&gt;Most companies use datasets that are updated frequently. These types of datasets are important especially for businesses that need to take a specific action if a certain metric falls below a certain threshold. For example, if the supply of an ice cream company falls below the predicted demand, the company needs to make a plan on how to match the supply and demand curve. Using real time datasets is a great way to show recruiters, you have experience with variable datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timestamp
&lt;/h2&gt;

&lt;p&gt;Datetime values are as the word states values that include date or time. These values are commonly found in datasets that are constantly updated, since most of the records will include a timestamp. Even if the record doesn’t have a timestamp, it’s useful for analysis to have a datetime column. Commonly companies want to see the distribution across the year (or possibly decades), so finding datasets with datetime and finding the distribution of a specific metric over a year is important.&lt;/p&gt;

&lt;h2&gt;
  
  
  Qualitative / Quantitative
&lt;/h2&gt;

&lt;p&gt;Qualitative and quantitative data represents non-numerical and numerical values accordingly.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qualitative → Gender, types of jobs, color of house&lt;br&gt;
Quantitative → Conversion rate, sales, employees laid off&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both types of data provide their own importance.&lt;/p&gt;

&lt;p&gt;Quantitative data is one of the fundamentals of regression models. You are using numbers and variables to predict a numeric value.&lt;/p&gt;

&lt;p&gt;Qualitative data can help with classification of models such as through decision trees. Qualitative data can also be converted into quantitative data, such as converting safety levels [none, low, medium, high] to [0, 1, 2, 3], which is called ordinal encoding.&lt;/p&gt;

&lt;p&gt;Geo-locations, such as countries or longitude/latitude, are nice to have in datasets. Similarly to datetime values, with geo-locations, you could find the distribution of metrics across various states/countries. Especially when working with multinational corporations, they have datasets from various countries that need to be analyzed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Now you have a better understanding of what to search for in your potential dataset, here are some websites to search for datasets.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://datasetsearch.research.google.com/"&gt;Google Dataset Search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.re3data.org/"&gt;Registry Of Research Data Repositories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://data.gov/"&gt;U.S. Government Open Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://data.gov.in/"&gt;Open Government Data Platform India&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These links contain datasets in csv format, but also have access to APIs. APIs are important to have in your repertoire, since data especially when working for companies is usually obtained through APIs.&lt;/p&gt;

&lt;p&gt;Another place to retrieve datasets beyond these websites are from famous tech companies that are consumer based, such as Twitter, Facebook, and YouTube. These companies provide APIs for developers through their websites directly. This is an easy area to find intriguing ideas for your projects!&lt;/p&gt;

&lt;h2&gt;
  
  
  Modern Technologies
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZOSzaLKq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2yn3cjz4caje4wa79gey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZOSzaLKq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2yn3cjz4caje4wa79gey.png" alt="Image description" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern technologies are key factors on what differentiates a good and great data science project. Modern technologies refers to commonly used softwares and services used by companies. APIs and cloud databases are examples of modern technologies.&lt;/p&gt;

&lt;h2&gt;
  
  
  API
&lt;/h2&gt;

&lt;p&gt;APIs are one of the most important modern technologies to use when creating a data science project. APIs (Application programming interface) is what makes your application work. Imagine you are booking an Uber ride. Through your phone, you will first input your pickup and dropoff location. Uber will give you an approximate cost. How does the Uber application calculate this cost, it uses APIs. An API is an interface between 2 software.&lt;/p&gt;

&lt;p&gt;Example: The Uber app requires an input of pickup and dropoff locations. A separate software, which can be hosted on the web, will calculate the cost based on distance between locations, approximate time taken, surge pricing (when there is high demand for rides), and much more. The calling and communication between these 2 softwares is an API.&lt;/p&gt;

&lt;p&gt;1.) Understanding APIs and how to setup in code&lt;/p&gt;

&lt;p&gt;Knowledge of APIs is essential for a great data scientist. While you can watch the short videos about what is an API, you definitely need a better understanding of where APIs are used and the different types. Building your own API is a great way to get a better understanding and how to test APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=OVvTv9Hy91Q"&gt;What Are APIs? - Simply Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=GZvSYJDk-us&amp;amp;t=6430s"&gt;APIs for Beginners - How to use an API (Full Course / Tutorial)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Libraries for APIs (Request/Flask in Python)&lt;/p&gt;

&lt;p&gt;To request data from APIs or even create your own API there are specific libraries to use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To learn more about Requests in APIs → Creating project with API &lt;a href="https://www.youtube.com/watch?v=fklHBWow8vE"&gt;Working with APIs in Python [For Your Data Science Project]&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;To learn more about building a REST API with Flask  &lt;a href="https://www.youtube.com/watch?v=GMppyAPbLYk"&gt;Python REST API Tutorial - Building a Flask REST API&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3) Understand json objects&lt;/p&gt;

&lt;p&gt;Plenty of APIs use JSON objects as an input and output, so it is crucial to understand what json objects are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using json library in Python - &lt;a href="https://www.youtube.com/watch?v=9N6a-VLBa2I"&gt;Python Tutorial: Working with JSON Data using the json Module&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cloud Database
&lt;/h2&gt;

&lt;p&gt;Recruiters want data scientists with experience with cloud databases, since databases are hosted in clouds more often these days.&lt;/p&gt;

&lt;p&gt;Before going into the 3 major cloud platforms, you want to plan your structure of input and output data. There are 2 different types of databases: relational and nonrelational data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Relational databases store data with primary keys to identify the specific data. This type of table is generally seen in table structure.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Non-Relational databases store data without primary keys, such as graphs.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common cloud platforms are AWS, Google Cloud, and Microsoft Azure. Each has their own advantages and disadvantages, so BMC gives a &lt;a href="https://www.bmc.com/blogs/aws-vs-azure-vs-google-cloud-platforms/"&gt;detailed analysis between these services.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After personally using these services, I would recommend using AWS especially for a first time project. &lt;a href="https://aws.amazon.com/s3/"&gt;S3 buckets&lt;/a&gt; are extremely common when working with companies. AWS has RDS (Relational database) and DynamoDB (Non-Relational database) along with S3 storage. Using these services is a great way to show you have experience in both cloud storage and S3. AWS has a great variety of free tier services to build your project along with 5gb free of S3 storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=ruz-vK8IesE"&gt;SQL vs NoSQL Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/playlist?list=PL9nWRykSBSFilnmg4hy2Sfs1o6_wF1JJP"&gt;How to learn AWS for beginners&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Building Models
&lt;/h2&gt;

&lt;p&gt;Let’s discuss what to keep in mind when building a model.&lt;br&gt;
An important part to remember when building a model is why you are using or not using a specific technique in a model. While getting an accurate model is important, during interviews, you want to explain the reasoning behind choosing specific models over others.&lt;/p&gt;

&lt;p&gt;Automate your model as much as possible. Assuming your input data has a fixed format, your algorithm should clean, derive new metrics, apply appropriate transformations, and build the model. So even when inputting a dataframe with new data, the algorithm should work and provide the right outputs.&lt;br&gt;
Here are some things to consider when building a model:&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;How do you determine how accurate your model is? What numerical data are used when creating your model? Not all models require a metric, but most use them. Determining what metrics will affect your model and to what extent is imperative to a well thought out model. If your model requires the derivation of a new metric, the what and why that metric was created should be noted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modifying dataset
&lt;/h2&gt;

&lt;p&gt;How are you manipulating the values or columns of the input dataset? Remember to clean the dataset. Beyond cleaning the dataset, are there any specific columns you derived that directly affect the output? Also remember to note in your project notes why you made these changes.&lt;/p&gt;

&lt;p&gt;Certain columns may contain null values. In those cases you should decide how to deal with the missing value rows. You could impute the average of that column to replace null values or run a regression to predict the values.&lt;/p&gt;

&lt;p&gt;There are various ways to clean the dataset ranging from simple removing rows with null values to PCA (Unsupervised ML algorithm to remove correlated values)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here are some common techniques to use to optimize your dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common Data Cleaning Techniques → &lt;a href="https://monkeylearn.com/blog/data-cleaning-techniques/"&gt;8 Effective Data Cleaning Techniques for Better Data&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dealing with missing values → &lt;a href="https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e"&gt;7 Ways to Handle Missing Values in Machine Learning&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Encoding values → &lt;a href="https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd"&gt;Categorical encoding using Label-Encoding and One-Hot-Encoder&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PCA → &lt;a href="https://www.askpython.com/python/examples/principal-component-analysis"&gt;Principal Component Analysis from Scratch in Python
&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---C8kswI2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z9niwutliicanrv4rboy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---C8kswI2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z9niwutliicanrv4rboy.png" alt="Image description" width="512" height="332"&gt;&lt;/a&gt;&lt;br&gt;
PCA - &lt;a href="https://www.stratascratch.com/blog/overview-of-machine-learning-algorithms-unsupervised-learning?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Overview of Machine Learning Algorithms: Unsupervised Learning&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Diagnostic tests
&lt;/h2&gt;

&lt;p&gt;Raw datasets often need to be updated for certain analysis you may run. Suppose you want to check the equal variance in your data, since you want to run a linear regression. To check, you can run a diagnostic test called Homoscedasticity. Depending on the type of model you want to create the dataset needs specific properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples of diagnostic tests to run for common problems:&lt;/p&gt;

&lt;p&gt;1) Outliers&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Box-plot - A simple graph to show the 5 number summary (minimum, first quartile, median, third quartile, maximum) - &lt;a href="https://www.youtube.com/watch?v=mhaGAaL6Abw"&gt;How To Make Box and Whisker Plots&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Grubbs Test - Test to detect exactly one outlier in the dataset - &lt;a href="https://www.youtube.com/watch?v=HmbERCjc8_8"&gt;Grubbs Test (example)&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Homoscedasticity / Heteroskedasticity&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Homoscedasticity - Test to check if variance of the dependent variable is the same throughout the dataset - &lt;a href="https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm"&gt;Bartlett's test&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Heteroskedasticity - Test to check if variance of the dependent variable is NOT the same throughout the dataset - &lt;a href="https://www.youtube.com/watch?v=wzLADO24CDk"&gt;Breusch Pagan test&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Transformation
&lt;/h2&gt;

&lt;p&gt;If any diagnostic tests prove transformations are required, run the relevant transformation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Common transformations:&lt;/p&gt;

&lt;p&gt;1) Box-Cox transformation - Transforming a non-normal distribution closer to a normal distribution&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/box-cox-transformation-explained-51d745e34203"&gt;Box-Cox Transformation: Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=vGOpEpjz2Ks"&gt;Box-Cox Transformation + R Demo
&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Log transformation - Transforming a skewed distribution closer to a normal distribution&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://medium.com/@kyawsawhtoon/log-transformation-purpose-and-interpretation-9444b4b049c9"&gt;Log Transformation: Purpose and Interpretation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Test/Control
&lt;/h2&gt;

&lt;p&gt;Some models require test/control versions. A common test/control test conducted is A/B testing. Did you implement test/control? What and why did you use the specific difference between test and control versions? What were your results?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://vwo.com/ab-testing/"&gt;Learn A/B testing&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Selection
&lt;/h2&gt;

&lt;p&gt;What are your assumptions about this model? What properties does the dataset and model have? Why was this model the best fit for answering your question you are trying to answer?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common Regression Models:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1) Ridge - This uses L1 regularization which means ineffective variables’s coefficient can be reduced CLOSE to 0 - &lt;a href="https://www.youtube.com/watch?v=Q81RR3yKn30"&gt;Regularization Part 1: Ridge (L2) Regression&lt;/a&gt;&lt;br&gt;
2) Lasso - This uses L1 regularization which means ineffective variables’s coefficient can be reduced to 0 - &lt;a href="https://www.youtube.com/watch?v=NGf0voTMlcs"&gt;Regularization Part 2: Lasso (L1) Regression&lt;/a&gt;&lt;br&gt;
3) Logistic Regression - This model processes the probability of a given input and returns a binary output- &lt;a href="https://www.youtube.com/watch?v=yIYKR4sgzI8"&gt;StatQuest: Logistic Regression&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--35y8k7Ti--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0u5we2ximrs77zyacw7h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--35y8k7Ti--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0u5we2ximrs77zyacw7h.png" alt="Image description" width="512" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logistic regression part - &lt;a href="https://www.stratascratch.com/blog/overview-of-machine-learning-algorithms-classification/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Overview of Machine Learning Algorithms: Classification&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1) Decision trees - Decision trees are made of decision nodes which further lead to either another decision node or a leaf node - &lt;a href="https://www.youtube.com/watch?v=7VeUPuFGJHk"&gt;StatQuest: Decision Trees&lt;/a&gt;&lt;br&gt;
2)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O7wLTr33--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9rca027l54d6z5xgf9t1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O7wLTr33--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9rca027l54d6z5xgf9t1.png" alt="Image description" width="512" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Decision trees - &lt;a href="https://www.stratascratch.com/blog/overview-of-machine-learning-algorithms-classification/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Overview of Machine Learning Algorithms: Classification&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3) Naive bayes - A type of classification algorithm that uses bayes theorem - &lt;a href="https://www.youtube.com/watch?v=O2L2Uv9pdDA"&gt;Naive Bayes, Clearly Explained!!!&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Neural Networks&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;CNN model - A neural network model that is used for images recognition - &lt;a href="https://www.youtube.com/watch?v=YRhxdVk_sIs"&gt;Convolutional Neural Networks (CNNs) explained&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;RNN model - A neural network model that is used for sequential data &lt;a href="https://www.youtube.com/watch?v=LHXXI4-IEns"&gt;Illustrated Guide to Recurrent Neural Networks: Understanding the Intuition&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Optimizing
&lt;/h2&gt;

&lt;p&gt;Your first iteration of your model should not be your final. You must recheck your code and find out how to optimize your code! (Hopefully you made proper comments and well named variables so you don’t forget what a certain function does)&lt;/p&gt;

&lt;p&gt;When optimizing your code, you want to determine what is considered a more optimized model? Most commonly with a data science project, a more optimized model is a more accurate model.&lt;/p&gt;

&lt;p&gt;Error predictions calculate the difference between the original data and predicted data. Calculating the difference can be done in a couple of ways, such as Mean Squared Error and R². Mean Squared Error is the average of the squares of the errors, where error is the difference between original data and predicted data.&lt;/p&gt;

&lt;p&gt;Mean Square Error Formula:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xg7Y4dju--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2hx4t1estfulajbbbyhy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xg7Y4dju--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2hx4t1estfulajbbbyhy.png" alt="Image description" width="559" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Common error predictions&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=KzHJXdFJSIQ"&gt;Root Mean Square Error (RMSE) Tutorial + MAE + MSE + MAPE+ MPE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=_I7sKr77Ci8"&gt;Adjusted R squared vs. R Squared For Beginners&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Math
&lt;/h2&gt;

&lt;p&gt;When creating a statistical model, you definitely need to understand the math. What are the mathematical assumptions of the model? If you have a final equation or specific epochs or other values, include that in your notes.&lt;/p&gt;

&lt;p&gt;If you want to create a scientific document for your model, you can use LaTeX documentation. This is specifically made for scientific documents and mathematical formulas. You can use an &lt;a href="https://www.overleaf.com/"&gt;online LaTeX editor&lt;/a&gt; to create the projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making an impact / validation
&lt;/h2&gt;

&lt;p&gt;Now you have finally created your model and project! The final step is to get peer review on your project!&lt;/p&gt;

&lt;p&gt;There are multiple ways to get the validation on your project, such as creating a report of your findings or sharing the visualizations.&lt;/p&gt;

&lt;p&gt;Creating an article/report&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are various opinions on how to write a report. No matter how you format your paper, remember to include evidence and logical reasoning behind your analysis. If there are other research papers or models built related to your question, explain the differences and similarities&lt;/li&gt;
&lt;li&gt;Examples of great research papers&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="http://www.robotics.stanford.edu/~ang/papers/icdar01-TextRecognitionUnsupervisedFeatureLearning.pdf"&gt;Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://vision.stanford.edu/pdf/CVPR16_N_LSTM.pdf"&gt;Social LSTM: Human Trajectory Prediction in Crowded Spaces&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now it is time to share your analysis to the world. The first important place to upload your analysis is to GitHub.&lt;/p&gt;

&lt;p&gt;GitHub should be used to upload:&lt;/p&gt;

&lt;p&gt;1) Code – Make sure it is effectively commented and precise variable names&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://realpython.com/python-comments-guide/"&gt;Writing Comments in Python (Guide)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.blog/2021/12/23/best-practices-for-writing-code-comments"&gt;Best practices for writing code comments&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2) Your report&lt;br&gt;
3) ReadMe file – For other users to understand how to replicate your analysis&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.drupal.org/docs/develop/managing-a-drupalorg-theme-module-or-distribution-project/documenting-your-project/readme-template"&gt;Documenting your project - README template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.makeareadme.com/"&gt;Make a README&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Another social media platform to take advantage of is Reddit. Reddit has plenty of subreddits where you can share your projects.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.reddit.com/r/learnmachinelearning/"&gt;r/learnmachinelearning&lt;/a&gt; → For simpler project that might tend to be your first few data science projects&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.reddit.com/r/MachineLearning/"&gt;r/machinelearning&lt;/a&gt; → For your detailed research papers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.reddit.com/r/dataisbeautiful/"&gt;r/dataisbeauitful &lt;/a&gt;→ This is the go to place to share visualizations with a large community to share visualizations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://towardsdatascience.com/"&gt;Towards Data Science &lt;/a&gt;(derived from Medium)  is the go to place to upload your data science articles. These articles need to be an analysis of your project, why you used specific models over others, your findings, and more.&lt;/p&gt;

&lt;p&gt;Some great project analysis&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://medium.com/@Abhishek3/future-of-san-francisco-job-market-41c1ee9be07a"&gt;Future of San Francisco City job market&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/coding-an-intelligent-battleship-agent-bf0064a4b319"&gt;Coding an Intelligent Battleship Agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/analyzing-your-friends-imessage-wordle-stats-using-python-5649def20fd"&gt;Analyzing your Friends’ iMessage Wordle Stats Using Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/ive-tracked-my-mood-for-over-1000-days-a-data-analysis-5b0bda76cbf7"&gt;I’ve Tracked My Mood for Over 1000 Days: A Data Analysis&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LinkedIn is a great tool to share your projects to people outside of data science. Data science teams in companies have to communicate with coworkers in other departments constantly.  Sharing your projects to people beyond your peers gives a great insight in how effectively you can communicate your technical project to a non-technical audience.&lt;/p&gt;

&lt;p&gt;Twitter is an important platform to learn about various topics, especially academic research. If you want to be active in the data science community, especially with new technologies or if you are going to publish your own projects, you should join Twitter. Twitter is a great way to share your projects to the academic community and follow reputable people from the field.&lt;br&gt;
Great Twitter pages in ML/AI/DS to follow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://twitter.com/drfeifei"&gt;Fei-Fei Li - @drfeifei&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/NandoDF"&gt;Nando de Freitas - @NandoDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/kdnuggets"&gt;KDnuggets - @kdnuggets&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now you have the path for a great data science project!&lt;/p&gt;

&lt;p&gt;Try to implement as many of these components as you can. Although, if you include components that are illogical for your project, that is a black mark especially if an interviewer figures this out.&lt;/p&gt;

&lt;p&gt;Always try to find how you can improve your model or follow up on your project! For example, if you are creating a prediction model see how accurate your model still is 6 months after you published your model!&lt;/p&gt;

&lt;p&gt;TIP: A question you should constantly ask yourself when building your project is why am I using this specific method. For example, if you are building a regression model but choose to use Lasso over Ridge, a reason could be due to wanting to remove certain variables. Again ask yourself, why do I want to remove variables? Maybe certain variables increase the MSPE value. Like this, constantly ask questions throughout your project so you have a more accurate model since you have thought through the various different approaches.&lt;/p&gt;

&lt;p&gt;If you’re a beginner and still want more ideas and tutorials to start with, check out our post “&lt;a href="https://www.stratascratch.com/blog/19-data-science-project-ideas-for-beginners/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;19 Data Science Project Ideas for Beginners&lt;/a&gt;”.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>webdev</category>
    </item>
    <item>
      <title>String and Array Functions in SQL for Data Science</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Mon, 09 May 2022 08:56:02 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/string-and-array-functions-in-sql-for-data-science-43fj</link>
      <guid>https://dev.to/nate_at_stratascratch/string-and-array-functions-in-sql-for-data-science-43fj</guid>
      <description>&lt;p&gt;&lt;em&gt;Commonly used string and array functions in SQL Data Science Interviews.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In a previous article "&lt;a href="https://www.stratascratch.com/blog/sql-scenario-based-interview-questions-and-answers/"&gt;SQL Scenario Based Interview Questions&lt;/a&gt;", we had touched upon the various date and time functions in SQL. In this article, we will look at another important favorite topic in Data Science Interviews – string and array manipulation. With increasingly diverse and unstructured data sources becoming commonplace, string and array manipulation has become a integral part of Data Analysis and Data Science functions. The key ideas discussed in this article include&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleaning Strings&lt;/li&gt;
&lt;li&gt;String Matching&lt;/li&gt;
&lt;li&gt;String Splitting&lt;/li&gt;
&lt;li&gt;Creating Arrays&lt;/li&gt;
&lt;li&gt;Splitting Arrays to rows&lt;/li&gt;
&lt;li&gt;Aggregating text fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might also want to look at our &lt;a href="https://www.stratascratch.com/blog/python-pandas-interview-questions-for-data-science-part-2/"&gt;Pandas article on string manipulation&lt;/a&gt; in DataFrame as we use quite a few similar concepts here as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  String Matching
&lt;/h2&gt;

&lt;p&gt;Let us start with a simple string-matching problem. This is from a past City of San Fransisco Data Science Interview Question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find the number of violations that each school had&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Determine the number of violations for each school. Any inspection that does not have risk category as null is considered a violation. Print the school’s name along with the number of violations. Order the output in the descending order of the number of violations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jp6khTmg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4n0eewn69gij5u4vkojs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jp6khTmg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4n0eewn69gij5u4vkojs.png" alt="String Matching Question For Practice" width="512" height="159"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can solve this problem here: &lt;a href="https://platform.stratascratch.com/coding/9727-find-the-number-of-violations-that-each-school-had"&gt;https://platform.stratascratch.com/coding/9727-find-the-number-of-violations-that-each-school-had&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem uses the sf_restaurant_health_violations dataset with the following fields.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2XvpZUqI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nky7m4554neato5h9v38.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2XvpZUqI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nky7m4554neato5h9v38.png" alt="Image description" width="415" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The relevant data in the table looks like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WHh3Eb-Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i7bhvoz9exgftq958vr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WHh3Eb-Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i7bhvoz9exgftq958vr2.png" alt="Image description" width="512" height="130"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The relevant columns are business_name and risk_category&lt;/p&gt;

&lt;p&gt;These columns are populated thus.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fST-qLrg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/grolfbdydk4u3pxgy4r0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fST-qLrg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/grolfbdydk4u3pxgy4r0.png" alt="Image description" width="512" height="266"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach and Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a relatively straightforward problem. We need&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify “Schools” from the business category&lt;/li&gt;
&lt;li&gt;Count the violations excluding the rows where the risk_category is NULL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simplest string-matching function in SQL is the LIKE function that searches for a substring inside a larger string. However, one needs to use wildcards to ensure the correct match is found. Since we do not know for sure that Schools end with the word School, we use the % wildcard before and after the string to ensure that the word “SCHOOL” is searched for. Further, we use the ILIKE function to make a case-insensitive search. The solution is now very simple.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT business_name,&lt;br&gt;
       COUNT(*) AS num_violations&lt;br&gt;
FROM sf_restaurant_health_violations&lt;br&gt;
WHERE business_name ILIKE '%SCHOOL%'&lt;br&gt;
  AND risk_category IS NOT NULL&lt;br&gt;
GROUP BY 1 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If your SQL flavor does not have the ILIKE statement, we can convert the string to upper or lower case and then use the LIKE statement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Splitting a Delimited String&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we have warmed up with string search, let us try another common string manipulation technique: splitting. There are numerous use cases for splitting a string. Splitting a string requires a delimiter (a separator). To illustrate this let us look at a problem from another City of San Francisco Data Science Interview problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Density Per Street&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fP89bSHu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sds0cghf4o2n7fkc1psp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fP89bSHu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sds0cghf4o2n7fkc1psp.png" alt="Image description" width="512" height="151"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can solve the problem on the StrataScratch Platform here: &lt;a href="https://platform.stratascratch.com/coding/9735-business-density-per-street?python="&gt;https://platform.stratascratch.com/coding/9735-business-density-per-street?python=&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This problem uses the same sf_restaurant_health_violations used in the previous problem. The fields of interest for this problem are business_id and business_address which are populated thus. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e9Q24Dzk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7l4nwo0rc3j0z1cj5kg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e9Q24Dzk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7l4nwo0rc3j0z1cj5kg1.png" alt="Image description" width="512" height="164"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach and Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We need to extract the second word from the address. That represents the street name. To do this we split the string using space as a delimiter (separator) and extracting the second word. We can do this using the SPLIT_PART function. This is similar to the split() method in Python. Since Postgres is case sensitive, we convert the output to upper case.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname,&lt;br&gt;
       business_address&lt;br&gt;
FROM sf_restaurant_health_violations ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We get the following output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2A0U2QR3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bvizywtvcpkr4g0fk971.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2A0U2QR3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bvizywtvcpkr4g0fk971.png" alt="Image description" width="512" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the problem becomes relatively easy to solve. We find the number of distinct business entries on each street. Since we need only those businesses with five or more entries, we use the HAVING clause to subset the output. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname,&lt;br&gt;
       COUNT (DISTINCT business_id) AS density&lt;br&gt;
FROM sf_restaurant_health_violations&lt;br&gt;
GROUP BY 1&lt;br&gt;
HAVING COUNT (DISTINCT business_id) &amp;gt;= 5 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We get the following output. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3hkpdUSg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fp7ufbf0xjwqsh1xxpzy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3hkpdUSg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fp7ufbf0xjwqsh1xxpzy.png" alt="Image description" width="512" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we can aggregate this table using a subquery, CTE or a temp table. We have used a CTE in this case and get the final output. &lt;/p&gt;

&lt;p&gt;&lt;code&gt;WITH rel_businesses AS&lt;br&gt;
  (SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname,&lt;br&gt;
          COUNT (DISTINCT business_id) AS density&lt;br&gt;
   FROM sf_restaurant_health_violations&lt;br&gt;
   GROUP BY 1&lt;br&gt;
   HAVING COUNT (DISTINCT business_id) &amp;gt;= 5)&lt;br&gt;
SELECT AVG(density),&lt;br&gt;
       MAX(density)&lt;br&gt;
FROM rel_businesses ;&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Arrays
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2DvYA7PG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8ad5771y5sncgbypu6xs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2DvYA7PG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8ad5771y5sncgbypu6xs.png" alt="Image description" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most modern SQL flavors allow creation and manipulation of arrays. Let us look at working with string arrays. One can manipulate integer and floating-point arrays in a similar manner. To illustrate this let us take an SQL Data Science Interview problem for an AirBnB interview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;City With Most Amenities&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Find the city with most amenities in the given dataset. Each row in the dataset represents a unique host. Output the name of the city with the most amenities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Qfy8caMN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ahknggjl8q8kfosstdtw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Qfy8caMN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ahknggjl8q8kfosstdtw.png" alt="Image description" width="512" height="129"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can solve the problem here: &lt;a href="https://platform.stratascratch.com/coding/9633-city-with-most-amenities"&gt;https://platform.stratascratch.com/coding/9633-city-with-most-amenities&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem uses the airbnb_search_details dataset with the following fields.&lt;/p&gt;

&lt;p&gt;airbnb_search_details&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZVmDz6PN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yyp3in4hkw4b23sx0hgs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZVmDz6PN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yyp3in4hkw4b23sx0hgs.png" alt="Image description" width="406" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The main fields of interest here are city and amenities that are populated thus. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PiR-G7cx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/btoiycrqpfons9ka8jbo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PiR-G7cx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/btoiycrqpfons9ka8jbo.png" alt="Image description" width="512" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach and solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To solve this let us break this problem into parts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We need to find the number of amenities for a given property&lt;/li&gt;
&lt;li&gt;Aggregate the amenities at city level&lt;/li&gt;
&lt;li&gt;Find the city with the highest number of amenities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The amenities are represented in form of a string separated by commas. However, SQL right now recognizes this field as string. So, we need to convert this string into individual amenities by splitting them using the comma delimiter. To do this we use the STRING_TO_ARRAY() function and specify comma as the delimiter.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT city,&lt;br&gt;
       STRING_TO_ARRAY(amenities, ',') AS num_amenities&lt;br&gt;
FROM airbnb_search_details ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We get the following output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kH7wnxrg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b09npq9wug27mxugjwvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kH7wnxrg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b09npq9wug27mxugjwvt.png" alt="Image description" width="512" height="171"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note for this problem, opening and closing braces are considered a part of the first and last word in the string. If we want to eliminate to clean the string, we can use the BTRIM function. BTRIM function will remove all the leading and trailing characters specified. We can modify our query in the following manner.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT city,&lt;br&gt;
       STRING_TO_ARRAY(BTRIM(amenities, '{}'), ',') AS num_amenities&lt;br&gt;
FROM airbnb_search_details ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This gives us the following output. As one can see we have successfully removed the leading and trailing braces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iKEvXqgr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/645hpcypeckis57ekx3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iKEvXqgr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/645hpcypeckis57ekx3e.png" alt="Image description" width="512" height="123"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To find the number of amenities, we need to count the number of elements in the amenities array. We can do this by using the ARRAY_LENGTH() function. The function requires us to specify the array dimension whose length is to be specified. This is useful for multi-dimensional arrays. Since our array is 1-dimensional, we simply specify 1.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT city,&lt;br&gt;
       ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}') , ',') , 1) AS num_amenities&lt;br&gt;
FROM airbnb_search_details ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Our output looks like this&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6o3cGZjP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/77ovxpexck64hw44jhqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6o3cGZjP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/77ovxpexck64hw44jhqw.png" alt="Image description" width="512" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We now proceed to aggregate the number of amenities at city level.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT city,&lt;br&gt;
       SUM(ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}'), ','), 1)) AS num_amenities&lt;br&gt;
FROM airbnb_search_details&lt;br&gt;
GROUP BY 1 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Our output now looks like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_i7SIEvN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uobphl8b8slr15tgiv20.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_i7SIEvN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uobphl8b8slr15tgiv20.png" alt="Image description" width="512" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can now find the city with the highest number of amenities by sorting in descending order and using LIMIT 1 or more reliably, by ranking them.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT CITY&lt;br&gt;
FROM&lt;br&gt;
  (SELECT city,&lt;br&gt;
          DENSE_RANK() OVER (&lt;br&gt;
                             ORDER BY num_amenities DESC) AS rank&lt;br&gt;
   FROM&lt;br&gt;
     (SELECT city ,&lt;br&gt;
             SUM(ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}'), ','), 1)) AS num_amenities&lt;br&gt;
      FROM airbnb_search_details&lt;br&gt;
      GROUP BY 1) Q1) Q2&lt;br&gt;
WHERE rank = 1 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Splitting an Array&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The above problem could have also been solved by exploding the array into individual rows and then aggregating the number of amenities for each city. Let us use this method in another SQL data science question from Meta (Facebook) interview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Views Per Keyword&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Find the number of views for each keyword. Report the keyword and the total views in the decreasing order of the views.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uLAnJ9nI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1k8s57wh3b99dnow8d3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uLAnJ9nI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1k8s57wh3b99dnow8d3c.png" alt="Image description" width="512" height="105"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can solve the problem on the StrataScratch platform here: &lt;a href="https://platform.stratascratch.com/coding/9791-views-per-keyword?python="&gt;https://platform.stratascratch.com/coding/9791-views-per-keyword?python=&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem uses the facebook_posts and facebook_post_views datasets. The fields present in the facebook_posts dataset are&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3tjnJZXv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gqagpg2iqm8d112xj4ii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3tjnJZXv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gqagpg2iqm8d112xj4ii.png" alt="Image description" width="512" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The data is presented in the following manner&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y1AePa-A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sr1oddjd7ggt97sbfnnq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y1AePa-A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sr1oddjd7ggt97sbfnnq.png" alt="Image description" width="512" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The facebook_post_views has the following fields&lt;/p&gt;

&lt;p&gt;facebook_post_views&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jWdJNA8X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u8w3ebx7bgmf6dreza3r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jWdJNA8X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u8w3ebx7bgmf6dreza3r.png" alt="Image description" width="512" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this is how the data in looks&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3p_PkwXJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hmpiilzo80xme6fk082d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3p_PkwXJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hmpiilzo80xme6fk082d.png" alt="Image description" width="512" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach and Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let us break this problem into individual parts.&lt;/p&gt;

&lt;p&gt;• We start off by merging the two datasets on the post_id field. We need to aggregate the number of views for each post&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT fp.post_id,&lt;br&gt;
       fp.post_keywords,&lt;br&gt;
       COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views&lt;br&gt;
FROM facebook_posts fp&lt;br&gt;
LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id&lt;br&gt;
GROUP BY 1,&lt;br&gt;
         2 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We get the following output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ewovji-2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lzjlh7atnzu8ta66wggt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ewovji-2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lzjlh7atnzu8ta66wggt.png" alt="Image description" width="512" height="140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We need to assign the views to each keyword. For example for post_id = 3, the keyword sphagetti and the keyword food should each get 3 views. For post_id = 4, the spam keyword should get 3 views and so on. To accomplish this, we first clean the string stripping the brackets and the # symbol.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;SELECT fp.post_id,&lt;br&gt;
       STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',') AS keyword,&lt;br&gt;
       COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views&lt;br&gt;
FROM facebook_posts fp&lt;br&gt;
LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id&lt;br&gt;
GROUP BY 1,&lt;br&gt;
         2 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We get the following output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qhRdlDPz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0gb6vlxo0bvv2clgapcy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qhRdlDPz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0gb6vlxo0bvv2clgapcy.png" alt="Image description" width="512" height="140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Now we separate (explode) the array into individual records using the UNNEST function.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;SELECT fp.post_id,&lt;br&gt;
       UNNEST(STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',')) AS keyword,&lt;br&gt;
       COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views&lt;br&gt;
FROM facebook_posts fp&lt;br&gt;
LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id&lt;br&gt;
GROUP BY 1,&lt;br&gt;
         2 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gEocGXl3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/92s0s0gjb401vvn1umcz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gEocGXl3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/92s0s0gjb401vvn1umcz.png" alt="Image description" width="512" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We can now easily aggregate the number of views per keyword and sort them in descending order.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;WITH exp_keywords AS&lt;br&gt;
  (SELECT fp.post_id ,&lt;br&gt;
          UNNEST(STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',')) AS keyword ,&lt;br&gt;
          COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views&lt;br&gt;
   FROM facebook_posts fp&lt;br&gt;
   LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id&lt;br&gt;
   GROUP BY 1,&lt;br&gt;
            2)&lt;br&gt;
SELECT keyword,&lt;br&gt;
       sum(num_views) AS total_views&lt;br&gt;
FROM exp_keywords&lt;br&gt;
GROUP BY 1&lt;br&gt;
ORDER BY 2 DESC ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggregating Text Fields&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let us finish things off by doing the converse. Aggregating rows back into a string. We illustrate this with a SQL Data Science Interview question from Google.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Contents Shuffle&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rearrange the words of the filename final.txt to make a new file named wacky.txt. Sort all the words in alphabetical order, output the words in column and the filename wacky.txt in another.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TNT1iC_G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jklxi2yglqg3n2cc37vz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TNT1iC_G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jklxi2yglqg3n2cc37vz.png" alt="Image description" width="512" height="132"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can solve the problem here: &lt;a href="https://platform.stratascratch.com/coding/9818-file-contents-shuffle"&gt;https://platform.stratascratch.com/coding/9818-file-contents-shuffle&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem uses the google_file_store dataset with the following columns.&lt;/p&gt;

&lt;p&gt;google_file_store&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IRkZE23R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/96gb1sc4gcdn5szcejq6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IRkZE23R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/96gb1sc4gcdn5szcejq6.png" alt="Image description" width="512" height="189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The contents of the dataset look like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L8WcRHHK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hkjkmewqgrv5rfvj06n8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L8WcRHHK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hkjkmewqgrv5rfvj06n8.png" alt="Image description" width="512" height="90"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach and Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let us solve this problem in a step-wise manner.&lt;/p&gt;

&lt;p&gt;• We first keep only the contents of the filename final.txt, split the contents using space as a delimiter, explode the resulting array into individual rows and sort in alphabetical order.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT UNNEST(STRING_TO_ARRAY(CONTENTS, ' ')) AS words&lt;br&gt;
FROM google_file_store&lt;br&gt;
WHERE filename ILIKE '%FINAL%'&lt;br&gt;
ORDER BY 1 ;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We get the following output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vL_y6x-A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7t98636j4ff75f8g0yuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vL_y6x-A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7t98636j4ff75f8g0yuy.png" alt="Image description" width="437" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We now need to combine the individual words back into a string. To do this we use the STRING_AGG() function and specify space as the delimiter. This function is similar to the join() method in Python. We also add a filename for the new string and output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;WITH exploded_arr AS&lt;br&gt;
  (SELECT UNNEST(STRING_TO_ARRAY(CONTENTS, ' ')) AS words&lt;br&gt;
   FROM google_file_store&lt;br&gt;
   WHERE filename ILIKE '%FINAL%'&lt;br&gt;
   ORDER BY 1)&lt;br&gt;
SELECT 'wacky.txt' AS filename,&lt;br&gt;
       STRING_AGG(words, ' ') AS CONTENTS&lt;br&gt;
FROM exploded_arr ;&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article we looked at the text and array manipulation abilities of SQL. This is specifically useful in ETL process upstream as well as Analysis downstream. As with other Data Science areas, only patience, persistence and practice can make you proficient. On StrataScratch, we have over 700 coding and non-coding problems that are relevant to Data Science Interviews. These problems appeared in actual Data Science interviews at top companies like Google, Amazon, Microsoft, Netflix, et al. For e.g., check out our posts "&lt;a href="https://www.stratascratch.com/blog/40-data-science-interview-questions-from-top-companies/"&gt;40+ Data Science Interview Questions From Top Companies&lt;/a&gt;" and "&lt;a href="https://www.stratascratch.com/blog/sql-interview-questions-you-must-prepare-the-ultimate-guide/"&gt;The Ultimate Guide to SQL Interview Questions&lt;/a&gt;" to practice such interview questions and prepare for the most in-demand jobs at big tech firms and start-ups across the world.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>sql</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The Ultimate Guide to Python Window Functions</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Wed, 23 Feb 2022 15:11:09 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/the-ultimate-guide-to-python-window-functions-1h1d</link>
      <guid>https://dev.to/nate_at_stratascratch/the-ultimate-guide-to-python-window-functions-1h1d</guid>
      <description>&lt;p&gt;&lt;em&gt;This article focuses on different types of Python window functions, where and how to implement them, practice questions, reference articles and documentation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Window function is a popular technique used to analyze a subset with related values. It is commonly used in SQL, however, these functions are extremely useful in Python as well.&lt;/p&gt;

&lt;p&gt;If you would like to check out our content on SQL Window Functions, we have also created an article "&lt;a href="https://www.stratascratch.com/blog/the-ultimate-guide-to-sql-window-functions/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;The Ultimate Guide to SQL Window Functions&lt;/a&gt;" and a &lt;a href="https://www.youtube.com/watch?v=XBE09l-UYTE" rel="noopener noreferrer"&gt;YouTube video&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;This article discusses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Different types of window functions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Where / How to implement these functions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Practice Questions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reference articles / Documentation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A general format is written for each of these functions for you to understand and implement on your own. The format will include &lt;strong&gt;&lt;em&gt;bold italicized text&lt;/em&gt;&lt;/strong&gt; which indicate these are the sections of the function you need to replace during implementation.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;.groupby(level='&lt;strong&gt;&lt;em&gt;groupby_column&lt;/em&gt;&lt;/strong&gt;').agg({‘&lt;strong&gt;&lt;em&gt;aggregate_column&lt;/em&gt;&lt;/strong&gt;’: ‘&lt;strong&gt;&lt;em&gt;aggregate_function&lt;/em&gt;&lt;/strong&gt;’})&lt;/p&gt;

&lt;p&gt;Texts such as '&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;' and '&lt;strong&gt;&lt;em&gt;groupby_column&lt;/em&gt;&lt;/strong&gt;' are bold and italicized meaning you should replace them with the actual variables.&lt;br&gt;
Texts such as ‘.groupby’ and ‘level’ which are not bold and italicized are required to remain the same to execute the function.&lt;/p&gt;

&lt;p&gt;Let’s suppose Amazon asks to find the total cost each user spent on their amazon orders.&lt;br&gt;
An implementation of this function this dataset would look similar to this:&lt;br&gt;
&lt;strong&gt;&lt;em&gt;amazon_orders&lt;/em&gt;&lt;/strong&gt;.groupby(level='&lt;strong&gt;&lt;em&gt;user_id&lt;/em&gt;&lt;/strong&gt;').agg({'&lt;strong&gt;&lt;em&gt;cost&lt;/em&gt;&lt;/strong&gt;': '&lt;strong&gt;&lt;em&gt;sum&lt;/em&gt;&lt;/strong&gt;'})&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Table of Contents&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Python Window Functions overview diagram&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Aggregate&lt;/strong&gt;&lt;br&gt;
• &lt;u&gt;Group by&lt;/u&gt;&lt;br&gt;
• &lt;u&gt;Rolling&lt;/u&gt;&lt;br&gt;
• &lt;u&gt;Expanding&lt;/u&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ranking&lt;/strong&gt;&lt;br&gt;
• &lt;u&gt;Row number&lt;/u&gt;&lt;/p&gt;
&lt;h6&gt;
  
  
  reset_index()
&lt;/h6&gt;
&lt;h6&gt;
  
  
  cumcount()
&lt;/h6&gt;

&lt;p&gt;• &lt;u&gt;Rank&lt;/u&gt;&lt;/p&gt;
&lt;h6&gt;
  
  
  default_rank
&lt;/h6&gt;
&lt;h6&gt;
  
  
  min_rank
&lt;/h6&gt;
&lt;h6&gt;
  
  
  NA_bottom
&lt;/h6&gt;
&lt;h6&gt;
  
  
  descending
&lt;/h6&gt;

&lt;p&gt;• &lt;u&gt;Dense rank&lt;/u&gt;&lt;br&gt;
• &lt;u&gt;Percent rank&lt;/u&gt;&lt;br&gt;
• &lt;u&gt;N-Tile / qcut()&lt;/u&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;br&gt;
• &lt;u&gt;Lag / Lead&lt;/u&gt;&lt;br&gt;
• &lt;u&gt;First / Last / nth value&lt;/u&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Python Window Functions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sz6kym1vzniv0fbcwxy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sz6kym1vzniv0fbcwxy.png" alt="Types of Python Window Functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While there is not any official classification of Python window functions, these are the common functions implemented.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aggregate
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjz0wozmqll8sz6shexdk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjz0wozmqll8sz6shexdk.png" alt="Aggregate python window functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are some common types of aggregate functions&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average&lt;/li&gt;
&lt;li&gt;Max&lt;/li&gt;
&lt;li&gt;Min&lt;/li&gt;
&lt;li&gt;Sum&lt;/li&gt;
&lt;li&gt;Count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these aggregate functions (except count which will be explained later) can be used in three types of situations&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Group by&lt;/li&gt;
&lt;li&gt;Rolling&lt;/li&gt;
&lt;li&gt;Expanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Group by: Facebook is trying to find the average revenue of Instagram for each year.&lt;/li&gt;
&lt;li&gt;Rolling: Facebook is trying to find the rolling 3 year average revenue of Instagram&lt;/li&gt;
&lt;li&gt;Expanding: Facebook is trying to find the cumulative average revenue of Instagram with an initial size of 2 years.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Group by
&lt;/h3&gt;

&lt;p&gt;Group by aggregates is computing a certain column by a statistical function within each group. For example in a dataset&lt;/p&gt;

&lt;p&gt;Let’s use a &lt;a href="https://platform.stratascratch.com/coding/9899-percentage-of-total-spend?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;question&lt;/a&gt; from Amazon to explain this topic. This question is asking us to calculate the percentage of the total expenditure a customer spent on each order. Output the customer’s first name, order details (product name), and percentage of the order cost to their total spend across all orders.&lt;/p&gt;

&lt;p&gt;Remember when approaching questions follow the 3 steps&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ask clarifying questions&lt;/li&gt;
&lt;li&gt;State assumptions&lt;/li&gt;
&lt;li&gt;Attempt the question&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When approaching these questions, understand which columns need to be grouped and which columns need to be aggregated.&lt;br&gt;
For the Amazon example,&lt;br&gt;
Group by: customer first_name, order_id, order_details&lt;br&gt;
Aggregate: total_order_cost&lt;/p&gt;

&lt;p&gt;In this question, there are 2 tables which need to be joined to get the customer’s first name, item, and spending. After merging both tables and filtering to get the required columns, to get the following dataset&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fli9e7nho1vd0ta3vul0o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fli9e7nho1vd0ta3vul0o.png" alt="Aggregate Group by Python window functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once necessary data is set in a single table, it is easier to manipulate.&lt;/p&gt;

&lt;p&gt;Here we can find the total spending by person by grouping first_name and sum of total_order_cost&lt;/p&gt;

&lt;p&gt;This is a general format on how to group by and aggregate the required columns.&lt;br&gt;
&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;.groupby(level='&lt;strong&gt;&lt;em&gt;groupby_column&lt;/em&gt;&lt;/strong&gt;').agg({'&lt;strong&gt;&lt;em&gt;aggregate_column&lt;/em&gt;&lt;/strong&gt;': '&lt;strong&gt;&lt;em&gt;aggregate_function&lt;/em&gt;&lt;/strong&gt;'})&lt;/p&gt;

&lt;p&gt;In reference to the Amazon example, this is the executing code.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;total_spending = customer_orders.groupby("first_name").agg({'total_order_cost' : 'sum'})&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This code will output the following dataframe&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzz5v109clnxnfjfaejk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzz5v109clnxnfjfaejk.png" alt="Output for Aggregate Python window function Question"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After this, we want to add a column to the merged data frame to represent total spending by each person.&lt;/p&gt;

&lt;p&gt;Let’s join both dataframes on the person’s first_name&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pd.merge(merged_dataframe, total_spending, how="left", on="first_name")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now we get the following dataset&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxepygqm1h6fxiw6hl2qx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxepygqm1h6fxiw6hl2qx.png" alt="Output 2 for Aggregate Python window function Question"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As seen, the total_order_cost_y column represents the total spending per person and the total_order_cost_x to represent the cost per order. After this, this is a simple division of 2 columns to create the percentage of the spending column AND filtering the output to get the required columns.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result = df3[["first_name", "order_details", "percentage_total_cost"]]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpq2e8jn9q1zuv5g970z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpq2e8jn9q1zuv5g970z.png" alt="Output 3 for Aggregate Python window function Question"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However for certain situations, it is required to sort the values within each group. This is where the sort_values() function is implemented.&lt;/p&gt;

&lt;p&gt;Referencing the amazon question example:&lt;/p&gt;

&lt;p&gt;Suppose the interviewer asks to order the percentage_total_cost in descending order by person.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result = result.sort_values(by=['first_name', 'percentage_total_cost'], ascending = (True, False))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
df = pd.merge(orders, customers, left_on="cust_id", right_on="id")&lt;br&gt;
df1 = df[["first_name", "id_x", 'order_details', 'total_order_cost']]&lt;br&gt;
df2 = df.groupby("first_name").agg({'total_order_cost' : 'sum'})&lt;br&gt;
df3 = pd.merge(df1, df2, how="left", on="first_name")&lt;br&gt;
df3["percentage_total_cost"] = df3["total_order_cost_x"] / df3["total_order_cost_y"]&lt;br&gt;
result = df3[["first_name", "order_details", "percentage_total_cost"]]&lt;br&gt;
result = result.sort_values(by=['first_name', 'percentage_total_cost'], ascending = (True, False))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Practice&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9711-facilities-with-lots-of-inspections?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9711-facilities-with-lots-of-inspections?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9899-percentage-of-total-spend?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9899-percentage-of-total-spend?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2044-most-senior-junior-employee?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2044-most-senior-junior-employee?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;sort_values() function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;groupby() function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pbpython.com/groupby-agg.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;Aggregation in Group By Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;groupby().agg() function&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Rolling vs Expanding Function
&lt;/h3&gt;

&lt;p&gt;Before diving into how to execute a rolling or expanding function, let’s understand how each of these functions works. While rolling function and expanding function work similarly, there is a significant difference in the window size. Rolling function has a fixed window size, while the expanding function has a variable window size.&lt;/p&gt;

&lt;p&gt;These images explain the difference between rolling and expanding functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rolling Function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpz3m9mn3d2j0tiw5ofan.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpz3m9mn3d2j0tiw5ofan.png" alt="Rolling Function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expanding Function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1ueqhw13ne3atup8tvr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1ueqhw13ne3atup8tvr.png" alt="Expanding Function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rolling and expanding functions both start with the same window size, but expanding function incorporates all the subsequent values beyond the initial window size.&lt;/p&gt;

&lt;p&gt;Example: AccuWeather, a weather forecasting company, is trying to find the rolling and expanding average 10 day weather of San Francisco in January.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Rolling&lt;/em&gt;&lt;/strong&gt;: Starting with a window size of 10, we take the average temperature from January 1st to January 10th. Next we take January 2nd to January 11th and so on. This shows the window size in rolling functions remains the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Expanding&lt;/em&gt;&lt;/strong&gt;: Starting with a window size of 10, we take the average temperature from january 1st to January 10th. However, next we’ll take the average temperature from January 1st to January 11th. Then, January 1st to January 12th and so on. Therefore the window size has “expanded”.&lt;/p&gt;

&lt;p&gt;While there are many aggregate functions that can be used in rolling/expanding functions, this article will discuss the frequently used functions (sum, average, max, min).&lt;/p&gt;

&lt;p&gt;This brings us to the reason why the count function is not used in rolling and expanding functions. Count is used when a certain variable is grouped and there is a need to count the occurrence of a value. In the rolling and expanding function, there is no grouping of rows, but a calculation on a specific column.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Rolling Aggregate&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Implementation of rolling functions are straightforward.&lt;/p&gt;

&lt;p&gt;A general format:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;DataSeries&lt;/em&gt;&lt;/strong&gt;.rolling(&lt;strong&gt;&lt;em&gt;window_size&lt;/em&gt;&lt;/strong&gt;).&lt;strong&gt;&lt;em&gt;aggregate_function&lt;/em&gt;&lt;/strong&gt;()&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
Temperature of San Francisco of the first 22 days of 2021. Let’s find the average, sum, maximum, and minimum of a 5 day rolling time period.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;weather['Average'] = weather['Temperature'].rolling(5).mean()&lt;br&gt;
weather['Sum'] = weather['Temperature'].rolling(5).sum()&lt;br&gt;
weather['Max'] = weather['Temperature'].rolling(5).max()&lt;br&gt;
weather['Min'] = weather['Temperature'].rolling(5).min()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dfm0pjovqpk4trlmg7n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dfm0pjovqpk4trlmg7n.png" alt="Output for Rolling Aggregate Python window function Question"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After row 4, the rolling function over a fixed window size of 5 calculates the average, sum, max, and min over the Temperature values. This means that in the average column in row 16, calculates the average of rows 12, 13, 14, 15, and 16.&lt;/p&gt;

&lt;p&gt;As expected, the first 4 values of the rolling function columns are null due to not having enough values to calculate. Sometimes you still want to calculate the aggregate of the first n rows even if it doesn’t fit the number of required rows.&lt;/p&gt;

&lt;p&gt;In that case, we have to set a minimum number of observations to start calculating. Within the rolling function, you can specify the min_periods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;DataSeries&lt;/em&gt;&lt;/strong&gt;.rolling(&lt;strong&gt;&lt;em&gt;window_size&lt;/em&gt;&lt;/strong&gt;, min_periods=&lt;strong&gt;&lt;em&gt;minimum_observations&lt;/em&gt;&lt;/strong&gt;).&lt;strong&gt;&lt;em&gt;aggregate_function&lt;/em&gt;&lt;/strong&gt;()&lt;/p&gt;

&lt;p&gt;&lt;code&gt;weather['Average'] = weather['Temperature'].rolling(5, min_periods=1).mean()&lt;br&gt;
weather['Sum'] = weather['Temperature'].rolling(5, min_periods=2).sum()&lt;br&gt;
weather['Max'] = weather['Temperature'].rolling(5, min_periods=3).max()&lt;br&gt;
weather['Min'] = weather['Temperature'].rolling(5, min_periods=3).min()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyqdxhv19uf6hdg24aoy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyqdxhv19uf6hdg24aoy.png" alt="Output 2 for Rolling Aggregate Python window function Question"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Practice&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10314-revenue-over-time?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10314-revenue-over-time?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/dont-miss-out-on-rolling-window-functions-in-pandas-850b817131db?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://towardsdatascience.com/dont-miss-out-on-rolling-window-functions-in-pandas-850b817131db&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Expanding Aggregate&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Expanding function has a similar implementation to rolling functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;DataSeries&lt;/em&gt;&lt;/strong&gt;.expanding(&lt;strong&gt;&lt;em&gt;minimum_observations&lt;/em&gt;&lt;/strong&gt;).&lt;strong&gt;&lt;em&gt;aggregate_function&lt;/em&gt;&lt;/strong&gt;()&lt;/p&gt;

&lt;p&gt;It is important to remember that unlike the rolling function, the expanding function does not set a window size, due to its variability. The minimum_observations is specified, so for rows less than the minimum_observations will be set as null.&lt;/p&gt;

&lt;p&gt;Let’s use the same San Francisco Temperature example to explain expanding function&lt;/p&gt;

&lt;p&gt;&lt;code&gt;weather['Average'] = weather['Temperature'].expanding(5).mean()&lt;br&gt;
weather['Sum'] = weather['Temperature'].expanding(5).sum()&lt;br&gt;
weather['Max'] = weather['Temperature'].expanding(5).max()&lt;br&gt;
weather['Min'] = weather['Temperature'].expanding(5).min()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq1ae2vysq5ienz6kll4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq1ae2vysq5ienz6kll4t.png" alt="Output for Expanding Aggregate Python window function Question"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As it can be seen through the minimum temperature column, it takes the minimum value throughout the dataset, since it’s expanding beyond the minimum observations set.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
weather = pd.DataFrame({'Temperature': [57, 54, 60, 54, 57, 58, 52, 52, 59, 54, 53, 57, 56, 60, 55, 58, 59, 64, 65, 66, 67, 74]})&lt;br&gt;
weather['Rolling_Average'] = weather['Temperature'].rolling(5).mean()&lt;br&gt;
weather['Rolling_Sum'] = weather['Temperature'].rolling(5).sum()&lt;br&gt;
weather['Rolling_Max'] = weather['Temperature'].rolling(5).max()&lt;br&gt;
weather['Rolling_Min'] = weather['Temperature'].rolling(5).min()&lt;br&gt;
weather[‘Rolling_Average_minperiod’] = weather['Temperature'].rolling(5, min_periods=1).mean()&lt;br&gt;
weather['Rolling_Sum_minperiod'] = weather['Temperature'].rolling(5, min_periods=2).sum()&lt;br&gt;
weather['Rolling_Max_minperiod'] = weather['Temperature'].rolling(5, min_periods=3).max()&lt;br&gt;
weather['Rolling_Min_minperiod'] = weather['Temperature'].rolling(5, min_periods=3).min()&lt;br&gt;
weather['Expanding_Average'] = weather['Temperature'].expanding(5).mean()&lt;br&gt;
weather['Expanding_Sum'] = weather['Temperature'].expanding(5).sum()&lt;br&gt;
weather['Expanding_Max'] = weather['Temperature'].expanding(5).max()&lt;br&gt;
weather['Expanding_Min'] = weather['Temperature'].expanding(5).min()&lt;br&gt;
weather&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/window-functions-in-pandas-eaece0421f7?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://towardsdatascience.com/window-functions-in-pandas-eaece0421f7&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://campus.datacamp.com/courses/manipulating-time-series-data-in-python/window-functions-rolling-expanding-metrics?ex=5&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://campus.datacamp.com/courses/manipulating-time-series-data-in-python/window-functions-rolling-expanding-metrics?ex=5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Ranking
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcghllf55rktbyjrglb9w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcghllf55rktbyjrglb9w.png" alt="Ranking python window functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Row Number
&lt;/h3&gt;

&lt;p&gt;Counting the number of rows can be executed in 2 different situations each with a different function&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Across the entire dataframe - reset_index()&lt;/li&gt;
&lt;li&gt;Within groups - cumcount()&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are the equivalent functions as row_number() in SQL&lt;/p&gt;

&lt;p&gt;Let’s use the following sample dataset to explain both concepts&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycwywjosyxukznddxg9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycwywjosyxukznddxg9q.png" alt="Ranking Row Number"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Reset_index
&lt;/h4&gt;

&lt;p&gt;Within a dataframe, reset_index() will output the row number of each row.&lt;/p&gt;

&lt;p&gt;General format to follow:&lt;br&gt;
&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;.reset_index()&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhgqlhrav7xyj0nuvpe48.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhgqlhrav7xyj0nuvpe48.png" alt="Reset_index"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To extract the nth row implement the .iloc() function&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Dataframe&lt;/em&gt;&lt;/strong&gt;.iloc[&lt;strong&gt;&lt;em&gt;nth_row&lt;/em&gt;&lt;/strong&gt;]&lt;/p&gt;

&lt;h4&gt;
  
  
  cumcount()
&lt;/h4&gt;

&lt;p&gt;To calculate the row number within groups of a dataframe, you have to implement the cumcount() function in the following format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;.groupby([‘&lt;strong&gt;&lt;em&gt;column_names&lt;/em&gt;&lt;/strong&gt;’]).cumcount()&lt;/p&gt;

&lt;p&gt;Also remember to start the row count from 1 instead of the default 0, you need to add +1 to the cumcount() function&lt;/p&gt;

&lt;p&gt;For the sample dataset, the implementation would be&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df['Row_count'] = df.groupby(['c1', 'c2']).cumcount()+1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This would be the output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1yuv9xtt8gbg43bp0tqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1yuv9xtt8gbg43bp0tqd.png" alt="cumcount"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that you have the row_count within each group, sometimes you have to extract a specific index row of each group.&lt;/p&gt;

&lt;p&gt;For example, the company asks to extract the 2nd indexed value within each group. We can extract this by returning each row with a row_count value of 2.&lt;/p&gt;

&lt;p&gt;Using iloc again, we can extract the subset with the following general format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Dataframe&lt;/em&gt;&lt;/strong&gt;.loc[&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;[&lt;strong&gt;&lt;em&gt;column_name&lt;/em&gt;&lt;/strong&gt;] == &lt;strong&gt;&lt;em&gt;index&lt;/em&gt;&lt;/strong&gt;]&lt;/p&gt;

&lt;p&gt;For the column dataset above, we would use&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df.loc[df['Row_count'] == 2]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;to get the subset&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz27i3bm8dm15jss1bmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz27i3bm8dm15jss1bmn.png" alt="Python Window Functions subset"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
df = pd.DataFrame({'c1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'c2':['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'], 'v1':[3, 5, 7, 1, 3, 1, 3, 1, 7, 4, 1, 6]})&lt;br&gt;
df['Row_count'] = df.groupby(['c1', 'c2']).cumcount()+1&lt;br&gt;
df.loc[df['Row_count'] == 2]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Questions&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2004-number-of-comments-per-user-in-past-30-days?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2004-number-of-comments-per-user-in-past-30-days?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9716-top-3-facilities?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9716-top-3-facilities?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10351-activity-rank?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10351-activity-rank?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.cumcount.html?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.cumcount.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Rank
&lt;/h4&gt;

&lt;p&gt;Ranking functions as the name states ranks values based on a certain variable. Ranking function works slightly differently than its SQL equivalent.&lt;/p&gt;

&lt;p&gt;Rank() function can be executed with the following general format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;[&lt;strong&gt;&lt;em&gt;column_name&lt;/em&gt;&lt;/strong&gt;].rank()&lt;/p&gt;

&lt;p&gt;Let’s assume the following dataset from the Pandas ranking documentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2enuuce9uckghwu56my.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2enuuce9uckghwu56my.png" alt="Rank python window functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And create 4 new columns which both use the rank() function to explain the function and its most popular parameters better&lt;/p&gt;

&lt;p&gt;&lt;code&gt;animal_legs['default_rank'] = animal_legs['Number_legs'].rank()&lt;br&gt;
animal_legs['min_rank'] = animal_legs['Number_legs'].rank(method='min')&lt;br&gt;
animal_legs['NA_bottom'] = animal_legs['Number_legs'].rank(method='min', na_option='bottom')&lt;br&gt;
animal_legs['descending'] = animal_legs['Number_legs'].rank(method='min', ascending = False)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filf2yfzjq57stnexef69.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filf2yfzjq57stnexef69.png" alt="Python Window Functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this function we’re ranking the number_legs for each animal.&lt;/p&gt;

&lt;p&gt;Let’s understand what each of the columns represents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;‘default_rank’&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a default rank() function, there are 3 important things to note.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ascending order is assumed true&lt;/li&gt;
&lt;li&gt;Null values are not ranked and left as null&lt;/li&gt;
&lt;li&gt;If n values are equal, the rank split is averaged between the values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The n values rank splitting is a bit confusing, so let’s dive more into this to explain it better.&lt;/p&gt;

&lt;p&gt;In SQL for the dataset above, since both cat and dog both have 4 legs, it would assume both as rank = 2 and spider with the next highest number of legs would have a rank of 4.&lt;/p&gt;

&lt;p&gt;Instead of that, Pandas averages out the ‘would have been’ ranks between cat and dog.&lt;br&gt;
There should be a rank of 2 and 3, but since cat and dog have the same value, the rank is the average of 2 and 3, which is 2.5&lt;/p&gt;

&lt;p&gt;Let’s alter the animal's example to include ‘donkey’ which has 4 legs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmf8th6przk7xthjsrxt5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmf8th6przk7xthjsrxt5.png" alt="default_rank python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Penguin has the least number of legs with 2, so it has a rank = 1.&lt;/p&gt;

&lt;p&gt;Since cat, dog, and donkey all have the next highest count of 4 legs, it will take the average of 2,3,4, due to 3 animals with the same value.&lt;/p&gt;

&lt;p&gt;If we have 4 animals all with 4 legs, it will take the average of 2,3,4,5 = 3.5&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;‘min_rank’&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When setting the parameter method=’min, instead of taking the average ranked value, it will take the minimum rank between equal values.&lt;/p&gt;

&lt;p&gt;The minimum rank is the same as how the rank function in SQL works.&lt;/p&gt;

&lt;p&gt;Using the animals example, the rank between dog and cat will now be 2 instead of 2.5.&lt;br&gt;
And for the example with donkey, it will still assume a rank of 2, while spider will be set to a rank of 5.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpbnmq4n5rkh2jxkpl95.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpbnmq4n5rkh2jxkpl95.png" alt="min_rank python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;‘NA_bottom’&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Certain rows contain null values and under default conditions, the rank will also be set as null. In certain cases you would want the null values to rank the lowest or highest.&lt;/p&gt;

&lt;p&gt;Setting the na_option as bottom would give the highest ranked value and setting as top would give it the lowest ranked value&lt;/p&gt;

&lt;p&gt;In the animals example, we set null values as bottom and rank method as minimum&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgs9zli98r0vkrxlf0f8v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgs9zli98r0vkrxlf0f8v.png" alt="NA_bottom python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;‘descending’&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you want to set the rank in descending order, set the parameter ascending as false.&lt;/p&gt;

&lt;p&gt;Referring to the animals example, we set ascending to false and method as minimum&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nhi2uommock3n66nx5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nhi2uommock3n66nx5q.png" alt="descending python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Questions&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10169-highest-total-miles?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10169-highest-total-miles?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10324-distances-traveled?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10324-distances-traveled?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2070-top-three-classes?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2070-top-three-classes?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html" rel="noopener noreferrer"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dense Rank
&lt;/h3&gt;

&lt;p&gt;Dense rank is similar to a normal rank with a slight difference.&lt;/p&gt;

&lt;p&gt;During a normal rank function, ranking numbers may be skipped, while dense_rank doesn’t skip.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzam9j8jxvp4or6ionl5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzam9j8jxvp4or6ionl5b.png" alt="Dense Rank Python Window Function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example in the animals dataframe, after [dog, cat, donkey], spider was the next value. In minimum rank, it sets spiders as rank = 5, since 2,3,4 are technically set for cat,dog, and donkey.&lt;/p&gt;

&lt;p&gt;In a dense rank, it will set the immediate consecutive ranks as seen above. Instead of 5th rank, spider was set to 3rd rank in dense_rank.&lt;/p&gt;

&lt;p&gt;Fortunately, you just have to edit the method parameter in a rank function to get the dense rank&lt;/p&gt;

&lt;p&gt;&lt;code&gt;animal_legs['dense_rank'] = animal_legs['Number_legs'].rank(method='dense')&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;All the other parameters, such as na_option and ascending, can also be set alongside the dense method as mentioned before.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Questions&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9701-3rd-most-reported-health-issues?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9701-3rd-most-reported-health-issues?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2026-bottom-2-companies-by-mobile-usage?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2026-bottom-2-companies-by-mobile-usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2019-top-2-users-with-most-calls?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2019-top-2-users-with-most-calls?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html" rel="noopener noreferrer"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dfrieds.com/data-analysis/rank-method-python-pandas.html" rel="noopener noreferrer"&gt;https://dfrieds.com/data-analysis/rank-method-python-pandas.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Percent rank (Percentile)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F18n81srab96dgfuzo5db.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F18n81srab96dgfuzo5db.png" alt="Percent rank python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Percent rank python window function&lt;br&gt;
Percent rank is just a representation of the ranks compared to the highest rank.&lt;/p&gt;

&lt;p&gt;As seen in the animals dataframe above, spider has a rank of 5 for both default_rank and min_rank. Since 5 is the highest rank, the other values would be compared to this.&lt;br&gt;
For cat in default_rank, it has a value of 3, and 3 / 5 = 0.6 for default_pct_rank&lt;br&gt;
For cat in min_rank, it has a value of 2, and 2 / 5 = 0.4 for min_pct_rank&lt;/p&gt;

&lt;p&gt;Percentage rank is boolean parameter which can be set&lt;/p&gt;

&lt;p&gt;&lt;code&gt;animal_legs['min_pct_rank'] = animal_legs['Number_legs'].rank(method='min', pct=True)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Questions&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10303-top-percentile-fraud?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10303-top-percentile-fraud?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9611-find-the-80th-percentile-of-hours-studied?python=1" rel="noopener noreferrer"&gt;https://platform.stratascratch.com/coding/9611-find-the-80th-percentile-of-hours-studied?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html" rel="noopener noreferrer"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dfrieds.com/data-analysis/rank-method-python-pandas.html" rel="noopener noreferrer"&gt;https://dfrieds.com/data-analysis/rank-method-python-pandas.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
import numpy as np&lt;br&gt;
animal_legs = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', 'spider', 'snake', 'donkey'], 'Number_legs': [4, 2, 4, 8, np.nan, 4]})&lt;br&gt;
animal_legs['default_rank'] = animal_legs['Number_legs'].rank()&lt;br&gt;
animal_legs['min_rank'] = animal_legs['Number_legs'].rank(method='min')&lt;br&gt;
animal_legs['NA_bottom'] = animal_legs['Number_legs'].rank(method='min', na_option='bottom')&lt;br&gt;
animal_legs['descending'] = animal_legs['Number_legs'].rank(method='min', ascending = False)&lt;br&gt;
animal_legs['dense_rank'] = animal_legs['Number_legs'].rank(method='dense')&lt;br&gt;
animal_legs['default_pct_rank'] = animal_legs['Number_legs'].rank(pct=True)&lt;br&gt;
animal_legs['min_pct_rank'] = animal_legs['Number_legs'].rank(method='min', pct=True)&lt;br&gt;
animal_legs&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  N-Tile / qcut()
&lt;/h3&gt;

&lt;p&gt;qcut() is not a popular function, since ranking based on quantiles beyond percentiles are not as common. While it isn’t as popular, it is still an extremely powerful function!&lt;br&gt;
If you don’t know the relationship between quantiles and percentiles, check out this article by statology!&lt;/p&gt;

&lt;p&gt;Let’s take a &lt;a href="https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;question&lt;/a&gt; from DoorDash, to explain how qcut is used.&lt;/p&gt;

&lt;p&gt;The question asks us to find the bottom 2% of the dataset, which is the first quantile from a 50-quantile split.&lt;/p&gt;

&lt;p&gt;A general format to follow when using qcut():&lt;br&gt;
pd.qcut(&lt;strong&gt;&lt;em&gt;dataseries&lt;/em&gt;&lt;/strong&gt;, q=&lt;strong&gt;&lt;em&gt;number_quantiles&lt;/em&gt;&lt;/strong&gt;, labels = range(&lt;strong&gt;&lt;em&gt;lower_bound&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;upper_bound&lt;/em&gt;&lt;/strong&gt;))&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fac22hfip35p1o6rcapyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fac22hfip35p1o6rcapyc.png" alt="N-Tile python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a subset of the dataset which we will use to analyze the usage of the qcut() function.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dataseries → The column to analyze, which is total_order in this example&lt;/li&gt;
&lt;li&gt;number_quantiles → Number of quantiles to split by, which is 50 due to 50-quantile split&lt;/li&gt;
&lt;li&gt;labels → Range of ntiles, which 1-50 in this case. However, the upper bound is calculated as n-1. So if we set a range of 1-50, the highest ntile will be 49 instead of 50. Due to this, we set our upper bound as n+1, which in this example would be range(1,51)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this example, this would be the following code.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result[‘ntile’] = pd.qcut(result['total_order'],q=50, labels=range(1, 50))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5l5is66bkn4knfsiebk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5l5is66bkn4knfsiebk.png" alt="N-Tile python window function example"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As seen in the example, ‘ntile’ has been split and represents the quantile.&lt;/p&gt;

&lt;p&gt;It must also be noted that if the label range is not specified, the quantile range is returned.&lt;/p&gt;

&lt;p&gt;For example executing the same code above without the labels range:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;result[‘ntile_range’] = pd.qcut(result['total_order'],q=50)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbe8v9yq3gnv4wamoxeid.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbe8v9yq3gnv4wamoxeid.png" alt="N-Tile python window function example 2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
result = doordash_delivery[doordash_delivery['customer_placed_order_datetime'].between('2020-05-01', '2020-05-31')].groupby("restaurant_id")["order_total"].sum().to_frame('total_order').reset_index()  &lt;br&gt;
result['ntile'] = pd.qcut(result['total_order'],q=50, labels=range(1, 50), duplicates = 'drop').values.tolist()&lt;br&gt;
result['ntile_range'] = pd.qcut(result['total_order'],q=50, duplicates = 'drop').values.tolist()&lt;br&gt;
result&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Questions&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;qcut() → &lt;a href="https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.qcut.html" rel="noopener noreferrer"&gt;https://pandas.pydata.org/docs/reference/api/pandas.qcut.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/all-pandas-qcut-you-should-know-for-binning-numerical-data-based-on-sample-quantiles-c8b13a8ed844" rel="noopener noreferrer"&gt;https://towardsdatascience.com/all-pandas-qcut-you-should-know-for-binning-numerical-data-based-on-sample-quantiles-c8b13a8ed844&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Value
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lag / Lead
&lt;/h3&gt;

&lt;p&gt;Lag and Lead functions are used to represent another column but are shifted by a single or multiple rows.&lt;/p&gt;

&lt;p&gt;Let’s use a &lt;a href="https://platform.stratascratch.com/coding/9782-customer-revenue-in-march?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;dataset&lt;/a&gt; given by Facebook (Meta) which represents the total cost of orders by each month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpnpjqw93z4g3oae316x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpnpjqw93z4g3oae316x.png" alt="Value Lag Lead python window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the ‘Lag’ column, we can see that values were shifted down by one. 305 which is the total_order_cost for January, appears in the ‘Lag’ column but on the same row as February.&lt;/p&gt;

&lt;p&gt;In the ‘Lead’ column, the opposite occurs. Rows are shifted up by one, so 285 which is the total_order_cost for February appears in the ‘Lead’ column in January.&lt;/p&gt;

&lt;p&gt;This makes it easier to calculate comparing values side by side such as growth of sales by month.&lt;/p&gt;

&lt;p&gt;A general format to follow:&lt;br&gt;
&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;[‘&lt;strong&gt;&lt;em&gt;shifting_column&lt;/em&gt;&lt;/strong&gt;’].shift(&lt;strong&gt;&lt;em&gt;number_shift&lt;/em&gt;&lt;/strong&gt;)&lt;/p&gt;

&lt;p&gt;Code used for the data:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;orders['Lag'] = orders['total_order_cost'].shift(1)&lt;br&gt;
orders['Lead'] = orders['total_order_cost'].shift(-1)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Also another key point to remember is the null values that are present due to the shift. As seen there are 1 null values (NaN) in the Lag and Lead column, since the values have been shifted by 1. There will be n rows of null values, due to the data series being shifted by n rows. So for the first n rows of ‘Lag’ column and last n rows of ‘Lead’ column will be null values.&lt;br&gt;
If you want to replace the null values that are generated by the shift, you can use the fill_value parameter.&lt;/p&gt;

&lt;p&gt;We execute the code with updated parameters&lt;/p&gt;

&lt;p&gt;&lt;code&gt;orders['Lag'] = orders['total_order_cost'].shift(1, fill_value = 0)&lt;br&gt;
orders['Lead'] = orders['total_order_cost'].shift(-1, fill_value = 0)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To get this as the output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffvyi3zlwvxfinbmyq2au.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffvyi3zlwvxfinbmyq2au.png" alt="Lag Lead window function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
import numpy as np&lt;br&gt;
orders['order_date'] = orders['order_date'].apply(pd.to_datetime)&lt;br&gt;
orders['order_month'] = orders['order_date'].dt.month&lt;br&gt;
orders.loc[(orders.order_month == 1),'order_month'] = 'January'&lt;br&gt;
orders.loc[(orders.order_month == 2),'order_month'] = 'February'&lt;br&gt;
orders.loc[(orders.order_month == 3),'order_month'] = 'March'&lt;br&gt;
orders.loc[(orders.order_month == 4),'order_month'] = 'April'&lt;br&gt;
orders['order_month'] = pd.Categorical(orders['order_month'], ["January", "February", "March", "April"])&lt;br&gt;
orders = orders[['order_month', 'total_order_cost']]&lt;br&gt;
orders = orders.sort_values(by=['order_month'])&lt;br&gt;
orders = orders.groupby("order_month").agg({'total_order_cost' : 'sum'})&lt;br&gt;
orders['Lag'] = orders['total_order_cost'].shift(1, fill_value = 0)&lt;br&gt;
orders['Lead'] = orders['total_order_cost'].shift(-1, fill_value = 0)&lt;br&gt;
orders&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Questions&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9637-growth-of-airbnb?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9637-growth-of-airbnb?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9714-dates-of-inspection?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9714-dates-of-inspection?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2045-days-without-hiringtermination?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2045-days-without-hiringtermination?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html" rel="noopener noreferrer"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First/Last/nth value
&lt;/h3&gt;

&lt;p&gt;Finding the nth value (including first and last) within groups of a dataset is fairly simple with Python as well.&lt;/p&gt;

&lt;p&gt;Let’s use the same &lt;a href="https://platform.stratascratch.com/coding/9782-customer-revenue-in-march?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;orders dataset by Facebook&lt;/a&gt; used in the Lag/Lead section.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pcyzfztrfa7w9xd0cd3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pcyzfztrfa7w9xd0cd3.png" alt="First Last and nth value python window functions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As seen here, the order_date has been ordered from earliest to latest.&lt;/p&gt;

&lt;p&gt;Let’s find the first order of each month using the nth() function&lt;/p&gt;

&lt;p&gt;General format:&lt;br&gt;
&lt;strong&gt;&lt;em&gt;dataframe&lt;/em&gt;&lt;/strong&gt;.groupby(‘&lt;strong&gt;&lt;em&gt;groupby_column&lt;/em&gt;&lt;/strong&gt;’).nth(&lt;strong&gt;&lt;em&gt;nth_value&lt;/em&gt;&lt;/strong&gt;)&lt;/p&gt;

&lt;p&gt;nth_value represents the indexed value&lt;br&gt;
nth_value in the nth() function works the same way as extracting the nth_value in a list.&lt;br&gt;
&lt;strong&gt;0 represents the first value&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;-1 represents the last value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using the following code:&lt;br&gt;
&lt;code&gt;orders.groupby('order_month').nth(0)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynyhagspwp8pwooxrnpn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynyhagspwp8pwooxrnpn.png" alt="First Last and nth value python window functions output"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To return only a specific column, such as total_order_cost, you can specify this as well.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;orders.groupby('order_month').nth(0)['total_order_cost'].reset_index()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5o2l5smzpk76v78kz00.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5o2l5smzpk76v78kz00.png" alt="First Last and nth value python window functions output 2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now if you want to join the nth value to the respective grouped by columns in the original dataframe, you could use a merge function, similar to how the merge function was applied in the aggregate functions mentioned above. Make sure you remember to extract the column to merge on as well! In this example, it would be the ‘order_month’ index column, where you should use the reset_index() function.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;import pandas as pd&lt;br&gt;
orders['order_date'] = orders['order_date'].apply(pd.to_datetime)&lt;br&gt;
orders['order_month'] = orders['order_date'].dt.month&lt;br&gt;
orders = orders.sort_values(by=['order_date'])&lt;br&gt;
ordered_group = orders.groupby('order_month').nth(0)['total_order_cost'].reset_index()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Reference&lt;/u&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html" rel="noopener noreferrer"&gt;https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Practice Questions Compiled
&lt;/h3&gt;

&lt;h5&gt;
  
  
  Aggregate
&lt;/h5&gt;

&lt;h6&gt;
  
  
  Group by
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9711-facilities-with-lots-of-inspections?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9711-facilities-with-lots-of-inspections?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9899-percentage-of-total-spend?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9899-percentage-of-total-spend?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2044-most-senior-junior-employee?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2044-most-senior-junior-employee?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  Rolling
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10314-revenue-over-time?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10314-revenue-over-time?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  Expanding
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;[[[ No Questions ]]]&lt;/li&gt;
&lt;/ul&gt;

&lt;h5&gt;
  
  
  Ranking
&lt;/h5&gt;

&lt;h6&gt;
  
  
  Row_number()
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2004-number-of-comments-per-user-in-past-30-days?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2004-number-of-comments-per-user-in-past-30-days?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9716-top-3-facilities?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9716-top-3-facilities?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10351-activity-rank?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10351-activity-rank?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  rank()
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10169-highest-total-miles?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10169-highest-total-miles?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10324-distances-traveled?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10324-distances-traveled?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2070-top-three-classes?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2070-top-three-classes?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  dense_rank()
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9701-3rd-most-reported-health-issues?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9701-3rd-most-reported-health-issues?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2026-bottom-2-companies-by-mobile-usage?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2026-bottom-2-companies-by-mobile-usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2019-top-2-users-with-most-calls?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2019-top-2-users-with-most-calls?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  percent_rank()
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/10303-top-percentile-fraud?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10303-top-percentile-fraud?python=1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9611-find-the-80th-percentile-of-hours-studied?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9611-find-the-80th-percentile-of-hours-studied?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  ntile() / qcut()
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h5&gt;
  
  
  Value
&lt;/h5&gt;

&lt;h6&gt;
  
  
  Lag/Lead
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9637-growth-of-airbnb?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9637-growth-of-airbnb?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/9714-dates-of-inspection?python=&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/9714-dates-of-inspection?python=&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.stratascratch.com/coding/2045-days-without-hiringtermination?python=1&amp;amp;utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/2045-days-without-hiringtermination?python=1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  First / Last / nth_value()
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;[[[ No Questions ]]]&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Solving LeetCode Single Number Problem for Data Science Interviews</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Tue, 01 Feb 2022 07:49:46 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/solving-leetcode-single-number-problem-for-data-science-interviews-kgd</link>
      <guid>https://dev.to/nate_at_stratascratch/solving-leetcode-single-number-problem-for-data-science-interviews-kgd</guid>
      <description>&lt;p&gt;&lt;em&gt;How does a data scientist solve the LeetCode Single Number problem in Python to prepare for their data science interviews?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Using LeetCode To Solve Python Questions
&lt;/h2&gt;

&lt;p&gt;LeetCode has a massive database of real interview questions which companies like Amazon, Google, Microsoft, Facebook, and other giants have asked. LeetCode’s designers built it specifically for software developers, and developers generally consider it an amazing resource with over 1599 algorithm-based questions among a variety of languages. Since data scientists most commonly use Python and SQL, data scientists use LeetCode to improve their skills and &lt;a href="https://www.stratascratch.com/blog/5-tips-to-prepare-for-a-data-science-interview/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;prepare for data science interviews&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We are going to look at a specific LeetCode Single Number problem today which we’ll solve in three different ways via Python. It is one of many Python problems useful for preparing for data science interviews.&lt;br&gt;
On StrataScratch.com, we also have several other articles discussing "&lt;a href="https://www.stratascratch.com/blog/how-to-use-leetcode-for-data-science-sql-interviews/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;How To Use LeetCode For Data Science SQL Interviews&lt;/a&gt;" and "&lt;a href="https://www.stratascratch.com/blog/leetcode-python-solutions-for-data-science/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;LeetCode Python Solutions for Data Science&lt;/a&gt;".&lt;/p&gt;
&lt;h2&gt;
  
  
  Solving the LeetCode Single Number Problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;136. Single Number&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Interviewers have asked the single number question a variety of ways in data science interviews. The key challenge is finding the single array element which only appears once in an array of integers. While one of the demands is implementing a solution with linear runtime complexity and constant extra space, we’ll see there are several ways to meet this criteria.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SNr2Bdt2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8gq6de0jzcte19f264nx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SNr2Bdt2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8gq6de0jzcte19f264nx.png" alt="LeetCode Single Number Problem" width="712" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Link to the question: &lt;a href="https://LeetCode.com/problems/single-number"&gt;https://leetcode.com/problems/single-number/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/LiX8xIsmNYc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;In this LeetCode Single Number problem, we’re being asked to find the only integer which appears once in an array of data where every other integer appears twice. We know immediately any solutions for this LeetCode Single Number Problem will require thinking in advance about the complexity and space of the computation.&lt;/p&gt;

&lt;p&gt;This LeetCode single number problem may seem daunting given the constraints, but, as we will see, there are several solutions with varying levels of complexity and space requirements. Today, we’re going to look at three solutions which approach the problem differently: through using a counter, through mathematics, and through bitwise manipulation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Framework to Solve this LeetCode Single Number Problem
&lt;/h3&gt;

&lt;p&gt;The easiest way to solve data science interview questions whether in Python, SQL, or some other language is to use a generally applicable framework. Here’s a framework we use for all data science problems on StrataScratch which we’ll adapt to this problem. You’ll see it provides logical steps for how to arrive at the correct answer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Understand Your Data&lt;br&gt;
• This LeetCode single number problem gives sample data to look at. See if you notice any patterns in the arrays or anything which might require you to adjust your algorithms. This will help you identify the bounds to which you should limit your solution as well as uncover edge cases.&lt;br&gt;
• Typically LeetCode provides more than one snippet of sample data. If the first example isn’t sufficient, spend some time with the other data they provide. In an interview, you can attempt to explain your current understanding of the data to the interviewer and ask them for feedback.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Formulate your approach:&lt;br&gt;
• Now write down all the steps you’ll need for coding your solution. Consider how Python computes and what functions you’ll need to leverage. Also keep in mind how complex they’ll make your solution.&lt;br&gt;
• Don’t forget the interviewer will be observing you. Don’t hesitate to ask them for help. They’ll often specify any additional limitations which apply to your Python code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Code Execution&lt;br&gt;
• Try to keep your Python code simple in this LeetCode single number problem since complexity and space are concerns.&lt;br&gt;
• Follow the steps you laid out in the beginning. Even if it’s not the most efficient way to solve the problem, you can explain potential optimizations afterwards as long as you hit the problem’s complexity and space requirements.&lt;br&gt;
• Don’t convolute your card. This might make it difficult for both you and the interview to understand your solution and could introduce unknown results or complexity.&lt;br&gt;
• Speak through your solution with the interviewer as you write down your Python. They want to understand how you think as you advance towards an answer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Understand your data
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uBOn7Q3R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uvavetud1u1c655b4edi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uBOn7Q3R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uvavetud1u1c655b4edi.png" alt="LeetCode Single Number Problem for Data Science Interviews" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s start by looking at some of the data. Fortunately, LeetCode will typically provide you with several examples to look at.&lt;/p&gt;

&lt;p&gt;In this case we receive three example arrays of integers. Each array contains an element which only appears once and other elements which appear twice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---03lAhAB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b8u1snj1vrb578hpt91v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---03lAhAB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b8u1snj1vrb578hpt91v.png" alt="Understanding LeetCode Single Number Data" width="722" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the three examples, we can already notice a few patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your solution should work for arrays with only one or two discrete integers.&lt;/li&gt;
&lt;li&gt;Your solution should work for arrays with several pairs of discrete integers where the duplicate elements aren’t consecutive.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We are also given some constraints to work with:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8uIiGsk_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9nbn9k73mn6sq31uank3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8uIiGsk_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9nbn9k73mn6sq31uank3.png" alt="LeetCode Single Number Question Constraints" width="737" height="106"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These constraints aren’t particularly important for the solutions we’ll cover in this article, but you should always be aware of the limits of the problem.&lt;/p&gt;

&lt;p&gt;What’s important to realize is we need a solution to account for a variety of array sizes and content which still only maintains linear complexity and constant extra space.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solution 1:
&lt;/h4&gt;

&lt;p&gt;For our first solution, we use a common Python function called a counter to count all elements in the array and then filter for the element with a count of one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formulate Approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The next step, according to our previous framework, is to outline some steps we’ll need to take. Writing down the steps you’ll take in advance will make coding significantly easier.&lt;/p&gt;

&lt;p&gt;For our first solution, here are the general steps we’ll follow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We use a COUNTER function in Python to go through all the elements in the array and return a dictionary of how many times any given element appears.&lt;/li&gt;
&lt;li&gt;We filter the result of the counter dictionary to find the element where the count is equal to 1. Since we expect only one element to appear once, we can return the first match.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Code execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To write the Python code for this question, let’s follow the steps we just wrote down and translate them to code. The most important part of this solution is correctly applying the count function on our array and filtering the result.&lt;/p&gt;

&lt;p&gt;Looking at the first step, we can start by writing the code for counting the elements in the array.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        c = Counter(nums)&lt;br&gt;
        return c&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SMZ7qOWO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/edn1u0un2qk5wv6innml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SMZ7qOWO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/edn1u0un2qk5wv6innml.png" alt="Output for LeetCode Single Number Question" width="575" height="84"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given one of our example inputs, our solution will now return a dictionary giving the total count of each integer in our input array. We immediately see one of our integers only has a count of one. Next, all we have to do is filter for this element. To do this, we’ll use a FOR loop to find the first dictionary key with a value equal to one.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        c = Counter(nums)&lt;br&gt;
        for n in c:&lt;br&gt;
            if c[n] == 1:&lt;br&gt;
                    return n&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yd5t-Q3R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/if2hrckdzl5fkga658pr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yd5t-Q3R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/if2hrckdzl5fkga658pr.png" alt="Output 2 for LeetCode Single Number Question" width="441" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we arrive at the correct answer with our solution presenting us with the only integer which doesn’t appear twice in the input array.&lt;/p&gt;

&lt;p&gt;When it comes to complexity and space, this answer meets the original criteria. Creating a counter and going through it once only requires O(n) runtime complexity. Furthermore, the comparison lookup in our loop only requires O(1) (constant) space.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solution 2:
&lt;/h4&gt;

&lt;p&gt;Our second solution relies on comparing sums in Python. Since sets reduce all elements to a single occurrence, we can use the set function and algebra to calculate the integer which appears only once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formulate Approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Following our framework again, we need to start our second solution by writing down some specific steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We first apply the SET function to reduce our input array to an array only containing one occurrence of each original integer element.&lt;/li&gt;
&lt;li&gt;We sum our set and double it using arithmetic operations to prepare our comparison.&lt;/li&gt;
&lt;li&gt;We then subtract the sum of the elements in our original array to isolate the single element which appears once.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Code execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s again follow the steps we just wrote down and translate them into code. The critical component of this solution is correctly applying the algebra to your set and input array, otherwise you’ll yield the incorrect output.&lt;/p&gt;

&lt;p&gt;Looking at the first step, let’s apply the set function to our array to understand how it changes our input array.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        return set(nums)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Hyeu3m6_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ccjm4nii894ojzy4zq41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Hyeu3m6_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ccjm4nii894ojzy4zq41.png" alt="Output 3 for LeetCode Single Number Question" width="521" height="86"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see the original array has been reduced to a set consisting of all the elements without repetition.&lt;/p&gt;

&lt;p&gt;Next we need to sum our set and double it. The specific mathematics behind this answer depends on the question being asked. Because we know all elements but 1 appear only twice, we can double the sum of our set. If they appeared perhaps three or four times, our math would change and doubling our set might be insufficient.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        return 2*sum(set(nums))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pDkRTKne--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4s9xugimoihgnqy4gu52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pDkRTKne--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4s9xugimoihgnqy4gu52.png" alt="Output 4 for LeetCode Single Number Question" width="598" height="80"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we yield an integer as an output, but it is not yet the element which only appears once. In order to calculate our answer, we must now subtract the sum of our original array. Given we DO NOT use a set function again to reduce the array, our sum will differ from our existing calculation by the correct output.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        return 2*sum(set(nums))-sum(nums)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aHdzUHLi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fi0n8rg7ytqd9ycd5hax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aHdzUHLi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fi0n8rg7ytqd9ycd5hax.png" alt="class Solution:&amp;lt;br&amp;gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&amp;lt;br&amp;gt;
        return 2*sum(set(nums))-sum(nums)" width="561" height="112"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When it comes to complexity and space, this answer doesn’t optimize any further than our first code snippet. Sum and set both go through the entire array, and it still requires space to store the set. As a result, we get O(n) for both runtime and space complexity.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solution 3:
&lt;/h4&gt;

&lt;p&gt;Our third solution relies on bitwise manipulation using the XOR operator. For context, if we take the XOR of 0 and some bit, it will return the bit. However, if we take the XOR of the same two elements, it will return 0. As such, every element which appears twice is going to yield 0 via a XOR operation, and the element appearing once is going to yield itself after a XOR with the 0 result from all the other duplicate elements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formulate Approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Looking back at our framework - we need to again begin our third solution by writing down our coding steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We know we’ll need to perform a bitwise comparison of all elements, so we’ll need to start by looping through our array and establishing a variable for our first XOR operation.&lt;/li&gt;
&lt;li&gt;Using the XOR operator, perform the XOR bitwise operation on each element of your array starting with the first element operated against 0.&lt;/li&gt;
&lt;li&gt;Simplify the solution by instead storing the result of each XOR operation in the first index of the input array. Then return the element at index 0 after the first loop through the array.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Code execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Time to again translate our steps into functional code. The key to this solution is performing the XOR operation against the result of previous XOR operations instead of performing an XOR operation on each element against itself.&lt;/p&gt;

&lt;p&gt;Looking at the first step, let’s establish a 0 variable to XOR against and loop through our array starting with the first element.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        a = 0&lt;br&gt;
        for i in nums:&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now we need to XOR every element of our array against our a variable. Since a starts as 0, we know the first XOR operation will yield the first element. We can reassign the result of this operation to the a variable to XOR the rest of the elements. Finally, we return the end result of all these XOR operations.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        a = 0&lt;br&gt;
        for i in nums:&lt;br&gt;
            a ^= i&lt;br&gt;
        return a&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Bx26SofX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/46ryb46tnsiiac0ny9m9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Bx26SofX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/46ryb46tnsiiac0ny9m9.png" alt="Output 6 for LeetCode Single Number Question" width="612" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This already yields us a correct answer, but the solution is not yet optimized. While it has a complexity of O(1), it does not yet have space optimization. You can save space by storing the result of the XOR operations in the first index of the array instead of creating a variable to store it in.&lt;/p&gt;

&lt;p&gt;We’ll need to change our loop in this case since you can now start with the 2nd index instead of the 1st. Keep in mind the first element will already be part of the first XOR operation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Solution:&lt;br&gt;
    def singleNumber(self, nums: List[int]) -&amp;gt; int:&lt;br&gt;
        for i in range(1, len(nums)):&lt;br&gt;
            nums[0] ^= nums[i]&lt;br&gt;
        return nums[0]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jREP4i-N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zpd49xfjsgg2nsjio82b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jREP4i-N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zpd49xfjsgg2nsjio82b.png" alt="Final Output for LeetCode Single Number Question" width="554" height="147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see our solution is again correct, but this time we didn’t have to make an extra storage variable. We end up using less space, and, as such, present the interviewer with a more efficient solution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison of Approaches to Solve this LeetCode Single Number Question
&lt;/h3&gt;

&lt;p&gt;We have gone through 3 different ways you can solve this LeetCode single number question. While all of them are correct and result in the expected output, the solutions differ in their computations and complexity.&lt;/p&gt;

&lt;p&gt;The first difference is to take note of how each approach solves the problem. Our first approach uses the counter function, our second approach arrives upon the solution mathematically, and our third approach leverages bitwise manipulation. As you work your way through interview questions similar to this one, consider how there may be different computational methods you can apply. This may open up solutions you weren’t considering at first.&lt;/p&gt;

&lt;p&gt;The second issue has to do with complexity and storage. While we’re already given constraints for linear complexity and constant extra space, we ideally present our interviewer with the most optimal code we can write.&lt;/p&gt;

&lt;p&gt;Looking back, our counter and mathematical solution both maintain O(n) complexity and O(n) space since they must calculate through the entire array and store it. However, our bitwise manipulation solution only requires complexity O(1) given it only needs to calculate between two elements at a time and requires minimal space since it stores the result of each XOR operation in the first element of the array.&lt;/p&gt;

&lt;p&gt;For these types of problems, you should prepare yourself to explain why different solutions yield different complexity and space requirements. In addition, it’s helpful to understand the computational differences between separate solutions to the same problem.&lt;/p&gt;

&lt;h4&gt;
  
  
  Conclusion
&lt;/h4&gt;

&lt;p&gt;In this article, we covered several different ways for solving a LeetCode Single Number Python data science problem. Many of the questions interviewers give you during data science interviews will be similar to this problem. Keep in mind this article’s three methods are not the only ways to solve this problem, and there exist many other solutions of greater or lesser complexity and space requirements.&lt;/p&gt;

&lt;p&gt;On StrataScratch, you will find articles discussing many other &lt;a href="https://www.stratascratch.com/blog/top-30-python-interview-questions-and-answers/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;python interview questions&lt;/a&gt; you will encounter in data science interviews. Beyond interview questions and answers, you will also find general articles on how to succeed in your data science interview as well as an interactive area to practice answering data science interview questions by constructing your own solutions or reviewing others’ solutions, and even more general articles like "&lt;a href="https://www.stratascratch.com/blog/leetcode-vs-hackerrank-vs-stratascratch-for-data-science/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;LeetCode vs HackerRank vs StrataScratch for Data Science&lt;/a&gt;" where we compared these three interview preparation platforms which are used by people working in Data Science.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>leetcode</category>
      <category>python</category>
      <category>dataanalyst</category>
    </item>
    <item>
      <title>Zillow Data Scientist Interview Question Walkthrough</title>
      <dc:creator>StrataScratch</dc:creator>
      <pubDate>Fri, 28 Jan 2022 05:31:57 +0000</pubDate>
      <link>https://dev.to/nate_at_stratascratch/zillow-data-scientist-interview-question-walkthrough-4mke</link>
      <guid>https://dev.to/nate_at_stratascratch/zillow-data-scientist-interview-question-walkthrough-4mke</guid>
      <description>&lt;p&gt;&lt;em&gt;We’ll closely examine one of the interesting Zillow data scientist interview questions and find a simple and flexible approach for solving this question.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This Zillow data scientist interview question can be solved in several ways, but we’ll cover one of the most simple and flexible solutions. Keep reading to discover an approach which can handle a variety of different datasets without accidentally leaving out important records!&lt;/p&gt;

&lt;p&gt;As the most-visited real estate website in the United States, Zillow and its affiliates offer customers an on-demand experience for buying, selling, renting and financing with transparency and nearly seamless end-to-end service. Zillow Offers buys and sells homes directly in dozens of markets across the country, allowing sellers control over their timeline. Zillow Home Loans, our affiliate lender, provides our customers with an easy option to get pre-approved and secure financing for their next home purchase. Zillow recently launched Zillow Homes, Inc., a licensed brokerage entity, to streamline Zillow Offers transactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Scientist Position at Zillow
&lt;/h2&gt;

&lt;p&gt;Data Scientist Positions at Zillow typically work for the Data Science &amp;amp; Analytics team. As a member of the Analytics team at Zillow, you will partner closely with stakeholders to model, analyze, and visualize business relevant metrics which inform both short and long term decision-making. This role will be responsible for advancing Zillow’s reporting practice to develop source-of-truth datasets and maintaining their Looker instance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TVyOTf6G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/coqqtwtu60tjekezo0w8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TVyOTf6G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/coqqtwtu60tjekezo0w8.png" alt="Data Scientist Position at Zillow" width="880" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This team collaborates with Data Engineering to turn data into information – and information into insight. It works with datasets as small as an Excel spreadsheet and as large as raw clickstream data. It’s responsible for production reporting, analysis, causal inference, and forecasting. This team works closely with Product Managers, Marketing, and Engineering to deliver critical information and insights that drive decision making.&lt;/p&gt;

&lt;p&gt;For additional information on the Data Science team at Zillow, &lt;a href="https://www.zillow.com/tech/data-science-overview-2017/"&gt;here’s an official article&lt;/a&gt; from a few years back highlighting their tools, technology, and data. Beyond this, StrataScratch offers several other articles like &lt;a href="https://www.stratascratch.com/blog/what-does-a-data-scientist-do/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;this one&lt;/a&gt; and &lt;a href="https://www.stratascratch.com/blog/most-in-demand-data-science-technical-skills/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;this one&lt;/a&gt; providing more context about data science roles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concepts Tested in Zillow Data Scientist Interview Questions
&lt;/h2&gt;

&lt;p&gt;The main SQL concepts tested in the Zillow data scientist interview questions include&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Use of avg() function to aggregate records&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When to use WHERE() versus HAVING() for filtering data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Using subqueries to compare computational results&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zillow Data Scientist Interview Question
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Cities With The Most Expensive Homes&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The question we are going to examine in detail in this article has been asked during an interview at Zillow. It’s titled “Cities With The Most Expensive Homes”, and the key challenge is finding the national average and city averages for home prices then comparing the two to filter for the most expensive cities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AnRCUaOP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zo16djhh12fws0wcvcrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AnRCUaOP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zo16djhh12fws0wcvcrq.png" alt="Image description" width="846" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Link to Problem: &lt;a href="https://platform.stratascratch.com/coding/10315-cities-with-the-most-expensive-homes?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;https://platform.stratascratch.com/coding/10315-cities-with-the-most-expensive-homes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/nRImay97hp8"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Ultimately, we’re being asked to find the cities with higher average home prices than the national average all while using one table of data.&lt;/p&gt;

&lt;p&gt;This Zillow data scientist interview question may seem like a short and simple question, but, as we will see, the answer requires thinking thoroughly about which function to use for data comparison. While there exists multiple ways to solve this question, we’re going to look at one of the simplest, most flexible solutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Framework to Solve the Problem
&lt;/h3&gt;

&lt;p&gt;To make the process of solving this interview question easier, we will follow a framework we could use for any data science problem. It consists of three steps and creates a logical pipeline for approaching problems concerning writing code for manipulating data. Here are the three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Understand your data:&lt;br&gt;
a). Take a look at the columns and make an assumption about them. Take note which columns will be relevant for your calculations and which you can discard.&lt;br&gt;
b). If you don’t have a complete understanding of the schema, take a look at the first couple of rows of data and explain how the values stem from the column. Ask for example values if none are present. Understanding what values might look like for columns will help you figure out if you can limit your solution to specific columns or if you must broaden it for edge cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Formulate your approach:&lt;br&gt;
a). Now, begin writing down the logical steps that you have to program/code. Don’t worry if it seems out of order at first. Code can be changed, so you might perform a calculation in advance and set it aside for later or place it elsewhere to write a separate part of the solution.&lt;br&gt;
b). You also have to identify the main functions that you have to implement to perform the logic. Envision the operation a function might have in advance to avoid miscalculations.&lt;br&gt;
c). Don't forget that interviewers will be watching you. They can intervene whenever needed, so make sure that you ask them to clarify any ambiguity. Your interviewers will also specify if you can use some ready-made functions or if you should write the code from scratch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Code Execution:&lt;br&gt;
a). Build up your code in such a way that it doesn't look oversimplified or overcomplicated either. Remember you can always set part of the solution aside for later use if you need to work through a separate step of the problem.&lt;br&gt;
b). Build it in steps based on the outline shared with the interviewer. It doesn’t have to be the most efficient solution, but it will help to present a generic solution which covers a variety of data.&lt;br&gt;
c). Here's the most important point. Think carefully about how your functions operate. This will let you achieve a simpler solution with fewer logical statements and rules cluttering the code.&lt;br&gt;
d). Don't be quiet while laying down your code. Talk about your code as the interviewer will evaluate your problem-solving skills.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Understand Your Data
&lt;/h4&gt;

&lt;p&gt;Let’s start by examining the data. At most company interviews, you won’t have access to the data and won’t have the ability to execute code. Instead, you’ll be responsible for understanding the data and making assumptions solely based on the table schema and by communicating with the interviewer.&lt;/p&gt;

&lt;p&gt;In the case of this Zillow data scientist interview question, there is only one table with five columns of data representing an id, state, city, street address, and market price. Each row corresponds to a single home.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HuRfmTnn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/onnra9j7nikko1t395d4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HuRfmTnn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/onnra9j7nikko1t395d4.png" alt="Data for Zillow Data Scientist Interview Question" width="847" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yBLBz0D_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/apubg83axz8xx21nqezr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yBLBz0D_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/apubg83axz8xx21nqezr.png" alt="Data for Zillow Data Scientist Interview Questions" width="880" height="757"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What’s important to realize is to solve this Zillow data scientist interview question, we don’t need all the columns of data. Reviewing the data shows us we can discard the id, street_address, and state columns for our calculations. As a result, our solution will only rely on market prices and cities. We also know we’ll have to use these two columns to calculate a national average and compare a city average to this value all within the same block of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution:
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Formulate Approach
&lt;/h4&gt;

&lt;p&gt;The next step, according to the general framework for solving data science questions, is to outline a few general steps we’ll need to perform to answer this question. These are very high-level but writing them down, in the beginning, will make the writing process much easier for us.&lt;/p&gt;

&lt;p&gt;Here are the general steps to follow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Start with a query to get the national average market price using the avg() function.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Put your initial query for the national average market price to the side while querying for the average market price for each city. This will require us to again use the avg() function and GROUP BY city since there are multiple records for each city.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Move your average market price by city calculation and your original national average query (in the form of a subquery) into a HAVING() function to filter for only the cities where the average market price is higher than the national average.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Find the National Average Market Price&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To write the SQL code for this question, let’s follow the general steps that we’ve just defined and translate them into code. The key part of this approach is we leverage a subquery for national average and the HAVING() function to perform a proper price comparison. You can think about it as first obtaining the national average, then obtaining a city average, then comparing the two averages to only list cities with a higher average price.&lt;/p&gt;

&lt;p&gt;Looking at the first step, we can start by writing the code for obtaining the national average. This is a relatively simple query which takes advantage of the avg() function, so we can start like this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT avg(mkt_price) &lt;br&gt;
FROM zillow_transactions&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0HbN2fz5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tfghdpts7c23a0pcl724.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0HbN2fz5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tfghdpts7c23a0pcl724.png" alt="Output for Zillow Data Scientist Interview Questions for Expensive Homes" width="880" height="116"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This code produces a single table with a single record corresponding to the national average. One thing to note is we can’t continue to manipulate this table to reach our solution. We’ll need this data for later, so the next step involves putting this query to the side (either cutting and pasting it or commenting it out)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Find the Average Market Price for Each City&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since we need to know the average market price by city, the next step involves using the avg() function again on the market_price column. Since each city has multiple records, we’ll GROUP BY city to get an average by city:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT city, avg(mkt_price) &lt;br&gt;
FROM zillow_transactions&lt;br&gt;
GROUP BY city&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--SELECT avg(mkt_price) &lt;br&gt;
--FROM zillow_transactions&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DQWGL-KG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mvw0ktjm64dnpb3ijb9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DQWGL-KG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mvw0ktjm64dnpb3ijb9j.png" alt="Output 2 for Zillow Data Scientist Interview Questions for Expensive Homes" width="880" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, we now have an average by city. What we need now is to compare these averages to our original national average and only present cities which have a higher price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Filter for Cities Where the Market Price is Greater Than the National Average&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s where this Zillow data scientist interview question becomes tricky: your first impression might be to use WHERE to filter. This issue here is WHERE applies before the calculation of the city averages, so it’s going to remove relevant data and present the wrong results. Instead, we’ll use HAVING() as our filtering function for the average comparison, so we aren’t discarding relevant pricing data.&lt;/p&gt;

&lt;p&gt;For the third step, we’ll compare the city average price calculation to a subquery featuring our original national average calculation inside the HAVING() function to filter for the correct cities:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT city&lt;br&gt;
FROM zillow_transactions&lt;br&gt;
GROUP BY city&lt;br&gt;
HAVING(avg(mkt_price)  &amp;gt;  (SELECT avg(mkt_price) FROM zillow_transactions))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0tU4jq0J--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kg2vn6oj6xidnmu78x0x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0tU4jq0J--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kg2vn6oj6xidnmu78x0x.png" alt="Final Output for Zillow Data Scientist Interview Questions for Expensive Homes" width="879" height="216"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Originally we were asked to present only the cities, and here we get only one column with cities having higher average home prices than the national average. While you could have used an ORDER BY to rank the home prices, it wouldn’t contribute at all towards reaching the correct answer in this solution.&lt;/p&gt;

&lt;p&gt;Now, we have the entire solution, and, although it’s simple, it’s also flexible enough to accommodate any additional price data appended to the dataset.&lt;/p&gt;

&lt;h4&gt;
  
  
  Conclusion
&lt;/h4&gt;

&lt;p&gt;In this article, we have discovered a simple and flexible way for solving one of the Zillow Data Scientist Interview questions. Remember the method mentioned here is not the only possibility, and there exists countless other ways, be they more or less efficient, for answering this interview question!&lt;/p&gt;

&lt;p&gt;On StrataScratch, you can practice answering more &lt;a href="https://www.stratascratch.com/blog/sql-interview-questions-you-must-prepare-the-ultimate-guide/?utm_source=blog&amp;amp;utm_medium=click&amp;amp;utm_campaign=dev.to"&gt;SQL interview questions&lt;/a&gt; by constructing solutions to them but always try to think of other ways to solve them, you may come up with a more efficient or elaborate approach. Make sure to post all your ideas to benefit from the feedback of other users, and you can also browse all their solutions for inspiration!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
