DEV Community: StrataScratch

Practicing String Manipulation in SQL

StrataScratch — Mon, 06 Mar 2023 09:14:49 +0000

A detailed walkthrough of the solution for a Google interview question to practice SQL String Manipulation.

With the wealth of data being captured by companies, not all of them will be structured and numerical. So today, our focus is to hone your skill in manipulating strings in SQL by introducing several advanced functions.

Interview Question Example to Practice SQL String Manipulation

Let’s dive into an example question from an interview at Google to practice SQL string manipulation. The question is entitled ‘File Contents Shuffle’. It asks:

Link to the question: https://platform.stratascratch.com/coding/9818-file-contents-shuffle

Video Solution:

To understand the question a bit better, let’s have a look at the dataset we’re working with.

1. Exploring the Dataset

The table google_file_store provides a list of text files with the filename as one column and its contents on the other. Both columns are string data.

The question asks us specifically to look at the record where the filename is final.txt’. Notice that there are punctuation marks and duplication of some words like ‘the’, ‘and’, and ‘a’.

When dealing with strings, always remember that data may not be ‘clean’. Watch out for punctuation marks, numbers, a mix of upper and lower cases, double spaces, and duplication of words. State how you’d like to deal with these scenarios or clarify this with your interviewer. For today, we will ignore them first.

The contents of the ‘final.txt’ need to be sorted alphabetically, returned in lowercase with a new filename called ‘wacky.txt’.

2. Writing Out the Approach

Once you’ve fully understood the requirements of the question, formulate a plan of how you’ll build the solution. Oftentimes, you already have an idea of what this is but I strongly suggest writing this out step-by-step. This forces you to identify any gaps in your thinking or errors that you may have missed otherwise.

From the instructions alone, you could easily write out these steps:

Filter the table where the filename is ‘final.txt’
Sort its contents alphabetically
Convert the words into lowercase
Return the contents with ‘wacky.txt’ as the filename column

While this sounds simple at the start, there are several important steps missed. To avoid this, I would also encourage you to think about the input and output at each step.

For example, the output of Step 1 is:
SELECT * FROM google_file_store WHERE filename = 'final.txt'

The contents are in string format and are encoded in one row only so we cannot immediately sort out the words alphabetically. If, instead, each word has its own row, we can do the usual sort through the ORDER BY() clause.

So we need to prepare the data first so we can manipulate them more easily later on. Let’s call this the data preparation step with the aim of having a string convert to the column of words. This is how we will do it:

Data preparation:

Convert the string into an array by splitting the text using a space to identify the individual words
Explode the array column-wise so that each element in the array becomes its own row

This will allow us to proceed to Step 2 where we can sort the new column alphabetically and turn it into lower case.

Then, we would like to return the result as a string like this:

And we cannot do that directly with the current format so another data transformation is required. This time, it has to be the reverse of Step 2 where the aim is to summarize the contents of a column into an array and stitch these elements together in a string format.

Data reformatting:

Aggregate column into an array
Combine elements of the array using a space, returning this as a string

Therefore, our full approach follows:

Filter the table where the filename is ‘final.txt’
Data preparation: a) Convert the string into an array using space as a delimiter b) Explode the array column-wise
Sort its contents alphabetically
Convert the words into lowercase
Data reformatting: a) Aggregate column into an array b) Combine elements of the array using a space, returning this as a string
Return the contents with ‘wacky.txt’ as the filename column

Don’t you feel more confident about tackling the question now that you have the steps written out? This will also provide you a good reference point if you ever feel stuck in the interview.

3. Coding the solution

Let’s code up the query.

1). Filter the table
First, let’s only look at the file ‘final.txt’. We can do this through using an equality condition in the WHERE() clause since we know the exact filename we are looking for.

SELECT * FROM google_file_store WHERE filename = 'final.txt'

However, if we only knew it started with ‘final’, we can use the LIKE() or ILIKE() function. These two functions are used to match strings based on a given pattern. The only difference is that LIKE() is case sensitive and ILIKE() is not.

Here, we can use the ILIKE()function with the wildcard operator, %, representing zero or more characters. This allows us to retrieve the records where the filename starts with ‘final’.

SELECT * FROM google_file_store WHERE filename ILIKE 'final%'

2). Data preparation
Next, let’s prepare the data for manipulation. We will use the STRING_TO_ARRAY() function which takes in a string and converts this to an array (or a list). The elements in this array are based on the delimiter we specify. So if we use a space as a delimiter, it creates an individual element whenever it sees a space. Essentially, it will break up our text into words like this:

SELECT STRING_TO_ARRAY(contents, ' ') AS word FROM google_file_store WHERE filename ILIKE 'final%'

As you can see, arrays provide a lot of information at one go but we cannot access or analyze its contents easily so a common manipulation done on arrays is ‘exploding’ them. We can do this with the UNNEST() function, which will take an array as an input and output a column where each array element becomes accessible as a separate row. Imagine this as a row-to-column transformation.

SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word FROM google_file_store WHERE filename ILIKE 'final%'

3). Sort the contents alphabetically
Having transformed our data earlier makes the sorting straightforward using the ORDER BY() function.

SELECT UNNEST (STRING_TO_ARRAY(CONTENTS, ' ')) AS word FROM google_file_store WHERE filename ILIKE 'final%' ORDER BY word

4). Convert the words into lowercase using LOWER()

SELECT LOWER(word) AS CONTENTS FROM (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word FROM google_file_store WHERE filename ILIKE 'final%' ORDER BY word) base

5). Data reformatting
Finally, to return the contents in a string format, we’ll do the reverse of the steps earlier.

First, we will aggregate the rows of the contents column into an array using the ARRAY_AGG()function. ARRAY_AGG() is an aggregate function so like your SUM() and AVG(), it will take a column and output a single row summarizing the set of values. But here, instead of performing a calculation, it will return an array listing all the values of the column.

SELECT ARRAY_AGG(LOWER(word)) AS contents FROM (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word FROM google_file_store WHERE filename ILIKE 'final%' ORDER BY word) base

Then, we can return this as a text by combining these individual words. The ARRAY_TO_STRING() takes in an array, combines the individual elements using a specified delimiter like a space, and returns the output as a string.

In the same query, we’ll hardcode the filename as ‘wacky.txt’ so our final solution looks like:

SELECT 'wacky.txt' AS filename, ARRAY_TO_STRING(ARRAY_AGG(LOWER(word)), ' ') AS contents FROM (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word FROM google_file_store WHERE filename ILIKE 'final%' ORDER BY word) base

Bonus

For more advanced users of SQL, you may be familiar with the REGEX_SPLIT_TO_TABLE() function which gives the same output as the UNNEST(STRING_TO_ARRAY()) combination we used earlier.

REGEX_SPLIT_TO_TABLE() will take in a string, separate these by a delimiter and return a table with each element in a separate row.

This is helpful for more complex manipulations where the use of regex is required. In this example, however, the delimiter is simply a space so the code is:

SELECT regexp_split_to_table(contents, ' ') AS word FROM google_file_store WHERE filename ILIKE 'final%'

And this gives us the same result as we had in Step 2!

Conclusion

This was an interesting example to level up your SQL string manipulation skills and I hope you learned something new today.

If you ever find yourself stuck doing SQL string manipulation, remember you can transform the data in another format first if that makes the next steps easier. Converting strings to arrays is now one of the tricks up your sleeve to impress your interviewer.

Practice more SQL interview questions and test your new skills on our coding platform where you can look specifically for string-related questions.

Spotify Advanced SQL Interview Question on PARTITION BY Clause

StrataScratch — Thu, 26 Jan 2023 04:31:50 +0000

A detailed solution walkthrough to a hard Spotify SQL interview question involving Joins, Aggregations, Case Statements, and Partition By Clause.

In this article, we’ll walk you through one of the Advanced SQL interview questions. This question is a Hard level problem and will test your advanced SQL skills such as SQL Joins, Aggregations, Partition By clauses, and Case statements. Follow along by clicking on the link to the question provided below. Let us solve this problem using our 3-step framework that you can use to solve any coding question anytime.

Spotify Advanced SQL Interview Question

Days At Number One
"Find the number of days a US track has stayed in the 1st position for both the US and worldwide rankings. Output the track name and the number of days in the 1st position. Order your output alphabetically by track name.
If the region 'US' appears in dataset, it should be included in the worldwide ranking."

Link to the question: https://platform.stratascratch.com/coding/10173-days-at-number-one

Video Solution

The question is entitled “Days at Number One” and asks us to find the number of days a US soundtrack has stayed in the first position in the Spotify daily rankings tables for both the US and worldwide rankings. The question further clarifies that if the region ‘US’ appears in the worldwide ranking table, it counts as a worldwide track.

In the output, we are expected to display two columns, viz., track name and the number of days at number one. Now, let us work backward to the approach. But first, let us explore the dataset provided.

1. Exploring the Dataset

Spotify has provided us with two tables, namely, spotify_daily_rankings_2017_us and spotify_worldwide_daily_song_ranking.

The first dataset, spotify_daily_rankings_2017_us, contains the following columns:

Table: spotify_daily_rankings_2017_us

We can see from this table that it only contains tracks that have held the first position in the US daily rankings on various dates.

The second table is named spotify_worldwide_daily_song_ranking, and it has the following schema:

Table: spotify_worldwide_daily_song_ranking

One look into this table, and we can see that there are tracks of various positions in this table, and not just the number one tracks as we observed from the US rankings table. Another difference to notice from this table is that there is a column for the region the track belongs to. As the question clarified, any track that hails from the US will also be a part of the worldwide rankings table.

2. Writing Out the Approach

Once you are familiar with the datasets provided, it is time to write out the approach you are about to take to solve the problem.

Going back to the question, the key to finding the number of days a US track has stayed in that position in both the tables is the word “both”. Instinctively, you’ll go for a join, which is absolutely correct.

Step 1: Merge the two tables in an Inner Join on the track name and date columns.
In our case, we will specifically use an inner join in order to filter out the tracks that are US tracks present in both tables and on the same dates. So, the inner join must be made on the two common columns, track name and date.

Step 2: Filter out the US tracks that are in position #1.
Once we have identified the tracks, we will then filter out the tracks that were in the number one position in the US rankings table. A simple WHERE clause is apt for this.

Step 3: Define a new column using SUM() OVER (PARTITION BY ) clauses
Next, we will create a subset of the latest result table containing only the US track names. A tricky little thing we will be using to achieve it is using an OVER (PARTITION BY) clause.

An OVER (PARTITION BY) clause helps us specify the columns on which we will perform windows functions. In our case, we are going to use a SUM function to aggregate the data. We will perform aggregation on the track name column so that we can find the sum of the number of times a US number one track has been number one worldwide as well.

Before moving on to the next step, let us break this step apart into smaller steps.

3.1 Check if the US #1 track is also #1 in the worldwide rankings.
A CASE statement will do the trick. CASE statements are basically if-then statements. When the WHEN condition goes through, the THEN parameter is returned, otherwise, the ELSE parameter is returned.

We will need to put a condition on the ‘position’ column of the worldwide rankings table in that it will return the value ‘1’ when its ranking is indeed number 1; else we will return the value ‘0’. We are using numerical values in this CASE statement since we will end up adding them to get the total number of days the track has stayed in that position.

3.2 Get the sum of the number of times it has occurred.
We will save the value of the SUM() function as a new column, say ‘n_days_on_n1_position’. As a result, we will have a sum value of the number of tracks a number one US track has been number one in both the tables on the same dates. We will use this query as a temporary table before proceeding toward displaying the final table.

Step 4: Select the track name and perform the MAX() function on the temp table.
In this step, we will basically select the maximum value of the ‘n_days_on_n1_position’ column as well as the corresponding track name to be displayed in the final result table.

Also, since we are using an aggregate function, we’ll couple it with a GROUP BY clause at the end, and we are grouping by track name in our case. This way, we only have one row per track with the maximum value displayed beside it.

Step 5: Order the track name column alphabetically.
As the question has suggested, we will order the output table by track name alphabetically.

3. Coding the Solution

Let’s get right into coding without further ado.

Step 1: Merge the two tables in an Inner Join on the track name and date columns.
Let us begin with selecting the track names from the US rankings table and seeing it in the console.

SELECT trackname FROM spotify_daily_rankings_2017_us

Let us now use an inner join on the common columns - trackname and date - to merge the two tables.

SELECT us.trackname FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date

The merged table looks like this:

Step 2: Filter out the US tracks that are in position #1.
We can add a WHERE clause at the end of the query to filter out the tracks that were in the first position alone.

SELECT us.trackname FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1

And the output appears to be the same, meaning that all four tracks that were common in both tables are all number one tracks.

Step 3: Define a new column using SUM() OVER (PARTITION BY ) clauses
First, let us add a new column to the table named ‘n_days_on_n1_position’.

SELECT us.trackname, n_days_on_n1_position FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1

3.1 Check if the US #1 track is also #1 in the worldwide rankings.
Write out the skeleton of the OVER (PARTITION BY ) clause first, and then we will add the conditions and parameters slowly.

SELECT us.trackname, (CASE WHEN THEN END) OVER(PARTITION BY ) AS n_days_on_n1_position FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1

3.2 Get the sum of the number of times it has occurred.
Insert windows function SUM() around the CASE statement.

SELECT us.trackname, SUM(CASE WHEN THEN END) OVER(PARTITION BY ) AS n_days_on_n1_position FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1

We’ve got the skeleton ready to add our logic in. So, when the position is number 1 from the worldwide rankings table, then we will return 1, or else we will return 0. Let us now insert these parameters into our CASE statement.

SELECT us.trackname, SUM(CASE WHEN world.position = 1 THEN 1 ELSE 0 END) OVER(PARTITION BY ) AS n_days_on_n1_position FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1

Also, since we are performing these functions on the US track names that we had selected in the first line of the query, we will insert the column name within the PARTITION BY clause.

SELECT us.trackname, SUM(CASE WHEN world.position = 1 THEN 1 ELSE 0 END) OVER(PARTITION BY us.trackname) AS n_days_on_n1_position FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1

Now, we can run this query, and the output is as below:

You can find multiple entries for the track ‘HUMBLE.’ because we have used the OVER(PARTITION BY column_name ) clause, which returns all the records upon which the function was performed, including the duplicates.

Step 4: Select the track name and perform the MAX() function on the temp table.
We definitely need to use a MAX() function on the ‘n_days_on_n1_position’ column to pick only the largest value for each track name, denoting the number of days in the number one position. But first, we need to convert the query we’ve drafted so far into a temporary table named ‘tmp’.

In addition to that, since we have used an aggregate query, we will group the table by track name.

SELECT tmp.trackname, MAX(n_days_on_n1_position) AS n_days_on_n1_position FROM (SELECT us.trackname, SUM(CASE WHEN world.position = 1 THEN 1 ELSE 0 END) OVER(PARTITION BY us.trackname) AS n_days_on_n1_position FROM spotify_daily_rankings_2017_us us INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname=us.trackname AND world.date = us.date WHERE us.position = 1) tmp GROUP BY trackname

Let us run the query now and take a look at the output.

Step 5: Order the track name column alphabetically.
Finally, we will order the table by the track name alphabetically, as the question suggests we do.

The final result is as shown below:

From the final result, we can infer that there are two tracks that were in the number one position in both the US and worldwide rankings tables. The first track, ‘Bad and Boujee (feat. Lil Uzi Vert)’, was number one for a day in both lists, and the second track, ‘HUMBLE.’, had three days of glory as number one.

Conclusion

It was a very interesting problem where we used multiple advanced SQL clauses such as windows function OVER (PARTITION BY) and CASE statements, join, and aggregations using SUM() and MAX() functions. I hope you enjoyed working on the problem as well. Explore our platform for more Data Science related SQL interview questions and walkthroughs. Good luck!

Facebook Python Interview Questions

StrataScratch — Mon, 02 Jan 2023 10:10:26 +0000

This Facebook python interview question will test your ability to use joins, perform transformations and calculations, and address edge-case scenarios.

We’re back with another Python interview question from Facebook / Meta. We will solve this question using our 3-step framework that can be used to solve any python interview questions.

Facebook Python Interview Question

Link to the question: https://platform.stratascratch.com/coding/2123-product-families

Video Solution:

This Facebook python interview question revolves around the concept of promotional campaigns. In solving the problem, we are analyzing how each of the product families is selling, with and without promotions applied.

We are seeking a result table that shows all the product families, their corresponding total units sold as well as the percentage of units sold under a valid promotion. A valid promotion is defined in the problem statement as ‘not empty’ and ‘contained within the promotions table’.

Looks pretty straightforward. The trick is to transform the data provided into the desired format for us to display. For that to happen, we will need to look into the tables provided.

Always begin your solution by exploring the dataset.

Framework to solve this Facebook python interview question

1. Exploring the Dataset

Meta has provided three tables, viz., facebook_products, facebook_sales_promotions, and facebook_sales.

There’s a lot of information available in those tables, so we need to distinguish and prioritize what columns or relationships to investigate during the interview. To preview the tables given, we can use the head() function. If you are practicing on our StrataScratch platform, however, the table can be previewed using the ‘Preview’ button.

The first table, facebook_products, contains information related to the products. It includes supplementary information such as class, brand, category, family, and more attributes like whether the product is low-fat and recyclable.

It has the following schema:

Run the code below to view a preview of the table:

facebook_products.head()

Upon previewing the table using the head() function, we can see the following table:

The second table, facebook_sales_promotions, is a table dedicated to promotions available. It contains the start and end dates of the promotions, the media channel used to promote them, as well the cost of these campaigns.

The schema of the table is as follows:

Run the code below:

facebook_sales_promotions.head()

A preview of the table looks like this:

Thirdly, we have got the facebook_sales table which has the following schema:

Preview the table by running the following code:

facebook_sales.head()

As you explore the schema and datasets provided, notice the common columns between these tables. Specifically, they are:

product _id - facebook_sales and facebook_products tables
promotion_id - facebook_sales and facebook_sales_promotions tables

Make it a practice to recognize the common columns as they help relate the tables later when we are figuring out an approach to solve this Facebook python interview question, which brings us to the next step.

2. Writing Out the Approach

Now that we have formed a better idea of our datasets let’s formulate our approach by identifying the high-level steps to our solution. It’s always good to have a plan before the execution so let’s not code anything just yet!

From the question, we figured out that the output table must contain three columns: Product Family, Total Units Sold, and Percentage of Units Sold Under Valid Promotion.

Let’s create a narrative that will help us navigate through this Facebook python interview question.

For each product family, we need the units sold in total and the percentage of these units sold under a valid promotion. To ensure the promotions are valid, we will need to cross-check the ‘promotion_id’ in the facebook_sales table with the ‘promotion_id’ in the facebook_sales_promotions table.

Therefore, we need to use data from multiple tables, so we need to merge these tables to identify the product family and the validity of each sales promotion.

Let’s list out the steps we are going to take.

The first step in our approach is to identify which ‘product_family’ the ‘product_id’ from the facebook_sales table belongs to. We can achieve this by merging the facebook_sales and facebook_products tables.
Next, we will create a new column to identify the sales made under valid promotions.
Once the valid promotions are identified, we will split the merged table into valid and invalid promotion sales.
Now, for each subset, we will compute the total units sold for each product family.
We will then merge these subsets into one main table.
We will then fill the null values with zeroes to avoid errors later on.
We can now calculate the total sales per product family.
Now that all the necessary information is available for the percentage calculation, we will compute the percentage of units sold under promotion.
Finally, we will select the three columns mentioned earlier and replace any null values with zeroes.

3. Coding the Solution

Now that we have the approach written down, it is time to translate them into code.

1). Identify the product family by joining the products and sales table

Firstly, let’s merge the facebook_products and facebook_sales tables on the common column between them, i.e., product_id. By default, we would’ve gone for an inner join, but in this, that’s not such a good idea.

Edge Case:
Remember that this Facebook python interview question asks us to make calculations for all the available product families. In an ideal setting, the various product families would be well-represented in the sales table. But in a realistic setting, it may not be the case.

We need to take into account phased-out items and newly introduced items that would have periods wherein there would be no sale. These would be entered in the table as NULL values. We will act upon this edge case later in our solution by filling these NULL values with 0s to avoid errors.

Regardless of whether a sale was made or not, all product families need to be handled in our solution. So, we will use an ‘outer’ join instead of an inner join.

import pandas as pd merged = facebook_sales.merge(facebook_products, how="outer", on="product_id")

The output of the table is as follows:

You could counter this approach with one where the facebook_products table is the base table and merged with facebook_sales using a left join. This is another viable approach to ensure that all the product families are being captured.

2. Create a new column identifying the sales made under a valid promotion

Next, we want to establish the promotion validity of each sale and create a new column for this information.

To mark a promotion as ‘valid’, we will need to ensure that the promotion_id in the sales table is not empty and that the promotion_id is also contained in the promotions table.

Of course, there are different ways of getting this done. A smart trick is to use a map-lambda combination. For those of you who are not familiar, a map takes in an iterable like a list and transforms each item of that iterable by applying the function specified.

We only need a temporary function here, so we can use a lambda function instead of defining a whole new function separately. As an added bonus, lambda allows us to specify the function in a single line of code!

Now the merged table should contain the main requirements of our solution - the product family, the number of units sold, and an indication of the promotion validity.

Digging deeper into the target table once more, the total units sold and percentage of sales on promotion can be calculated as:

Code below as a comment (under “OUTPUT: product_family | n_sold | perc_promotion”)

# product_family | SUM(units_sold) | SUM(units_sold_valid) / SUM(units_sold)*100

Now, we will create a new column named valid_promotion which will be marked as True if the promotion is valid and ‘False’ if it isn’t. A valid promotion was defined by two factors:

The promotion_id cannot be missing or null
The promotion_id should also exist in the facebook_sales_promotions table

The first condition can be checked by using the pandas function, isna(). So we will need to import the pandas first. The latter can be efficiently achieved by getting the full list of unique promotion_ids from the promotions table and checking if it exists in that list.

Let’s now write these two conditions in code as shown below:

merged['valid_promotion'] = merged.promotion_id.map(lambda x: \ not pd.isna(x) and x \ in facebook_sales_promotions.promotion_id.unique())

The output of this step produces a table that contains a table containing only one column containing TRUE and FALSE values corresponding to whether the promotion_id was valid or not.

3. Split the merged table into valid and invalid promotion sales

Let’s now use this table to split up the merged table we had created earlier into valid and invalid promotion sales. We can split the merged table by the column ‘valid_promotion’ as shown below:

valid_promotion = merged[merged.valid_promotion] invalid_promotion = merged[~merged.valid_promotion]

Valid promotion dataset is illustrated below:

And the invalid promotions dataset is as below:

4. For each subset, compute the total units sold per product family

Now that we have segregated the merged table into valid and invalid promotion datasets, we can perform aggregation on these datasets to calculate the sum of units sold for the sales made in each of these datasets. We will use the groupby() function to get the total units sold at the product family level.

Let’s start with the valid promotions.

valid_promotion.groupby('product_family')['units_sold'].sum()

The output is a column of total units sold under a valid promotion. Let’s convert this into a dataframe using the to_frame() function that allows us to specify the column name directly in this line. Only append the previous code with the to_frame() function.

valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds')

The output looks like this:

It does not make a reference to the product family that the sales are made on. It is being stored as an index. So, let us reset the index so that the product_family column will become available again. Append reset_index() function to the code above:

valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds').reset_index()

Now the output displays the respective product family as well:

To differentiate between the valid and invalid promotions, let’s save the above line as results_valid and repeat the same steps for the invalid_promotion table.

results_valid = valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds').reset_index() invalid_promotion = merged[~merged.valid_promotion]

Note that we have already typed out the code for invalid_promotion.

result_invalid = invalid_promotion.groupby('product_family')['units_sold'].sum().to_frame('invalid_solds').reset_index()

The results_invalid table looks like this:

5. Merge the results for the valid and invalid promotions

Now, it is time to merge the two tables: results_valid and results_invalid. Basically, our objective is to make our calculations down the line easier. The main table will contain the product family, the total units sold under valid promotion as well as the total units sold under invalid promotion sales.

result = results_valid.merge(result_invalid, how='outer', on='product_family')

Let’s take a step back and consider the edge case scenario that we had identified earlier in Step 1, wherein a product family does not have any sales on a valid promotion.

Now that it comes up, there could be a scenario wherein all the sales are made under promotion which creates a missing value when we try to merge them. These scenarios, once made apparent, can be easily solved with the use of the **fillna() **function. All we need to do is fill the null values with zeroes.

6. Fill null values with zeroes

Let’s append the previous line of code with the fillna() function as shown below:

result = result_valid.merge(results_invalid, how='outer', on='product_family').fillna(0)

The output table is cleaner and will make our further calculations a cakewalk.

7. Calculate the total sales

Let us create new columns to store the value of total sales and the percentage of sales under valid promotions.
Firstly, let’s calculate the total sales, which but a sum of the valid_solds and invalid_solds.

result['total'] = result['valid_solds'] + result['invalid_solds']

The output looks as shown below:

8. Calculate the percentage of units sold under promotion

The last piece of computation we need to perform is to capture the percentage of units sold under a valid promotion. We can calculate the same using the formula applied in the code below:

result['valid_solds_percentage'] = result['valid_solds'] / result['total'] * 100

The output of this line is as shown below:

9. Display the relevant columns, replacing any na’s with 0s

Lastly, let us select the required columns for our output: product_family, total, valid_solds_percentage, and add a final touch by replacing any na’s with zeroes.

Before displaying the results, let’s be mindful of the edge case scenarios where product families may have no sales i.e., the total units sold will be zero. Subsequently, the total units sold under a promotion would also be zero. You will be thrown an error when zero is divided by zero. We are covering this edge case by replacing any na’s with zeros.

Select the necessary columns as shown below:

result[['product_family', 'total','valid_solds_percentage']].fillna(0)

To get the complete picture of the solution, here is the complete solution:

import pandas as pd merged = facebook_sales.merge(facebook_products, how="outer", on="product_id") merged['valid_promotion'] = merged.promotion_id.map(lambda x: \ not pd.isna(x) and x \ in facebook_sales_promotions.promotion_id.unique()) valid_promotion = merged[merged.valid_promotion] invalid_promotion = merged[~merged.valid_promotion] results_valid = valid_promotion.groupby('product_family')['units_sold'].sum().to_frame('valid_solds').reset_index() result_invalid = invalid_promotion.groupby('product_family')['units_sold'].sum().to_frame('invalid_solds').reset_index() result = results_valid.merge(result_invalid, how='outer', on='product_family').fillna(0) result['total'] = result['valid_solds'] + result['invalid_solds'] result['valid_solds_percentage'] = result['valid_solds'] / result['total'] * 100 result[['product_family', 'total','valid_solds_percentage']].fillna(0)

Now let us run the complete code to get the expected result as shown below.

All required columns and the first 5 rows of the solution are shown

If you dive a little further into the result table, we can see that there is no sale for the product family ‘Accessory’. Had we not handled the edge case, we would have been thrown an error during our calculations. You can see now how paramount it is to anticipate this at various points in our solution.

Conclusion

Despite the difficulty level of this Facebook python interview question, it was not as complicated as we would’ve expected. We hope you learned something about JOINs, data transformation, and aggregations.

Practice is the only way to mastery. Keep practicing from our Python Coding Interview Questions article.

How to Get Hired as a Data Scientist at Google

StrataScratch — Tue, 01 Nov 2022 03:32:51 +0000

Everybody wants to work at Google, but how do you become one that works there?

Why would you even want to work at Google? Every person has different career goals, motivations, and reasons for choosing a certain career and an employer. However, I think it would be safe to reduce the multitude of reasons to two:

Competitive salary
Using and furthering your skills as a data scientist

These two reasons work together. Of course, you want to be well paid, especially when data science is a specific field requiring multidisciplinary knowledge and education. You want to get compensated fairly for the time, effort, and money you invested in your education.

While you need to make a living (unless you inherited a significant amount of wealth), why not make it a comfortable living while also doing something that you find interesting? You just don’t accidentally start working in data science, so it’s a safe bet that you’re here because you find data and data science interesting, regardless of money. And if you do, then you’d want to work at a top company that is a leader in innovation and the latest technologies. Working at such a company challenges your skills. It gives you the platform for developing them further than you’d be able in most other companies by participating in the most varied and technically-advanced data science projects.

Google Data Scientist Salary

One of the ways how Google attracts top data scientists is by offering them a competitive salary. As a data scientist at Google, you can earn well above (almost 250%) the US median salary, even in the most junior positions. It usually includes not only the basic salary but also the cash and stock bonuses.

Apart from this, other material benefits cover

Insurance, Health, and Wellness
Financial and Retirement
Home
Transportation
Perks and Discounts,

and other benefits unique to Google.

Some of these benefits are health & life insurance, 401k, Student Loan Repayment Plan, remote work, adoption and surrogacy assistance, transportation allowance, free lunch and drinks, tuition reimbursement, etc. You can learn more about salaries, benefits, and levels in the Google Data Scientist Salary article.

Now, the Hard Part: Getting Hired!

Knowing what salary you can get is not essential for getting a job at Google. Moreover, it’s completely irrelevant, but it can serve as a good motivation for getting the hard part done. How do you do that? The approach is defined by three aspects you need to pay attention to:

Knowing Google’s hiring process
Having skills they need
Acing job interview

1. What is Google’s Hiring Process?

Knowing how Google hires is the first step in getting hired. From a high-level perspective, Google’s process is the same as in any other company and consists of:

Applying for a job
Interviews

However, details matter, and you should go into detail on what Google wants to see in your job application and what their interviews look like.

Job Application

Google strongly advises that you do a little self-reflection before you jump to applying for a job. This means they want you to think about your skills, interests, goals, and motivations. Even before applying, you can check if you’re the right fit for a job that way. When thinking about your professional life and yourself as a person, think about whether you enjoy more working alone or as part of a team, what kind of job (or its part) you find most rewarding, what your passions are, do you get excited about solving a problem or discussing it, and so on.

Once you decide to apply for a job at Google, you should know the following:

Cover letters are not required
Tailor CV to the specific position (even if you apply for multiple positions) – generic CVs are a big no-no!
Keep CV concise and focused
Highlight the skills that are required for the job you apply for
Quantify the success at your previous job – ‘successful’, ‘quicker’, ‘more efficient’, ‘disruption’, and ‘data-driven’ is not a metric
Mention your references – if you have somebody that can verify your work experience, projects you did, your character, and skills in general, that will ensure your resume will be looked at; it doesn’t guarantee you’ll get an interview, but it can increase your chances.

Interviews at Google

Before you come to the interview stage, ensure you know Google as a company well. Be informed about their history, organization, values, and products. This will show you’re really interested in working at Google and that you didn’t apply for a job accidentally. Imagine that you send your resume, you come to the interview, and the interviewer doesn’t know your name or anything about your education or work history. You wouldn’t be happy, would you? The same goes for Google: they like to see that what you know about them makes you want to work for them.

The first step before the interviews with Google is a phone call with the recruiter. They will ask you a few general questions about your work experience and interest in working at Google. They might also ask a simple technical question or two (e.g., a probability question or something easily solved in a minute) to get a general idea of whether you’re suitable for the position. However, the main point of this call is to understand your work experience and how it aligns with what Google is looking for.

Then comes the interview process at Google, and there’s no big mystery here: Google itself lists the types of interviews you could expect. You just need to take some time to inform yourself about it and be prepared.

Online assessment is the first elimination stage when it comes to interviews. This is usually a short test of your coding skills conducted online.

If you get past this, then comes short virtual chat or two. These are not on-premises but over the phone or video chat. They will involve a recruiter, hiring manager, and/or a colleague from the team asking you about the skills required for the job you applied for. The point is for them to get a picture of your technical profile and if it generally suits the position. They can also learn that even though you may be missing some of the required skills, you have some other skills that can be well used in a particular job.

Depending on the job, Google might ask you to do project work. This means doing a little project or providing some of your previous work/code.

All these steps are where the candidates get eliminated before they get to the in-depth interviews. There are usually 3-4 interviews in one day, intended to assess your technical skills, problem-solving and thinking process, and personality traits.

When it comes to testing your expertise, this is usually done through the following types of questions. They don’t come up every time; the types of questions you get heavily depend on the position you applied for.

Coding Questions
Algorithm Questions
Statistics Questions
Modeling Questions
Business Case Questions
Product Questions
Technical Questions

You can find more about the whole process on the Google website.

While you’re making yourself familiar with Google’s hiring process, use this.

2. What Skills Google Wants to See in Data Scientists?

There’s no such thing as an ideal candidate. Google knows that because Google knows everything. Every candidate has unique skills and characteristics that could make them a desirable candidate. Google tries to select the candidates with the best combination of:

Hard skills and qualifications, and
Soft skills

Hard Skills and Qualifications

The general requirements for data scientists at Google are:

Master's Degree in Statistics, Computer Science, or other relevant quantitative disciplines
Relevant experience, which you’ll need more of to compensate for if you’re lacking a required formal education level
Programming languages: SQL and R/Python

Depending on the position you’re applying for, some other specific requirements can be focused on the following areas: statistics, machine learning, AI, data analysis, data visualization, engineering, software development, products, etc.

Soft Skills

Getting hired at Google requires high scores at:

Interdisciplinarity
Big-picture Perspective
Being Customer-Oriented

Data science is an interdisciplinary field per se. It merges statistics, mathematics, and business knowledge. This interdisciplinarity is compounded by the requirement for data scientists to work with other sectors in Google, such as Product, Marketing, Engineering, etc.

The essence of data science is problem-solving with the business outcome in mind. To solve problems, every company (including Google) introduces projects. The only way to complete the projects successfully is to be focused on the project outcome and know how to achieve this goal. With such a desirable skill, it’s no wonder Google wants to see big-picture energy from their data scientists.

Every data science project at Google has business in mind, and when we say business, it means customers. All you do will, directly or indirectly, be used by Google’s customers. Their satisfaction is the key for Google keeping their market-leader position, as is for you getting the job. Show that you have this in you, and you’re one step closer to becoming a data scientist at Google.

The final step for achieving this is performing well in the interviews.

To get more details about Google’s hiring process and the skills they’re looking for, take a look at The Ultimate Guide to Become a Data Scientist at Google.

3. Acing Job Interview

The central part of all the interviews you’ll have at Google are, one way or another, your technical skills.

While you for sure don’t know which questions you’ll get, there are still ways for you to better your chances of getting a job.

Brush up your skills
Have a clear approach to coding questions
Be self-aware

Brushing Up Your Skills

You’re preparing for a job interview, right? Solving the actual job interview questions before the real job interview at Google seems quite logical.

There are platforms where you can do that—for example, StrataScratch, LeetCode, SQLPad, or HackerRank. There you can practice your SQL, Python, algorithm and other technical skills tested by Google.

Also, there are other ways to refresh your knowledge or learn something new. You have course websites (e.g., Coursera, Udemy, edX), Youtube channels (e.g, freeCodeCamp.org, Alex the Analyst, Amigoscode), blogs (LearnSQL.com, GeeksforGeeks, W3Schools), and data science community (Stack Overflow, Reddit, GitHub, Codementor) at your disposal. While they don’t necessarily prepare you specifically for a Google job, these resources (and many others) can help you with the data science concepts you can easily apply at the Google job interview.

Framework for the Coding Questions

It is crucial to write a correct solution when you’re at the coding interview, I don’t deny that. The pressure and limited time at the interviews can make even the most experienced look a level or two below their natural coding-masters selves.

To get around this, I advise that you always have a clearly defined framework of how to approach solving the coding questions. I found that these four general guidelines work best:

Explore the dataset
Identify relevant columns
Write out the code logic
Code

Exploring the dataset

Exploring the dataset involves getting to know each table’s data structure. It also means detecting the shared columns between the tables, thus knowing how the tables can communicate. Along the way, get a sense of data types in each column and whether there might be duplicate or NULL values.

Identify relevant columns

When you identify the relevant columns, you will eliminate the unnecessary columns that can clutter your thinking and divert you when writing a code. The interview questions often give you more data than you need, reflecting a data scientist's real life. Consider this as a small test where you can show that you can differentiate between relevant and irrelevant data.

Write out the code logic

Before you start coding, it’s important to write out all the steps of your solution. Break down the code into logical blocks and/or individual steps, and decide on the functions you will use, why you will use them, and how. The code logic can be written in English (or any other language the interview is conducted in) or a pseudo-SQL/R/Python/any other programming language code.

Code

Coding should, at this point, feel almost like a technicality. All the previous steps will make it possible for you to focus on the code syntax, its efficiency, and debugging. It also allows you to check the code logic and catch all the missing or unnecessary steps.

This is how this framework can be applied to the Google Data Scientist Interview Questions.

Self-awareness

Job interviews are stressful, draining, and require a lot of concentration. They are, to be honest, a pain in the ass. They can sometimes show the worst side of our characters. Don’t let this happen to you by considering three simple things.

Slow down
Be friendly
Listen to the interviewers

When under stress, people tend to jump to answers, stop taking time for thinking, and talk too fast.

Slow down. Allow yourself time to think about what you’re going to say, ask for clarifications if you didn’t understand what was being asked, and try to be as articulate when you talk.

People often wrongly think that silence between the interviewer’s question and your answer shows you’re a slow thinker, low on self-confidence, or whatnot. No, it shows that you’re thinking and not simulating it. It shows you’re confident enough to take your time to come up with the best possible answer, which ultimately shows you can control stressful situations. Highly desirable skills for a data scientist!Also, if you talk at a medium pace, the chance is better that you won’t blurt out something stupid. And everything smart that you say will be easily followed and acknowledged as smart by the interviewer.

Be friendly. You won’t be working alone in a cave atop some hill. Want it or not, you’ll work within a team and cooperate with other teams. Having different personalities in a team or across the teams is desirable. But this has limits. Nobody wants to work with a person that drains all the energy from other people, starts petty fights, takes credit for someone else work or sabotages everybody else. Google wants people who others enjoy working with, so remaining friendly and good-spirited under pressure is something they’ll look for. The interview is a perfect opportunity to showcase this side of yourself.

Listening is equally important as talking. Pay attention to what the interviewer asks so you can answer their questions. Don’t interrupt them, but ask questions if you want something to be clarified.

Summary

Getting hired at Google starts with knowing what you want and finding an ad for a job that you’d like. Then comes the part where you apply for a job by satisfying specific format requirements and make yourself familiar with Google itself: its hiring process and other aspects of the way they operate.

When you come to the interview stage, you must know what you can expect there: what types of interviews they conduct and the topics. Once you know that, you should prepare yourself as best as possible. Use various sources, such as job interview questions examples, Youtube channels, blog articles, and courses, or get involved with the data science community to ask about technical concepts and their experiences on getting hired by Google.

These steps prepare you to shine in a job interview, where you can confidently showcase your hard and soft skills. In other words, the best version of yourself.

Statistics Cheat Sheet: Data Collection and Exploration

StrataScratch — Thu, 27 Oct 2022 03:03:37 +0000

This Statistics Cheat Sheet includes the concepts that you must know for your next data science or analytics interview presented in an easy-to-remember manner.

Statistics is the foundation of Data Analysis and Data Science. While a lot of aspirants concentrate on learning fuzzy algorithms with arcane names, they neglect the fundamentals and end up messing up their interviews. Without an in-depth understanding of statistics, it is difficult to make a serious career in data science. One need not be a Ph.D., but one must be able to understand the basic math and intuition behind the statistical methods in order to be successful. In this series, we will go through the fundamentals of statistics that you must know in order to clear your next Data Science Interview.

What is Statistics

Statistics is the science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. There are three main pillars of statistics.

Data Collection and Exploration
Probability
Statistical Inference

In this three-part series, we will look at major areas of statistics relevant to a budding Data Scientist. In this part, we will look at Data Collection and Exploration. We have a fantastic set of statistics blog articles that you can find here. Some of the articles that are recommended are.

Also, check out our comprehensive “statistics cheat sheet” that goes beyond the very fundamentals of statistics (like mean/median/mode).

Sample vs Population

In order to analyze data, it is important to collect it. In statistics, we usually set out to examine a population. A population can be considered to be a collection of objects, people, or natural phenomena under study. For example, the income of a graduate fresh out of college, the weight of a donut, or the time spent on smartphones. Since it is not always possible or it is prohibitively expensive (or both) to collect data about the entire population, we rely on a subset of the population.

This subset (or sample), if chosen properly, can help us understand the entire population with relative surety and make our decisions. From this sample data (for example, the incomes of people), we calculate a statistic (for example, the typical income). This statistic represents a property of a sample. The statistic is an estimate of a population parameter (the typical income of all Americans).

Sampling Methods

For a sample to be representative of the population, it should have the same characteristics as the rest of the population. For example, if one were to survey the attitudes of Americans towards conservative values, then the students of the liberal arts department from a Blue state college might not be the best representation. Statisticians use multiple ways to ensure that the sample is random and truly representative of the entire population. Here we look at some of the most common methods used for Sampling. Each method has its pros and cons, we will look into those as well.

Simple Random Sample

The easiest way of sampling methods is a simple random sample. One randomly picks a subset of the entire population. Each person is equally likely to be chosen as every other person. In other words, each person has the same probability of being chosen.

where n is the size of the population.

A simple random sample has two properties that make it the standard against which we compare all the other sampling methods

Bias: A simple random sample is unbiased. In other words, each unit has the same chance of being chosen as every other unit. There is no preference given to a particular unit or units
Independence: The selection of one does not influence the chances of selection of another unit.

However, in the real world, a completely unbiased and independent sample is very difficult (if not impossible) to find. One of the most common instances is the underrepresentation of less educated voters in the samples used for the 2016 Election Polling. While it is possible to generate a list of completely random respondents, the final results might be skewed because people may not respond, thus skewing the sample and diverging from population characteristics. There are more effective and efficient ways to sample a population if we know something about the population.

Stratified Sample

In a stratified sample, we divide the population into homogeneous groups (or strata) and then take a proportionate number from each stratum. For example, we can divide a college into various departments and then take a random sample from each department in the proportion of the strengths of each department.

An example of how stratified sampling could look like

Here the sample represents 1% of each region. A more complex example can be when we introduce multiple characteristics. For example, let us bifurcate each region by gender as well. The sampling process would look something like this.

The major advantage of stratified sampling is that it captures the key characteristics of the population in the sample. As with a weighted average, stratified sampling produces characteristics that are proportional to the overall population. However, if the strata cannot be formed, then the method can lead to erroneous results. It is also time-consuming and relatively more expensive as the analysts have to identify each member of the sample and classify them into exactly one of the strata. Further, there might be cases where the members might fall into multiple strata. In such a scenario, the sample might misrepresent the population.

Cluster Sample

Sometimes it is cost-effective to select survey respondents in clusters. For example, instead of going through each building in a town and randomly sampling the respondents, one could randomly select some of the buildings (clusters) and survey all the residents living in them.

This can result in an increase in speed and greater cost savings on account of lesser logistical requirements. The cluster sample method’s effectiveness depends on how representative the members of the chosen clusters are when compared to the population. To alleviate this, the sample size for cluster sampling is usually larger than that for simple random sampling, as the characteristics of the members in a cluster usually tend to be similar and may not capture all the population characteristics. However, the cost savings on account of reduced travel and time might still mean that even with the additional sample size, cluster sampling turns out to be the cheaper option.

Cluster sampling can be further optimized by using multi-stage clustering. As the name suggests, in a multi-stage clustering, once the clusters are chosen, they are further clustered in smaller units, thus reducing the costs further. For example, if we wanted to find the learning abilities of students across the country. We start off clustering on the basis of states. Further, in these states, we cluster on the basis of schools and choose a random sample of the schools in the sector and survey the students in these schools. This is usually used for national surveys for employment, health, and household statistics.

Systematic Sample

Another widely used sampling method for large population sizes is Systematic Sampling. In this process, using a random starting position, every kth element is chosen to be included in the sample.

For example, we might choose to sample every third person entering a building.

It is a quick and convenient method and gives a result similar to a simple random sample if the interval is chosen carefully. This method is used very widely as it is simple to implement and explain. One of the use cases of systematic sampling is for conducting exit polls during elections. A systematic sample makes it easier to separate groups of voters who might all be voting for the same person or party.

Convenience Sample

Another method not usually recommended but used because of circumstances is convenience sampling. Also known as grab sampling or opportunity sampling, the method involves grabbing whatever sample is available.

For example, on account of lack of time or resources, the analyst might choose to sample only from their neighboring homes and offices instead of trying to finding respondents from across the entire city.

As you might anticipate, this method can be unreliable and hence not recommended. However, it might be the only way to collect data sometimes. For example, instead of trying to contact all the users of marijuana, one might choose to just go to the nearest college dorm and survey the attendees of a party.

The advantages of this method include convenience, speed, and cost-effectiveness. However, the collected samples may not be truly random or representative of the population. However, it does provide some information as opposed to just hunches from the analyst. This method is widely used in pilot testing or MVPs for testing and launching new products.

Descriptive Statistics

Now that we have found out how to collect the data, let us move to the next step - analyzing the collected data. Displaying and describing the collected data in the forms of graphs and numbers is called Descriptive Statistics. Let us use some data points to analyze this. We use a hypothetical dataset of 200 students from a program with salary offers from three companies A, B, and C. You can find the dataset here.

If we had a limited number of data points, we could have simply plotted a bar graph like this.

However, if we try to do this for our full dataset, we will end up with something like this

As you see, if we were to look at each student one by one, the process can become quite tedious and overwhelming. An easier way is to summarize the data. That is where descriptive statistics come into play. Let us look at a couple of ways.

Stem and Leaf Plot

One of the oldest plots was the stem and leaf plot. The idea is to divide the values into a stem and leaf. The last significant digit is the leaf, the remaining number is the stem. For example, in the case of the number 239, 9 is the leaf, and 23 is the stem. For the number 53, 3 is the leaf, and 5 is the stem. To draw the plot, we write the stems in order vertically and then write each of the leaves in ascending order. For our salary dataset, the stem and leaf plot for Salary A will look like this.

The above graph shows that there is one observation in the range 10 - 19, which is 12. In the range 40 - 49, we have four observations 42, 45, 45, and 46. The cumulative frequency is also shown in the leftmost column.

We can similarly plot the stem graphs for Company B and Company C's salaries as well.

With larger datasets and a variety of values, it becomes unwieldy, as you can see in the case of Salaries for Company B. To alleviate these problems, we have a histogram.

Histograms

Histograms extend the concept of stem and leaf plots with the difference being that instead of dividing the numbers into tens, we can decide how we want to group these numbers. As with the stem and leaf plot, we first decide the bins (or buckets) that we would like the numbers to be in. We then count the frequency in each bin and then plot the values in a bar graph. So if we decide to bin the numbers in 10s, we will get the same graph as the stem and leaf plot.

We can also create bins in the 20s. This is what the graph will look like.

As you can observe, the numbers are higher in each bin. This is natural as we will have a greater number of observations now as the bin widths have widened. As with a stem and leaf plot, the histograms are a good way to examine the spread of the data. Let us look at how the other histograms of the other company salaries look.

While examining data visually is helpful, we need some measurements to provide us with information about the characteristics of the data. The most vital set of measurements are the central tendency (a typical value that describes the data) and the spread of the values in the dataset. Let us look at the commonly used numerical statistics used to describe a dataset.

Measures of Central Tendency

The measures of central tendency describe what a typical value in the data would look like. In the case of our salaries, think of it as what a typical salary from the three companies looks like. There are three commonly used measures of central tendency - the mean, median, and mode. Let us look at these in detail.

Mean

Mean (or average) is the most widely used measure of central tendency. The mean for a data set is calculated by dividing the sum of all observations in the dataset by the number of observations.

For a dataset with n values, the mean usually denoted by x̄ is given by

the mean usually denoted by

is given by

In simple words,

Let us calculate the means of each of the three companies’ salaries. If you are using spreadsheet software, you can use the AVERAGE function to calculate means.

Salary A(k) 101.525
Salary B(k) 94.760
Salary C(k) 87.590

Let us see where the mean lies on the histograms

While the means for Companies A and C appear alright, at first glance, the mean for companies B appears a bit misleading.

If one does a quick visual calculation (or uses the stem and leaf plots), more than half the number of students (100) was offered a salary of 70k and lower. While the calculated mean was around 95k.

This is one of the problems with means. For a balanced dataset, the mean represents the middle value, for an asymmetrical distribution, the mean appears to be misleading as a few extreme values can shift the mean away. If you observe, there are seven students who were offered salaries in excess of 250k. To alleviate this problem, we cannot always rely on the mean alone. This nicely leads us to the next measure - the median.

Median

The median is quite simply the middle value of the dataset when the observations are ordered. To calculate the median, we simply arrange the values in ascending or descending order and pick the middle value. For example, for observations 18, 35, 7, 20, and 27, we start by arranging them.

7, 18, 20, 27, 35

Now we pick the middle value, which in this case is 20. If we have an even number of values, then we pick the average of the two middle values. For example, if we add another observation 42 to the above, we will get the following ordered values.

7, 18, 20, 27, 35, 42

In this case, the median will be the average of the two middle values, 20 and 27.

Therefore,

Note: The median divides the dataset into two halves each containing the same number of observations.

Let us find the medians for the three datasets and plot them on the histograms.

Salary A(k) 101.0
Salary B(k) 78.0
Salary C(k) 90.5

As we had expected, the medians for Companies A and C are pretty close to their means, but the median for B is separated from its median by almost a full bin. One of the advantages of the median is that it is not easily impacted by extreme values. It is therefore preferred in datasets that are not balanced (roughly equal spread of observations on either side of the mean).

Mode

Another measure that is widely used is the Mode. The mode represents the most frequent observation in the dataset. Let us calculate mode with a simple example.

Suppose the ages of a group of five students are 23, 21, 18, 21, and 20, then the mode for this data is 21 since it appears the maximum number of times. A dataset can have multiple modes as well. For example if the ages were 18, 23, 21, 23 and 18, then the dataset has two modes 18 and 23 since both the values appear twice. Such data is called multi-modal data.

Let us calculate the mode and plot them on the histograms.

Salary A(k) 92
Salary B(k) 39, 105
Salary C(k) 95

Note for Salaries offered by Company B, there are two values that appear the most number of times (39 and 105).

Measures of Spread

In statistics, spread (or dispersion or variability or scatter) is the extent to which the data is stretched or squeezed. Think of it as a measure of how far from the center the data tends to be present. For instance, if everyone was offered the same salary, the spread will be 0. We can evaluate the spread using a histogram. For thinly spread data, the histogram will be skinny, for example, Salaries offered by Companies A and C. Whereas, for dataset with a greater range of values, the histogram will be wider as in the case of Company B. Let us look at the mathematical measures used to evaluate the spread.

Range

The range of the dataset is the difference between the highest and the lowest values. The range of three datasets is as follows:

Salary A(k) 218
Salary B(k) 338
Salary C(k) 99

This is in line with what we saw visually.

Interquartile Range (IQR)

While the range gives a good idea about the spread of the dataset, as with the mean, the range is prone to be influenced by extreme values on either side of the spectrum. We, therefore, use a more nuanced version of the range called the Interquartile Range (or IQR for short). Quartiles are an extension of the concept of the median. Just as the median divides the dataset in two, each containing an equal number of observations, Quartiles divide the dataset into four, each containing an equal number of observations. Each of these quarter boundaries are represented by Q1, Q2, Q3, and Q4 or the first, second, third, and fourth quartile, respectively.

The first quartile represents the maximum value for the bottom 25% of the values (by magnitude), the second quartile contains the next 25%, and so on. Let us plot the four quartiles on the histogram of the Salaries offered by the company A.

As you might have guessed, the median is the second quartile (Q2). The numerical values are:

Q1: 81.75
Q2: 101
Q3: 119.25
Q4: 230

IQR measures the spread between the first and the third quartiles or the range of the middle 50% of the values excluding the top 25% and bottom 25% of the observations.

IQR = Q3 - Q1

Let us calculate the IQR for the three salaries.

Salary A(k): 37.5
Salary B(k): 100.5
Salary C(k): 27.5

The trend is similar to what we saw earlier, but this also shows that the middle 50% values for Company A are relatively closely packed.

The IQR value is used for constructing the box plot (also called the box and whiskers plot).

Let us deconstruct the box-plot.

The Box represents the middle 50% of the data. The ends are Q1 and Q3. The line inside the box is the Median. The boundaries of the whiskers are 1.5 IQR to the left and right of Q1 and Q3 respectively. The range of values from Q1 - 1.5 IQR and Q3 + 1.5 IQR is popularly known as the fence. This is the acceptable value for balanced distributions. Values outside this fence are outliers (extreme values)

Variance and Standard Deviation

Till now we have used only the extreme and quartile values to measure the spread. The most widely used measure for spread is the standard deviation (along with variance). The standard deviation measures the difference of each value from the mean of the dataset and calculates a single number that represents the spread of the data.

Let's take a simple dataset to show the calculations involved. Suppose I observe the following temperatures over the course of five days (in Fahrenheit) : 82, 93, 87, 91 and 92. The average of these values will be

We need to find how much each value differs from the mean. We can find this by subtracting the mean from each value. We get the following.

(82 - 89), (93 - 89), (87 - 89), (91 - 89) and(92 - 89)

or -7, 4, -2, 2, 3

These values are called residuals or deviations from the mean or simply deviations. Since we would like one single value, let us try to calculate the mean of these values. If you calculate that, you will find that the sum of these values adds up to zero. This is the basic property of the mean. To overcome this, we need to remove the sign from the deviations. The most common way to do this is to square the values. Since the square of a real number is always positive, we are now guaranteed to get a positive value.

We now take the mean of this and get

This value is called the variance of the data.

However, if you notice carefully, the units are also squared now, so 16.4 is not in Fahrenheit, rather it is in Fahrenheit squared!! To bring it back to the original units, we take the square root.

The number 4.05 is considered the standard deviation of the dataset.

There is, however, one twist. This number will be the standard deviation if the dataset were the entire population. Since this is not the case, we need to adjust the formula to get the sample variance and standard deviations. To do this, we use n - 1 instead of n in the divisor. This is called the Bessel’s correction. This largely stems from the fact that a sample variance will always be lesser than the population variance. You can see a wonderful explanation here.

Therefore sample variance usually denoted by s2 =

And sample standard deviation s =

As an exercise try to calculate the standard deviations and variations for the Salaries offered by the three companies. You can use a simple spreadsheet program to do this. Also try to calculate the values without using the built-in formula.

Conclusion

Now that we are armed with the basic tools for finding the center and the spread of the samples, we will extend this in the next part, where we look at another key aspect of statistics - Probability and Random Events.

In this article, we looked at the various ways of collecting data samples for analysis. We used a hypothetical salaries data set and learned how to plot graphs like histograms and box plots. We also learned about the measures of central tendency and the measure of spread. This will set us up nicely for the next two parts. In preparation for statistics, you can use the StrataScratch platform, where we have a community of more than 20,000 aspirants aiming to get into the most sought-after Data Science and Data Analyst roles at companies like Google, Amazon, Microsoft, Netflix, etc. Join StrataScratch today and turn your dream into a reality.

How FAANG companies are leveraging data science and AI

StrataScratch — Thu, 29 Sep 2022 07:04:23 +0000

This article will cover information about how FAANG companies leverage Data Science and AI to drive product innovation and thereby improve their customer satisfaction and drive revenue growth.

As the data is increasing at an exponential rate, most of the companies are leveraging this data to drive growth and enhance customer experience. Big Data has spread around the world since the 1960s. Data that contains greater variety, arriving in growing columns, and with increased velocity is what Oracle defines as big data. Every industry has seen a rise in the number of data science firms that analyze this data for commercial insights. There are many data science companies that have improved over time due to data-driven decision making. In this article, we will be talking about FAANG companies. FAANG stands for Facebook/Meta, Amazon, Apple, Netflix and Google.

Big Data is useless without the knowledge of experts who can transform cutting-edge technology into useful insights. The value of a data scientist who knows how to wring relevant insights out of gigabytes of data is rising as more and more firms today unlock the power of big data.

Data processing and analysis have a huge value, and it is becoming increasingly obvious as time goes on. The importance a data scientist holds in a company is still largely unknown to executives, despite the fact that they are aware of how data science is a hot business and how data scientists are like modern-day superheroes.

Why do companies need Data Science Capabilities?

In order to succeed in this ever-changing world, companies need to rely heavily on data driven decisions and make their products more innovative. Some examples of innovative products backed by data science include Alexa by Amazon which is a virtual assistant and does basic jobs using voice commands, Google’s apps like translate, maps, etc., Netflix's recommendation system to show users what they might like, etc. Having such innovative products backed by data helps the companies in building a great customer experience which improves customer loyalty and thereby drives growth.

Data Science and AI can add value to the businesses by empowering management to make better/data driven decisions. It can also help the companies in directing actions based on trends, identifying opportunities, making decisions with quantifiable, data-driven evidence and test these decisions, etc. Let’s look at some examples of how and why companies are using Data Science.

ML usage to Increase Competitiveness

It is hard to be unaware of how machine learning affects enterprises. Machine learning has gained popularity in recent years, and the solutions it offers can be highly advantageous to any business now and in the future. At the moment, enterprises all over the world employ machine learning mostly for:

• Using predictive analysis to improve customer interactions
• Revenue projections and product marketing
• Simplified data management
• Improved selling models
• Awareness of fraud and cybersecurity

However, other companies have advanced machine learning and are now using it in highly inventive ways. For instance, Pinterest used machine learning and data science to build its whole content discovery system. This technology aids the business in anticipating customer preferences and improves the accuracy of search results.

Business process optimization using data science

Data science provides businesses with a wide range of options for streamlining key business procedures. For instance, big data analysis has recently become more prevalent in the manufacturing industry. More and more manufacturing facilities are interested in investing in data analytics and IIoT. (Industrial Internet of Things). Real-time tracking systems and sensor technology gather and analyze data that manufacturers can utilize to:

• Eliminate snags in the production process
• Boost the assets' effectiveness
• Keep track of product flaws and quality
• Conduct product testing

Data science assists firms in minimizing production problems that may have an impact on the product's quality, factory logistics, and shipping procedures. Recruitment is a fantastic illustration of how data science can be used in innovative ways to improve business processes beyond manufacturing.

Now, let’s see how FAANG companies are using Data Science and AI to innovate their products and improve customer experience. We have already covered how the work culture of these FAANG companies is in this article.

How is Facebook/Meta using Data Science and AI?

Facebook, now Meta is one of the biggest social media companies in the world. With close to 3 billion monthly active users worldwide, Facebook has captured enormous amounts of data. For Facebook, most of its revenue comes from advertising. So the most important application of data science at Facebook is to decide which advertisements to show to which users.

The company changed the name to Meta since the strategy involved evolving the company into virtual platforms of social technologies where users will be able to immerse themselves in “metaverse”. Facebook has been leveraging data science and AI in every step to make the customer experience better and drive exponential revenue growth. Every 60 seconds on Facebook, 510K comments are posted, 293K status updates and 136K photos are uploaded which is massive. So what does Facebook do with all this data? How do they leverage data science and AI capabilities to make the most of this data? Let's see some examples where Facebook uses Data Science extensively for enhancing customer experience.

Text Analytics

A majority of the data that is available on Facebook is in the form of text; for example, posts and comments unlike that of Instagram which has photos and videos. Facebook has developed an in-house tool called Deep Text which analyzes the text that we shart on post, comments and extract the meaning out of this. This technology is used to identify any abusive posts on Facebook.

Deep Text is a deep learning based text understanding engine that can understand the text by extracting meaning from it at a near-human accuracy level. It is built by leveraging state of the art neural network architectures that can perform word/character level based learning.

Some applications of Deep Text include identifying the sentiment of the post (positive, negative, neutral) or identifying the emotions in the post (sad, happy, angry, threat, etc.). This framework can also be used in identifying the topic; for example whether the post is about Cricket or Football by recognizing the player names from the post.

Topic Data

Facebook has developed Topic Data which leverages data science to help marketers to understand what people are talking about the topics related to their business so that marketers can make their products and marketing relevant to their customers. Earlier to this technology, marketers had to rely on what people are posting online and topics they are talking about but it provided a very limited view. Thus, Facebook developed a framework built by data science to help marketers in building their marketing content more effectively and personalized.

Some examples of how marketers are using Topic Data:

• Inventory manager of a fashion retailer using this data to understand clothing trends of its target audience to decide on which products to stock.
• A company can use this to understand their brand positioning and what the sentiment of their brand is.
• Companies selling a hair de-frizzing product can see demographics data and target relevant customers.

Advertising

Facebook’s main revenue stream is advertising. The company has been using their data very effectively to decide which ads should be shown to which users. On Facebook, we usually see sponsored posts and those are the examples of these targeted advertising practices. When you search for any product on the web and if you are logged in to Facebook, then chances are you will see the product you searched for on your Facebook app as an ad.

Other than the topics discussed above, there are many ways in which Facebook is using data science and AI. Some recent projects that Facebook has undertaken are:

• Detecting Deepfakes - The company recently launched a Deepfake Detection Challenge to build detection models for Deepfake content and to speed up their efforts.
• Language Translation - If you see a Facebook post in another language, Facebook can translate that in real time automatically for you
• Suicide Prevention - Facebook can identify a sentiment of a post and look for signs of trouble in users post and comments and thereby generating alerts and helping people in crisis.
• Image Recognition - Users can easily search through photos without having them to rely on tags or a surrounding test.
How is Amazon using Data Science and AI?

How is Amazon using Data Science and AI

Amazon is one of the largest ecommerce companies in the world. Amazon collects around 1 exabyte of purchase history data from their customers. Apart from online shopping, Amazon provides services like Amazon Web Services, Amazon Pay, Amazon Pantry and many more. With all these services, imagine how much information Amazon has about their customers. They leverage this information to improve the customer experience and drive revenue growth. Amazon collects data about what pages customers visit on their website, what products they are interested in, what category they are interested in and based on all this information, they recommend new products to customers which they are likely to buy.

Amazon is the leader in collecting and processing users' information about how they are spending their money. They use sophisticated data science and machine learning modes for targeted marketing which helps them in increasing customer engagement. Let’s see some of the examples of how Amazon uses Data Science and AI.

Recommendation Engine

With developments in AI, Amazon began working on building a state of the art recommendation engine which would be able to analyze the customer’s behavior on the website and thereby accurately predict what the customer might be interested in the future. The recommendation engine collects a lot of data about the products and the users and then forms relations and dependencies between them.

User-Product relationship occurs when some users with a specific set of characteristics have a similar preference in some products and they buy them often. Example includes Game of Thrones fans buying GoT merchandise and related items. Product-Product relationship occurs when few products on the website are similar in appearance and specifications. For example, if you search for a water bottle and are interested in bottles only for the gym, then all the similar items would be placed together. User-User relationship occurs when a certain set of customers have very similar taste or preference for certain products. For example, tenagers massively buying merchandise from their favorite Youtuber.

Alexa

Amazon offers virtual assistants such as Alexa or Echo/ Echo Show. This product is widely used by the customers to get basic information such as setting reminders, checking weather, checking latest news, etc.

When the users speak with Alexa, the recordings are uploaded to Amazon's servers as voice files and these files are used to train the machine learning algorithms and help make Alexa experience better. Thus, Amazon is continuously collecting data from its users and they use advanced tools in AI to understand what the user is saying. Some customers might not be happy in sharing this voice data and Amazon provides a way to delete these data from their servers.

Pricing Optimization

Amazon changes prices on its products every 10 mins which is close to 2.5m changes in a day across all its products. Amazon has algorithms in place which assess a person’s willingness to buy a specific product. The aim is to set the price of a product in such a way that the customer is likely to buy that product, which is known as dynamic pricing. The changes in the prices occur depending on the user’s activity on the website, competitor’s pricing, product availability, profit margin and many more.

Other than the topics discussed above, there are many ways in which Amazon uses Data Science and AI. One of the latest is using Alexa - Enabled Voice Shopping. This feature uses voice commands as input data and performs the purchasing flows based on the commands. It allows the users to find and purchase products and walk through the checkout flow with voice prompts instead of clicking or tapping on your phone/Echo screen. The goal of this feature is to provide a seamless customer experience for ordering a product and Data Science and AI is at the center of these products.

How is Apple using Data Science and AI?

Apple, formerly known as Apple Computers Inc. is a global technology corporation specializing in designing, developing and selling consumer electronics such as Mac, Iphone, Ipad, Airpods, etc. Apple uses AI and Data Science widely to innovate products and build better customer experiences. Let’s discuss some of the AI applications that Apple uses:

Siri

Similar to Alexa, Siri is Apple’s AI-enabled assistant that is available across Apple devices such as Mac, Ipad, Iphone, Airpods, etc. which aims to help the users quickly navigate through the required tasks and perform them quickly without touching the screen. Examples such as setting alarms, reminders, calling someone, weather updates, news, etc.

As a virtual assistant, Siri is built on large-scale ML algorithms that combine speech recognition along with text mining and natural language processing. Firstly, Siri uses speech recognition to translate the human text into textual format and natural language processing is used to identify the meaning of the sentence and prepare the next best response for the user’s task.

Apple Watch Sleep Tracking

Apple has many ways to collect data regarding their customers. One of the most recent ones is the use of Apple Watch where Apple can track user’s activity throughout the day. Apple has partnered with IBM to apply digital information for health management. Using this technology, customers can monitor their health and lifestyle throughout the day and thereby make improvements to it.

Apple Watch also tracks the sleeping patterns and provides data to their customers regarding their deep sleep and light sleep. Based on this data, Apple Watch reminds their users about their sleeping times and how to improve it. These notifications/reminders from the Apple watch are a result of sophisticated machine learning models in the backend.

Apple HomePod

Another AI-enabled technology Apple uses is in HomePod, a speaker powered by Siri that can do more tasks. “With multiple HomePod mini speakers placed around the house, you can have a connected sound system for your whole home. Ask Siri to play one song everywhere or, just as easily, a different song in each room,” (from Keynote speech). Additionally, the speaker can also act as a HomeHub, connecting to Apple’s HomeKit that can be accessed through the iPhone.

Apart from the projects discussed above, Apple uses Data Science and AI in everything they do, especially in innovating their products and technologies. Apple is always ahead in the game of technology and they do this by using big data extensively. The big data technologies in Apple are used to build innovative products such as Siri, Homepod, Apple’s Digital Car Key (partnered with BMW series 5), etc. To conclude, we can say that Apple is utilizing technological advances to improve its user experience by using Artificial Intelligence and Data Science methodologies.

How is Netflix using Data Science and AI?

Netflix, originally a DVD-rental service, boomed into one of the biggest video streaming platforms with many options for movies and web series. Netflix generated $24.9 billion revenue in 2021, 23% increase as compared to 2020. In 2022, Netflix has 222 million subscribers worldwide. Netflix has reigned as the number 1 over the top (OTT) streaming platform since its launch in 2007. So what’s the secret of such huge success? It is all about using Data and Analytics to make enhancements in their products and improve customer experience. With the use of data extensively, Netflix can provide users with personalized movie and TV shows recommendations, optimize production planning, predict the popularity of original content, personalize marketing content such as trailers/thumbnail images and help the stakeholders in decision making. Data is at the center of everything that Netflix does. Let’s see some of Data Science applications Netflix uses:

Recommendation Engine

Netflix has built a near-real time recommendation system for their customers by using the huge amount of data they have. Netflix captures information about each user and ranks those users based on what type of content they watch, what they search, what they add to watch-list, etc. This type of data becomes part of Big Data and all this information is stored in the databases which is then used by machine learning algorithms to build a pattern indicating the viewer’s taste. This pattern may match with another user or may not match with anyone since each user might have a unique taste. Based on these ratings to each customer, the recommendation system provides TV shows or movies that the user is likely to watch and enjoy. Thus, they use Data Science extensively to recommend new shows to their end users thereby improving customer experience significantly.

Other than the user behavior on the web-application, Netflix captures data like viewing day, time, location, type of device used, etc. It also captures data around search key-words that a user uses to find a movie/show. Using this data, Netflix is not only able to provide suggestions of the next shows/movies to watch but also arrange the selections into rows based on an individual’s viewing preferences. For example, Netflix will position the program that you are most likely to watch in the top left corner.

Production Planning & Content Development Analytics

At Netflix, Data Science and AI is not only used to understand user behavior for recommendation systems, but it’s also used in production planning and content development activities. When the creators come up with an idea about a movie or the show, data plays an integral part in making any decisions.

Based on the content developed historically and its performance over time, a lot of data is crunched to find insights around what went well and what can be improved. Data around how viewers perceived the previous content is really helpful in predicting the likeability of the new content. For example, Netflix’s executive knew that Umbrella Academy is going to be a hit because it checks certain parameters. It's a series which shows protagonists growth right from childhood to adulthood, features actor Elliot Page and it’s a comic action adventure where all these parameters have been successful in the past.

Also, data is widely used to find out different shoot locations, timings and day of the shoot. Simple prediction models can save them a significant amount of time and effort in planning and reducing the expenses.

Personalized Artwork and Imagery Selection

Netflix knows that imagery plays a very important role in how viewers perceive movies or TV shows to watch. The main objective of the content platform team is to surface those aspects of the story that might be intriguing for the users and increase the chances for that user to watch the tv show or movie. This imagery for a show/movie is purely backed by data science and it is personalized for all the users. Netflix uses Artwork Visual Analysis (AVA), which is a collection of tools and machine learning algorithms that extract relevant imagery from the videos to surface as thumbnails for the customers. Netflix has many tv shows and typically each TV show has around 10 episodes in each season on average which is close to 9 million frames. To select a frame from such a large video manually, which will catch audience attention is tedious and thus, Netflix has developed AVA which does this work automatically. More information on different algorithms used in AVA can be found on their tech blog.

How is Google using Data Science and AI?

Google is a technology giant which uses Data Science and AI extensively. It is a multinational internet company that provides digital products and services such as online search and advertising, cloud computing and software, etc. Google has a wide range of products and services available and all of them are backed heavily by data science. Google has also acquired Youtube recently which adds on to the list of services that the company offers. Products such as Google Search, Google Photos, Google Drive, G-Suite, Gmail, Voice Search, Reverse Image Search, Maps, speech recognition, Translate and many more. All these products use Data Science to help improve the customer experience and drive revenue growth for the company.

Google Translate

Google Translate is a simple online tool that can be used to translate any text from one language to another language. Earlier in 2006, Google used Statistical Machine Translation for this app but has made great progress in using AI for instant real time translation of text. The latest machine learning algorithms Google Translate uses, provides translation in 109 different languages and it has boosted the quality and reliability of these translations. Google has made significant improvements in the field of Natural Language Processing which enabled them to have a high accuracy rate in Google Translate.

Google Ads

Google Ads, formerly known as AdWords is a part of Google’s marketing suite of tools. Google Ads gives full control to the businesses and users to advertise their products online and profile the users based on their searches. This data is used by Google to target the right advert to the right users which is a main idea behind Google Ads. Google uses state of the art machine learning algorithms that rank thousands of keywords based on several metrics, which are then used to pick the right ad to show to the users.

Gmail

Gmail has been used by a lot of customers as their primary email service and Google has implemented many smart features to it. One of the latest smart features is called a smart reply. This feature reads the email, extract the meaning of that email and provides the user with possible responses so that the users don’t have to type a lot. Also, Google uses machine learning algorithms to identify and categorize emails as Spam or not-spam. Another Data Science backed feature in Gmail is automatic categorization of emails into Promotions, Social, Updates, Priority, etc.

Apart from the products discussed above, Google uses Machine Learning and AI to build smart products. Google is leading in the technology space due to heavy investments in research in advanced computer science and artificial intelligence.

Conclusion

Any organization that uses its data effectively can benefit from data science. Data science is beneficial to any business in any industry, from statistics and insights throughout workflows and employing new applicants to assisting senior employees in making more informed decisions.

If your aim is to work for either of the above data science companies, it’s very important to develop skills needed for a Data Scientist / Data Engineers, etc. depending on your preference. With StrataScratch, you can tackle coding as well as non-coding questions and would be able to be a part of a community of like minded people. You can communicate and collaborate with other aspiring Data Engineers and work towards achieving your dream job. We have more than 400 real life SQL questions on the platform ranging from beginner to advanced level. We highly recommend you joining the community of over 20K learners and be interview ready. All the best!

Find the Retention Rates – Salesforce SQL Interview Question

StrataScratch — Wed, 17 Aug 2022 07:26:10 +0000

Retention rates are one of the key business metrics. We’ll show you how to calculate them by explaining in detail how to solve the Salesforce data science interview question.

The retention rate is one of the important business metrics, especially in marketing, investing, and product management.

It refers to the percentage of customers continuing to do business with a company. This usually means extending your subscription or in any other way continuing to use the company’s products and services, such as software, application, maintenance, etc.

The retention rate is calculated by dividing the number of retained customers by the number of customers at the beginning of the period. The number of retained customers shouldn’t include customers acquired during the monitored period. In other words, the formula is:

PEC – Period End Number of Customers
NC – New Customers in the Period
PSC – Period Start Number of Customers

Now, we’ll have a look at the interview question and try to find the retention rates using SQL.

Retention Rate - A Data Science Interview Question by Salesforce

Here’s what this question asks you:

Link to the question: https://platform.stratascratch.com/coding/2053-retention-rate

Dataset to Work With

To solve this problem, Salesforce gives you only one table: sf_events.

It has three columns:

To get an idea about the data it contains, here are the first few rows from the table:

Solution Approach

Since this table is not a list of all users, but the list of users’ activity in each month, we don’t need to calculate the number of new users each month. In other words, we want to see how many users active in December 2020 were also active in January 2021 or any other future month. We also need to have a look at all the users active in January 2021 and see were they active in February 2021 or any other future month. This is also the assumption stated in the question.

With this assumption in mind, the retention rate is calculated by finding the users active in future months and dividing this number by the number of users in December 2020 or January 2021, depending on which retention rate you’re calculating.

For example, if the user were active in December 2020, it would appear in a table with a December 2020 timestamp. If there’s any future activity (in January 2021 or on), this user would be considered as retained for December 2020. If the user were active in December 2020 but didn’t appear in any of the coming months, it would be considered not retained.

Assumptions

Our solution will be based on the following assumptions:

If a user is listed in the table, this represents the user’s activity for the date in the record.
We consider only retention rates for Dec 2020 and January 2021.
The table does not represent the list of all users but only the active users.

Solution Breakdown

The steps you have to build into your code are:

Find all active users in December 2020 by using the date field. Do the same for January 2021. That way, you’re getting denominators for the Dec and Jan retention rates.
Find the maximum date of the user’s activity to see if the user has the activity in the future months. To do that, create a table with the user_id and max date.
Join all the active users in the month with the list of users with future activity. That way, you’ll get the list of December 2020 users and their latest activity date. Then count the number of users with activity after Dec and divide it by the number of users in Dec to get the Dec retention rate. Apply the same principle to calculate the January retention rate. Mind the fact that for Jan retention rate, the future activity begins with February 2021.
Consolidate by account_id. Use either Jan or Dec accounts list because it’s assumed that both months contain the complete list of account_ids.

Solution

The first thing is to find users active in December 2020.

WITH dec_2020 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 12
     AND EXTRACT(YEAR
                 FROM date) = 2020 ),

To do that, we’re using the CTE. We’re interested in the distinct accounts and users, and to get the users active in December 2020, we’re using the EXTRACT() function in the WHERE clause.

The second CTE does the same thing for the users active in January 2021.

jan_2021 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 1
     AND EXTRACT(YEAR
                 FROM date) = 2021 ),

Next we want to find the latest active date for each user.

max_date AS
  (SELECT user_id,
          MAX(Date) AS max_date
   FROM sf_events
   GROUP BY user_id),

As you could probably say from the solution breakdown, here we’ll use the MAX() function to find the latest active date.

Now comes the step where we calculate the retention rate. First the December 2020 retention rate.

retention_dec_2020 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2020-12-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_dec
   FROM dec_2020
   JOIN max_date ON dec_2020.user_id = max_date.user_id
   GROUP BY account_id),

Here, we joined the two CTEs together to match active users in Dec with users that have had future activity. It’s possible that the future activity is in December. Because of that, we’ll only count the users that had activity after Dec.

We used the CASE WHEN statements to allocate values of 1 to all users that had activity after December 2020. Sum these values, divide them by the total number of users in December 2020, and you get the Dec retention rate.

Then we do the same for January 2021 retention rate.

retention_jan_2021 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2021-01-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_jan
   FROM jan_2021
   JOIN max_date ON jan_2021.user_id = max_date.user_id
   GROUP BY account_id)

Now that we have the retention rate for Dec and Jan active users, we only need to group by account_id and divide the retentions.

SELECT retention_jan_2021.account_id,
       retention_jan / retention_dec AS retention
FROM retention_jan_2021
INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

The complete answer to this question is:

WITH dec_2020 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 12
     AND EXTRACT(YEAR
                 FROM date) = 2020 ),

 jan_2021 AS
  (SELECT DISTINCT account_id,
                   user_id
   FROM sf_events
   WHERE EXTRACT(MONTH
                 FROM date) = 1
     AND EXTRACT(YEAR
                 FROM date) = 2021 ),

max_date AS
  (SELECT user_id,
          MAX(Date) AS max_date
   FROM sf_events
   GROUP BY user_id),

 retention_dec_2020 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2020-12-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_dec
   FROM dec_2020
   JOIN max_date ON dec_2020.user_id = max_date.user_id
   GROUP BY account_id),

retention_jan_2021 AS
  (SELECT account_id,
          SUM(CASE
                  WHEN max_date > '2021-01-31' THEN 1.0
                  ELSE 0
              END) / COUNT(*) * 100.0 AS retention_jan
   FROM jan_2021
   JOIN max_date ON jan_2021.user_id = max_date.user_id
   GROUP BY account_id)

SELECT retention_jan_2021.account_id,
       retention_jan / retention_dec AS retention
FROM retention_jan_2021
INNER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

Edge Case Consideration

As an edge case, we’ll consider the possibility that not all accounts were present each month.

To compensate for that and to include all accounts, you can use two workarounds.

FULL OUTER JOIN

The first workaround is to use the FULL OUTER JOIN instead of INNER JOIN in the SELECT statement referencing the CTEs.

SELECT 
    COALESCE(retention_jan_2021.account_id, retention_dec_2020.account_id) AS account_id,
    COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention
FROM retention_jan_2021
FULL OUTER JOIN retention_dec_2020 ON retention_jan_2021.account_id = retention_dec_2020.account_id

Use the COALESCE function to get the January accounts and the December accounts not appearing in January. Then use the same function to divide the two retention rates, with NULL when there’s no retention rate for that account.The CTEs calculating the retention rates are joined using the FULL OUTER JOIN. If you don’t feel at home with all these different JOINs and what they do, don’t worry! Here’s an article “How to Join 3 or More Tables in SQL” that explains everything about the JOINs you need to know.

The issue with this edge case solution is that it’s computationally intensive.

UNION

There’s another way. You can get a complete list of all accounts by using UNION, like this:

all_accounts AS
  (SELECT account_id
   FROM retention_jan_2021
   UNION 
   SELECT account_id
   FROM retention_dec_2020)

SELECT a.account_id,
       COALESCE(retention_jan, NULL) / COALESCE(retention_dec, NULL) AS retention
FROM all_accounts a
LEFT JOIN retention_jan_2021 j ON a.account_id = j.account_id
LEFT JOIN retention_dec_2020 d ON a.account_id = d.account_id

Both these workarounds have a downside, which is they only capture the accounts that are in December and January. This means they don’t consider all months in the dataset.

If you want all months, you can simply create a table with all the distinct account IDs found in the table. This would, however, mean listing all the accounts for all time, so you may get a lot of accounts with retention being zero because they don’t have any users.

Conclusion

This Salesforce data science interview question is not easy. But if you hung on in there until the end, you have gotten really valuable knowledge. That is calculating the retention rates.

Knowing that will not only get you a bigger chance of success at the job interview. It will also make you a valuable asset to a company, because you’ve shown that you possess a high level of business, as well as technical knowledge. If you want to practice more questions from Salesforce, check out our previous post “Salesforce Data Scientist Coding Interview Questions” or you can also find questions from other top companies here “SQL Interview Questions You Must Prepare: The Ultimate Guide”.

A Resource Guide to Jump-Start Your Own Data Science Projects

StrataScratch — Fri, 10 Jun 2022 02:32:24 +0000

A very into-detail guide on the data science project components and the resources for jump starting your very own project.

This is a resource guide to start your data science projects. This guide will go over what are the various components to successful data science projects and websites and online datasets to help jump start your project!

First off, you need to understand what and why specific components are required in a full stack data science project (such as time-series analysis and APIs). We have created a detailed breakdown and what interviewers look for in this article → “Data Analytics Project Ideas That Will Get You The Job” and the video below:

Components of Data Science project:

1) Promising dataset

Real data
Timestamps
Qualitative data
Quantitative data

2) Modern Technologies

APIs
Cloud Databases (Relational + Non-Relational data)
AWS → S3 buckets

3) Building model

Metrics
Modifying dataset
Diagnostics tests
Transformation
Test/Control
Model Selection
Optimizing
Math

4) Making impact/validation

Promising Dataset

Real Data

Any great data science project uses constantly updated data.

There are 2 important reasons for using updating real data.

1) The dataset is never truly complete

A real world dataset needs manipulation of values to derive a new metric or needs to be filtered. Data wrangling is one of the most important parts of data science, since a model is only as good as the dataset it analyzes.

2) The dataset is updated in real time

Most companies use datasets that are updated frequently. These types of datasets are important especially for businesses that need to take a specific action if a certain metric falls below a certain threshold. For example, if the supply of an ice cream company falls below the predicted demand, the company needs to make a plan on how to match the supply and demand curve. Using real time datasets is a great way to show recruiters, you have experience with variable datasets.

Timestamp

Datetime values are as the word states values that include date or time. These values are commonly found in datasets that are constantly updated, since most of the records will include a timestamp. Even if the record doesn’t have a timestamp, it’s useful for analysis to have a datetime column. Commonly companies want to see the distribution across the year (or possibly decades), so finding datasets with datetime and finding the distribution of a specific metric over a year is important.

Qualitative / Quantitative

Qualitative and quantitative data represents non-numerical and numerical values accordingly.

Examples:

Qualitative → Gender, types of jobs, color of house
Quantitative → Conversion rate, sales, employees laid off

Both types of data provide their own importance.

Quantitative data is one of the fundamentals of regression models. You are using numbers and variables to predict a numeric value.

Qualitative data can help with classification of models such as through decision trees. Qualitative data can also be converted into quantitative data, such as converting safety levels [none, low, medium, high] to [0, 1, 2, 3], which is called ordinal encoding.

Geo-locations, such as countries or longitude/latitude, are nice to have in datasets. Similarly to datetime values, with geo-locations, you could find the distribution of metrics across various states/countries. Especially when working with multinational corporations, they have datasets from various countries that need to be analyzed.

Resources

Now you have a better understanding of what to search for in your potential dataset, here are some websites to search for datasets.

These links contain datasets in csv format, but also have access to APIs. APIs are important to have in your repertoire, since data especially when working for companies is usually obtained through APIs.

Another place to retrieve datasets beyond these websites are from famous tech companies that are consumer based, such as Twitter, Facebook, and YouTube. These companies provide APIs for developers through their websites directly. This is an easy area to find intriguing ideas for your projects!

Modern Technologies

Modern technologies are key factors on what differentiates a good and great data science project. Modern technologies refers to commonly used softwares and services used by companies. APIs and cloud databases are examples of modern technologies.

API

APIs are one of the most important modern technologies to use when creating a data science project. APIs (Application programming interface) is what makes your application work. Imagine you are booking an Uber ride. Through your phone, you will first input your pickup and dropoff location. Uber will give you an approximate cost. How does the Uber application calculate this cost, it uses APIs. An API is an interface between 2 software.

Example: The Uber app requires an input of pickup and dropoff locations. A separate software, which can be hosted on the web, will calculate the cost based on distance between locations, approximate time taken, surge pricing (when there is high demand for rides), and much more. The calling and communication between these 2 softwares is an API.

1.) Understanding APIs and how to setup in code

Knowledge of APIs is essential for a great data scientist. While you can watch the short videos about what is an API, you definitely need a better understanding of where APIs are used and the different types. Building your own API is a great way to get a better understanding and how to test APIs.

Resources

2) Libraries for APIs (Request/Flask in Python)

To request data from APIs or even create your own API there are specific libraries to use.

Resources

To learn more about Requests in APIs → Creating project with API Working with APIs in Python [For Your Data Science Project]
To learn more about building a REST API with Flask Python REST API Tutorial - Building a Flask REST API

3) Understand json objects

Plenty of APIs use JSON objects as an input and output, so it is crucial to understand what json objects are.

Resources

Using json library in Python - Python Tutorial: Working with JSON Data using the json Module

Cloud Database

Recruiters want data scientists with experience with cloud databases, since databases are hosted in clouds more often these days.

Before going into the 3 major cloud platforms, you want to plan your structure of input and output data. There are 2 different types of databases: relational and nonrelational data.

Relational databases store data with primary keys to identify the specific data. This type of table is generally seen in table structure.
Non-Relational databases store data without primary keys, such as graphs.

Common cloud platforms are AWS, Google Cloud, and Microsoft Azure. Each has their own advantages and disadvantages, so BMC gives a detailed analysis between these services.

After personally using these services, I would recommend using AWS especially for a first time project. S3 buckets are extremely common when working with companies. AWS has RDS (Relational database) and DynamoDB (Non-Relational database) along with S3 storage. Using these services is a great way to show you have experience in both cloud storage and S3. AWS has a great variety of free tier services to build your project along with 5gb free of S3 storage.

Resources

Building Models

Let’s discuss what to keep in mind when building a model.
An important part to remember when building a model is why you are using or not using a specific technique in a model. While getting an accurate model is important, during interviews, you want to explain the reasoning behind choosing specific models over others.

Automate your model as much as possible. Assuming your input data has a fixed format, your algorithm should clean, derive new metrics, apply appropriate transformations, and build the model. So even when inputting a dataframe with new data, the algorithm should work and provide the right outputs.
Here are some things to consider when building a model:

Metrics

How do you determine how accurate your model is? What numerical data are used when creating your model? Not all models require a metric, but most use them. Determining what metrics will affect your model and to what extent is imperative to a well thought out model. If your model requires the derivation of a new metric, the what and why that metric was created should be noted.

Modifying dataset

How are you manipulating the values or columns of the input dataset? Remember to clean the dataset. Beyond cleaning the dataset, are there any specific columns you derived that directly affect the output? Also remember to note in your project notes why you made these changes.

Certain columns may contain null values. In those cases you should decide how to deal with the missing value rows. You could impute the average of that column to replace null values or run a regression to predict the values.

There are various ways to clean the dataset ranging from simple removing rows with null values to PCA (Unsupervised ML algorithm to remove correlated values)

Resources

Here are some common techniques to use to optimize your dataset:

Common Data Cleaning Techniques → 8 Effective Data Cleaning Techniques for Better Data
Dealing with missing values → 7 Ways to Handle Missing Values in Machine Learning
Encoding values → Categorical encoding using Label-Encoding and One-Hot-Encoder
PCA → Principal Component Analysis from Scratch in Python

PCA - Overview of Machine Learning Algorithms: Unsupervised Learning

Diagnostic tests

Raw datasets often need to be updated for certain analysis you may run. Suppose you want to check the equal variance in your data, since you want to run a linear regression. To check, you can run a diagnostic test called Homoscedasticity. Depending on the type of model you want to create the dataset needs specific properties.

Resources

Examples of diagnostic tests to run for common problems:

1) Outliers

Box-plot - A simple graph to show the 5 number summary (minimum, first quartile, median, third quartile, maximum) - How To Make Box and Whisker Plots
Grubbs Test - Test to detect exactly one outlier in the dataset - Grubbs Test (example)

2) Homoscedasticity / Heteroskedasticity

Homoscedasticity - Test to check if variance of the dependent variable is the same throughout the dataset - Bartlett's test
Heteroskedasticity - Test to check if variance of the dependent variable is NOT the same throughout the dataset - Breusch Pagan test

Transformation

If any diagnostic tests prove transformations are required, run the relevant transformation.

Resources

Common transformations:

1) Box-Cox transformation - Transforming a non-normal distribution closer to a normal distribution

2) Log transformation - Transforming a skewed distribution closer to a normal distribution

Log Transformation: Purpose and Interpretation

Test/Control

Some models require test/control versions. A common test/control test conducted is A/B testing. Did you implement test/control? What and why did you use the specific difference between test and control versions? What were your results?

Resources

Learn A/B testing

Model Selection

What are your assumptions about this model? What properties does the dataset and model have? Why was this model the best fit for answering your question you are trying to answer?

Resources

Common Regression Models:

1) Ridge - This uses L1 regularization which means ineffective variables’s coefficient can be reduced CLOSE to 0 - Regularization Part 1: Ridge (L2) Regression
2) Lasso - This uses L1 regularization which means ineffective variables’s coefficient can be reduced to 0 - Regularization Part 2: Lasso (L1) Regression
3) Logistic Regression - This model processes the probability of a given input and returns a binary output- StatQuest: Logistic Regression

Logistic regression part - Overview of Machine Learning Algorithms: Classification

Classification

1) Decision trees - Decision trees are made of decision nodes which further lead to either another decision node or a leaf node - StatQuest: Decision Trees
2)

Decision trees - Overview of Machine Learning Algorithms: Classification

3) Naive bayes - A type of classification algorithm that uses bayes theorem - Naive Bayes, Clearly Explained!!!

Neural Networks

CNN model - A neural network model that is used for images recognition - Convolutional Neural Networks (CNNs) explained
RNN model - A neural network model that is used for sequential data Illustrated Guide to Recurrent Neural Networks: Understanding the Intuition

Optimizing

Your first iteration of your model should not be your final. You must recheck your code and find out how to optimize your code! (Hopefully you made proper comments and well named variables so you don’t forget what a certain function does)

When optimizing your code, you want to determine what is considered a more optimized model? Most commonly with a data science project, a more optimized model is a more accurate model.

Error predictions calculate the difference between the original data and predicted data. Calculating the difference can be done in a couple of ways, such as Mean Squared Error and R². Mean Squared Error is the average of the squares of the errors, where error is the difference between original data and predicted data.

Mean Square Error Formula:

Resources

Common error predictions

Math

When creating a statistical model, you definitely need to understand the math. What are the mathematical assumptions of the model? If you have a final equation or specific epochs or other values, include that in your notes.

If you want to create a scientific document for your model, you can use LaTeX documentation. This is specifically made for scientific documents and mathematical formulas. You can use an online LaTeX editor to create the projects.

Making an impact / validation

Now you have finally created your model and project! The final step is to get peer review on your project!

There are multiple ways to get the validation on your project, such as creating a report of your findings or sharing the visualizations.

Creating an article/report

There are various opinions on how to write a report. No matter how you format your paper, remember to include evidence and logical reasoning behind your analysis. If there are other research papers or models built related to your question, explain the differences and similarities
Examples of great research papers

Now it is time to share your analysis to the world. The first important place to upload your analysis is to GitHub.

GitHub should be used to upload:

1) Code – Make sure it is effectively commented and precise variable names

2) Your report
3) ReadMe file – For other users to understand how to replicate your analysis

Another social media platform to take advantage of is Reddit. Reddit has plenty of subreddits where you can share your projects.

r/learnmachinelearning → For simpler project that might tend to be your first few data science projects
r/machinelearning → For your detailed research papers
r/dataisbeauitful → This is the go to place to share visualizations with a large community to share visualizations

Towards Data Science (derived from Medium) is the go to place to upload your data science articles. These articles need to be an analysis of your project, why you used specific models over others, your findings, and more.

Some great project analysis

LinkedIn is a great tool to share your projects to people outside of data science. Data science teams in companies have to communicate with coworkers in other departments constantly. Sharing your projects to people beyond your peers gives a great insight in how effectively you can communicate your technical project to a non-technical audience.

Twitter is an important platform to learn about various topics, especially academic research. If you want to be active in the data science community, especially with new technologies or if you are going to publish your own projects, you should join Twitter. Twitter is a great way to share your projects to the academic community and follow reputable people from the field.
Great Twitter pages in ML/AI/DS to follow

Conclusion

Now you have the path for a great data science project!

Try to implement as many of these components as you can. Although, if you include components that are illogical for your project, that is a black mark especially if an interviewer figures this out.

Always try to find how you can improve your model or follow up on your project! For example, if you are creating a prediction model see how accurate your model still is 6 months after you published your model!

TIP: A question you should constantly ask yourself when building your project is why am I using this specific method. For example, if you are building a regression model but choose to use Lasso over Ridge, a reason could be due to wanting to remove certain variables. Again ask yourself, why do I want to remove variables? Maybe certain variables increase the MSPE value. Like this, constantly ask questions throughout your project so you have a more accurate model since you have thought through the various different approaches.

If you’re a beginner and still want more ideas and tutorials to start with, check out our post “19 Data Science Project Ideas for Beginners”.

String and Array Functions in SQL for Data Science

StrataScratch — Mon, 09 May 2022 08:56:02 +0000

Commonly used string and array functions in SQL Data Science Interviews.

In a previous article "SQL Scenario Based Interview Questions", we had touched upon the various date and time functions in SQL. In this article, we will look at another important favorite topic in Data Science Interviews – string and array manipulation. With increasingly diverse and unstructured data sources becoming commonplace, string and array manipulation has become a integral part of Data Analysis and Data Science functions. The key ideas discussed in this article include

Cleaning Strings
String Matching
String Splitting
Creating Arrays
Splitting Arrays to rows
Aggregating text fields

You might also want to look at our Pandas article on string manipulation in DataFrame as we use quite a few similar concepts here as well.

String Matching

Let us start with a simple string-matching problem. This is from a past City of San Fransisco Data Science Interview Question.

Find the number of violations that each school had

Determine the number of violations for each school. Any inspection that does not have risk category as null is considered a violation. Print the school’s name along with the number of violations. Order the output in the descending order of the number of violations.

You can solve this problem here: https://platform.stratascratch.com/coding/9727-find-the-number-of-violations-that-each-school-had

The problem uses the sf_restaurant_health_violations dataset with the following fields.

The relevant data in the table looks like this.

The relevant columns are business_name and risk_category

These columns are populated thus.

Approach and Solution

This is a relatively straightforward problem. We need

Identify “Schools” from the business category
Count the violations excluding the rows where the risk_category is NULL

The simplest string-matching function in SQL is the LIKE function that searches for a substring inside a larger string. However, one needs to use wildcards to ensure the correct match is found. Since we do not know for sure that Schools end with the word School, we use the % wildcard before and after the string to ensure that the word “SCHOOL” is searched for. Further, we use the ILIKE function to make a case-insensitive search. The solution is now very simple.

SELECT business_name, COUNT(*) AS num_violations FROM sf_restaurant_health_violations WHERE business_name ILIKE '%SCHOOL%' AND risk_category IS NOT NULL GROUP BY 1 ;

If your SQL flavor does not have the ILIKE statement, we can convert the string to upper or lower case and then use the LIKE statement.

Splitting a Delimited String

Now that we have warmed up with string search, let us try another common string manipulation technique: splitting. There are numerous use cases for splitting a string. Splitting a string requires a delimiter (a separator). To illustrate this let us look at a problem from another City of San Francisco Data Science Interview problem.

Business Density Per Street

You can solve the problem on the StrataScratch Platform here: https://platform.stratascratch.com/coding/9735-business-density-per-street?python=

This problem uses the same sf_restaurant_health_violations used in the previous problem. The fields of interest for this problem are business_id and business_address which are populated thus.

Approach and Solution

We need to extract the second word from the address. That represents the street name. To do this we split the string using space as a delimiter (separator) and extracting the second word. We can do this using the SPLIT_PART function. This is similar to the split() method in Python. Since Postgres is case sensitive, we convert the output to upper case.

SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname, business_address FROM sf_restaurant_health_violations ;

We get the following output.

Now the problem becomes relatively easy to solve. We find the number of distinct business entries on each street. Since we need only those businesses with five or more entries, we use the HAVING clause to subset the output.

SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname, COUNT (DISTINCT business_id) AS density FROM sf_restaurant_health_violations GROUP BY 1 HAVING COUNT (DISTINCT business_id) >= 5 ;

We get the following output.

Now we can aggregate this table using a subquery, CTE or a temp table. We have used a CTE in this case and get the final output.

WITH rel_businesses AS (SELECT UPPER(split_part(business_address, ' ', 2)) AS streetname, COUNT (DISTINCT business_id) AS density FROM sf_restaurant_health_violations GROUP BY 1 HAVING COUNT (DISTINCT business_id) >= 5) SELECT AVG(density), MAX(density) FROM rel_businesses ;

Arrays

Most modern SQL flavors allow creation and manipulation of arrays. Let us look at working with string arrays. One can manipulate integer and floating-point arrays in a similar manner. To illustrate this let us take an SQL Data Science Interview problem for an AirBnB interview.

City With Most Amenities

Find the city with most amenities in the given dataset. Each row in the dataset represents a unique host. Output the name of the city with the most amenities.

You can solve the problem here: https://platform.stratascratch.com/coding/9633-city-with-most-amenities

The problem uses the airbnb_search_details dataset with the following fields.

airbnb_search_details

The main fields of interest here are city and amenities that are populated thus.

Approach and solution

To solve this let us break this problem into parts.

We need to find the number of amenities for a given property
Aggregate the amenities at city level
Find the city with the highest number of amenities.

The amenities are represented in form of a string separated by commas. However, SQL right now recognizes this field as string. So, we need to convert this string into individual amenities by splitting them using the comma delimiter. To do this we use the STRING_TO_ARRAY() function and specify comma as the delimiter.

SELECT city, STRING_TO_ARRAY(amenities, ',') AS num_amenities FROM airbnb_search_details ;

We get the following output.

Note for this problem, opening and closing braces are considered a part of the first and last word in the string. If we want to eliminate to clean the string, we can use the BTRIM function. BTRIM function will remove all the leading and trailing characters specified. We can modify our query in the following manner.

SELECT city, STRING_TO_ARRAY(BTRIM(amenities, '{}'), ',') AS num_amenities FROM airbnb_search_details ;

This gives us the following output. As one can see we have successfully removed the leading and trailing braces.

To find the number of amenities, we need to count the number of elements in the amenities array. We can do this by using the ARRAY_LENGTH() function. The function requires us to specify the array dimension whose length is to be specified. This is useful for multi-dimensional arrays. Since our array is 1-dimensional, we simply specify 1.

SELECT city, ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}') , ',') , 1) AS num_amenities FROM airbnb_search_details ;

Our output looks like this

We now proceed to aggregate the number of amenities at city level.

SELECT city, SUM(ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}'), ','), 1)) AS num_amenities FROM airbnb_search_details GROUP BY 1 ;

Our output now looks like this.

We can now find the city with the highest number of amenities by sorting in descending order and using LIMIT 1 or more reliably, by ranking them.

SELECT CITY FROM (SELECT city, DENSE_RANK() OVER ( ORDER BY num_amenities DESC) AS rank FROM (SELECT city , SUM(ARRAY_LENGTH(STRING_TO_ARRAY(BTRIM(amenities, '{}'), ','), 1)) AS num_amenities FROM airbnb_search_details GROUP BY 1) Q1) Q2 WHERE rank = 1 ;

Splitting an Array

The above problem could have also been solved by exploding the array into individual rows and then aggregating the number of amenities for each city. Let us use this method in another SQL data science question from Meta (Facebook) interview.

Views Per Keyword

Find the number of views for each keyword. Report the keyword and the total views in the decreasing order of the views.

You can solve the problem on the StrataScratch platform here: https://platform.stratascratch.com/coding/9791-views-per-keyword?python=

The problem uses the facebook_posts and facebook_post_views datasets. The fields present in the facebook_posts dataset are

The data is presented in the following manner

The facebook_post_views has the following fields

facebook_post_views

And this is how the data in looks

Approach and Solution

Let us break this problem into individual parts.

• We start off by merging the two datasets on the post_id field. We need to aggregate the number of views for each post

SELECT fp.post_id, fp.post_keywords, COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views FROM facebook_posts fp LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id GROUP BY 1, 2 ;

We get the following output.

We need to assign the views to each keyword. For example for post_id = 3, the keyword sphagetti and the keyword food should each get 3 views. For post_id = 4, the spam keyword should get 3 views and so on. To accomplish this, we first clean the string stripping the brackets and the # symbol.

SELECT fp.post_id, STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',') AS keyword, COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views FROM facebook_posts fp LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id GROUP BY 1, 2 ;

We get the following output

Now we separate (explode) the array into individual records using the UNNEST function.

SELECT fp.post_id, UNNEST(STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',')) AS keyword, COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views FROM facebook_posts fp LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id GROUP BY 1, 2 ;

We can now easily aggregate the number of views per keyword and sort them in descending order.

WITH exp_keywords AS (SELECT fp.post_id , UNNEST(STRING_TO_ARRAY(BTRIM(fp.post_keywords, '[]#'), ',')) AS keyword , COALESCE(COUNT(DISTINCT fpv.viewer_id), 0) AS num_views FROM facebook_posts fp LEFT JOIN facebook_post_views fpv ON fp.post_id = fpv.post_id GROUP BY 1, 2) SELECT keyword, sum(num_views) AS total_views FROM exp_keywords GROUP BY 1 ORDER BY 2 DESC ;

Aggregating Text Fields

Let us finish things off by doing the converse. Aggregating rows back into a string. We illustrate this with a SQL Data Science Interview question from Google.

File Contents Shuffle

Rearrange the words of the filename final.txt to make a new file named wacky.txt. Sort all the words in alphabetical order, output the words in column and the filename wacky.txt in another.

You can solve the problem here: https://platform.stratascratch.com/coding/9818-file-contents-shuffle

The problem uses the google_file_store dataset with the following columns.

google_file_store

The contents of the dataset look like this.

Approach and Solution

Let us solve this problem in a step-wise manner.

• We first keep only the contents of the filename final.txt, split the contents using space as a delimiter, explode the resulting array into individual rows and sort in alphabetical order.

SELECT UNNEST(STRING_TO_ARRAY(CONTENTS, ' ')) AS words FROM google_file_store WHERE filename ILIKE '%FINAL%' ORDER BY 1 ;

We get the following output

We now need to combine the individual words back into a string. To do this we use the STRING_AGG() function and specify space as the delimiter. This function is similar to the join() method in Python. We also add a filename for the new string and output.

WITH exploded_arr AS (SELECT UNNEST(STRING_TO_ARRAY(CONTENTS, ' ')) AS words FROM google_file_store WHERE filename ILIKE '%FINAL%' ORDER BY 1) SELECT 'wacky.txt' AS filename, STRING_AGG(words, ' ') AS CONTENTS FROM exploded_arr ;

Conclusion

In this article we looked at the text and array manipulation abilities of SQL. This is specifically useful in ETL process upstream as well as Analysis downstream. As with other Data Science areas, only patience, persistence and practice can make you proficient. On StrataScratch, we have over 700 coding and non-coding problems that are relevant to Data Science Interviews. These problems appeared in actual Data Science interviews at top companies like Google, Amazon, Microsoft, Netflix, et al. For e.g., check out our posts "40+ Data Science Interview Questions From Top Companies" and "The Ultimate Guide to SQL Interview Questions" to practice such interview questions and prepare for the most in-demand jobs at big tech firms and start-ups across the world.

The Ultimate Guide to Python Window Functions

StrataScratch — Wed, 23 Feb 2022 15:11:09 +0000

This article focuses on different types of Python window functions, where and how to implement them, practice questions, reference articles and documentation.

Window function is a popular technique used to analyze a subset with related values. It is commonly used in SQL, however, these functions are extremely useful in Python as well.

If you would like to check out our content on SQL Window Functions, we have also created an article "The Ultimate Guide to SQL Window Functions" and a YouTube video!

This article discusses:

Different types of window functions
Where / How to implement these functions
Practice Questions
Reference articles / Documentation

A general format is written for each of these functions for you to understand and implement on your own. The format will include bold italicized text which indicate these are the sections of the function you need to replace during implementation.

For example:
dataframe.groupby(level='groupby_column').agg({‘aggregate_column’: ‘aggregate_function’})

Texts such as 'dataframe' and 'groupby_column' are bold and italicized meaning you should replace them with the actual variables.
Texts such as ‘.groupby’ and ‘level’ which are not bold and italicized are required to remain the same to execute the function.

Let’s suppose Amazon asks to find the total cost each user spent on their amazon orders.
An implementation of this function this dataset would look similar to this:
amazon_orders.groupby(level='user_id').agg({'cost': 'sum'})

Table of Contents

Python Window Functions overview diagram
Aggregate
• Group by
• Rolling
• Expanding
Ranking
• Row number

reset_index()

cumcount()

• Rank

default_rank

min_rank

NA_bottom

descending

• Dense rank
• Percent rank
• N-Tile / qcut()
Value
• Lag / Lead
• First / Last / nth value

Python Window Functions

While there is not any official classification of Python window functions, these are the common functions implemented.

Aggregate

These are some common types of aggregate functions

Average
Max
Min
Sum
Count

Each of these aggregate functions (except count which will be explained later) can be used in three types of situations

Group by
Rolling
Expanding

Example

Group by: Facebook is trying to find the average revenue of Instagram for each year.
Rolling: Facebook is trying to find the rolling 3 year average revenue of Instagram
Expanding: Facebook is trying to find the cumulative average revenue of Instagram with an initial size of 2 years.

Group by

Group by aggregates is computing a certain column by a statistical function within each group. For example in a dataset

Let’s use a question from Amazon to explain this topic. This question is asking us to calculate the percentage of the total expenditure a customer spent on each order. Output the customer’s first name, order details (product name), and percentage of the order cost to their total spend across all orders.

Remember when approaching questions follow the 3 steps

Ask clarifying questions
State assumptions
Attempt the question

When approaching these questions, understand which columns need to be grouped and which columns need to be aggregated.
For the Amazon example,
Group by: customer first_name, order_id, order_details
Aggregate: total_order_cost

In this question, there are 2 tables which need to be joined to get the customer’s first name, item, and spending. After merging both tables and filtering to get the required columns, to get the following dataset

Once necessary data is set in a single table, it is easier to manipulate.

Here we can find the total spending by person by grouping first_name and sum of total_order_cost

This is a general format on how to group by and aggregate the required columns.
dataframe.groupby(level='groupby_column').agg({'aggregate_column': 'aggregate_function'})

In reference to the Amazon example, this is the executing code.

total_spending = customer_orders.groupby("first_name").agg({'total_order_cost' : 'sum'})

This code will output the following dataframe

After this, we want to add a column to the merged data frame to represent total spending by each person.

Let’s join both dataframes on the person’s first_name

pd.merge(merged_dataframe, total_spending, how="left", on="first_name")

Now we get the following dataset

As seen, the total_order_cost_y column represents the total spending per person and the total_order_cost_x to represent the cost per order. After this, this is a simple division of 2 columns to create the percentage of the spending column AND filtering the output to get the required columns.

result = df3[["first_name", "order_details", "percentage_total_cost"]]

However for certain situations, it is required to sort the values within each group. This is where the sort_values() function is implemented.

Referencing the amazon question example:

Suppose the interviewer asks to order the percentage_total_cost in descending order by person.

result = result.sort_values(by=['first_name', 'percentage_total_cost'], ascending = (True, False))

import pandas as pd df = pd.merge(orders, customers, left_on="cust_id", right_on="id") df1 = df[["first_name", "id_x", 'order_details', 'total_order_cost']] df2 = df.groupby("first_name").agg({'total_order_cost' : 'sum'}) df3 = pd.merge(df1, df2, how="left", on="first_name") df3["percentage_total_cost"] = df3["total_order_cost_x"] / df3["total_order_cost_y"] result = df3[["first_name", "order_details", "percentage_total_cost"]] result = result.sort_values(by=['first_name', 'percentage_total_cost'], ascending = (True, False))

Practice

Reference

Rolling vs Expanding Function

Before diving into how to execute a rolling or expanding function, let’s understand how each of these functions works. While rolling function and expanding function work similarly, there is a significant difference in the window size. Rolling function has a fixed window size, while the expanding function has a variable window size.

These images explain the difference between rolling and expanding functions.

Rolling Function

Expanding Function

Rolling and expanding functions both start with the same window size, but expanding function incorporates all the subsequent values beyond the initial window size.

Example: AccuWeather, a weather forecasting company, is trying to find the rolling and expanding average 10 day weather of San Francisco in January.

Rolling: Starting with a window size of 10, we take the average temperature from January 1st to January 10th. Next we take January 2nd to January 11th and so on. This shows the window size in rolling functions remains the same.

Expanding: Starting with a window size of 10, we take the average temperature from january 1st to January 10th. However, next we’ll take the average temperature from January 1st to January 11th. Then, January 1st to January 12th and so on. Therefore the window size has “expanded”.

While there are many aggregate functions that can be used in rolling/expanding functions, this article will discuss the frequently used functions (sum, average, max, min).

This brings us to the reason why the count function is not used in rolling and expanding functions. Count is used when a certain variable is grouped and there is a need to count the occurrence of a value. In the rolling and expanding function, there is no grouping of rows, but a calculation on a specific column.

Rolling Aggregate

Implementation of rolling functions are straightforward.

A general format:

DataSeries.rolling(window_size).aggregate_function()

Example:
Temperature of San Francisco of the first 22 days of 2021. Let’s find the average, sum, maximum, and minimum of a 5 day rolling time period.

weather['Average'] = weather['Temperature'].rolling(5).mean() weather['Sum'] = weather['Temperature'].rolling(5).sum() weather['Max'] = weather['Temperature'].rolling(5).max() weather['Min'] = weather['Temperature'].rolling(5).min()

After row 4, the rolling function over a fixed window size of 5 calculates the average, sum, max, and min over the Temperature values. This means that in the average column in row 16, calculates the average of rows 12, 13, 14, 15, and 16.

As expected, the first 4 values of the rolling function columns are null due to not having enough values to calculate. Sometimes you still want to calculate the aggregate of the first n rows even if it doesn’t fit the number of required rows.

In that case, we have to set a minimum number of observations to start calculating. Within the rolling function, you can specify the min_periods.

DataSeries.rolling(window_size, min_periods=minimum_observations).aggregate_function()

weather['Average'] = weather['Temperature'].rolling(5, min_periods=1).mean() weather['Sum'] = weather['Temperature'].rolling(5, min_periods=2).sum() weather['Max'] = weather['Temperature'].rolling(5, min_periods=3).max() weather['Min'] = weather['Temperature'].rolling(5, min_periods=3).min()

Practice

https://platform.stratascratch.com/coding/10314-revenue-over-time?python=1

Reference

Expanding Aggregate

Expanding function has a similar implementation to rolling functions.

DataSeries.expanding(minimum_observations).aggregate_function()

It is important to remember that unlike the rolling function, the expanding function does not set a window size, due to its variability. The minimum_observations is specified, so for rows less than the minimum_observations will be set as null.

Let’s use the same San Francisco Temperature example to explain expanding function

weather['Average'] = weather['Temperature'].expanding(5).mean() weather['Sum'] = weather['Temperature'].expanding(5).sum() weather['Max'] = weather['Temperature'].expanding(5).max() weather['Min'] = weather['Temperature'].expanding(5).min()

As it can be seen through the minimum temperature column, it takes the minimum value throughout the dataset, since it’s expanding beyond the minimum observations set.

import pandas as pd weather = pd.DataFrame({'Temperature': [57, 54, 60, 54, 57, 58, 52, 52, 59, 54, 53, 57, 56, 60, 55, 58, 59, 64, 65, 66, 67, 74]}) weather['Rolling_Average'] = weather['Temperature'].rolling(5).mean() weather['Rolling_Sum'] = weather['Temperature'].rolling(5).sum() weather['Rolling_Max'] = weather['Temperature'].rolling(5).max() weather['Rolling_Min'] = weather['Temperature'].rolling(5).min() weather[‘Rolling_Average_minperiod’] = weather['Temperature'].rolling(5, min_periods=1).mean() weather['Rolling_Sum_minperiod'] = weather['Temperature'].rolling(5, min_periods=2).sum() weather['Rolling_Max_minperiod'] = weather['Temperature'].rolling(5, min_periods=3).max() weather['Rolling_Min_minperiod'] = weather['Temperature'].rolling(5, min_periods=3).min() weather['Expanding_Average'] = weather['Temperature'].expanding(5).mean() weather['Expanding_Sum'] = weather['Temperature'].expanding(5).sum() weather['Expanding_Max'] = weather['Temperature'].expanding(5).max() weather['Expanding_Min'] = weather['Temperature'].expanding(5).min() weather

Reference

Ranking

Row Number

Counting the number of rows can be executed in 2 different situations each with a different function

Across the entire dataframe - reset_index()
Within groups - cumcount()

These are the equivalent functions as row_number() in SQL

Let’s use the following sample dataset to explain both concepts

Reset_index

Within a dataframe, reset_index() will output the row number of each row.

General format to follow:
dataframe.reset_index()

To extract the nth row implement the .iloc() function

Dataframe.iloc[nth_row]

cumcount()

To calculate the row number within groups of a dataframe, you have to implement the cumcount() function in the following format

dataframe.groupby([‘column_names’]).cumcount()

Also remember to start the row count from 1 instead of the default 0, you need to add +1 to the cumcount() function

For the sample dataset, the implementation would be

df['Row_count'] = df.groupby(['c1', 'c2']).cumcount()+1

This would be the output

Now that you have the row_count within each group, sometimes you have to extract a specific index row of each group.

For example, the company asks to extract the 2nd indexed value within each group. We can extract this by returning each row with a row_count value of 2.

Using iloc again, we can extract the subset with the following general format

Dataframe.loc[dataframe[column_name] == index]

For the column dataset above, we would use

df.loc[df['Row_count'] == 2]

to get the subset

import pandas as pd df = pd.DataFrame({'c1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'c2':['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'], 'v1':[3, 5, 7, 1, 3, 1, 3, 1, 7, 4, 1, 6]}) df['Row_count'] = df.groupby(['c1', 'c2']).cumcount()+1 df.loc[df['Row_count'] == 2]

Questions:

Reference:

Rank

Ranking functions as the name states ranks values based on a certain variable. Ranking function works slightly differently than its SQL equivalent.

Rank() function can be executed with the following general format

dataframe[column_name].rank()

Let’s assume the following dataset from the Pandas ranking documentation.

And create 4 new columns which both use the rank() function to explain the function and its most popular parameters better

animal_legs['default_rank'] = animal_legs['Number_legs'].rank() animal_legs['min_rank'] = animal_legs['Number_legs'].rank(method='min') animal_legs['NA_bottom'] = animal_legs['Number_legs'].rank(method='min', na_option='bottom') animal_legs['descending'] = animal_legs['Number_legs'].rank(method='min', ascending = False)

In this function we’re ranking the number_legs for each animal.

Let’s understand what each of the columns represents.

‘default_rank’

In a default rank() function, there are 3 important things to note.

Ascending order is assumed true
Null values are not ranked and left as null
If n values are equal, the rank split is averaged between the values.

The n values rank splitting is a bit confusing, so let’s dive more into this to explain it better.

In SQL for the dataset above, since both cat and dog both have 4 legs, it would assume both as rank = 2 and spider with the next highest number of legs would have a rank of 4.

Instead of that, Pandas averages out the ‘would have been’ ranks between cat and dog.
There should be a rank of 2 and 3, but since cat and dog have the same value, the rank is the average of 2 and 3, which is 2.5

Let’s alter the animal's example to include ‘donkey’ which has 4 legs.

Penguin has the least number of legs with 2, so it has a rank = 1.

Since cat, dog, and donkey all have the next highest count of 4 legs, it will take the average of 2,3,4, due to 3 animals with the same value.

If we have 4 animals all with 4 legs, it will take the average of 2,3,4,5 = 3.5

‘min_rank’

When setting the parameter method=’min, instead of taking the average ranked value, it will take the minimum rank between equal values.

The minimum rank is the same as how the rank function in SQL works.

Using the animals example, the rank between dog and cat will now be 2 instead of 2.5.
And for the example with donkey, it will still assume a rank of 2, while spider will be set to a rank of 5.

‘NA_bottom’

Certain rows contain null values and under default conditions, the rank will also be set as null. In certain cases you would want the null values to rank the lowest or highest.

Setting the na_option as bottom would give the highest ranked value and setting as top would give it the lowest ranked value

In the animals example, we set null values as bottom and rank method as minimum

‘descending’

If you want to set the rank in descending order, set the parameter ascending as false.

Referring to the animals example, we set ascending to false and method as minimum

Questions:

Reference:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

Dense Rank

Dense rank is similar to a normal rank with a slight difference.

During a normal rank function, ranking numbers may be skipped, while dense_rank doesn’t skip.

For example in the animals dataframe, after [dog, cat, donkey], spider was the next value. In minimum rank, it sets spiders as rank = 5, since 2,3,4 are technically set for cat,dog, and donkey.

In a dense rank, it will set the immediate consecutive ranks as seen above. Instead of 5th rank, spider was set to 3rd rank in dense_rank.

Fortunately, you just have to edit the method parameter in a rank function to get the dense rank

animal_legs['dense_rank'] = animal_legs['Number_legs'].rank(method='dense')

All the other parameters, such as na_option and ascending, can also be set alongside the dense method as mentioned before.

Questions:

Reference:

Percent rank (Percentile)

Percent rank python window function
Percent rank is just a representation of the ranks compared to the highest rank.

As seen in the animals dataframe above, spider has a rank of 5 for both default_rank and min_rank. Since 5 is the highest rank, the other values would be compared to this.
For cat in default_rank, it has a value of 3, and 3 / 5 = 0.6 for default_pct_rank
For cat in min_rank, it has a value of 2, and 2 / 5 = 0.4 for min_pct_rank

Percentage rank is boolean parameter which can be set

animal_legs['min_pct_rank'] = animal_legs['Number_legs'].rank(method='min', pct=True)

Questions:

Reference:

import pandas as pd import numpy as np animal_legs = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', 'spider', 'snake', 'donkey'], 'Number_legs': [4, 2, 4, 8, np.nan, 4]}) animal_legs['default_rank'] = animal_legs['Number_legs'].rank() animal_legs['min_rank'] = animal_legs['Number_legs'].rank(method='min') animal_legs['NA_bottom'] = animal_legs['Number_legs'].rank(method='min', na_option='bottom') animal_legs['descending'] = animal_legs['Number_legs'].rank(method='min', ascending = False) animal_legs['dense_rank'] = animal_legs['Number_legs'].rank(method='dense') animal_legs['default_pct_rank'] = animal_legs['Number_legs'].rank(pct=True) animal_legs['min_pct_rank'] = animal_legs['Number_legs'].rank(method='min', pct=True) animal_legs

N-Tile / qcut()

qcut() is not a popular function, since ranking based on quantiles beyond percentiles are not as common. While it isn’t as popular, it is still an extremely powerful function!
If you don’t know the relationship between quantiles and percentiles, check out this article by statology!

Let’s take a question from DoorDash, to explain how qcut is used.

The question asks us to find the bottom 2% of the dataset, which is the first quantile from a 50-quantile split.

A general format to follow when using qcut():
pd.qcut(dataseries, q=number_quantiles, labels = range(lower_bound, upper_bound))

This is a subset of the dataset which we will use to analyze the usage of the qcut() function.

dataseries → The column to analyze, which is total_order in this example
number_quantiles → Number of quantiles to split by, which is 50 due to 50-quantile split
labels → Range of ntiles, which 1-50 in this case. However, the upper bound is calculated as n-1. So if we set a range of 1-50, the highest ntile will be 49 instead of 50. Due to this, we set our upper bound as n+1, which in this example would be range(1,51)

For this example, this would be the following code.

result[‘ntile’] = pd.qcut(result['total_order'],q=50, labels=range(1, 50))

As seen in the example, ‘ntile’ has been split and represents the quantile.

It must also be noted that if the label range is not specified, the quantile range is returned.

For example executing the same code above without the labels range:

result[‘ntile_range’] = pd.qcut(result['total_order'],q=50)

import pandas as pd result = doordash_delivery[doordash_delivery['customer_placed_order_datetime'].between('2020-05-01', '2020-05-31')].groupby("restaurant_id")["order_total"].sum().to_frame('total_order').reset_index() result['ntile'] = pd.qcut(result['total_order'],q=50, labels=range(1, 50), duplicates = 'drop').values.tolist() result['ntile_range'] = pd.qcut(result['total_order'],q=50, duplicates = 'drop').values.tolist() result

Questions:

qcut() → https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1

Reference:

Value

Lag / Lead

Lag and Lead functions are used to represent another column but are shifted by a single or multiple rows.

Let’s use a dataset given by Facebook (Meta) which represents the total cost of orders by each month.

In the ‘Lag’ column, we can see that values were shifted down by one. 305 which is the total_order_cost for January, appears in the ‘Lag’ column but on the same row as February.

In the ‘Lead’ column, the opposite occurs. Rows are shifted up by one, so 285 which is the total_order_cost for February appears in the ‘Lead’ column in January.

This makes it easier to calculate comparing values side by side such as growth of sales by month.

A general format to follow:
dataframe[‘shifting_column’].shift(number_shift)

Code used for the data:

orders['Lag'] = orders['total_order_cost'].shift(1) orders['Lead'] = orders['total_order_cost'].shift(-1)

Also another key point to remember is the null values that are present due to the shift. As seen there are 1 null values (NaN) in the Lag and Lead column, since the values have been shifted by 1. There will be n rows of null values, due to the data series being shifted by n rows. So for the first n rows of ‘Lag’ column and last n rows of ‘Lead’ column will be null values.
If you want to replace the null values that are generated by the shift, you can use the fill_value parameter.

We execute the code with updated parameters

orders['Lag'] = orders['total_order_cost'].shift(1, fill_value = 0) orders['Lead'] = orders['total_order_cost'].shift(-1, fill_value = 0)

To get this as the output

import pandas as pd import numpy as np orders['order_date'] = orders['order_date'].apply(pd.to_datetime) orders['order_month'] = orders['order_date'].dt.month orders.loc[(orders.order_month == 1),'order_month'] = 'January' orders.loc[(orders.order_month == 2),'order_month'] = 'February' orders.loc[(orders.order_month == 3),'order_month'] = 'March' orders.loc[(orders.order_month == 4),'order_month'] = 'April' orders['order_month'] = pd.Categorical(orders['order_month'], ["January", "February", "March", "April"]) orders = orders[['order_month', 'total_order_cost']] orders = orders.sort_values(by=['order_month']) orders = orders.groupby("order_month").agg({'total_order_cost' : 'sum'}) orders['Lag'] = orders['total_order_cost'].shift(1, fill_value = 0) orders['Lead'] = orders['total_order_cost'].shift(-1, fill_value = 0) orders

Questions:

Reference:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html

First/Last/nth value

Finding the nth value (including first and last) within groups of a dataset is fairly simple with Python as well.

Let’s use the same orders dataset by Facebook used in the Lag/Lead section.

As seen here, the order_date has been ordered from earliest to latest.

Let’s find the first order of each month using the nth() function

General format:
dataframe.groupby(‘groupby_column’).nth(nth_value)

nth_value represents the indexed value
nth_value in the nth() function works the same way as extracting the nth_value in a list.
0 represents the first value
-1 represents the last value

Using the following code:
orders.groupby('order_month').nth(0)

To return only a specific column, such as total_order_cost, you can specify this as well.

orders.groupby('order_month').nth(0)['total_order_cost'].reset_index()

Now if you want to join the nth value to the respective grouped by columns in the original dataframe, you could use a merge function, similar to how the merge function was applied in the aggregate functions mentioned above. Make sure you remember to extract the column to merge on as well! In this example, it would be the ‘order_month’ index column, where you should use the reset_index() function.

import pandas as pd orders['order_date'] = orders['order_date'].apply(pd.to_datetime) orders['order_month'] = orders['order_date'].dt.month orders = orders.sort_values(by=['order_date']) ordered_group = orders.groupby('order_month').nth(0)['total_order_cost'].reset_index()

Reference:

https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html

Practice Questions Compiled

Aggregate

Group by

Rolling

https://platform.stratascratch.com/coding/10314-revenue-over-time?python=1

Expanding

[[[ No Questions ]]]

Ranking

Row_number()

rank()

dense_rank()

percent_rank()

ntile() / qcut()

https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants?python=1

Value

Lag/Lead

First / Last / nth_value()

[[[ No Questions ]]]

Solving LeetCode Single Number Problem for Data Science Interviews

StrataScratch — Tue, 01 Feb 2022 07:49:46 +0000

How does a data scientist solve the LeetCode Single Number problem in Python to prepare for their data science interviews?

Using LeetCode To Solve Python Questions

LeetCode has a massive database of real interview questions which companies like Amazon, Google, Microsoft, Facebook, and other giants have asked. LeetCode’s designers built it specifically for software developers, and developers generally consider it an amazing resource with over 1599 algorithm-based questions among a variety of languages. Since data scientists most commonly use Python and SQL, data scientists use LeetCode to improve their skills and prepare for data science interviews.

We are going to look at a specific LeetCode Single Number problem today which we’ll solve in three different ways via Python. It is one of many Python problems useful for preparing for data science interviews.
On StrataScratch.com, we also have several other articles discussing "How To Use LeetCode For Data Science SQL Interviews" and "LeetCode Python Solutions for Data Science".

Solving the LeetCode Single Number Problem

136. Single Number

Interviewers have asked the single number question a variety of ways in data science interviews. The key challenge is finding the single array element which only appears once in an array of integers. While one of the demands is implementing a solution with linear runtime complexity and constant extra space, we’ll see there are several ways to meet this criteria.

Link to the question: https://leetcode.com/problems/single-number/

In this LeetCode Single Number problem, we’re being asked to find the only integer which appears once in an array of data where every other integer appears twice. We know immediately any solutions for this LeetCode Single Number Problem will require thinking in advance about the complexity and space of the computation.

This LeetCode single number problem may seem daunting given the constraints, but, as we will see, there are several solutions with varying levels of complexity and space requirements. Today, we’re going to look at three solutions which approach the problem differently: through using a counter, through mathematics, and through bitwise manipulation.

Framework to Solve this LeetCode Single Number Problem

The easiest way to solve data science interview questions whether in Python, SQL, or some other language is to use a generally applicable framework. Here’s a framework we use for all data science problems on StrataScratch which we’ll adapt to this problem. You’ll see it provides logical steps for how to arrive at the correct answer:

Understand Your Data
• This LeetCode single number problem gives sample data to look at. See if you notice any patterns in the arrays or anything which might require you to adjust your algorithms. This will help you identify the bounds to which you should limit your solution as well as uncover edge cases.
• Typically LeetCode provides more than one snippet of sample data. If the first example isn’t sufficient, spend some time with the other data they provide. In an interview, you can attempt to explain your current understanding of the data to the interviewer and ask them for feedback.
Formulate your approach:
• Now write down all the steps you’ll need for coding your solution. Consider how Python computes and what functions you’ll need to leverage. Also keep in mind how complex they’ll make your solution.
• Don’t forget the interviewer will be observing you. Don’t hesitate to ask them for help. They’ll often specify any additional limitations which apply to your Python code.
Code Execution
• Try to keep your Python code simple in this LeetCode single number problem since complexity and space are concerns.
• Follow the steps you laid out in the beginning. Even if it’s not the most efficient way to solve the problem, you can explain potential optimizations afterwards as long as you hit the problem’s complexity and space requirements.
• Don’t convolute your card. This might make it difficult for both you and the interview to understand your solution and could introduce unknown results or complexity.
• Speak through your solution with the interviewer as you write down your Python. They want to understand how you think as you advance towards an answer.

Understand your data

Let’s start by looking at some of the data. Fortunately, LeetCode will typically provide you with several examples to look at.

In this case we receive three example arrays of integers. Each array contains an element which only appears once and other elements which appear twice.

From the three examples, we can already notice a few patterns:

Your solution should work for arrays with only one or two discrete integers.
Your solution should work for arrays with several pairs of discrete integers where the duplicate elements aren’t consecutive.

We are also given some constraints to work with:

These constraints aren’t particularly important for the solutions we’ll cover in this article, but you should always be aware of the limits of the problem.

What’s important to realize is we need a solution to account for a variety of array sizes and content which still only maintains linear complexity and constant extra space.

Solution 1:

For our first solution, we use a common Python function called a counter to count all elements in the array and then filter for the element with a count of one.

Formulate Approach

The next step, according to our previous framework, is to outline some steps we’ll need to take. Writing down the steps you’ll take in advance will make coding significantly easier.

For our first solution, here are the general steps we’ll follow:

We use a COUNTER function in Python to go through all the elements in the array and return a dictionary of how many times any given element appears.
We filter the result of the counter dictionary to find the element where the count is equal to 1. Since we expect only one element to appear once, we can return the first match.

Code execution

To write the Python code for this question, let’s follow the steps we just wrote down and translate them to code. The most important part of this solution is correctly applying the count function on our array and filtering the result.

Looking at the first step, we can start by writing the code for counting the elements in the array.

class Solution: def singleNumber(self, nums: List[int]) -> int: c = Counter(nums) return c

Given one of our example inputs, our solution will now return a dictionary giving the total count of each integer in our input array. We immediately see one of our integers only has a count of one. Next, all we have to do is filter for this element. To do this, we’ll use a FOR loop to find the first dictionary key with a value equal to one.

class Solution: def singleNumber(self, nums: List[int]) -> int: c = Counter(nums) for n in c: if c[n] == 1: return n

Now we arrive at the correct answer with our solution presenting us with the only integer which doesn’t appear twice in the input array.

When it comes to complexity and space, this answer meets the original criteria. Creating a counter and going through it once only requires O(n) runtime complexity. Furthermore, the comparison lookup in our loop only requires O(1) (constant) space.

Solution 2:

Our second solution relies on comparing sums in Python. Since sets reduce all elements to a single occurrence, we can use the set function and algebra to calculate the integer which appears only once.

Formulate Approach

Following our framework again, we need to start our second solution by writing down some specific steps:

We first apply the SET function to reduce our input array to an array only containing one occurrence of each original integer element.
We sum our set and double it using arithmetic operations to prepare our comparison.
We then subtract the sum of the elements in our original array to isolate the single element which appears once.

Code execution

Let’s again follow the steps we just wrote down and translate them into code. The critical component of this solution is correctly applying the algebra to your set and input array, otherwise you’ll yield the incorrect output.

Looking at the first step, let’s apply the set function to our array to understand how it changes our input array.

class Solution: def singleNumber(self, nums: List[int]) -> int: return set(nums)

We see the original array has been reduced to a set consisting of all the elements without repetition.

Next we need to sum our set and double it. The specific mathematics behind this answer depends on the question being asked. Because we know all elements but 1 appear only twice, we can double the sum of our set. If they appeared perhaps three or four times, our math would change and doubling our set might be insufficient.

class Solution: def singleNumber(self, nums: List[int]) -> int: return 2*sum(set(nums))

Now we yield an integer as an output, but it is not yet the element which only appears once. In order to calculate our answer, we must now subtract the sum of our original array. Given we DO NOT use a set function again to reduce the array, our sum will differ from our existing calculation by the correct output.

class Solution: def singleNumber(self, nums: List[int]) -> int: return 2*sum(set(nums))-sum(nums)

def singleNumber(self, nums: List[int]) -> int:
return 2*sum(set(nums))-sum(nums)" width="561" height="112">

When it comes to complexity and space, this answer doesn’t optimize any further than our first code snippet. Sum and set both go through the entire array, and it still requires space to store the set. As a result, we get O(n) for both runtime and space complexity.

Solution 3:

Our third solution relies on bitwise manipulation using the XOR operator. For context, if we take the XOR of 0 and some bit, it will return the bit. However, if we take the XOR of the same two elements, it will return 0. As such, every element which appears twice is going to yield 0 via a XOR operation, and the element appearing once is going to yield itself after a XOR with the 0 result from all the other duplicate elements.

Formulate Approach

Looking back at our framework - we need to again begin our third solution by writing down our coding steps:

We know we’ll need to perform a bitwise comparison of all elements, so we’ll need to start by looping through our array and establishing a variable for our first XOR operation.
Using the XOR operator, perform the XOR bitwise operation on each element of your array starting with the first element operated against 0.
Simplify the solution by instead storing the result of each XOR operation in the first index of the input array. Then return the element at index 0 after the first loop through the array.

Code execution

Time to again translate our steps into functional code. The key to this solution is performing the XOR operation against the result of previous XOR operations instead of performing an XOR operation on each element against itself.

Looking at the first step, let’s establish a 0 variable to XOR against and loop through our array starting with the first element.

class Solution: def singleNumber(self, nums: List[int]) -> int: a = 0 for i in nums:

Now we need to XOR every element of our array against our a variable. Since a starts as 0, we know the first XOR operation will yield the first element. We can reassign the result of this operation to the a variable to XOR the rest of the elements. Finally, we return the end result of all these XOR operations.

class Solution: def singleNumber(self, nums: List[int]) -> int: a = 0 for i in nums: a ^= i return a

This already yields us a correct answer, but the solution is not yet optimized. While it has a complexity of O(1), it does not yet have space optimization. You can save space by storing the result of the XOR operations in the first index of the array instead of creating a variable to store it in.

We’ll need to change our loop in this case since you can now start with the 2nd index instead of the 1st. Keep in mind the first element will already be part of the first XOR operation.

class Solution: def singleNumber(self, nums: List[int]) -> int: for i in range(1, len(nums)): nums[0] ^= nums[i] return nums[0]

We see our solution is again correct, but this time we didn’t have to make an extra storage variable. We end up using less space, and, as such, present the interviewer with a more efficient solution.

Comparison of Approaches to Solve this LeetCode Single Number Question

We have gone through 3 different ways you can solve this LeetCode single number question. While all of them are correct and result in the expected output, the solutions differ in their computations and complexity.

The first difference is to take note of how each approach solves the problem. Our first approach uses the counter function, our second approach arrives upon the solution mathematically, and our third approach leverages bitwise manipulation. As you work your way through interview questions similar to this one, consider how there may be different computational methods you can apply. This may open up solutions you weren’t considering at first.

The second issue has to do with complexity and storage. While we’re already given constraints for linear complexity and constant extra space, we ideally present our interviewer with the most optimal code we can write.

Looking back, our counter and mathematical solution both maintain O(n) complexity and O(n) space since they must calculate through the entire array and store it. However, our bitwise manipulation solution only requires complexity O(1) given it only needs to calculate between two elements at a time and requires minimal space since it stores the result of each XOR operation in the first element of the array.

For these types of problems, you should prepare yourself to explain why different solutions yield different complexity and space requirements. In addition, it’s helpful to understand the computational differences between separate solutions to the same problem.

Conclusion

In this article, we covered several different ways for solving a LeetCode Single Number Python data science problem. Many of the questions interviewers give you during data science interviews will be similar to this problem. Keep in mind this article’s three methods are not the only ways to solve this problem, and there exist many other solutions of greater or lesser complexity and space requirements.

On StrataScratch, you will find articles discussing many other python interview questions you will encounter in data science interviews. Beyond interview questions and answers, you will also find general articles on how to succeed in your data science interview as well as an interactive area to practice answering data science interview questions by constructing your own solutions or reviewing others’ solutions, and even more general articles like "LeetCode vs HackerRank vs StrataScratch for Data Science" where we compared these three interview preparation platforms which are used by people working in Data Science.

Zillow Data Scientist Interview Question Walkthrough

StrataScratch — Fri, 28 Jan 2022 05:31:57 +0000

We’ll closely examine one of the interesting Zillow data scientist interview questions and find a simple and flexible approach for solving this question.

This Zillow data scientist interview question can be solved in several ways, but we’ll cover one of the most simple and flexible solutions. Keep reading to discover an approach which can handle a variety of different datasets without accidentally leaving out important records!

As the most-visited real estate website in the United States, Zillow and its affiliates offer customers an on-demand experience for buying, selling, renting and financing with transparency and nearly seamless end-to-end service. Zillow Offers buys and sells homes directly in dozens of markets across the country, allowing sellers control over their timeline. Zillow Home Loans, our affiliate lender, provides our customers with an easy option to get pre-approved and secure financing for their next home purchase. Zillow recently launched Zillow Homes, Inc., a licensed brokerage entity, to streamline Zillow Offers transactions.

Data Scientist Position at Zillow

Data Scientist Positions at Zillow typically work for the Data Science & Analytics team. As a member of the Analytics team at Zillow, you will partner closely with stakeholders to model, analyze, and visualize business relevant metrics which inform both short and long term decision-making. This role will be responsible for advancing Zillow’s reporting practice to develop source-of-truth datasets and maintaining their Looker instance.

This team collaborates with Data Engineering to turn data into information – and information into insight. It works with datasets as small as an Excel spreadsheet and as large as raw clickstream data. It’s responsible for production reporting, analysis, causal inference, and forecasting. This team works closely with Product Managers, Marketing, and Engineering to deliver critical information and insights that drive decision making.

For additional information on the Data Science team at Zillow, here’s an official article from a few years back highlighting their tools, technology, and data. Beyond this, StrataScratch offers several other articles like this one and this one providing more context about data science roles.

Concepts Tested in Zillow Data Scientist Interview Questions

The main SQL concepts tested in the Zillow data scientist interview questions include

Use of avg() function to aggregate records
When to use WHERE() versus HAVING() for filtering data
Using subqueries to compare computational results

Zillow Data Scientist Interview Question

Cities With The Most Expensive Homes

The question we are going to examine in detail in this article has been asked during an interview at Zillow. It’s titled “Cities With The Most Expensive Homes”, and the key challenge is finding the national average and city averages for home prices then comparing the two to filter for the most expensive cities.

Link to Problem: https://platform.stratascratch.com/coding/10315-cities-with-the-most-expensive-homes

Ultimately, we’re being asked to find the cities with higher average home prices than the national average all while using one table of data.

This Zillow data scientist interview question may seem like a short and simple question, but, as we will see, the answer requires thinking thoroughly about which function to use for data comparison. While there exists multiple ways to solve this question, we’re going to look at one of the simplest, most flexible solutions.

Framework to Solve the Problem

To make the process of solving this interview question easier, we will follow a framework we could use for any data science problem. It consists of three steps and creates a logical pipeline for approaching problems concerning writing code for manipulating data. Here are the three steps:

Understand your data:
a). Take a look at the columns and make an assumption about them. Take note which columns will be relevant for your calculations and which you can discard.
b). If you don’t have a complete understanding of the schema, take a look at the first couple of rows of data and explain how the values stem from the column. Ask for example values if none are present. Understanding what values might look like for columns will help you figure out if you can limit your solution to specific columns or if you must broaden it for edge cases.
Formulate your approach:
a). Now, begin writing down the logical steps that you have to program/code. Don’t worry if it seems out of order at first. Code can be changed, so you might perform a calculation in advance and set it aside for later or place it elsewhere to write a separate part of the solution.
b). You also have to identify the main functions that you have to implement to perform the logic. Envision the operation a function might have in advance to avoid miscalculations.
c). Don't forget that interviewers will be watching you. They can intervene whenever needed, so make sure that you ask them to clarify any ambiguity. Your interviewers will also specify if you can use some ready-made functions or if you should write the code from scratch.
Code Execution:
a). Build up your code in such a way that it doesn't look oversimplified or overcomplicated either. Remember you can always set part of the solution aside for later use if you need to work through a separate step of the problem.
b). Build it in steps based on the outline shared with the interviewer. It doesn’t have to be the most efficient solution, but it will help to present a generic solution which covers a variety of data.
c). Here's the most important point. Think carefully about how your functions operate. This will let you achieve a simpler solution with fewer logical statements and rules cluttering the code.
d). Don't be quiet while laying down your code. Talk about your code as the interviewer will evaluate your problem-solving skills.

Understand Your Data

Let’s start by examining the data. At most company interviews, you won’t have access to the data and won’t have the ability to execute code. Instead, you’ll be responsible for understanding the data and making assumptions solely based on the table schema and by communicating with the interviewer.

In the case of this Zillow data scientist interview question, there is only one table with five columns of data representing an id, state, city, street address, and market price. Each row corresponds to a single home.

What’s important to realize is to solve this Zillow data scientist interview question, we don’t need all the columns of data. Reviewing the data shows us we can discard the id, street_address, and state columns for our calculations. As a result, our solution will only rely on market prices and cities. We also know we’ll have to use these two columns to calculate a national average and compare a city average to this value all within the same block of code.

Solution:

Formulate Approach

The next step, according to the general framework for solving data science questions, is to outline a few general steps we’ll need to perform to answer this question. These are very high-level but writing them down, in the beginning, will make the writing process much easier for us.

Here are the general steps to follow:

Start with a query to get the national average market price using the avg() function.
Put your initial query for the national average market price to the side while querying for the average market price for each city. This will require us to again use the avg() function and GROUP BY city since there are multiple records for each city.
Move your average market price by city calculation and your original national average query (in the form of a subquery) into a HAVING() function to filter for only the cities where the average market price is higher than the national average.

Find the National Average Market Price

To write the SQL code for this question, let’s follow the general steps that we’ve just defined and translate them into code. The key part of this approach is we leverage a subquery for national average and the HAVING() function to perform a proper price comparison. You can think about it as first obtaining the national average, then obtaining a city average, then comparing the two averages to only list cities with a higher average price.

Looking at the first step, we can start by writing the code for obtaining the national average. This is a relatively simple query which takes advantage of the avg() function, so we can start like this:

SELECT avg(mkt_price) FROM zillow_transactions

This code produces a single table with a single record corresponding to the national average. One thing to note is we can’t continue to manipulate this table to reach our solution. We’ll need this data for later, so the next step involves putting this query to the side (either cutting and pasting it or commenting it out)

Find the Average Market Price for Each City

Since we need to know the average market price by city, the next step involves using the avg() function again on the market_price column. Since each city has multiple records, we’ll GROUP BY city to get an average by city:

SELECT city, avg(mkt_price) FROM zillow_transactions GROUP BY city

--SELECT avg(mkt_price) --FROM zillow_transactions

As you can see, we now have an average by city. What we need now is to compare these averages to our original national average and only present cities which have a higher price.

Filter for Cities Where the Market Price is Greater Than the National Average

Here’s where this Zillow data scientist interview question becomes tricky: your first impression might be to use WHERE to filter. This issue here is WHERE applies before the calculation of the city averages, so it’s going to remove relevant data and present the wrong results. Instead, we’ll use HAVING() as our filtering function for the average comparison, so we aren’t discarding relevant pricing data.

For the third step, we’ll compare the city average price calculation to a subquery featuring our original national average calculation inside the HAVING() function to filter for the correct cities:

SELECT city FROM zillow_transactions GROUP BY city HAVING(avg(mkt_price) > (SELECT avg(mkt_price) FROM zillow_transactions))

Originally we were asked to present only the cities, and here we get only one column with cities having higher average home prices than the national average. While you could have used an ORDER BY to rank the home prices, it wouldn’t contribute at all towards reaching the correct answer in this solution.

Now, we have the entire solution, and, although it’s simple, it’s also flexible enough to accommodate any additional price data appended to the dataset.

Conclusion

In this article, we have discovered a simple and flexible way for solving one of the Zillow Data Scientist Interview questions. Remember the method mentioned here is not the only possibility, and there exists countless other ways, be they more or less efficient, for answering this interview question!

On StrataScratch, you can practice answering more SQL interview questions by constructing solutions to them but always try to think of other ways to solve them, you may come up with a more efficient or elaborate approach. Make sure to post all your ideas to benefit from the feedback of other users, and you can also browse all their solutions for inspiration!