DEV Community

Cover image for Essential SQL Commands For Data Science
Yankho Chimpesa
Yankho Chimpesa

Posted on

Essential SQL Commands For Data Science

Data is naturally at the heart of the job of a data scientist or data analyst. You can get your information from a variety of sources.
Because data is frequently stored in a SQL database, understanding SQL query commands is often required to perform this role successfully.
This article will introduce you to some of the more basic commands, as well as some of the more advanced operations that will be useful to you as a data analyst or data scientist.

The commands are classified based on multiple operations such as simple data retrieval, aggregations, joins and complex conditions.

The following are some of the essential SQL commands you need to have knowledge of as a data scientist:

SELECT

The SELECT command is used to retrieve data from a database. It is used to specify which columns and rows to retrieve from a table. Here is an example:

SELECT * 
FROM 
neighbourhoods

Enter fullscreen mode Exit fullscreen mode
neighbourhood_id neighbourhood
0 Ashfield
1 Bankstown
2 Blacktown
3 Burwood
4 Botany Bay

In this example, we are selecting all columns from a table called neighbourhoods.

The * operator is used to select all columns in a table:

FROM

The FROM command is used to specify the table or tables from which to retrieve data. Here is an example:
In this example, we are retrieving data from a table called names.

SELECT * 
FROM 
names

Enter fullscreen mode Exit fullscreen mode
reg_id name
0 Astrid
1 Barin
2 Blaje
3 Brian
4 Cody

If you need to retrieve data from multiple tables, you can use a JOIN statement. We will cover JOIN in more detail later in this article.

WHERE

The WHERE command is used to filter the data based on a specified condition. It is used to narrow down the results to only those rows that meet the specified condition.

Here is an example:
In this example, we are answering this question: How would you adapt the query to be sorted by host_id, to display the host_id and the host, and to be restricted to the neighbourhood_id of a particular neighbourhood, let's say number 35?

# 1/ Fetch only host_id, host from the listings table
# 2/ Make sure you filtered the data to just neighbourhood_id=35
# 3/ Make sure the output is sorted by host_id in descending order

SELECT host_id, host FROM listings
WHERE neighbourhood_id=35
ORDER BY host_id DESC

Enter fullscreen mode Exit fullscreen mode
host_id host
285488167 Rick
185783910 Tiina
109067745 Annie
41506490 Andrew

GROUP BY

The GROUP BY command is used to group the data based on one or more columns. It is used to aggregate data based on the grouping columns.

The GROUP BY requires aggregate functions:
COUNT: total number of rows
SUM: sum of all the values
MAX: maximum value
MIN: minimum value
AVG: average value

Here is an example:

We're now interested in tracking all neighbourhoods in which we are "over-represented". Let's first count all the occurences of each neighbourhood in our listings-table.

# Instructions: 
# 1/ Fetch neighbourhood_id from the listings table
# 2/ For the second column get the number of listings in each neighbourhood

# TO BE COMPLETED

SELECT neighbourhood_id,
COUNT(neighbourhood_id)
FROM listings
GROUP BY neighbourhood_id

Enter fullscreen mode Exit fullscreen mode
neighbourhood_id COUNT(neighbourhood_id)
2 3
0 1
1 1
4 1

HAVING

The HAVING command is used to filter the data after it has been grouped. It is used to filter out groups that do not meet a specified condition. Here is an example:

SELECT listing_id, COUNT(host_id) as count
FROM reviews
GROUP BY host_name
HAVING COUNT(host_id) > 10;

Enter fullscreen mode Exit fullscreen mode

In this example, we are selecting listing_id and counting the number of values in host_id for each group of values in listing_id. We then use the HAVING clause to filter the results so that only groups with a count greater than 10 are included in the results.

ORDER BY

The ORDER BY command is used to sort the data based on one or more columns. It is used to sort the data in ascending or descending order. Here is an example:

Find all the listings where we set our neighbourhood_id to 27 and "Private room".


# Instructions: 
# 1/ Fetch host_id, host from the listings table
# 2/ Make sure you filtered the data to just neighbourhood_id=27 and room_type='Private room'
# 3/ Make sure the output is sorted by host_id in descending order

SELECT host_id, host 
FROM listings
WHERE 
neighbourhood_id=27 AND room_type='Private room'
ORDER BY host_id DESC

Enter fullscreen mode Exit fullscreen mode

DISTINCT

In SQL, the DISTINCT keyword is used to select only unique values from a column or set of columns. Here are some examples of how to use the DISTINCT keyword in SQL:

SELECT DISTINCT first_name
FROM Customers;

Enter fullscreen mode Exit fullscreen mode
first_name
Edwin
William
Samuel
Linda

In this example, we are selecting only the distinct values of first_name column from the table. The resulting query will return a list of unique values of the first_name column.

AS

The AS command is used to make aliases or rename column names.
We are renaming "customer id" to "ID" and "first name" to "Name" in the example below.

SELECT customer_id AS ID,
       first_name AS Name
FROM Customers;

Enter fullscreen mode Exit fullscreen mode
ID Name
1 Edwin
2 William
3 Samuel
4 Linda

LIKE

The LIKE command is used for string filtering. You will provide the expression and it will use it to find the values that are matching the expression.
Consider the following example:

# Instructions: 
# 1/ Fetch all columns from the listings table
# 2/ Make sure you filtered the data to names that start with Jos

SELECT 
*
FROM listings 
WHERE host LIKE 'Jos%'

Enter fullscreen mode Exit fullscreen mode
listing_id listing host_id
22296011 Large private room on Camperdown park & Newtown 10873080

JOIN

In SQL, a JOIN statement is used to combine data from two or more tables based on a common column. Joining tables is a powerful way to retrieve data that is spread across multiple tables. There are several types of JOIN statements, including:

  • INNER JOIN: An inner join returns only the rows that have matching values in both tables being joined. Here is an example:
SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;

Enter fullscreen mode Exit fullscreen mode

In this example, we are selecting the order_id from the orders table and the customer_name from the customers table where the customer_id in both tables matches.

  • LEFT JOIN: A left join returns all the rows from the left table (the table specified before the LEFT JOIN keyword) and the matching rows from the right table (the table specified after the LEFT JOIN keyword). If there are no matching rows in the right table, the result will contain NULL values for the right table columns. Here is an example:
SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;

Enter fullscreen mode Exit fullscreen mode

In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching orders for a customer, the result will contain NULL values for the order_id column.

  • RIGHT JOIN: A right join is similar to a left join, but it returns all the rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, the result will contain NULL values for the left table columns. Here is an example:
SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;

Enter fullscreen mode Exit fullscreen mode

In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching customers for an order, the result will contain NULL values for the customer_name column.

  • FULL OUTER JOIN: A full outer join returns all the rows from both tables and combines the matching rows from both tables. If there are no matching rows in one of the tables, the result will contain NULL values for the columns of the table that has no matching rows. Here is an example:
SELECT customers.customer_name, orders.order_id
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id;

Enter fullscreen mode Exit fullscreen mode

In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching customers for an order or no matching orders for a customer, the result will contain NULL values for the respective columns. Note that not all database management systems support the FULL OUTER JOIN syntax.

These are the main types of JOIN statements in SQL. Understanding the different types of JOINs and when to use them is an important skill for data scientists who work with relational databases.

UNION

In SQL, the UNION operator is used to combine the results of two or more SELECT statements into a single result set. Here are some examples of how to use the UNION operator in SQL:

Simple UNION example:

SELECT host_id, host_name
FROM listings
UNION
SELECT reg_number, reg_name
FROM reviews;

Enter fullscreen mode Exit fullscreen mode

In this example, we are selecting columns from two different tables and combining the results using the UNION operator. The resulting query will return all unique combinations of host_id, reg_number and host_name,reg_name from both tables.

UNION with ORDER BY:

SELECT host_id, host_name
FROM listings
UNION
SELECT reg_number, reg_name
FROM reviews
ORDER BY reg_number ASC;

Enter fullscreen mode Exit fullscreen mode

In this example, we are using the UNION operator to combine the results of two SELECT statements, but we are also using the ORDER BY clause to sort the results by column1 in ascending order. The resulting query will return all unique combinations of host_id, reg_number and host_name, reg_name from both tables, sorted by reg_number.

UNION with WHERE clause:

SELECT host_id, host_name
FROM listings
WHERE purchase> 10
UNION
SELECT reg_number, reg_name
FROM reviews
WHERE order_price < 5;

Enter fullscreen mode Exit fullscreen mode

In this example, we use the UNION operator to combine the results of two SELECT statements, but we also use WHERE clauses to filter the results of each SELECT statement prior to combining them.
The query that results will return all unique combinations of the columns that satisfy the conditions in either WHERE clause.

The UNION operator is an extremely useful tool for combining the results of multiple SELECT statements into a single result set.
You can use the UNION operator to perform complex queries on your data and extract meaningful insights from it.

CASE

CASE statement is a powerful tool that allows you to perform conditional logic within a SQL query. With the CASE statement, you can evaluate an expression and return different values based on different conditions. Here are some examples of how to use the CASE statement in MySQL:

SELECT item,
       amount,
       CASE
           WHEN amount < 1000 THEN 'Low'
           ELSE 'High'
       END AS Priority
FROM Orders;

Enter fullscreen mode Exit fullscreen mode
item amount Priority
Keyboard 600 Low
Mouse 200 Low
Monitor 18000 High
Keyboard 900 Low
Mousepad 850 Low

Conclusion

In conclusion, SQL is a critical tool for any data scientist as it provides a powerful way to query, filter, and analyze data stored in relational databases. The ability to extract valuable insights from large datasets is a key component of data science, and SQL provides an efficient and effective way to accomplish this task.

In this article, we covered some essential SQL commands that every data scientist should know.

However, there are many other SQL commands and techniques that data scientists can use to enhance their data analysis skills. For instance, joining tables, aggregating data, and using subqueries can all help data scientists to analyze data more effectively. Additionally, using SQL with other tools such as Python, R, and visualization software can provide even more advanced capabilities in data analysis.

Finally, it's worth noting that while SQL is a powerful tool, it's not the only tool that data scientists should rely on. Other tools and techniques, such as machine learning, deep learning, and natural language processing, can also provide valuable insights into data. The key to successful data analysis is to use the right tools and techniques for the task at hand and to constantly learn and adapt as new technologies and methods emerge.

Mastering SQL commands is an essential skill for data scientists looking to extract valuable insights from large datasets.
By understanding how to connect to a database, retrieve data, filter data, and sort data, data scientists can effectively manipulate data and extract insights that will help them make informed business decisions. However, it's important to remember that SQL is just one tool in the data scientist's toolkit, and that the most successful data analysis requires a diverse set of skills and techniques

Top comments (0)