Karen Ngala

Posted on Mar 15, 2023

Essential SQL Commands for Data Science

#datascience #sql #sqlfordatasccience

Pre-requisites: This article assumes basic SQL knowledge and CRUD commands such as: CREATE, INSERT, UPDATE, ALTER, DELETE, and DROP

SQL, Structured Query Language, is a programming language used for manipulating and managing data in a relational database. Data Scientists use it to extract insights from data. A large amount of data used by data scientists lives in a relational database. This data can be extracted using SQL commands. SQL servers such as MySQL and PostgreSQL use SQL.

This article covers the essential SQL commands that data scientists rely on to effectively clean and filter data:

Data Retrieval
- Conditions for Data Retrieval
Data Aggregation
- Changing Data Types
Joining Data From Different Tables
Complex Conditions

The Basics: Data Retrieval

SELECT FROM
This is the simplest method of data retrieval in a relational database.
It can be combined with conditional statements such as WHERE, ORDER BY, and GROUP BY to filter, sort, and group data.

-- To select specific columns in a table:
SELECT column1, column2, column3
FROM table_name;

-- To select everything in a table:
SELECT * 
FROM table_name;

DISTINCT
DISTINCT is used with SELECT to view unique values in a column.
For example, to know all the departments appearing in the column department, we use DISTINCT. It returns a table of the departments appearing in that table.

SELECT DISTINCT department
FROM employees;

Conditions for Data Retrieval

WHERE
This is a conditional statement used to filter data according to a specific condition.

SELECT column1, column2, column3
FROM table_name
WHERE condition;

-- for example:
SELECT *
FROM employees
WHERE age >= 45;

-- We can also filter data with more than one condition:
SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales' AND salary >= 50000;

SELECT *
FROM employees
WHERE department IN ('Finance', 'IT', 'HR');

GROUP BY
This statement is used to group data based on one or more columns.

SELECT department, salary
FROM employees
GROUP BY department;

ORDER BY
This is used to sort the results of a query either alphabetically or numerically.
The default sorting order in sql is ASC. Therefore, you do not have to specify ASC in your query.

SELECT employee, salary
FROM employees 
ORDER BY salary;

However, to sort the results in a descending order, use the keyword DESC

SELECT employee, salary
FROM employees 
ORDER BY salary DESC;

LIMIT
When the records in a table are many, we may want to limit the number of records we get. For example, to view only the top 10 earners in the Finance department:

SELECT employee_name, department, salary
FROM employees
WHERE salary > 50000
ORDER BY salary
LIMIT 10;

Data Aggregation

Aggregations are summaries of data used to gain insights on a dataset. They are often used with the GROUP BY clause.
COUNT()
Count returns the total number of rows. In the example below, we are displaying the number of employees in each department.

SELECT COUNT(employee_id)
FROM employees
GROUP BY department;

SUM() & AVG()
Sum returns the sum of all the values. In the example below, we use the GROUP BY statement to group the employees by department and calculate the total salary for each department:

SELECT department, SUM(salary) as total_salary
FROM employees
GROUP BY department;

Avg returns the average value. In the example below, we use the GROUP BY statement to group the employees by department and calculate the average salary for each department:

SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department;

HAVING
Having is used to add additional conditions after calculating a grouped aggregation.
For example, the above query can be conditioned further to only show departments with an average salary above 50000.

SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

MIN() & MAX()
To know the lowest or highest values in a column, we can use MIN and MAX.

SELECT MIN(salary) AS lowest_salary
FROM employees;

SELECT MAX(salary) AS highest_salary
FROM employees;

Changing Data Types

CAST( )
SQL sees numeric values as numbers even when dealing with money. We can change salary values to dollar amounts using the CAST function:

SELECT department, CAST(SUM(salary) as money)
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC;

We can also change numbers into floats, text, or date and time.
ROUND()
When aggregations cause many decimal points, we can round off the decimal points:

SELECT department, ROUND(AVG(salary), 2) as avg_salary
FROM employees
GROUP BY department;

JOINS

Working with a single table limits the number of manipulations we can do with data. This is where JOINs come in. We are able to join data from multiple tables.
Before we go any further, we need to distinguish a promary key from a foreign key. A primary key is a column used to uniquely identify records in a table. For example, the primary key in the employees table is employee_id. On the other hand, a foreign key is used to relate two tables.
A foreign key is usually a primary key in the other table. A separate table having information about when employees take vacation days (employee_vacation table) will have a column employee_id to relate to the employee table. Therefore, employee_id is a primary key in the employee table but a foreign key in the employee_vacation table.
There are different types of SQL joins which are best illustrated using venn diagrams.

The following examples will feature a customer database with customers table and orders table.

INNER JOIN
An inner join is used to view data where records in two tables match on two columns. The example below shows the order_id and customer_name if the customer_id on the orders table and the customer_id on the customers table are the same.

SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;

An INNER JOIN is also known as a JOIN and therefore, the code above can be written as:

SELECT orders.order_id, customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id;

-- We can filter the data to not show a specific customer:
SELECT orders.order_id, customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id
WHERE customers.customer_name != 'Lucy Lucy';

You can also work with more than two tables:

SELECT orders.order_id, customers.customer_name, shippers.shipper_name
FROM orders
JOIN customers 
    ON orders.customer_id = customers.customer_id
JOIN shippers 
    ON orders.shipper_id = shippers.shipper_id;

LEFT JOIN

The table before the statement LEFT JOIN is the left table while the one after is the right table.

A LEFT JOIN will return all the records in the left table and the matching records in the right table. If there are no matching records, the result will contain NULL values.

-- customers = left table
SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;

RIGHT JOIN

The table before the statement JOIN is the right table while the one after is the left table.

A RIGHT JOIN will return all the records in the right table and the matching records in the left table. If there are no matching records, the result will contain NULL values.

-- customers = right table
SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;

Complex Queries

Subqueries
This is a query within another query, also known as a Nested query. It is usually embedded within the WHERE clause.

-- showing the highest paid employees
SELECT * 
FROM employees 
WHERE salary = (SELECT MAX(salary) 
                FROM employees);

CASE statement
This can be used when you need to add a category where the values are determined by an if...else statement(CASE statement)

SELECT order_id, order_total,
CASE 
    WHEN order_total < 20 THEN 'Order total is less than $20'
    ELSE 'Order total is greater than $20' 
END AS sales_threshold 
FROM orders;

Common Table Expressions (CTEs)
CTEs are used to create temporary tables that are then used to extract the information we need.

-- weekly_orders is the temporary table
WITH weekly_orders AS(
    SELECT
        customer_id,
        DATE_PART('week', order_date) AS week,
        COUNT(order_id) AS order_numbers
    FROM orders
    GROUP BY customer_id, week)

SELECT customer_id, AVG(order_numbers)
FROM weekly_rentals
GROUP BY customer_id

DEV Community