Pre-requisites: This article assumes basic SQL knowledge and CRUD commands such as: CREATE
, INSERT
, UPDATE
, ALTER
, DELETE
, and DROP
SQL, Structured Query Language, is a programming language used for manipulating and managing data in a relational database. Data Scientists use it to extract insights from data. A large amount of data used by data scientists lives in a relational database. This data can be extracted using SQL commands. SQL servers such as MySQL and PostgreSQL use SQL.
This article covers the essential SQL commands that data scientists rely on to effectively clean and filter data:
- Data Retrieval
- Conditions for Data Retrieval
- Data Aggregation
- Changing Data Types
- Joining Data From Different Tables
- Complex Conditions
The Basics: Data Retrieval
SELECT FROM
This is the simplest method of data retrieval in a relational database.
It can be combined with conditional statements such as WHERE, ORDER BY, and GROUP BY to filter, sort, and group data.
-- To select specific columns in a table:
SELECT column1, column2, column3
FROM table_name;
-- To select everything in a table:
SELECT *
FROM table_name;
DISTINCT
DISTINCT is used with SELECT to view unique values in a column.
For example, to know all the departments appearing in the column department
, we use DISTINCT. It returns a table of the departments appearing in that table.
SELECT DISTINCT department
FROM employees;
Conditions for Data Retrieval
WHERE
This is a conditional statement used to filter data according to a specific condition.
SELECT column1, column2, column3
FROM table_name
WHERE condition;
-- for example:
SELECT *
FROM employees
WHERE age >= 45;
-- We can also filter data with more than one condition:
SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales' AND salary >= 50000;
SELECT *
FROM employees
WHERE department IN ('Finance', 'IT', 'HR');
GROUP BY
This statement is used to group data based on one or more columns.
SELECT department, salary
FROM employees
GROUP BY department;
ORDER BY
This is used to sort the results of a query either alphabetically or numerically.
The default sorting order in sql is ASC
. Therefore, you do not have to specify ASC
in your query.
SELECT employee, salary
FROM employees
ORDER BY salary;
However, to sort the results in a descending order, use the keyword DESC
SELECT employee, salary
FROM employees
ORDER BY salary DESC;
LIMIT
When the records in a table are many, we may want to limit the number of records we get. For example, to view only the top 10 earners in the Finance department:
SELECT employee_name, department, salary
FROM employees
WHERE salary > 50000
ORDER BY salary
LIMIT 10;
Data Aggregation
Aggregations are summaries of data used to gain insights on a dataset. They are often used with the GROUP BY clause.
COUNT()
Count returns the total number of rows. In the example below, we are displaying the number of employees in each department.
SELECT COUNT(employee_id)
FROM employees
GROUP BY department;
SUM() & AVG()
Sum returns the sum of all the values. In the example below, we use the GROUP BY statement to group the employees by department and calculate the total salary for each department:
SELECT department, SUM(salary) as total_salary
FROM employees
GROUP BY department;
Avg returns the average value. In the example below, we use the GROUP BY statement to group the employees by department and calculate the average salary for each department:
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department;
HAVING
Having is used to add additional conditions after calculating a grouped aggregation.
For example, the above query can be conditioned further to only show departments with an average salary above 50000.
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;
MIN() & MAX()
To know the lowest or highest values in a column, we can use MIN
and MAX
.
SELECT MIN(salary) AS lowest_salary
FROM employees;
SELECT MAX(salary) AS highest_salary
FROM employees;
Changing Data Types
CAST( )
SQL sees numeric values as numbers even when dealing with money. We can change salary
values to dollar amounts using the CAST function:
SELECT department, CAST(SUM(salary) as money)
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC;
We can also change numbers into floats, text, or date and time.
ROUND()
When aggregations cause many decimal points, we can round off the decimal points:
SELECT department, ROUND(AVG(salary), 2) as avg_salary
FROM employees
GROUP BY department;
JOINS
Working with a single table limits the number of manipulations we can do with data. This is where JOINs come in. We are able to join data from multiple tables.
Before we go any further, we need to distinguish a promary key from a foreign key. A primary key is a column used to uniquely identify records in a table. For example, the primary key in the employees
table is employee_id. On the other hand, a foreign key is used to relate two tables.
A foreign key is usually a primary key in the other table. A separate table having information about when employees take vacation days (employee_vacation
table) will have a column employee_id
to relate to the employee table. Therefore, employee_id is a primary key in the employee table but a foreign key in the employee_vacation table.
There are different types of SQL joins which are best illustrated using venn diagrams.
The following examples will feature a customer database with
customers
table andorders
table.
INNER JOIN
An inner join is used to view data where records in two tables match on two columns. The example below shows the order_id and customer_name if
the customer_id on the orders table and the customer_id on the customers table are the same.
SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;
An INNER JOIN is also known as a JOIN and therefore, the code above can be written as:
SELECT orders.order_id, customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id;
-- We can filter the data to not show a specific customer:
SELECT orders.order_id, customers.customer_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id
WHERE customers.customer_name != 'Lucy Lucy';
You can also work with more than two tables:
SELECT orders.order_id, customers.customer_name, shippers.shipper_name
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id
JOIN shippers
ON orders.shipper_id = shippers.shipper_id;
LEFT JOIN
The table before the statement
LEFT JOIN
is the left table while the one after is the right table.
A LEFT JOIN will return all
the records in the left table and the matching records in the right table. If there are no matching records, the result will contain NULL
values.
-- customers = left table
SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;
RIGHT JOIN
The table before the statement
JOIN
is the right table while the one after is the left table.
A RIGHT JOIN will return all
the records in the right table and the matching records in the left table. If there are no matching records, the result will contain NULL
values.
-- customers = right table
SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;
Complex Queries
Subqueries
This is a query within another query, also known as a Nested query. It is usually embedded within the WHERE clause.
-- showing the highest paid employees
SELECT *
FROM employees
WHERE salary = (SELECT MAX(salary)
FROM employees);
CASE statement
This can be used when you need to add a category where the values are determined by an if...else
statement(CASE statement)
SELECT order_id, order_total,
CASE
WHEN order_total < 20 THEN 'Order total is less than $20'
ELSE 'Order total is greater than $20'
END AS sales_threshold
FROM orders;
Common Table Expressions (CTEs)
CTEs are used to create temporary tables that are then used to extract the information we need.
-- weekly_orders is the temporary table
WITH weekly_orders AS(
SELECT
customer_id,
DATE_PART('week', order_date) AS week,
COUNT(order_id) AS order_numbers
FROM orders
GROUP BY customer_id, week)
SELECT customer_id, AVG(order_numbers)
FROM weekly_rentals
GROUP BY customer_id
Conclusion
Further reading:
I hope you found this article helpful in your SQL journey. Practice questions will definitely help you retain all the information you have learned. Use platforms like Hackerranck to level up your SQL skills.
If you found this article helpful, make sure to like it or leave a comment.
Top comments (0)