Catherine Kawara

Posted on Mar 15, 2023

Essential SQL Commands for Data Science

#datascience #sql #machinelearning #data

In the previous article, Introduction to SQL for Data Analysis, I provided an introduction to SQL for data analysis, covering the basics of SQL syntax, data types, operators, and table creation/modification. In this article, we will delve into more advanced SQL features that can be used for complex data analysis tasks. In this article, we will cover some of the essential SQL commands that every data scientist should know.

1. SELECT

SELECT is the most basic and important command in SQL. It is used to retrieve data from a table. The basic syntax of the SELECT command is:

SELECT column1, column2, ... FROM table_name;

2. WHERE

WHERE is used to filter data based on certain conditions. The basic syntax of the WHERE command is:

SELECT column1, column2, ... FROM table_name WHERE condition;

3. GROUP BY

GROUP BY is used to group rows based on the values in one or more columns. The basic syntax of the GROUP BY command is:

SELECT column1, column2, ..., aggregate_function(column_name)
FROM table_name
GROUP BY column1, column2, ...;

Here, aggregate_function is a function used to perform an operation on a set of values, such as COUNT, SUM, AVG, MAX, and MIN. For example, if you want to retrieve the number of customers from each country in the "customers" table, you can use the following command:

SELECT country, COUNT(*)
FROM customers
GROUP BY country;

This command will retrieve the number of customers from each country in the "customers" table.

4. ORDER BY

ORDER BY is used to sort the data in ascending or descending order based on one or more columns. The basic syntax of the ORDER BY command is:

SELECT column1, column2, ... FROM table_name ORDER BY column_name [ASC|DESC];

Here, column_name is the name of the column that you want to sort by.

5. JOIN

JOIN is used to combine rows from two or more tables based on a related column between them. There are different types of joins, including INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. The basic syntax of the JOIN command is:

SELECT column1, column2, ... FROM table1 JOIN table2 ON table1.column_name = table2.column_name;

Table1 and table2 are the names of the tables that you want to join, and column_name is the name of the related column between them.

INNER JOIN

INNER JOIN returns only the rows in which the joined tables have matching values. For example, if we have two tables - employees and departments - with a common column "department_id", we can join them as follows:

SELECT employees.employee_id, employees.first_name, employees.last_name, departments.department_name
FROM employees
INNER JOIN departments
ON employees.department_id = departments.department_id;

This statement will return only the rows where the "department_id" column in the employees table matches the "department_id" column in the departments table.

LEFT JOIN

LEFT JOIN is used to combine all rows from the left table and the matching rows from the right table. If there are no matching rows from the right table, the result will contain NULL values.

SELECT column1, column2, ... FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name;

Table1 and table2 are the names of the tables that you want to join, and column_name is the name of the related column between them.

Say you want to retrieve all the customers and their orders from the "customers" and "orders" tables, even if some customers have not made any orders yet, you can use the following command:

SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;

RIGHT JOIN

RIGHT JOIN is used to combine all rows from the right table and the matching rows from the left table. If there are no matching rows from the left table, the result will contain NULL values. Here's the syntax:

SELECT column1, column2, ... FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name;

FULL OUTER JOIN

FULL OUTER JOIN is used to combine all rows from both tables, including the rows that have no matching values in either table. The basic syntax of FULL OUTER JOIN command is:

SELECT column1, column2, ...
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;

table1 and table2 are the names of the tables that you want to join, and column_name is the related column between them.

SELECT c.name, o.order_id
FROM customers c
FULL OUTER JOIN orders o
ON c.customer_id = o.customer_id;

This query will return all customers and their corresponding order IDs, including customers who have not placed any orders yet. For customers without orders, the order ID column will contain NULL values.

SELF JOIN

Self join is a type of join where a table is joined with itself. This is useful when you want to compare data in the same table or when you have hierarchical data in a table.

To perform a self join on a table, we need to give the table an alias name so that we can refer to it separately within the same query. Here's an example of a self join that retrieves the names of all employees and their corresponding supervisors:

SELECT e.employee_name, s.employee_name as supervisor_name
FROM employees e
INNER JOIN employees s ON e.supervisor_id = s.employee_id;

In this example, we gave the "employees" table the alias name "e" for the first instance of the table, and "s" for the second instance of the table. We then joined the two instances of the table on the "supervisor_id" column in the first instance of the table, and the "employee_id" column in the second instance of the table.

The result of the query will be a list of all employee names with their corresponding supervisor names.

6. Subqueries

Subqueries are used to nest one query inside another query. They can be used to retrieve data that meets specific criteria or to create temporary tables for use in other queries.

For example, if we want to retrieve the names of employees who work in departments with more than 100 employees, we can use a subquery as follows:

SELECT first_name, last_name
FROM employees
WHERE department_id IN (
SELECT department_id
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 100
);

This statement uses a subquery to retrieve the department IDs of departments with more than 100 employees, and then retrieves the names of employees who work in those departments.

7. Aggregation Functions

Aggregation functions are used to perform calculations on a set of data, such as finding the average, maximum, or minimum value in a column. Common aggregation functions include SUM, AVG, MAX, MIN, and COUNT.

For example, if we want to find the average salary of employees in each department, we can use the AVG function as follows:

SELECT department_id, AVG(salary) AS average_salary
FROM employees
GROUP BY department_id;

This statement groups employees by department and calculates the average salary for each department.

8. Window functions

Window functions are used to perform calculations on a subset of data within a larger data set. They can be used to calculate running totals, rankings, and other complex calculations.
For example, to rank the sales figures for each month, we can use the following query:

SELECT month, sales, RANK() OVER (ORDER BY sales DESC) AS sales_rank
FROM sales_table;

Here, month is the column that contains the month name or number, sales is the column that contains the sales figures for each month, and sales_table is the name of the table that contains the data.

The RANK function is used to rank the rows based on a specific column, and the OVER clause specifies the order in which the rows should be processed. Sales_rank is the column that will contain the rank of each month's sales figures, and DESC is used to order the sales figures in descending order.

Overall, window functions provide a powerful tool for performing complex calculations on subsets of data within a larger data set.

Conclusion

In this article, we covered some of the more advanced SQL features that are commonly used in data science. Joins allow us to combine data from multiple tables based on a common column, subqueries enable us to nest one query inside another query, aggregation functions let us perform calculations on a set of data, and window functions allow us to perform calculations on a subset of data within a larger dataset.

As you continue to work with SQL for data analysis, it's important to keep practicing and exploring the different features and commands available. With enough practice, you'll become more comfortable with SQL and be able to use it to extract valuable insights from even the largest datasets.

Till next time, happy coding!✌️

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Forem