This post is a continuation of the Introduction to SQL. However, if you are familiar with basic SQL concepts you can take a read through this article for more advance concepts.
CASE WHEN AND THEN
Evaluates a set of conditions just like an if else statement and return a corresponding result when the condition is true.
The corresponding syntax is as shown:
SELECT column1,
column2,
CASE WHEN condition1 THEN result1
WHEN condition2 THEN result2
...(other when conditions)
ELSE default_result
END AS new_column_name
FROM table_name;
Lets look at a more concrete example. Assuming we have a matches table with different teams and how they played and we want to determine if the won when they played at home, we shall have the following query.
SELECT id,
name,
CASE WHEN home_goal > away_goal THEN 'Its a home win'
WHEN home_goal < away_goal THEN 'Its a home loss'
ELSE 'Its a tie'
END AS home_scores
INTO home_analysis
FROM matches;
The output of the above code is three columns id, name and home_scores which are stored inside the home_analysis table.
We used the CASE statement to filter the results of the home_scores column.
You can also use a CASE statement to aggregate data based on the results of a WHERE clause, since aggregate functions cannot be used directly in a WHERE clause.
Let's say you have a table called "employees" with columns "name", "salary", and "department". You want to create a report that shows the total salary for each department, and categorize each department as either "High Paying" or "Low Paying" based on the total salary. The results should include only employees hired on or after January 1, 2023
SELECT department,
SUM(salary) as total_salary
CASE WHEN SUM(salary) >= 1000000 THEN 'High paying dpt'
ELSE 'Low paying dpt'
END AS dpt_salary
FROM employees
WHERE hire_date >= '2023-01-01'
GROUP BY department;
The sum of the salaries will only be placed in any category if they satisfy the where clause first.
SUBQUERIES
They are queries found inside other queries hence also known as nested queries.The subquery is executed first, and the results are then used in the main query. This allows you to perform more complex queries that involve multiple tables and conditions.
The syntax is as shown:
SELECT column
FROM (SELECT column
FROM table) AS subquery
A subquery can be placed at any part of your query, such as the SELECT, FROM, WHERE or GROUP BY
The allow you to compare summarized values to detailed data and also reshape your data as desired.
A subquery is also used to combine data that cannot be joined.
A subquery in the where clause
It is used to filter data based on the results of another table.For example, if you have a table of customers and a table of orders, you can use a subquery to find all customers who have made at least one order.
SELECT *
FROM customers
WHERE id IN (SELECT c_id FROM orders);
A subquery in the FROM clause
They are used to restructure and transform the data to be selected. Ensure you alias the subquery for easy reference.
Lets use the customer and order table for reference and see the orders for each customer in the 2022/2023 season.
SELECT name,
order
FROM (SELECT c.name AS name
o.id AS order
FROM customers AS c
INNER JOIN orders AS o
ON c.order = o.id)
AS customer_order
WHERE season = '2022/2023';
We can also create more than one subquery inside a from statement.
SELECT customers.customer_id,
customers.first_name, '
customers.last_name,
orders.total
FROM (
SELECT customer_id,
SUM(price * quantity) AS total
FROM order_items
GROUP BY customer_id
) AS orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id
WHERE orders.total > (
SELECT AVG(total)
FROM (
SELECT customer_id,
SUM(price * quantity) AS total
FROM order_items
GROUP BY customer_id
) AS order_totals
)
In this example, we're using two subqueries in the FROM clause. The first subquery calculates the total amount spent by each customer on all of their orders. The second subquery calculates the average total amount spent by all customers.
The main query then joins the orders subquery with the customers table to retrieve the customer's ID, first name, and last name, as well as their total amount spent. Finally, the WHERE clause filters the results to only include customers whose total amount spent is greater than the average total amount spent by all customers.
Subquries in the select statement
They can be used to perform calculations based on data from another table. For example, if you have a table of orders and a table of products, you can use a subquery to calculate the total price of each order.
SELECT id,
(SELECT SUM(price * quantity)
FROM products
WHERE order.id = products.order_id) AS total_price
FROM orders;
It can also be used to retrieve information from a table as we saw above we got the total_price of our commodities from products table.
The big question however is, when is it advisable to use subqueries considering that each subquery requires additional computing power. There are certain factors to consider when selecting a subquery:
1.Data complexity
Subqueries are useful when working with complex data queries that require multiple tables and conditions. If the query involves multiple subqueries or multiple nested subqueries, it can become very difficult to read and maintain. In this case, it may be better to break the query into smaller parts or consider using a different approach.
2.Data volume
Subqueries can be slow to execute when dealing with large datasets. If the query involves a large amount of data, it may be better to use a join or a different type of query.
3.Performance
The performance of a subquery can depend on the database engine being used. Some database engines can optimize subqueries for better performance. It's important to consider the performance implications when deciding whether to use a subquery.
4.Maintainability
Subqueries can make queries more difficult to read and maintain. It's important to consider whether the use of a subquery will make the query more or less maintainable in the long run.
5.Query reuse
If the same subquery is going to be used multiple times in different parts of a query or across multiple queries, it may be more efficient to create a view or a temporary table instead of using a subquery.
Corelated Subqueries
It is a type of subquery which uses the values from the outer query to generate results, by referencing one or more columns in the main query. It is always re-run for every new row generated.
SELECT *
FROM employees AS e
WHERE e.salary > (
SELECT AVG(salary)
FROM employees
WHERE dpt_id = e.dpt_id
)
It is also very possible to have a nested subquery. Lets select an employee's name and the average salary he received in the first quarter of the year.
SELECT
e.name AS employee,
(SELECT AVG(jan_pay + feb_pay + mar_pay),
FROM salary AS s
WHERE s.empl_id = e.id
AND id IN (
SELECT id
FROM salary
WHERE year == 2023
) AS avg_salary
FROM employees AS e
GROUP BY employee;
The subquery in the SELECT clause is dependent on the equality of the salary and the employees id, which is also dependent on the year column in the salary table.
Common Table Expressions(CTEs)
Subquries are a good way of bringing multiple unrelated tables together and also simplifying calculations. However, having multiple nested subqueries can be hard to read and understand as discussed above. That is why a solution to this can be using CTEs.
CTEs are declared ahead of the main query using a WITH statement and referenced later in the FROM statement.
It can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement in SQL.
WITH s AS (
SELECT emp_id, id,
AVG(jan_pay + feb_pay + mar_pay) AS q1
FROM salary
WHERE year == 2023
GROUP BY id
)
SELECT e.name AS employee,
s.q1
FROM employees AS e
INNER JOIN s
ON e.id = s.emp_id
CTEs are better than subqueries because they are run only once and then stored in memory, hence reducing the amount of time required to run the query.
You can also have multiple CTEs, separated by a comma but make sure you join them in their respective order.
WITH max_salary AS (
SELECT department_id, MAX(salary) AS max_salary
FROM salary
GROUP BY department_id
),
avg_salary AS (
SELECT department_id, AVG(salary) AS avg_salary
FROM salary
GROUP BY department_id
)
SELECT employees.*, max_salary.max_salary, avg_salary.avg_salary
FROM employees
JOIN max_salary ON employees.department_id = max_salary.department_id
JOIN avg_salary ON employees.department_id = avg_salary.department_id
WHERE employees.salary = max_salary.max_salary
Window Functions
Perform a set of operations that are somehow related to the current row, similar to group by but rows are not groupd into a single row.
They are called "window" functions because they operate on a "window" of rows, defined by an OVER clause, rather than on individual rows.
Some use cases of window functions include:
- Fetching values before and after the current row.
- Ranking row values.
- Calculating averages See the example below:
SELECT transaction_id,
sales_amount,
SUM(sales_amount) OVER() AS total_sales
FROM sales_transactions
This sums the sales_amount per amount and displays it as total_sales.The first row will contain its own value in the total sales column, the second row will contain a sum of the first and second row and so on.
At any given point we might want to give each of our rows a value in order to easily access it. This can be done using the
ROW_NUMBER() function.
SELECT transaction_id,
sales_amount,
ROW_NUMBER() OVER() AS row_n
FROM sales_transactions
The output of the above query will contain an extra row called row_n with row numbers from 1 and so on.
It is possible that we want the values in the row_n columns to be ordered in a certain way. We can thus use the ORDER BY clause in OVER() function as shown:
SELECT transaction_id,
sales_amount,
ROW_NUMBER() OVER(ORDER BY sales_amount DESC) AS row_n
FROM sales_transactions
The row_n values will now be sorted starting from the one with the highest sales order to the lowest.
We can generate ranks instead of ordering as well using the RANK() window function.
SELECT transaction_id,
sales_amount,
SUM(sales_amount) RANK() OVER() AS total_sales
FROM sales_transactions
Note that window functions are produced after other queries have been processed. It uses the generated table to find its results.
As stated earlier we can also use window functions to compare the current row and the one before it. This is made possible using the LAG() function. Example:
Assuming we have a games table with year, champion,medals,match and other columns in it. We can find out which champion won recurrently.
SELECT
year,match, champion
LAG(champion) OVER(
ORDER BY year ASC) AS last_champion
FROM games
ORDER BY year ASC;
We can now compare the champion and last_champion columns and find recurrent champions.
The above query will produce the values for last_champion regardless of weather the matches are similar or not. For example if we had matches for hockey and rollball within the same year it will not provide the last_champion for each separate match but just generalize them.
To solve this , we use a PARTITION BY clause as shown below.
SELECT
year,match, champion
LAG(champion) OVER(
PARTITION BY match
ORDER BY year ASC) AS last_champion
FROM games
ORDER BY year ASC;
Window partitions also group the result set of a row into a smaller group so that window functions can be applied to those subsets separately.
Lets look at another example:
SELECT department_id,
employee_id,
salary,
AVG(salary) OVER (PARTITION BY department_id
) AS avg_salary
FROM employees
The results will be first partitioned then average salary calculated.
Sliding Windows
They enable us to perform calculations relative to the current window of the dataset.
The window is defined by a rage of rows that "slide " or move through the partition based on a specified ordering.
The syntax is as shown:
ROWS BETWEEN <start> AND <finish>
The start and finish can be replaced with PRECEDING(comes before), FOLLOWING(comes after), UNBOUNDED PRECEDING(all rows in the partition up to and including the current row), UNBOUNDED FOLLOWING, CURRENT ROW
SELECT date,
jan_pay,
SUM (jan_pay)
OVER(ORDER BY date ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW)
AS jan_total
FROM salary
WHERE year = 2023;
Fetching
They include functions that enable us to get certain values from different parts of the table or partition into one row.
The LAG() function discussed above is also one of them.
The LEAD() function is used return the values after the current row. Lets use the LAG() example shown above.
SELECT
year,match, champion
LEAD(champion) OVER(
ORDER BY year ASC) AS following_champion
FROM games
ORDER BY year ASC;
The above query compares the current champion and the champion after him.
FIRST_VALUE returns the first value in the table or partition.
SELECT
year,match, champion
FIRST_VALUE(champion) OVER(
ORDER BY year ASC) AS first
FROM games
ORDER BY year ASC;
LAST_VALUE returns the last value in the table or partition.
SELECT
year,match, champion
LAST_VALUE(champion) OVER(
ORDER BY year ASC) AS last
FROM games
ORDER BY year ASC;
NOTE
A sliding window can also be defined as a frame. Without frame the LAST_VALUE() function would return the current row. Efficient example of the LAST_VALUE() would be:
SELECT
year,match, champion
LAST_VALUE(champion) OVER(
ORDER BY year ASC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW)) AS last
FROM games
ORDER BY year ASC;
There are two types of frames, RANGE BETWEEN and ROWS BETWEEN. RANGE BETWEEN however lists duplicate values when used with the OVER's ORDER BY subclause as a single entity. Hence ROWS BETTWEEN is mostly used over RANGE BETWEEN.
Ranking
It involves arranging the data in a partition or table in a particular order.
The ranking functions include:
ROW_NUMBER(): It assigns each row a unique number even if the row values are the same. Check the window functions example for more clarification.
RANK(): It orders the values of a table or partition giving rows with identical values the same number and skipping one number over to the next value.
DENSE_RANK(): It ranks values in a particular table or partition giving the same number for identical values but does not skip a number to the next value, hence mostly preferred.
SELECT transaction_id,
sales_amount,
SUM(sales_amount) DENSE_RANK() OVER() AS rank_sales
FROM sales_transactions
This will rank the total sales giving identical values the same rank then move to the next total_sales value without skipping a number.
transaction_id | sales_amount | rank_sales |
---|---|---|
iutio | 64 | 1 |
ghjkl | 48 | 2 |
frtyu | 48 | 2 |
lkjhg | 36 | 3 |
Paging
This includes splitting data into approximately equal chunks which enables us to easily analyze data piece by piece.
Paging in sql is done using the NTILE(n) which splits data into approximately n equal pages.
SELECT transaction_id,
sales_amount,
NTILE(15) OVER() AS page1
FROM sales_transactions
In the above query the rows are divided into 15 buckets when called.
Pivoting
It makes it easier to understand and analyze data by converting columns into rows. It is very valuable when preparing data for visualization.
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM CROSSTAB ( $$
SELECT country,
year,
COUNT(*) :: INTEGER AS awards
FROM medals
WHERE year IN (2022, 2023)
GROUP BY country, year
ORDER BY country ASC;
$$) AS ct ( country VARCHAR,
"2022" INTEGER,
"2023" INTEGER)
ORDER BY country ASC;
The above query converts the country and year column in the medal table into rows and splits the year into 2022 and 2023.
Each row is also given their respective datatype.
Conclusion
In conclusion, this article has covered several intermediate SQL topics that can help you gain a deeper understanding of SQL and data manipulation. By mastering these concepts, you will be able to write more complex SQL queries and work with large datasets more efficiently. I hope you found this article helpful in your SQL journey and that it has provided you with valuable insights and knowledge. Keep practicing and exploring the world of SQL to become a proficient data analyst or developer. Stay curious and stay golden!
Top comments (0)