maggy njuguna

Posted on Mar 9

ADVANCED SQL FOR DATA ANALYTICS:ADVANCED TECHNIQUES EVERY DATA ANALYST SHOULD KNOW.

#sql #datascience #beginners #learning

Introduction
SQL(Structured Query Language) is an important tool in data analysis and is used to retrieve data ,manipulate and analyse stored data.Basic SQL techniques are important in data analysis however advanced techniques enhance analysis of complex data sets.The advanced techniques in SQL include;window functions,subqueries,ctes,stored procedures and indexes.

REAL WORLD PROBLEMS IN DATA

These are the products encountered by data analysts while trying to retrieve, manage and make decisions using data.Tools like SQL come in handy to solve these problems. Main data problems include;

Data quality and cleaning -data analysis involves using large data sets which are often messy,inconsistent and missing values and require cleaning and standardisation.Poor quality data can give inconsistent and unreliable results.
Data integration -Most data is stored in different tables and requires merging which if not done correctly may lead to inconsistencies.
Large volumes of data -organizations deal with large data sets which sometimes slow down query execution.
Data security - organizations deal with sensitive information which if not handled properly may lead to data violation and legal issues.

WINDOW FUNCTIONS IN SQL

Window functions perform calculations on related rows and keeps each row in the result unlike GROUP BY which groups the rows together.
A window function allows viewing of a group of rows.
The OVER() Clause is used in window functions.
Examples of window functions

ROW_NUMBER()
RANK()
DENSE_RANK
LEAD AND LAG
NTILE
RUNNING TOTAL
ROW_NUMBER.
It assigns a unique number to each row based on a condition.
In the case of 2 rows having similar values, this window function issues different numbers to the given rows.

RANK. This window function issues a ranking number to each row based on the given conditions. If two rows share the same value, they are assigned with the same value and the next number is skipped.

From this query rows with ties are issued with the same numbers and the next value is skipped.

DENSE_RANK Dense_rank is almost similar to rank and is used to issue ranks to rows but in relation to ties dense-rank does not leave gaps.The next number is issued after assigning a number to ties.

Example

Dense_rank does not skip the next number after handling ties.

LEAD AND LAG

LEAD -is used to compare the current rows with the next row.Like comparing the total quantity of products purchased by a customer with the customer's next order.If there is no next value it returns NULL.

Example

select order_id,order_date ,lead(order_date)OVER(order by order_date )as next_date, order_date-lead(order_date)OVER(order by order_date ) from orders;

LAG - is used to compare the current rows with the previous rows.Can be used to compare a customer's current order with the previous order.If there is no previous order it returns NULL.

select title,price,lag(price)over(order by price asc)as previous_price, price - lag(price)over(order by price asc)as price_diff from books;

NTILE It is used to divide the results set into a specified number of roughly equal parts.Example in dividing orders into 4 parts based on quantity.

Example

RUNNING TOTAL. This is a cumulative sum that adds the current value to previous values based on a specified order.

POINT TO REMEMBER

PARTITION BY. It is a part of the window function that divides data into groups before calculations. It divides the results in groups just like GROUP BY but does not remove rows.

HOW WINDOW FUNCTIONS PLAY A ROLE IN REAL WORLD DATA PROBLEMS

Improved query performance- data analysis involves using large data sets and window functions come in handy to ensure fast execution instead of using joins or subqueries.
Preserve data integrity -Window functions allow data analysts to view individual results and aggregated results.

Real word scenarios on the use of window functions.

Ranking products or customers When performing sales analysis for businesses a data analyst can use the windows function(Rank,Dense_rank) to give ranks to products or customers based on their performance.This helps the business identify the performance of products and decide on which products or customers need more marketing.
Calculating revenue growth
Using the running total a business is able to calculate cumulative revenue over time.
Identifying trends and performance.
Using the window functions businesses can compare the performance of their products, customers over time and guide them in making strategic decisions.
Compare current values with previous or next values.
Using lead and lag a company can perform trends in sales by doing comparisons on either the previous or next sales.

Subqueries
A subquery is a query written inside another query (Nested).Subqueries can appearin the following instances;

In the select clause.
In the Where clause.
In the From clause.

In the select clause
Subqueries in the select clause add an extra computed column to the results.
Example
select name,(select count(*) from projects p where p.employee_id = e.employee_id )as total_projects from employees_2 e;
In the where clause
Subqueries in the where clause allow the main query to filter results based on another's query output.
Example
select d.department_name ,e.salary from employees_2 e join departments d on e.department_id = d.department_id where salary =(select max(salary)from employees_2 where e.department_id =d.department_id);
In the From clause
A subquery in the from clause creates a temporary table from which data can be selected or filtered.
Example
select department_name,total_employees from(select department_id ,count(*)as total_employees from employees_2 group by department_id) as department_summary join departments d on d.department_id = department_summary.department_id;

Reasons for using subqueries

They allow filtering data in the main query based on results of an inner query.
Can be used in from clause for aggregation before proceeding with further analysis.

How subqueries solve real world data problems

Aggregate data comparison Using subqueries a data analyst can calculate the average performance of products and customers while the main query can identify products that are below or above that average.
Filtering data from complex conditions Subqueries can be used to perform calculations that can be filtered in other SQL queries.

Common Table Expressions(CTES)
CTES involves creating a temporary results set that only exists during the execution of the query.
CTES breakdown complex queries into simpler more manageable and readable parts.
CTES act like a virtual table which only exists during the execution of the main query.
Example
Listing employees earning more than 45000 using a cte

with high_earners as (select name,salary
from employees_2
where salary
>45000)
select * from high_earners;

CTES can be used to simplify joins.

Example
with total_orders as (select customer_id,sum(quantity) as total_quantity from orders group by customer_id) select first_name,last_name ,t.total_quantity from total_orders t join clients c on t.customer_id = c.customer_id;
Multiple CTES -More than one CTE can be used and they are separated by commas.

Example
with department_max as (select department_id,max(salary)as max_salary from employees_2 group by department_id), top_earners as (select name,employee_id,e.department_id,salary from employees_2 e join department_max m on e.department_id =m.department_id and e.salary = m.max_salary) select * from top_earners;

How CTES solve real world data problems

Temporary nature-CTES results are not stored permanently in database, they disappear after a query is executed hence they don't need maintenance.
Improving readability -CTES allow breaking down of queries into simple named steps instead of using nested unreadable queries.
Reusability -CTES can be referenced several times hence no need of writing the same subquery multiple times.
Performance optimization -complex nested queries can be turned into simple steps using CTES hence fast execution.

Key differences between subqueries and CTES.

CTES come before the main query while subqueries are inside the main query.
CTES are easy to read however subqueries can be long and messy.
CTES can be referenced several times while subqueries run once and cannot be reused.
CTES are used when the query is complex and repititive while subqueries are used when you need quick nested results.

Stored procedures
Stored procedure is a block of SQL logic stored inside the database.
Actions that can be executed using CTES include

Inserting data into tables.
Updating data in tables.
Delete data from tables.

Example
This is a procedure that inserts data into the clients table.

create or replace procedure add_clients( p_first_name varchar, p_last_name varchar, p_email varchar, p_contact varchar, p_city varchar )

language plpgsql
as $$
begin
insert into
clients(first_name,last_name,email,contact,city)
values (p_first_name, p_last_name,p_email,p_contact,p_city );
end;
$$;

How stored procedures solve real world data problems

Performance optimization -Using a call statement is faster than executing many lines of code.Because stored procedures are precompiled the database skips the compling steps hence faster execution.
Stored procedures enhance data security-Data base administrators can allow users to execute the stored procedures instead of allowing them to directly use the SELECT or UPDATE procedures hence protecting the data from any malicious acts.
Code reusability-Complex data procedures are written once and can be used across multiple executions.
Real world use case scenario.
- Banking -Transactions such as updating account balances,checking loan limits .
- Auditing - using stored procedures it is possible to track changes in your data. -Human resource department can use stored procedures to update staff records.

Indexes
Data analysts encounter complex and large data sets in their day to day jobs and querying this data sets can be slow and cumbersome. Indexes act as a road map for databases and allow the data base management system to quickly locate specific data rows without scanning the entire database .This ensures fast execution of queries.
create index idx_employees_2_name on employees_2(name);
Partial indexes indexes rows that meet a particular condition.
Indexes can also be used on joins.
create index idx_emp_dept on employees_2(department_id);
select e.name,d.department_name from employees_2 e inner join departments d on e.department_id = d.department_id where d.department_name ='Engineering';
How indexes solve data problems

Increases the speed of query execution- A search taking minutes can be reduced to seconds with indexes since the database only scans the specified rows instead of scanning the entire database.
Efficient sorting and grouping. Indexes store data in an sorted order which makes the ORDER BY,GROUP BY execution faster.
Speed up join operations -Indexes speed up the process of merging rows.

CONCLUSION
Data analysts face many data problems that hinder they ability to effectively analyse data .Some common data problems include poor quality data,data security and bias ,dealing with large data sets.Complex SQL techniques enable data analysts to overcome this data problems and clearly analyse data.Indexing increases query optimization and reduces query execution time.Advanced SQL techniques enable data analysts to clean and transform data for effective analysis.It is important for every data analyst to master the advanced SQL techniques.

DEV Community

ADVANCED SQL FOR DATA ANALYTICS:ADVANCED TECHNIQUES EVERY DATA ANALYST SHOULD KNOW.

REAL WORLD PROBLEMS IN DATA

WINDOW FUNCTIONS IN SQL

Top comments (0)