Helen Anderson

Posted on Feb 6, 2019 • Updated on Sep 3, 2021 • Originally published at helenanderson.co.nz

SQL concepts from A to Z

#data #database #sql #beginners

It's time for some jargon-busting.

A big part of my role is to onboard and support junior data analysts. A lot of them have just started using SQL, have come from the world of Excel analysis and have self-taught the SQL basics.

Here are some of those terms and concepts that pop up during training if you need a refresher or have a junior analyst in your life that needs something to refer back to.

Alias
Begin Transaction
CTEs v Subqueries
Design
ETL
Function
Group By
Heaped Storage
Integrity
Join
Key
Lock
Massive Parallel Processing
Normalisation
OLTP v OLAP
Privileges
Query Plan
Disaster Recovery
System Tables
Truncate v Drop
Union
View
Window Function
XML
Year
Zero

Alias

When joining tables, we need to state which column from which table we want to match up, and which columns we want to return in the results. If there are columns with the same name we need to be specific about which column we want to return.

select
  orders.item, 
  inventory.item,
  inventory.unitprice
from 
  orders 
inner join 
  inventory 
on orders.order_item = inventory.inventory_item

To make it quicker to type we can alias the two tables with something shorter.

select
  o.item, 
  i.item,
  i.unitprice
from 
  orders o
inner join 
  inventory i 
on o.order_item = i.inventory_item

Instead of having to type out the whole table name each time we want a new column added, we can alias them with the letter 'o' for orders and 'i' for inventory.

Read more about JOINs and aliasing in this beginner-friendly post:

SQL Joins without the Venn diagrams

Helen Anderson ・ Apr 29 '20

#sql #beginners #data #database

Begin Transaction

SQL Transactions are used to trap errors when making changes to tables. During an UPDATE or DELETE statement, the change is auto-committed.

By wrapping the statement in a transaction we have the opportunity to 'roll back' or 'commit' when we are sure that it should be executed, or if a condition has been met.

The following transaction will run in a block and commit if successful.

begin transaction

update orders
set status = 'sent'
where order_id = '12345'

update orders
set status = 'sent'
where order_id = '54321'

commit transaction

Database Transactions Like You're Five

John Dougherty ・ Sep 9 '18

#database #explainlikeimfive

CTEs v Subqueries

CTEs (Common Table Expressions) are a temporary, named result set we can come back to in the scope of a single SELECT, INSERT, UPDATE, DELETE or MERGE statement.

I use them when dealing with large tables to. For example: get all the columns I need from the 'emailsent' table, then get all the columns I need from the 'emailunsubscribe' table. Then in a final step join them together.

; -- start the CTE with a semicolon to terminate anything above

with sent as -- here is where you name the dataset

(select 
  emailaddress,
  emailid, 
  senddate
from
  marketing.emailsent
where
  senddate between '2018-01-01' and '2018-01-31'
),  -- add a comma if you need to add a subsequent CTE

unsubs as

(select 
  emailaddress,
  emailid, 
  senddate
from
  marketing.emailunsubscribe
where
  senddate between '2018-01-01' and '2018-01-31'
) -- no comma for the last CTE

select
  'January' as [monthdelivered],
  c.country, 
  count(distinct sent.emailaddress) as [countofdelivered], 
  count(distinct unsubs.emailaddress) as [countofunsubd]
from sent
left join 
  marketing.customers c on sent.email = unsubs.emailaddress
left join  
  unsubs on sent.email = unsubs.email 
  and sent.emailid = unsubs.emailid

Chidiebere has written an excellent series on CTEs and how they compare with the subquery.

Whats WITH CTE

Chidiebere Ogujeiofor ・ Dec 13 '19

#postgres #database #sql

Design

Datamarts tables are organised in one of two forms. A ‘Star’ schema and a ‘Snowflake’ schema made of two types of tables.

Facts - that count how many times something has happened.
Dimensions - (or Dims) that describe an attribute.

In the Star model, we can have a sales table as our Fact in the centre. Dimension tables for the store, product and location surround the Fact like a star.

*Attribution: SqlPac at English Wikipedia*

The Snowflake is similar but takes the Dimensions one step further. Instead of just a location table, we may have a city, country and even a postcode table. All the Dimensions become the points on the snowflake.

*Attribution: SqlPac at English Wikipedia*

Read more about the advantages and disadvantages of both:

Star Schema vs Snowflake Schema and Why You Should Care

pedrojmfidalgopt ・ Dec 19 '17

#database #data

ETL

ETL and ELT are the steps involved in moving data from a source system to a destination system.

Extract - in the Extract step the raw data is moved from source to a temporary or staging area.
Transform - the Transform step converts the data so matches the destination table.
Load - the Load step moves the data into its final destination so it can be used in analysis or reporting.

ETL is the order that these steps are traditionally performed in. ETL is great for putting data into the right format, stripping out unnecessary columns, and masking fields relating to GDPR compliance.

However, ELT has become a more popular approach when used in conjunction with Data Lake architecture. The data arrives quickly as it does not have to be altered in any way. The Data Scientist can then use just the data they need, quickly get results, and not have to deal with delays if a transformation step fails.

Considerations need to be made around how reliable the data is in its raw form. Each Data Scientist or end-user will need to apply the same logic and business rules when conducting analysis to keep results consistent.

For more on ETL, ELT and data pipelines check out this post from SeattleDataGuy who writes excellent posts on all things data.

Data Engineering 101: From Batch Processing To Streaming

SeattleDataGuy ・ Mar 4 '20

#database #beginners #aws

Function

In PostgreSQL, we can execute blocks of code, called Functions, on a schedule. They can be written like the statements we run ad hoc on the database or can be parsed variables to make them dynamic.

Read more about how to write and execute functions:

A PRIMER ON POSTGRESQL STORED FUNCTIONS (PL/pgSQL)

Samuyi ・ Jan 7 '19

#postgres #linux #sql #database

Group By

Aggregate functions allow us to perform calculations on fields. The most common ones are SUM, COUNT, MIN, MAX, AVERAGE.

For example, to see the total amount due for each item in the orders table we can use the SUM of the amount_due column and GROUP BY

select 
  order_item, 
  sum(amount_due) 
from orders
group by order_item;

SQL: ROLLUP Like A Boss

Nathan Griffiths ・ Sep 15 '19

#sql #database #cube #dataanalysis

Heaped Storage

Heaped Storage is a term for tables that live on the database with no clustered index. The data is stored in no particular order and new data simply gets added as it comes in.

Indexes are a way of telling the database to order the data or where to look to find the data you query often.

Clustered Indexes are like the contents page of a book. Applying this kind of index is telling the data how it should be ordered, like the pages in a book.

Non-clustered Indexes are like the index of a book, the pages haven't been arranged that way physically, but you now have a lookup to get to what you need faster.

Read more in this beginner-friendly post:

We’re not all DBAs: Indexes For Developers

Matthew Gale ・ Oct 24 '19

#sql #index #backend #performance

Integrity

This refers to data quality and rules ensuring data is traceable, searchable and recoverable.

Entity Integrity - each table must have a unique primary key
Referential Integrity - foreign keys on each table refers to a primary key on another or is NULL
Domain Integrity - each column has a specific data type and length.

Read more about Integrity and Database design:

Effective Database Design: Part 1

Adam McNeilly ・ Dec 3 '18

#database

Join

Because our database contains tables which are normalised you may not find all the data you need on one single table. To put the data back together in a way that makes the most sense for us we use JOINs. This adds columns from multiple tables into one dataset.

Use an INNER JOIN, shortened to 'JOIN', when you want to find the match between two tables. You need to have a column on both tables that you join ON, and that's where the match happens. Any results where there is not a match are discarded.

Use a LEFT JOIN when you want to find a match between two tables, but also show a NULL where there is no match from the right table. A RIGHT JOIN does the same but in reverse.

Like the INNER JOIN, you need a column to join ON. Unlike the INNER JOIN, a NULL is used to show there is no match between the two tables.

Katie has written a great post on 'Every JOIN you will ever need' with a focus on Oracle syntax.

Every SQL Join You’ll Ever Need

Katie ・ Nov 8 '18

#database #sql #tutorial #beginners

Key

A primary key is a column that best identifies one unique row, and identifies each record as unique, like an ID

It ensures that there are no duplicates
It cannot be unknown (NULL)
There can only be one primary key per table

A foreign key is a column that matches a primary key in another table and enforces integrity between the two.

To create a primary key in SQL Server add the reserved words 'primary key' after the data type of your chosen column.

create table students (
  id int not null primary key,
  firstname varchar(255) not null,
  lastname varchar(255) not null,
);

Lenique puts this into practice with this post on relational model design.

DBMS For Application Development: Relational Model & Relational Database

Lenique Noralez ・ Apr 8 '19

#database #sql #datamodel

Lock

When two users are trying to query or update the same table at the same time it may result in a lock. In the same way that two people with ATM cards for the same bank account are trying to withdraw the same $100 from the same bank account, one will be locked out while the first transaction is completed.

Rhymes does a great job of explaining how it works on the database:

"...database locks serve the purpose of protecting access to shared resources (tables, rows, data).

In a system where tens if not hundreds of connections operate on the same dataset, there has to be a system to avoid that two connections invalidate each other's operation (or in other cases causing a deadlock) ...

Locks are a way to do that. An operation comes to the database, declares they need a resource, finishes its own modification, then releases such resource so the next operation can do the same. If they didn't lock their resource two operations might overwrite each other's data causing disasters.

SQL Server Locking

Matt Eland ・ Sep 10 '19

#sql #sqlserver #performance #database

Massive Parallel Processing

In Massively Parallel Processing databases, like Redshift, data is partitioned across multiple compute nodes with each node having memory to process data locally.

Redshift distributes the rows of a table to the nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, the workload can be balanced.

Read more about Redshift with ronsoak who has written a complete guide to Redshift:

Article No Longer Available

Normalisation

Database normalisation increases data integrity and allows new data to be added without changing the underlying structure.

Eliminate or minimise duplication - repeating a value across multiple tables means that tables take up more space than needed which increases storage costs. Storing customers address details on one table with keys linking to their orders will take up less space than repeating address details on each row of the order table.
Simplify updates to data - by keeping a value in one table with a key to another we minimise the risk of errors when there are updates to be made. If there are two places where a customers email is stored and only one gets updated there will be confusion over which one is correct.
Simplify queries - searching and sorting becomes easier if there is no duplication on tables.

ELI5: What is a database normalization?

Jamshid Tursunboyev ・ Dec 6 '17

#explainlikeimfive #discuss

OLTP v OLAP

OLTP and OLAP refer to different types of databases and tools that perform different functions.

OLTP - Online Transaction Processing - used for fast data processing and responds immediately to queries.
OLAP - Online Analytics Processing - used for storing historical data and data mining.

Revolutionize the Performance in SQL Server With in-memory OLTP System

Vikas Arora ・ May 7 '18

#oltpsystem #sql

Privileges

If you intend on sharing a table with your colleagues who have access to your schema, you need to explicitly grant access to them. This keeps data locked down to just those who need to see it.

GRANT ALL ON <schemaname.tablename> TO <username>  
-- if you would like them to SELECT, UPDATE and DELETE

GRANT SELECT ON <schemaname.tablename> TO <username> 
-- if you would like them to be able to only SELECT

Read more about permissions and everything else you need to know about databases:

Everything you need to know about (Relational) Databases

Lucas Olivera ・ Jan 17 '19

#beginners #database #sql

Query Plan

When we run a query there are many things that the SQL engine considers - the joins, the indexes, whether it will scan through the whole table or be faced with table locking.

In SQL Server we can use the Execution Plan to visualise runtime information and any warnings.

*From the SQL Server documentation*

In PostgreSQL we can check the query plan using the EXPLAIN command:

EXPLAIN -- show the execution plan of a statement
EXPLAIN ANALYZE -- causes the query to be executed as well explain

Read more about what each query looks like and how to interpret the results:

Reading a Postgres EXPLAIN ANALYZE Query Plan

Caleb Hearth ・ Feb 22 '18

#database #postgres #performance

Recovery

Disaster Recovery in the database world relates to the backups, logs and replication instances that are maintained while everything is working fine. These can then be switched on, switched over and analysed when something does go wrong, like a hardware failure, natural disaster or even human error.

Failover - multiple clusters are set up so if one fails the other can take over.
Mirroring - maintaining two copies of the same database at different locations. One in offline mode so we know where things are at when we need to use it.
Replication - the secondary database is online and can be queried. This is not only good for Disaster Recovery but can be useful if you utilise one instance for reporting and one for live queries. If you are using AWS setting this up takes just a few clicks.

The need for Database Replication

Thamaraiselvam ・ Feb 1 '20

#database #distributedsystems #tutorial #devops

System Tables

In SQL Server these are often referred to as system tables and views. They can be found in the master database, which holds data about the database. And in the system views within each database for specific information about each database.

In PostgreSQL, a similar collection of tables can be found in the information_schema and PostgreSQL catalog.

Examples of system views

sys.objects - shows each object, its type and created date
sys.indexes - shows each index and type
information_schema.columns - shows each column, it's position and datatype

Examples of catalog objects

information_schema.tables - shows each object, its type and created date
pg_index - shows each index and type
information_schema.columns - shows each column, it's position and datatype

SQL Tips & Tricks: Counting Rows

Jimmy Guerrero for YugabyteDB ・ Aug 27 '20

#postgres #yugabyte #sql #distributedsystems

Truncate v Drop

Both of these commands will remove the data from a table but in different ways.

TRUNCATE is a DDL command which removes the contents of the table while leaving the structure in place

truncate table marketing.emailcampaign

DELETE is a DML command which removes rows given a WHERE clause

delete from 
  marketing.emailcampaign
where
  month = 'January'

For a break down of which commands fall into DDL and which are DML commands check out this post.

SQL- Overview and Types of commands

l0l0l0l ・ May 23 '19

#sql #tutorial #computerscience #todayilearned

Union

While a JOIN combines rows of columns horizontally, a UNION combines the results vertically. Using a UNION combines the result of two queries into one column and removes duplicates. If your query has multiple columns, they need to be in the same order to complete the UNION.

select *
from
  orders
union
select *
from
  inventory

The UNION ALL combines the results of two queries the same as a UNION but keeps the duplicates in the result.

select *
from
  orders
union all
select *
from
  inventory

SQL: Union Operator

Wendy Calderon ・ Dec 2 '19

#sql #postgres #programming

View

Views are not tables, they are queries that are executed on the fly and are used as a way to create a level of abstraction from the base table.

Joe sums it up perfectly in his post

A view is a stored query. When you create a database view, the database stores the SQL you gave it. Then, when you come along and query that view, the database takes the stored view query, adds in the extras from the query against the view, and executes it. That's it!

Database Views Don't Really Exist

Joseph Moore ・ Aug 9 '18

#database

Window Function

A window function gets its name because, unlike an aggregate function, it keeps each row intact and adds a row number or running total.

Here is an example using the orders table that returns a rank using the order_value.

select
  order_id,
  order_name
  order_date,
  rank() over(order by amount_due desc) as rank
from 
  dbo.orders

Introduction into Window Functions on SQL Server

Kay Sauter ・ Mar 7 '20

#sql #tutorial

XML

We can import files into tables using the import/export wizard. But they don't have to just be csv or txt files. By using a few lines of code we can import XML as well.

SQL Developer's new format hints

Mark Sta Ana ・ Aug 26 '18

#oracle #sqldeveloper

Year

Depending on your flavour of SQL you will be able to calculate the difference between two dates and compare a date with the current date. There is added complexity when moving between databases so keep this in mind when on your next migration.

Formatting and dealing with dates in SQL

justin gage for Retool ・ Mar 4 '20

#sql #dates #formatting

Zero

NULL means that the value is unknown, not zero and not blank. This makes it difficult to compare values if you are comparing NULLs with NULLs.

Because NULL is not a value it isn't possible to use comparison operators. Instead, we need to use the IS, and IS NOT operators:

select * 
from 
  inventory
where 
  unitprice is not null

By default, NULLs will appear as the largest value and making sorting at best annoying and at worse misleading. To get around this we can use COALESCE to treat NULLs as a 0.

select 
  itemname, 
  coalesce(unitprice, 0)
from 
  inventory
order by 2

SQL:How to compare with NULL without "IS NULL" operator

yuyabu ・ Nov 11 '18

#sql #null #rdbms #database

There we have it, a quick introduction to the key terms, concepts and jargon for those new to the world of SQL and databases.

This post first appeared on helenanderson.co.nz

Top comments (20)

rhymes • Feb 6 '19

Great "SQL vocabulary" you have here. Also, thanks for quoting me!

Stay tuned for a post dedicated to a deep dive on this topic

Can't wait, window functions are my favorite "modern SQL" feature and I think many devs don't know how much they are missing :D I wanted to write about them eventually but it would definitely be better if you do it!

Helen Anderson • Feb 6 '19

Thanks rhymes

You had such a perfect explanation I had to include it :)

LuckyArthas • May 9 '21

Wonderful topic.

There was a spark in my head.
I have an lack of knowledge with relational and non-relational database. In some cases, please post about what can I choose and why it is convenient.
Thanks a lot. ( 💝 I will follow you all the time 💝 )

Evaldas Buinauskas • Feb 12 '19

A foreign key is a column that matches a primary key in another table so we can join the data in each together.

This is partially true. Foreign keys can also reference unique (alternate) keys. 👌

Also foreign key is not required to join data. It just enforces integrity. But you know that 😊

Todd M Owens • Jun 30 '20

Kudos. This is sorely needed and fairly well done. My only real critique is there is not enough context or priority for each entry. But I understand there is a trade-off.

Now for my not serious critique, if you non-American anglophones could only learn:

to use a "z" in words like "normalize"; show "z" some love.
to not overuse "u" and make words like "favor" and "color" too long.
stop doubling consonants in unaccented terminal syllables; it's "modeling" not "modelling".