Every company you will ever work for stores its data in a database.
Not CSV files. Not Excel sheets. A database. Tables of users. Tables of transactions. Tables of products, logs, events, sessions. Millions of rows. Billions sometimes.
To get that data out, you write SQL.
Not Python. Not Pandas. SQL. A completely different language designed specifically for asking questions of relational databases. And unlike Python which took months to feel comfortable, SQL has a core that you can learn in one sitting and use productively within hours.
This is that sitting.
What SQL Actually Is
SQL stands for Structured Query Language. It is not a programming language in the traditional sense. You do not write loops or functions. You describe what data you want and the database figures out how to retrieve it.
That difference matters. In Python you tell the computer how to do something step by step. In SQL you tell it what you want and it decides the how.
SELECT name, salary
FROM employees
WHERE department = 'Engineering'
ORDER BY salary DESC;
Read that out loud. "Select name and salary from the employees table where the department is Engineering, ordered by salary descending." That is exactly what it does. SQL reads like English on purpose.
Setting Up SQLite in Python
You do not need to install a database server. Python comes with SQLite, a lightweight database that lives in a single file. Perfect for learning.
import sqlite3
import pandas as pd
conn = sqlite3.connect("company.db")
cursor = conn.cursor()
cursor.executescript("""
DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS departments;
DROP TABLE IF EXISTS projects;
CREATE TABLE departments (
dept_id INTEGER PRIMARY KEY,
dept_name TEXT NOT NULL,
budget INTEGER
);
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
age INTEGER,
salary INTEGER,
dept_id INTEGER,
hire_date TEXT,
FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
);
CREATE TABLE projects (
project_id INTEGER PRIMARY KEY,
emp_id INTEGER,
project TEXT,
hours INTEGER,
FOREIGN KEY (emp_id) REFERENCES employees(emp_id)
);
INSERT INTO departments VALUES
(1, 'Engineering', 500000),
(2, 'Marketing', 300000),
(3, 'Sales', 400000),
(4, 'HR', 200000);
INSERT INTO employees VALUES
(1, 'Alex', 25, 55000, 1, '2022-01-15'),
(2, 'Priya', 30, 82000, 2, '2021-03-22'),
(3, 'Sam', 22, 43000, 1, '2023-06-01'),
(4, 'Jordan', 35, 95000, 3, '2019-04-20'),
(5, 'Lisa', 28, 67000, 2, '2022-11-30'),
(6, 'Ravi', 31, 71000, 1, '2021-07-14'),
(7, 'Tom', 42, 88000, 3, '2018-09-05'),
(8, 'Nina', 26, 59000, 4, '2023-02-28');
INSERT INTO projects VALUES
(1, 1, 'AI Platform', 120),
(2, 3, 'Data Pipeline', 80),
(3, 2, 'Campaign Tool', 95),
(4, 6, 'Internal API', 110),
(5, 4, 'CRM Integration', 60),
(6, 1, 'Model Monitoring', 45);
""")
conn.commit()
print("Database created with three tables.")
A helper function to run queries and see results cleanly:
def query(sql, conn=conn):
return pd.read_sql_query(sql, conn)
pd.read_sql_query runs the SQL and returns a DataFrame. Best of both worlds: SQL for querying, Pandas for display and further manipulation.
SELECT: Getting Data Out
The most basic query. Get everything from a table.
result = query("SELECT * FROM employees;")
print(result)
Output:
emp_id name age salary dept_id hire_date
0 1 Alex 25 55000 1 2022-01-15
1 2 Priya 30 82000 2 2021-03-22
2 3 Sam 22 43000 1 2023-06-01
...
* means all columns. Never use SELECT * in production code. Always name the columns you need. It is faster and the code is clearer.
result = query("SELECT name, salary, dept_id FROM employees;")
print(result)
Give columns aliases with AS:
result = query("""
SELECT
name AS employee_name,
salary AS annual_salary,
salary / 12 AS monthly_salary
FROM employees;
""")
print(result)
Output:
employee_name annual_salary monthly_salary
0 Alex 55000 4583.333
1 Priya 82000 6833.333
2 Sam 43000 3583.333
...
SQL can do math directly in the SELECT. Monthly salary calculated inline. No Python needed.
WHERE: Filtering Rows
result = query("""
SELECT name, salary
FROM employees
WHERE salary > 70000;
""")
print(result)
Output:
name salary
0 Priya 82000
1 Jordan 95000
2 Ravi 71000
3 Tom 88000
Comparison operators: =, !=, >, <, >=, <=
Note: SQL uses single = for equality comparison, not ==.
Multiple conditions:
result = query("""
SELECT name, age, salary
FROM employees
WHERE salary > 60000
AND age < 35;
""")
print(result)
Output:
name age salary
0 Priya 30 82000
1 Lisa 28 67000
2 Ravi 31 71000
AND, OR, NOT work exactly as you expect.
result = query("""
SELECT name, dept_id
FROM employees
WHERE dept_id IN (1, 3);
""")
print(result)
IN is cleaner than multiple OR conditions for checking against a list.
result = query("""
SELECT name, salary
FROM employees
WHERE salary BETWEEN 60000 AND 85000;
""")
print(result)
result = query("""
SELECT name
FROM employees
WHERE name LIKE 'J%';
""")
print(result)
LIKE with % as wildcard. 'J%' matches anything starting with J. '%son' matches anything ending with son. '%or%' matches anything containing "or".
ORDER BY and LIMIT
result = query("""
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 3;
""")
print(result)
Output:
name salary
0 Jordan 95000
1 Tom 88000
2 Priya 82000
Top 3 earners. DESC for descending, ASC for ascending (default).
result = query("""
SELECT name, age, salary
FROM employees
ORDER BY dept_id ASC, salary DESC;
""")
print(result)
Sort by multiple columns. Department first, then salary within each department.
Aggregate Functions
result = query("""
SELECT
COUNT(*) AS total_employees,
AVG(salary) AS avg_salary,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
SUM(salary) AS total_payroll
FROM employees;
""")
print(result.round(0))
Output:
total_employees avg_salary min_salary max_salary total_payroll
0 8 70000.0 43000 95000 560000
Five aggregate functions. One query. No loops.
COUNT(*) counts all rows. COUNT(column_name) counts non-null values in that column.
GROUP BY: Aggregate by Category
This is the SQL equivalent of Pandas groupby.
result = query("""
SELECT
dept_id,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
MAX(salary) AS max_salary
FROM employees
GROUP BY dept_id
ORDER BY avg_salary DESC;
""")
print(result.round(0))
Output:
dept_id headcount avg_salary max_salary
0 3 2 91500.0 95000
1 2 2 74500.0 82000
2 4 1 59000.0 59000
3 1 3 56333.0 71000
Every aggregate function runs within each group. The ORDER BY here sorts the result by average salary.
HAVING: Filter After Grouping
WHERE filters rows before grouping. HAVING filters groups after aggregation.
result = query("""
SELECT
dept_id,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
HAVING COUNT(*) >= 2
ORDER BY avg_salary DESC;
""")
print(result.round(0))
Output:
dept_id headcount avg_salary
0 3 2 91500.0
1 2 2 74500.0
2 1 3 56333.0
Department 4 (HR) only has one person so it was filtered out by HAVING COUNT(*) >= 2.
A very common mistake: using WHERE to filter aggregated values.
-- Wrong: WHERE cannot reference aggregated values
SELECT dept_id, AVG(salary)
FROM employees
WHERE AVG(salary) > 70000
GROUP BY dept_id;
-- Correct: use HAVING
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id
HAVING AVG(salary) > 70000;
Combining SQL with Pandas
SQL retrieves data. Pandas transforms and analyzes it. Use both together.
df = query("""
SELECT name, age, salary, dept_id, hire_date
FROM employees
WHERE salary > 50000
ORDER BY salary DESC;
""")
df["hire_date"] = pd.to_datetime(df["hire_date"])
df["years_employed"] = (pd.Timestamp.now() - df["hire_date"]).dt.days / 365
df["salary_per_year"] = df["salary"] / df["years_employed"]
print(df[["name", "salary", "years_employed", "salary_per_year"]].round(1))
Use SQL for what it is good at: filtering large datasets at the database level before loading into Python. Use Pandas for what it is good at: flexible transformations, visualization prep, feature engineering.
SQL Execution Order
This confuses everyone at first.
SQL does not execute in the order you write it.
The actual execution order is:
1. FROM -- which table
2. WHERE -- filter rows
3. GROUP BY -- group the filtered rows
4. HAVING -- filter the groups
5. SELECT -- choose columns and compute aggregates
6. ORDER BY -- sort the result
7. LIMIT -- restrict output count
This is why you cannot use a SELECT alias in a WHERE clause:
-- Wrong: 'monthly_salary' does not exist yet at WHERE stage
SELECT salary / 12 AS monthly_salary
FROM employees
WHERE monthly_salary > 5000;
-- Correct: use the original expression
SELECT salary / 12 AS monthly_salary
FROM employees
WHERE salary / 12 > 5000;
A Resource Worth Knowing
Mode Analytics SQL Tutorial at mode.com/sql-tutorial is the most well-structured free SQL learning resource online. It uses real datasets, interactive queries you can run in the browser, and covers everything from basics to window functions. More practical than any textbook. Widely recommended by data professionals. Search "Mode Analytics SQL Tutorial" and it comes up immediately.
Towards Data Science published a piece by Terence Shin called "10 SQL Concepts Every Data Scientist Should Know" that maps directly to what practicing data scientists actually use versus what SQL courses spend too long on. Cuts through the noise. Search the title on Towards Data Science.
Try This
Create sql_practice.py. Use the database we built in this post.
Write SQL queries to answer each of these. Do not use Python to filter or transform the results. Do it all in SQL.
Which employees earn more than the average salary of the entire company?
How many employees were hired each year? (Extract the year from hire_date using strftime('%Y', hire_date))
What is the total hours worked on projects per employee? Include employees with no projects (hint: you need a join, which is coming next post, but try it anyway).
Find the department where the average age of employees is the lowest.
List employees who earn more than the average salary in their own department. (This needs a subquery. Try it. The syntax is WHERE salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id).)
Close the connection when done: conn.close().
What's Next
You can query one table now. Real databases have dozens or hundreds of tables. The power comes from combining them. Next post is SQL joins: how to connect tables through shared keys and build queries that span your entire database.
Top comments (0)