Akhilesh

Posted on Apr 29

SQL Basics: Talking to a Database for the First Time

#beginners #database #sql #ai

Every company you will ever work for stores its data in a database.

Not CSV files. Not Excel sheets. A database. Tables of users. Tables of transactions. Tables of products, logs, events, sessions. Millions of rows. Billions sometimes.

To get that data out, you write SQL.

Not Python. Not Pandas. SQL. A completely different language designed specifically for asking questions of relational databases. And unlike Python which took months to feel comfortable, SQL has a core that you can learn in one sitting and use productively within hours.

This is that sitting.

What SQL Actually Is

SQL stands for Structured Query Language. It is not a programming language in the traditional sense. You do not write loops or functions. You describe what data you want and the database figures out how to retrieve it.

That difference matters. In Python you tell the computer how to do something step by step. In SQL you tell it what you want and it decides the how.

SELECT name, salary
FROM employees
WHERE department = 'Engineering'
ORDER BY salary DESC;

Read that out loud. "Select name and salary from the employees table where the department is Engineering, ordered by salary descending." That is exactly what it does. SQL reads like English on purpose.

Setting Up SQLite in Python

You do not need to install a database server. Python comes with SQLite, a lightweight database that lives in a single file. Perfect for learning.

import sqlite3
import pandas as pd

conn = sqlite3.connect("company.db")
cursor = conn.cursor()

cursor.executescript("""
    DROP TABLE IF EXISTS employees;
    DROP TABLE IF EXISTS departments;
    DROP TABLE IF EXISTS projects;

    CREATE TABLE departments (
        dept_id   INTEGER PRIMARY KEY,
        dept_name TEXT NOT NULL,
        budget    INTEGER
    );

    CREATE TABLE employees (
        emp_id     INTEGER PRIMARY KEY,
        name       TEXT NOT NULL,
        age        INTEGER,
        salary     INTEGER,
        dept_id    INTEGER,
        hire_date  TEXT,
        FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
    );

    CREATE TABLE projects (
        project_id  INTEGER PRIMARY KEY,
        emp_id      INTEGER,
        project     TEXT,
        hours       INTEGER,
        FOREIGN KEY (emp_id) REFERENCES employees(emp_id)
    );

    INSERT INTO departments VALUES
        (1, 'Engineering', 500000),
        (2, 'Marketing', 300000),
        (3, 'Sales', 400000),
        (4, 'HR', 200000);

    INSERT INTO employees VALUES
        (1, 'Alex',   25, 55000, 1, '2022-01-15'),
        (2, 'Priya',  30, 82000, 2, '2021-03-22'),
        (3, 'Sam',    22, 43000, 1, '2023-06-01'),
        (4, 'Jordan', 35, 95000, 3, '2019-04-20'),
        (5, 'Lisa',   28, 67000, 2, '2022-11-30'),
        (6, 'Ravi',   31, 71000, 1, '2021-07-14'),
        (7, 'Tom',    42, 88000, 3, '2018-09-05'),
        (8, 'Nina',   26, 59000, 4, '2023-02-28');

    INSERT INTO projects VALUES
        (1, 1, 'AI Platform',     120),
        (2, 3, 'Data Pipeline',    80),
        (3, 2, 'Campaign Tool',    95),
        (4, 6, 'Internal API',    110),
        (5, 4, 'CRM Integration',  60),
        (6, 1, 'Model Monitoring', 45);
""")

conn.commit()
print("Database created with three tables.")

A helper function to run queries and see results cleanly:

def query(sql, conn=conn):
    return pd.read_sql_query(sql, conn)

pd.read_sql_query runs the SQL and returns a DataFrame. Best of both worlds: SQL for querying, Pandas for display and further manipulation.

SELECT: Getting Data Out

The most basic query. Get everything from a table.

result = query("SELECT * FROM employees;")
print(result)

Output:

   emp_id   name  age  salary  dept_id   hire_date
0       1   Alex   25   55000        1  2022-01-15
1       2  Priya   30   82000        2  2021-03-22
2       3    Sam   22   43000        1  2023-06-01
...

* means all columns. Never use SELECT * in production code. Always name the columns you need. It is faster and the code is clearer.

result = query("SELECT name, salary, dept_id FROM employees;")
print(result)

Give columns aliases with AS:

result = query("""
    SELECT
        name        AS employee_name,
        salary      AS annual_salary,
        salary / 12 AS monthly_salary
    FROM employees;
""")
print(result)

Output:

  employee_name  annual_salary  monthly_salary
0          Alex          55000        4583.333
1         Priya          82000        6833.333
2           Sam          43000        3583.333
...

SQL can do math directly in the SELECT. Monthly salary calculated inline. No Python needed.

WHERE: Filtering Rows

result = query("""
    SELECT name, salary
    FROM employees
    WHERE salary > 70000;
""")
print(result)

Output:

    name  salary
0  Priya   82000
1 Jordan   95000
2   Ravi   71000
3    Tom   88000

Comparison operators: =, !=, >, <, >=, <=

Note: SQL uses single = for equality comparison, not ==.

Multiple conditions:

result = query("""
    SELECT name, age, salary
    FROM employees
    WHERE salary > 60000
      AND age < 35;
""")
print(result)

Output:

   name  age  salary
0 Priya   30   82000
1  Lisa   28   67000
2  Ravi   31   71000

AND, OR, NOT work exactly as you expect.

result = query("""
    SELECT name, dept_id
    FROM employees
    WHERE dept_id IN (1, 3);
""")
print(result)

IN is cleaner than multiple OR conditions for checking against a list.

result = query("""
    SELECT name, salary
    FROM employees
    WHERE salary BETWEEN 60000 AND 85000;
""")
print(result)

result = query("""
    SELECT name
    FROM employees
    WHERE name LIKE 'J%';
""")
print(result)

LIKE with % as wildcard. 'J%' matches anything starting with J. '%son' matches anything ending with son. '%or%' matches anything containing "or".

ORDER BY and LIMIT

result = query("""
    SELECT name, salary
    FROM employees
    ORDER BY salary DESC
    LIMIT 3;
""")
print(result)

Output:

     name  salary
0  Jordan   95000
1     Tom   88000
2   Priya   82000

Top 3 earners. DESC for descending, ASC for ascending (default).

result = query("""
    SELECT name, age, salary
    FROM employees
    ORDER BY dept_id ASC, salary DESC;
""")
print(result)

Sort by multiple columns. Department first, then salary within each department.

Aggregate Functions

result = query("""
    SELECT
        COUNT(*)          AS total_employees,
        AVG(salary)       AS avg_salary,
        MIN(salary)       AS min_salary,
        MAX(salary)       AS max_salary,
        SUM(salary)       AS total_payroll
    FROM employees;
""")
print(result.round(0))

Output:

   total_employees  avg_salary  min_salary  max_salary  total_payroll
0                8     70000.0       43000       95000         560000

Five aggregate functions. One query. No loops.

COUNT(*) counts all rows. COUNT(column_name) counts non-null values in that column.

GROUP BY: Aggregate by Category

This is the SQL equivalent of Pandas groupby.

result = query("""
    SELECT
        dept_id,
        COUNT(*)     AS headcount,
        AVG(salary)  AS avg_salary,
        MAX(salary)  AS max_salary
    FROM employees
    GROUP BY dept_id
    ORDER BY avg_salary DESC;
""")
print(result.round(0))

Output:

   dept_id  headcount  avg_salary  max_salary
0        3          2     91500.0       95000
1        2          2     74500.0       82000
2        4          1     59000.0       59000
3        1          3     56333.0       71000

Every aggregate function runs within each group. The ORDER BY here sorts the result by average salary.

HAVING: Filter After Grouping

WHERE filters rows before grouping. HAVING filters groups after aggregation.

result = query("""
    SELECT
        dept_id,
        COUNT(*)    AS headcount,
        AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
    HAVING COUNT(*) >= 2
    ORDER BY avg_salary DESC;
""")
print(result.round(0))

Output:

   dept_id  headcount  avg_salary
0        3          2     91500.0
1        2          2     74500.0
2        1          3     56333.0

Department 4 (HR) only has one person so it was filtered out by HAVING COUNT(*) >= 2.

A very common mistake: using WHERE to filter aggregated values.

-- Wrong: WHERE cannot reference aggregated values
SELECT dept_id, AVG(salary)
FROM employees
WHERE AVG(salary) > 70000
GROUP BY dept_id;

-- Correct: use HAVING
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id
HAVING AVG(salary) > 70000;

Combining SQL with Pandas

SQL retrieves data. Pandas transforms and analyzes it. Use both together.

df = query("""
    SELECT name, age, salary, dept_id, hire_date
    FROM employees
    WHERE salary > 50000
    ORDER BY salary DESC;
""")

df["hire_date"] = pd.to_datetime(df["hire_date"])
df["years_employed"] = (pd.Timestamp.now() - df["hire_date"]).dt.days / 365
df["salary_per_year"] = df["salary"] / df["years_employed"]

print(df[["name", "salary", "years_employed", "salary_per_year"]].round(1))

Use SQL for what it is good at: filtering large datasets at the database level before loading into Python. Use Pandas for what it is good at: flexible transformations, visualization prep, feature engineering.

SQL Execution Order

This confuses everyone at first.

SQL does not execute in the order you write it.

The actual execution order is:

1. FROM       -- which table
2. WHERE      -- filter rows
3. GROUP BY   -- group the filtered rows
4. HAVING     -- filter the groups
5. SELECT     -- choose columns and compute aggregates
6. ORDER BY   -- sort the result
7. LIMIT      -- restrict output count

This is why you cannot use a SELECT alias in a WHERE clause:

-- Wrong: 'monthly_salary' does not exist yet at WHERE stage
SELECT salary / 12 AS monthly_salary
FROM employees
WHERE monthly_salary > 5000;

-- Correct: use the original expression
SELECT salary / 12 AS monthly_salary
FROM employees
WHERE salary / 12 > 5000;

A Resource Worth Knowing

Mode Analytics SQL Tutorial at mode.com/sql-tutorial is the most well-structured free SQL learning resource online. It uses real datasets, interactive queries you can run in the browser, and covers everything from basics to window functions. More practical than any textbook. Widely recommended by data professionals. Search "Mode Analytics SQL Tutorial" and it comes up immediately.

Towards Data Science published a piece by Terence Shin called "10 SQL Concepts Every Data Scientist Should Know" that maps directly to what practicing data scientists actually use versus what SQL courses spend too long on. Cuts through the noise. Search the title on Towards Data Science.

Try This

Create sql_practice.py. Use the database we built in this post.

Write SQL queries to answer each of these. Do not use Python to filter or transform the results. Do it all in SQL.

Which employees earn more than the average salary of the entire company?

How many employees were hired each year? (Extract the year from hire_date using strftime('%Y', hire_date))

What is the total hours worked on projects per employee? Include employees with no projects (hint: you need a join, which is coming next post, but try it anyway).

Find the department where the average age of employees is the lowest.

List employees who earn more than the average salary in their own department. (This needs a subquery. Try it. The syntax is WHERE salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id).)

Close the connection when done: conn.close().

What's Next

You can query one table now. Real databases have dozens or hundreds of tables. The power comes from combining them. Next post is SQL joins: how to connect tables through shared keys and build queries that span your entire database.

DEV Community

SQL Basics: Talking to a Database for the First Time

What SQL Actually Is

Setting Up SQLite in Python

SELECT: Getting Data Out

WHERE: Filtering Rows

ORDER BY and LIMIT

Aggregate Functions

GROUP BY: Aggregate by Category

HAVING: Filter After Grouping

Combining SQL with Pandas

SQL Execution Order

A Resource Worth Knowing

Try This

What's Next

Top comments (0)