DEV Community

Anurag Verma
Anurag Verma

Posted on

Understanding the basics and advanced concepts of RDBMS and SQL for data science and analysis

Day 7 of 100 Days Data Science Bootcamp from noob to expert.

GitHub link: Complete-Data-Science-Bootcamp

Main Post: Complete-Data-Science-Bootcamp

Recap Day 6

Yesterday we have studied in detail about Data Visualization using matplotlib and seaborn in Python.

Let's Start

A DBMS (Database Management System) is a software application that interacts with end users, other applications, and the database itself to capture and analyze the data. It is a tool that allows data scientists to store, manage, and retrieve large amounts of structured and unstructured data.

There are different types of DBMSs, such as relational databases, NoSQL databases, and graph databases.

  • Relational databases, such as MySQL and PostgreSQL, use a structured query language (SQL) to manage data stored in tables with rows and columns. These databases are well-suited for data that can be easily organized into a tabular format and are useful for data warehousing, business intelligence, and other reporting applications.

  • NoSQL databases, such as MongoDB and Cassandra, are designed to handle large amounts of unstructured data and support different data models, such as document, key-value, and graph. These databases are useful for big data and real-time applications, such as real-time analytics, social media, and gaming.

  • Graph databases, such as Neo4j and Amazon Neptune, are used to store and query data in a graph format, where nodes and edges represent entities and relationships. These databases are useful for applications that involve complex relationships, such as fraud detection, recommendation systems, and social network analysis.

In data science, the choice of DBMS depends on the type and volume of data, the complexity of the relationships among data, and the performance and scalability requirements of the application.

In this article we will study in detail Relational databases

Basic Understanding

A relational database is a type of database that organizes data into tables, with each table consisting of rows and columns.

  • Tables: A table is a collection of related data that is organized into rows and columns. Each table has a unique name that identifies it within the database. For example, a table named "customers" might contain information about all the customers of a company, with each row representing a different customer and each column representing a different piece of information about that customer, such as name, address, and phone number.

  • Columns: A column is a vertical set of values in a table and it has a name and a data type. Each column represents a specific attribute of the data in the table, such as "customer_name" or "customer_address". For example, in a table named "customers", the columns could be "customer_id", "customer_name", "customer_address", "customer_phone_number", "customer_email" etc.

  • Rows: A row is a horizontal set of values in a table and it represents a single record or tuple in the table. Each row represents a unique instance of the data in the table, such as a specific customer. For example, in a table named "customers", a row could represent a single customer with values such as "1" for the "customer_id", "John Doe" for the "customer_name", "123 Main St" for the "customer_address", "555-555-5555" for the "customer_phone_number" and "johndoe@gmail.com" for the "customer_email"

  • Primary keys: A primary key is a column or set of columns in a table that uniquely identify each row in the table. Each table can have only one primary key, and it is used to enforce the integrity of the data and create relationships with other tables. For example, in a table named "customers" the "customer_id" column might be the primary key, because it is unique for each customer and can be used to identify a specific customer in the table.

  • Foreign keys: A foreign key is a column or set of columns in a table that refers to the primary key of another table. It is used to create relationships between tables and enforce referential integrity. For example, in a table named "orders" that contains information about customer orders, the "customer_id" column might be a foreign key that refers to the primary key "customer_id" in the "customers" table.

  • Indexing: Indexing is the process of creating a separate data structure that allows for faster searching and sorting of data in a table. Indexes can be created on one or more columns in a table to improve the performance of queries that search for specific data. For example, if you frequently search for customers by their last name, you might create an index on the "customer_name" column in the "customers" table to speed up those searches.

Here is an example of a "customers" table:

customer_id customer_name customer_address customer_phone_number customer_email
1 John Doe 123 Main St 555-555-5555 johndoe@gmail.com
2 Jane Smith 456 Park Ave 555-555-5556 janesmith@gmail.com
3 Bob Johnson 789 Elm St 555-555-5557 bobjohnson@gmail.com

In this example table, the "customer_id" column is the primary key and it is used to uniquely identify each customer. The other columns such as "customer_name", "customer_address", "customer_phone_number" and "customer_email" are non-key attributes and they provide additional information about each customer.

An index can be created on any column to speed up the query performance. In this example, an index can be created on the "customer_name" column if we frequently need to search customer by their name.

It is important to note that the example is a very simple representation of the table and in real world scenario, tables can have multiple columns as primary key, multiple foreign keys and a complex data model with multiple tables, but the basic concept remains the same.

SQL Quries

We will execute all our queries using Python and SQLite

Connecting to a database

To start interacting with the database we first we need to establish a connection.

#!/usr/bin/python

import sqlite3

conn = sqlite3.connect('100daysofdatascience.db')

print("Opened database successfully");
Enter fullscreen mode Exit fullscreen mode

Opened database successfully

Sample data

# Customers table:
conn.execute('''CREATE TABLE customers (
                        id INT PRIMARY KEY,
                        first_name VARCHAR(255),
                        last_name VARCHAR(255),
                        city VARCHAR(255)
                    );''')

conn.execute('''INSERT INTO customers (id, first_name, last_name, city) 
                            VALUES (1, 'John', 'Smith', 'Paris'),
                                   (2, 'Mary', 'Johnson', 'London'),
                                   (3, 'Michael', 'Williams', 'Berlin'),
                                   (4, 'Brad', 'Brown', 'Rome');''')

print("Customer table created successfully!")
Enter fullscreen mode Exit fullscreen mode

Customer table created successfully!

# Products table:
conn.execute('''CREATE TABLE products (
                        id INT PRIMARY KEY,
                        product_name VARCHAR(255),
                        category VARCHAR(255),
                        price DECIMAL(10,2),
                        in_stock BOOLEAN
                    );''')

conn.execute('''INSERT INTO products (id, product_name, category, price, in_stock)
                        VALUES (1, 'MacBook Pro', 'electronics', 1500, true),
                               (2, 'iPhone', 'electronics', 1000, true),
                               (3, 'T-Shirt', 'clothing', 20, true),
                               (4, 'Jeans', 'clothing', 50, false);''')

print("Products table created successfully!")
Enter fullscreen mode Exit fullscreen mode

Products table created successfully!

# Orders table:
conn.execute('''CREATE TABLE orders (
                        id INT PRIMARY KEY,
                        customer_id INT,
                        order_date DATE,
                        total DECIMAL(10,2)
                    );''')

conn.execute('''INSERT INTO orders (id, customer_id, order_date, total)
                        VALUES (1, 1, '2021-01-01', 100),
                               (2, 2, '2021-01-02', 200),
                               (3, 3, '2021-01-03', 150),
                               (4, 2, '2021-01-04', 75);''')

print("Orders table created successfully!")
Enter fullscreen mode Exit fullscreen mode

Orders table created successfully!

# Create the employee table
conn.execute('''CREATE TABLE employees
                (id INT PRIMARY KEY NOT NULL,
                name TEXT NOT NULL,
                salary REAL);''')

# Insert data into the employee table
conn.execute("INSERT INTO employees (id, name, salary) VALUES (1, 'John Smith', 50000)")
conn.execute("INSERT INTO employees (id, name, salary) VALUES (2, 'Mary Johnson', 55000)")
conn.execute("INSERT INTO employees (id, name, salary) VALUES (3, 'Michael Williams', 60000)")
conn.execute("INSERT INTO employees (id, name, salary) VALUES (4, 'Brad Brown', 65000)")

print("Employees table created successfully!")
Enter fullscreen mode Exit fullscreen mode

Employees table created successfully!

conn.commit()
Enter fullscreen mode Exit fullscreen mode

Here are some examples of basic-to-advanced SQL queries that a data scientist might use to retrieve and manipulate data:

SELECT:

The SELECT statement is used to retrieve data from one or more tables. For example, the following query retrieves all columns from a table called "customers":

import pandas as pd
data = pd.read_sql_query("SELECT * FROM customers", conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name last_name city
0 1 John Smith Paris
1 2 Mary Johnson London
2 3 Michael Williams Berlin
3 4 Brad Brown Rome

WHERE:

The WHERE clause is used to filter the data returned by a SELECT statement. For example, the following query retrieves all columns from the "customers" table for customers whose city is "New York":

data = pd.read_sql_query("SELECT * FROM customers WHERE city = 'London'", conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name last_name city
0 2 Mary Johnson London

JOIN:

The JOIN clause is used to combine rows from two or more tables based on a related column between them. For example, the following query retrieves all columns from the "orders" table and the "customers" table, where the customer_id in the "orders" table matches the id in the "customers" table:

data = pd.read_sql_query("SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id", conn)
data
Enter fullscreen mode Exit fullscreen mode
id customer_id order_date total id first_name last_name city
0 1 1 2021-01-01 100 1 John Smith Paris
1 2 2 2021-01-02 200 2 Mary Johnson London
2 3 3 2021-01-03 150 3 Michael Williams Berlin
3 4 2 2021-01-04 75 2 Mary Johnson London

UPDATE:

The UPDATE statement is used to modify data in a table. For example, the following query updates the salary of all employees in the "employees" table by 10%:

conn.execute('''UPDATE employees SET salary = salary * 10;''')
conn.commit()
Enter fullscreen mode Exit fullscreen mode
data = pd.read_sql_query("SELECT * FROM employees", conn)
data
Enter fullscreen mode Exit fullscreen mode
id name salary
0 1 John Smith 500000.0
1 2 Mary Johnson 550000.0
2 3 Michael Williams 600000.0
3 4 Brad Brown 650000.0

DELETE:

The DELETE statement is used to remove data from a table. For example, the following query deletes all rows from the "employees" table where the salary is less than $50,000:

conn.execute('''DELETE FROM employees WHERE salary < 550000;''')
conn.commit()
data = pd.read_sql_query("SELECT * FROM employees", conn)
data
Enter fullscreen mode Exit fullscreen mode
id name salary
0 2 Mary Johnson 550000.0
1 3 Michael Williams 600000.0
2 4 Brad Brown 650000.0

GROUP BY:

The GROUP BY clause is used to group rows in a SELECT statement based on one or more columns. For example, the following query retrieves the total in_stock for each product category:

data = pd.read_sql_query("SELECT category, SUM(in_stock) FROM products GROUP BY category;", conn)
data
Enter fullscreen mode Exit fullscreen mode
category SUM(in_stock)
0 clothing 1
1 electronics 2

ORDER BY:

The ORDER BY clause is used to sort the data returned by a SELECT statement. For example, the following query retrieves all columns from the "customers" table, sorted by last name in ascending order:

data = pd.read_sql_query("SELECT * FROM customers ORDER BY last_name;", conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name last_name city
0 4 Brad Brown Rome
1 2 Mary Johnson London
2 1 John Smith Paris
3 3 Michael Williams Berlin

LIMIT:

The LIMIT clause is used to limit the number of rows returned by a SELECT statement. For example, the following query retrieves the top 2 products by in_stocks:

data = pd.read_sql_query("SELECT * FROM products ORDER BY in_stock DESC LIMIT 2;", conn)
data
Enter fullscreen mode Exit fullscreen mode
id product_name category price in_stock
0 1 MacBook Pro electronics 1500 1
1 2 iPhone electronics 1000 1

LIKE:

The LIKE operator is used to search for a specific pattern in a column. For example, the following query retrieves all customers whose last name starts with "S":

data = pd.read_sql_query("SELECT * FROM customers WHERE last_name LIKE 'S%';", conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name last_name city
0 1 John Smith Paris

INNER JOIN:

The INNER JOIN keyword is used to combine rows from two or more tables based on a related column between them. This will only return rows when there is at least one match in both tables. For example:

data = pd.read_sql_query('''SELECT orders.id, customers.first_name
                            FROM orders
                            INNER JOIN customers ON orders.customer_id = customers.id;''', conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name
0 1 John
1 2 Mary
2 3 Michael
3 4 Mary

OUTER JOIN:

The OUTER JOIN keyword is used to combine rows from two or more tables based on a related column between them. This will return all rows from one table and the matching rows from the other table. If there is no match, NULL values will be returned. For example:

data = pd.read_sql_query('''SELECT orders.id, customers.first_name
                            FROM orders
                            LEFT OUTER JOIN customers ON orders.customer_id = customers.id;
                            ''', conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name
0 1 John
1 2 Mary
2 3 Michael
3 4 Mary

UNION:

The UNION operator is used to combine the result-set of two or more SELECT statements. The UNION operator selects only distinct values by default. For example:

data = pd.read_sql_query('''SELECT * FROM orders UNION SELECT * FROM customers;''', conn)
data
Enter fullscreen mode Exit fullscreen mode
id customer_id order_date total
0 1 1 2021-01-01 100
1 1 John Smith Paris
2 2 2 2021-01-02 200
3 2 Mary Johnson London
4 3 3 2021-01-03 150
5 3 Michael Williams Berlin
6 4 2 2021-01-04 75
7 4 Brad Brown Rome

INDEX:

Indexes are used to improve the performance of a database by allowing the database to find and retrieve specific rows much faster. For example, the following command creates an index on the "last_name" column in the "customers" table:

conn.execute('''CREATE INDEX last_name_indexes ON customers (last_name);''')
conn.commit()
Enter fullscreen mode Exit fullscreen mode

Date and Time Functions:

SQL provides a number of functions to work with date and time data types. Some examples include:

  • CURDATE() - returns the current date

  • NOW() - returns the current date and time

  • YEAR() - returns the year of a given date

  • MONTH() - returns the month of a given date

  • DAY() - returns the day of a given date

As we are using SQLite. So, SQLite provides several built-in functions for working with date and time. Here is a list of some of the most commonly used date and time functions in SQLite:

  • date(timestring, modifier, modifier, ...): This function returns the date part of a date-time string.

  • time(timestring, modifier, modifier, ...): This function returns the time part of a date-time string.

  • datetime(timestring, modifier, modifier, ...): This function returns the date and time parts of a date-time string.

  • julianday(timestring, modifier, modifier, ...): This function returns the Julian day - the number of days since noon in Greenwich on November 24, 4714 B.C.

  • strftime(format, timestring, modifier, ...): This function returns a string representation of the date and time based on the specified format. The format string can contain various placeholders for different parts of the date and time, such as %Y for the year, %m for the month, %d for the day, %H for the hour, %M for the minute, and %S for the second.

  • date(timestring, '+' or '-', number, 'days' or 'months' or 'years'): This function allows you to add or subtract days, months, or years from a date.

  • current_date, current_time, current_timestamp: These functions return the current date, time, and timestamp respectively.

  • year(timestring), month(timestring), day(timestring), hour(timestring), minute(timestring), second(timestring): These functions return the year, month, day, hour, minute, and second of the given timestring respectively.

For example, the following query retrieves the total sales for each month of the current year:

data = pd.read_sql_query('''SELECT strftime('%m', order_date) as month, SUM(total) as totals
                                FROM orders
                                WHERE strftime('%Y', order_date) = strftime('%Y', 'now')
                                GROUP BY month;''', conn)
Enter fullscreen mode Exit fullscreen mode

Handling NULL values:

SQL provides several operators and functions to handle NULL values. Some examples include:

  • IS NULL - used to check for NULL values

  • IS NOT NULL - used to check for non-NULL values

  • COALESCE() - returns the first non-NULL value in a list of expressions

  • NULLIF() - returns NULL if two expressions are equal, otherwise it returns the first expression

For example, the following query retrieves all customers who have not placed any orders:

data = pd.read_sql_query('''SELECT * FROM customers
                                WHERE id NOT IN (SELECT customer_id FROM orders)
                                ''', conn)
data
Enter fullscreen mode Exit fullscreen mode
id first_name last_name city
0 4 Brad Brown Rome

Other data types:

SQL also supports other data types such as BLOB (binary large object) for storing binary data, BOOLEAN for storing true/false values, and ENUM for storing predefined set of strings.

For example, the following query retrieves all products with image and only if the products are in stock:

data = pd.read_sql_query('''SELECT product_name, in_stock FROM products
                                    WHERE product_name IS NOT NULL AND in_stock = TRUE;
                                    ''', conn)
data
Enter fullscreen mode Exit fullscreen mode
product_name in_stock
0 MacBook Pro 1
1 iPhone 1
2 T-Shirt 1

Summary

This article provides a comprehensive guide to relational database management systems (RDBMS) and SQL (Structured Query Language) for data science and analysis. It covers the basics and advanced concepts of RDBMS and SQL, including data types, queries, and advanced concepts such as joins and subqueries. The article also provides examples and code snippets to help illustrate the concepts and provide a hands-on learning experience. The main goal of this article is to help data scientists, analysts, and developers understand the importance of RDBMS and SQL in managing and analyzing large datasets.

Exercise Question you will find in the exercise notebook of Day 6 on GitHub.

If you liked it then...

Buy Me A Coffee

Top comments (0)