👋 Hello! starting Today, I will be sharing with you my journey in learning Data Engineering at Zoocamp by DataTalks Club 2023 cohort. It going to be an amazing experience , and I will be looking forward to learn a lot about various aspects of data engineering, such as data modeling, ETL, databases, data pipelines, and more. I am excited to continue this journey and improve my skills in this field. Stay tuned for more updates! 🚀
In this blog post, I will show you how to ingest data to PostgreSQL using SQLAlchemy and Pandas. We will also use Docker to run the PostgreSQL database. Specifically, we will be ingesting data from the New York City Taxi & Limousine Commission (TLC) yellow taxi data for January 2021, which can be downloaded from this link: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz 🚕
Prerequisites
Please ensure that you have the following:
- Docker installed on your system. Follow installation guide here
- Python 3 installed on your system
- SQLAlchemy, pgcli and Pandas Python libraries installed
This was all ran on Ubuntu environment!!!!
📥 To get started with ingesting data to PostgreSQL from a CSV file, we need to install and import two Python libraries: SQLAlchemy and Pandas.
🐍 SQLAlchemy is a powerful SQL toolkit and Object-Relational Mapping (ORM) library for Python, while Pandas is a fast and efficient data manipulation library for Python.
💻 To install these libraries, you can use the following commands in your terminal:
pip install sqlalchemy pandas pgcli
📝 Once installed, you can import them in your Python script as follows:
import pandas as pd
from sqlalchemy import create_engine
👍 With these libraries imported, we can now read in the CSV file and write its contents to a PostgreSQL database using SQLAlchemy.
Starting a PostgreSQL Container
First, let's start by creating a Docker container for the PostgreSQL database. We will use the official PostgreSQL Docker image available on Docker Hub.
Open your terminal or command prompt and run the following command:
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:13
This command will download the latest PostgreSQL image and start a new container named postgresql. The -e POSTGRES_PASSWORD=root
option sets the password for the postgres user to root. The -p 5432:5432
option maps the container's port 5432 to the host's port 5432, which is the default port for PostgreSQL.
We can check that the container is running by running the following command:
docker ps
Ingesting Data to PostgreSQL
From here, our playground is jupyter notebook, use the following command to download the data:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
Now, let's proceed to ingest the data from the CSV file to the PostgreSQL database.
First, let's read the data from the CSV file into a Pandas DataFrame. We will use the read_csv() function from Pandas.
df = pd.read_csv("yellow_tripdata_2021-01.csv.gz")
Great now, we have read our data, we will peek and see the first five rows and also the data information.Let's do that with the following code:
df.head()
df.info()
The head()
function displays the first five rows of our DataFrame, while info()
provides us with more details about our data, such as the number of rows and columns, column names, data types, and the number of non-null values in each column.
Next, let's convert the date columns to the datetime format.
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
Now, let's create a connection to the PostgreSQL database using SQLAlchemy.
engine = create_engine('postgresql://postgres:root@localhost:5432/ny_taxi')
Now that we have our data in a Pandas DataFrame, we can create a connection to our PostgreSQL database using SQLAlchemy's create_engine()
function. We will use the following connection string: 'postgresql://root:root@localhost:5432/ny_taxi
', assuming the database name is ny_taxi
, the username is root
, and the password is also root
.
Before we write the data to the database, let's print the schema of the table that will be created using Pandas' pd.io.sql.get_schema()
function.
# Print the schema of the table to be created
print(pd.io.sql.get_schema(df, name='yellow_trip_data',con=engine))
Finally, we can write our data to the PostgreSQL database using the df.to_sql()
function. We will name our table yellow_trip_data and set the if_exists
parameter to 'replace' to replace any existing tables with the same name.
# Write the DataFrame to the database
df.to_sql(name='yellow_trip_data', con=engine, if_exists='replace')
Interact with PostgreSQL in Bash with pgcli
👨💻 pgcli is a command-line interface for PostgreSQL that provides many useful features to interact with a PostgreSQL database. We can use it to access our yellow_trip_data.
Let's start pgcli by running the following command:
pgcli -h localhost -p 5432 -u root -d ny_taxi
where -h
is the hostname of the PostgreSQL server, -p
is the port number, -u
is the username for the PostgreSQL database, and -d
is the name of the database we want to connect to.
Conclusion
And that's it! We have successfully ingested our data into a PostgreSQL database running in a Docker container.💻🐍📊
Top comments (0)