DEV Community: totalSophie

Ingesting Data to Postgres

totalSophie — Thu, 11 Jan 2024 10:55:20 +0000

DE Zoomcamp study notes

To set up PostgreSQL in Docker, run the pgcli command, and execute SQL statements, you can follow these steps:

Step 1: Install Docker

Make sure you have Docker installed on your machine. You can download Docker from the official website: Docker.

Step 2: Pull PostgreSQL Docker Image

Open a terminal and pull the official PostgreSQL Docker image:

docker pull postgres

Step 3: Run PostgreSQL Container

Run a PostgreSQL container with a specified password for the default user 'postgres':

docker run --name mypostgres -e POSTGRES_USER="root" -e POSTGRES_PASSWORD="password" -e POSTGRES_DB="ny_taxi" -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data -p 5432:5432 -d postgres

This command starts a PostgreSQL container named 'mypostgres' with the password 'password' and exposes port 5432 on the host.
e declares the environment variables.
v declares volume path

Step 4: Install pgcli

Install pgcli, a command-line interface for PostgreSQL, on your local machine:

pip install pgcli

Step 5: Connect to PostgreSQL using pgcli

Connect to the PostgreSQL database using pgcli:

pgcli -h localhost -p 5432 -U root -d ny_taxi -W

h declares the host variable which is localhost connection port.
u is the username.
d is the database name
-W prompts the user for the password. After entering the command

Enter the password when prompted (use 'password' if you followed the previous steps).

Step 6: Execute SQL Statements

Once connected, you can execute SQL statements directly in the pgcli interface. For example:

-- Create a new database
CREATE DATABASE mydatabase;

-- Switch to the new database
\c mydatabase

-- Create a table
CREATE TABLE mytable (
    id serial PRIMARY KEY,
    name VARCHAR (100),
    age INT
);

-- Insert some data
INSERT INTO mytable (name, age) VALUES ('John', 25), ('Jane', 30);

-- Query the data
SELECT * FROM mytable;

Feel free to modify the SQL statements according to your requirements.

Step 7: To Exit pgcli and Stop the PostgreSQL Container

To exit pgcli, type \q. After that, stop and remove the PostgreSQL container:

docker stop mypostgres
docker rm mypostgres

Data Ingestion from CSV to PostgreSQL using Pandas and SQLAlchemy

Used Jupyter notebook to insert the data in chunks.
Downloaded the NY taxi 2021 data

Step 1: Setting Up the Environment:

Use Pandas to read the CSV file in chunks for more efficient processing.
Define a PostgreSQL connection string using SQLAlchemy.

Step 2: Creating the Table Schema:

Read the first chunk of data to create the initial table schema in the database.
Utilize Pandas' to_sql method to replace or create the table in the PostgreSQL database.

Step 3: Iterative Data Insertion:

Iterate through the remaining chunks of the CSV file.
Optimize timestamp data types using Pandas' to_datetime.
Append each chunk to the existing PostgreSQL table.

Final Code:

from sqlalchemy import create_engine
from time import time
import pandas as pd

# specify the database you want to use based on the docker run command we had
# postgresql://username:password@localhost:port/dbname
db_url = 'postgresql://root:password@localhost:5432/ny_taxi'
engine = create_engine(db_url)

# Chunksize for reading CSV and inserting into the database
chunk_size = 100000

# Create an iterator for reading CSV in chunks
csv_iter = pd.read_csv('2021_Yellow_Taxi_Trip_Data.csv', iterator=True, chunksize=chunk_size)

# Get the first chunk to create the table schema
first_chunk = next(csv_iter)
first_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', index=False)

# Loop through the remaining chunks and append to the table
for chunk in csv_iter:
    t_start = time()
    # Fix timestamp type issue
    chunk['tpep_pickup_datetime'] = pd.to_datetime(chunk['tpep_pickup_datetime'])
    chunk['tpep_dropoff_datetime'] = pd.to_datetime(chunk['tpep_dropoff_datetime'])

    # Append data to the existing table
    chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append', index=False)

    # Print a message and benchmark the time
    t_end = time()
    print(f'Inserted another chunk... took {t_end - t_start:.3f} second(s)')

Extra, Extra!!!

Using `argparse` to Parse Command Line Arguments

Utilizing the argparse standard library to efficiently parse command line arguments, this script downloads a CSV file from a specified URL and ingests its data into a PostgreSQL database.

from time import time
from sqlalchemy import create_engine
import pandas as pd
import argparse
import os

def main(params):
    user = params.user
    password = params.password
    host = params.host
    port = params.port
    db = params.db
    table_name = params.table_name
    url = params.url

    csv_name = 'output.csv'

    # Download the CSV using the os system function to execute command line arguments from Python
    os.system(f"wget {url} -O {csv_name}")

    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

    df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000)
    df = next(df_iter)

    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

    # Adding the column names
    df.head(n=0).to_sql(name=table_name, con=engine, if_exists="replace")

    # Adding the first batch of rows
    df.to_sql(name=table_name, con=engine, if_exists="append")

    while True:
        t_start = time()

        df = next(df_iter)

        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

        df.to_sql(name=table_name, con=engine, if_exists="append")

        t_end = time()

        print('Inserted another chunk... took %.3f second(s)' % (t_end - t_start))

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Ingest CSV data to Postgres")

    parser.add_argument('--user', help="user name for postgres")
    parser.add_argument('--password', help="password for postgres")
    parser.add_argument('--host', help="host for postgres")
    parser.add_argument('--port', help="port for postgres")
    parser.add_argument('--db', help="database name for postgres")
    parser.add_argument('--table_name', help="name of the table where we will write the results to")
    parser.add_argument('--url', help="url of the CSV")

    args = parser.parse_args()

# Dockerizing Ingestion Script

In the provided Dockerfile:

> **Dockerfile**
> ```
{% endraw %}
docker
> FROM python:3.9.1
> 
> RUN apt-get install wget
> RUN pip install pandas sqlalchemy psycopg2
> 
> WORKDIR /app
> COPY ingest_data.py ingest_data.py
> 
> ENTRYPOINT ["python", "ingest_data.py"]
>
{% raw %}

The psychopg2 package is included to facilitate access to the PostgreSQL database from Python, serving as a valuable "database wrapper."

To build the Docker image, execute the following command:


bash
docker build -t taxi_ingest:v001 .

Now run the image instead of the script with the network argument and changing the database host...

You can serve the local file over HTTP on your machine and access it through your IP address by running this in its location
python3 -m http.server


bash
# If your file is local
URL="http://192.x.x.x:8000/2021_Yellow_Taxi_Trip_Data.csv"
docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
  --user=root \
  --password=password \
  --host=pg-database \
  --port=5432 \
  --db=ny_taxi \
  --table_name=yellow_taxi_trips \
  --url="${URL}"

Not yet...

Connecting pgAdmin and Postgres

pgCLI allows for quickly looking into data. But the more convenient way to work with a postgres database is to use the pgAdmin tool which is a web based GUI tool.

To install pgAdmin in a Docker container, you can follow these steps:

Pull the pgAdmin Docker Image: Use the following command to pull the official pgAdmin Docker image from Docker Hub.


bash
   docker pull dpage/pgadmin4

Create a Docker Network: It's a good practice to create a Docker network to facilitate communication between the PostgreSQL container and the pgAdmin container.


bash
   docker network create pgadmin-network

Run the PostgreSQL Container: Now modify the postgres db run command


bash
docker run --name pg-database \
--network pgadmin-network \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="password" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
-d postgres

Replace password with your desired PostgreSQL password.

Run the pgAdmin Container: Now, you can run the pgAdmin container and link it to the PostgreSQL container.


bash
   docker run --name pgadmin-container \
              --network pgadmin-network \
              -e PGADMIN_DEFAULT_EMAIL=myemail@example.com \
              -e PGADMIN_DEFAULT_PASSWORD=mypassword \
              -p 5055:80 \
              -d dpage/pgadmin4

Replace myemail@example.com and mypassword with your desired pgAdmin login credentials.

Access pgAdmin:
Open your web browser and navigate to http://localhost:5055. Log in with the credentials you provided in the previous step.
Add PostgreSQL Server:
In pgAdmin, click on "Add New Server" and fill in the necessary details to connect to the PostgreSQL server running in the Docker container.

Host name/address: postgres-container (the name of your PostgreSQL container)
Port: 5432
Username: postgres
Password: (the password you set in step 3)

Now, you should be able to manage your PostgreSQL server using pgAdmin in a Docker container. Adjust the commands and parameters according to your specific requirements and environment.

Rather, we can also use Docker Compose

Create a docker-compose.yml.. Now, you don't specify the network



services:
  pgdatabase:
    image: postgres:latest
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
    container_name: mypostgres

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=myemail@example.com
      - PGADMIN_DEFAULT_PASSWORD=mypassword
    ports:
      - "5055:80"
    container_name: pgadmin

To start Docker Compose docker-compose up
To run Docker Compose in the background docker-compose up -d
To view Docker Compose containers docker-compose ps
To stop Docker Compose docker-compose down
To stop Docker Compose if you used the -d flag docker-compose down -v

Beginner's guide to Apache Flink

totalSophie — Mon, 08 Jan 2024 14:28:42 +0000

Apache Flink is a powerful and versatile open-source stream processing framework that goes beyond traditional batch processing to handle real-time data streaming and analytics. In this article, we will explore the fundamental concepts of Apache Flink, compare batch and stream processing, highlight the differences between Flink and Apache Spark, delve into system requirements, installation procedures, Maven usage, and discuss Flink's APIs and transformations.

Understanding Apache Flink

What is Apache Flink?

Apache Flink is a distributed stream processing framework designed for big data processing and analytics. It excels in handling large volumes of data with low-latency processing capabilities, making it suitable for real-time applications. Flink supports event time processing, fault-tolerance, and stateful processing, enabling developers to build robust and scalable data processing applications.

Batch Processing vs. Stream Processing

Batch Processing

Batch processing involves processing data in chunks or batches at a time. It is suitable for scenarios where data can be collected and processed in a non-continuous manner. Examples of batch processing include nightly ETL (Extract, Transform, Load) jobs, data warehousing, and large-scale data analysis.

Stream Processing

Stream processing deals with the continuous and real-time processing of data as it is generated. It is ideal for scenarios where low-latency and near-real-time insights are crucial.
Examples of stream processing include fraud detection, monitoring systems, and real-time analytics on social media feeds.

Apache Flink vs. Apache Spark

Core Differences

While both Apache Flink and Apache Spark are powerful big data processing frameworks, they have some key differences:

Processing Model:
- Flink focuses on event time processing and supports true event-driven stream processing.
- Spark primarily follows a micro-batch processing model, which introduces slight latency in stream processing.
State Management:
- Flink emphasizes on stateful processing, offering built-in support for managing state.
- Spark typically relies on external storage solutions for state management.
Fault Tolerance:
- Flink achieves fault tolerance through distributed snapshots and state replication.
- Spark employs lineage information and recomputation for fault tolerance.

Layers of Apache Flink Ecosystem

Apache Flink System Requirements

Before diving into Apache Flink, ensure that your system meets the following requirements:

Java: Flink is a Java-based framework, so ensure that Java is installed on your machine.
Memory: Sufficient RAM to accommodate Flink's processes.
Disk Space: Adequate disk space for Flink's data storage requirements.

Installation and Maven Usage

Installing Apache Flink

Download:
- Visit the Apache Flink download page and select the desired version.
- Follow the installation instructions provided for your operating system.
Configuration:
- Customize Flink's configuration files based on your specific requirements.

Using Maven with Apache Flink

Maven simplifies the management of project dependencies and builds. To use Flink with Maven:

Add Dependency:

Include the Flink dependency in your pom.xml file:

   <dependencies>
       <dependency>
           <groupId>org.apache.flink</groupId>
           <artifactId>flink-java</artifactId>
           <version>${flink.version}</version>
       </dependency>
   </dependencies>

Or
You can create a project based on an Archetype with the Maven command below:

$ mvn archetype:generate                \
  -DarchetypeGroupId=org.apache.flink   \
  -DarchetypeArtifactId=flink-quickstart-java \
  -DarchetypeVersion=1.18.0

Build and Run

To compile and execute your Apache Flink project, follow these steps:

Build Project: Execute the following Maven command to build your Flink project:

   mvn clean package

Run the Flink application using the generated JAR file.

Apache Flink APIs and Transformations

Flink APIs

Flink provides high-level APIs for Java and Scala:

DataStream API: Used for stream processing applications.
Supported by Java, Scala and Python
DataSet API: Designed for batch processing applications.
Supported by Java and Scala

Data Sources for DataStream API

The DataStream API in Apache Flink supports various data sources for stream processing applications. Common data sources include:

Kafka: Flink can consume and process data from Kafka topics in real-time.
Socket Streams: Flink allows the ingestion of data from socket streams, making it versatile for various streaming scenarios.
File Systems: Data can be read from various file systems, such as HDFS or local file systems, providing flexibility in handling different data storage formats.

Transformations in Flink

Flink transformations are the building blocks of data processing pipelines. Key transformations include:

Map: Applies a function to each element in the dataset.
Filter: Retains elements satisfying a specified condition.
KeyBy: Groups elements based on a key.
Window: Defines time or count-based windows for stream processing.
Reduce: Aggregates elements in a window.

Conclusion

Let's create a project!

Demystifying MLOps: Week 1

totalSophie — Sun, 18 Jun 2023 06:51:37 +0000

Notes from MLOps ZoomCamp

1.1 What is MLOps

MLOps (Machine Learning Operations) refers to the practices, processes, and tools used to manage the entire lifecycle of machine learning models. It bridges the gap between data scientists, software engineers, and operations teams to ensure successful deployment and maintenance of ML models.

Key Components

Data Management and Versioning
Model Training and Evaluation
Deployment and Infrastructure
Continuous Integration and Delivery
Monitoring and Governance

1.2 Environment Preparation

You can use an EC2 instance or your local environment

Step 1

Download and install the Anaconda distribution of Python:

wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
bash Anaconda3-2022.05-Linux-x86_64.sh

Step 2

Update existing packages:

sudo apt update

Step 3

Install Docker:

sudo apt install docker.io

Step 4

Create a separate directory for the installation and get the latest release of Docker Compose:

mkdir soft
cd soft
wget https://github.com/docker/compose/releases/download/v2.18.0/docker-compose-linux-x86_64 -O docker-compose
chmod +x docker-compose
nano ~/.bashrc

Add the following line to the .bashrc file:

export PATH="${HOME}/soft:${PATH}"

Save and exit the .bashrc file, then apply the changes:

source ~/.bashrc

Step 5

Run Docker to check if it's working:

docker run hello-world

1.3 Training a ride duration prediction model

Dataset

Dataset used is 2022 NYC green taxi trip records

More information on the data is found at https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Download the dataset

!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet

Imports

Import required packages

import pandas as pd 
import pickle 
import seaborn as sns 
import matplotlib.pyplot as plt 

from sklearn.feature_extraction import DictVectorizer 
from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

Reading the file:

jan_data = pd.read_parquet("./data/green_tripdata_2022-01.parquet")
jan_data.head()

	VendorID	lpep_pickup_datetime	lpep_dropoff_datetime	store_and_fwd_flag	RatecodeID	PULocationID	DOLocationID	passenger_count	trip_distance	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	payment_type	trip_type	congestion_surcharge
0	2	2022-01-01 00:14:21	2022-01-01 00:15:33	N	1	42	42	1	0.44	3.5	0.5	0.5	0	0.3	4.8	2	1	0
1	1	2022-01-01 00:20:55	2022-01-01 00:29:38	N	1	116	41	1	2.1	9.5	0.5	0.5	0	0.3	10.8	2	1	0
2	1	2022-01-01 00:57:02	2022-01-01 01:13:14	N	1	41	140	1	3.7	14.5	3.25	0.5	4.6	0.3	23.15	1	1	2.75
3	2	2022-01-01 00:07:42	2022-01-01 00:15:57	N	1	181	181	1	1.69	8	0.5	0.5	0	0.3	9.3	2	1	0
4	2	2022-01-01 00:07:50	2022-01-01 00:28:52	N	1	33	170	1	6.26	22	0.5	0.5	5.21	0.3	31.26	1	1	2.75

jan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62495 entries, 0 to 62494
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               62495 non-null  int64         
 1   lpep_pickup_datetime   62495 non-null  datetime64[ns]
 2   lpep_dropoff_datetime  62495 non-null  datetime64[ns]
 3   store_and_fwd_flag     56200 non-null  object        
 4   RatecodeID             56200 non-null  float64       
 5   PULocationID           62495 non-null  int64         
 6   DOLocationID           62495 non-null  int64         
 7   passenger_count        56200 non-null  float64       
 8   trip_distance          62495 non-null  float64       
 9   fare_amount            62495 non-null  float64       
 10  extra                  62495 non-null  float64       
 11  mta_tax                62495 non-null  float64       
 12  tip_amount             62495 non-null  float64       
 13  tolls_amount           62495 non-null  float64       
 14  ehail_fee              0 non-null      object        
 15  improvement_surcharge  62495 non-null  float64       
 16  total_amount           62495 non-null  float64       
 17  payment_type           56200 non-null  float64       
 18  trip_type              56200 non-null  float64       
 19  congestion_surcharge   56200 non-null  float64       
dtypes: datetime64[ns](2), float64(13), int64(3), object(2)
memory usage: 9.5+ MB

Calculate duration of trip from dropoff and pickup times

jan_dropoff = pd.to_datetime(jan_data["lpep_dropoff_datetime"])
jan_pickup = pd.to_datetime(jan_data["lpep_pickup_datetime"])

jan_data["duration"] = jan_dropoff - jan_pickup

# Convert the values to minutes
jan_data["duration"] = jan_data.duration.apply(lambda td: td.total_seconds()/60)

Check the distribution of the duration

jan_data.duration.describe(percentiles=[0.95, 0.98, 0.99])

count    62495.000000
mean        19.019387
std         78.215732
min          0.000000
50%         11.583333
95%         35.438333
98%         49.722667
99%         68.453000
max       1439.466667
Name: duration, dtype: float64

sns.distplot(jan_data.duration)

We can see the data is skewed due to the presence of outliers
Keeping only the records with the duration between 1 and 70 minutes

jan_data = jan_data[(jan_data.duration >= 1) & (jan_data.duration <= 60)]

One Hot Encoding

Using Dictionary Vectorizer for One Hot Encoding
Our categorical values that I will consider are the pickup and dropoff locations

categorical = ["PULocationID", "DOLocationID"]
numerical = ["trip_distance"]

Convert the column type to string from integers

jan_data.loc[:, categorical] = jan_data[categorical].astype(str)

# Change our values to dictionaries

train_jan_data = jan_data[categorical + numerical].to_dict(orient='records')

dv = DictVectorizer()
X_train_jan = dv.fit_transform(train_jan_data)

# Convert the feature matrix to an array
fm_array = X_train_jan.toarray()

# Get the dimensionality of the feature matrix
fm_array.shape

(59837, 471)

Python function that would do the above steps

Custom function to read and preprocess the data

def read_dataframe(filename):
    # Read the parquet file
    df = pd.read_parquet(filename)

    # Calculate the duration
    df_dropoff = pd.to_datetime(df["lpep_dropoff_datetime"])
    df_pickup = pd.to_datetime(df["lpep_pickup_datetime"])
    df["duration"] = df_dropoff - df_pickup

    # Remove outliers
    df["duration"] = df.duration.apply(lambda td: td.total_seconds()/60)
    df = df[(jan_data.duration >= 1) & (df.duration <= 60)]

    # Preparation for OneHotEncoding using DictVectorizer
    categorical = ["PULocationID", "DOLocationID"]
    df[categorical] = df[categorical].astype(str)

    return df

Fitting Linear Regression Model

# Using January data as train and Feb as Validation

df_train = read_dataframe("./data/green_tripdata_2022-01.parquet")
df_val = read_dataframe("./data/green_tripdata_2022-02.parquet")

dv = DictVectorizer()

categorical = ["PULocationID", "DOLocationID"]
numerical = ["trip_distance"]


train_dicts= df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts= df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

8.364575685718151

Try other models like lasso and Ridge

Save the model

with open('models/lin_reg.bin', 'wb') as f_out:
    pickle.dump((dv, lr), f_out)

Cover Photo by Alina Grubnyak on Unsplash