<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: danial.shabbir</title>
    <description>The latest articles on DEV Community by danial.shabbir (@danielox).</description>
    <link>https://dev.to/danielox</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1377750%2Fea2fdb43-9103-4248-9f1a-1ebd8ee66c44.jpeg</url>
      <title>DEV Community: danial.shabbir</title>
      <link>https://dev.to/danielox</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/danielox"/>
    <language>en</language>
    <item>
      <title>Python Data Engineering: Comprehensive Workflow for Data Modeling, Analytics with DuckDB</title>
      <dc:creator>danial.shabbir</dc:creator>
      <pubDate>Sat, 23 Mar 2024 17:39:09 +0000</pubDate>
      <link>https://dev.to/danielox/python-data-engineering-comprehensive-workflow-for-data-modeling-analytics-with-duckdb-4a1e</link>
      <guid>https://dev.to/danielox/python-data-engineering-comprehensive-workflow-for-data-modeling-analytics-with-duckdb-4a1e</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Our primary goal is to convert the raw dataset into structured Dimension and Fact tables, allowing for efficient analysis and modelling. This process involves data cleaning and creating specific dimension tables covering attributes like date-time, passenger count, trip distance, payment types, and more.&lt;/p&gt;

&lt;p&gt;We will touch the fundamentals of data modelling, granularity and basic data engineering terminologies in simple human friendly terms.&lt;/p&gt;

&lt;p&gt;Additionally, we’ll explore automation using Python libraries, such as pandas, DuckDB and highlight the advantages of the Parquet file format, and perform analytical queries to derive meaningful insights.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3&lt;/li&gt;
&lt;li&gt;Pandas&lt;/li&gt;
&lt;li&gt;DuckDB&lt;/li&gt;
&lt;li&gt;Jupyter Notebook (optional)&lt;/li&gt;
&lt;li&gt;Vs Code&lt;/li&gt;
&lt;li&gt;— Vs code provides notebook support out of the box, if open a file with extension of .ipynb it will open it in notebook like interface&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction to the Data
&lt;/h2&gt;

&lt;p&gt;The provided dataset contains detailed records of taxi trips from the year 2023 trips. It captures pivotal information such as pick-up and drop-off dates/times, trip distances, fare details, passenger counts, and more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s important to note that this data was collected and supplied to the TLC by technology providers under authorised programs. Therefore, while being valuable, the accuracy and completeness of the dataset might not be guaranteed.&lt;/p&gt;

&lt;p&gt;Dataset Download Link:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page"&gt;https://github.com/DanielOX/Data-Engineering-Workflow-DUCKDB&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Github&lt;/strong&gt; Reository for the code used in article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/DanielOX/Data-Engineering-Workflow-DUCKDB"&gt;https://github.com/DanielOX/Data-Engineering-Workflow-DUCKDB&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Dictionary / Data Catalog&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A data dictionary or data catalog is like a detailed map or directory that contains descriptions and explanations of all the datasets and their elements (like tables, columns, or fields)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Below image contains all the data points we will getting and their descriptions from the dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdlm9494wbf5x5tuecrr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdlm9494wbf5x5tuecrr.png" alt="NYC Taxi Data Dictionary" width="800" height="1080"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Process Flow of the Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1 — Loading and Data Preprocessing
&lt;/h3&gt;

&lt;p&gt;The analysis starts with loading the data into a Pandas DataFrame within the Jupyter Notebook environment. This initial step involves crucial data preprocessing tasks such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checking data types and ensuring appropriate formats for date-time columns.&lt;/li&gt;
&lt;li&gt;Removing duplicate entries to maintain data integrity.&lt;/li&gt;
&lt;li&gt;Generating a unique Trip ID for each trip record for identification purposes. (also known as surrogate key).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2 — Constructing Dimension Tables
&lt;/h3&gt;

&lt;p&gt;Once, the data is fully cleaned. We creates dimension tables for different entities&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension Name&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;datetime_dim&lt;/td&gt;
&lt;td&gt;Captures datetime-related attributes such as pickup and drop-off hours, days, months, years, and weekdays.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;passenger_count_dim&lt;/td&gt;
&lt;td&gt;Contains information about passenger count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;trip_distance_dim&lt;/td&gt;
&lt;td&gt;Holds details about trip distances.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rate_code_dim&lt;/td&gt;
&lt;td&gt;Provides insights into different rate codes and their corresponding names.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;location_dim&lt;/td&gt;
&lt;td&gt;A lookup table for locations that maps locationID for pickup/dropoff to location_name.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pickup_location_dim&lt;/td&gt;
&lt;td&gt;Stores data about pickup locations.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dropoff_location_dim&lt;/td&gt;
&lt;td&gt;Holds data regarding drop-off locations.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;payment_type_dim&lt;/td&gt;
&lt;td&gt;Contains information about various payment types.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3 — Building the Fact Table
&lt;/h3&gt;

&lt;p&gt;The Fact table, the heart of a dimensional model, is constructed by merging various Dimension tables. It encapsulates essential information such as trip details, fare amounts, payment details, dropoff_location_dim, pickup_location_dim, passenger details and more.&lt;/p&gt;

&lt;p&gt;It encapsulates essential information such as trip details, fare amounts, payment details, dropoff_location_dim, pickup_location_dim, passenger details and more.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagram of Data Model
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6jwkwhspiuynwzrf30r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6jwkwhspiuynwzrf30r.png" alt="Diagram of Data Model" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4 — Analytics using Our Data Model
&lt;/h3&gt;

&lt;p&gt;We will write some analytical queries using duckdb to query data from our data model. This will present and help us identify different usecases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Installations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In order to proceed, We will need to install some python dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;import necessary packages in python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#load data from `parquet` file
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://github.com/DanielOX/Data-Engineering-Workflow-DUCKDB/raw/main/nyc_taxi_trip_data/2023/2023.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# display dataframe
&lt;/span&gt;&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# meta info on dataframe - df.shape outputs (rows, columns). 
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dataset Contains: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows,  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;outputs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqhtgs7v8ix88syqlqvm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqhtgs7v8ix88syqlqvm.png" alt="showing first 15 columns to large data frame" width="720" height="91"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some details about parquet file format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CSV files store data in rows with values separated by delimiters, while JSON uses key-value pairs for structured data.&lt;/p&gt;

&lt;p&gt;Parquet, however, arranges data by columns, allowing more efficient storage and retrieval of column-specific information. It employs advanced compression techniques and encoding methods, like Run-Length and Dictionary Encoding, resulting in smaller file sizes and faster query performance.&lt;/p&gt;

&lt;p&gt;This optimized structure makes Parquet well-suited for analytical workloads, big data processing, and data warehousing, as it significantly reduces storage needs and enhances processing speeds compared to CSV and JSON formats.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will always prefer working with parquet format for my analytical projects, even if the source file csv/json, i will always try to convert it into parquet format.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;**Preprocessing Steps &amp;amp; Surrogate Key&lt;/p&gt;

&lt;p&gt;We will perform a basic preprocessing step&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drop duplicates&lt;/li&gt;
&lt;li&gt;Add a surrogate key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each record in dataset represents a trip. We can apply complex hashing (md5, sha256) of concatenating multiple columns to represent a unique record for each trip or we can just assign the monotically increasing id. Both can work, i will be applying monotonically increasing id. Pandas provide this out of the box when we load in the dataframe. it is called &lt;code&gt;index&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Just for the &lt;strong&gt;reference&lt;/strong&gt;, if some of the folks wants to take hashing route. one can apply the logic as below. please note it’s gonna be a little slow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Will not be using in this article, code written here just for reference only
# Continue from the below code block
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;genMd5Hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt; Function to generate md5 when given a string &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genMd5Hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Just adding for sequence. the trip_id columns would be located at the end, so i will filtered out all the cols except `trip_id`
&lt;/span&gt;
&lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# combining array in this way ensures `trip_id` is always on the first columns, just for readability purposes
&lt;/span&gt;
&lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# this will drop duplicates
&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# It will insert `trip_id` to the first location in columns
&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;output:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fersm5hkzxkmqffuaqqjb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fersm5hkzxkmqffuaqqjb.png" alt="dataframe with trip_id" width="720" height="197"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see trip_id column was added, which is nothing by a monotonically increasing id or index from data frame.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Modelling
&lt;/h2&gt;

&lt;p&gt;Now preprocessing of the data is completed. The next step would be to create data model.&lt;/p&gt;

&lt;p&gt;Some definitions of &lt;strong&gt;data model&lt;/strong&gt; and &lt;strong&gt;dimnensions&lt;/strong&gt; in a layman terms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Definitions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Data Model&lt;/strong&gt;&lt;br&gt;
A data model is like a blueprint or a plan that defines how data is organized, structured, and related within a database or system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why do we need it?&lt;/strong&gt;&lt;br&gt;
It helps in understanding the different types of data, their relationships, and how they can be accessed or retrieved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy&lt;/strong&gt;&lt;br&gt;
Imagine you’re building a house. Before construction begins, you have architectural plans that detail where rooms will be located, how they connect, and what materials will be used. Similarly, a data model outlines how data will be stored, what types of data are needed, and how different pieces of data relate to each other.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Dimension&lt;/strong&gt;&lt;br&gt;
A dimension in a model refers to a way of organizing or categorizing data to help understand and analyze information more effectively. It’s like grouping or labeling data based on specific attributes or characteristics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why do we need it?&lt;/strong&gt;&lt;br&gt;
Dimensions provide context and structure to data, making it easier to study and gain insights. They help in organizing information in a meaningful way, allowing users to ask questions and perform analyses that can lead to better decision-making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy&lt;/strong&gt;&lt;br&gt;
In a sales model, time can be a dimension. You might break down time into years, months, and days to analyze sales performance over different time periods. Similarly, in a customer model, age or location can be dimensions to categorize customers based on their age groups or geographic regions.&lt;/p&gt;

&lt;p&gt;Creating &lt;strong&gt;datetime_dim&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4afbf7m0bvy4xoym1o1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4afbf7m0bvy4xoym1o1.png" alt="date time dimension" width="466" height="662"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;datetime_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# PK
&lt;/span&gt;&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;datetime_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;


&lt;span class="c1"&gt;# pickup  datetime
&lt;/span&gt;&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_hour&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;day_name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_weekday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weekday&lt;/span&gt;

&lt;span class="c1"&gt;# drop off datetime
&lt;/span&gt;&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_hour&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;day_name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_weekday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weekday&lt;/span&gt;

&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;
&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;datetime_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_hour&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pick_weekday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_hour&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;drop_weekday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;datetime_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;output:&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatlt750qdmnqwrkoj9gc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatlt750qdmnqwrkoj9gc.png" alt="datetime_dim dataframe" width="720" height="168"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;Creating &lt;strong&gt;passenger_count_dim&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbe5rj6vbeyjxajbldd2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbe5rj6vbeyjxajbldd2m.png" alt="passenger count dim" width="598" height="242"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;passenger&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;
&lt;span class="n"&gt;passenger_count_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passenger_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;passenger_count_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passenger_count_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;passenger_count_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;

&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passenger_count_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passenger_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;passenger_count_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;passenger_count_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;passenger_count_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1o260zrcp50k5jas3dj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1o260zrcp50k5jas3dj.png" alt="passenger_count_dim dataframe" width="678" height="612"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Creating &lt;strong&gt;trip_distance_dim&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdawyg7nf8oro4t65zlr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdawyg7nf8oro4t65zlr2.png" alt="trip distance dimension" width="540" height="242"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trip&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;
&lt;span class="n"&gt;trip_distance_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_distance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trip_distance_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_distance_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trip_distance_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;


&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_distance_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_distance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;trip_distance_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trip_distance_dim&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trip_distance_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw0kf01i6mvdpat7tjl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw0kf01i6mvdpat7tjl3.png" alt="trip_distance_dim data frame" width="596" height="612"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Creating &lt;strong&gt;rate_code_dim&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tt5w5sg00ara1536usg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tt5w5sg00ara1536usg.png" alt="rate code dimension" width="358" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if we look in the data dictionary the information about rate code id are provided e.g 1 = ‘Standard rate’ etc. We can add this meta info in our dimension table to make it more helpful for our analysts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdpb3a8s0oos5onbgbmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdpb3a8s0oos5onbgbmc.png" alt="rate code mapping" width="720" height="167"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Mapping of rate code 
&lt;/span&gt;&lt;span class="n"&gt;rate_code_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Standard rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JFK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Newark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Nassau or Westchester&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Negotiated fare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Group ride&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;rate_code_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RatecodeID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate_code_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;

&lt;span class="c1"&gt;# map `id` to the name of rate type
&lt;/span&gt;&lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate_code_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RatecodeID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_code_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;
&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate_code_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RatecodeID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate_code_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;rate_code_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yhrydx6h8tyg1crmg1c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yhrydx6h8tyg1crmg1c.png" alt="rate_code_dim data frame" width="720" height="438"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Creating &lt;strong&gt;pickup_location_dim&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We have PULocationID and DOLocationID column which doesn’t mean anything to us unless we have a location name. Therefore, we will be using a lookup table which has mappings for each of location_id and their zone/location. If we refer to the above diagram, the lookup table which i am referring to is as follows.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzb21zomp4pnc0b9nz3u1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzb21zomp4pnc0b9nz3u1.png" alt="pickup location and drop off location dimension" width="720" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creating &lt;strong&gt;location_dim&lt;/strong&gt; dimension table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# loading lookup table
&lt;/span&gt;&lt;span class="n"&gt;lookup_location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://raw.githubusercontent.com/DanielOX/Data-Engineering-Workflow-DUCKDB/main/nyc_taxi_trip_data/taxi%2B_zone_lookup.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LocationID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Zone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lookup_location&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LocationID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Zone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpv95bn5dnbpsb6kxfwyu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpv95bn5dnbpsb6kxfwyu.png" alt="location_dim dataframe" width="564" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creating &lt;strong&gt;pickup_location_dim&lt;/strong&gt; dimension table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# preparing dimension table
&lt;/span&gt;&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PULocationID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pickup_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;
&lt;span class="n"&gt;cols_select&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pickup_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Join `pickup_location_dim` with `location_dim`
&lt;/span&gt;&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;location_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PULocationID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;cols_select&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pickup_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 

&lt;span class="c1"&gt;#rename `location` to `pickup_location` for good readability 
&lt;/span&gt;&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pickup_location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;


&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F46yp6u9efceq8hk6dd7a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F46yp6u9efceq8hk6dd7a.png" alt="pickup_location_dim data frame" width="642" height="606"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creating &lt;strong&gt;dropoff_location_dim&lt;/strong&gt; dimension table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# preparing dimension table
&lt;/span&gt;&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DOLocationID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dropoff_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;
&lt;span class="n"&gt;cols_select&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dropoff_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Join `dropoff_location_dim` with `location_dim`
&lt;/span&gt;&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;location_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DOLocationID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;cols_select&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dropoff_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 

&lt;span class="c1"&gt;#rename `location` to `pickup_location` for good readability 
&lt;/span&gt;&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dropff_location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;


&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2bltmvpsn27kz77ol4g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2bltmvpsn27kz77ol4g.png" alt="dropoff_location_dim dataframe" width="720" height="455"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Creating &lt;strong&gt;payment_type_dim&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0g3axdxtb7ozwkmtwmtt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0g3axdxtb7ozwkmtwmtt.png" alt="payment type dimension" width="442" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if we look in the data dictionary the information about payment_type is provided e.g 1 = ‘Credit Card’ etc. This would be helpful for analysts when making complex join on data for analytical purposes&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqny7bgc72ylbm1hjypuo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqny7bgc72ylbm1hjypuo.png" alt="payment type mapping" width="720" height="146"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;payment_type_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Credit card&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No charge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dispute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Voided trip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;payment_type_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;
&lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment_type_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# giving proper ordering to columns for readability purpose
&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;payment_type_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_seq&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fec53kqd0c9qalymmzmya.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fec53kqd0c9qalymmzmya.png" alt="payment_type_dim dataframe" width="720" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creating &lt;strong&gt;fact_table&lt;/strong&gt; dimension table&lt;/p&gt;

&lt;p&gt;The most important table, where all of the data gets connected. This table doesn’t hold descriptive information about the items themselves or the customers — instead, it links to other tables (like dimensions) that contain this additional information&lt;/p&gt;

&lt;p&gt;So, while the fact table holds the numeric facts or measurements, it references other tables to provide the context or details about those facts. think of a fact table as the central hub of a data warehouse or database. It’s where all the key metrics and measurements are stored, organized, and connected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjosu24nau3mabij3tm0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjosu24nau3mabij3tm0w.png" alt="fact_table dimension" width="514" height="652"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fact_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;passenger_count_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passenger_count_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trip_distance_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_distance_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_code_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RatecodeID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RatecodeID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pickup_location_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pickup_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dropoff_location_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dropoff_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;datetime_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment_type_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 


&lt;span class="n"&gt;cols_required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;VendorID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;datetime_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passenger_count_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trip_distance_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate_code_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;store_and_fwd_flag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pickup_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dropoff_location_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment_type_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fare_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extra&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mta_tax&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tip_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tolls_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;improvement_surcharge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;fact_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fact_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cols_required&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fact_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;as we can see the table contains foreign keys of the dimension table and it represent a consolidated table which can link different tables to get different information of the trips.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte9w0pbh34m10ya7cynw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte9w0pbh34m10ya7cynw.png" alt="fact_table dataframe only 15 columns shown out of 19" width="800" height="82"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Applying Analytics
&lt;/h2&gt;

&lt;p&gt;What good a data model is when we can’t apply analytics to it. I have prepared some questions in advance that we can answer. I will be using DuckDB for this exercise but in future articles (i hope :v ) i will be utilising click-house.&lt;/p&gt;

&lt;p&gt;I will be creating a helper function for duckdb. a simple function when string of query is given, if will execute it and return dataframe&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt; A function takes SQL Query and executes it on dataframe&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; In DuckDB we don’t have to explicitly insert data, the DuckDB engine treats the pandas dataframe as tables.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;for example if you have a dataframe name&lt;/p&gt;

&lt;p&gt;&lt;code&gt;some_random_df_name = pd.read_csv(…)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;in duckdb you can directly apply query on &lt;em&gt;&lt;strong&gt;some_random_df_name&lt;/strong&gt;&lt;/em&gt; variable. since, it is a dataframe object, it will treat it as a table. so,&lt;/p&gt;

&lt;p&gt;&lt;code&gt;select * from some_random_df_name;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;is a completely valid statement&lt;/p&gt;

&lt;p&gt;Type of Questions we will be answering using our data model&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic Statistics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the average trip distance?&lt;/li&gt;
&lt;li&gt;What is the most common payment type?&lt;/li&gt;
&lt;li&gt;Average trip distance for trips with a tip_amount greater than $10, excluding trips with a payment type of ‘Cash’.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What is the average trip distance? 
&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT round(avg(trip_distance),2) as average_distance
    FROM trip_distance_dim
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌──────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;average_distance&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="n"&gt;double&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;             &lt;span class="mf"&gt;3.85&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What is the most common payment type? 
&lt;/span&gt;
&lt;span class="n"&gt;TOP_K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    WITH PAYMENT_NAME_AGG_COUNTS AS (
         SELECT payment_type, payment_type_name , count(*) as count_
        FROM payment_type_dim 
        JOIN fact_table ON fact_table.payment_type_id = payment_type_dim.payment_type_id
        GROUP BY payment_type_dim.payment_type, payment_type_dim.payment_type_name 
    ), PAYMENT_RANK AS (
        SELECT *, ROW_NUMBER() OVER(ORDER BY count_ DESC) as rnk
        FROM PAYMENT_NAME_AGG_COUNTS
    ) SELECT payment_type_name, count_ FROM PAYMENT_RANK WHERE rnk &amp;lt;= &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TOP_K&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌───────────────────┬─────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;payment_type_name&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;count_&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="n"&gt;varchar&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;int64&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├───────────────────┼─────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Credit&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;2411462&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Cash&lt;/span&gt;              &lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="mi"&gt;532241&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;NULL&lt;/span&gt;              &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="mi"&gt;71743&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────────────┴─────────┘&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# - Average trip distance for trips with a tip_amount greater than $10, excluding trips with a payment type of 'Cash'.
&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    WITH TRIP_DISTANCE_BASE AS (
        SELECT round(avg(trip_distance_dim.trip_distance)) as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;average_trip_distance(miles)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
        FROM fact_table
        JOIN trip_distance_dim ON  fact_table.trip_id = trip_distance_dim.trip_distance_id
        JOIN payment_type_dim  ON  fact_table.payment_type_id = payment_type_dim.payment_type_id
        WHERE fact_table.tip_amount &amp;gt; 10 and lower(payment_type_dim.payment_type_name) = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cash&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    ) select * from TRIP_DISTANCE_BASE
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌──────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="nf"&gt;average_trip_distance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;miles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="n"&gt;double&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                         &lt;span class="mf"&gt;15.0&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vendor Comparison:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many trips were made by each VendorID?&lt;/li&gt;
&lt;li&gt;Which vendor has the highest average fare_amount?
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# How many trips were made by each VendorID?
&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT VendorID, count(*) as total_trips
    FROM fact_table
    GROUP BY VendorID
    ORDER BY count(*) DESC
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌──────────┬─────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;VendorID&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;total_trips&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;int64&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;int64&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────┼─────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;     &lt;span class="mi"&gt;2239399&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="mi"&gt;827367&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────┴─────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Which vendor has the highest average fare_amount?
&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    WITH VENDOR_FARE_AMOUNT AS (
        SELECT VendorID, round(avg(fare_amount),2) as avg_fare_amount
        FROM fact_table
        GROUP BY VendorID
    ), VENDOR_MAX_AMOUNT_RNK AS (
        SELECT VendorID, avg_fare_amount, row_number() OVER(ORDER BY avg_fare_amount DESC) as RNK
        FROM VENDOR_FARE_AMOUNT
    ) SELECT VendorID, avg_fare_amount FROM VENDOR_MAX_AMOUNT_RNK WHERE RNK = 1
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌──────────┬─────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;VendorID&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;avg_fare_amount&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;int64&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;     &lt;span class="n"&gt;double&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────┼─────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;           &lt;span class="mf"&gt;18.71&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────┴─────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Time Analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the average number of passengers for trips that occurred on Sundays?&lt;/li&gt;
&lt;li&gt;What are the peak hours for taxi trips based on the pick-up date-time?
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#- What is the average number of passengers for trips that occurred on Sundays?
&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT round(avg(passenger_count_dim.passenger_count),2) as average_passenger_count
    FROM fact_table
    JOIN passenger_count_dim ON passenger_count_dim.passenger_count_id = fact_table.passenger_count_id
    JOIN datetime_dim ON datetime_dim.datetime_id = fact_table.datetime_id
    WHERE lower(datetime_dim.pick_day) = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sunday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; or datetime_dim.drop_day = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sunday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the average number of passengers for trips that occurred on Sundays?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; 
&lt;span class="err"&gt;┌─────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;average_passenger_count&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;         &lt;span class="n"&gt;double&lt;/span&gt;          &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├─────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                    &lt;span class="mf"&gt;1.45&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#- What are the peak hours for taxi trips based on the pick-up datetime?
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the peak hours for taxi trips based on the pick-up datetime?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        WITH TRIPS_BY_HOUR_COUNT AS (
            SELECT datetime_dim.pick_hour, count(*) as total_trips
            FROM fact_table
            JOIN datetime_dim ON datetime_dim.datetime_id = fact_table.datetime_id
            GROUP BY datetime_dim.pick_hour
        ), TRIPS_BY_HOUR_RANK AS (
            SELECT  pick_hour, total_trips, DENSE_RANK() OVER(ORDER BY total_trips) as rnk
           FROM TRIPS_BY_HOUR_COUNT
        ) SELECT pick_hour,total_trips  FROM TRIPS_BY_HOUR_RANK WHERE rnk = 1
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌───────────┬─────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;pick_hour&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;total_trips&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="n"&gt;int32&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;int64&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├───────────┼─────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;         &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;       &lt;span class="mi"&gt;17835&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────┴─────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Location-Based Analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify zones with the highest average total_amount and the lowest average trip_distance.&lt;/li&gt;
&lt;li&gt;Which pickup location has the highest tip_amount amount?
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Identify zones with the highest average total_amount and the lowest average trip_distance
&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    WITH BASE AS (
        SELECT 
            pickup_location_dim.pickup_location as zone, 
            round(avg(total_amount),2) as avg_total_amount, 
            round(avg(trip_distance_dim.trip_distance_id),2) as avg_trip_distance
        FROM fact_table 
        join trip_distance_dim ON fact_table.trip_id = trip_distance_dim.trip_distance_id
        JOIN pickup_location_dim ON pickup_location_dim.pickup_location_id = fact_table.pickup_location_id
        WHERE pickup_location_dim.pickup_location IS NOT NULL
        GROUP BY pickup_location_dim.pickup_location
    )  SELECT zone, avg_total_amount as avg_highest_trip_amount, avg_trip_distance  as avg_lowest_trip_distance 
       FROM BASE ORDER BY avg_total_amount DESC, avg_trip_distance ASC LIMIT 1
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌────────────────┬─────────────────────────┬──────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="n"&gt;zone&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;avg_highest_trip_amount&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;avg_lowest_trip_distance&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;varchar&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;         &lt;span class="n"&gt;double&lt;/span&gt;          &lt;span class="err"&gt;│&lt;/span&gt;          &lt;span class="n"&gt;double&lt;/span&gt;          &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├────────────────┼─────────────────────────┼──────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Newark&lt;/span&gt; &lt;span class="n"&gt;Airport&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;                  &lt;span class="mf"&gt;104.38&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;               &lt;span class="mf"&gt;1212392.68&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└────────────────┴─────────────────────────┴──────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Which pickup location has the highest tip_amount amount?
&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    WITH BASE AS(

        SELECT 
            pickup_location_dim.pickup_location as zone, 
            fact_table.tip_amount,
            ROW_NUMBER() OVER(ORDER BY fact_table.tip_amount DESC) as rnk
        FROM fact_table 
        JOIN pickup_location_dim ON pickup_location_dim.pickup_location_id = fact_table.pickup_location_id
    ) SELECT zone, tip_amount as highest_tip_amount FROM BASE WHERE rnk = 1
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which pickup location has the highest tip_amount amount?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌───────────────────────┬────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;         &lt;span class="n"&gt;zone&lt;/span&gt;          &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;highest_tip_amount&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="n"&gt;varchar&lt;/span&gt;        &lt;span class="err"&gt;│&lt;/span&gt;       &lt;span class="n"&gt;double&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├───────────────────────┼────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="n"&gt;East&lt;/span&gt; &lt;span class="n"&gt;Side&lt;/span&gt; &lt;span class="n"&gt;North&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;              &lt;span class="mf"&gt;380.8&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────────────────┴────────────────────┘&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rate Code Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many trips were made for each rate code type?&lt;/li&gt;
&lt;li&gt;What is the average total_amount for each rate code type?
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# How many trips were made for each rate code type?
&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT rate_code_dim.rate_code_name, count(*) as trip_count
    FROM fact_table
    JOIN rate_code_dim ON rate_code_dim.RatecodeID = fact_table.rate_code_id
    GROUP BY rate_code_dim.rate_code_name
    ORDER BY trip_count DESC

&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many trips were made for each rate code type?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌───────────────────────┬────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;rate_code_name&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;trip_count&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="n"&gt;varchar&lt;/span&gt;        &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="n"&gt;int64&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├───────────────────────┼────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Standard&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;         &lt;span class="err"&gt;│&lt;/span&gt;     &lt;span class="mi"&gt;114239&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Nassau&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;Westchester&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="mi"&gt;15043&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Newark&lt;/span&gt;                &lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="mi"&gt;13106&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Negotiated&lt;/span&gt; &lt;span class="n"&gt;fare&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;       &lt;span class="mi"&gt;8958&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;JFK&lt;/span&gt;                   &lt;span class="err"&gt;│&lt;/span&gt;       &lt;span class="mi"&gt;4366&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Group&lt;/span&gt; &lt;span class="n"&gt;ride&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt;          &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────────────────┴────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What is the average total_amount for each rate code type?
&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT rate_code_dim.rate_code_name, round(avg(fact_table.total_amount),2) as avg_total_amount
    FROM fact_table
    JOIN rate_code_dim ON rate_code_dim.RatecodeID = fact_table.rate_code_id
    GROUP BY rate_code_dim.rate_code_name
    ORDER BY avg_total_amount DESC

&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the average total_amount for each rate code type?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="err"&gt;┌───────────────────────┬──────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;rate_code_name&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;avg_total_amount&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="n"&gt;varchar&lt;/span&gt;        &lt;span class="err"&gt;│&lt;/span&gt;      &lt;span class="n"&gt;double&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├───────────────────────┼──────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Group&lt;/span&gt; &lt;span class="n"&gt;ride&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt;           &lt;span class="mf"&gt;132.78&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;JFK&lt;/span&gt;                   &lt;span class="err"&gt;│&lt;/span&gt;           &lt;span class="mf"&gt;119.95&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Negotiated&lt;/span&gt; &lt;span class="n"&gt;fare&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="mf"&gt;107.4&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Standard&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;         &lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="mf"&gt;89.09&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Nassau&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;Westchester&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="mf"&gt;81.39&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Newark&lt;/span&gt;                &lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="mf"&gt;37.76&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────────────────┴──────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conclusion:
&lt;/h3&gt;

&lt;p&gt;In conclusion, This transformation of raw data into organized structures empowers efficient analysis and modeling. By creating distinct dimension tables encompassing attributes like date-time, passenger count, trip distance, payment types, among others, we have established a strong foundation for our data model.&lt;/p&gt;

&lt;p&gt;This model facilitates effective merging and querying of information, laying the groundwork for deeper insights. Moreover, exploring automation using Python libraries, understanding the advantages of the Parquet file format, and conducting analytical queries has equipped us to derive valuable and meaningful insights from this extensive dataset.&lt;/p&gt;

&lt;p&gt;You can reach out to me directly in case of any questions:&lt;/p&gt;

&lt;p&gt;Email: &lt;a href="mailto:danial.shabbir77@gmail.com"&gt;danial.shabbir77@gmail.com&lt;/a&gt;&lt;br&gt;
Github: &lt;a href="http://github.com/danielox"&gt;http://github.com/danielox&lt;/a&gt;&lt;br&gt;
About me: &lt;a href="http://danielox.github.io/"&gt;http://danielox.github.io/&lt;/a&gt;&lt;br&gt;
LinkedIn: &lt;a href="http://linkedin.com/in/danial-shabbir/"&gt;http://linkedin.com/in/danial-shabbir/&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Refrences:
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/"&gt;https://duckdb.org/docs/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://pandas.pydata.org/getting_started.html"&gt;https://pandas.pydata.org/getting_started.html&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.python.org/"&gt;https://www.python.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The ideas was inspired by Darshil Parmer Video&lt;/p&gt;

&lt;p&gt;“&lt;a href="https://www.youtube.com/watch?v=WpQECq5Hx9g%E2%80%9D"&gt;https://www.youtube.com/watch?v=WpQECq5Hx9g”&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>analytics</category>
      <category>datascience</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
