<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: mwang-cmn</title>
    <description>The latest articles on DEV Community by mwang-cmn (@caroline_mwangi).</description>
    <link>https://dev.to/caroline_mwangi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F847473%2F6560587a-782b-453d-af88-04e8a77de8c9.jpeg</url>
      <title>DEV Community: mwang-cmn</title>
      <link>https://dev.to/caroline_mwangi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/caroline_mwangi"/>
    <language>en</language>
    <item>
      <title>Exploration of Digital Banking Transactions: A SQL Analysis</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Sat, 17 Aug 2024 08:47:07 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/exploration-of-digital-banking-transactions-a-sql-analysis-1hl0</link>
      <guid>https://dev.to/caroline_mwangi/exploration-of-digital-banking-transactions-a-sql-analysis-1hl0</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;Data Bank, a cutting-edge digital bank, is pioneering the integration of banking and distributed data storage. By linking customers' cloud storage limits to their account balances, Data Bank offers a unique blend of financial services and secure data storage. As the digital banking landscape evolves, understanding customer behavior and forecasting data needs are crucial for strategic growth and efficient resource allocation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Problem Statement&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This project aims to explore customer nodes and transaction patterns within the Data Bank system to identify key metrics that will help the management team optimize customer allocation, improve transaction tracking, and accurately forecast future data storage requirements. View the data challenge &lt;a href="https://8weeksqlchallenge.com/case-study-4/" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Structure and SQL Environment&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The dataset used in this analysis was intergrated into a SQL Server database. The database structure is designed with clarity featuring three key tables:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Regions:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;2. Customer Nodes&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;3. Customer Transactions&lt;/strong&gt;&lt;br&gt;
The Entity Relationship Diagram for this dataset is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapqh0cu5chkf99udp3zr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapqh0cu5chkf99udp3zr.png" alt="Image description" width="796" height="342"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Data Cleaning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The dataset has been thouroughly examined and all three tables are clean and suited for analysis. There are no null values, duplicates or incorrect data types.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Question and Answers&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How many unique nodes are there on the Data Bank system?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_nodes&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_nodes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp347b6lke985vyya23mc.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp347b6lke985vyya23mc.PNG" alt="Image description" width="503" height="104"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What is the number of nodes per region?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_nodes&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
 &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;regions&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhp6tcz2guaappd8ofoeu.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhp6tcz2guaappd8ofoeu.PNG" alt="Image description" width="383" height="141"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How many customers are allocated to each region?&lt;/strong&gt;&lt;br&gt;
Counts the unique number of customers from each Region, in descending order. Australia has the largest number of customers&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;customer_count&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_nodes&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
 &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;regions&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2dlyqt33fkjdl0sukud.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2dlyqt33fkjdl0sukud.PNG" alt="Image description" width="450" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How many days on average are customers reallocated to a different node?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATEDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_days&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_nodes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'9999-12-31'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8e0tqe8a34p44zz923xm.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8e0tqe8a34p44zz923xm.PNG" alt="Image description" width="287" height="84"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the median, 80th and 95th percentile for this same reallocation days metric for each region?&lt;/strong&gt;&lt;br&gt;
The CTE date_diff calculates the number of days each customer spends on a node before being reallocated, linking this data to specific regions. The query then computes the median (50th percentile), 80th percentile, and 95th percentile of these reallocation days for each region&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;date_diff&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;  &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;DATEDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;reallocation_days&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_nodes&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;regions&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;end_date&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'9999-12-31'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;region_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;reallocation_days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;median_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;reallocation_days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;eighty_perc_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;reallocation_days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ninety_five_perc_days&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;date_diff&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuimpczprcvx6425l378d.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuimpczprcvx6425l378d.PNG" alt="Image description" width="800" height="124"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;What is the unique count and total amount for each transaction type?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;transaction_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;unique_transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;txn_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_transactions&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;unique_transactions&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpf0kxt8jh3teq4m4vxy.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpf0kxt8jh3teq4m4vxy.PNG" alt="Image description" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There were 4,777 unique transactions during the period, 56% of which were deposits. Additionally, the total amount transacted during the period was $2,958,708.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the average total historical deposit counts and amounts for all customers?&lt;/strong&gt;&lt;br&gt;
The CTE customer_deposits calculates the total number of deposits and the total amount of money deposited for each customer in the customer_transactions table. Itfilters the transactions to include only those where the transaction type is 'deposit', then grouping the results by each customer_id.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;customer_deposits&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;  &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;txn_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;deposit_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;txn_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;deposit_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_transactions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'deposit'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deposit_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deposit_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_amount&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_deposits&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpr4vwzuph1ra5hw5mpp.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpr4vwzuph1ra5hw5mpp.PNG" alt="Image description" width="800" height="98"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?&lt;/strong&gt;&lt;br&gt;
The CTE customer_trends tracks customer transaction behavior on a monthly basis. It calculates, for each month and each customr id:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The number of deposits (deposit_count)&lt;/li&gt;
&lt;li&gt;The number of purchases (purchase_count)&lt;/li&gt;
&lt;li&gt;The number of withdrawals (withdrawal_count)
The main query then determines, for each month, how many unique customers, made more than 1 deposit (deposit_count &amp;gt; 1).Additionally, made either at least 1 purchase or at least 1 withdrawal (purchase_count &amp;gt; 0 OR withdrawal_count &amp;gt; 0) in the same month.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;customer_trends&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DATENAME&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;month_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'deposit'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;deposit_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'purchase'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;purchase_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'withdrawal'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;withdrawal_count&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_transactions&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;DATENAME&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;  &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;month_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_customers&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_trends&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;deposit_count&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;purchase_count&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;withdrawal_count&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;month_name&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;month_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xohaviftv4htn9rveja.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xohaviftv4htn9rveja.PNG" alt="Image description" width="779" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This analysis identifies customers who are more actively engaged by tracking those who not only deposit money multiple times within a month but also use the account for other types of transactions like purchases or withdrawals. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the closing balance for each customer at the end of the month?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;cashflows&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- Use month number for proper ordering&lt;/span&gt;
        &lt;span class="n"&gt;DATENAME&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;month_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'deposit'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;txn_amount&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;txn_amount&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;inflow&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; 
        &lt;span class="n"&gt;customer_transactions&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;DATENAME&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;month_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inflow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; 
        &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt; 
        &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;closing_balance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; 
    &lt;span class="n"&gt;cashflows&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnu41lduv17egtn3mgbxg.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnu41lduv17egtn3mgbxg.PNG" alt="Image description" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To calculate the closing balance for each customer as of the maximum transaction date (txn_date) within each month;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;cashflows&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'deposit'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;txn_amount&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;txn_amount&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;inflow&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; 
        &lt;span class="n"&gt;customer_transactions&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;latest_cashflows&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inflow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;closing_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; 
        &lt;span class="n"&gt;cashflows&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;closing_balance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; 
    &lt;span class="n"&gt;latest_cashflows&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; 
    &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;cashflows CTE:&lt;br&gt;
This CTE calculates the net inflow for each customer_id and each txn_date. The result includes the transaction date (txn_date) along with the month information.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;latest_cashflows CTE:&lt;br&gt;
In this step, the ROW_NUMBER() function is used to assign a rank (rn) to each row within the partition of customer_id and month_id, ordered by txn_date in descending order. This ensures that the latest transaction date within each month is given the rank of 1.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Final Query:&lt;br&gt;
The final SELECT statement retrieves the closing balance (inflow) for the latest transaction date in each month by filtering for rn = 1.&lt;br&gt;
&lt;strong&gt;What is the percentage of customers who increase their closing balance by more than 5%?&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft571b0eddbk303m3l091.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft571b0eddbk303m3l091.PNG" alt="Image description" width="800" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the percentage of customers who increase their closing balance by more than 5%?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The cashflows CTE calculates the net cash flow for each customer per month by summing up the transaction amounts, treating deposits as positive inflows and other transactions as outflows.&lt;/li&gt;
&lt;li&gt;ClosingBalance CTE:calculates the running (cumulative) closing balance for each customer over time by summing the monthly inflows up to and including the current month.&lt;/li&gt;
&lt;li&gt;PercentChange CTE: This calculates the percentage change in the closing balance for each customer from one month to the next, using the LAG function to compare the closing balance of the current month with the previous month.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;cashflows&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;txn_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'deposit'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;txn_amount&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;txn_amount&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;inflow&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; 
        &lt;span class="n"&gt;customer_transactions&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;DATEPART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MONTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;ClosingBalance&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
           &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
           &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inflow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;closing_balance&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cashflows&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;PercentChange&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;closing_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;closing_balance&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;closing_balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;closing_balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;month_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;percent_increase&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ClosingBalance&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_transactions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;percent_customers&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;PercentChange&lt;/span&gt;
 &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;percent_increase&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4sehekf82ye0ik4ctwh.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe4sehekf82ye0ik4ctwh.PNG" alt="Image description" width="759" height="171"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Full sql script - &lt;a href="https://github.com/mwang-cmn/Data-Bank-8-Days-of-sql-challenge/blob/main/Data%20bank%20script.sql" rel="noopener noreferrer"&gt;Github&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sqlserver</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Customer Segmentation - Time Based Cohort Analysis</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Wed, 24 Jul 2024 08:27:44 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/customer-segmentation-time-based-cohort-analysis-5631</link>
      <guid>https://dev.to/caroline_mwangi/customer-segmentation-time-based-cohort-analysis-5631</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the competitive landscape of modern business, understanding customer behavior is crucial for sustaining growth and maintaining a competitive edge. One of the most effective methods for gaining these insights is cohort analysis, which groups customers based on shared characteristics and tracks their behavior over time. This project demonstrates a comprehensive approach to conducting a time-based cohort analysis, focusing on customer retention, purchase behavior, and engagement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Importance of Cohort Analysis for Businesses
&lt;/h3&gt;

&lt;p&gt;Cohort analysis offers deep insights into customer behavior by tracking groups of customers (cohorts) who share common characteristics or experiences. This approach helps businesses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify patterns and trends in customer behavior over time&lt;/li&gt;
&lt;li&gt;Understand the impact of marketing efforts and product changes&lt;/li&gt;
&lt;li&gt;Improve customer relationship management strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Preparation and Data Cleaning
&lt;/h2&gt;

&lt;p&gt;The Data used in this analysis contains transactions ,made by customers from 28 countries.&lt;br&gt;
Load and preprocess the dataset to ensure it is ready for analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# Load the data
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C:\Users\user.DESKTOP-1OSCPEL\Documents\Cohort Analysis\Ecommerce.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parse_dates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fahd6d7mbiwauzp453tbh.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fahd6d7mbiwauzp453tbh.PNG" alt="Image description" width="737" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Description&lt;/strong&gt;&lt;br&gt;
The dataset used in this cohort analysis is a transactional dataset from an unspecified retail business. It includes detailed records of customer purchases over a specific period. The dataset contains the following columns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;InvoiceNo: A unique identifier for each transaction. Each transaction can involve multiple items.&lt;/li&gt;
&lt;li&gt;InvoiceDate: The date and time when the transaction was generated. This column is essential for time-based analysis, as it allows us to track the timing of each purchase.&lt;/li&gt;
&lt;li&gt;CustomerID: A unique identifier for each customer. This column enables us to group purchases by customer and track their behavior over time.&lt;/li&gt;
&lt;li&gt;StockCode: A unique identifier for each product. This column helps in identifying different products purchased.&lt;/li&gt;
&lt;li&gt;Description: A brief description of the product. This column provides a human-readable name for each product.&lt;/li&gt;
&lt;li&gt;Quantity: The number of units of the product purchased in each transaction. This column helps in understanding the volume of items sold.&lt;/li&gt;
&lt;li&gt;UnitPrice: The price per unit of the product. This column, combined with the quantity, allows us to calculate the total sales for each product in each transaction.&lt;/li&gt;
&lt;li&gt;Country: The country where the customer resides. This column helps in analyzing geographical patterns in customer behavior.
I removed the null values and there were no duplicates in the dataset.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfzlnfrm7oinii4ti9h2.PNG" alt="Image description" width="800" height="242"&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Exploratory Data Analysis
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Quantity Sold by Country&lt;/strong&gt;&lt;br&gt;
These are the top 10 countries by Quantity sold:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdvbnpeelh5xyflaw39f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdvbnpeelh5xyflaw39f.png" alt="Image description" width="709" height="424"&gt;&lt;/a&gt;&lt;br&gt;
These are the bottom 10 countries by Quantity sold:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqwhrw1pgutg9dn9ob4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqwhrw1pgutg9dn9ob4l.png" alt="Image description" width="712" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Users by Country&lt;/strong&gt;&lt;br&gt;
These are the top 10 countries by number of unique customers:&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F09p74los0m7y72bdvfx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F09p74los0m7y72bdvfx8.png" alt="Image description" width="712" height="424"&gt;&lt;/a&gt;&lt;br&gt;
These are the bottom 10 countries by number of unique customers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqeoqzptuep58z2p7eas.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqeoqzptuep58z2p7eas.png" alt="Image description" width="712" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trend in Monthly Sales&lt;/strong&gt;&lt;br&gt;
A new column, Sales was calculated. I calculated the Monthly Sales and plotted a line graph using seaborn as follows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#  Calculate the Sales column
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sales&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Extract Year-Month
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Year-Month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_period&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;M&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Monthly sales
&lt;/span&gt;&lt;span class="n"&gt;monthly_sales&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Year-Month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sales&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;#monthly_sales
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lineplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;monthly_sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%y-%m&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;monthly_sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;monthly_sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Monthly Trend of Sales&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sales in millions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;which&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;both&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;linewidth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tight_layout&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result was as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8ib9f40s6lwcqw4wd0h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8ib9f40s6lwcqw4wd0h.png" alt="Image description" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;December 2010 to August 2011: Noticeable fluctuations in sales quantity.&lt;/li&gt;
&lt;li&gt;August 2011 to November 2011: A significant sales increase occurred.&lt;/li&gt;
&lt;li&gt;November 2011 to December 2011: Sales experienced a sudden and substantial decline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cohort Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Defining Cohorts
&lt;/h3&gt;

&lt;p&gt;Identify cohorts based on the initial purchase dates and calculate the cohort indices.&lt;br&gt;
&lt;strong&gt;1. Define Cohort Date&lt;/strong&gt;&lt;br&gt;
Since this is a time-based cohort analysis, the cohorts will be grouped according to the dates they made their first purchase&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1q7hvn3zhfvvmr3luxp.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1q7hvn3zhfvvmr3luxp.PNG" alt="Image description" width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Define Cohort Index&lt;/strong&gt;&lt;br&gt;
The cohort Index is a measure of the time interval in months since a particular cohort made their first purchase. Index 4 for instance, indicates that this cohort made their first purchase 4 months ago. The cohort index is the difference between the Cohort Date and Invoice Date i.e. recent purchase time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tq71a9xipkhs0v81s4k.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tq71a9xipkhs0v81s4k.PNG" alt="Image description" width="800" height="356"&gt;&lt;/a&gt;&lt;br&gt;
The resulting dataset with the Cohort Date and Cohort Index was as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F197gmqms0k1oclpfx2id.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F197gmqms0k1oclpfx2id.PNG" alt="Image description" width="800" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Calculating Cohort Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Aggregate the data to calculate metrics such as unique users, average quantity, and retention rates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Group data by CohortDate and CohortIndex, and count unique CustomerIDs
&lt;/span&gt;&lt;span class="n"&gt;cohort_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortIndex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;nunique&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Rename the 'CustomerID' column to 'Customers' for better clarity
&lt;/span&gt;&lt;span class="n"&gt;cohort_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pivot the data to create a matrix where rows represent CohortDate, columns represent CohortIndex, and values represent the number of unique Customers
&lt;/span&gt;&lt;span class="n"&gt;cohortdata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cohort_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortIndex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Display the pivoted cohort data
&lt;/span&gt;&lt;span class="n"&gt;cohortdata&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Visualizing Cohort Analysis
&lt;/h2&gt;

&lt;p&gt;Generate heatmaps to visualize average quantity per cohort, retention rates, and users per cohort.&lt;br&gt;
&lt;strong&gt;1. Average Quantity ordered per Cohort&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;average_quantity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortIndex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;average_quantity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_quantity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#Rename the Quantity column
&lt;/span&gt;&lt;span class="n"&gt;average_quantity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Avg. Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#pivot the table 
&lt;/span&gt;&lt;span class="n"&gt;quantity_pivot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;average_quantity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortDate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CohortIndex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Avg. Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Convert index to the name of the month and year
&lt;/span&gt;&lt;span class="n"&gt;quantity_pivot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quantity_pivot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%B %Y&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Visualize using a heatmap
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity_pivot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dark2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Heatmap - Average Quantity Bought per Cohort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontweight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4aq57eijzp4jykz4ickr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4aq57eijzp4jykz4ickr.png" alt="Image description" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Average Quantity bought is almost constant after a few months. While there maybe fluctuations in customer retention, the volume of sales does not fluctuate, indicating that few customers purchase a large quantity of products.&lt;br&gt;
Target marketing can be increased in countries where more quantity is sold, rather than where more customers are acquired e.g. Tunisia, South Africa, Rwanda, Somalia etc&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Retention Rate&lt;/strong&gt;&lt;br&gt;
Retention rate is a metric that measures the percentage of customers who continue to engage with a business over a specified period. It is calculated by dividing the number of retained customers by the total number of customers at the start of the period.&lt;br&gt;
&lt;strong&gt;Importance of Retention Rate:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Indicates customer loyalty and satisfaction&lt;/li&gt;
&lt;li&gt;Helps businesses understand long-term engagement&lt;/li&gt;
&lt;li&gt;Provides insights into the effectiveness of customer retention strategies
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Calculate the Retention Rates 
&lt;/span&gt;&lt;span class="n"&gt;cohort_retention_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cohortdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cohortdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cohort_retention_table&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Heatmap for Retention Rates&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Visualize the retention rates for each cohort in a heatmap
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cohort_retention_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dark2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.0%&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Heatmap - Retention Rates per cohort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontweight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq8rkz8qxhbmzuz0x41nx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq8rkz8qxhbmzuz0x41nx.png" alt="Image description" width="800" height="454"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;3. Users per Cohort&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Convert Index to a readable format
&lt;/span&gt;&lt;span class="n"&gt;cohortdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cohortdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%B-%Y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Plot heatmap
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cohortdata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dark2_r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.2f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Heatmap - Users per Cohort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontweight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8ymocwhnk565146aggx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8ymocwhnk565146aggx.png" alt="Image description" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Analysis and Insights
&lt;/h2&gt;

&lt;p&gt;The following are observations from the analysis:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retention Rates&lt;/strong&gt;: The retention rate ranges between 20% and 40%, meaning that 20% to 40% of customers continue to make purchases from the e-commerce platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;December 2010 Cohort Outperforms Others&lt;/strong&gt;: The December 2010 cohort has a retention rate above 30%, indicating strong customer engagement likely due to effective marketing or customer retention strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decline in December 2011&lt;/strong&gt;: A noticeable decline in retention rates for all cohorts in December 2011 suggests potential issues during that period, warranting further investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variability in Retention Rates&lt;/strong&gt;: Retention rates vary significantly, from a minimum of 8% to a maximum of 50%, indicating differing customer behaviors across cohorts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average Quantity Bought&lt;/strong&gt;: The average quantity purchased remains relatively constant after a few months, suggesting stable sales volume despite fluctuations in customer retention. This indicates that fewer customers purchase larger quantities of products.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identify Factors Driving High Retention (December 2010)&lt;/strong&gt;: Analyze the factors contributing to the high retention rate for the December 2010 cohort and replicate successful strategies for other cohorts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigate December 2011 Drop&lt;/strong&gt;: Investigate the causes behind the low retention rates in December 2011 by analyzing customer feedback, product quality, customer service, or any changes in business operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set Realistic Targets&lt;/strong&gt;: Set specific retention targets based on historical data and industry benchmarks, aiming to improve retention rates gradually over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement Retention Strategies&lt;/strong&gt;: Develop and implement tailored retention strategies such as personalized marketing, loyalty programs, and targeted communication to improve retention rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuously Monitor and Adapt&lt;/strong&gt;: Regularly monitor retention rates and cohort data to adapt strategies as needed, ensuring informed decisions to enhance customer retention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target Marketing in High-Quantity Regions&lt;/strong&gt;: Increase target marketing efforts in countries where higher quantities are sold, such as Tunisia, South Africa, Rwanda, and Somalia, rather than focusing solely on customer acquisition.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Understanding customer behavior through cohort analysis provides invaluable insights that can drive strategic decision-making. This analysis highlights the importance of customer segmentation, as it allows businesses to identify and understand different customer groups' behavior over time. By segmenting customers based on their initial purchase dates, we can track retention rates, average purchase quantities, and overall engagement. This granular understanding helps in tailoring marketing strategies, improving customer retention efforts, and ultimately driving business growth. Proper segmentation ensures that resources are allocated effectively, catering to the specific needs and preferences of different customer cohorts.&lt;br&gt;
Find the Notebook - &lt;a href="https://github.com/mwang-cmn/Customer-Segmentation-Cohort-analysis/blob/main/Customer%20Segmentation%20-%20Time%20Based%20Cohort%20Analysis.ipynb" rel="noopener noreferrer"&gt;Here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataanalysis</category>
      <category>cohortanalysis</category>
      <category>python</category>
      <category>customersegmentation</category>
    </item>
    <item>
      <title>Energy Market Resilience Metrics: Analyzing Vulnerabilities and Preparing for Disruptions</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Wed, 17 Jul 2024 08:26:27 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/energy-market-resilience-metrics-analyzing-vulnerabilities-and-preparing-for-disruptions-3k5o</link>
      <guid>https://dev.to/caroline_mwangi/energy-market-resilience-metrics-analyzing-vulnerabilities-and-preparing-for-disruptions-3k5o</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
In the dynamic and competitive energy market, companies like EnergiX Enterprise face numerous challenges that can significantly impact their operations and profitability. Understanding these challenges and devising effective strategies to address them is crucial for maintaining stability and growth. This blog post delves into a comprehensive analysis of how regulatory changes, infrastructure and technology capabilities affect EnergiX's operational costs, revenue, demand and energy production and consumption using Python for data analysis and visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem Statement&lt;/strong&gt;&lt;br&gt;
EnergiX Enterprise is currently grappling with several key issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fluctuations in Energy Demand and Supply: The energy market experiences volatility due to evolving consumer behavior and market dynamics, impacting the company's operations and profitability.&lt;/li&gt;
&lt;li&gt;Rising Competition from Renewable Energy Providers: The growth of renewable energy providers has intensified competition, affecting EnergiX's market share and pricing strategies.&lt;/li&gt;
&lt;li&gt;Regulatory Changes and Environmental Regulations: Evolving regulations necessitate compliance measures that increase operational costs.&lt;/li&gt;
&lt;li&gt;Aging Infrastructure and Technology Limitations: Outdated infrastructure and technology hinder operational efficiency and the company's ability to adapt to market dynamics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Data Description&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The four datasets in this analysis can be found &lt;a href="https://drive.google.com/drive/folders/172UN3UpD933Xe5woEk7BrVMMdMoQ3CI_?usp=sharing" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Historical Energy Data: Contains information on energy production, consumption, prices, and operational costs.&lt;/li&gt;
&lt;li&gt;Market Data: Provides insights into market prices, competitor strategies, and market trends.&lt;/li&gt;
&lt;li&gt;Infrastructure and Maintenance Records: Details the condition of infrastructure, maintenance activities, and technology limitations.&lt;/li&gt;
&lt;li&gt;Regulatory and Compliance Data: Tracks changes in regulations, compliance status, and associated costs.
## Data Cleaning
Missing Values and Duplicates - There were no missing values or duplicates in the data
Datatypes - Converted the Date/Time column in each dateset to Datetime format.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6s5rzc51mzxbpbextexj.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6s5rzc51mzxbpbextexj.PNG" alt="Image description" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  *&lt;em&gt;Exploratory Data Analysis&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Univariate Analysis of Categorical Columns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I plotted bar plots for all categorical columns in each dataset as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9i6pd3guu630hxo7bej.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9i6pd3guu630hxo7bej.PNG" alt="Image description" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Analysis&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;1.Energy Demand, Production and Consumption over Time&lt;br&gt;
Plot the Monthly Aggregate of Energy Demand , Production and Consuption over time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Extract Year and Month from Date/Time column in historical data
historical_data["Year"] = historical_data["Date/Time"].dt.year
historical_data["Month"] = historical_data["Date/Time"].dt.month
#Create new column Year-Month
historical_data["Year-Month"] = historical_data["Date/Time"].dt.to_period('M')

#Aggregate data on Monthly basis
monthly_data = historical_data.groupby("Year-Month").mean()
# Plot demand, production and consumption
plt.figure(figsize = (12,6))
sns.lineplot(data = monthly_data, x = monthly_data.index.astype(str), y = "Energy Demand", label = "Energy Demand", color="blue",ci=None)
sns.lineplot(data = monthly_data, x = monthly_data.index.astype(str), y = "Energy Consumption (kWh)", label = "Energy Consumption", color="brown",ci=None)
sns.lineplot(data = monthly_data, x = monthly_data.index.astype(str), y = "Energy Production (kWh)", label = "Energy Production", color="green",ci=None)
plt.title("Monthly aggregate of Energy Demand, Consumption and Production over time",  fontsize=14, fontweight='bold')
plt.xlabel("Date")
plt.ylabel("kwh")
labels = monthly_data.index.astype(str).tolist()
n=6
plt.xticks(labels[::n],rotation = 360)
plt.legend(loc='upper left', bbox_to_anchor=[1,1])
plt.grid(True, which='both',linewidth=0.5)
plt.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivxksjueil4y5f3vmzpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivxksjueil4y5f3vmzpl.png" alt="Image description" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It appears that across the years, energy demand has exceeded both energy production and consumption. The company is unable to meet the market demand for energy. As energy is being produced, it is being consumed , i.e. the enery production is directly proportional to consumption. Investigate why they are unable to produce sufficient energy to meet the current market demand.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Market Price vs Energy Price&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Investigate current pricing dynamics in relation to market trends&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyaqgc7lkbsmz0ltiyb3n.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyaqgc7lkbsmz0ltiyb3n.PNG" alt="Image description" width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7r11xr4h0s82p3yp9l8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7r11xr4h0s82p3yp9l8f.png" alt="Image description" width="800" height="396"&gt;&lt;/a&gt;&lt;br&gt;
Energy price is the price at which the company currently charges for its product. Overall,both prices appear to fluctuate over time which maybe due to macroeconomic factors or regulatory factors. Market price is higher than the energy price, thus an indication of a competitive market. While the current energy prices may give the company a competitive edge, it may affect the profitablity of the company.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Investigate correlation between energy demand and energy price.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjx2jqab94z0whclgheav.PNG" alt="Image description" width="800" height="170"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The correlation between Price and Demand of Energy is &lt;strong&gt;-0.005&lt;/strong&gt;, a negative relationship, thus no linear relationship between these two variables. There may be other factors that affect the price or demand of their products.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Infrastructure Status and Technology Limitations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Investigate the distribution of these two categories.&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fucjsbt7364ap2lrx9itq.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fucjsbt7364ap2lrx9itq.PNG" alt="Image description" width="800" height="497"&gt;&lt;/a&gt;&lt;br&gt;
The company is unable to produce sufficient energy to meet demand probably due to the poor state of their infrastructure and the high technology limitations they face during production&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5f8tvj3y6zy0b4ydlx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5f8tvj3y6zy0b4ydlx4.png" alt="Image description" width="524" height="352"&gt;&lt;/a&gt;&lt;br&gt;
There is a high relationship between Poor Infrastructure and High technology limitations. This consolidates the idea that their infrastructure needs an upgrade or overhaul, to upscale their production.&lt;br&gt;
To ascertain this, investigate the correlation between Infrastructure Status, Technology limitations and Energy demand&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs15grv0tgu9tpmrj3fze.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs15grv0tgu9tpmrj3fze.PNG" alt="Image description" width="800" height="142"&gt;&lt;/a&gt;&lt;br&gt;
A correlation score of -0.015, suggests there is a weak negative linear relationship between demand and these infrastructure and technology limitations. Energy Production tends to decrease as Infrastructure and Technology limitations increase. While these constraints have a direct on production, demand is affected by other factors. This might also suggest there is a very competitive business environment in the energy sector, that may influence demand.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Regulatory Changes and Compliance Costs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Regulatory changes and amendments are frequent during this period of operations. Subsequent compliance costs are quite significant as well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkd4ox19nqb2ywxlo1rg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkd4ox19nqb2ywxlo1rg.png" alt="Image description" width="710" height="424"&gt;&lt;/a&gt;&lt;br&gt;
Investigate how these compliance costs impact the operational costs and the revenue generated by the company as well.&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9jk3f29wtoibblkgy8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9jk3f29wtoibblkgy8f.png" alt="Image description" width="800" height="396"&gt;&lt;/a&gt;&lt;br&gt;
Operational costs are the day to day costs incurred by the business. All three metrics fluctuate over time, with the costs exceeding the revenue genated by the business. This business is running at a loss and must adjust its revenue strategies to meet its expenses. They need to adjust their pricing strategies and revenue allocation strategy.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Competition Analysis&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Analyse Energy source column, visualize trends in production based on energy source over time.&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hhvbj5zp92grr1cdeb5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hhvbj5zp92grr1cdeb5.png" alt="Image description" width="800" height="423"&gt;&lt;/a&gt;&lt;br&gt;
There are fluctuations in production of both energy sources, with the renewable fuels seeing spikes in production during certain periods. &lt;br&gt;
In other periods the production of the renewable sources declined,indicating possible competition, that impacted their overall market share. Focus on renewable sources should be considered, by working on the infrastructure and technology limitations to increase production. This may boost their market share and revenue over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Insights
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Energy Landscape&lt;/strong&gt; - EnergiX Enterprise faces considerable variations in energy production, consumption, and demand patterns. Notably, there are specific periods where demand surpasses production, underlining potential market stability and supply consistency concerns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Volatility&lt;/strong&gt; - EnergiX's energy pricing exhibits significant volatility within broader market price trends. The energy price remains uncorrelated with energy demand, posing challenges for sales predictability and revenue forecasting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure &amp;amp; Technology Concerns&lt;/strong&gt; - A significant portion of the company’s infrastructure is rated as 'Poor'. Combined with severe technology limitations, this necessitates comprehensive infrastructure rejuvenation. Initial analysis indicates that areas of 'Poor' infrastructure status and high technological constraints could result in reduced energy production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory &amp;amp; Financial Implications&lt;/strong&gt; -EnergiX is currently navigating a challenging regulatory landscape, with new mandates and modifications to existing ones. These financial ramifications, particularly in terms of compliance costs and operational expenditures, are significant. A juxtaposition of these costs with the firm’s current revenue trajectory indicates a pressing profitability challenge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emergence of Renewables&lt;/strong&gt;- The energy market is experiencing a substantial shift towards renewables. Data trends suggest that renewable energy production instances have exceeded those of fossil fuels. For EnergiX, this highlights the dual challenges of evolving competition and potential market share erosion.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;To address the identified challenges, we recommend the following strategies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enhance Market Resilience&lt;/strong&gt;: Develop strategies to enhance the company's resilience to market fluctuations and regulatory changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invest in Infrastructure and Technology&lt;/strong&gt;: Prioritize investments in modernizing infrastructure and adopting advanced technologies to improve operational efficiency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize Energy Production and Pricing&lt;/strong&gt;: Implement data-driven strategies to optimize energy production and pricing, ensuring competitiveness in the market.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strengthen Compliance Measures&lt;/strong&gt;: Proactively address regulatory requirements to minimize compliance costs and avoid potential penalties.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This analysis highlights the significant impact of regulatory changes on EnergiX Enterprise's operational costs, revenue, and compliance expenses. By leveraging data analysis and visualization, we have provided actionable insights and recommendations to help the company navigate the complex energy market effectively.&lt;/p&gt;

&lt;p&gt;Find Full Notebook &lt;a href="https://github.com/mwang-cmn/Energy-Market-Resilience-Analysis/blob/main/Energy_Market_Resilience_Metrics.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energymarkets</category>
      <category>dataanalytics</category>
      <category>python</category>
      <category>eventdriven</category>
    </item>
    <item>
      <title>Floods Prediction in Lagos, Nigeria.</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Mon, 08 Jul 2024 12:36:13 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/floods-prediction-in-lagos-nigeria-3ici</link>
      <guid>https://dev.to/caroline_mwangi/floods-prediction-in-lagos-nigeria-3ici</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Lagos is a city on Nigeria's Atlantic Coast with a population of 16.5 million people according to the UN in 2023. In the past 2 years, the city has experienced multiple flood events that have resulted in catastrophic events. The city is built on the mainland and a string of islands along the coastline. While the floods may be attributed to factors such as the rising sea levels, shoreline erosion and sand mining, it is imperative that the city implements effective disaster risk management system to deal with the effects of floods. &lt;br&gt;
In this project I will implement an ARIMA model to predict when the city is likely to experience the floods.&lt;br&gt;&lt;br&gt;
Time series is widely used for forecasting and predicting future observations in a &lt;strong&gt;time series&lt;/strong&gt;. AutoRegressive Intergrated Moving average models (ARIMA) are used for predicting time series data.&lt;/p&gt;
&lt;h1&gt;
  
  
  Data Understanding
&lt;/h1&gt;

&lt;p&gt;The data used in this analysis ranges from 1st January 2002 to 28th February 2025. This data can be found on &lt;a href="https://www.visualcrossing.com/weather/weather-data-services" rel="noopener noreferrer"&gt;Visual Crossing&lt;/a&gt;.Find a description of each variable &lt;a href="https://www.visualcrossing.com/resources/documentation/weather-data/weather-data-documentation/" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;br&gt;
View notebook on &lt;a href="https://github.com/mwang-cmn/Floods-Prediction-Lagos-/blob/main/Floods_Prediction_Lagos.ipynb" rel="noopener noreferrer"&gt;Github&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The data contains 44498 records and 36 columns.&lt;/li&gt;
&lt;li&gt;I renamed the 'precip' column to 'Precipitation'&lt;/li&gt;
&lt;li&gt;I renamed the datetime column to Date&lt;/li&gt;
&lt;li&gt;I dropped the name column and set the Date column as the index.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Drop name column
data.drop(['name'], axis=1, inplace=True)
#rename datetime to DATE
data.rename(columns={'datetime': 'Date'}, inplace=True)
#Rename precip to Precipitation
data.rename(columns={'precip': 'Precipitation'}, inplace=True)
#set date to index
data.set_index('Date', inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Exploratory Data Analysis
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;A line plot showing Daily Precipitation from 2002 to 2024&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhs25u59fu48xxdsiajo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhs25u59fu48xxdsiajo.png" alt="Image description" width="571" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A line plot showing Monthly Precipitation from 2002 to 2024&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53nse19x0gdkcbjwiavz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53nse19x0gdkcbjwiavz.png" alt="Image description" width="562" height="455"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  ARIMA Model
&lt;/h3&gt;

&lt;p&gt;An ARIMA model is defined with the notation ARIMA(p,d,q), where &lt;br&gt;
p - The number of lagged observations&lt;br&gt;
d - Number of differencing operations&lt;br&gt;
q - The size of the moving average window&lt;br&gt;
When adopting an ARIMA model,the above parameters must be specified, the time series must be made stationary via differencing and the residuals should be uncorrelated. I conducted an adfuller test that confirmed the data series to be stationary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftl6ibeupsd93an4bvhf6.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftl6ibeupsd93an4bvhf6.PNG" alt="Image description" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An ARIMA model was used to forecast future daily precipitation based on historical data. The model provided a 30-day forecast of daily precipitation for the next year. This forecast was plotted along with historical data to visualize the forecast values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Forecast the next 12 months
forecast_steps = 12
forecast = arima_model.forecast(steps=forecast_steps)
# Plot the historical data and forecast
plt.figure(figsize=(10, 6))
plt.plot(monthly_data, label='Historical')
plt.plot(forecast, label='Forecast', color='red')
plt.title('Monthly Precipitation Forecast')
plt.xlabel('Date')
plt.ylabel('Precipitation (inches)')
plt.legend()
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf5vnt95y85cgwojtwek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf5vnt95y85cgwojtwek.png" alt="Image description" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recall, our values in Precipitation column are in inches. Therefore, given the tropical climate in Nigeria, I set a threshhold of 200 mm or 8 inches to indicate potential of a flood.Local studies in Nigeria have shown that rainfall events exceeding 150 mm often lead to significant flooding in Lagos.&lt;br&gt;
 I then subset the forecasted data to obtain the next 12 periods Lagos is likely to experience floods, that is, rainfall above 8 inches or 200 mm.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Set the flood threshold
flood_threshold = 8.0
# Identify months with predicted precipitation above the threshold in the future forecast
flood_months = forecast_future[forecast_future &amp;gt; flood_threshold]

print("Predicted flood months:")
print(flood_months)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxwk4zd3nry2tmx2dwv6j.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxwk4zd3nry2tmx2dwv6j.PNG" alt="Image description" width="452" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusions
&lt;/h3&gt;

&lt;p&gt;The ARIMA models provides a flexible and structured way to model a time series data that relies on historical observations as well as past prediction errors. Ho&lt;br&gt;
&lt;strong&gt;Summary of Findings&lt;/strong&gt;&lt;br&gt;
In this study, I utilized historical rainfall data and time series techniques to predict flood occurrences in Lagos. By leveraging the ARIMA model, I generated accurate monthly precipitation forecasts. The analysis identified specific months with a high likelihood of flooding, providing valuable insights for urban planning and disaster management in Lagos.&lt;br&gt;
&lt;strong&gt;Key findings include:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prediction Accuracy: The ARIMA model demonstrated robust predictive capabilities, accurately forecasting rainfall trends.&lt;/li&gt;
&lt;li&gt;Flood Threshold: We established a realistic flood threshold of 200 mm of rainfall within 24 hours, based on historical data and scientific literature.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Identified Risk Periods: Our model identified several months with predicted precipitation exceeding the flood threshold, indicating potential flood risk periods.&lt;br&gt;
Implications for Stakeholders&lt;br&gt;
The results of this analysis can significantly aid local authorities, urban planners, and disaster management agencies in:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Proactive Flood Management: Implementing early warning systems and preparedness measures during identified high-risk months.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Infrastructure Planning: Enhancing drainage systems and urban infrastructure to mitigate flood impacts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Public Awareness: Informing and educating the public about flood risks and necessary precautions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;p&gt;While the analysis provides valuable insights, there are several limitations to consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data Quality and Availability: The accuracy of predictions depends on the quality and granularity of historical rainfall data.&lt;/li&gt;
&lt;li&gt;Model Assumptions: The ARIMA model assumes linearity and may not capture complex, non-linear interactions in climate data.&lt;/li&gt;
&lt;li&gt;External Factors: Factors such as urbanization, land use changes, and climate change were not explicitly modeled but can significantly influence flood risks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Find the full notebook on &lt;a href="https://github.com/mwang-cmn/Floods-Prediction-Lagos-/blob/main/Floods_Prediction_Lagos.ipynb" rel="noopener noreferrer"&gt;Github&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Outlier Detection in Election Data Using Geospatial Analysis - AKWA IBOM</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Fri, 05 Jul 2024 20:33:21 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/outlier-detection-in-election-data-using-geospatial-analysis-akwa-ibom-3b06</link>
      <guid>https://dev.to/caroline_mwangi/outlier-detection-in-election-data-using-geospatial-analysis-akwa-ibom-3b06</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;The aim of this project is to uncover potential election irregularities to enable the electoral commission to ensure transparency of election results. In this project , I will identify outlier polling units where the voting results deviate significantly from neighbouring  units.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Understanding
&lt;/h2&gt;

&lt;p&gt;The dataset used in this analysis, represents polling units in the state of Akwa Ibom only.The data used can be found &lt;a href="https://drive.google.com/file/d/1dUewV7fM1TJA1XeaZCMWaYQ98Rxlb7CQ/view?usp=sharing" rel="noopener noreferrer"&gt;here&lt;/a&gt;. I conducted this analysis in Python as follows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from google.colab import drive, files
drive.mount('/content/drive')
#Import Libraries
import pandas as pd
from geopy.geocoders import OpenCage
#path = '/content/drive/MyDrive/Colab Notebooks/Nigeria_Elections/'
data = pd.read_csv(path + "AKWA_IBOM_crosschecked.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a summary about columns in the data set&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State&lt;/strong&gt;: The name of the Nigerian state where the election took place (e.g., “AKWA IBOM”).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LGA&lt;/strong&gt; (Local Government Area): The specific local government area within the state (e.g., “ABAK”).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ward&lt;/strong&gt;: The electoral ward within the local government area (e.g., “ABAK URBAN 1”).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PU-Code&lt;/strong&gt; (Polling Unit Code): A unique identifier for the polling unit (e.g., “3/1/2001 0:00”).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PU-Name&lt;/strong&gt; (Polling Unit Name): The name or location of the polling unit (e.g., “VILLAGE SQUARE, IKOT AKWA EBOM” or “PRY SCH, IKOT OKU UBARA”).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accredited Voters&lt;/strong&gt;: The number of voters accredited to participate in the election at that polling unit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registered Voters&lt;/strong&gt;: The total number of registered voters in that polling unit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Results Found&lt;/strong&gt;: Indicates whether results were found for this polling unit (usually TRUE or FALSE).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transcription Count&lt;/strong&gt;: The count of how many times the results were transcribed (may be -1 if not applicable).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result Sheet Stamped&lt;/strong&gt;: Indicates whether the result sheet was stamped (TRUE or FALSE).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result Sheet Corrected&lt;/strong&gt;: Indicates whether any corrections were made to the result sheet (TRUE or FALSE).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result Sheet Invalid&lt;/strong&gt;: Indicates whether the result sheet was deemed invalid (TRUE or FALSE).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result Sheet Unclear&lt;/strong&gt;: Indicates whether the result sheet was unclear (TRUE or FALSE).&lt;/li&gt;
&lt;li&gt;Result Sheet Unsigned: Indicates whether the result sheet was unsigned (TRUE or FALSE).&lt;/li&gt;
&lt;li&gt;APC: The number of votes received by the All Progressives Congress (APC) party.&lt;/li&gt;
&lt;li&gt;LP: The number of votes received by the Labour Party (LP).&lt;/li&gt;
&lt;li&gt;PDP: The number of votes received by the People’s Democratic Party (PDP).&lt;/li&gt;
&lt;li&gt;NNPP: The number of votes received by the New Nigeria People’s Party (NNPP).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I then created the Address column by concatenating the Polling unit Name, Ward, the Local government Area and State, which will be useful during geocoding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data['Address'] = data['PU-Name'] + ',' + data['Ward'] + ',' + data['LGA'] + ',' + data['State']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To obtain the Latitude and Longitude columns, I utilized geospatial encoding techiniques.&lt;br&gt;
I generated an API key on OpenCage Geocoding API, and defined a function geocode_address to geocode our new Address column to obtain the Latitude and Longitude columns&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def geocode_address(Address):
  try:
    location = geolocator.geocode(Address)
    return location.latitude, location.longitude
  except:
    return None, None

data[['Latitude', 'Longitude']] = data['Address'].apply(lambda x: pd.Series(geocode_address(x)))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A quick at our dataset:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3g0kinf7xbb6at2jz4z3.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3g0kinf7xbb6at2jz4z3.PNG" alt="Image description" width="539" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looks like our function works and I was able to obtain the Latitude and Longitude column.&lt;br&gt;
As there are still null values in these 2 columns, I will Impute them using the Simple Imputer, which will replace the missing values with the mean.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
data[['Latitude', 'Longitude']] = imputer.fit_transform(data[['Latitude', 'Longitude']])
data.to_csv('AKWA_IBOM_geocode.csv', index = False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Identifying Neighbours
&lt;/h3&gt;

&lt;p&gt;I defined a radius of 1 km to identify which polling units are considered neighbours&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Calculate distance and find neighbours
from geopy.distance import geodesic
neighbours= {}
def neighbouring_pu(data, radius = 1.0):
  for i, row in data.iterrows():
    neighbours[i] = []
    for j, row2 in data.iterrows():
      if i != j:
        distance = geodesic((row['Latitude'],row['Longitude']), (row2['Latitude'],row2['Longitude'])).km
        if distance &amp;lt;= radius:
          neighbours[i].append(j)
  return neighbours

neighbours = neighbouring_pu(data, radius =1.0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outlier Calculation - Score&lt;/strong&gt;&lt;br&gt;
I will define a function, get_outlier_scores, that calculates the outlier scores for voting data in this dataset. It does so by comparing the votes each row received for various parties (APC, LP, PDP, NNPP) to the average votes received by its neighboring rows, which are specified in a dictionary, neighbours. &lt;br&gt;
For each row, the function computes the absolute difference between the votes in that row and the average votes of its neighbors for each party, and stores these differences as outlier scores. Finally, it returns a new DataFrame that combines the original voting data with the calculated outlier scores. This allows for the identification of rows with voting patterns that significantly differ from their neighbors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_outlier_scores(data, neighbours):
  outlier_scores = []
  parties = ['APC', 'LP', 'PDP', 'NNPP']
  for i, row in data.iterrows():
    scores = {}
    for party in parties:
      votes = row[party]
      neighbour_votes = data.loc[neighbours[i], party].mean() if neighbours[i] else 0
      scores[party + '_outlier_score'] = abs(votes - neighbour_votes)
    outlier_scores.append(scores)
    outlier_scores_data = pd.DataFrame(outlier_scores)
  return pd.concat([data, outlier_scores_data], axis = 1)

outlier_scores_df = get_outlier_scores(data, neighbours)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sorting and Reporting&lt;/strong&gt;&lt;br&gt;
I sorted the data by the outlier scores for each party and obtained the following detailed report  that includes the top five outliers for each party, with the 'PU-Code', number of votes, and the outlier score.&lt;/p&gt;
&lt;h4&gt;
  
  
  : All Progressives Congress (APC) party
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PU-Code&lt;/th&gt;
&lt;th&gt;APC&lt;/th&gt;
&lt;th&gt;APC_outlier_score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;03-05-11-009&lt;/td&gt;
&lt;td&gt;324&lt;/td&gt;
&lt;td&gt;228.52&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-29-05-013&lt;/td&gt;
&lt;td&gt;194&lt;/td&gt;
&lt;td&gt;167.334&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-30-07-001&lt;/td&gt;
&lt;td&gt;180&lt;/td&gt;
&lt;td&gt;153.325&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-05-09-014&lt;/td&gt;
&lt;td&gt;194&lt;/td&gt;
&lt;td&gt;152.149&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-28-05-003&lt;/td&gt;
&lt;td&gt;180&lt;/td&gt;
&lt;td&gt;138.132&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  : Labour Party (LP)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PU-Code&lt;/th&gt;
&lt;th&gt;LP&lt;/th&gt;
&lt;th&gt;LP_outlier_score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;03-05-11-009&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;45.451&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-29-05-013&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;6.65894&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-30-07-001&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;6.34942&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-05-09-014&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;26.5831&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-28-05-003&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;61.5261&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  : People’s Democratic Party (PDP)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PU-Code&lt;/th&gt;
&lt;th&gt;PDP&lt;/th&gt;
&lt;th&gt;PDP_outlier_score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;03-05-11-009&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;27.3627&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-29-05-013&lt;/td&gt;
&lt;td&gt;181&lt;/td&gt;
&lt;td&gt;145.232&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-30-07-001&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;18.8739&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-05-09-014&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;24.2221&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-28-05-003&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;48.2519&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  : New Nigeria People’s Party - NNPP
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PU-Code&lt;/th&gt;
&lt;th&gt;NNPP&lt;/th&gt;
&lt;th&gt;NNPP_outlier_score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;03-05-11-009&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0.27451&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-29-05-013&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;4.14865&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-30-07-001&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1.85521&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-05-09-014&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2.36104&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03-28-05-003&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2.36104&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Visualize the neighbours&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Generate scatterplots to visualize the geographical distribution of polling units based on their outlier scores for four political parties (APC, LP, PDP, NNPP).&lt;br&gt;
Each point represents a polling unit plotted by its latitude and longitude. &lt;br&gt;
Each plot provides a clear visual representation of how the outlier scores are geographically distributed, making it easier to identify patterns or anomalies in the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import matplotlib.pyplot as plt
import seaborn as sns

parties = ['APC', 'LP', 'PDP', 'NNPP']
for party in parties:
  plt.figure(figsize=(10, 6))
  sns.scatterplot(data=outlier_scores_df, x='Latitude', y='Longitude', hue=party + '_outlier_score', palette='viridis')
  plt.title(f'Polling Units by {party} Outlier Score')
  plt.xlabel('Latitude')
  plt.ylabel('Longitude')
  plt.legend(title=party + ' Outlier Score')
  plt.savefig(f'polling_units_{party}_outlier_score.png')
  plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9f1a93wtcrblr4h3sp9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9f1a93wtcrblr4h3sp9.png" alt="Image description" width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkymfzldnuok135p6u92e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkymfzldnuok135p6u92e.png" alt="Image description" width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy14cv4dpuphgmui78bm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy14cv4dpuphgmui78bm.png" alt="Image description" width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgjm2yljnnuvfgjv6k97.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgjm2yljnnuvfgjv6k97.png" alt="Image description" width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deliverables
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Find the full Notebook &lt;a href="https://github.com/mwang-cmn/Outlier-Detection---Geospatial-Analysis-of-Election-Data/blob/main/Nigerian_Elections_Outlier_Detection.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full &lt;a href="https://github.com/mwang-cmn/Outlier-Detection---Geospatial-Analysis-of-Election-Data/blob/main/README.md" rel="noopener noreferrer"&gt;Report&lt;/a&gt; - Top five outliers for each party.&lt;/li&gt;
&lt;li&gt;File with Latitude and Longitude - &lt;a href="https://drive.google.com/file/d/1rxI3TK_MupPdkgvb0BYtC8Hwux3nhGoo/view?usp=sharing" rel="noopener noreferrer"&gt;CSV&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;File with sorted polling units by outlier scores - &lt;a href="https://drive.google.com/file/d/153MFcDuYsST1Q-69HQuenVJw-CqqTKQ9/view?usp=sharing" rel="noopener noreferrer"&gt;CSV&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>Retail Sales Analysis - Python and Power BI</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Fri, 05 Jul 2024 13:44:48 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/retail-sales-analysis-python-and-power-bi-gc7</link>
      <guid>https://dev.to/caroline_mwangi/retail-sales-analysis-python-and-power-bi-gc7</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;One of the most crucial steps in improving business performance is to identify opportunities and evaluate sales performance in order to establish an effective strategy. This can be accomplished through descriptive analysis of sales data.&lt;br&gt;
In this project, I will do a simple sales analysis of a retail store based on a historical dataset. &lt;br&gt;
The dataset used in this analysis can be found on &lt;a href="https://www.kaggle.com/datasets/kyanyoga/sample-sales-data?resource=download" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;. &lt;br&gt;
The main objective of this analysis is to better understand business performance by tracking historical transactions. The tools used in this analysis in Colab Notebooks for data cleaning and EDA analysis, and Power BI for a dashboard.&lt;br&gt;
The dataset contains records of transactions/ orders of a retail company specializing in transport by selling cars, trucks, planes, ships and trains&lt;/p&gt;
&lt;h1&gt;
  
  
  Data Cleaning and Transformation
&lt;/h1&gt;

&lt;p&gt;Importing data&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
path = "/content/drive/MyDrive/Colab Notebooks/Retail_Sales/"
data = pd.read_csv(path + "sales_data_sample.csv", encoding='latin1', parse_dates= ['ORDERDATE'])
#Make a copy of dataset
retail_data = data.copy()
data.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft91rb5wg6fpt2m0h0x6g.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft91rb5wg6fpt2m0h0x6g.PNG" alt="Image description" width="636" height="453"&gt;&lt;/a&gt;&lt;br&gt;
From this initial assessment, there is one datetime column, nine numerical columns and fifteen categorical columns.&lt;br&gt;
Check for duplicates and null values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Check for duplicates
retail_data.duplicated().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There were no duplicated rows in the dataset.&lt;br&gt;
Check for null values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check for null values
retail_data.isnull().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyx26u8wt6cvbs8ovk7n.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyx26u8wt6cvbs8ovk7n.PNG" alt="Image description" width="503" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The territory column has 1074 null values, which correspond to transactions in the USA and Canada. I imputed these null values with AMER to represent the Americas territory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;null_territory = retail_data['TERRITORY'].isnull().sum()
print(f'Number of null values in territory column: {null_territory}')
# view countries in Each territory
countries_by_territory = retail_data.groupby('TERRITORY')['COUNTRY'].unique()
print(f'Countries by territory:\n{countries_by_territory}')
# Impute null values in territorry column with AMER - Americas consisting the USA and Canada
retail_data['TERRITORY'] = retail_data['TERRITORY'].fillna('AMER')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I then converted the object dtypes to categoricy dtypes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#copy of our dataset
retail_df = retail_data.copy()
#Convert object dtypes to categorical columns
categorical_columns = retail_df.select_dtypes(include=['object']).columns
#retail_df[categorical_columns] = retail_df[categorical_columns].astype('category')
categorical_columns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notably, the dataset had a sales column, quantity ordered and price of each product line, per order. I created a REVENUE column to find out if it was equivalent to the SALES column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Create Revenue Column
retail_df['REVENUE'] = retail_df['QUANTITYORDERED'] * retail_df['PRICEEACH']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I then dropped irrelevant columns.Checking the final dtypes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Check dtypes
retail_df.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqktkx2o3gfj318mujp2.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqktkx2o3gfj318mujp2.PNG" alt="Image description" width="606" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploratory Data Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Correlation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Correlation heatmap**
numerical_columns = retail_df.select_dtypes(include=['int64', 'float64']).columns
plt.figure(figsize = (10,6))
sns.heatmap(retail_df[numerical_columns].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fida652ywpr1g9oqpi2bc.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fida652ywpr1g9oqpi2bc.PNG" alt="Image description" width="800" height="448"&gt;&lt;/a&gt;&lt;br&gt;
The correlation co-efficient indicates the linear relationship between two variables.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quantity Ordered and Priceeach (0.0056) - this indicates a weak positive correlation. The two features are hardly related.&lt;/li&gt;
&lt;li&gt;Quantity Ordered and Revenue (0.75) - Indicates a strong positive correlation. When the quantity ordered increases, revenue also increases.&lt;/li&gt;
&lt;li&gt;Priceeach and Revenue (0.64) - Indicates a moderate correlation. When the price increases, revenue increases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Outlier Detection
&lt;/h3&gt;

&lt;p&gt;I plotted boxplots of Price, Revenue and Sales columns to check for possible outliers&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Identifying Outliers**
plt.figure(figsize=(8,6))
outlier = pd.DataFrame(data=retail_data, columns = ['REVENUE', 'QUANTITYORDERED', 'SALES'])
sns.boxplot(data=outlier, color='cyan')
plt.title('Outliers in the Revenue, Quantity and Sales columns')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgim3j8nta1mwgcn60fa.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgim3j8nta1mwgcn60fa.PNG" alt="Image description" width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The boxplots above identified possible outliers in the Sales column. However, the revenue column appears to have none. This indicates that there are discrepancies in the sales column. Recall the Revenue column was a direct calculation, of Quantity ordered and Price of each item, therefore, should have been equivalent to the Sales column values. The number of dicrepancies is 1304, about 46% of the dataset Since this is not the case, its important to find out why these discrepancies exist and their source.&lt;/p&gt;

&lt;p&gt;After completion of the EDA analysis, I exported the dataset to Power BI for further analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analysis in Power BI
&lt;/h2&gt;

&lt;p&gt;At this stage I used the dataset to create meaningful insights. The data collected contains data from 6th January 2003 to 31st May 2005.&lt;br&gt;
Here is a preview of the final dashboard and insights from this dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5cy83pm399o7jk3m5uh.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5cy83pm399o7jk3m5uh.PNG" alt="Image description" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Insights and Recommendations.
&lt;/h2&gt;

&lt;p&gt;There are 307 distinct orders, the total quantity of products sold over the 27 months was ninety nine thousand, revenue of 8 million USD and a shipping rate of 93%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revenue Trend&lt;/strong&gt;&lt;br&gt;
The revenue trend across all years remained consistent with its highest peak in November.This is consistent with sales trends during holiday seasons where sales are expected to rise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe4v6qcibahsphxv14ue.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe4v6qcibahsphxv14ue.PNG" alt="Image description" width="800" height="428"&gt;&lt;/a&gt;&lt;br&gt;
Revenue also surged in the 4th quarter respectivelly and the retail company sold more units from Tuesday to Friday, compared to other days of the week.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezuih54l568tz6tnjxaa.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezuih54l568tz6tnjxaa.PNG" alt="Image description" width="668" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Product Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Classic Cars is the most popular product line, accounting for $3 million  in revenue, about 37.5% of the total revenue during the entire period. Therefore, the company should extend extensive marketing of this product to their clientelle to improve revenue from this core product. &lt;/p&gt;

&lt;p&gt;On the other hand, trains and ships were the least popular products, accounting for only 12.5% of the total revenue. The average prices of a ship, train and a classic car are $87.34, $83.86 and $75.65 respectively. Trains and ships are mostly purchased for commercial purposes, compared to classic cars that are used by individuals and this could explain the variance in revenue performance by these products. Due to the poor performance of the ships, marketing efforts can be redirected to cruise companies in Europe.&lt;br&gt;
However, the retailer could consider smaller boats or yatchs, for private buyers as a potential investment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regional Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The United States market recorded the highest revenue of $3 million while, Madrid city in Spain recorded the highest revenue during the period at $902,094, i.e. 10.88% of total revenue.&lt;br&gt;
Additionally, the EMEA region, that is countries in Europe, accounted for 49.79% of the total revenue, followed closely by the Americas territory, AMER, at 38.35%. Marketing should also focus on these territories.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0c9cyx8uujz3ky7eee85.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0c9cyx8uujz3ky7eee85.PNG" alt="Image description" width="800" height="485"&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Link to Dashboard and Notebook
&lt;/h3&gt;

&lt;p&gt;The final dashboard and notebook can be found below.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/mwang-cmn/Retail-Sales-Analysis/blob/main/dashboard%20-%20sales.pbix" rel="noopener noreferrer"&gt;Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mwang-cmn/Retail-Sales-Analysis/blob/main/Retail_Sales_Analysis.ipynb" rel="noopener noreferrer"&gt;Colab Notebook&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>Ultimate Introduction to Python</title>
      <dc:creator>mwang-cmn</dc:creator>
      <pubDate>Mon, 02 May 2022 08:35:40 +0000</pubDate>
      <link>https://dev.to/caroline_mwangi/ultimate-introduction-to-python-3g6p</link>
      <guid>https://dev.to/caroline_mwangi/ultimate-introduction-to-python-3g6p</guid>
      <description>&lt;p&gt;Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language created by Guido van Rossum during 1985- 1990.&lt;br&gt;
&lt;strong&gt;Why Python?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Easy-to-learn: Python has few keywords, simple structure, and a clearly defined syntax. This allows the student to pick up the language quickly. &lt;/li&gt;
&lt;li&gt; Easy-to-read: Python code is more clearly defined and visible to the eyes. &lt;/li&gt;
&lt;li&gt; Easy-to-maintain: Python's source code is fairly easy-to-maintain. &lt;/li&gt;
&lt;li&gt; A broad standard library: Python's bulk of the library is very portable and cross platform compatible on UNIX, Windows, and Macintosh. &lt;/li&gt;
&lt;li&gt; Interactive Mode: Python has support for an interactive mode which allows interactive testing and debugging of snippets of code.&lt;/li&gt;
&lt;li&gt; Scalable: Python provides a better structure and support for large programs than shell scripting.&lt;/li&gt;
&lt;li&gt; Databases: Python provides interfaces to all major commercial databases&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Installing Python
&lt;/h2&gt;

&lt;p&gt;Download the latest version for Windows, Linux and MacOS at this &lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Variables&lt;/strong&gt;&lt;br&gt;
Variables store information that can be used and/or changed in your program. This information can be an integer, text, collection, etc. Variables are used to hold user inputs, local states of your program, etc. Variables have a name so that they can be referenced in the code. The fundamental concept to understand is that everything is an object in Python.&lt;br&gt;
Python supports numbers, strings, sets, lists, tuples, and dictionaries. These are the standard data types.&lt;br&gt;
To assign variable a value, use the equals sign (=)&lt;br&gt;
&lt;code&gt;Var_1= 1&lt;/code&gt;&lt;br&gt;
&lt;code&gt;var_2 = ‘dog’&lt;/code&gt;&lt;br&gt;
&lt;code&gt;var_3 = "Caroline"&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Data structures and sequences
&lt;/h2&gt;

&lt;p&gt;They include numbers, strings, lists, tuples, sets and dictionaries&lt;br&gt;
&lt;strong&gt;Numbers&lt;/strong&gt;&lt;br&gt;
Include integers, decimals, floats&lt;br&gt;
&lt;strong&gt;Strings&lt;/strong&gt;&lt;br&gt;
Textual information. Strings are sequence of letters. They are enclosed in quotation marks (single, double or triple). They are immutable. This means, once created, they cannot be changed&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tuples&lt;/strong&gt;&lt;br&gt;
A tuple is a fixed length, immutable sequence of python objects separated by commas enclosed in parathesis&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
tup1= (8,9,10) #create a tuple by enclosing a sequence in parathesis
tup2=tuple(‘string’) #convert any sequence to a tuple
tup1[2] =10 #access any item in a tuple using square brackets
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lists&lt;/strong&gt;&lt;br&gt;
Lists are variable length and their contents can be modified in place (mutable). They are defined by square brackets [] or the list type function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
List_1= [2,3,4] #create list by enclosing sequence in square brackets
List_2=list(tup1) #use the list type function
Output= [8,9,10]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Operations in lists&lt;/strong&gt;&lt;br&gt;
The following operations can be carried out in lists.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
del(List_1[1])  #delete the second element using  the del() function
cop_list=List_1[:] #clone a list or create a copy by using a colon in square brackets
List_2.append(‘cats’) #add an element at the end of the list
Output:
 List_2
[8,9,10,’cats’]
List_2.insert(1, ‘fox’) #insert and element at index 1 of list 2
List_2
[8,’fox’,9,10,’cats’]
List_2.pop(3) #remove element at index 3
List_2
Output: [8,’fox’,9,’cats’]
List_2.extend([6, ‘dogs’]) #append multiple elements to a list using the extend function
List_2
Output: [8,’fox’,9,’cats’, 6, ‘dogs’]
Print(len(List_2)) #obtain number of elements in list
Output:6
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dictionaries&lt;/strong&gt;&lt;br&gt;
A dictionary is a flexibly sized collection of key-value pairs, where key and value are Python objects. One approach for creating one is to use curly braces {} and colons to separate keys and values. Keys can be strings, numbers or tuples. Values can be any data type.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
dict_a={‘a’:[2,3], ‘b’:[4,5], ‘c’:[8,9]} #create a dictionary by assigning values to keys
dict_a[‘a’] =[2,3] #access value of a key using square brackets

}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sets&lt;/strong&gt;&lt;br&gt;
A set is an unordered collection of unique elements. They are like dictionaries, but keys only, no values. A set can be created in two ways: via the set function or via a set literal with curly braces.&lt;br&gt;
They do not record element position and have unique elements. They also support arithmetic operations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
Set1=(‘apple’, ‘banana’,’ apple’, ‘mango’, ‘kiwi’) #create a set and excludes duplicates
Set1
Output: {‘apple’, ‘banana’, ‘mango’, ‘kiwi’}
List3= [6,7,8]
Set2= set(List3) #covert a list to a set
Set2={6,7,8}
Set1.remove(‘banana’)
Set1: {‘apple’, ‘mango’, ‘kiwi’}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Arithmetic operations
&lt;/h2&gt;

&lt;p&gt;Python supports the following operations&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
a+b #Addition 
a-b #Subtraction 
a*b #Multiplication
a/b #Division
a%b #Modulus that returns remainder of division
a**b #Exponentiation
a//b #Floor division that rounds the result to nearest whole number
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
