<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Siaterlis Konstantinos</title>
    <description>The latest articles on DEV Community by Siaterlis Konstantinos (@siakon89).</description>
    <link>https://dev.to/siakon89</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F367897%2F236bbf34-2531-420a-812a-e6e81a3ebd0c.jpeg</url>
      <title>DEV Community: Siaterlis Konstantinos</title>
      <link>https://dev.to/siakon89</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/siakon89"/>
    <language>en</language>
    <item>
      <title>Pyspark Code Snippets</title>
      <dc:creator>Siaterlis Konstantinos</dc:creator>
      <pubDate>Thu, 28 May 2020 08:55:25 +0000</pubDate>
      <link>https://dev.to/siakon89/pyspark-code-snippets-3p55</link>
      <guid>https://dev.to/siakon89/pyspark-code-snippets-3p55</guid>
      <description>&lt;p&gt;I have been using PySpark for some time now and I thought to share with you the process of how I begin learning Spark, my experiences, problems I encountered, and how I solved them! You are more than welcome to suggest and/or request code snippets in the comments section below or at my twitter &lt;a href="https://twitter.com/siaterliskonsta"&gt;@siaterliskonsta&lt;/a&gt;. I will keep this post maintained and update it once I create more gists.&lt;/p&gt;

&lt;p&gt;The original post is located at &lt;a href="https://thelastdev.com/pyspark-code-snippets/"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  DataFrames
&lt;/h1&gt;

&lt;p&gt;From Spark's website, a DataFrame is a &lt;em&gt;Dataset&lt;/em&gt; organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of &lt;a href="https://spark.apache.org/docs/latest/sql-data-sources.html"&gt;sources&lt;/a&gt; such as structured data files, tables in Hive, external databases, or existing RDDs.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reading Files
&lt;/h1&gt;

&lt;p&gt;Here are some options on how to read files with PySpark.&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h1&gt;
  
  
  Selecting Columns
&lt;/h1&gt;

&lt;p&gt;Now that you have some data in your DataFrame you may need to select some specific columns instead of the whole thing. This is how you do it&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h1&gt;
  
  
  Filtering
&lt;/h1&gt;

&lt;p&gt;In addition, you may also want to filter the rows based on some conditions. This is how it works.&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h1&gt;
  
  
  GroupBy
&lt;/h1&gt;

&lt;p&gt;Now that you have learned the basic column and row manipulation, let's move on to some aggregations.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h1&gt;
  
  
  RDDs - Resilient Distributed Datasets
&lt;/h1&gt;

&lt;p&gt;RDDs are fault-tolerant collection of elements that can be operated on in parallel. More information can be found in the official spark documentation &lt;a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Reading Files
&lt;/h1&gt;

&lt;p&gt;Let’s see how to read files into Spark RDDs (Resilient Distributed Datasets)&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


</description>
      <category>python</category>
      <category>codenewbie</category>
      <category>100daysofcode</category>
    </item>
    <item>
      <title>Create a Serverless API with AWS Chalice</title>
      <dc:creator>Siaterlis Konstantinos</dc:creator>
      <pubDate>Thu, 23 Apr 2020 10:29:14 +0000</pubDate>
      <link>https://dev.to/siakon89/create-a-serverless-api-with-aws-chalice-32fe</link>
      <guid>https://dev.to/siakon89/create-a-serverless-api-with-aws-chalice-32fe</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev.thelastdev.com%2Fwp-content%2Fuploads%2F2019%2F06%2Fchalice.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev.thelastdev.com%2Fwp-content%2Fuploads%2F2019%2F06%2Fchalice.gif" alt="chalice"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;Indiana Jones diving into AWS Chalice&lt;/center&gt;

&lt;p&gt;Recently I got involved with a package called &lt;a href="https://github.com/aws/chalice" rel="noopener noreferrer"&gt;AWS Chalice&lt;/a&gt;. With this package, you can easily create serverless rest APIs using AWS Gateway and AWS Lambdas. It is easily maintainable and very, VERY easy to deploy to AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/siakon89/chalice-projects/tree/master/blog-demo" rel="noopener noreferrer"&gt;You can find the code of the post here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now that SAM (Serverless Application Model) exists, Chalice has lost some ground. SAM is a very powerful framework that helps you build amazing things, but it can get very complicated for someone that has never used CloudFormation before. I've been using SAM the past year to build serverless REST APIs that communicate with our Data Lake and I will show-case SAM in a future post. Chalice, on the other hand, is very simple, elegant, and suitable for small apps.&lt;/p&gt;

&lt;p&gt;Personally, I used Chalice to build triggers from events in AWS and small apps that need up to 5 endpoints. So let's see what Chalice is and how easy it is to use it.&lt;/p&gt;

&lt;h1&gt;
  
  
  Creating a Serverless API using AWS Chalice
&lt;/h1&gt;

&lt;p&gt;In this post, we will create an app with 2 endpoints. One POST endpoint that will put items in a DynamoDB and one GET that will fetch one item from DynamoDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Requirements
&lt;/h2&gt;

&lt;p&gt;We need to install chalice in our python virtual environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chalice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And then we'll have to configure the AWS profile on our computer. You can follow the guide &lt;a href="https://docs.aws.amazon.com/polly/latest/dg/setup-aws-cli.html" rel="noopener noreferrer"&gt;here&lt;/a&gt; to see how to set up the AWS CLI.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Setting up the Environment
&lt;/h2&gt;

&lt;p&gt;This is as easy as it gets, the only thing we need to run to initialize the environment is the following&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;chalice new-project blog-demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will generate the following files:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;blog-demo
├── app.py  &lt;span class="c"&gt;# This is where the lambda is&lt;/span&gt;
├── .chalice  &lt;span class="c"&gt;# Configuration of our app&lt;/span&gt;
│   └── config.json
├── .gitignore
└── requirements.txt  &lt;span class="c"&gt;# If we have dependencies this will handle it&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  3. Build the app
&lt;/h2&gt;

&lt;p&gt;Now let's open the &lt;strong&gt;app.py&lt;/strong&gt; file and build the API that we discussed above&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
Ok, I lied. We created three endpoints... One to insert, one to get and one to fetch them all. Fetching all data from DynamoDB is a bad practice but it will help the purpose of this post.&lt;/p&gt;

&lt;p&gt;Furthermore, we have to set up a policy that can read and write from DynamoDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edit: .chalice/config.json&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Create: .chalice/policy-dev.json&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;blog-demo-1&lt;/strong&gt; is the name of the DynamoDB table we are going to use. We will use CloudFromation to create the table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dynamodb_cf_template.yaml&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Run the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation deploy&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--template-file&lt;/span&gt; dynamodb_cf_template.yaml&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--stack-name&lt;/span&gt; &lt;span class="s2"&gt;"my-stack"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command will create a DyanmoDB table that is defined in the file &lt;strong&gt;dynamodb_cf_template.yaml&lt;/strong&gt; with the name blog-demo-1. At the end of the post, I will show you how to delete everything!&lt;/p&gt;

&lt;p&gt;Now that we have an app that can read and write from DynamoDB let's test it! Oh, WAIT! Do I have to deploy the app first and then test it online?? No, Chalice comes with a local API that can emulate the behavior of the lambda.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Testing the API Locally
&lt;/h2&gt;

&lt;p&gt;To run the API locally you need to run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;chalice &lt;span class="nb"&gt;local&lt;/span&gt;

&lt;span class="c"&gt;# Response&lt;/span&gt;
Found credentials &lt;span class="k"&gt;in &lt;/span&gt;shared credentials file: ~/.aws/credentials
Serving on http://127.0.0.1:8000  &lt;span class="c"&gt;# This is the localhost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's test it!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get the main page&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/json"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-X&lt;/span&gt; GET http://127.0.0.1:8000

&lt;span class="c"&gt;# Response&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"data"&lt;/span&gt;:[]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes sense! We haven't added any data yet. Let's do that!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;abc&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Hello World&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; http://127.0.0.1:8000/item

&lt;span class="c"&gt;# Response&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"message"&lt;/span&gt;:&lt;span class="s2"&gt;"ok"&lt;/span&gt;,&lt;span class="s2"&gt;"status"&lt;/span&gt;:201&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes! We did it! Let's get that item now!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/json"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-X&lt;/span&gt; GET http://127.0.0.1:8000/item/abc

&lt;span class="c"&gt;# Response&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"data"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"abc"&lt;/span&gt;,&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"Hello World"&lt;/span&gt;&lt;span class="o"&gt;}]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's deploy our app now to AWS! Brace yourselves!&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Deploying to AWS
&lt;/h2&gt;

&lt;p&gt;To deploy the app you will have to set up the right permissions in your AWS account. We will not explore that section in this post.&lt;/p&gt;

&lt;p&gt;I am using an admin account but you can do whatever you want. Make sure your account has the permissions needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;chalice deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the command that will deploy our app to AWS. Chalice will output the API endpoint and the ARN of the Lambda. You can play with that endpoint as we did with the local one, just replace the URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;abc2&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Hello Worldsadfsad&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
 -H "&lt;/span&gt;Content-Type: application/json&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
 https://vdzma7mi85.execute-api.eu-central-1.amazonaws.com/api/item
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;a href="https://vdzma7mi85.execute-api.eu-central-1.amazonaws.com/api/" rel="noopener noreferrer"&gt;https://vdzma7mi85.execute-api.eu-central-1.amazonaws.com/api/&lt;/a&gt;&lt;/strong&gt;  is the endpoint that Chalice gave me. Use your own endpoint there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/json"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-X&lt;/span&gt; GET https://vdzma7mi85.execute-api.eu-central-1.amazonaws.com/api/

&lt;span class="c"&gt;# Response&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"data"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"abc"&lt;/span&gt;,&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"Hello World"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"abc2"&lt;/span&gt;,&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"Hello Worldsadfsad"&lt;/span&gt;&lt;span class="o"&gt;}]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see I added some more entries there. It works! Well done, you know have a serverless app!&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Deleting Everything and moving on
&lt;/h2&gt;

&lt;p&gt;Now that we have tested, played, and made ourselves happy, let's DELETE EVERYTHING! You will see that cleaning our resources is very easy and elegant, and that is because we used CloudFormation (Chalice also uses CloudFormation)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete Chalice &lt;/span&gt;
chalice delete

&lt;span class="c"&gt;# Delete our DynamoDB table&lt;/span&gt;
aws cloudformation delete-stack &lt;span class="nt"&gt;--stack-name&lt;/span&gt; my-stack
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;my-stack is the name I used to generate the DynamoDB table in section 3. Make sure that you use the same name as the one in section 3, otherwise, you can go to the AWS Console at CloudFormation and delete your stack manually.&lt;/p&gt;

&lt;p&gt;Go to your console and check if everything is deleted&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  API gateway&lt;/li&gt;
&lt;li&gt;  AWS Lambda&lt;/li&gt;
&lt;li&gt;  DyanamoDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They should be deleted.&lt;/p&gt;

&lt;p&gt;Well, that's it folks! You now know how to create a Serverless app! If you have any questions or suggestions please let me know in the comment section below or at Twitter &lt;a href="https://twitter.com/siaterliskonsta" rel="noopener noreferrer"&gt;@siaterliskonsta&lt;/a&gt;. See you at the next one!&lt;/p&gt;

&lt;p&gt;You can find the original post and many more &lt;a href="https://thelastdev.com/" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>python</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Exploratory Data Analysis with Pandas</title>
      <dc:creator>Siaterlis Konstantinos</dc:creator>
      <pubDate>Thu, 16 Apr 2020 10:32:58 +0000</pubDate>
      <link>https://dev.to/siakon89/exploratory-data-analysis-with-pandas-2mh6</link>
      <guid>https://dev.to/siakon89/exploratory-data-analysis-with-pandas-2mh6</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--b5TNgnZX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/photo-1527474305487-b87b222841cc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--b5TNgnZX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/photo-1527474305487-b87b222841cc.jpeg" alt="data image from unsplash"&gt;&lt;/a&gt;&lt;br&gt;Image by &lt;a href="https://unsplash.com/@franki"&gt;Franki Chamaki&lt;/a&gt; on Unsplash
  &lt;/p&gt;

&lt;p&gt;In this post, we will see the Exploratory Data Analysis or EDA, but we are not going to use any models or feature extraction methods. Our main focus is on the data. In future posts, we’ll explore feature extraction methods and modeling.&lt;/p&gt;

&lt;p&gt;I consider myself lucky when at the beginning of the day, a dataset is given to me. In most cases, we have to build our own datasets in order to create the models to answer, solve, or explore problems, and honestly, that is ok. It is our job to define the data that we need, to gather them, and eventually build a pretty good dataset. Let’s say we do that, then what? What are the next steps? “I am sure than I know my data very well, after all, I gathered them. I should start building the model and train it!” Well… this is a point, but it is something that is not sitting well with me. I love exploring my dataset, I ask questions, and the data reply with tables, numbers, and visuals. I want to know them better, to explore them, and to learn from them. This is a process I follow every time I get my hands on a dataset, and today I invite you to walk this path with me. We will download a dataset, explore its features, gain insights, and finally formulate some hypotheses.&lt;/p&gt;

&lt;p&gt;This process is called Exploratory Data Analysis, in short EDA, and it is a fundamental ‘tool’ for a Data Scientist. It is a method that allows us to take an in-depth look into our data and gain knowledge of their format, their distribution. Without EDA, we are flying blind.&lt;/p&gt;

&lt;p&gt;Tukey defined data analysis in 1961 as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyze data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;First of all, we need a dataset. We will not go down the path to gather and create our own, this is a topic for another post, for now, we’ll visit Kaggle and download one. Yes! We found the perfect dataset! The old-time classic and very popular &lt;a href="https://www.kaggle.com/c/titanic/data"&gt;Titanic&lt;/a&gt; dataset. It is perfect for our case, there are missing values, strange columns, diverse types, etc.&lt;/p&gt;

&lt;p&gt;Considering the environment I recommend using the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jupyter notebook (pip install jupyter)&lt;/li&gt;
&lt;li&gt;Pandas (pip install pandas)&lt;/li&gt;
&lt;li&gt;Matplotlib (pip install matplotlib)&lt;/li&gt;
&lt;li&gt;Seaborn (pip install seaborn)&lt;/li&gt;
&lt;li&gt;Kaggle API (pip install kaggle)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So without further ado let’s start!&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploring our data
&lt;/h2&gt;

&lt;p&gt;Let’s download the data from Kaggle. You can follow &lt;a href="https://github.com/Kaggle/kaggle-api"&gt;the instructions here&lt;/a&gt; to install Kaggle API, get the API keys from your account and set it up. Then you run the following command to download the dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;kaggle competitions download &lt;span class="nt"&gt;-c&lt;/span&gt; titanic &lt;span class="nt"&gt;-p&lt;/span&gt; data
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I have created a folder called data and with the flag -p downloaded the dataset directly to the folder, you can do the same if you want. Now that we have all the data let’s explore.&lt;/p&gt;

&lt;p&gt;We are going to open the dataset and take a glance at the data. Then search for missing values and explore each column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'data/train.csv'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Mn85XH16--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/head_titanic.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Mn85XH16--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/head_titanic.png" alt="alt text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;'PassengerId'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SibSp'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Parch'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Ticket'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Fare'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Cabin'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="s"&gt;'Embarked'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'object'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;891&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We have data for 891 passengers of the Titanic. Let’s see if there are any missing values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Look at that! There are missing values in our dataset, but we want to know which columns and at what degree&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;total_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="n"&gt;na_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;na_values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;f"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%"&lt;/span&gt;

&lt;span class="n"&gt;na_values&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'PassengerId'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Name'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'19.87%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'SibSp'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Parch'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Ticket'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Fare'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.00%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Cabin'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'77.10%'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s"&gt;'Embarked'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'0.22%'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There are missing values on the following columns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Age&lt;/li&gt;
&lt;li&gt;Cabin&lt;/li&gt;
&lt;li&gt;Embarked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cabin is missing the 77.1% of values, with Age coming second with almost 20%. We can explore Age but it will be very difficult to use Cabin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding your data
&lt;/h2&gt;

&lt;p&gt;There is a detailed explanation of the columns in &lt;a href="https://www.kaggle.com/c/titanic/data"&gt;Kaggle&lt;/a&gt;. The column Survived is our label. The rest of the columns are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  PassengerId: This is the identifier of the passenger&lt;/li&gt;
&lt;li&gt;  Pclass: This is the socio-economic status, we can see that the variable is ordinal

&lt;ul&gt;
&lt;li&gt;  1: Upper&lt;/li&gt;
&lt;li&gt;  2: Middle&lt;/li&gt;
&lt;li&gt;  3: Lower&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;  Name&lt;/li&gt;
&lt;li&gt;  Sex: This is a categorical variable, we will have to assign numbers to each category

&lt;ul&gt;
&lt;li&gt;  0: male&lt;/li&gt;
&lt;li&gt;  1: female&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;  Age: This is also an interval-level variable&lt;/li&gt;
&lt;li&gt;  SibSp: family relations, we will have to explore that variable&lt;/li&gt;
&lt;li&gt;  Parch: family relations, we will also have to explore that variable&lt;/li&gt;
&lt;li&gt;  Ticket: The ticket that passenger had&lt;/li&gt;
&lt;li&gt;  Fare: Passenger fare&lt;/li&gt;
&lt;li&gt;  Cabin: we can already see some NaN values here&lt;/li&gt;
&lt;li&gt;  Embarked: Port

&lt;ul&gt;
&lt;li&gt;  C = Cherbourg&lt;/li&gt;
&lt;li&gt;  Q = Queenstown&lt;/li&gt;
&lt;li&gt;  S = Southampton&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's analyze each feature!&lt;/p&gt;

&lt;h3&gt;
  
  
  Survived
&lt;/h3&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1 
&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of Passengers|"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Survivability"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2 
&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Probability"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Survivability"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m0CtGsjF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/survived_count-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m0CtGsjF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/survived_count-1.png" alt="Number of survived Passengers"&gt;&lt;/a&gt;&lt;br&gt;1. Number of survived Passengers
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ikMmcAJc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/percentage_of_survived-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ikMmcAJc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thelastdev.com/wp-content/uploads/2020/02/percentage_of_survived-1.png" alt="Percentage of survived"&gt;&lt;/a&gt;&lt;br&gt;2. Percentage of survived
  &lt;/p&gt;

&lt;p&gt;The probability of someone dying in the Titanic is higher than the probability of surviving. This is just an observation we did in our data by just plotting the survival rate. We extracted that hypothesis by just looking at one feature.  Let’s see what we can learn from the other features!&lt;/p&gt;

&lt;h3&gt;
  
  
  P-class
&lt;/h3&gt;

&lt;p&gt;P-class is the class of the passenger. We have three classes in the Titanic dataset:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Upper class&lt;/li&gt;
&lt;li&gt; Middle class&lt;/li&gt;
&lt;li&gt; Lower class&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The class is based on the income of the passenger, it is a proxy for socioeconomic status. Let's see what we can learn from that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of Passengers|"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Pclass"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2
&lt;/span&gt;&lt;span class="n"&gt;passenger_classes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;passenger_classes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;passenger_classes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Percentage of passengers"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Economic Class"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3
&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;barplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--02M_Q3Sl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/pclass_hist-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--02M_Q3Sl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/pclass_hist-1.png" alt="P-class on passengers"&gt;&lt;/a&gt;&lt;br&gt;1. P-class on passengers
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pSexujag--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/pclass_distr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pSexujag--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/pclass_distr.png" alt="Percentage of P-class among passengers"&gt;&lt;/a&gt;&lt;br&gt;2. Percentage of P-class among passengers
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PA_0KyQy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/pclass.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PA_0KyQy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/pclass.png" alt="Percentage of survived passengers"&gt;&lt;/a&gt;&lt;br&gt;3. Percentage of survived passengers
  &lt;/p&gt;

&lt;p&gt;We can see that Titanic had a lot more passengers from the lower class, we can also see that more than 50% of the passengers are in the lowest economic class. Almost 70% of the non-survivors were at the lowest economic class.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gender
&lt;/h3&gt;

&lt;p&gt;We all know from the famous movie "Titanic" that in the case of emergency women and children are prioritized to the lifeboats. Let's see if we can back it up with data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of passengers"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sex"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2
&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Percentage of passengers"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sex"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3
&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;barplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--p7LbfzkF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/gender_hist.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--p7LbfzkF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/gender_hist.png" alt="Gender in Titanic"&gt;&lt;/a&gt;&lt;br&gt;1. Gender in Titanic
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HO8hPUSi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/gender_disrt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HO8hPUSi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/gender_disrt.png" alt="Percentage of male/females in Titanic"&gt;&lt;/a&gt;&lt;br&gt;2. Percentage of male/females in Titanic
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--owN6X76---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/sex.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--owN6X76---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/sex.png" alt="Percentage of survived"&gt;&lt;/a&gt;&lt;br&gt;3. Percentage of survived
  &lt;/p&gt;

&lt;p&gt;The numbers support the hypothesis we did, based on the movie! Let's look now at the age. There are some missing data in that column, we are going to use only the ones we have. When we build the model, we'll fill that data based on the observations (future post).&lt;/p&gt;

&lt;h3&gt;
  
  
  Age
&lt;/h3&gt;

&lt;p&gt;Age is an interval-level variable. Interval-level data are quantitative, i.e. years, temperature, etc. In this dataset, Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5. We will transform age into integers with .x going to 0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_counts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of passengers"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2
&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3
&lt;/span&gt;&lt;span class="n"&gt;age_survived&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;age_not_survived&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age_survived&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'survived'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age_not_survived&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'not-survived'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Percentage'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5HARFe7w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/age_hist.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5HARFe7w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/age_hist.png" alt="Age on passengers"&gt;&lt;/a&gt;&lt;br&gt;1. Age on passengers
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kH0vAoVb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/age_disrt-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kH0vAoVb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/age_disrt-1.png" alt="Percentage of passengers by age"&gt;&lt;/a&gt;&lt;br&gt;2. Percentage of passengers by age
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UBwRx0t1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/age.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UBwRx0t1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/age.png" alt="Percentage of survived-not survived passengers by age"&gt;&lt;/a&gt;&lt;br&gt;3. Percentage of survived-not survived passengers by age
  &lt;/p&gt;

&lt;p&gt;Again, the hypothesis is valid, children and infants have a high survival rate. We also see that passengers in their 20s-40s also had a high survival rate, care to bet on their gender? Let's see!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;'female'&lt;/span&gt;&lt;span class="p"&gt;)][&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Percentage - survived'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2
&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;'male'&lt;/span&gt;&lt;span class="p"&gt;)][&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Percentage - survived'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TbrHYujH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/survived_sex.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TbrHYujH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/survived_sex.png" alt="Male-Female survivors by age"&gt;&lt;/a&gt;&lt;br&gt;1. Male-Female survivors by age
  &lt;/p&gt;

&lt;p&gt;Well, that makes sense for the female passengers, but we also notice an increase in survival rate on male passengers around their 30s. We know that in the Titanic all women had the highest priority to get into a lifeboat, but it seems that some men potentially with higher P-class managed to get in.&lt;/p&gt;

&lt;p&gt;Now we'll follow the same practices for the rest of the features, and you can see the full analysis of each column &lt;a href="https://github.com/siakon89/machine-learning/tree/master/eda/titnanic_dataset"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Correlations
&lt;/h2&gt;

&lt;p&gt;We'll have to look for correlations between features. First, we assign the value 0 to males and 1 to females&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'female'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'male'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8jLfcisF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/screenshot-from-2020-02-09-20-57-46.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8jLfcisF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/screenshot-from-2020-02-09-20-57-46.png" alt="head of gender processded"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we will search for correlations between the columns. As for age, we will try to see if filling the NaN values is making any difference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="n"&gt;corr_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Fare'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;corr_matrix&lt;/span&gt;

&lt;span class="c1"&gt;# 2
&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'new_age'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Age'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;corr_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_data&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s"&gt;'Survived'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Sex'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'new_age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Pclass'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Fare'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;corr_matrix&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QxCrVleu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/correlation.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QxCrVleu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/correlation.png" alt="Correlation matrix between features"&gt;&lt;/a&gt;&lt;br&gt;1. Correlation matrix between features
  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YjwfThaK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/correlation_age.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YjwfThaK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/correlation_age.png" alt="Correlation matrix between features with filled none values on age"&gt;&lt;/a&gt;&lt;br&gt;2. Correlation matrix between features with filled none values on age
  &lt;/p&gt;

&lt;p&gt;Let's try a heatmap also.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pLiU322f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/corr-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pLiU322f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.thelastdev.com/wp-content/uploads/2020/02/corr-1.png" alt="corr heat map"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that Survived has a correlation with Sex, P-class, and Fare and the movie can totally back this up! There is a lot more in that heatmap.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Fare has a negative correlation with P-class, which means the more the fare increases the more the P-class decreases, which makes sense, P-class 1 is the Upper one.&lt;/li&gt;
&lt;li&gt;  Age has a negative correlation with P-class&lt;/li&gt;
&lt;li&gt;  We see the usual suspects, the Survived is correlated with Sex and Fare and negatively correlated with Pclass. This supports some of the hypotheses we tested above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is so much more that we can do with the data, so I decided that we will explore the Bayes theorem in a future post. I would like to hear your thoughts, what you would have done differently, what more would you have explored? Let me know in the comment section below or on Twitter &lt;a href="https://twitter.com/siaterliskonsta"&gt;@siaterliskonsta&lt;/a&gt;. If you think I am missing something, or you need more explanation please let me know.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
