<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kushal</title>
    <description>The latest articles on DEV Community by Kushal (@kushal_).</description>
    <link>https://dev.to/kushal_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F329513%2Fd0ac187a-26cb-40ca-a2ef-2ee9f4430688.png</url>
      <title>DEV Community: Kushal</title>
      <link>https://dev.to/kushal_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kushal_"/>
    <language>en</language>
    <item>
      <title>Prompt Engineering - Part 1</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Mon, 05 Jun 2023 10:23:09 +0000</pubDate>
      <link>https://dev.to/kushal_/prompt-engineering-part-1-29i</link>
      <guid>https://dev.to/kushal_/prompt-engineering-part-1-29i</guid>
      <description>&lt;p&gt;In this article, I will provide a comprehensive tutorial on prompt engineering, highlighting how to achieve the best and optimal results from Large Language Models (LLMs) such as OpenAI's ChatGPT. &lt;/p&gt;

&lt;p&gt;Prompt engineering has gained significant popularity and widespread usage since the advent of LLMs, leading to a revolution in the field of Natural Language Processing (NLP). The beauty of prompt engineering lies in its versatility, allowing professionals from diverse backgrounds to effectively utilize it and maximize the potential of LLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic working of ChatGPT
&lt;/h2&gt;

&lt;p&gt;ChatGPT works with the concept of "assistant" and "user" roles to facilitate interactive conversations. The model operates in a back-and-forth manner, where the user provides input or messages, and the assistant responds accordingly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhore0g8pozl9y6ct49ja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhore0g8pozl9y6ct49ja.png" alt="Internal-working-LLM"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The user role represents the individual engaging in the conversation. As a user, you can provide instructions, queries, or any text-based input to the model - which forms the &lt;strong&gt;prompt&lt;/strong&gt; to the model.&lt;/p&gt;

&lt;p&gt;The assistant role refers to the AI language model itself, which is designed to generate responses based on the user's input. &lt;br&gt;
The model processes the &lt;em&gt;conversation history&lt;/em&gt;, including both user and assistant messages, to generate a relevant and coherent response. It takes into account the context and information provided in the conversation history to generate more accurate and appropriate replies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vfwbpukkr67mjxqjnkq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vfwbpukkr67mjxqjnkq.png" alt="user-assistant-roles"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The conversation typically starts with a system message that sets the &lt;em&gt;behavior of the assistant&lt;/em&gt;, followed by alternating user and assistant messages. By maintaining a conversational context, the model can generate more consistent and context-aware responses.&lt;/p&gt;

&lt;p&gt;To maintain the &lt;em&gt;context&lt;/em&gt;, it is important to include the relevant conversation history when interacting with the model. This ensures that the model understands the ongoing conversation and can provide appropriate responses based on the given context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftccn3p00h9hizug1d5k4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftccn3p00h9hizug1d5k4.png" alt="context-history"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Template for Prompt Usage
&lt;/h2&gt;

&lt;p&gt;In this section, we will be writing a boiler-plate code that will form the basis for all our tasks.&lt;br&gt;
To begin with, we need to generate a &lt;em&gt;secret key&lt;/em&gt; from our OpenAI account.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; 

&lt;span class="c1"&gt;# Generating Secret Key from your OpenAI Account
&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;OPEN-AI-SECRET-KEY&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# Template function 
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are the assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# this is the degree of randomness of the model's output
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reply&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function has three inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt&lt;/li&gt;
&lt;li&gt;Model (ChatGPT)&lt;/li&gt;
&lt;li&gt;Temperature &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We have covered the model and the role of the "user." Now, let's move on to the next two inputs: prompt and temperature.&lt;/p&gt;

&lt;p&gt;Prompt refers to the text input provided to the model, which serves as a guiding instruction or query for generating a response.&lt;/p&gt;

&lt;p&gt;Temperature, on the other hand, is a hyperparameter that plays a crucial role in determining the behavior of the model's output. Not to be confused with real world connotation, but this metric controls the level of randomness in the generated responses. By adjusting the temperature value, we can influence the model's output.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When the &lt;code&gt;temperature&lt;/code&gt; is set to a higher value, the model produces more diverse and creative responses. Conversely, lower temperature values make the model more focused and deterministic, often resulting in more precise but potentially less varied outputs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Choosing an appropriate temperature depends on the specific task and desired output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basics of Prompting
&lt;/h2&gt;

&lt;p&gt;ChatGPT can perform a plethora of tasks namely &lt;em&gt;Text Summarisation, Information Extraction, Question Answering, Text Classification, Sentiment Analysis, Code Generation&lt;/em&gt; to name a few. &lt;br&gt;
Prompts can be designed to undertake single or multiple tasks depending on the use-case.&lt;/p&gt;

&lt;p&gt;In this below example, we will be showcasing a basic text summarisation task peformed by the GPT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prod_desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
3 MODES &amp;amp; ROTATABLE NOZZLE DESIGN- This portable oral irrigator comes with Normal, Soft and Pulse modes which are best for professional use. The 360° rotatable jet tip design allows easy cleaning helping prevent tooth decay, dental plaque, dental calculus, gingival bleeding and dental hypersensitivity.
DUAL WATERPROOF DESIGN- The IPX7 waterproof design is adopted both internally and externally to provide dual protection. The intelligent ANTI-LEAK design prevents leakage and allows the dental flosser to be used safely under the running water.
UPGRADED 300 ML LARGE CAPACITY WATER TANK- The new water tank is the largest capacity tank available and provides continuous flossing for an entire session. The removable full-opening design allows thorough cleaning thus preventing formation of bacteria and limescale deposits.
CORDLESS &amp;amp; QUALITY ASSURANCE- Cordless and lightweight power irrigator comes with a powerful battery that lasts upto 14 days on a single charge
RECHARGEABLE &amp;amp; QUALITY ASSURANCE- Cordless and lightweight power irrigator comes with a powerful battery that lasts upto 14 days on a single charge
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Your task is to generate a short summary of a product &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;description from an ecommerce site. 

Summarize the description below, delimited by tags, in at most 50 words. 

Review: &amp;lt;tag&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prod_desc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/tag&amp;gt;

Output should be in JSON format with &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; as key.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output :&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;{&lt;br&gt;
    "summary": "This portable oral irrigator has 3 modes and a rotatable nozzle design for easy cleaning. It has a dual waterproof design and a large 300ml water tank. It is cordless, rechargeable, and comes with a powerful battery that lasts up to 14 days on a single charge."&lt;br&gt;
} &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some of the key notes to infer from the above code&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When constructing a prompt, it's important to provide &lt;strong&gt;clear and specific&lt;/strong&gt; instructions to guide the model's behavior. This helps ensure that the generated output aligns with your desired outcome.&lt;/li&gt;
&lt;li&gt;Delimit input data: To differentiate the prompt's instruction from the actual input data, it's advisable to use delimiters. Delimiters can take various forms, such as quotation marks (" "), angle brackets (&amp;lt; &amp;gt;), HTML tags ( ), colons (:), or backticks (

```). By using delimiters, you create a visual distinction that aids in parsing the prompt. &lt;/li&gt;
&lt;li&gt;Request structured output: If your task requires a specific format or structure for the model's response, make sure to explicitly mention it in the prompt. By doing so we can easily utilize the output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a more detailed breakdown of the above prompt.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Elements of Prompts&lt;/th&gt;
&lt;th&gt;Breakdown&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Instruction&lt;/td&gt;
&lt;td&gt;To generate a short summary of a product description.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task&lt;/td&gt;
&lt;td&gt;Summarize the description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task Constraints&lt;/td&gt;
&lt;td&gt;At most 50 words.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input Data Delimiter&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;tag&amp;gt; &amp;lt;/tag&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output Format&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;The key-elements of a prompt are : Instruction, Tasks, Constraints, Output Indicator, Input Data. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Multi-tasking Prompts
&lt;/h2&gt;

&lt;p&gt;Consider a scenario where you are presented with a text and need to perform sentiment analysis, summarize the content, and extract topics from it.&lt;br&gt;
In the pre-LLM era, accomplishing these tasks would typically involve training separate models for each task or relying on pre-trained models. However, with the advent of LLMs like ChatGPT, all of these tasks can now be efficiently executed using a single prompt.This eliminates the need for multiple specialized models and streamlines the workflow.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

review = f""" Writing this review after using it for a couple of months now. It can take some time to get used to since the water jet is quite powerful. It might take you a couple of tries to get comfortable with some modes. Start with the teeth, get comfortable and then move on to the gums. Some folks may experience sensitivity. I experienced it for a day or so and then went away.
It effectively knocks off debris from between the teeth especially the hard to get like the fibrous ones. I haven't seen much difference in the tartar though. Hopefully, with time, it gets rid of it too.
There are 3 modes of usage: normal, soft and pulse. I started with soft then graduated to pulse and now use normal mode. For the ones who are not sure, soft mode is safe as it doesn't hit hard. Once you get used to the technique of holding and using the product, you could start experimenting with the other modes and choose the one that best suits you.
One time usage of the water full of tank should usually be sufficient if your teeth are relatively clean. If, however, you have hard to reach spaces with buildup etc. it might require a refill for a usage.
If you don't refill at all, one time full recharge of the battery in normal mode will last you 4 days with maximum strength of the water jet. If you refill it once, it'll last you 2 days after which the strength of the water jet reduces.
As for folks who are worried about the charging point getting wet, I accidentally used it once without the plug for the charging point and yet it worked fine and had no issues. Ideally keep the charging point covered with the plug provided with the product.
It has 2 jet heads (pink and blue) and hence the product can be used by 2 people as long as it's used hygienically. For charging, it comes with a USB cable without the adapter which shouldn't be an issue as your phone adapter should do the job.
I typically wash the product after every usage as the used water tends to run on the product during usage.
One issue I see is that the clasp for the water tank could break accidentally if not handled properly which will render the tank useless. So ensure to not keep it open unless you are filling the tank.
"""


prompt = f"""
Your task is to provide insights for the product review \
on a e-commerce website, which is delimited by \
triple colons.

Perform following tasks:
1. Identify the product.
2. Summarize the product review, in upto 50 words.
3. Analyze the sentiment of review - positive/negative/neutral
4. Extract topics that user didnt like about the product.
5. Identify the name of the company, if not then "not mentioned"

Use the following format:
1. Product - &amp;lt;product&amp;gt;
2. Summary - &amp;lt;summary&amp;gt;
3. Sentiment - &amp;lt;user_sentiment&amp;gt;
4. Topics - &amp;lt;negative_topics&amp;gt;
5. Company - &amp;lt;company&amp;gt;
Use JSON format for the output.

Product review: :::{review}:::
"""


response = get_completion(prompt)
print(response)



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;{&lt;br&gt;
    "Product": "Water Flosser",&lt;br&gt;
    "Summary": "The water flosser is effective in removing debris from between teeth, but may take some time to get used to. It has 3 modes of usage and a full tank can last for one usage. The charging point should be covered with the provided plug. The clasp for the water tank could break if not handled properly.",&lt;br&gt;
    "Sentiment": "Neutral",&lt;br&gt;
    "Topics": "Difficulty in getting used to the product, sensitivity, no significant difference in tartar removal, clasp for water tank could break",&lt;br&gt;
    "Company": "not mentioned"&lt;br&gt;
}&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From the aforementioned example, we can infer that by explicitly listing the tasks and providing a structured format, we enable ChatGPT to understand and address each task individually. &lt;/p&gt;

&lt;p&gt;Furthermore, we can enhance the prompt by including specific conditions or instructions for each task. This allows for a more tailored and accurate response from ChatGPT, as it can take into account the unique requirements and constraints of each task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iterative Prompt Development
&lt;/h2&gt;

&lt;p&gt;As we reach the final section of the article, it's crucial to acknowledge that the process of designing and crafting prompt is similar to optimizing/selecting ML models. It is an iterative process, although typically simpler and less complex. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjevqlq3kdcrq8qmgantx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjevqlq3kdcrq8qmgantx.png" alt="iterative_prompt"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creating effective prompts requires experimentation, observation, and continuous refinement. It's important to iterate and fine-tune the prompts based on the desired output required by the use-cases. &lt;/p&gt;

&lt;h2&gt;
  
  
  End-notes
&lt;/h2&gt;

&lt;p&gt;In Part 1 of this series, we provided a brief introduction to the foundations of Prompt Engineering - that can get you started on building your own applications. &lt;br&gt;
As we move forward, subsequent parts will delve into various types of techniques and concepts, including LangChain and ChatBots.&lt;/p&gt;

</description>
      <category>openai</category>
      <category>chatgpt</category>
      <category>datascience</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>TensorFlow Model Deployment using FastAPI &amp; Docker</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Fri, 02 Apr 2021 17:10:27 +0000</pubDate>
      <link>https://dev.to/kushal_/tensorflow-model-deployment-using-fastapi-docker-4183</link>
      <guid>https://dev.to/kushal_/tensorflow-model-deployment-using-fastapi-docker-4183</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2FTensorFlow%2520-%2523FF6F00.svg%3F%26style%3Dfor-the-badge%26logo%3DTensorFlow%26logoColor%3Dwhite" class="article-body-image-wrapper"&gt;&lt;img alt="TensorFlow" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2FTensorFlow%2520-%2523FF6F00.svg%3F%26style%3Dfor-the-badge%26logo%3DTensorFlow%26logoColor%3Dwhite"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2Fdocker%2520-%25230db7ed.svg%3F%26style%3Dfor-the-badge%26logo%3Ddocker%26logoColor%3Dwhite" class="article-body-image-wrapper"&gt;&lt;img alt="Docker" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2Fdocker%2520-%25230db7ed.svg%3F%26style%3Dfor-the-badge%26logo%3Ddocker%26logoColor%3Dwhite"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2Fpython%2520-%252314354C.svg%3F%26style%3Dfor-the-badge%26logo%3Dpython%26logoColor%3Dwhite" class="article-body-image-wrapper"&gt;&lt;img alt="Python" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2Fpython%2520-%252314354C.svg%3F%26style%3Dfor-the-badge%26logo%3Dpython%26logoColor%3Dwhite"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL; DR:&lt;/strong&gt; &lt;br&gt;
In this article, we are going to build a &lt;em&gt;TensorFlow model (v2)&lt;/em&gt; and using FastAPI create REST API calls to predict from the model, and finally containerize it using &lt;em&gt;Docker&lt;/em&gt; 😃&lt;/p&gt;

&lt;p&gt;I want to emphasize the usage of &lt;strong&gt;FastAPI&lt;/strong&gt; and how rapidly this framework is a game-changer for building easy to go and much faster API calls for a machine learning pipeline.&lt;br&gt;
Traditionally, we have been using Flask Microservices for building REST API calls but the process involves a bit of nitty-gritty to understand the framework and implement it.&lt;br&gt;
On the other end, I found FastAPI to be pretty user-friendly and very easy to pick up and implement type of framework.&lt;/p&gt;

&lt;p&gt;And finally from one game-changer to another, &lt;strong&gt;Docker&lt;/strong&gt; &lt;br&gt;
As a data scientist: our role is vague and it keeps on changing year in year out. Some skillset gets added, some get extinct and obsolete, but Docker has made its mark as one of the very important and most sought out skills in the market. &lt;em&gt;Docker&lt;/em&gt; gives us the ability to containerize a solution with all its binding software and requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We have used a text classification problem : &lt;a href="https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews" rel="noopener noreferrer"&gt;IMDb Dataset&lt;/a&gt; for the purpose of building the model.&lt;/p&gt;

&lt;p&gt;The dataset comprises 50,000 reviews of movies and is a binary classification problem with the target variable being a &lt;em&gt;sentiment&lt;/em&gt;: positive or negative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Preprocessing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We use Tensorflow's &lt;em&gt;TextVectorization&lt;/em&gt; layer which tidies things up and outputs a layer which we will be using in the process of creating a graph on a Sequential or Functional Model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;VOCAB_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;
&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;preprocessing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextVectorization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;VOCAB_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;standardize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lower_and_strip_punctuation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_sequence_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;We can go for &lt;em&gt;custom standardization&lt;/em&gt; by curating a function for our own use case but there are some bugs in tf:2.4.1 which create trouble whilst creating REST API call for the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As we can see below, we are using the &lt;em&gt;encoder&lt;/em&gt; layer on the top of &lt;em&gt;Embedding&lt;/em&gt; that outputs us with a 256 dimension vector.&lt;br&gt;
The rest of the graph is self-explanatory, although we are giving a &lt;em&gt;probabilistic output instead of a 2-class softmax layer&lt;/em&gt;: the closer the probability to 1 meaning a positive sentiment for the review and vice-versa.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Creating the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;Embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_vocabulary&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt; &lt;span class="n"&gt;output_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mask_zero&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;GlobalAveragePooling1D&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="nc"&gt;Dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;relu&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;Dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sigmoid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After initialising the graph, we compile and fit the model:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Compiling the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
              &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;losses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BinaryCrossentropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_logits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="c1"&gt;# Training the model
&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="n"&gt;validation_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After model training, we evaluate the model on the test dataset and get a reasonably satisfactory test accuracy of 86.2%&lt;br&gt;
(Although our major focus is the API &amp;amp; Docker and not extending our virtues in model building for this scenario)&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Evaluating the model on test dataset
&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loss: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accuracy: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Output:
&lt;/span&gt;&lt;span class="n"&gt;Loss&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.3457462191581726&lt;/span&gt;
&lt;span class="n"&gt;Accuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.8626000285148621&lt;/span&gt;


&lt;span class="c1"&gt;# Saving the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tf_keras_imdb/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In TensorFlow, we can save the model in two ways: 'tf' or '.h5' format. Our model cannot be saved in '.h5' format since we are using the &lt;em&gt;TextVectorization&lt;/em&gt; layer&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FastAPI&lt;/strong&gt;&lt;br&gt;
Before we start creating APIs, we need a particular directory structure that will be utilized for creating a Docker image.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;tf_keras_imdb/&lt;/em&gt; : SavedModel from TensorFlow&lt;br&gt;
main.py : Python file for creating REST APIs using FastAPI framework&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;|
|--- model
|    |______ tf_keras_imdb/
|
|--- app
|    |_______ main.py
|
|--- Dockerfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Whenever we are building an API using FastAPI, we use &lt;em&gt;pydantic&lt;/em&gt; to set the type of input our API expects. For eg, a list, dictionary, JSON, string, integer, float.&lt;/p&gt;

&lt;p&gt;To create an object using pydantic, we use &lt;em&gt;BaseModel&lt;/em&gt; that defines our type of inputs.&lt;/p&gt;

&lt;p&gt;One of the reasons why FastAPI is faster and more efficient is its usage of ASGI - Asynchronous Server Gateway Interface, instead of traditional WSGI - Web Server Gateway Interface (which is used in Flask, Django)&lt;/p&gt;

&lt;p&gt;POST request is assigned to our prediction API, since it requires us to &lt;em&gt;post&lt;/em&gt; the data and fetch back the results.&lt;/p&gt;

&lt;p&gt;Uvicorn is a lightning-fast ASGI server implementation, which creates a server on our host machine and lets our API host the model on.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;We can test our API on SwaggerUI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2774y8yzfl8yo2eoct8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2774y8yzfl8yo2eoct8n.png" alt="SwaggerUI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt;&lt;br&gt;
Finally, to wrap it all up, we create a &lt;em&gt;Dockerfile&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; tiangolo/uvicorn-gunicorn-fastapi:python3.7&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;tensorflow&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.4.1

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; ./model /model/&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; ./app /app&lt;/span&gt;

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8000&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "main.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have attached a docker container (tiangolo/uvicorn-gunicorn-fastapi) which is made public on docker-hub, which makes quick work of creating a docker image on our own functionalities.&lt;/p&gt;

&lt;p&gt;To create a docker image and deploy it, we run the following commands, and voila!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; api &lt;span class="nb"&gt;.&lt;/span&gt;

docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After going through the process of working around FastAPI and Docker, I feel this skillset is a necessary repertoire in a data scientist's toolkit. The process of building around our model and deploying it has become easier and much more accessible than it was before.&lt;/p&gt;

&lt;p&gt;Github Link: &lt;a href="https://github.com/kushalvala/fastapi-nlp" rel="noopener noreferrer"&gt;https://github.com/kushalvala/fastapi-nlp&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Kushal Vala&lt;br&gt;
Data Scientist&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>fastapi</category>
      <category>docker</category>
    </item>
    <item>
      <title>Data and Sampling Distributions- II</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Thu, 16 Jul 2020 13:32:31 +0000</pubDate>
      <link>https://dev.to/kushal_/data-and-sampling-distributions-ii-32ph</link>
      <guid>https://dev.to/kushal_/data-and-sampling-distributions-ii-32ph</guid>
      <description>&lt;p&gt;At the end of Part-I, we talked about how to calculate an estimate for Standard Error of a Statistic. We will be continuing the discussion and discuss further.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Bootstrap
&lt;/h4&gt;

&lt;p&gt;One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, &lt;em&gt;with replacement&lt;/em&gt;, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the &lt;em&gt;bootstrap&lt;/em&gt;, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.&lt;/p&gt;

&lt;p&gt;Conceptually, you can imagine the bootstrap as replicating the original sample thousands or millions of times so that you have a &lt;em&gt;hypothetical population&lt;/em&gt; that embodies all the knowledge from your original sample.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e1IGSCe7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/u5zxww2u3pr4na1muoa0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e1IGSCe7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/u5zxww2u3pr4na1muoa0.png" alt="Bootstrap" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In practice, it is not necessary to actually replicate the sample a huge number of times. We simply replace each observation after each draw i.e, we &lt;em&gt;sample with replacement&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The algorithm for bootstrap resampling of the mean for a sample size of &lt;em&gt;n&lt;/em&gt; is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Draw a sample value, record it, and then replace it.&lt;/li&gt;
&lt;li&gt;Repeat &lt;em&gt;n&lt;/em&gt; times&lt;/li&gt;
&lt;li&gt;Record the mean of &lt;em&gt;n&lt;/em&gt; resampled values&lt;/li&gt;
&lt;li&gt;Repeat steps 1-3 &lt;em&gt;R&lt;/em&gt; times&lt;/li&gt;
&lt;li&gt;Use the &lt;em&gt;R&lt;/em&gt; results to:

&lt;ol&gt;
&lt;li&gt; Calculate their standard deviation ( estimates sample mean standard error)&lt;/li&gt;
&lt;li&gt;Produce Boxplot or Histogram&lt;/li&gt;
&lt;li&gt;Find &lt;em&gt;Confidence Interval&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The number of iterations of the bootstrap: &lt;em&gt;R&lt;/em&gt;, is set arbitrarily. The more the iterations, the more accurate is the estimate of standard error.&lt;/p&gt;

&lt;p&gt;From the previous dataset of &lt;em&gt;Red Wine Quality Estimation&lt;/em&gt;, we are taking &lt;em&gt;Total Sulfur Dioxide&lt;/em&gt; as a key feature to calculate the &lt;em&gt;bias&lt;/em&gt; and an estimate of &lt;em&gt;standard error&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;resample&lt;/span&gt;
&lt;span class="n"&gt;boot_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;nrepeat&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;replace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boot_sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bootstrap Statistics:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Original Population Size : &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bootstrap Sample Size : &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boot_sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Original: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bias: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Standard Error: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;#Output:
&lt;/span&gt;&lt;span class="n"&gt;Bootstrap&lt;/span&gt; &lt;span class="n"&gt;Statistics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;Original&lt;/span&gt; &lt;span class="n"&gt;Population&lt;/span&gt; &lt;span class="n"&gt;Size&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mi"&gt;1599&lt;/span&gt;
&lt;span class="n"&gt;Bootstrap&lt;/span&gt; &lt;span class="n"&gt;Sample&lt;/span&gt; &lt;span class="n"&gt;Size&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;Original&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;38.0&lt;/span&gt;
&lt;span class="n"&gt;Bias&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.016345870231326387&lt;/span&gt;
&lt;span class="n"&gt;Standard&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;1.071951943585676&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bootstrap can be used with &lt;em&gt;multivariate data&lt;/em&gt;, where the rows are sampled as units.&lt;br&gt;
A model might then be run on the bootstrapped data, for example, to estimate the &lt;em&gt;stability (variability)&lt;/em&gt; of model parameters, or to improve predictive power. &lt;br&gt;
With CART Algorithm (&lt;em&gt;Random Forest&lt;/em&gt;), running multiple trees on bootstrap samples and then averaging their predictions (or, with classification, taking a majority vote) generally performs better than using a single tree. &lt;/p&gt;

&lt;p&gt;So as we can observe that, the concept of &lt;em&gt;Bootstrap&lt;/em&gt; has been used extensively in &lt;em&gt;Machine Learning&lt;/em&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  Confidence Intervals
&lt;/h4&gt;

&lt;p&gt;The concept of &lt;em&gt;Confidence Interval&lt;/em&gt; lies in the idea of &lt;em&gt;uncertainty&lt;/em&gt;. Usually, there are &lt;em&gt;point estimate&lt;/em&gt; which are estimated but presenting a range of values to counteract this tendency.&lt;/p&gt;

&lt;p&gt;Confidence intervals always come with a coverage level, expressed as a (high) percentage, say 90% or 95%.&lt;br&gt;
One way to think of a 90% confidence interval is as follows: it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic. &lt;br&gt;
More generally, an x% confidence interval around a sample estimate should, on average, contain similar sample estimates x% of the time (when a similar sampling procedure is followed).&lt;/p&gt;

&lt;p&gt;Bootstrap is a general tool that can be used to generate confidence intervals for most statistics, or model parameters.&lt;/p&gt;

&lt;p&gt;The percentage associated with the confidence interval is termed the level of confidence. The higher the level of confidence, the wider the interval.&lt;br&gt;
Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For a data scientist, a confidence interval is a tool that can be used to get an idea of how variable a sample result might be.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Creating a dataset from normal distribution
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="c1"&gt;# bootstrap
&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# bootstrap sample
&lt;/span&gt;    &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# calculate and store statistic
&lt;/span&gt;    &lt;span class="n"&gt;statistic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;statistic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50th percentile (median) = %.3f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# calculate 95% confidence intervals (100 - alpha)
&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;
&lt;span class="c1"&gt;# calculate lower percentile (e.g. 2.5)
&lt;/span&gt;&lt;span class="n"&gt;lower_p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;span class="c1"&gt;# retrieve observation at lower percentile
&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower_p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%.1fth percentile = %.3f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lower_p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# calculate upper percentile (e.g. 97.5)
&lt;/span&gt;&lt;span class="n"&gt;upper_p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# retrieve observation at upper percentile
&lt;/span&gt;&lt;span class="n"&gt;upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper_p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%.1fth percentile = %.3f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upper_p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this article, we covered two major concepts: &lt;em&gt;Confidence Intervals&lt;/em&gt; and &lt;em&gt;Bootstrap&lt;/em&gt;, this two concepts are used majorly in field of Data Science for various applications.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fin&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Data and Sampling Distributions- I</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Mon, 13 Jul 2020 07:41:12 +0000</pubDate>
      <link>https://dev.to/kushal_/data-and-sampling-distributions-i-5g0k</link>
      <guid>https://dev.to/kushal_/data-and-sampling-distributions-i-5g0k</guid>
      <description>&lt;p&gt;In the previous series, we divulged a lot into &lt;em&gt;Exploratory Data Analysis&lt;/em&gt; and how as a &lt;em&gt;Data Scientist&lt;/em&gt; we have a lot of tools at our disposal to analyze and synthesize our data.&lt;/p&gt;

&lt;p&gt;There is a popular misconception during the age of &lt;strong&gt;Big Data&lt;/strong&gt;, is that because of the size and nature of data, the need for sampling is redundant. But on the contrary, because of the varying quality of data: the need for sampling is still prevalent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cHhFOTFZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/0i79s1ou9zmk1rd2d2vq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cHhFOTFZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/0i79s1ou9zmk1rd2d2vq.png" alt="Population-Sample" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The lefthand side is the &lt;em&gt;population&lt;/em&gt; which is assumed to follow an unknown distribution. The righthand side is the &lt;em&gt;sample&lt;/em&gt; with an empirical distribution. &lt;br&gt;
The process of picking up data from the lefthand side to the right-hand side is called &lt;em&gt;sampling&lt;/em&gt; and which is the major concern in data science.&lt;/p&gt;
&lt;h4&gt;
  
  
  Random Sampling and Sample Bias
&lt;/h4&gt;

&lt;p&gt;A &lt;em&gt;sample&lt;/em&gt; is a subset of data from a larger data set, statisticians call this larger data set the &lt;em&gt;population&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Random sampling&lt;/em&gt; is a process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw.&lt;/p&gt;

&lt;p&gt;Sampling can be done &lt;em&gt;with replacement&lt;/em&gt;, in which observations are put back in the population after each draw for possible future reselection. Or it can be done &lt;em&gt;without replacement&lt;/em&gt;, in which case observations, once selected, are unavailable for future draws.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What is Sample Bias?&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;It occurs when a &lt;em&gt;sample&lt;/em&gt; drawn from the &lt;em&gt;population&lt;/em&gt; was drawn in a nonrandom manner which resulted in a different distribution as compared to its &lt;em&gt;population&lt;/em&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  Bias
&lt;/h4&gt;

&lt;p&gt;Statistical bias refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process.&lt;br&gt;
There is a large difference between &lt;em&gt;Error from Bias&lt;/em&gt; and &lt;em&gt;Error due to Random chance&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;How to deal with Bias? - &lt;em&gt;Random Selection&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are now a variety of methods to achieve representativeness, but at the heart of all of them lies random sampling.&lt;br&gt;
Random sampling is not always easy. A proper definition of an &lt;em&gt;accessible population&lt;/em&gt; is key.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;stratified sampling&lt;/em&gt;, the population is divided into &lt;em&gt;stratas&lt;/em&gt;, and random samples are taken from each of them.&lt;/p&gt;
&lt;h4&gt;
  
  
  Selection Bias
&lt;/h4&gt;

&lt;p&gt;Selection bias refers to the practice of selectively choosing data—consciously or unconsciously—in a way that leads to a conclusion that is misleading or ephemeral.&lt;/p&gt;

&lt;p&gt;Selection bias occurs when you are &lt;em&gt;data snooping&lt;/em&gt;, i.e extensive hunting for patterns inside data that suit your use-case.&lt;/p&gt;

&lt;p&gt;Since the repeated review of large data sets is a key value proposition in data science, selection bias is something to worry about. A form of selection bias that a data scientist has to deal with is called &lt;em&gt;Vast search effect&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you repeatedly run different models and ask different questions with a large data set, you are bound to find something interesting. But is the result you found truly something interesting, or is it the chance outlier?&lt;/p&gt;

&lt;p&gt;How to deal with this effect? The answer is by using a &lt;em&gt;holdout set&lt;/em&gt; and sometimes more than &lt;em&gt;one holdout set&lt;/em&gt; to validate against.&lt;/p&gt;
&lt;h4&gt;
  
  
  Sampling Distribution of Statistic
&lt;/h4&gt;

&lt;p&gt;The term &lt;em&gt;sampling distribution&lt;/em&gt; of a statistic refers to the distribution of some sample statistics over many samples drawn from the same population. &lt;/p&gt;

&lt;p&gt;Much of the classical statistics is concerned with making inferences from &lt;em&gt;small&lt;/em&gt; samples to a &lt;em&gt;very large&lt;/em&gt; population.&lt;/p&gt;

&lt;p&gt;Typically, a sample is drawn with the goal of measuring something (with a sample statistic) or modeling something (with a statistical or machine learning model). &lt;br&gt;
Since our estimate or model is based on a &lt;em&gt;sample&lt;/em&gt;, it might be in error, it might be different if we were to draw a &lt;em&gt;different sample&lt;/em&gt;. &lt;br&gt;
We are therefore interested in how different it might be, a  key concern is &lt;em&gt;sampling variability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: It is important to distinguish between the distribution of the individual data points, known as the data distribution, and the distribution of a sample statistic, known as the sampling distribution.&lt;/p&gt;

&lt;p&gt;The distribution of a &lt;em&gt;sample statistic&lt;/em&gt; such as the mean is likely to be more regular and bell-shaped than the distribution of the data itself. The larger the sample the statistic is based on, the more this is true. Also, the larger the sample, the narrower the distribution of the sample statistic.&lt;/p&gt;

&lt;p&gt;From an open-sourced Dataset- &lt;em&gt;Wine Quality&lt;/em&gt;, &lt;br&gt;
We are taking three samples from this data: a sample of 1,000 values, a sample of 1,000 means of 5 values, and a sample of 1,000 means of 20 values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Taking a Sample Data
&lt;/span&gt;&lt;span class="n"&gt;sample_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Taking a mean of statistic for 5 samples
&lt;/span&gt;
&lt;span class="n"&gt;sample_mean_05&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mean of 5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Taking mean of statistic for 20 samples
&lt;/span&gt;
&lt;span class="n"&gt;sample_mean_20&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mean of 20&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;sample_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_mean_05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_mean_20&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FacetGrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col_wrap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aspect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_axis_labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_titles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{col_name}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code produces a FacetGrid consisting of three histograms, the first one being a data distribution, and second and third being sampling distribution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--q7yHMiZ_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/hqw5wyc6ttb1bkmypcht.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--q7yHMiZ_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/hqw5wyc6ttb1bkmypcht.png" alt="Distribution" width="294" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The phenomenon we’ve just described is termed the &lt;strong&gt;Central limit theorem&lt;/strong&gt;. It says that the means drawn from multiple samples will resemble the familiar bell-shaped normal curve. &lt;/p&gt;

&lt;p&gt;The central limit theorem allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference—that is, confidence intervals and hypothesis tests.&lt;/p&gt;

&lt;h4&gt;
  
  
  Standard Error
&lt;/h4&gt;

&lt;p&gt;The standard error is a single metric that sums up the variability in the sampling distribution for a statistic.&lt;br&gt;
The standard error can be estimated using a statistic based on the standard deviation &lt;em&gt;s&lt;/em&gt; of the sample values, and the sample size &lt;em&gt;n&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--D4w9zfEq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/c8x6mc2j2mh7cdr6wtb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--D4w9zfEq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/c8x6mc2j2mh7cdr6wtb6.png" alt="Formula-SE" width="740" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As the sample size increases, the standard error decreases, corresponding to what was observed in the above figure.&lt;/p&gt;

&lt;p&gt;The approach to measuring standard error:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sample from an accessible population distribution
&lt;/li&gt;
&lt;li&gt;For each sample, calculate the statistic (eg &lt;em&gt;mean&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Calculate the standard deviation of statistics from Step-2. Using this as an estimate for Standard Error.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In practice, this approach of collecting new samples to estimate the standard error is typically not feasible. Fortunately, it turns out that it is not necessary to draw brand new samples; instead, you can use &lt;em&gt;bootstrap resamples&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In modern statistics, &lt;em&gt;Bootstrap&lt;/em&gt; has become a standard way to estimate the standard error.&lt;/p&gt;

&lt;p&gt;So this concludes Part-I, where I have divulged on Sample/Population Data dichotomy, Various types of bias in samples (Sample Bias), Ways to mitigate bias in our data, Central Limit Theorem and Standard Error.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fin&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Exploratory Data Analysis: Part C</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Thu, 09 Jul 2020 13:16:38 +0000</pubDate>
      <link>https://dev.to/kushal_/exploratory-data-analysis-part-c-lcc</link>
      <guid>https://dev.to/kushal_/exploratory-data-analysis-part-c-lcc</guid>
      <description>&lt;p&gt;In this article, we will delve into various aspects of plotting and analyzing numerical and categorical variables in bivariate and a multivariate manner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlation
&lt;/h3&gt;

&lt;p&gt;Exploratory data analysis in many modeling projects (whether in data science or in research) involves examining correlation among predictors and between predictors and a target variable.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;We use &lt;em&gt;Pearson's Correlation Coefficient&lt;/em&gt; as a de-facto method for computing correlation among numerical variables.&lt;/p&gt;

&lt;p&gt;Following is the mathematical formula of the same:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fm4wuiox5krlgzg1gfx0c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fm4wuiox5krlgzg1gfx0c.png" alt="Correlation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The correlation coefficient always lies between +1 (perfect positive correlation) and –1 (perfect negative correlation); 0 indicates no correlation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Reading the data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;winequality-red.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# pandas dataframe has .corr() method to compute a correlation table
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output of the following code is a square matrix with each numerical variable's correlation computed against every other in the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fhhlevlf1ohj0ebdvd919.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fhhlevlf1ohj0ebdvd919.png" alt="CorrTable"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For visualization purposes, we use seaborn's &lt;em&gt;heatmap&lt;/em&gt; for better inferences and data storytelling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;vmin&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vmax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;diverging_palette&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;220&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flmmro9723p1zlqasvdyu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flmmro9723p1zlqasvdyu.png" alt="heatmap"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data.&lt;/p&gt;

&lt;p&gt;Note: There are various other correlation coefficients devised by statisticians: &lt;em&gt;Spearman’s rho&lt;/em&gt; or &lt;em&gt;Kendall’s tau&lt;/em&gt;. &lt;br&gt;
These are correlation coefficients based on the rank of the data. Since they work with ranks rather than values, these estimates are robust to outliers and can handle certain types of nonlinearities.&lt;/p&gt;
&lt;h4&gt;
  
  
  Scatterplots
&lt;/h4&gt;

&lt;p&gt;The standard way to visualize the relationship between two measured data variables is with a scatterplot. The x-axis represents one variable and the y-axis another, and each point on the graph is a record.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;citric acid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;axhline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;grey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lw&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;axvline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;grey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lw&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fvzvo5vs3yhmxa9g5b2gm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fvzvo5vs3yhmxa9g5b2gm.png" alt="scatterplot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The plot shows a fairly positive relation between &lt;em&gt;Citric Acid&lt;/em&gt; and &lt;em&gt;Fixed Acidity&lt;/em&gt;, where we can conclude safely that increase of citric acid results in a corresponding increase of acidity levels. &lt;/p&gt;

&lt;h3&gt;
  
  
  Exploring Two or More Variables
&lt;/h3&gt;

&lt;p&gt;Familiar estimators like mean and variance look at variables one at a time (univariate analysis).&lt;br&gt;
In this section, we look at additional estimates and plots, and at more than two variables (multivariate analysis).&lt;/p&gt;
&lt;h4&gt;
  
  
  Hexagonal Binning and Contours
&lt;/h4&gt;

&lt;p&gt;Scatterplots are fine when there is a relatively small number of data values.&lt;br&gt;
For data sets with hundreds of thousands or millions of records, a scatterplot will be too dense, so we need a different way to visualize the relationship.&lt;br&gt;
Rather than plotting points, which would appear as a monolithic dark cloud, we grouped the records into hexagonal bins and plotted the hexagons with a color indicating the number of records in that bin. &lt;/p&gt;

&lt;p&gt;In Python, hexagonal binning plots are readily available using the pandas data frame method &lt;em&gt;hexbin&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexbin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;citric acid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gridsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Citric Acid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Fixed Acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ff70t0acy6ezq0foux5xh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ff70t0acy6ezq0foux5xh.png" alt="Hexplot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another method to analyze dense data is to plot the density contours. In Python, &lt;em&gt;seaborn&lt;/em&gt; has the method &lt;em&gt;kdeplot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;citric acid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fcwv5k0vot11n4zwex5jq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fcwv5k0vot11n4zwex5jq.png" alt="contourplot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Two Categorical Variables
&lt;/h4&gt;

&lt;p&gt;A useful way to summarize two categorical variables is a contingency table - &lt;em&gt;a table of counts by category.&lt;/em&gt;&lt;br&gt;
Contingency tables can look only at counts, or they can also include column and total percentages. &lt;/p&gt;

&lt;p&gt;The pivot_table method creates the pivot table in Python. The aggfunc argument allows us to get the counts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adult_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pivot_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;education&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sex&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggfunc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;margins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;workclass&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fcu6wwtbka6ih3z9wybrf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fcu6wwtbka6ih3z9wybrf.png" alt="ContingencyTable"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can observe that, for two categorical variables : &lt;em&gt;education&lt;/em&gt; and &lt;em&gt;sex&lt;/em&gt;, we have computed a contingency table. &lt;br&gt;
From the table, we have used 'workclass' column for the output. &lt;/p&gt;
&lt;h4&gt;
  
  
  Categorical and Numerical Variables
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Boxplots&lt;/em&gt; are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;pandas&lt;/em&gt; boxplot method takes the by an argument that splits the data set into groups and creates the individual boxplots.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adult_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;race&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hours-per-week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbxzmwqcafjhb08sbr500.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbxzmwqcafjhb08sbr500.png" alt="boxpair"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above visualization, we can observe that, we have grouped the data by the categorical variable &lt;em&gt;race&lt;/em&gt; and have plotted against &lt;em&gt;hours-per-work&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;violin plot&lt;/strong&gt; is an enhancement to the boxplot and plots the density estimate with the density on the y-axis. &lt;br&gt;
The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin. &lt;br&gt;
&lt;em&gt;The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;violinplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adult_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;race&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adult_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hours-per-week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quartile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fx3gceax0ms847dr80grj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fx3gceax0ms847dr80grj.png" alt="violenplot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We created a violin plot with similar features as the aforementioned boxplot.&lt;/p&gt;

&lt;h4&gt;
  
  
  Closing Remarks
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Exploratory data analysis (EDA)&lt;/em&gt; set a foundation for the field of data science. The key idea of EDA is that the first and most important step in any project based on data is to look at the data. By summarizing and visualizing the data, you can gain valuable intuition and understanding of the project.&lt;/p&gt;

&lt;p&gt;Fin.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Exploratory Data Analysis: Part B</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Wed, 08 Jul 2020 15:46:30 +0000</pubDate>
      <link>https://dev.to/kushal_/exploratory-data-analysis-part-b-13dk</link>
      <guid>https://dev.to/kushal_/exploratory-data-analysis-part-b-13dk</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/kushalvala/exploratory-data-analysis-part-a-3j3l"&gt;Part-A&lt;/a&gt;, we divulged in the following concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elements of Structured Data&lt;/li&gt;
&lt;li&gt;Estimate of Location ( &lt;em&gt;Central Tendency&lt;/em&gt; )&lt;/li&gt;
&lt;li&gt;Estimate of Variability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, we will be moving ahead with the methodologies and techniques for Exploratory Data Analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exploring Data Distribution
&lt;/h3&gt;

&lt;p&gt;Each of the estimates we have covered sums up the data in a single number to describe the location or variability of the data. &lt;em&gt;It is also useful to explore how the data is distributed overall.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Percentiles and BoxPlots
&lt;/h4&gt;

&lt;p&gt;In Part-A, we have seen how percentile can be used to measure the spread of the data.&lt;br&gt;
Percentiles are also valuable for summarizing the entire distribution. Percentiles are especially valuable for summarizing the tails (the outer range) of the distribution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;winequality-red.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Calculating Percentiles - 5th, 25th, 50th, 75th, 95th 
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; 

&lt;span class="c1"&gt;# Output:
&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;     &lt;span class="mf"&gt;11.0&lt;/span&gt;
&lt;span class="mf"&gt;0.25&lt;/span&gt;     &lt;span class="mf"&gt;22.0&lt;/span&gt;
&lt;span class="mf"&gt;0.50&lt;/span&gt;     &lt;span class="mf"&gt;38.0&lt;/span&gt;
&lt;span class="mf"&gt;0.75&lt;/span&gt;     &lt;span class="mf"&gt;62.0&lt;/span&gt;
&lt;span class="mf"&gt;0.95&lt;/span&gt;    &lt;span class="mf"&gt;112.1&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="n"&gt;sulfur&lt;/span&gt; &lt;span class="n"&gt;dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;float64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can observe from the above code, the &lt;em&gt;median&lt;/em&gt; (50th percentile) is 38.0 i.e &lt;em&gt;Total Sulfur Dioxide&lt;/em&gt; has a huge variance with the 5th percentile being 11.0 and 95th percentile being 112.1.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Boxplots&lt;/em&gt; is based on percentiles and give a quick way to visualize the distribution of data&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;box&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Concentration of Sulfar Dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this code, we use * pandas* inbuilt boxplot command, but many data scientists and analysts prefer &lt;em&gt;matplotlib&lt;/em&gt; and &lt;em&gt;seaborn&lt;/em&gt; for their flexibility.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_de4NPyK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/m7gzb04enis11kupzaf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_de4NPyK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/m7gzb04enis11kupzaf9.png" alt="BoxPlot" width="360" height="576"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The top and bottom of the box are the 75th and 25th percentiles, respectively. The median is shown by the horizontal line in the box. The dashed lines referred to as &lt;em&gt;whiskers&lt;/em&gt;, extend from the top and bottom of the box to indicate the range for the bulk of the data.&lt;/p&gt;

&lt;p&gt;Any data outside of the whiskers are plotted as single points or circles (often considered outliers).&lt;/p&gt;

&lt;h4&gt;
  
  
  Frequency Tables and Histograms
&lt;/h4&gt;

&lt;p&gt;A frequency table of a variable divides up the variable range into equally spaced segments and tells us how many values fall within each segment.&lt;/p&gt;

&lt;p&gt;The function pandas.cut creates a series that maps the values into the segments.&lt;br&gt;
Using the method value_counts, we get the frequency table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Frequency Table for Sulfur Dioxide Concentration
&lt;/span&gt;
&lt;span class="n"&gt;binnedConct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;binnedConct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;#Output:
&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;5.717&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;34.3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="mi"&gt;730&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;34.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;62.6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="mi"&gt;471&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;62.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;90.9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="mi"&gt;221&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;90.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;119.2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="mi"&gt;113&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;119.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;147.5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="mi"&gt;52&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;147.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;175.8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;260.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;289.0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;232.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;260.7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;204.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;232.4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;175.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;204.1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;      &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="n"&gt;sulfur&lt;/span&gt; &lt;span class="n"&gt;dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;int64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is important to include the empty bins. The fact that there are no values in those bins is useful information. It can also be useful to experiment with different bin sizes.&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;histogram&lt;/em&gt; is a way to visualize a frequency table, with bins on the x-axis and the data count on the y-axis.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;pandas&lt;/em&gt; support histograms for data frames with the &lt;em&gt;hist&lt;/em&gt; method. Use the keyword argument bins to define the number of bins.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Plotting Histogram
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Concentration of Sulfar Dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HVSsyt_W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/h73im7pqph2ionz7swtj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HVSsyt_W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/h73im7pqph2ionz7swtj.png" alt="Histogram" width="504" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the following histogram, we see that histogram is rightly skewed. (I will address &lt;em&gt;Skewness&lt;/em&gt; and &lt;em&gt;Kurtosis&lt;/em&gt; in upcoming articles)&lt;/p&gt;

&lt;h4&gt;
  
  
  Density Plots and Estimates
&lt;/h4&gt;

&lt;p&gt;Related to the histogram is a &lt;em&gt;density plot&lt;/em&gt;, which shows the distribution of data values as a continuous line. A density plot can be thought of as a smoothed histogram, although it is typically computed directly from the data through a &lt;em&gt;kernel density estimate&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;density&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;density&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Concentration of Sulfar Dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--w3-nYUa_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/dyxk18dyw1d8g58tb4ps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--w3-nYUa_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/dyxk18dyw1d8g58tb4ps.png" alt="DensityPlot" width="720" height="504"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: A key distinction from the histogram plotted in is the scale of the y-axis: &lt;em&gt;a density plot&lt;/em&gt; corresponds to plotting the histogram as a proportion rather than counts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exploring Binary and Categorical Data
&lt;/h3&gt;

&lt;p&gt;Getting a summary of a binary variable or a categorical variable with a few categories is a fairly easy matter: we just figure out the proportion of 1s or the proportions of the important categories.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Bar charts, seen often in the popular press, are a common visual tool for displaying a single categorical variable.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;adult_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adult.data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_blank_lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Plotting the Categorical Variable: Education
&lt;/span&gt;&lt;span class="n"&gt;adult_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gP_HMkgY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/dablbamifqwhyv6km83a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gP_HMkgY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/dablbamifqwhyv6km83a.png" alt="Alt Text" width="504" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that a bar chart resembles a histogram; in a bar chart the x-axis represents different categories of a factor variable, while in a histogram the x-axis represents values of a single variable on a numeric scale.&lt;/p&gt;

&lt;p&gt;Some more concepts on categorical variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mode: &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mode is the value—or values in case of a tie—that appears most often in the data. The mode is a simple summary statistic for categorical data, and it is generally not used for numeric data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expected Value:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The expected value is calculated as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Multiply each outcome by its probability of occurrence.&lt;/li&gt;
&lt;li&gt;Sum these values.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The expected value is really a form of &lt;em&gt;weighted mean&lt;/em&gt;: it adds the ideas of future expectations and probability weights, often based on subjective judgment.&lt;/p&gt;

&lt;p&gt;That's all Folks for Part-B of this series! &lt;br&gt;
In this article, we covered various plotting paradigms used to analyze numerical and categorical variables along with its python code.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fin&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Exploratory Data Analysis: Part A</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Tue, 07 Jul 2020 13:39:40 +0000</pubDate>
      <link>https://dev.to/kushal_/exploratory-data-analysis-part-a-3j3l</link>
      <guid>https://dev.to/kushal_/exploratory-data-analysis-part-a-3j3l</guid>
      <description>&lt;p&gt;In this article, we will be exploring fundamental ways of doing an exploratory data analysis on a dataset. &lt;br&gt;
Earlier, the statistical studies were limited to &lt;em&gt;inferences&lt;/em&gt;, but then John Tukey proposed a new scientific discipline called data analysis that included statistical inference as just one component. &lt;br&gt;
With the ready availability of computing power and expressive data analysis software, exploratory data analysis has evolved well beyond its original scope. &lt;/p&gt;
&lt;h3&gt;
  
  
  Elements of Structured Data
&lt;/h3&gt;

&lt;p&gt;Data comes from many sources: &lt;em&gt;sensor measurements, events, text, images, and videos&lt;/em&gt;. &lt;br&gt;
The &lt;em&gt;Internet of Things (IoT)&lt;/em&gt; is spewing out streams of information. Much of this data is unstructured: Images are a collection of pixels, with each pixel containing RGB (red, green, blue) color information. &lt;br&gt;
Texts are sequences of words and nonword characters, often organized by sections, subsections, and so on.&lt;/p&gt;

&lt;p&gt;To apply statistical concepts, unstructured raw data has to be converted into structured data.&lt;/p&gt;

&lt;p&gt;There are mainly two types of structured data:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Numeric Type

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Continuous&lt;/em&gt;: Data that can take on any value in an interval. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Discrete&lt;/em&gt;: Data that can take on only integer values, such as counts. &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Categorical Type

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Binary Data&lt;/em&gt; (Special Case): A special case of categorical data with just two categories of values, e.g., 0/1, true/false. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Ordinal Data&lt;/em&gt;: Categorical data that has an explicit ordering. (Synonym: ordered factor).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Rectangular Data
&lt;/h3&gt;

&lt;p&gt;The typical frame of reference for analysis in data science is a &lt;em&gt;rectangular data object&lt;/em&gt;, like a &lt;em&gt;spreadsheet&lt;/em&gt; or &lt;em&gt;database table.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rectangular data&lt;/em&gt; is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables).&lt;br&gt;
The data frame is the specific format in R and Python.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Terms for Rectangular Data&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Feature: A column within a table is commonly referred to as a feature. Alias: attribute, predictor, variable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Records: A row within a data frame. Alias: case, example, instance, observation. etc&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Below is the typical data frame object read by &lt;em&gt;pandas&lt;/em&gt; library in Python. &lt;br&gt;
&lt;em&gt;Dataset: Wine Quality by UCI&lt;/em&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Non-Rectangular Data Structure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;There are data structures other than the rectangular data.&lt;br&gt;
&lt;em&gt;Time series data&lt;/em&gt; records successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced by devices—the Internet of Things.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Graph (or network) data structures are used to represent physical, social, and abstract relationships.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Estimates of Location
&lt;/h3&gt;

&lt;p&gt;Variables with measured or count data (&lt;em&gt;Numerical&lt;/em&gt;) might have thousands of distinct values. &lt;br&gt;
A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).&lt;/p&gt;

&lt;p&gt;At first glance, summarizing data might seem fairly trivial: just take the &lt;em&gt;mean of the data&lt;/em&gt;. In fact, while the mean is easy to compute and expedient to use, it may not always be the best measure for a central value.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most basic estimate of location is the mean or average value. The mean is the sum of all values divided by the number of values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7tD2NTKH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/etbavb47kt1eldsx1sdj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7tD2NTKH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/etbavb47kt1eldsx1sdj.png" alt="Alt Text" width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;N (or n)&lt;/em&gt; refers to the total number of records or observations. In statistics, it is capitalized if it is referring to a population, and lowercase if it refers to a sample from a population.&lt;/p&gt;

&lt;p&gt;A variation of the mean is a &lt;em&gt;trimmed mean&lt;/em&gt;, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--B2XUkNaq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/xvexymb5m8gi81o1t99y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--B2XUkNaq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/xvexymb5m8gi81o1t99y.png" alt="Alt Text" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An advantage of using a trimmed mean is that it removes the influence of extreme values. It is more robust than the regular mean.&lt;/p&gt;

&lt;p&gt;Another type of mean is a weighted mean, which you calculate by multiplying each data value by a user-specified weight and dividing their sum by the sum of the weights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QO_b06Ts--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/8qu6d9xjaypmcomsjhrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QO_b06Ts--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/8qu6d9xjaypmcomsjhrq.png" alt="Alt Text" width="692" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Median and Robust Measures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;em&gt;median&lt;/em&gt; is the middle number on a sorted list of the data. If there is an even number of data values, the middle value is one that is not actually in the data set, but rather the average of the two values that divide the sorted data into upper and lower halves.&lt;/p&gt;

&lt;p&gt;Compared to the mean, the median takes into account only the central values of the sorted data, which makes the median more robust. In many use-cases, the median is a better metric for central tendencies.&lt;/p&gt;

&lt;p&gt;The median is referred to as a robust estimate of location since it is not influenced by outliers (extreme cases) that could skew the results.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;An outlier is any value that is very distant from the other values in a data set&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In fact, a trimmed mean is widely used to avoid the influence of outliers. For example, trimming the bottom and top 10% (a common choice) of the data will provide protection against outliers in all but the smallest data sets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Mean, Trimmed Mean and Median of the feature: fixed acidity of wine
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mean of Fixed Acidity of Wine:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Slicing 10% of left and right most elements
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Trimmed Mean of Fixed Acidity of Wine: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;trim_mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Median of Fixed Acidity of Wine: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixed acidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;#Output
&lt;/span&gt;&lt;span class="n"&gt;Mean&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Fixed&lt;/span&gt; &lt;span class="n"&gt;Acidity&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Wine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.319637273295838&lt;/span&gt;
&lt;span class="n"&gt;Trimmed&lt;/span&gt; &lt;span class="n"&gt;Mean&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Fixed&lt;/span&gt; &lt;span class="n"&gt;Acidity&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Wine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;8.152537080405933&lt;/span&gt;
&lt;span class="n"&gt;Median&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Fixed&lt;/span&gt; &lt;span class="n"&gt;Acidity&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Wine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;7.9&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Estimates of Variability
&lt;/h3&gt;

&lt;p&gt;Location is just one dimension in summarizing a feature. &lt;br&gt;
A second dimension, &lt;em&gt;variability&lt;/em&gt;, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard Deviation and Related Estimates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most widely used estimates of variation are based on the differences, or deviations, between the estimate of location and the observed data.&lt;br&gt;
In fact, the sum of the deviations from the mean is precisely zero. Instead, a simple approach is to take the average of the absolute values of the deviations from the mean.&lt;br&gt;
This is known as the &lt;em&gt;mean absolute deviation&lt;/em&gt; and is computed with the formula:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--W1CBstpt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/1weawn4wjesyjgkr8uq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--W1CBstpt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/1weawn4wjesyjgkr8uq3.png" alt="MAD" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The best-known estimates of variability are the variance and the standard deviation, which are based on squared deviations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---w5UULru--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/7roeyi24zvfplvya5d1m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---w5UULru--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/7roeyi24zvfplvya5d1m.png" alt="Variance/SD" width="800" height="167"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The standard deviation is much easier to interpret than the variance since it is on the same scale as the original data.&lt;br&gt;
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations.&lt;/p&gt;

&lt;p&gt;A robust estimate of variability is &lt;em&gt;Median Absolute Deviation&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fUdO6Xte--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/jqjw5br8luo0tjcu41lk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fUdO6Xte--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/jqjw5br8luo0tjcu41lk.png" alt="Median Absolute Deviation" width="800" height="69"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estimates based on Percentiles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A different approach to estimating dispersion is based on looking at the spread of the sorted data. Statistics based on sorted (ranked) data are referred to as order statistics.&lt;br&gt;
The most basic measure is the &lt;em&gt;range&lt;/em&gt;, but it is sensitive to outliers and not a great measure of dispersion.&lt;/p&gt;

&lt;p&gt;In a data set, the &lt;em&gt;Pth percentile&lt;/em&gt; is a value such that at least P percent of the values take on this value or less, and at least (100 – P) percent of the values take on this value or more.&lt;br&gt;
For example, to find the 80th percentile, sort the data. Then, starting with the smallest value, proceed 80 percent of the way to the largest value&lt;/p&gt;

&lt;p&gt;A common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the &lt;em&gt;interquartile range (or IQR)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For very large data sets, calculating exact percentiles can be computationally very expensive since it requires sorting all the data values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Measures of Variability for Sulfur Dioxide
&lt;/span&gt;
&lt;span class="c1"&gt;# Standard Deviation
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Standard Deviation for Sulfur Dioxide in Wine: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Inter-Quartile Range
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IQR of Sulfar Dioxide: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Median Absolute Deviation (a robust measure)
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Median Absolute Deviation: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;mad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free sulfur dioxide&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="c1"&gt;#Output:
&lt;/span&gt;&lt;span class="n"&gt;Standard&lt;/span&gt; &lt;span class="n"&gt;Deviation&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;Sulfur&lt;/span&gt; &lt;span class="n"&gt;Dioxide&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Wine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;10.46015696980973&lt;/span&gt;
&lt;span class="n"&gt;IQR&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Sulfur&lt;/span&gt; &lt;span class="n"&gt;Dioxide&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;14.0&lt;/span&gt;
&lt;span class="n"&gt;Median&lt;/span&gt; &lt;span class="n"&gt;Absolute&lt;/span&gt; &lt;span class="n"&gt;Deviation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;10.378215529539213&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So in this article, we have explored basics of EDA process, exploring central tendencies and measures of variability.&lt;br&gt;
Part-B will focus on Data Distributions, Exploring Categorical Variables, and Correlations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fin&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>statistics</category>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Natural Language Processing #1: Traditional Embeddings</title>
      <dc:creator>Kushal</dc:creator>
      <pubDate>Mon, 22 Jun 2020 13:12:50 +0000</pubDate>
      <link>https://dev.to/kushal_/natural-language-processing-1-traditional-embeddings-2pjc</link>
      <guid>https://dev.to/kushal_/natural-language-processing-1-traditional-embeddings-2pjc</guid>
      <description>&lt;p&gt;Hello there! You are about to embark on an exciting journey of Natural language processing, covering the nuances from a programmatic and mathematical standpoint. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Natural Language Processing&lt;/em&gt; has been at the helm for decades, it is no secret that there has been a significant effort made during the 1980-90s to build a chatbot to communicate with a human, and give out a pre-scripted response based on the question asked.&lt;br&gt;
This type of system usually called &lt;em&gt;Finite State Machines&lt;/em&gt; (FSM) or &lt;em&gt;Deterministic Finite Automation&lt;/em&gt; (DFA)&lt;br&gt;
The major drawback of such a system was the rule-based implementation and a hierarchical if-else conditional which can be complex structure to decode and update. &lt;/p&gt;

&lt;p&gt;The field of NLP is based on the foundation to derive embeddings from text data, and in-process understanding the semantic and syntactic pattern in the data, to carry out various tasks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spelling Checker&lt;/li&gt;
&lt;li&gt;Sentence Autocomplete&lt;/li&gt;
&lt;li&gt;Document Summarization&lt;/li&gt;
&lt;li&gt;Question Answering &lt;/li&gt;
&lt;li&gt;Named Entity Recognition&lt;/li&gt;
&lt;li&gt;Machine Translation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, we will look into some of the most-used Frequency Embedding Techniques used and also divulge into the pros and the cons of it.&lt;/p&gt;

&lt;p&gt;There are two families of methodologies to derive a word embedding :&lt;br&gt;
  &lt;em&gt;1. Frequency-based methods&lt;/em&gt;&lt;br&gt;
  &lt;em&gt;2. Prediction based methods&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Frequency-based Methods
&lt;/h4&gt;

&lt;p&gt;In this paradigm, a sentence is often tokenized into words, and then certain techniques are used to count the weight of the corresponding word, in turn giving us a brief idea of the usage.&lt;/p&gt;

&lt;p&gt;Following are the schemes for frequency-based methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Count Vector ( Bag of Words Model)&lt;/li&gt;
&lt;li&gt;TF-IDF Vector (Term Frequency - Inverse Document Frequency)&lt;/li&gt;
&lt;li&gt;Co-Occurrence Vector&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;
  
  
  1. Count Vectors
&lt;/h5&gt;

&lt;p&gt;This method which is popularly referred to as Bag of Words Model, which is the simplest representation of text into numeric data. &lt;/p&gt;

&lt;p&gt;The process is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Corpus of Unique Vocabulary Words is built&lt;/li&gt;
&lt;li&gt;Each Word in Corpus is assigned a unique index&lt;/li&gt;
&lt;li&gt;A count number (weight) is assigned to the word in a sentence.&lt;/li&gt;
&lt;li&gt;Vector Length of the sentence is equal to the vocabulary size of the corpus. For the words which do not fall into a sentence, the weight is assigned as &lt;strong&gt;0&lt;/strong&gt; &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;BoW (&lt;em&gt;Bag of Words&lt;/em&gt;) Model can be built using scikit-learn's &lt;em&gt;CountVectorizer&lt;/em&gt; Method&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This method is not recommended since it fails to learn the &lt;em&gt;semantic&lt;/em&gt; and &lt;em&gt;syntactic&lt;/em&gt; structure of the sentence.&lt;/p&gt;

&lt;p&gt;Additionally, the method also results in a sparse matrix which is difficult to compute and store.&lt;/p&gt;

&lt;h5&gt;
  
  
  2. TF-IDF Vectors
&lt;/h5&gt;

&lt;p&gt;TF-IDF (Term Frequency- Inverse Document Frequency) is a weighing scheme that incorporates two formulas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Term-Frequency&lt;/strong&gt;: Measure of Occurrence of the word &lt;em&gt;'t'&lt;/em&gt; in the &lt;br&gt;
document &lt;em&gt;'d'&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sqLUTlCo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/tavayiu5xmpbbdln0w9t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sqLUTlCo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/tavayiu5xmpbbdln0w9t.png" alt="TermFrequency" width="800" height="159"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inverse Document Frequency&lt;/strong&gt;: IDF is a measure of how important a term is, that is how rare or frequent the occurrence across the documents/sentences.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GpZPNktp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/hr93p8715ufl7pdfa6i4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GpZPNktp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/hr93p8715ufl7pdfa6i4.png" alt="IDF" width="800" height="130"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Below is the code, using &lt;em&gt;scikit-learn's TfidfVectorizer&lt;/em&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;TF-IDF also gives larger values for less frequent words and is high when both IDF and TF values are high i.e the word is rare in all the documents combined but frequent in a single document.&lt;/p&gt;

&lt;h5&gt;
  
  
  3. Co-Occurence Vectors
&lt;/h5&gt;

&lt;p&gt;The big idea – Similar words tend to occur together and will have a similar context.&lt;/p&gt;

&lt;p&gt;There are mainly two concepts to understand for building a co-occurrence matrix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Co-occurrence&lt;/li&gt;
&lt;li&gt;Context Window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Co-occurrence&lt;/em&gt; – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Context Window&lt;/em&gt; – Context window is specified by a number and the direction&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;It preserves the semantic relationship between words to some extent. Further down, a co-occurrence matrix can be factorized using a Truncated SVD Transformation for dense vector representations.&lt;/p&gt;

&lt;p&gt;In conclusion, we covered three base methods for frequency-based word embeddings: BoW, tf-idf, co-occurrence matrix.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>nlp</category>
      <category>machinelearning</category>
      <category>statistics</category>
    </item>
  </channel>
</rss>
