<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tanay Js</title>
    <description>The latest articles on DEV Community by Tanay Js (@221910301027).</description>
    <link>https://dev.to/221910301027</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F621182%2F4e1aa8ad-5667-474a-aeba-7ecfa96ce830.jpg</url>
      <title>DEV Community: Tanay Js</title>
      <link>https://dev.to/221910301027</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/221910301027"/>
    <language>en</language>
    <item>
      <title>Introduction to Data Analysis and Visualization using Python</title>
      <dc:creator>Tanay Js</dc:creator>
      <pubDate>Mon, 26 Apr 2021 14:55:45 +0000</pubDate>
      <link>https://dev.to/221910301027/introduction-to-data-analysis-and-visualization-using-python-4c9m</link>
      <guid>https://dev.to/221910301027/introduction-to-data-analysis-and-visualization-using-python-4c9m</guid>
      <description>&lt;p&gt;Data Analysis and Visualization plays a major role in computer science fields such as Data Analysis, Big Data and Data science etc. In which they are required to analyze raw data input and try understanding patterns, co-relations and trends to create an output. &lt;/p&gt;

&lt;p&gt;This article should help readers learn different ways to represent data in different basic visual forms and what to understand from them.&lt;/p&gt;

&lt;p&gt;Common Tools used for Data Analysis are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;R Programming&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python Programming&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SAS&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Microsoft Excel&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article will be explained using Python as it is a high level language and it offers a lot of libraries for visualization such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Matplotlib&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Panda Visualisation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seaborn&lt;/strong&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These libraries can be used to import data from file formats such as Excel and convert Random Raw data into Graphs, pie charts, Scatterplots etc.&lt;/p&gt;

&lt;h4&gt;
  
  
  Adding Important Libraries in Python
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Importing Datasets
&lt;/h4&gt;

&lt;p&gt;The dataset used in this article is the 2008 Swing state US elections. &lt;/p&gt;

&lt;p&gt;The dataset file was taken from &lt;a href="https://www.kaggle.com/aman1py/swing-states"&gt;https://www.kaggle.com/aman1py/swing-states&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;h6&gt;
  
  
  &lt;em&gt;Note&lt;/em&gt;: Make sure the CSV file(Excel) is locally downloaded in the system.
&lt;/h6&gt;
&lt;h6&gt;
  
  
  The following code is mentioned in the downloadable code block and as well as  executed using Jupyter Notebook.
&lt;/h6&gt;
&lt;h6&gt;
  
  
  The screenshot of the output is also attached for your understanding.
&lt;/h6&gt;
&lt;/blockquote&gt;

&lt;p&gt;The data can be imported in Python using panda &lt;code&gt;read_csv&lt;/code&gt; method&lt;/p&gt;

&lt;p&gt;The first 5 columns of Data can be represented by &lt;code&gt;head()&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;To practice and implement the following dataset must be copied onto a notepad and must be saved as &lt;code&gt;2008_Election.csv&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;state,county,total_votes,dem_votes,rep_votes,dem_share
PA,Erie County,127691,75775,50351,60.08
PA,Bradford County,25787,10306,15057,40.64
PA,Tioga County,17984,6390,11326,36.07
PA,McKean County,15947,6465,9224,41.21
PA,Potter County,7507,2300,5109,31.04
PA,Wayne County,22835,9892,12702,43.78
PA,Susquehanna County,19286,8381,10633,44.08
PA,Warren County,18517,8537,9685,46.85
OH,Ashtabula County,44874,25027,18949,56.94
OH,Lake County  121335,60155,59142,50.46
PA,Crawford County,38134,16780,20750,44.71
OH,Lucas County 219830,142852,73706,65.99
OH,Fulton County,21973,9900,11689,45.88
OH,Geauga County,51102,21250,29096,42.23
OH,Williams County,18397,8174,9880,45.26
PA,Wyoming County,13138,5985,6983,46.15
PA,Lackawanna County,107876,67520,39488,63.1
PA,Elk County,14271,7290,6676,52.2
PA,Forest County,2444,1038,1366,43.18
PA,Venango County,23307,9238,13718,40.24
OH,Erie County,41229,23148,17432,57.01
OH,Wood County,65022,34285,29648,53.61
PA,Cameron County,2245,879,1323,39.92
PA,Pike County,24284,11493,12518,47.87
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Import code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
df=pd.read_csv('2008_Election.csv')
df.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HlsDHeU---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8fnw8t12ytasw00himcr.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HlsDHeU---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8fnw8t12ytasw00himcr.PNG" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To display description of mean, standard deviation, maximum and minimum values can be done by &lt;code&gt;describe()&lt;/code&gt; method. &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dWP5i3Nd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nyiae0b2r6yh5r81w71n.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dWP5i3Nd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nyiae0b2r6yh5r81w71n.PNG" alt="Describe Method"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Plotting Histograms
&lt;/h4&gt;

&lt;p&gt;Histograms are univariate Analysis and can be used to represent data to understand relations.&lt;/p&gt;

&lt;p&gt;Histograms can be represented using matplotlib &lt;code&gt;plt.hist()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Labeling of the Histogram: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;plt.xlabel()&lt;/code&gt;- for x-axis &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;plt.ylabel()&lt;/code&gt;- for Y-axis.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;h6&gt;
  
  
  Note: Always label your graph
&lt;/h6&gt;
&lt;h6&gt;
  
  
  Import matplotlib.pyplot library for the code to execute.
&lt;/h6&gt;


&lt;/blockquote&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import matplotlib.pyplot as plt
h=plt.hist(df['dem_share'])
_=plt.xlabel('percentage of vote for Obama')
_=plt.ylabel('number of counties')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4g6XDH8D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v4wlu29a0hceb62g67wf.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4g6XDH8D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v4wlu29a0hceb62g67wf.PNG" alt=""&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Setting Seaborn Styling
&lt;/h4&gt;

&lt;p&gt;Seaborn is a styling package in Matplot library this styling is preferred by many professionals because it has a high-level interface for drawing attractive and informative statistical graphics&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import seaborn as sns
sns.set()
h=plt.hist(df['dem_share'])
_=plt.xlabel('percentage of vote for Obama')
_=plt.ylabel('number of countries')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--an1woy9R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/srt3vv9ba9dlngt31tzg.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--an1woy9R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/srt3vv9ba9dlngt31tzg.PNG" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Plotting Box Plot
&lt;/h4&gt;

&lt;p&gt;Box plot shows us the median of the data, which represents where the middle data point is. The upper and lower quartiles represent 75 and 25 percentile respectively&lt;/p&gt;

&lt;p&gt;Boxplots are represented with &lt;code&gt;sns.boxplot()&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import matplotlib as plt
import seaborn as sns 
_=sns.boxplot (x='east_west',y='dem_share',data = df_all_states)
_=plt.xlabel('region')
_=plt.ylabel('percentage of votes for Obama')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SiUWOrG9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y10j9het6l7b9a5c1v8s.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SiUWOrG9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y10j9het6l7b9a5c1v8s.PNG" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Generating a Bee swarm plot
&lt;/h4&gt;

&lt;p&gt;Bee swarm plot is generally used on relatively small data. The primary use of this is to group data with similar function&lt;/p&gt;

&lt;p&gt;Bee Swarm plot is represented with &lt;code&gt;sns.swarmplot&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_=sns.swarmplot(x='state',y='dem_share',data=df)
_=plt.xlabel('state')
_=plt.ylabel('percentage of vote for Obama')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--phgHqQL4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/30kqq641j6nfe3xl0ui9.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--phgHqQL4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/30kqq641j6nfe3xl0ui9.PNG" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Making an ECDF
&lt;/h4&gt;

&lt;p&gt;ECDF stands for Empirical cumulative distribution function (ECDF) &lt;/p&gt;

&lt;p&gt;ECDF is an estimator tool which allows a user to plot a particular feature from lowest to highest, it is considered as an alternative to Histograms.&lt;/p&gt;

&lt;p&gt;ECDF is generated using &lt;code&gt;plt.plot()&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
x=np.sort(df['dem_share']) #sorts data
y=np.arange(1, len(x)+1)/len(x) #arranges data
_=plt.plot(x,y,marker='.', linestyle='none')
_=plt.xlabel('percentage of vote for Obama')
_=plt.ylabel('ECDF')
plt.margins(0.02) #Keeps data off plot edges
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--trI6zRYc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tp7d0ll2tts6s2nqa5t2.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--trI6zRYc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tp7d0ll2tts6s2nqa5t2.PNG" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Conclusion
&lt;/h4&gt;

&lt;p&gt;Thus using Data Analysis and Visualization we converted random numbers and data to understand  facts such as &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;East U.S voted more for Obama compared to the West U.S&lt;/li&gt;
&lt;li&gt;In 75% of counties close to 50% have voted for Obama.&lt;/li&gt;
&lt;li&gt;In 20% counties only 36% or less voted for Obama&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These facts could not be directly understood just from looking at CSV dataset, just by using a few lines of code we have a good understanding of the data and it can be explained to others with Visual proof such as Histograms, ECDF etc.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>bigdata</category>
      <category>python</category>
      <category>database</category>
    </item>
  </channel>
</rss>
