<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: bhadresh savani</title>
    <description>The latest articles on DEV Community by bhadresh savani (@bhadreshpsavani).</description>
    <link>https://dev.to/bhadreshpsavani</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F348652%2Fcf80a467-2a90-4203-baa2-7891323b7a62.jpeg</url>
      <title>DEV Community: bhadresh savani</title>
      <link>https://dev.to/bhadreshpsavani</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bhadreshpsavani"/>
    <language>en</language>
    <item>
      <title>Tutorial1: Getting Started with Pyspark</title>
      <dc:creator>bhadresh savani</dc:creator>
      <pubDate>Mon, 31 Oct 2022 17:49:33 +0000</pubDate>
      <link>https://dev.to/bhadreshpsavani/getting-started-with-pyspark-7p8</link>
      <guid>https://dev.to/bhadreshpsavani/getting-started-with-pyspark-7p8</guid>
      <description>&lt;p&gt;As a Data Scientist one might have worked with large amount of data. I never got chance to work on large data earlier. Recently i came across a 1.3gb of sensor data, it was little hard to work on using pandas dataframe. I have to wait for couple of miniutes to read or write data or to perform data manipulation.&lt;/p&gt;

&lt;p&gt;I also realize that, While working with big data we can't use pandas dataframe. It fails to give better performing in terms of reading and writing file(IO Operation), even data manipulation also takes time. &lt;strong&gt;Reading a 1gb csv file took around 44sec using pandas while Pyspark took just 6sec.&lt;/strong&gt;(The time taken depends on hardware) It made me realize that i need to explore Pyspark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pLyCEdmb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f2fvdhm88kfpqqsyo9cf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pLyCEdmb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f2fvdhm88kfpqqsyo9cf.png" alt="pyspark Advantages" width="599" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this tutorial, we will see pyspark installation step and doing some basic operation with dataframe object.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step1. Pyspark Installation
&lt;/h2&gt;

&lt;p&gt;You will require Java installed in the environment. It also ask for A proper Java Home Variable path defined in the environment. Make sure you install JDK or JRE.&lt;/p&gt;

&lt;p&gt;To install pyspark we just need to do pip installation in &lt;code&gt;conda&lt;/code&gt; or any python &lt;code&gt;virtual environment&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install pyspark&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step2. Session Initalization
&lt;/h2&gt;

&lt;p&gt;Before doing any operation in &lt;code&gt;pyspark&lt;/code&gt; we need to initialize spark session it can be done like this,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Practice'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the &lt;code&gt;appName&lt;/code&gt;, we can provide any name based on the objective. The session builder takes little time to setup but it is one time process.&lt;/p&gt;

&lt;p&gt;Once its completed &lt;code&gt;pyspark&lt;/code&gt; is ready to use&lt;/p&gt;

&lt;h2&gt;
  
  
  Step3. Reading a File
&lt;/h2&gt;

&lt;p&gt;Pyspark syntax is very similar to pandas. In pandas library we read csv file like this,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df_pandas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'sample.csv'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_pandas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;similarly, in the spark we have below syntax&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_pyspark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sample.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_pyspark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: In pyspark dataframe will not be shown directly, we need to call &lt;code&gt;show()&lt;/code&gt; on the dataframe object.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step4. Some similar functions
&lt;/h2&gt;

&lt;p&gt;There are some functions that are similar in &lt;code&gt;pandas&lt;/code&gt; and &lt;code&gt;pyspark&lt;/code&gt; dataframe like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# head
&lt;/span&gt;&lt;span class="n"&gt;df_pandas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df_pyspark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# describe
&lt;/span&gt;&lt;span class="n"&gt;df_pandas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df_pyspark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and many more that gives almost similar syntax and results. &lt;/p&gt;

&lt;h2&gt;
  
  
  Step5. Dissimilar functions
&lt;/h2&gt;

&lt;p&gt;There are also few functions which works differently from pandas like column selection and slicing function&lt;br&gt;
ex,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# column selection function
&lt;/span&gt;&lt;span class="n"&gt;df_pandas&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'column1'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_pyspark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'column1'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this note, the pyspark learning journey begins...&lt;/p&gt;

&lt;h3&gt;
  
  
  Reference
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.javatpoint.com/pyspark"&gt;https://www.javatpoint.com/pyspark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://insaid.medium.com/eda-with-pyspark-1f29b7d1618"&gt;https://insaid.medium.com/eda-with-pyspark-1f29b7d1618&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>pyspark</category>
      <category>pandas</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
