<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shovon Basak</title>
    <description>The latest articles on DEV Community by Shovon Basak (@shvnbasak).</description>
    <link>https://dev.to/shvnbasak</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F334999%2F318ce7fd-18ea-41bd-9c0c-c81594bc30eb.jpg</url>
      <title>DEV Community: Shovon Basak</title>
      <link>https://dev.to/shvnbasak</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shvnbasak"/>
    <language>en</language>
    <item>
      <title>Google Colab, Pyspark, Cassandra remote cluster combine these all together</title>
      <dc:creator>Shovon Basak</dc:creator>
      <pubDate>Mon, 13 Sep 2021 07:06:35 +0000</pubDate>
      <link>https://dev.to/shvnbasak/google-colab-pyspark-cassandra-remote-cluster-combine-these-all-together-41bd</link>
      <guid>https://dev.to/shvnbasak/google-colab-pyspark-cassandra-remote-cluster-combine-these-all-together-41bd</guid>
      <description>&lt;p&gt;If you don't know what are these things, please navigate to the introductory links mentioned bellow to get and rough idea about the tech stack mentioned here and what you can achieve using them. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://colab.research.google.com/notebooks/intro.ipynb"&gt;Google Colaboratory aka Google Colab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cassandra.apache.org/"&gt;Cassandra&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://databricks.com/glossary/pyspark"&gt;PySpark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://spark.apache.org/"&gt;Spark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Though it's really easy to connect Apache Spark with Cassandra or any other data source. But I can't find any source where it is mentioned clearly how to connect PySpark and Cassandra. To reduce the struggle of them who are are going to do the same, just follow this tutorial and you are all set to stark working with Sparks RDD by using Cassandra database as the data source. &lt;/p&gt;

&lt;p&gt;At the very first please create one new notebook in Google Colab. Then you need to install jdk, apache spark on hadoop cluster, findspark library on python env. Copy the commands bellow and paste it into Google Colab notebook. Running the code block will install all of them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!apt-get install openjdk-8-jdk-headless -qq &amp;gt; /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here I am using spark-2.4.8, but can install the latest versions as per your requirement.&lt;/p&gt;

&lt;p&gt;Then you need to set some environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initialize findspark and check where it is located:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import findspark
findspark.init()
findspark.find()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now this is the step where you are going to connect PySpark with Cassandra:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql import SparkSession
from pyspark import SQLContext

spark = (SparkSession.builder.master("local[*]")
         .config('spark.cassandra.connection.host', "[host_ip_address]")
         .config('spark.jars.packages', "datastax:spark-cassandra-connector:2.4.0-s_2.11")
         .config('spark.cassandra.auth.username', "[db_username]")
         .config('spark.cassandra.auth.password', "[db_password]")
         .getOrCreate())

SQL_LOCAL_CONTEXT = SQLContext(spark)

def read_table(context, table):
    return context.read.format("org.apache.spark.sql.cassandra").options(table=table, keyspace="[key_space]").load()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test your connection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;groups = read_table(SQL_LOCAL_CONTEXT, "[db_table]")
groups.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace the variables placed inside the brackets, such as: host_ip_address, db_username, db_password &amp;amp; key_space with the values of your database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;groups&lt;/strong&gt; will store data as PySpark dataframe.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
