<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Suda Srinivasan</title>
    <description>The latest articles on DEV Community by Suda Srinivasan (@ssrinivasan).</description>
    <link>https://dev.to/ssrinivasan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F725936%2F17e1f197-d581-41e7-a681-394dc41b7718.jpeg</url>
      <title>DEV Community: Suda Srinivasan</title>
      <link>https://dev.to/ssrinivasan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ssrinivasan"/>
    <language>en</language>
    <item>
      <title>Tutorial: Building Applications with YugabyteDB and Spark</title>
      <dc:creator>Suda Srinivasan</dc:creator>
      <pubDate>Fri, 22 Oct 2021 18:57:27 +0000</pubDate>
      <link>https://dev.to/yugabyte/tutorial-building-applications-with-yugabytedb-and-spark-39nk</link>
      <guid>https://dev.to/yugabyte/tutorial-building-applications-with-yugabytedb-and-spark-39nk</guid>
      <description>&lt;p&gt;Authors: Wei Wang and Amey Banarse&lt;/p&gt;

&lt;p&gt;Originally Published on October 19, 2021 at &lt;a href="https://blog.yugabyte.com/" rel="noopener noreferrer"&gt;https://blog.yugabyte.com/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At Distributed SQL Summit 2021 (&lt;a href="https://distributedsql.org/" rel="noopener noreferrer"&gt;https://distributedsql.org/&lt;/a&gt;), we presented a workshop on how to build an application using the YugabyteDB Spark Connector (&lt;a href="https://github.com/yugabyte/spark-cassandra-connector/tree/2.4-yb" rel="noopener noreferrer"&gt;https://github.com/yugabyte/spark-cassandra-connector/tree/2.4-yb&lt;/a&gt;) and Yugabyte Cloud to deliver business outcomes for our customers.&lt;/p&gt;

&lt;p&gt;The YugabyteDB Spark Connector brings together the best of breed technologies Apache Spark — an industry-leading, distributed computing engine — with YugabyteDB, a modern, cloud-native distributed SQL database. This connector allows customers to seamlessly and natively read from, perform complex ETL, and write to YugabyteDB.&lt;/p&gt;

&lt;p&gt;This native integration removes all the complexity and guesswork in deciding what processing should happen where. With the optimized connector, the complex workloads processed by Spark can be translated to SQL and executed by YugabyteDB directly, making the application much more scalable and performant.&lt;/p&gt;

&lt;p&gt;In this blog post, we recap some highlights from the workshop, and show you how to get started with your first application.&lt;/p&gt;

&lt;p&gt;Workshop recording and slides&lt;br&gt;
You can check out the complete tutorial by accessing the workshop recording (&lt;a href="https://www.yugabyte.com/dss-2021-on-demand/?utm_campaign=DSS21&amp;amp;utm_medium=email&amp;amp;_hsmi=164556599&amp;amp;_hsenc=p2ANqtz-9aPU72_LtnHCEhXrHHR8zWPundujwpXJUq9VLF0tRmlcRdlvqHNLHs1qfJopA_T-TASED8Fk0tHNvKvao3XFi3cTvzyg&amp;amp;utm_content=164556599&amp;amp;utm_source=hs_email" rel="noopener noreferrer"&gt;https://www.yugabyte.com/dss-2021-on-demand/?utm_campaign=DSS21&amp;amp;utm_medium=email&amp;amp;_hsmi=164556599&amp;amp;_hsenc=p2ANqtz-9aPU72_LtnHCEhXrHHR8zWPundujwpXJUq9VLF0tRmlcRdlvqHNLHs1qfJopA_T-TASED8Fk0tHNvKvao3XFi3cTvzyg&amp;amp;utm_content=164556599&amp;amp;utm_source=hs_email&lt;/a&gt;) and slides (&lt;a href="https://blog.yugabyte.com/wp-content/uploads/2021/10/DSS-2021-%E2%80%94-Yugabyte-Cloud-and-Spark-Workshop-Slides-Final.pdf" rel="noopener noreferrer"&gt;https://blog.yugabyte.com/wp-content/uploads/2021/10/DSS-2021-%E2%80%94-Yugabyte-Cloud-and-Spark-Workshop-Slides-Final.pdf&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Brief introduction to YugabyteDB and Yugabyte Cloud&lt;br&gt;
Cloud native enterprise applications in the multi-cloud world demand a highly scalable, resilient, horizontally scalable, geographically distributed, and cloud agnostic modern database. YugabyteDB (&lt;a href="https://blog.yugabyte.com/reimagining-the-rdbms-for-the-cloud/" rel="noopener noreferrer"&gt;https://blog.yugabyte.com/reimagining-the-rdbms-for-the-cloud/&lt;/a&gt;) meets these challenges and is the database of choice for organizations building microservices and born-in-the-cloud apps. In addition, YugabyteDB provides multiple APIs by converging SQL and NoSQL, which simplifies the polyglot data architecture needs of the enterprise.&lt;/p&gt;

&lt;p&gt;Yugabyte Cloud (&lt;a href="http://www.yugabyte.com/cloud" rel="noopener noreferrer"&gt;http://www.yugabyte.com/cloud&lt;/a&gt;) is a fully-managed offering of YugabyteDB and unlocks the power of “any.” Organizations can run any app at any scale, anywhere, accessible at any time and running in any cloud, a perfect addition to a multi-cloud world.&lt;/p&gt;

&lt;p&gt;Getting started with the YugabyteDB Spark Connector&lt;br&gt;
The latest version of the YugabyteDB Spark Connector is Spark 3.0 compatible and allows you to expose YugabyteDB tables as Spark RDDs, write Spark RDDs to Yugabyte tables, and execute arbitrary CQL queries in your Spark applications. Key features of the YugabyteDB Spark Connector include:&lt;/p&gt;

&lt;p&gt;-Read from YugabyteDB: Exposes YugabyteDB tables as Spark RDDs&lt;br&gt;
-Write to YugabyteDB: Save RDDs back to Cassandra by implicit saveToCassandra calls&lt;br&gt;
-Native JSON data support using JSONB data type, where Cassanda lacks the native support&lt;br&gt;
-Supports PySpark DataFrames&lt;br&gt;
-Performance optimizations with predicate pushdowns&lt;br&gt;
Cluster,  topology and partition awareness&lt;/p&gt;

&lt;p&gt;Understanding the application architecture&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrndgfqqbimgyqdc21t8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrndgfqqbimgyqdc21t8.png" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As seen in the diagram, we are building a Scala application using YugabyteDB Spark Connector to demonstrate:&lt;/p&gt;

&lt;p&gt;-How Spark integrates with Yugabyte Cloud to develop applications&lt;br&gt;
-How YugabyteDB models JSON data efficiently using a JSONB data type&lt;br&gt;
-Key features of the connector&lt;/p&gt;

&lt;p&gt;Prerequisites&lt;br&gt;
Before you get started, you’ll need to have the following software installed on your machine:&lt;/p&gt;

&lt;p&gt;-Java JDK 1.8 &lt;br&gt;
-Spark 3.0 and Scala 2.12&lt;/p&gt;

&lt;p&gt;wget &lt;a href="https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz" rel="noopener noreferrer"&gt;https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz&lt;/a&gt;&lt;br&gt;
tar xvf spark-3.0.3-bin-hadoop2.7.tgz&lt;br&gt;
cd spark-3.0.3-bin-hadoop2.7&lt;br&gt;
./bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1               --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions  --packages com.yugabyte.spark:spark-cassandra-connector_2.12:3.0-yb-8&lt;/p&gt;

&lt;p&gt;-Yugabyte Cloud access: Create a YugabyteDB cluster from Yugabyte Cloud (&lt;a href="https://www.yugabyte.com/cloud/" rel="noopener noreferrer"&gt;https://www.yugabyte.com/cloud/&lt;/a&gt;) and follow the instructions to configure and connect to the cluster, as well as create database objects using the script namespace.sql (&lt;a href="https://github.com/YB-WangTx/yugabyteSparkDemo/blob/main/namespace.sql" rel="noopener noreferrer"&gt;https://github.com/YB-WangTx/yugabyteSparkDemo/blob/main/namespace.sql&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;./bin/ycqlsh -h your_cluster_ip -f namepspace.sql&lt;/p&gt;

&lt;p&gt;//a DDL example with jsonb data type&lt;br&gt;
Create keyspace test;&lt;br&gt;
Create table test.employeess_json&lt;br&gt;
            (department_id INT, employee_id INT,dept_name TEXT, salary Double,&lt;br&gt;
             phone jsonb, PRIMARY KEY(department_id, employee_id));&lt;/p&gt;

&lt;p&gt;Steps for building your first application&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Import the libraries required to build the application:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;import org.apache.spark.{SparkConf, SparkContext}&lt;br&gt;
import org.apache.spark.sql.SparkSession&lt;br&gt;
import com.datastax.spark.connector._&lt;br&gt;
import com.datastax.spark.connector.cql.CassandraConnectorConf&lt;br&gt;
import org.apache.spark.sql.functions._&lt;br&gt;
import org.apache.spark.sql.expressions.Window&lt;br&gt;
import com.datastax.spark.connector.cql.CassandraConnector&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Configure Yugabyte Cloud connectivity:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;val host = "748fdee2-aabe-4d75-a698-a6514e0b19ff.aws.ybdb.io"&lt;br&gt;
val user = "admin"&lt;br&gt;
val password = "your password for admin"&lt;br&gt;
val keyStore ="/Users/xxx/Documents/spark3yb/yb-keystore.jks"&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Now create a Spark session and connect to Yugabyte Cloud:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;val conf = new SparkConf()&lt;br&gt;
          .setAppName("yb.spark-jsonb")&lt;br&gt;
          .setMaster("local[1]")&lt;br&gt;
          .set("spark.cassandra.connection.localDC", "us-east-2")&lt;br&gt;
          .set("spark.cassandra.connection.host", "127.0.0.1")&lt;br&gt;
          .set("spark.sql.catalog.ybcatalog",&lt;br&gt;
                "com.datastax.spark.connector.datasource.CassandraCatalog")&lt;br&gt;
          .set("spark.sql.extensions",&lt;br&gt;
                "com.datastax.spark.connector.CassandraSparkExtensions")&lt;/p&gt;

&lt;p&gt;val spark = SparkSession.builder()&lt;br&gt;
           .config(conf)&lt;br&gt;
           .config("spark.cassandra.connection.host", host)&lt;br&gt;
           .config("spark.cassandra.connection.port", "9042")&lt;br&gt;
           .config("spark.cassandra.connection.ssl.clientAuth.enabled", true)&lt;br&gt;
           .config("spark.cassandra.auth.username", user)&lt;br&gt;
           .config("spark.cassandra.auth.password", password)&lt;br&gt;
           .config("spark.cassandra.connection.ssl.enabled", true)&lt;br&gt;
           .config("spark.cassandra.connection.ssl.trustStore.type", "jks")&lt;br&gt;
           .config("spark.cassandra.connection.ssl.trustStore.path", keyStore)&lt;br&gt;
           .config("spark.cassandra.connection.ssl.trustStore.password", "ybcloud")&lt;br&gt;
           .withExtensions(new CassandraSparkExtensions)&lt;br&gt;
           .getOrCreate()&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Process the data by reading from YugabyteDB, performing an ETL-window function specifically in Spark and saving the result back to YugabyteDB:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;//Read data into a data frame from a YB table&lt;br&gt;
val df_yb = spark.read.table("ybcatalog.test.employees_json")&lt;br&gt;
//Perform window function&lt;br&gt;
val windowSpec  = Window.partitionBy("department_id").orderBy("salary")&lt;br&gt;
df_yb.withColumn("row_number",row_number.over(windowSpec)).show(false)&lt;br&gt;
df_yb.withColumn("rank",rank().over(windowSpec)).show(false)&lt;br&gt;
//Write back to a YB table&lt;br&gt;
df_yb.write.cassandraFormat("employees_json_copy", "test").mode("append").save()&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Now query the data in YugabyteDB:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F571z4zf9cz80xkwp0d3t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F571z4zf9cz80xkwp0d3t.png" alt="Image description" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;YugabyteDB models JSON data in a JSONB data type efficiently. The connector optimizes the query performance with column pruning and predicate pushdown:&lt;/p&gt;

&lt;p&gt;//Column Pruning&lt;/p&gt;

&lt;p&gt;Val query1 = "SELECT department_id, employee_id, get_json_object(phone,'$.code') as   code FROM ybcatalog.test.employees_json WHERE get_json_string(phone, '$.key(1)') = '1400' order by department_id limit 2";&lt;/p&gt;

&lt;p&gt;val df_sel1=spark.sql(query1)&lt;/p&gt;

&lt;p&gt;df_sel1.explain&lt;/p&gt;

&lt;p&gt;//Predicate pushdown&lt;/p&gt;

&lt;p&gt;val query2 = "SELECT department_id, employee_id, get_json_object(phone, '$.key[1].m[2].b') as key FROM ybcatalog.test.employees_json WHERE get_json_string(phone, '$.key[1].m[2].b') = '400' order by department_id limit 2";&lt;/p&gt;

&lt;p&gt;val df_sel2 = spark.sql(query2)&lt;/p&gt;

&lt;p&gt;df_sel2.show(false)&lt;/p&gt;

&lt;p&gt;//verify with the explain plan from YB&lt;/p&gt;

&lt;p&gt;df_sel2.explain&lt;/p&gt;

&lt;p&gt;Conclusion&lt;br&gt;
YugabyteDB is an excellent choice of distributed SQL database for storing critical business information such as system of records and product catalogs. Spark provides all the capabilities you need to perform complex computations on this data by leveraging the YugabyteDB Spark Connector.&lt;/p&gt;

&lt;p&gt;Next steps&lt;br&gt;
Interested in learning more about YugabyteDB Spark Connector? Download it here (&lt;a href="https://github.com/yugabyte/spark-cassandra-connector/tree/2.4-yb" rel="noopener noreferrer"&gt;https://github.com/yugabyte/spark-cassandra-connector/tree/2.4-yb&lt;/a&gt;) to get started!&lt;/p&gt;

&lt;p&gt;The code for the application we just walked through can be found on GitHub (&lt;a href="https://github.com/YB-WangTx/yugabyteSparkDemo" rel="noopener noreferrer"&gt;https://github.com/YB-WangTx/yugabyteSparkDemo&lt;/a&gt;). You can also sign up for Yugabyte Cloud (&lt;a href="https://cloud.yugabyte.com/signup" rel="noopener noreferrer"&gt;https://cloud.yugabyte.com/signup&lt;/a&gt;), a fully managed YugabyteDB-as-a-service.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
