DEV Community

Cover image for Python: Using Spark-SQL
James Wachuka
James Wachuka

Posted on • Updated on

Python: Using Spark-SQL

Apache Spark is a distributed processing engine that is open source and used for big data applications. For quick analytic queries against any size of data, it uses in-memory caching and efficient query execution. It offers Java, Scala, Python, and R development APIs and facilitates code reuse across many workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph analysis.

PySpark is the Python API for Apache Spark, a free and open-source platform for distributed computing that includes a number of tools for processing huge amounts of data quickly. To build more scalable analyses and pipelines, PySpark is an useful language to learn if you're already familiar with Python and tools like Pandas.

A Spark module called PySpark SQL combines relational processing with the functional programming API of Spark. A SQL query language can be used to extract the data. The queries can be used in the same way as SQL.

Some classes of The PySpark-SQL module

  1. pyspark.sql.SparkSession It represents the main entry point for DataFrame and SQL functionality.
  2. pyspark.sql.DataFrame It represents a distributed collection of data grouped into named columns.
  3. pyspark.sql.GroupedData Aggregation methods, returned by DataFrame. groupBy(). 4. pyspark.sql.DataFrameNaFunctions It represents methods for handling missing data
  4. pysark.sql.functions It represents a list of built-in functions available for DataFrame.

Example of code to query a database and convert results to a pyspark dataframe:

import libraries

import mysql.connector
import pandas as pd
from pyspark.sql import SparkSession
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

create a spark-session

spark = SparkSession.builder\
.appName("sql-pyspark example").getOrCreate()

Establish a connection to your database

conn = mysql.connector.connect(user='user',database='example_db',
password='password',
host="localhost",
port=3306)

start connection

cursor = conn.cursor()

create a query

query = "SELECT * FROM table1 GROUP BY (table2)"

Create a pandas dataframe from the query

query_output = pd.read_sql(query, con=conn)

close connection

conn.close()

Convert Pandas dataframe to spark DataFrame

result = spark.createDataFrame(query_output)
result.show()

SPARK-SQL facilitates simple data querying. Spark programs and SQL queries are combined to query structured data as a distributed dataset (RDD). Additionally, utilizing Spark SQL's integration property, the SQL queries are conducted along with analytical methods.
Ability to load and query data from various sources is another significant benefit of Spark SQL. The data access is therefore unified.Standard connectivity is provided because Spark SQL can be accessed by ODBC or JDBC.
Spark-Sql can be used to parse Hive tables more quickly.
Due to Spark SQL's simple compatibility with Hive data and queries, it can easily perform unaltered Hive queries on warehouses that already exist.

Top comments (0)