DEV Community

OdyAsh
OdyAsh

Posted on • Updated on

Learning Spark 2.0 Knowledge Dump

This post will serve as a continuous knowledge dump regarding the "Learning Spark 2.0" book, where I'll dump certain quotes that I find relevant (and hopefully you will too :]!)

  • In Spark’s supported languages, columns are objects with public methods (represented by the Column type).

Code example that uses expr(), withColumn, and col():

blogsDF.withColumn("Big Hitters", (expr("Hits > 10000"))).show()
Enter fullscreen mode Exit fullscreen mode

The above adds a new column, Big Hitters, based on the conditional expression, noting that expr(...) part can be changed with: col("Hits") > 1000.

  • In Scala, using the dollar sign $ before a string will be rendered as a function which converts this string into type Column. Details can be found here

 

  • In Spark, a Row is a Spark object which consists of an ordered collection of fields. Therefore, you can access its fields by an index starting from 0.

 

  • Parquet does not support some symbols and whitespace characters in column names (source).

 

  • Check the PySpark date and time functions here. Check this docs for all pyspark's SQL-related functions. Finally, check Spark SQL's built-in functions that are equivalent in definition here. For example, "=" and "==", "!=" and "<>".

 

  • From the book: The DataFrame API also offers the collect() method, but for extremely large DataFrames this is resource-heavy (expensive) and dangerous, as it can cause out-of-memory (OOM) exceptions. Unlike count(), which returns a single number to the driver, collect() returns a collection of all the Row objects in the entire DataFrame or Dataset. If you want to take a peek at some Row records you’re better off with take(n), which will return only the first n Row objects of the DataFrame.

 

 

  • "Datasets" -> Java/Scala (as they're compile-time/type-safe languages)
  • "Dataframe" -> Python (as types are dynamically inferred/assigned during execution)
  • Tricky note: "Dataframe" -> Scala (because in Scala, a DataFrame is just an alias for untyped Dataset[Row])
    • A Row is a generic object type in Spark, holding a collection of mixed types that can be accessed using an index.
    • Example of defining a Scala Case Class to use DeviceIoTData instead of Row can be found here

Summary of the above bullet points:

Image description

  • When we use Datasets, the underlying Spark SQL engine handles the creation/conversion/serialization/deserialization of JVM objects as well as Java heap memory management with the help of Dataset encoders.

DataFrames Versus Datasets:

  • Use Datasets if you want:
    • strict compile-time type safety and don’t mind creating multiple case classes for a specific Dataset[T].
    • Tungsten’s efficient serialization with Encoders.
  • Use Dataframes if you want:
    • SQL-like queries to perform relational transformations to your data.
    • unification, code optimization, and simplification of APIs across Spark components, use DataFrames.
    • an easy transition from R language.
    • to precisely instruct Spark how to do a query using RDDs.
    • space and speed efficiency.

 

Top comments (0)