DEV Community

Play Button Pause Button
Upkar Lidder
Upkar Lidder

Posted on

A Gentle Intro to Apache Spark for Developers

I recently hosted an online meetup on Apache Spark with IBM Developer. Spark has been around for a few years, but the interest in still growing to my surprise. Apache Spark was developed at the University of California, Berkeley’s AMPLab. The Spark codebase was open sourced and donated to the Apache Software Foundation in 2010.

Apache Spark

The background of the attendees was quite diverse

  • Developer (25%)
  • Architect (12.5%)
  • Data Scientist (41.7%)
  • Other (12.8%)

We looked at the WHAT and WHY of spark and then dove in the three data structures that you might encounter when working with Spark …

We also looked at some Transformations, Actions, built-in functions and UDF (user defined functions). For example the following function creates a new column called GENDER based on the contents of the column GenderCode.

# ------------------------------
# Derive gender from salutation
# ------------------------------
def deriveGender(col):
    """ input: pyspark.sql.types.Column
        output: "male", "female" or "unknown"
    """    
    if col in ['Mr.', 'Master.']:
        return 'male'
    elif col in ['Mrs.', 'Miss.']:
        return 'female'
    else:
        return 'unknown';

deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())
customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))
customer_df.cache()
Enter fullscreen mode Exit fullscreen mode

withColumn creates a new column in the customer_df dataframe with the values from deriveGenderUDF (our user defined function). The deriveGenderUDF is essentially the deriveGender function. If this does not make sense, watch the webinar as we go into a lot more detail.

Finally, we created a spark cluster environment on IBM Cloud, and used a Jupyter notebook to explore customer data with the following columns …

"CUST_ID", 
"CUSTNAME", 
"ADDRESS1", 
"ADDRESS2", 
"CITY", 
"POSTAL_CODE", 
"POSTAL_CODE_PLUS4", 
"STATE", 
"COUNTRY_CODE", 
"EMAIL_ADDRESS", 
"PHONE_NUMBER",
"AGE",
"GenderCode",
"GENERATION",
"NATIONALITY", 
"NATIONAL_ID", 
"DRIVER_LICENSE"
Enter fullscreen mode Exit fullscreen mode

After cleaning the data using in-built and user defined methods, we used PixieDust to visualize the data. The cool thing about pixieDust is that you don’t need to set it up or configure it. You just pass it a Spark DataFrame or a Pandas DataFrame and you are good to go! You can find the complete notebook here.

Thank you IBM Developer and Max Katz for the opportunity to present and special thanks to Lisa Jung for being a patient co-presenter!

Top comments (0)