Testing PySpark & Pandas in style

#spark #pandas #testing #opensource

Today we'd like to share a small utility package for testing Dataframes on PySpark and Pandas.

If you are a fan of test-driven development and had a chance to work on PySpark (or Pandas) projects, you've probably had written tests similar to this one:

from datetime import datetime
from pyspark_test import assert_pyspark_df_equal
from your_module import calculate_result

def test_event_aggregation(spark):
    schema = ["user_id", "even_type", "item_id", "event_time", "country", "dt"]
    expected_df = spark.createDataFrame(
        [
            (123456, 'page_view', None, datetime(2017,12,31,23,50,50), "uk", "2017-12-31"),
            (123456, 'item_view', 68471513, datetime(2017,12,31,23,50,55), "uk", "2017-12-31")
        ], 
        schema
    )

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)

It works OK for small applications, but when your project gets bigger, data gets more complicated and the amount of tests starts to grow, you might want a less tedious way to define test data.

Exacaster alumni Vaidas Armonas came up with an idea to represent Spark DataFrames as markdown tables. This idea materialized to a testing package markdown-frames. With this package the test, which was shown before, can be replaced with this one:

from pyspark_test import assert_pyspark_df_equal
from markdown_frames.spark_dataframe import spark_df
from your_module import calculate_result

def test_event_aggregation(spark):
    input_data = """ 
        |  user_id   |  even_type  | item_id  |    event_time       | country  |     dt      |
        |   bigint   |   string    |  bigint  |    timestamp        |  string  |   string    |
        | ---------- | ----------- | -------- | ------------------- | -------- | ----------- |
        |   123456   |  page_view  |   None   | 2017-12-31 23:50:50 |   uk     | 2017-12-31  |
        |   123456   |  item_view  | 68471513 | 2017-12-31 23:50:55 |   uk     | 2017-12-31  |
    """
    expected_df = spark_df(input_data, spark)

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)