DEV Community

Paulius for Exacaster

Posted on • Updated on

Testing PySpark & Pandas in style

Today we'd like to share a small utility package for testing Dataframes on PySpark and Pandas.

If you are a fan of test-driven development and had a chance to work on PySpark (or Pandas) projects, you've probably had written tests similar to this one:

from datetime import datetime
from pyspark_test import assert_pyspark_df_equal
from your_module import calculate_result

def test_event_aggregation(spark):
    schema = ["user_id", "even_type", "item_id", "event_time", "country", "dt"]
    expected_df = spark.createDataFrame(
        [
            (123456, 'page_view', None, datetime(2017,12,31,23,50,50), "uk", "2017-12-31"),
            (123456, 'item_view', 68471513, datetime(2017,12,31,23,50,55), "uk", "2017-12-31")
        ], 
        schema
    )

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)

Enter fullscreen mode Exit fullscreen mode

It works OK for small applications, but when your project gets bigger, data gets more complicated and the amount of tests starts to grow, you might want a less tedious way to define test data.

Exacaster alumni Vaidas Armonas came up with an idea to represent Spark DataFrames as markdown tables. This idea materialized to a testing package markdown-frames. With this package the test, which was shown before, can be replaced with this one:

from pyspark_test import assert_pyspark_df_equal
from markdown_frames.spark_dataframe import spark_df
from your_module import calculate_result

def test_event_aggregation(spark):
    input_data = """ 
        |  user_id   |  even_type  | item_id  |    event_time       | country  |     dt      |
        |   bigint   |   string    |  bigint  |    timestamp        |  string  |   string    |
        | ---------- | ----------- | -------- | ------------------- | -------- | ----------- |
        |   123456   |  page_view  |   None   | 2017-12-31 23:50:50 |   uk     | 2017-12-31  |
        |   123456   |  item_view  | 68471513 | 2017-12-31 23:50:55 |   uk     | 2017-12-31  |
    """
    expected_df = spark_df(input_data, spark)

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)
Enter fullscreen mode Exit fullscreen mode

It makes tests more readable and self-explanatory.

Everything looks almost the same, when you need to build a Dataframe for Pandas, you just need to use different function:

from markdown_frames.pandas_dataframe import pandas_df

Share in the comments, if you know any other convienient tips & tricks when writing PySpark (and Pandas) Unit tests.

Top comments (0)