Aki

Posted on Jun 1

Exploring Snowpark While Comparing It with Apache Spark

#snowflake #dataengineering

Original Japanese article: Snowparkを動かしながらSparkとの違いを整理してみる

Introduction

Recently, I've had more opportunities to work with Snowflake when building data platforms.

When working with modern data platforms, Apache Spark is often used for distributed data processing. Snowflake also provides its own data processing framework called Snowpark.

If you're already familiar with Spark or AWS Glue, you may find yourself wondering:

"Wait... how is Snowpark actually different from Spark?"

In this article, I'd like to organize my own understanding while exploring Snowpark's behavior and comparing it with Spark.

For this experiment, everything was done entirely within Snowflake Notebooks in Snowsight.

One of the biggest advantages is that no local environment setup or connection configuration is required—you can start experimenting immediately.

What is Snowpark?

Snowpark is a data processing framework provided by Snowflake.

Its biggest feature is the ability to write code in Python, Java, or Scala and execute it directly inside Snowflake.

Traditionally, Snowflake workloads were primarily implemented using SQL. With Snowpark, however, you can use a DataFrame API similar to Spark or Pandas while keeping all processing within Snowflake.

In other words, you no longer need to pull data into a local environment or AWS Lambda for processing.

Some key characteristics include:

Managed execution environment – Processing runs on Snowflake warehouses with no infrastructure management required.
DataFrame API – Similar developer experience to Spark and Pandas.
Pushdown execution – Code is executed within Snowflake, eliminating data transfer overhead.
UDF and UDTF support – Custom functions can be defined and executed inside Snowflake.
Snowflake Notebook integration – Interactive development is supported directly in Snowsight.

Personally, I still like Scala, but these days I write most data processing code in Python.

While Scala often offers better performance, Python's simplicity and extensive ecosystem make it the more practical choice in many situations.

Differences Between Spark and Snowpark

Many people immediately think of Spark when they hear the name Snowpark.

The names are similar, and the DataFrame APIs feel very familiar.

However, there are several important differences.

Aspect	Spark	Snowpark
Execution Environment	Distributed cluster	Snowflake warehouse
Data Source	HDFS, S3, and others	Primarily Snowflake tables
Scaling	Cluster size managed by user	Warehouse size adjustment
Languages	Scala, Java, Python, R, etc.	Python, Java, Scala
External Data Support	Broad ecosystem support	Primarily Snowflake-centric
Infrastructure Management	Cluster management required	Fully managed

Spark requires awareness of distributed clusters and execution mechanics.

Snowpark, on the other hand, is fundamentally a processing framework that operates on top of the Snowflake platform.

DataFrame operations are internally converted into SQL execution plans and executed by Snowflake's SQL engine.

Unlike Spark, user code is not distributed across worker nodes.

Scaling is handled by Snowflake warehouses.

UDFs are an exception. UDF code is pushed into Snowflake and executed in parallel by Snowflake's infrastructure.

A useful mental model is:

DataFrame operations → SQL generation

UDFs → Server-side parallel execution

In either case, users do not need to manage clusters or DAG execution as they would in Spark.

If your data is already centralized in Snowflake, Snowpark provides a convenient way to write Spark-like code without worrying about infrastructure management.

Of course, AWS Glue also provides a largely serverless experience, making it another convenient option in the AWS ecosystem.

Getting Started

All examples in this article are executed within Snowflake Notebooks in Snowsight.

No local Python environment or connection configuration is required—the entire workflow runs directly in the browser.

Prerequisites

From the Snowsight menu:

Create → Notebooks

Create a new notebook.

Snowpark for Python is already installed, so there is no need to run pip install.

You can start coding immediately.

Obtaining a Session

In local environments, Snowpark sessions are typically created using:

Session.builder.configs(...).create()

In Snowflake Notebooks, an active session already exists.

You can simply retrieve it:

from snowflake.snowpark.context import get_active_session

session = get_active_session()
print("Session acquired successfully!")

One of the major advantages of Snowflake Notebooks is that connection details never need to be written manually.

Basic DataFrame Operations

Let's create and manipulate a DataFrame from a table.

df = session.table("MY_DB.MY_SCHEMA.SALES_DATA")

result_df = df.select("ORDER_ID", "AMOUNT", "REGION") \
              .filter(df["REGION"] == "Asia") \
              .sort(df["AMOUNT"].desc())

result_df.show()

One important point is that no SQL is actually executed until show() is called.

We'll discuss this in more detail when covering lazy evaluation.

Aggregations

GroupBy operations feel almost identical to Spark.

from snowflake.snowpark import functions as F

summary_df = df.group_by("REGION") \
               .agg(
                   F.sum("AMOUNT").alias("TOTAL_AMOUNT"),
                   F.count("ORDER_ID").alias("ORDER_COUNT"),
                   F.avg("AMOUNT").alias("AVG_AMOUNT")
               )

summary_df.show()

When executed, these DataFrame operations are translated into SQL and run within Snowflake.

You can inspect the generated SQL using:

print(summary_df.queries)

Writing Results Back to a Table

To save results into a Snowflake table:

summary_df.write.mode("overwrite").save_as_table(
    "MY_DB.MY_SCHEMA.SALES_SUMMARY"
)

Since save_as_table() does not return a result, it's often useful to reload the table to verify the output.

session.table("MY_DB.MY_SCHEMA.SALES_SUMMARY").show()

The overwrite mode replaces the existing table.

Use append if you want to add rows instead.

How Does Lazy Evaluation Work?

Anyone familiar with Spark has likely encountered lazy evaluation.

Sometimes it can even lead to unexpected behavior during debugging.

Snowpark adopts the same fundamental concept.

Understanding Lazy Evaluation

DataFrame transformations such as:

select
filter
group_by

are not executed immediately.

These operations merely build an execution plan.

Actual execution occurs only when an action is triggered.

Common action operations include:

show()
collect()
count()
to_pandas()
write.save_as_table()

Verifying Lazy Evaluation

A convenient way to inspect behavior is through df.queries.

from snowflake.snowpark import functions as F

df_filtered = session.table("MY_DB.MY_SCHEMA.LARGE_TABLE") \
                     .filter(F.col("STATUS") == "ACTIVE") \
                     .select("ID", "NAME", "STATUS", "CREATED_AT")

print(df_filtered.queries)

result = df_filtered.collect()
print(f"{len(result)} rows retrieved")

The generated SQL can be inspected before execution, but no query has actually been sent to Snowflake yet.

To verify execution timing precisely, we can use Query History in Snowsight.

Open:

Monitoring → Query History

Then perform the following steps:

Define the DataFrame.
Check Query History.
Confirm that no SELECT statement has been executed.
Execute collect().
Refresh Query History.
Observe that the SELECT statement now appears.

Define the DataFrame

No corresponding query appears yet.

Execute collect()

After execution, the query becomes visible.

This confirms that DataFrame definitions alone do not trigger execution.

The SQL is executed only when collect() is called.

Differences from Spark's Lazy Evaluation

In Spark, lazy evaluation constructs a DAG that is optimized and executed across a cluster.

In Snowpark, lazy evaluation ultimately produces SQL, which is then optimized and executed by Snowflake's query optimizer.

The concept is similar, but the execution engine is fundamentally different.

One particularly useful feature is that generated SQL can be inspected via df.queries, making it easier to validate execution plans.

Can We Use Caching?

If you're coming from Spark, your first instinct may be to use cache().

Snowpark provides a similar capability through cache_result().

Differences from Spark cache()

Aspect	Spark `cache()`	Snowpark `cache_result()`
Storage	Memory (and disk)	Temporary table
Lifetime	Until application ends	Until session ends
Cost	No additional write	INSERT into temporary table

Internally, cache_result() materializes results into a temporary table.

Subsequent operations reuse that table rather than re-running expensive transformations.

Example

df_heavy = session.table("MY_DB.MY_SCHEMA.LARGE_TABLE") \
                  .filter(F.col("STATUS") == "ACTIVE") \
                  .join(
                      session.table("MY_DB.MY_SCHEMA.MASTER"),
                      "ID"
                  )

cached_df = df_heavy.cache_result()

result1 = cached_df.filter(
    F.col("REGION") == "Asia"
).collect()

result2 = cached_df.group_by("REGION") \
                   .agg(F.count("*")) \
                   .collect()

cached_df.drop_table()

Using a with block is often more convenient because the temporary table is automatically dropped when the block exits.

with df_heavy.cache_result() as cached_df:
    result1 = cached_df.filter(
        F.col("REGION") == "Asia"
    ).collect()

    result2 = cached_df.group_by("REGION") \
                       .agg(F.count("*")) \
                       .collect()

Since cache_result() performs an INSERT into a temporary table, it can actually make things slower when the DataFrame is only used once.

It's most effective when the same expensive transformation is reused multiple times.

You can also observe this behavior in Snowsight.

Temporary table creation:

Subsequent SELECT from the temporary table:

Another query reusing the same temporary table:

Use Cases

Let's consider some practical scenarios where Snowpark can be useful.

ETL Pipelines

Traditionally, pipelines often look like:

S3 → Glue → Redshift

With Snowpark, many transformations can be performed entirely within Snowflake.

This reduces data movement and simplifies overall architecture.

Example:

raw_df = session.table("MY_DB.MY_SCHEMA.RAW_EVENTS")

cleaned_df = raw_df \
    .filter(raw_df["EVENT_TYPE"] != "test") \
    .with_column(
        "EVENT_DATE",
        F.to_date(raw_df["EVENT_TIMESTAMP"])
    ) \
    .drop_duplicates([
        "USER_ID",
        "EVENT_DATE",
        "EVENT_TYPE"
    ])

aggregated_df = cleaned_df \
    .group_by("EVENT_DATE", "EVENT_TYPE") \
    .agg(
        F.count("USER_ID")
         .alias("USER_COUNT")
    )

aggregated_df.write.mode("overwrite") \
             .save_as_table(
                 "MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY"
             )

session.table(
    "MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY"
).show()

Custom Transformations Using UDFs

Snowpark UDFs allow complex logic that would be cumbersome in SQL to be implemented in Python.

You can register UDFs using either the @udf decorator or session.udf.register().

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType

@udf(return_type=StringType(),
     input_types=[StringType()])
def normalize_region(region: str) -> str:
    region_map = {
        "US": "North America",
        "JP": "Asia",
        "DE": "Europe",
    }
    return region_map.get(region, "Other")

df = session.table("MY_DB.MY_SCHEMA.RAW_EVENTS")

df_with_region = df.with_column(
    "NORMALIZED_REGION",
    normalize_region(df["REGION_CODE"])
)

If type hints are available, explicit type definitions can be omitted:

from snowflake.snowpark.functions import udf

@udf
def normalize_region(region: str) -> str:
    region_map = {
        "US": "North America",
        "JP": "Asia",
        "DE": "Europe",
    }
    return region_map.get(region, "Other")

That said, simple transformations are often faster when implemented using built-in SQL functions.

As always, benchmark before deciding.

Data Quality Validation

Snowpark can also be used for data quality checks before processing continues.

total_count = df.count()
null_count = df.filter(
    F.col("AMOUNT").is_null()
).count()

null_rate = null_count / total_count

if null_rate > 0.10:
    raise ValueError(
        f"NULL rate exceeds the threshold: {null_rate:.1%}"
    )

print(
    f"Data quality check passed "
    f"(NULL rate: {null_rate:.1%})"
)

Conclusion

In this article, we explored Snowpark's fundamentals, compared it with Spark, and examined its lazy evaluation behavior.

For engineers already familiar with Spark, Snowpark should feel quite approachable.

However, it's important to remember that execution occurs on Snowflake warehouses rather than a Spark cluster.

Reviewing generated SQL and understanding how Snowflake executes queries can help avoid unexpected full-table scans and other performance issues.

If your data is already centralized in Snowflake, keeping processing inside Snowflake rather than moving data to Lambda or Glue Python Shell can be a significant advantage.

Reducing infrastructure management overhead and consolidating ETL processing within Snowflake can also improve maintainability.

One final note: throughout this experiment, I frequently relied on Cortex Code whenever I encountered errors.

The workflow of iteratively fixing notebook errors through Cortex Code was surprisingly convenient.

That said, just like any AI-assisted coding workflow, it's still important to carefully validate the generated code rather than accepting it blindly.

I hope this article helps anyone considering Snowpark for data processing within Snowflake.

DEV Community