Original Japanese article: Snowparkを動かしながらSparkとの違いを整理してみる
Introduction
Recently, I've had more opportunities to work with Snowflake when building data platforms.
When working with modern data platforms, Apache Spark is often used for distributed data processing. Snowflake also provides its own data processing framework called Snowpark.
If you're already familiar with Spark or AWS Glue, you may find yourself wondering:
"Wait... how is Snowpark actually different from Spark?"
In this article, I'd like to organize my own understanding while exploring Snowpark's behavior and comparing it with Spark.
For this experiment, everything was done entirely within Snowflake Notebooks in Snowsight.
One of the biggest advantages is that no local environment setup or connection configuration is required—you can start experimenting immediately.
What is Snowpark?
Snowpark is a data processing framework provided by Snowflake.
Its biggest feature is the ability to write code in Python, Java, or Scala and execute it directly inside Snowflake.
Traditionally, Snowflake workloads were primarily implemented using SQL. With Snowpark, however, you can use a DataFrame API similar to Spark or Pandas while keeping all processing within Snowflake.
In other words, you no longer need to pull data into a local environment or AWS Lambda for processing.
Some key characteristics include:
- Managed execution environment – Processing runs on Snowflake warehouses with no infrastructure management required.
- DataFrame API – Similar developer experience to Spark and Pandas.
- Pushdown execution – Code is executed within Snowflake, eliminating data transfer overhead.
- UDF and UDTF support – Custom functions can be defined and executed inside Snowflake.
- Snowflake Notebook integration – Interactive development is supported directly in Snowsight.
Personally, I still like Scala, but these days I write most data processing code in Python.
While Scala often offers better performance, Python's simplicity and extensive ecosystem make it the more practical choice in many situations.
Differences Between Spark and Snowpark
Many people immediately think of Spark when they hear the name Snowpark.
The names are similar, and the DataFrame APIs feel very familiar.
However, there are several important differences.
| Aspect | Spark | Snowpark |
|---|---|---|
| Execution Environment | Distributed cluster | Snowflake warehouse |
| Data Source | HDFS, S3, and others | Primarily Snowflake tables |
| Scaling | Cluster size managed by user | Warehouse size adjustment |
| Languages | Scala, Java, Python, R, etc. | Python, Java, Scala |
| External Data Support | Broad ecosystem support | Primarily Snowflake-centric |
| Infrastructure Management | Cluster management required | Fully managed |
Spark requires awareness of distributed clusters and execution mechanics.
Snowpark, on the other hand, is fundamentally a processing framework that operates on top of the Snowflake platform.
DataFrame operations are internally converted into SQL execution plans and executed by Snowflake's SQL engine.
Unlike Spark, user code is not distributed across worker nodes.
Scaling is handled by Snowflake warehouses.
UDFs are an exception. UDF code is pushed into Snowflake and executed in parallel by Snowflake's infrastructure.
A useful mental model is:
- DataFrame operations → SQL generation
- UDFs → Server-side parallel execution
In either case, users do not need to manage clusters or DAG execution as they would in Spark.
If your data is already centralized in Snowflake, Snowpark provides a convenient way to write Spark-like code without worrying about infrastructure management.
Of course, AWS Glue also provides a largely serverless experience, making it another convenient option in the AWS ecosystem.
Getting Started
All examples in this article are executed within Snowflake Notebooks in Snowsight.
No local Python environment or connection configuration is required—the entire workflow runs directly in the browser.
Prerequisites
From the Snowsight menu:
Create → Notebooks
Create a new notebook.
Snowpark for Python is already installed, so there is no need to run pip install.
You can start coding immediately.
Obtaining a Session
In local environments, Snowpark sessions are typically created using:
Session.builder.configs(...).create()
In Snowflake Notebooks, an active session already exists.
You can simply retrieve it:
from snowflake.snowpark.context import get_active_session
session = get_active_session()
print("Session acquired successfully!")
One of the major advantages of Snowflake Notebooks is that connection details never need to be written manually.
Basic DataFrame Operations
Let's create and manipulate a DataFrame from a table.
df = session.table("MY_DB.MY_SCHEMA.SALES_DATA")
result_df = df.select("ORDER_ID", "AMOUNT", "REGION") \
.filter(df["REGION"] == "Asia") \
.sort(df["AMOUNT"].desc())
result_df.show()
One important point is that no SQL is actually executed until show() is called.
We'll discuss this in more detail when covering lazy evaluation.
Aggregations
GroupBy operations feel almost identical to Spark.
from snowflake.snowpark import functions as F
summary_df = df.group_by("REGION") \
.agg(
F.sum("AMOUNT").alias("TOTAL_AMOUNT"),
F.count("ORDER_ID").alias("ORDER_COUNT"),
F.avg("AMOUNT").alias("AVG_AMOUNT")
)
summary_df.show()
When executed, these DataFrame operations are translated into SQL and run within Snowflake.
You can inspect the generated SQL using:
print(summary_df.queries)
Writing Results Back to a Table
To save results into a Snowflake table:
summary_df.write.mode("overwrite").save_as_table(
"MY_DB.MY_SCHEMA.SALES_SUMMARY"
)
Since save_as_table() does not return a result, it's often useful to reload the table to verify the output.
session.table("MY_DB.MY_SCHEMA.SALES_SUMMARY").show()
The overwrite mode replaces the existing table.
Use append if you want to add rows instead.
How Does Lazy Evaluation Work?
Anyone familiar with Spark has likely encountered lazy evaluation.
Sometimes it can even lead to unexpected behavior during debugging.
Snowpark adopts the same fundamental concept.
Understanding Lazy Evaluation
DataFrame transformations such as:
selectfiltergroup_by
are not executed immediately.
These operations merely build an execution plan.
Actual execution occurs only when an action is triggered.
Common action operations include:
show()collect()count()to_pandas()write.save_as_table()
Verifying Lazy Evaluation
A convenient way to inspect behavior is through df.queries.
from snowflake.snowpark import functions as F
df_filtered = session.table("MY_DB.MY_SCHEMA.LARGE_TABLE") \
.filter(F.col("STATUS") == "ACTIVE") \
.select("ID", "NAME", "STATUS", "CREATED_AT")
print(df_filtered.queries)
result = df_filtered.collect()
print(f"{len(result)} rows retrieved")
The generated SQL can be inspected before execution, but no query has actually been sent to Snowflake yet.
To verify execution timing precisely, we can use Query History in Snowsight.
Open:
Monitoring → Query History
Then perform the following steps:
- Define the DataFrame.
- Check Query History.
- Confirm that no SELECT statement has been executed.
- Execute
collect(). - Refresh Query History.
- Observe that the SELECT statement now appears.
Define the DataFrame
No corresponding query appears yet.
Execute collect()
After execution, the query becomes visible.
This confirms that DataFrame definitions alone do not trigger execution.
The SQL is executed only when collect() is called.
Differences from Spark's Lazy Evaluation
In Spark, lazy evaluation constructs a DAG that is optimized and executed across a cluster.
In Snowpark, lazy evaluation ultimately produces SQL, which is then optimized and executed by Snowflake's query optimizer.
The concept is similar, but the execution engine is fundamentally different.
One particularly useful feature is that generated SQL can be inspected via df.queries, making it easier to validate execution plans.
Can We Use Caching?
If you're coming from Spark, your first instinct may be to use cache().
Snowpark provides a similar capability through cache_result().
Differences from Spark cache()
| Aspect | Spark cache()
|
Snowpark cache_result()
|
|---|---|---|
| Storage | Memory (and disk) | Temporary table |
| Lifetime | Until application ends | Until session ends |
| Cost | No additional write | INSERT into temporary table |
Internally, cache_result() materializes results into a temporary table.
Subsequent operations reuse that table rather than re-running expensive transformations.
Example
df_heavy = session.table("MY_DB.MY_SCHEMA.LARGE_TABLE") \
.filter(F.col("STATUS") == "ACTIVE") \
.join(
session.table("MY_DB.MY_SCHEMA.MASTER"),
"ID"
)
cached_df = df_heavy.cache_result()
result1 = cached_df.filter(
F.col("REGION") == "Asia"
).collect()
result2 = cached_df.group_by("REGION") \
.agg(F.count("*")) \
.collect()
cached_df.drop_table()
Using a with block is often more convenient because the temporary table is automatically dropped when the block exits.
with df_heavy.cache_result() as cached_df:
result1 = cached_df.filter(
F.col("REGION") == "Asia"
).collect()
result2 = cached_df.group_by("REGION") \
.agg(F.count("*")) \
.collect()
Since
cache_result()performs an INSERT into a temporary table, it can actually make things slower when the DataFrame is only used once.It's most effective when the same expensive transformation is reused multiple times.
You can also observe this behavior in Snowsight.
Temporary table creation:
Subsequent SELECT from the temporary table:
Another query reusing the same temporary table:
Use Cases
Let's consider some practical scenarios where Snowpark can be useful.
ETL Pipelines
Traditionally, pipelines often look like:
S3 → Glue → Redshift
With Snowpark, many transformations can be performed entirely within Snowflake.
This reduces data movement and simplifies overall architecture.
Example:
raw_df = session.table("MY_DB.MY_SCHEMA.RAW_EVENTS")
cleaned_df = raw_df \
.filter(raw_df["EVENT_TYPE"] != "test") \
.with_column(
"EVENT_DATE",
F.to_date(raw_df["EVENT_TIMESTAMP"])
) \
.drop_duplicates([
"USER_ID",
"EVENT_DATE",
"EVENT_TYPE"
])
aggregated_df = cleaned_df \
.group_by("EVENT_DATE", "EVENT_TYPE") \
.agg(
F.count("USER_ID")
.alias("USER_COUNT")
)
aggregated_df.write.mode("overwrite") \
.save_as_table(
"MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY"
)
session.table(
"MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY"
).show()
Custom Transformations Using UDFs
Snowpark UDFs allow complex logic that would be cumbersome in SQL to be implemented in Python.
You can register UDFs using either the @udf decorator or session.udf.register().
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType
@udf(return_type=StringType(),
input_types=[StringType()])
def normalize_region(region: str) -> str:
region_map = {
"US": "North America",
"JP": "Asia",
"DE": "Europe",
}
return region_map.get(region, "Other")
df = session.table("MY_DB.MY_SCHEMA.RAW_EVENTS")
df_with_region = df.with_column(
"NORMALIZED_REGION",
normalize_region(df["REGION_CODE"])
)
If type hints are available, explicit type definitions can be omitted:
from snowflake.snowpark.functions import udf
@udf
def normalize_region(region: str) -> str:
region_map = {
"US": "North America",
"JP": "Asia",
"DE": "Europe",
}
return region_map.get(region, "Other")
That said, simple transformations are often faster when implemented using built-in SQL functions.
As always, benchmark before deciding.
Data Quality Validation
Snowpark can also be used for data quality checks before processing continues.
total_count = df.count()
null_count = df.filter(
F.col("AMOUNT").is_null()
).count()
null_rate = null_count / total_count
if null_rate > 0.10:
raise ValueError(
f"NULL rate exceeds the threshold: {null_rate:.1%}"
)
print(
f"Data quality check passed "
f"(NULL rate: {null_rate:.1%})"
)
Conclusion
In this article, we explored Snowpark's fundamentals, compared it with Spark, and examined its lazy evaluation behavior.
For engineers already familiar with Spark, Snowpark should feel quite approachable.
However, it's important to remember that execution occurs on Snowflake warehouses rather than a Spark cluster.
Reviewing generated SQL and understanding how Snowflake executes queries can help avoid unexpected full-table scans and other performance issues.
If your data is already centralized in Snowflake, keeping processing inside Snowflake rather than moving data to Lambda or Glue Python Shell can be a significant advantage.
Reducing infrastructure management overhead and consolidating ETL processing within Snowflake can also improve maintainability.
One final note: throughout this experiment, I frequently relied on Cortex Code whenever I encountered errors.
The workflow of iteratively fixing notebook errors through Cortex Code was surprisingly convenient.
That said, just like any AI-assisted coding workflow, it's still important to carefully validate the generated code rather than accepting it blindly.
I hope this article helps anyone considering Snowpark for data processing within Snowflake.














Top comments (0)