Nyson Markus

Posted on Apr 10

PySpark to Pandas/scikit-learn: A Practical Migration Guide for Data Engineers Learning ML

#python #datascience #machinelearning #dataengineering

If you've spent years writing PySpark pipelines, the first time you open a Jupyter notebook full of pd.DataFrame and sklearn.fit() calls, it can feel like you've switched languages entirely.

You haven't. The concepts are the same: transformations, aggregations, pipelines, model evaluation. But the execution model, API design, and idioms are different enough to cause real friction when you're trying to learn ML fast.

This guide is not a beginner tutorial. It's a translation layer, a direct mapping from what you already know in PySpark to its equivalent in Pandas and scikit-learn, with side-by-side code, gotchas, and practical advice for anyone making the shift from data engineer to machine learning engineer.

The Single Biggest Mental Model Shift

Before any code, understand this: PySpark uses lazy evaluation. Pandas and scikit-learn do not.

In PySpark, transformations like .filter(), .select(), and .groupBy() build a logical execution plan. Nothing runs until you call an action like .collect() or .show(). This is what enables distributed optimization across a cluster.

In Pandas, every operation executes immediately on the data in memory. When you write df['col'].mean(), it computes right then.

# PySpark: lazy, nothing computed yet
df_filtered = spark_df.filter(spark_df['age'] > 30)

# Pandas: eager, executes immediately
df_filtered = pandas_df[pandas_df['age'] > 30]

This difference has downstream consequences. Debugging is easier in Pandas because errors surface instantly. But Pandas cannot handle datasets that exceed your machine's RAM. PySpark adds cluster overhead that makes it slower than Pandas for datasets under roughly 5 to 10 GB.

Part 1: Core DataFrame Operations Side-by-Side

Most of your day-to-day DE work maps cleanly. Here's the translation table for the operations you use most.

Filtering Rows

# PySpark
df.filter(df['salary'] > 100000)
df.filter("salary > 100000")           # SQL-style string also works

# Pandas
df[df['salary'] > 100000]
df.query("salary > 100000")            # equivalent SQL-style

Selecting Columns

# PySpark
df.select('name', 'salary', 'department')

# Pandas
df[['name', 'salary', 'department']]

GroupBy and Aggregation

# PySpark
df.groupBy('department').agg(
    F.mean('salary').alias('avg_salary'),
    F.count('*').alias('headcount')
)

# Pandas
df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    headcount=('salary', 'count')
).reset_index()

Note the small but important differences. PySpark uses groupBy (camelCase) while Pandas uses groupby (lowercase). PySpark's .agg() requires explicit column references via F.mean(), while Pandas uses tuple notation (column, function).

Joins

# PySpark
df1.join(df2, on='user_id', how='left')

# Pandas
pd.merge(df1, df2, on='user_id', how='left')

Adding and Transforming Columns

# PySpark
df.withColumn('salary_k', df['salary'] / 1000)

# Pandas
df['salary_k'] = df['salary'] / 1000

# or the non-mutating version
df = df.assign(salary_k=df['salary'] / 1000)

The .withColumn() pattern in PySpark has no single Pandas equivalent. You can use direct assignment or .assign(). Prefer .assign() in chained operations because it returns a new DataFrame without modifying the original.

Handling Nulls

# PySpark
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df.agg(F.mean('salary')).collect()})

# Pandas
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df['salary'].mean()})

This is one of the cleanest translations. The APIs are nearly identical, except Pandas lets you call .mean() directly on a Series without a .collect() call.

Part 2: What `.toPandas()` Actually Costs

As a data engineer, you've probably used .toPandas() to pull Spark results into a notebook. It's tempting to treat this as a bridge that makes the two worlds equivalent. It is not, or at least not for free.

# This works, but has important implications
pandas_df = spark_df.toPandas()

When you call .toPandas():

All data is collected from every executor back to the driver node
It must fit entirely in driver memory
Arrow-based transfer (enabled by spark.sql.execution.arrow.pyspark.enabled = true) makes this significantly faster but does not change the memory constraint

For ML work on training datasets, your feature table often fits comfortably in memory. Feature stores, for example, typically serve pre-aggregated feature vectors that are far smaller than raw event data. But for very large training sets, this is where tools like Ray, Dask, or Spark's native MLlib become relevant.

Part 3: Feature Engineering: MLlib vs. scikit-learn Transformers

This is where the API divergence is most significant, and where most data engineers hit the steepest learning curve.

The Fundamental Difference: Vectors vs. Native Columns

PySpark MLlib requires all numerical features to be assembled into a single dense or sparse vector column before training. scikit-learn works on native NumPy arrays and Pandas DataFrames directly.

# PySpark MLlib: you MUST assemble features into a vector first
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=['age', 'salary', 'tenure'],
    outputCol='features'
)
df_assembled = assembler.transform(df)
# df now has a 'features' column of DenseVector([age, salary, tenure])

# scikit-learn: no assembly needed, just pass the DataFrame directly
from sklearn.linear_model import LogisticRegression

X = df[['age', 'salary', 'tenure']]
y = df['label']
model = LogisticRegression()
model.fit(X, y)

The VectorAssembler requirement is a common source of confusion when coming from PySpark. In scikit-learn, you skip it entirely and work with 2D arrays directly.

Encoding Categorical Variables

# PySpark MLlib: two-step process
from pyspark.ml.feature import StringIndexer, OneHotEncoder

indexer = StringIndexer(inputCol='city', outputCol='city_index')
encoder = OneHotEncoder(inputCols=['city_index'], outputCols=['city_ohe'])

# scikit-learn: single step
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
city_encoded = encoder.fit_transform(df[['city']])

PySpark requires a two-step process (StringIndexer then OneHotEncoder) because it operates in a distributed setting where integer indices must be assigned consistently across partitions. scikit-learn handles this in a single step.

Scaling Features

# PySpark MLlib
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)

# scikit-learn
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Part 4: Building Pipelines: The Most Transferable Concept

This is the good news for data engineers: both PySpark and scikit-learn use the Pipeline abstraction, and the mental model translates almost directly.

A pipeline chains transformers and a final estimator into a single reusable object that can be fit, serialized, and applied to new data.

# PySpark MLlib Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

pipeline = Pipeline(stages=[
    StringIndexer(inputCol='city', outputCol='city_idx'),
    OneHotEncoder(inputCols=['city_idx'], outputCols=['city_ohe']),
    VectorAssembler(inputCols=['age', 'salary', 'city_ohe'], outputCol='features'),
    StandardScaler(inputCol='features', outputCol='scaled_features'),
    RandomForestClassifier(featuresCol='scaled_features', labelCol='label')
])

model = pipeline.fit(train_df)
predictions = model.transform(test_df)

# scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city'])
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)

The key structural difference: scikit-learn's ColumnTransformer lets you apply different transformations to different columns in parallel, while PySpark's Pipeline applies stages sequentially. You must explicitly order every step in PySpark. The scikit-learn approach is more concise for heterogeneous feature sets.

Part 5: Model Evaluation: A Familiar Concept, Cleaner API

PySpark's evaluation API requires instantiating a separate evaluator object. scikit-learn bundles everything into sklearn.metrics.

# PySpark
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')
auc = evaluator.evaluate(predictions)

acc_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
accuracy = acc_evaluator.evaluate(predictions)

# scikit-learn
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
accuracy = accuracy_score(y_test, predictions)
print(classification_report(y_test, predictions))  # precision, recall, F1 in one call

scikit-learn's classification_report() is significantly more informative than PySpark's single-metric evaluators. It gives precision, recall, F1, and support for every class in a single call.

Part 6: When PySpark Still Wins

Making the shift to Pandas/scikit-learn for ML does not mean PySpark becomes irrelevant. As an ML engineer, you'll continue using PySpark for:

Feature computation at scale: computing rolling aggregations, user behavior features, or entity embeddings over billions of rows before they land in your feature store
Training data generation pipelines: joining raw event tables, filtering time windows, and assembling labeled datasets
Batch inference at scale: applying trained sklearn or PyTorch models inside a Spark UDF to score millions of records
Large-scale data validation: running data quality checks (Great Expectations, Deequ) on incoming training data

The common production pattern is: PySpark for data preparation, then Pandas/scikit-learn/PyTorch for model training, then PySpark again for batch scoring. Your DE background gives you a direct advantage in both the first and last mile of this pipeline.

Part 7: Practical Learning Path

If you're working through this transition, here's a sequenced approach based on the operations data engineers use most.

Week 1: Core Pandas fluency

Rebuild 3 of your existing PySpark ETL jobs in Pandas
Master: filtering, groupby/agg, merge, apply, pivot_table
Read: Pandas User Guide: Comparison with Spark

Week 2: NumPy foundations

Understand array broadcasting, which is the engine under Pandas
Practice vectorized operations instead of loops
This directly maps to understanding tensor operations in PyTorch later

Week 3: scikit-learn core API

The estimator interface: .fit(), .transform(), .predict()
Build your first Pipeline with at least one ColumnTransformer
Implement cross-validation with cross_val_score()

Week 4: End-to-end project

Pick a Kaggle tabular dataset, preferably structured like a transaction dataset that resembles your DE work
Build a full pipeline: EDA, feature engineering, model training, evaluation, and serialization with joblib
This becomes your first ML portfolio project, exactly the kind of hands-on work that signals readiness for an ML engineering role

Quick Reference: PySpark to Pandas/scikit-learn Cheat Sheet

Operation	PySpark	Pandas / scikit-learn
Filter rows	`df.filter(condition)`	`df[condition]` or `df.query()`
Select columns	`df.select('a', 'b')`	`df[['a', 'b']]`
Add column	`df.withColumn('c', expr)`	`df.assign(c=expr)`
GroupBy agg	`df.groupBy('x').agg(F.mean('y'))`	`df.groupby('x').agg(avg_y=('y','mean'))`
Join	`df1.join(df2, on='id', how='left')`	`pd.merge(df1, df2, on='id', how='left')`
Drop nulls	`df.dropna(subset=['col'])`	`df.dropna(subset=['col'])`
Fill nulls	`df.fillna({'col': val})`	`df.fillna({'col': val})`
Encode categoricals	`StringIndexer` + `OneHotEncoder`	`OneHotEncoder` (single step)
Scale features	`StandardScaler(inputCol='features')`	`StandardScaler()` on array directly
Build pipeline	`Pipeline(stages=[...])`	`Pipeline([('step', transformer)])`
Evaluate model	`BinaryClassificationEvaluator`	`roc_auc_score()`, `classification_report()`
Lazy evaluation	Yes, actions trigger execution	No, eager by default

Key Takeaways

The shift from PySpark to Pandas/scikit-learn is a surface-level API change over a familiar conceptual foundation. Every concept you already know, including transformations, aggregations, joins, pipelines, and train/test splits, exists in both worlds. What changes is:

Execution model: eager vs. lazy; in-memory vs. distributed
Feature representation: native columns vs. assembled vectors
Pipeline structure: sequential stages vs. parallel ColumnTransformer
Evaluation API: single-metric evaluators vs. a comprehensive metrics module

Your data engineering instincts around schema validation, null handling, data quality, and pipeline reproducibility are directly applicable to ML work. The transition is about building on top of what you have, not starting over.

DEV Community

PySpark to Pandas/scikit-learn: A Practical Migration Guide for Data Engineers Learning ML

The Single Biggest Mental Model Shift

Part 1: Core DataFrame Operations Side-by-Side

Filtering Rows

Selecting Columns

GroupBy and Aggregation

Joins

Adding and Transforming Columns

Handling Nulls

Part 2: What `.toPandas()` Actually Costs

Part 3: Feature Engineering: MLlib vs. scikit-learn Transformers

The Fundamental Difference: Vectors vs. Native Columns

Encoding Categorical Variables

Scaling Features

Part 4: Building Pipelines: The Most Transferable Concept

Part 5: Model Evaluation: A Familiar Concept, Cleaner API

Part 6: When PySpark Still Wins

Part 7: Practical Learning Path

Quick Reference: PySpark to Pandas/scikit-learn Cheat Sheet

Key Takeaways

Top comments (0)

The Single Biggest Mental Model Shift

Part 1: Core DataFrame Operations Side-by-Side

Filtering Rows

Selecting Columns

GroupBy and Aggregation

Joins

Adding and Transforming Columns

Handling Nulls

Part 2: What .toPandas() Actually Costs

Part 3: Feature Engineering: MLlib vs. scikit-learn Transformers

The Fundamental Difference: Vectors vs. Native Columns

Encoding Categorical Variables

Scaling Features

Part 4: Building Pipelines: The Most Transferable Concept

Part 5: Model Evaluation: A Familiar Concept, Cleaner API

Part 6: When PySpark Still Wins

Part 7: Practical Learning Path

Quick Reference: PySpark to Pandas/scikit-learn Cheat Sheet

Key Takeaways

Part 2: What `.toPandas()` Actually Costs