DEV Community

Nyson Markus
Nyson Markus

Posted on

PySpark to Pandas/scikit-learn: A Practical Migration Guide for Data Engineers Learning ML

If you've spent years writing PySpark pipelines, the first time you open a Jupyter notebook full of pd.DataFrame and sklearn.fit() calls, it can feel like you've switched languages entirely.

You haven't. The concepts are the same: transformations, aggregations, pipelines, model evaluation. But the execution model, API design, and idioms are different enough to cause real friction when you're trying to learn ML fast.

This guide is not a beginner tutorial. It's a translation layer, a direct mapping from what you already know in PySpark to its equivalent in Pandas and scikit-learn, with side-by-side code, gotchas, and practical advice for anyone making the shift from data engineer to machine learning engineer.


The Single Biggest Mental Model Shift

Before any code, understand this: PySpark uses lazy evaluation. Pandas and scikit-learn do not.

In PySpark, transformations like .filter(), .select(), and .groupBy() build a logical execution plan. Nothing runs until you call an action like .collect() or .show(). This is what enables distributed optimization across a cluster.

In Pandas, every operation executes immediately on the data in memory. When you write df['col'].mean(), it computes right then.

# PySpark: lazy, nothing computed yet
df_filtered = spark_df.filter(spark_df['age'] > 30)

# Pandas: eager, executes immediately
df_filtered = pandas_df[pandas_df['age'] > 30]
Enter fullscreen mode Exit fullscreen mode

This difference has downstream consequences. Debugging is easier in Pandas because errors surface instantly. But Pandas cannot handle datasets that exceed your machine's RAM. PySpark adds cluster overhead that makes it slower than Pandas for datasets under roughly 5 to 10 GB.


Part 1: Core DataFrame Operations Side-by-Side

Most of your day-to-day DE work maps cleanly. Here's the translation table for the operations you use most.

Filtering Rows

# PySpark
df.filter(df['salary'] > 100000)
df.filter("salary > 100000")           # SQL-style string also works

# Pandas
df[df['salary'] > 100000]
df.query("salary > 100000")            # equivalent SQL-style
Enter fullscreen mode Exit fullscreen mode

Selecting Columns

# PySpark
df.select('name', 'salary', 'department')

# Pandas
df[['name', 'salary', 'department']]
Enter fullscreen mode Exit fullscreen mode

GroupBy and Aggregation

# PySpark
df.groupBy('department').agg(
    F.mean('salary').alias('avg_salary'),
    F.count('*').alias('headcount')
)

# Pandas
df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    headcount=('salary', 'count')
).reset_index()
Enter fullscreen mode Exit fullscreen mode

Note the small but important differences. PySpark uses groupBy (camelCase) while Pandas uses groupby (lowercase). PySpark's .agg() requires explicit column references via F.mean(), while Pandas uses tuple notation (column, function).

Joins

# PySpark
df1.join(df2, on='user_id', how='left')

# Pandas
pd.merge(df1, df2, on='user_id', how='left')
Enter fullscreen mode Exit fullscreen mode

Adding and Transforming Columns

# PySpark
df.withColumn('salary_k', df['salary'] / 1000)

# Pandas
df['salary_k'] = df['salary'] / 1000

# or the non-mutating version
df = df.assign(salary_k=df['salary'] / 1000)
Enter fullscreen mode Exit fullscreen mode

The .withColumn() pattern in PySpark has no single Pandas equivalent. You can use direct assignment or .assign(). Prefer .assign() in chained operations because it returns a new DataFrame without modifying the original.

Handling Nulls

# PySpark
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df.agg(F.mean('salary')).collect()})

# Pandas
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df['salary'].mean()})
Enter fullscreen mode Exit fullscreen mode

This is one of the cleanest translations. The APIs are nearly identical, except Pandas lets you call .mean() directly on a Series without a .collect() call.


Part 2: What .toPandas() Actually Costs

As a data engineer, you've probably used .toPandas() to pull Spark results into a notebook. It's tempting to treat this as a bridge that makes the two worlds equivalent. It is not, or at least not for free.

# This works, but has important implications
pandas_df = spark_df.toPandas()
Enter fullscreen mode Exit fullscreen mode

When you call .toPandas():

  • All data is collected from every executor back to the driver node
  • It must fit entirely in driver memory
  • Arrow-based transfer (enabled by spark.sql.execution.arrow.pyspark.enabled = true) makes this significantly faster but does not change the memory constraint

For ML work on training datasets, your feature table often fits comfortably in memory. Feature stores, for example, typically serve pre-aggregated feature vectors that are far smaller than raw event data. But for very large training sets, this is where tools like Ray, Dask, or Spark's native MLlib become relevant.


Part 3: Feature Engineering: MLlib vs. scikit-learn Transformers

This is where the API divergence is most significant, and where most data engineers hit the steepest learning curve.

The Fundamental Difference: Vectors vs. Native Columns

PySpark MLlib requires all numerical features to be assembled into a single dense or sparse vector column before training. scikit-learn works on native NumPy arrays and Pandas DataFrames directly.

# PySpark MLlib: you MUST assemble features into a vector first
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=['age', 'salary', 'tenure'],
    outputCol='features'
)
df_assembled = assembler.transform(df)
# df now has a 'features' column of DenseVector([age, salary, tenure])

# scikit-learn: no assembly needed, just pass the DataFrame directly
from sklearn.linear_model import LogisticRegression

X = df[['age', 'salary', 'tenure']]
y = df['label']
model = LogisticRegression()
model.fit(X, y)
Enter fullscreen mode Exit fullscreen mode

The VectorAssembler requirement is a common source of confusion when coming from PySpark. In scikit-learn, you skip it entirely and work with 2D arrays directly.

Encoding Categorical Variables

# PySpark MLlib: two-step process
from pyspark.ml.feature import StringIndexer, OneHotEncoder

indexer = StringIndexer(inputCol='city', outputCol='city_index')
encoder = OneHotEncoder(inputCols=['city_index'], outputCols=['city_ohe'])

# scikit-learn: single step
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
city_encoded = encoder.fit_transform(df[['city']])
Enter fullscreen mode Exit fullscreen mode

PySpark requires a two-step process (StringIndexer then OneHotEncoder) because it operates in a distributed setting where integer indices must be assigned consistently across partitions. scikit-learn handles this in a single step.

Scaling Features

# PySpark MLlib
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)

# scikit-learn
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

Part 4: Building Pipelines: The Most Transferable Concept

This is the good news for data engineers: both PySpark and scikit-learn use the Pipeline abstraction, and the mental model translates almost directly.

A pipeline chains transformers and a final estimator into a single reusable object that can be fit, serialized, and applied to new data.

# PySpark MLlib Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

pipeline = Pipeline(stages=[
    StringIndexer(inputCol='city', outputCol='city_idx'),
    OneHotEncoder(inputCols=['city_idx'], outputCols=['city_ohe']),
    VectorAssembler(inputCols=['age', 'salary', 'city_ohe'], outputCol='features'),
    StandardScaler(inputCol='features', outputCol='scaled_features'),
    RandomForestClassifier(featuresCol='scaled_features', labelCol='label')
])

model = pipeline.fit(train_df)
predictions = model.transform(test_df)
Enter fullscreen mode Exit fullscreen mode
# scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city'])
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

The key structural difference: scikit-learn's ColumnTransformer lets you apply different transformations to different columns in parallel, while PySpark's Pipeline applies stages sequentially. You must explicitly order every step in PySpark. The scikit-learn approach is more concise for heterogeneous feature sets.


Part 5: Model Evaluation: A Familiar Concept, Cleaner API

PySpark's evaluation API requires instantiating a separate evaluator object. scikit-learn bundles everything into sklearn.metrics.

# PySpark
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')
auc = evaluator.evaluate(predictions)

acc_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
accuracy = acc_evaluator.evaluate(predictions)
Enter fullscreen mode Exit fullscreen mode
# scikit-learn
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
accuracy = accuracy_score(y_test, predictions)
print(classification_report(y_test, predictions))  # precision, recall, F1 in one call
Enter fullscreen mode Exit fullscreen mode

scikit-learn's classification_report() is significantly more informative than PySpark's single-metric evaluators. It gives precision, recall, F1, and support for every class in a single call.


Part 6: When PySpark Still Wins

Making the shift to Pandas/scikit-learn for ML does not mean PySpark becomes irrelevant. As an ML engineer, you'll continue using PySpark for:

  • Feature computation at scale: computing rolling aggregations, user behavior features, or entity embeddings over billions of rows before they land in your feature store
  • Training data generation pipelines: joining raw event tables, filtering time windows, and assembling labeled datasets
  • Batch inference at scale: applying trained sklearn or PyTorch models inside a Spark UDF to score millions of records
  • Large-scale data validation: running data quality checks (Great Expectations, Deequ) on incoming training data

The common production pattern is: PySpark for data preparation, then Pandas/scikit-learn/PyTorch for model training, then PySpark again for batch scoring. Your DE background gives you a direct advantage in both the first and last mile of this pipeline.


Part 7: Practical Learning Path

If you're working through this transition, here's a sequenced approach based on the operations data engineers use most.

Week 1: Core Pandas fluency

Week 2: NumPy foundations

  • Understand array broadcasting, which is the engine under Pandas
  • Practice vectorized operations instead of loops
  • This directly maps to understanding tensor operations in PyTorch later

Week 3: scikit-learn core API

  • The estimator interface: .fit(), .transform(), .predict()
  • Build your first Pipeline with at least one ColumnTransformer
  • Implement cross-validation with cross_val_score()

Week 4: End-to-end project

  • Pick a Kaggle tabular dataset, preferably structured like a transaction dataset that resembles your DE work
  • Build a full pipeline: EDA, feature engineering, model training, evaluation, and serialization with joblib
  • This becomes your first ML portfolio project, exactly the kind of hands-on work that signals readiness for an ML engineering role

Quick Reference: PySpark to Pandas/scikit-learn Cheat Sheet

Operation PySpark Pandas / scikit-learn
Filter rows df.filter(condition) df[condition] or df.query()
Select columns df.select('a', 'b') df[['a', 'b']]
Add column df.withColumn('c', expr) df.assign(c=expr)
GroupBy agg df.groupBy('x').agg(F.mean('y')) df.groupby('x').agg(avg_y=('y','mean'))
Join df1.join(df2, on='id', how='left') pd.merge(df1, df2, on='id', how='left')
Drop nulls df.dropna(subset=['col']) df.dropna(subset=['col'])
Fill nulls df.fillna({'col': val}) df.fillna({'col': val})
Encode categoricals StringIndexer + OneHotEncoder OneHotEncoder (single step)
Scale features StandardScaler(inputCol='features') StandardScaler() on array directly
Build pipeline Pipeline(stages=[...]) Pipeline([('step', transformer)])
Evaluate model BinaryClassificationEvaluator roc_auc_score(), classification_report()
Lazy evaluation Yes, actions trigger execution No, eager by default

Key Takeaways

The shift from PySpark to Pandas/scikit-learn is a surface-level API change over a familiar conceptual foundation. Every concept you already know, including transformations, aggregations, joins, pipelines, and train/test splits, exists in both worlds. What changes is:

  1. Execution model: eager vs. lazy; in-memory vs. distributed
  2. Feature representation: native columns vs. assembled vectors
  3. Pipeline structure: sequential stages vs. parallel ColumnTransformer
  4. Evaluation API: single-metric evaluators vs. a comprehensive metrics module

Your data engineering instincts around schema validation, null handling, data quality, and pipeline reproducibility are directly applicable to ML work. The transition is about building on top of what you have, not starting over.

Top comments (0)