If you've spent years writing PySpark pipelines, the first time you open a Jupyter notebook full of pd.DataFrame and sklearn.fit() calls, it can feel like you've switched languages entirely.
You haven't. The concepts are the same: transformations, aggregations, pipelines, model evaluation. But the execution model, API design, and idioms are different enough to cause real friction when you're trying to learn ML fast.
This guide is not a beginner tutorial. It's a translation layer, a direct mapping from what you already know in PySpark to its equivalent in Pandas and scikit-learn, with side-by-side code, gotchas, and practical advice for anyone making the shift from data engineer to machine learning engineer.
The Single Biggest Mental Model Shift
Before any code, understand this: PySpark uses lazy evaluation. Pandas and scikit-learn do not.
In PySpark, transformations like .filter(), .select(), and .groupBy() build a logical execution plan. Nothing runs until you call an action like .collect() or .show(). This is what enables distributed optimization across a cluster.
In Pandas, every operation executes immediately on the data in memory. When you write df['col'].mean(), it computes right then.
# PySpark: lazy, nothing computed yet
df_filtered = spark_df.filter(spark_df['age'] > 30)
# Pandas: eager, executes immediately
df_filtered = pandas_df[pandas_df['age'] > 30]
This difference has downstream consequences. Debugging is easier in Pandas because errors surface instantly. But Pandas cannot handle datasets that exceed your machine's RAM. PySpark adds cluster overhead that makes it slower than Pandas for datasets under roughly 5 to 10 GB.
Part 1: Core DataFrame Operations Side-by-Side
Most of your day-to-day DE work maps cleanly. Here's the translation table for the operations you use most.
Filtering Rows
# PySpark
df.filter(df['salary'] > 100000)
df.filter("salary > 100000") # SQL-style string also works
# Pandas
df[df['salary'] > 100000]
df.query("salary > 100000") # equivalent SQL-style
Selecting Columns
# PySpark
df.select('name', 'salary', 'department')
# Pandas
df[['name', 'salary', 'department']]
GroupBy and Aggregation
# PySpark
df.groupBy('department').agg(
F.mean('salary').alias('avg_salary'),
F.count('*').alias('headcount')
)
# Pandas
df.groupby('department').agg(
avg_salary=('salary', 'mean'),
headcount=('salary', 'count')
).reset_index()
Note the small but important differences. PySpark uses groupBy (camelCase) while Pandas uses groupby (lowercase). PySpark's .agg() requires explicit column references via F.mean(), while Pandas uses tuple notation (column, function).
Joins
# PySpark
df1.join(df2, on='user_id', how='left')
# Pandas
pd.merge(df1, df2, on='user_id', how='left')
Adding and Transforming Columns
# PySpark
df.withColumn('salary_k', df['salary'] / 1000)
# Pandas
df['salary_k'] = df['salary'] / 1000
# or the non-mutating version
df = df.assign(salary_k=df['salary'] / 1000)
The .withColumn() pattern in PySpark has no single Pandas equivalent. You can use direct assignment or .assign(). Prefer .assign() in chained operations because it returns a new DataFrame without modifying the original.
Handling Nulls
# PySpark
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df.agg(F.mean('salary')).collect()})
# Pandas
df.dropna(subset=['age', 'salary'])
df.fillna({'age': 0, 'salary': df['salary'].mean()})
This is one of the cleanest translations. The APIs are nearly identical, except Pandas lets you call .mean() directly on a Series without a .collect() call.
Part 2: What .toPandas() Actually Costs
As a data engineer, you've probably used .toPandas() to pull Spark results into a notebook. It's tempting to treat this as a bridge that makes the two worlds equivalent. It is not, or at least not for free.
# This works, but has important implications
pandas_df = spark_df.toPandas()
When you call .toPandas():
- All data is collected from every executor back to the driver node
- It must fit entirely in driver memory
- Arrow-based transfer (enabled by
spark.sql.execution.arrow.pyspark.enabled = true) makes this significantly faster but does not change the memory constraint
For ML work on training datasets, your feature table often fits comfortably in memory. Feature stores, for example, typically serve pre-aggregated feature vectors that are far smaller than raw event data. But for very large training sets, this is where tools like Ray, Dask, or Spark's native MLlib become relevant.
Part 3: Feature Engineering: MLlib vs. scikit-learn Transformers
This is where the API divergence is most significant, and where most data engineers hit the steepest learning curve.
The Fundamental Difference: Vectors vs. Native Columns
PySpark MLlib requires all numerical features to be assembled into a single dense or sparse vector column before training. scikit-learn works on native NumPy arrays and Pandas DataFrames directly.
# PySpark MLlib: you MUST assemble features into a vector first
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['age', 'salary', 'tenure'],
outputCol='features'
)
df_assembled = assembler.transform(df)
# df now has a 'features' column of DenseVector([age, salary, tenure])
# scikit-learn: no assembly needed, just pass the DataFrame directly
from sklearn.linear_model import LogisticRegression
X = df[['age', 'salary', 'tenure']]
y = df['label']
model = LogisticRegression()
model.fit(X, y)
The VectorAssembler requirement is a common source of confusion when coming from PySpark. In scikit-learn, you skip it entirely and work with 2D arrays directly.
Encoding Categorical Variables
# PySpark MLlib: two-step process
from pyspark.ml.feature import StringIndexer, OneHotEncoder
indexer = StringIndexer(inputCol='city', outputCol='city_index')
encoder = OneHotEncoder(inputCols=['city_index'], outputCols=['city_ohe'])
# scikit-learn: single step
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
city_encoded = encoder.fit_transform(df[['city']])
PySpark requires a two-step process (StringIndexer then OneHotEncoder) because it operates in a distributed setting where integer indices must be assigned consistently across partitions. scikit-learn handles this in a single step.
Scaling Features
# PySpark MLlib
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
scaler_model = scaler.fit(df_assembled)
df_scaled = scaler_model.transform(df_assembled)
# scikit-learn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Part 4: Building Pipelines: The Most Transferable Concept
This is the good news for data engineers: both PySpark and scikit-learn use the Pipeline abstraction, and the mental model translates almost directly.
A pipeline chains transformers and a final estimator into a single reusable object that can be fit, serialized, and applied to new data.
# PySpark MLlib Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
pipeline = Pipeline(stages=[
StringIndexer(inputCol='city', outputCol='city_idx'),
OneHotEncoder(inputCols=['city_idx'], outputCols=['city_ohe']),
VectorAssembler(inputCols=['age', 'salary', 'city_ohe'], outputCol='features'),
StandardScaler(inputCol='features', outputCol='scaled_features'),
RandomForestClassifier(featuresCol='scaled_features', labelCol='label')
])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)
# scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
preprocessor = ColumnTransformer([
('num', StandardScaler(), ['age', 'salary']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['city'])
])
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
The key structural difference: scikit-learn's ColumnTransformer lets you apply different transformations to different columns in parallel, while PySpark's Pipeline applies stages sequentially. You must explicitly order every step in PySpark. The scikit-learn approach is more concise for heterogeneous feature sets.
Part 5: Model Evaluation: A Familiar Concept, Cleaner API
PySpark's evaluation API requires instantiating a separate evaluator object. scikit-learn bundles everything into sklearn.metrics.
# PySpark
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')
auc = evaluator.evaluate(predictions)
acc_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
accuracy = acc_evaluator.evaluate(predictions)
# scikit-learn
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
accuracy = accuracy_score(y_test, predictions)
print(classification_report(y_test, predictions)) # precision, recall, F1 in one call
scikit-learn's classification_report() is significantly more informative than PySpark's single-metric evaluators. It gives precision, recall, F1, and support for every class in a single call.
Part 6: When PySpark Still Wins
Making the shift to Pandas/scikit-learn for ML does not mean PySpark becomes irrelevant. As an ML engineer, you'll continue using PySpark for:
- Feature computation at scale: computing rolling aggregations, user behavior features, or entity embeddings over billions of rows before they land in your feature store
- Training data generation pipelines: joining raw event tables, filtering time windows, and assembling labeled datasets
- Batch inference at scale: applying trained sklearn or PyTorch models inside a Spark UDF to score millions of records
- Large-scale data validation: running data quality checks (Great Expectations, Deequ) on incoming training data
The common production pattern is: PySpark for data preparation, then Pandas/scikit-learn/PyTorch for model training, then PySpark again for batch scoring. Your DE background gives you a direct advantage in both the first and last mile of this pipeline.
Part 7: Practical Learning Path
If you're working through this transition, here's a sequenced approach based on the operations data engineers use most.
Week 1: Core Pandas fluency
- Rebuild 3 of your existing PySpark ETL jobs in Pandas
- Master: filtering, groupby/agg, merge, apply, pivot_table
- Read: Pandas User Guide: Comparison with Spark
Week 2: NumPy foundations
- Understand array broadcasting, which is the engine under Pandas
- Practice vectorized operations instead of loops
- This directly maps to understanding tensor operations in PyTorch later
Week 3: scikit-learn core API
- The estimator interface:
.fit(),.transform(),.predict() - Build your first Pipeline with at least one
ColumnTransformer - Implement cross-validation with
cross_val_score()
Week 4: End-to-end project
- Pick a Kaggle tabular dataset, preferably structured like a transaction dataset that resembles your DE work
- Build a full pipeline: EDA, feature engineering, model training, evaluation, and serialization with
joblib - This becomes your first ML portfolio project, exactly the kind of hands-on work that signals readiness for an ML engineering role
Quick Reference: PySpark to Pandas/scikit-learn Cheat Sheet
| Operation | PySpark | Pandas / scikit-learn |
|---|---|---|
| Filter rows | df.filter(condition) |
df[condition] or df.query()
|
| Select columns | df.select('a', 'b') |
df[['a', 'b']] |
| Add column | df.withColumn('c', expr) |
df.assign(c=expr) |
| GroupBy agg | df.groupBy('x').agg(F.mean('y')) |
df.groupby('x').agg(avg_y=('y','mean')) |
| Join | df1.join(df2, on='id', how='left') |
pd.merge(df1, df2, on='id', how='left') |
| Drop nulls | df.dropna(subset=['col']) |
df.dropna(subset=['col']) |
| Fill nulls | df.fillna({'col': val}) |
df.fillna({'col': val}) |
| Encode categoricals |
StringIndexer + OneHotEncoder
|
OneHotEncoder (single step) |
| Scale features | StandardScaler(inputCol='features') |
StandardScaler() on array directly |
| Build pipeline | Pipeline(stages=[...]) |
Pipeline([('step', transformer)]) |
| Evaluate model | BinaryClassificationEvaluator |
roc_auc_score(), classification_report()
|
| Lazy evaluation | Yes, actions trigger execution | No, eager by default |
Key Takeaways
The shift from PySpark to Pandas/scikit-learn is a surface-level API change over a familiar conceptual foundation. Every concept you already know, including transformations, aggregations, joins, pipelines, and train/test splits, exists in both worlds. What changes is:
- Execution model: eager vs. lazy; in-memory vs. distributed
- Feature representation: native columns vs. assembled vectors
- Pipeline structure: sequential stages vs. parallel ColumnTransformer
- Evaluation API: single-metric evaluators vs. a comprehensive metrics module
Your data engineering instincts around schema validation, null handling, data quality, and pipeline reproducibility are directly applicable to ML work. The transition is about building on top of what you have, not starting over.
Top comments (0)