Question: Alternative to MLeap for Real-Time Inference Without Spark Context with SparkXGBClassifier

himadri bhattacharjee — Thu, 21 Aug 2025 05:55:00 +0000

We are exploring alternatives to MLeap for running inference without Spark, since MLeap has limitations with Spark/PySpark version compatibility and library updates.

Our Setup & Goal

Environment: PySpark 3.5.5
Algorithm: Distributed ML training using XGBoost with Spark.
Goal: Run real-time inference without requiring a Spark session/context, to reduce overhead and response latency.

What We Did

Took a dataset (Titanic), converted it to Parquet, and split it into 80% (train) and 20% (test).
Trained with Spark (80% data) including preprocessing + XGBoost.
Evaluated on Spark (20% data) and logged the trained model.
Tried multiple logging/serialization approaches:
- MLflow pyfunc
- ONNX
- XGBoost native model (JSON/binary)
For inference: loaded the same 20% data, applied preprocessing outside Spark, reloaded the trained model, and ran predictions.

The Problem

In all approaches tested (MLflow pyfunc, ONNX, XGBoost native save/load), accuracy differs between:
- Spark-based evaluation (during training)
- Non-Spark inference (real-time service)
It seems precision is lost when the model is saved and reloaded outside Spark.

Main Requirement

The accuracy from Spark-based evaluation and non-Spark inference must match.
Need a solution to serialize/deserialize models that works across Spark training and non-Spark inference.
Prefer portable formats (JSON or similar).
Must avoid Spark context overhead at inference for real-time serving.

Question

👉 Is there any solution or alternative to MLeap for serving models trained with Spark (e.g., XGBoost with PySpark), but performing inference outside of Spark (lightweight, real-time)?

Should support PySpark 3.5.5
Must work with XGBoost distributed training
Should prevent accuracy mismatch between Spark and non-Spark inference
JSON or portable serialization preferred