We are exploring alternatives to MLeap for running inference without Spark, since MLeap has limitations with Spark/PySpark version compatibility and library updates.
Our Setup & Goal
- Environment: PySpark 3.5.5
- Algorithm: Distributed ML training using XGBoost with Spark.
- Goal: Run real-time inference without requiring a Spark session/context, to reduce overhead and response latency.
What We Did
- Took a dataset (Titanic), converted it to Parquet, and split it into 80% (train) and 20% (test).
- Trained with Spark (80% data) including preprocessing + XGBoost.
- Evaluated on Spark (20% data) and logged the trained model.
- Tried multiple logging/serialization approaches:
- MLflow
pyfunc
- ONNX
- XGBoost native model (JSON/binary)
- MLflow
- For inference: loaded the same 20% data, applied preprocessing outside Spark, reloaded the trained model, and ran predictions.
The Problem
- In all approaches tested (MLflow
pyfunc
, ONNX, XGBoost native save/load), accuracy differs between:- Spark-based evaluation (during training)
- Non-Spark inference (real-time service)
- It seems precision is lost when the model is saved and reloaded outside Spark.
Main Requirement
- The accuracy from Spark-based evaluation and non-Spark inference must match.
- Need a solution to serialize/deserialize models that works across Spark training and non-Spark inference.
- Prefer portable formats (JSON or similar).
- Must avoid Spark context overhead at inference for real-time serving.
Question
👉 Is there any solution or alternative to MLeap for serving models trained with Spark (e.g., XGBoost with PySpark), but performing inference outside of Spark (lightweight, real-time)?
- Should support PySpark 3.5.5
- Must work with XGBoost distributed training
- Should prevent accuracy mismatch between Spark and non-Spark inference
- JSON or portable serialization preferred
Any recommendations for frameworks, libraries, or best practices beyond MLeap would be greatly appreciated.
Top comments (0)