DEV Community

himadri bhattacharjee
himadri bhattacharjee

Posted on

Question: Alternative to MLeap for Real-Time Inference Without Spark Context with SparkXGBClassifier

We are exploring alternatives to MLeap for running inference without Spark, since MLeap has limitations with Spark/PySpark version compatibility and library updates.


Our Setup & Goal

  • Environment: PySpark 3.5.5
  • Algorithm: Distributed ML training using XGBoost with Spark.
  • Goal: Run real-time inference without requiring a Spark session/context, to reduce overhead and response latency.

What We Did

  1. Took a dataset (Titanic), converted it to Parquet, and split it into 80% (train) and 20% (test).
  2. Trained with Spark (80% data) including preprocessing + XGBoost.
  3. Evaluated on Spark (20% data) and logged the trained model.
  4. Tried multiple logging/serialization approaches:
    • MLflow pyfunc
    • ONNX
    • XGBoost native model (JSON/binary)
  5. For inference: loaded the same 20% data, applied preprocessing outside Spark, reloaded the trained model, and ran predictions.

The Problem

  • In all approaches tested (MLflow pyfunc, ONNX, XGBoost native save/load), accuracy differs between:
    • Spark-based evaluation (during training)
    • Non-Spark inference (real-time service)
  • It seems precision is lost when the model is saved and reloaded outside Spark.

Main Requirement

  • The accuracy from Spark-based evaluation and non-Spark inference must match.
  • Need a solution to serialize/deserialize models that works across Spark training and non-Spark inference.
  • Prefer portable formats (JSON or similar).
  • Must avoid Spark context overhead at inference for real-time serving.

Question

👉 Is there any solution or alternative to MLeap for serving models trained with Spark (e.g., XGBoost with PySpark), but performing inference outside of Spark (lightweight, real-time)?

  • Should support PySpark 3.5.5
  • Must work with XGBoost distributed training
  • Should prevent accuracy mismatch between Spark and non-Spark inference
  • JSON or portable serialization preferred

Any recommendations for frameworks, libraries, or best practices beyond MLeap would be greatly appreciated.

Top comments (0)