DEV Community

Cover image for Batch Scoring with R: Scalable Predictions on Azure
Rishi
Rishi

Posted on

Batch Scoring with R: Scalable Predictions on Azure

Batch Scoring with R: Scalable Predictions on Azure

Batch scoring (bulk inference) is the process of generating predictions for a large dataset (often offline) using a pre-trained model. In R, batch scoring typically involves writing a script that loads data, runs the model on all records, and outputs results. For massive workloads, it’s impractical to score millions of rows in one machine; instead, parallel or distributed architectures are used. One proven approach is using Azure Batch + Containerized R jobs to distribute scoring across many VMs

For example, Microsoft’s Azure/RBatchScoring project demonstrates scoring 1,000 products across 83 stores (5.4 million forecasts) by parallelizing R computations on an Azure Batch cluster
RBatchScoring_Github_Repo.

The high-level workflow is:

  • Prepare Model & Data. Train your model in R and save it (e.g. as an .RData file) in Azure Blob Storage. Place your input data (CSV or other formats) in Blob Storage as well.
  • Scheduler Script. A small R script (using packages like doAzureParallel or rAzureBatch) runs in an Azure Container Instance (ACI). This script triggers Azure Batch jobs on a schedule (e.g. daily)

  • Azure Batch Pool. A pool of VMs (Azure Batch) is created. Each VM pulls a portion of the input data from Blob Storage, loads the R model, and computes predictions on that mini-batch of data in parallel. Results are written back to Azure Blob Storage.

  • Gather Results. Once all batch jobs complete, the container can merge outputs or a downstream process can consume the scored data. An Azure Logic App or similar service can orchestrate the schedule and notify when done.

Key points of this architecture (from the Azure sample) include:

  • Parallelization: The Batch pool runs jobs in parallel, making it feasible to score millions of records.

Packages like doAzureParallel simplify distributing R work over the cluster.

  • Storage Integration: Input and output use Azure Blob Storage, providing high throughput and easy access for multiple nodes . The scheduler script divides the workload by listing blobs or splitting input files.
  • Automation: In the reference example, an Azure Logic App triggers the container on a recurring schedule (e.g. nightly). The R scripts include steps like 02_deploy_azure_resources.R and 03_forecast_on_batch.R to set up compute and run scoring [
    • Scalability: By adjusting the Batch pool size, scoring speed scales with data volume. The demo handled 5.4M predictions by using enough VMs to finish in a reasonable time .

In practice, other cloud or open-source tools can achieve batch scoring as well. For instance, one might use Spark with SparkR or sparklyr, or deploy an R-based API to Kubernetes for batch mode. However, the above Azure-based pattern is a concrete example for intermediate practitioners to follow. The critical takeaway is: batch scoring in R requires distributing the load. Whether using Azure Batch, Databricks, or another service, the workflow is: split data → score chunks → combine results. The Azure RBatchScoring repository provides detailed code and guidance for this pattern RBatchScoring_Github_Repo1 RBatchScoring_Github_Repo2 .

Practical example Azure ML Batch Scoring: Dynamic Output Paths from the Model URI (R)

Sequence

score.R (parses short or full AzureML model URI)

library(stringr)
library(glue)
library(jsonlite)

args <- commandArgs(trailingOnly = TRUE)
model_uri <- Sys.getenv("AZUREML_MODEL_URI", unset = ifelse(length(args)>0, args[1], ""))
out_root <- Sys.getenv("OUTPUTS_DATASTORE", unset = ifelse(length(args)>1, args[2], ""))

if (model_uri == "" || out_root == "") stop("Model URI or outputs datastore not provided")

# Accepts formats like:
# azureml://registries/<reg>/models/<name>/versions/<ver>
# azureml://subscriptions/.../workspaces/.../models/<name>/versions/<ver>
# models:/<name>/<ver>  (short)
extract <- function(uri){
  # normalize
  u <- gsub("\\s","", uri)
  # try short format
  m <- str_match(u, "^models:/([^/]+)/([0-9]+)$")
  if (!is.na(m[1,1])) return(list(name=m[1,2], ver=m[1,3]))
  # try full
  m <- str_match(u, "models/([^/]+)/versions/([0-9]+)")
  if (!is.na(m[1,1])) return(list(name=m[1,2], ver=m[1,3]))
  stop(glue("Unrecognized model URI: {uri}"))
}

info <- extract(model_uri)

ts <- format(Sys.time(), "%Y/%m/%d/%H")
dynamic_path <- glue("{out_root}/{info$name}/v{info$ver}/{ts}/")

dir.create(dynamic_path, recursive = TRUE, showWarnings = FALSE)

# ... your scoring code here; demo write:
pred <- data.frame(id=1:5, score=c(0.1,0.78,0.43,0.92,0.66))
write.csv(pred, file=glue("{dynamic_path}/predictions.csv"), row.names=FALSE)

# Emit a manifest for downstream steps
manifest <- list(
  model_name = info$name,
  model_version = info$ver,
  output_path = dynamic_path
)
write(toJSON(manifest, auto_unbox = TRUE, pretty = TRUE), file=glue("{dynamic_path}/_manifest.json"))
cat(glue("Wrote outputs to: {dynamic_path}\n"))
Enter fullscreen mode Exit fullscreen mode

Minimal pipeline job (only passes the root)

# batch-score.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  Rscript score.R
  --model_uri ${{inputs.model_uri}}
  --out_root ${{outputs.predictions}}
environment: azureml://registries/azureml/environments/r-4.2-ubuntu2004/versions/1
inputs:
  model_uri:
    type: string
    value: models:/customer-churn/12
outputs:
  predictions:
    type: uri_folder
    mode: rw_mount
    path: azureml://datastores/outputs_datastore/paths/scoring/
compute: azureml:cpu
experiment_name: tac-batch-score
Enter fullscreen mode Exit fullscreen mode

Result: The script builds the full path dynamically, so your pipelines stay generic while outputs remain neatly organized by model/version and time.

Top comments (0)