Batch Scoring with R: Scalable Predictions on Azure
Batch scoring (bulk inference) is the process of generating predictions for a large dataset (often offline) using a pre-trained model. In R, batch scoring typically involves writing a script that loads data, runs the model on all records, and outputs results. For massive workloads, it’s impractical to score millions of rows in one machine; instead, parallel or distributed architectures are used. One proven approach is using Azure Batch + Containerized R jobs to distribute scoring across many VMs
For example, Microsoft’s Azure/RBatchScoring project demonstrates scoring 1,000 products across 83 stores (5.4 million forecasts) by parallelizing R computations on an Azure Batch cluster
RBatchScoring_Github_Repo.
The high-level workflow is:
-
Prepare Model & Data. Train your model in R and save it (e.g. as an
.RData
file) in Azure Blob Storage. Place your input data (CSV or other formats) in Blob Storage as well. Scheduler Script. A small R script (using packages like
doAzureParallel
orrAzureBatch
) runs in an Azure Container Instance (ACI). This script triggers Azure Batch jobs on a schedule (e.g. daily)Azure Batch Pool. A pool of VMs (Azure Batch) is created. Each VM pulls a portion of the input data from Blob Storage, loads the R model, and computes predictions on that mini-batch of data in parallel. Results are written back to Azure Blob Storage.
Gather Results. Once all batch jobs complete, the container can merge outputs or a downstream process can consume the scored data. An Azure Logic App or similar service can orchestrate the schedule and notify when done.
Key points of this architecture (from the Azure sample) include:
- Parallelization: The Batch pool runs jobs in parallel, making it feasible to score millions of records.
Packages like doAzureParallel
simplify distributing R work over the cluster.
- Storage Integration: Input and output use Azure Blob Storage, providing high throughput and easy access for multiple nodes . The scheduler script divides the workload by listing blobs or splitting input files.
-
Automation: In the reference example, an Azure Logic App triggers the container on a recurring schedule (e.g. nightly). The R scripts include steps like
02_deploy_azure_resources.R
and03_forecast_on_batch.R
to set up compute and run scoring [- Scalability: By adjusting the Batch pool size, scoring speed scales with data volume. The demo handled 5.4M predictions by using enough VMs to finish in a reasonable time .
In practice, other cloud or open-source tools can achieve batch scoring as well. For instance, one might use Spark with SparkR or sparklyr, or deploy an R-based API to Kubernetes for batch mode. However, the above Azure-based pattern is a concrete example for intermediate practitioners to follow. The critical takeaway is: batch scoring in R requires distributing the load. Whether using Azure Batch, Databricks, or another service, the workflow is: split data → score chunks → combine results. The Azure RBatchScoring repository provides detailed code and guidance for this pattern RBatchScoring_Github_Repo1 RBatchScoring_Github_Repo2 .
Practical example Azure ML Batch Scoring: Dynamic Output Paths from the Model URI (R)
Sequence
score.R (parses short or full AzureML model URI)
library(stringr)
library(glue)
library(jsonlite)
args <- commandArgs(trailingOnly = TRUE)
model_uri <- Sys.getenv("AZUREML_MODEL_URI", unset = ifelse(length(args)>0, args[1], ""))
out_root <- Sys.getenv("OUTPUTS_DATASTORE", unset = ifelse(length(args)>1, args[2], ""))
if (model_uri == "" || out_root == "") stop("Model URI or outputs datastore not provided")
# Accepts formats like:
# azureml://registries/<reg>/models/<name>/versions/<ver>
# azureml://subscriptions/.../workspaces/.../models/<name>/versions/<ver>
# models:/<name>/<ver> (short)
extract <- function(uri){
# normalize
u <- gsub("\\s","", uri)
# try short format
m <- str_match(u, "^models:/([^/]+)/([0-9]+)$")
if (!is.na(m[1,1])) return(list(name=m[1,2], ver=m[1,3]))
# try full
m <- str_match(u, "models/([^/]+)/versions/([0-9]+)")
if (!is.na(m[1,1])) return(list(name=m[1,2], ver=m[1,3]))
stop(glue("Unrecognized model URI: {uri}"))
}
info <- extract(model_uri)
ts <- format(Sys.time(), "%Y/%m/%d/%H")
dynamic_path <- glue("{out_root}/{info$name}/v{info$ver}/{ts}/")
dir.create(dynamic_path, recursive = TRUE, showWarnings = FALSE)
# ... your scoring code here; demo write:
pred <- data.frame(id=1:5, score=c(0.1,0.78,0.43,0.92,0.66))
write.csv(pred, file=glue("{dynamic_path}/predictions.csv"), row.names=FALSE)
# Emit a manifest for downstream steps
manifest <- list(
model_name = info$name,
model_version = info$ver,
output_path = dynamic_path
)
write(toJSON(manifest, auto_unbox = TRUE, pretty = TRUE), file=glue("{dynamic_path}/_manifest.json"))
cat(glue("Wrote outputs to: {dynamic_path}\n"))
Minimal pipeline job (only passes the root)
# batch-score.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
Rscript score.R
--model_uri ${{inputs.model_uri}}
--out_root ${{outputs.predictions}}
environment: azureml://registries/azureml/environments/r-4.2-ubuntu2004/versions/1
inputs:
model_uri:
type: string
value: models:/customer-churn/12
outputs:
predictions:
type: uri_folder
mode: rw_mount
path: azureml://datastores/outputs_datastore/paths/scoring/
compute: azureml:cpu
experiment_name: tac-batch-score
Result: The script builds the full path dynamically, so your pipelines stay generic while outputs remain neatly organized by model/version and time.
Top comments (0)