<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rishi</title>
    <description>The latest articles on DEV Community by Rishi (@rishangsharma).</description>
    <link>https://dev.to/rishangsharma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3478161%2F6abdfbaf-9669-4488-89ce-239ec9f76f01.jpeg</url>
      <title>DEV Community: Rishi</title>
      <link>https://dev.to/rishangsharma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rishangsharma"/>
    <language>en</language>
    <item>
      <title>Batch Scoring with R: Scalable Predictions on Azure</title>
      <dc:creator>Rishi</dc:creator>
      <pubDate>Thu, 04 Sep 2025 02:26:18 +0000</pubDate>
      <link>https://dev.to/rishangsharma/batch-scoring-with-r-scalable-predictions-on-azure-533n</link>
      <guid>https://dev.to/rishangsharma/batch-scoring-with-r-scalable-predictions-on-azure-533n</guid>
      <description>&lt;h2&gt;
  
  
  Batch Scoring with R: Scalable Predictions on Azure
&lt;/h2&gt;

&lt;p&gt;Batch scoring (bulk inference) is the process of generating predictions for a large dataset (often offline) using a pre-trained model. In R, batch scoring typically involves writing a script that loads data, runs the model on all records, and outputs results. For massive workloads, it’s impractical to score millions of rows in one machine; instead, parallel or distributed architectures are used. One proven approach is using &lt;strong&gt;Azure Batch + Containerized R jobs&lt;/strong&gt; to distribute scoring across many VMs &lt;/p&gt;

&lt;p&gt;For example, Microsoft’s &lt;em&gt;Azure/RBatchScoring&lt;/em&gt; project demonstrates scoring 1,000 products across 83 stores (5.4 million forecasts) by parallelizing R computations on an Azure Batch cluster &lt;br&gt;
&lt;a href="https://github.com/Azure/RBatchScoring#:~:text=This%20example%20uses%20the%20scenario,details%20of%20the%20forecasting%20scenario" rel="noopener noreferrer"&gt;RBatchScoring_Github_Repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The high-level workflow is:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prepare Model &amp;amp; Data.&lt;/strong&gt; Train your model in R and save it (e.g. as an &lt;code&gt;.RData&lt;/code&gt; file) in Azure Blob Storage. Place your input data (CSV or other formats) in Blob Storage as well.
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scheduler Script.&lt;/strong&gt; A small R script (using packages like &lt;code&gt;doAzureParallel&lt;/code&gt; or &lt;code&gt;rAzureBatch&lt;/code&gt;) runs in an Azure Container Instance (ACI). This script &lt;strong&gt;triggers Azure Batch jobs on a schedule&lt;/strong&gt; (e.g. daily) &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Azure Batch Pool.&lt;/strong&gt; A pool of VMs (Azure Batch) is created. Each VM pulls a portion of the input data from Blob Storage, loads the R model, and computes predictions on that mini-batch of data in parallel. Results are written back to Azure Blob Storage.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Gather Results.&lt;/strong&gt; Once all batch jobs complete, the container can merge outputs or a downstream process can consume the scored data. An Azure Logic App or similar service can orchestrate the schedule and notify when done.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key points of this architecture (from the Azure sample) include:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallelization:&lt;/strong&gt; The Batch pool runs jobs in parallel, making it feasible to score millions of records.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Packages like &lt;code&gt;doAzureParallel&lt;/code&gt; simplify distributing R work over the cluster.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Storage Integration:&lt;/strong&gt; Input and output use Azure Blob Storage, providing high throughput and easy access for multiple nodes . The scheduler script divides the workload by listing blobs or splitting input files.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation:&lt;/strong&gt; In the reference example, an Azure Logic App triggers the container on a recurring schedule (e.g. nightly). The R scripts include steps like &lt;code&gt;02_deploy_azure_resources.R&lt;/code&gt; and &lt;code&gt;03_forecast_on_batch.R&lt;/code&gt; to set up compute and run scoring  [

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; By adjusting the Batch pool size, scoring speed scales with data volume. The demo handled 5.4M predictions by using enough VMs to finish in a reasonable time .&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, other cloud or open-source tools can achieve batch scoring as well. For instance, one might use &lt;strong&gt;Spark with SparkR or sparklyr&lt;/strong&gt;, or deploy an R-based API to Kubernetes for batch mode. However, the above Azure-based pattern is a concrete example for intermediate practitioners to follow. The critical takeaway is: &lt;strong&gt;batch scoring in R requires distributing the load&lt;/strong&gt;. Whether using Azure Batch, Databricks, or another service, the workflow is: split data → score chunks → combine results. The Azure RBatchScoring repository provides detailed code and guidance for this pattern &lt;a href="https://github.com/Azure/RBatchScoring#:~:text=1,managed%20by%20a%20Logic%20App" rel="noopener noreferrer"&gt;RBatchScoring_Github_Repo1&lt;/a&gt; &lt;a href="https://github.com/Azure/RBatchScoring#:~:text=This%20example%20uses%20the%20scenario,details%20of%20the%20forecasting%20scenario" rel="noopener noreferrer"&gt;RBatchScoring_Github_Repo2&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical example Azure ML Batch Scoring: Dynamic Output Paths from the Model URI (R)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sequence&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsp4f9wyo9mx5a0tz3jzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsp4f9wyo9mx5a0tz3jzx.png" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;score.R (parses short or full AzureML model URI)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(stringr)
library(glue)
library(jsonlite)

args &amp;lt;- commandArgs(trailingOnly = TRUE)
model_uri &amp;lt;- Sys.getenv("AZUREML_MODEL_URI", unset = ifelse(length(args)&amp;gt;0, args[1], ""))
out_root &amp;lt;- Sys.getenv("OUTPUTS_DATASTORE", unset = ifelse(length(args)&amp;gt;1, args[2], ""))

if (model_uri == "" || out_root == "") stop("Model URI or outputs datastore not provided")

# Accepts formats like:
# azureml://registries/&amp;lt;reg&amp;gt;/models/&amp;lt;name&amp;gt;/versions/&amp;lt;ver&amp;gt;
# azureml://subscriptions/.../workspaces/.../models/&amp;lt;name&amp;gt;/versions/&amp;lt;ver&amp;gt;
# models:/&amp;lt;name&amp;gt;/&amp;lt;ver&amp;gt;  (short)
extract &amp;lt;- function(uri){
  # normalize
  u &amp;lt;- gsub("\\s","", uri)
  # try short format
  m &amp;lt;- str_match(u, "^models:/([^/]+)/([0-9]+)$")
  if (!is.na(m[1,1])) return(list(name=m[1,2], ver=m[1,3]))
  # try full
  m &amp;lt;- str_match(u, "models/([^/]+)/versions/([0-9]+)")
  if (!is.na(m[1,1])) return(list(name=m[1,2], ver=m[1,3]))
  stop(glue("Unrecognized model URI: {uri}"))
}

info &amp;lt;- extract(model_uri)

ts &amp;lt;- format(Sys.time(), "%Y/%m/%d/%H")
dynamic_path &amp;lt;- glue("{out_root}/{info$name}/v{info$ver}/{ts}/")

dir.create(dynamic_path, recursive = TRUE, showWarnings = FALSE)

# ... your scoring code here; demo write:
pred &amp;lt;- data.frame(id=1:5, score=c(0.1,0.78,0.43,0.92,0.66))
write.csv(pred, file=glue("{dynamic_path}/predictions.csv"), row.names=FALSE)

# Emit a manifest for downstream steps
manifest &amp;lt;- list(
  model_name = info$name,
  model_version = info$ver,
  output_path = dynamic_path
)
write(toJSON(manifest, auto_unbox = TRUE, pretty = TRUE), file=glue("{dynamic_path}/_manifest.json"))
cat(glue("Wrote outputs to: {dynamic_path}\n"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Minimal pipeline job (only passes the root)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# batch-score.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: &amp;gt;-
  Rscript score.R
  --model_uri ${{inputs.model_uri}}
  --out_root ${{outputs.predictions}}
environment: azureml://registries/azureml/environments/r-4.2-ubuntu2004/versions/1
inputs:
  model_uri:
    type: string
    value: models:/customer-churn/12
outputs:
  predictions:
    type: uri_folder
    mode: rw_mount
    path: azureml://datastores/outputs_datastore/paths/scoring/
compute: azureml:cpu
experiment_name: tac-batch-score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: The script builds the full path dynamically, so your pipelines stay generic while outputs remain neatly organized by model/version and time.&lt;/p&gt;

</description>
      <category>azure</category>
      <category>mlops</category>
      <category>batchscoring</category>
      <category>r</category>
    </item>
  </channel>
</rss>
