DEV Community

Pascoal Eddy Bayonne
Pascoal Eddy Bayonne

Posted on

Spring Batch — Monitoring and Metrics

Monitoring Short-Lived Jobs

In the world of batch processing, Spring Batch has emerged as the de facto standard for building robust, scalable ETL jobs and data-processing pipelines. However, one of the most challenging aspects of batch processing isn’t the processing logic itself—it’s gaining visibility into what’s happening during execution and understanding the performance characteristics of these jobs.

This blog post explores how to implement comprehensive monitoring and metrics for Spring Batch applications, with a particular focus on the unique challenges posed by short-lived jobs.


What is a short-lived job?

A short-lived job is a process that:

  • Runs for a limited time (minutes to hours, not days/weeks)
  • Starts, executes a specific task, then terminates
  • Doesn’t run continuously or as a long-running service

Ephemeral or short-lived jobs


What Makes Spring Batch Jobs “Short-Lived”?

Spring Batch jobs are inherently ephemeral by nature. They:

  • Start: Triggered by schedules (cron jobs) or events
  • Execute: Process data in chunks or as a single unit
  • Complete: Terminate after finishing their work
  • Disappear: Leave no long-running process behind

This pattern is perfect for ETL operations, import/export tasks, cleanup operations, and data transformations—but it creates a significant monitoring challenge.


The Traditional Monitoring Problem

In a typical microservices architecture, long-running services expose metrics endpoints that monitoring systems (like Prometheus) can scrape at regular intervals (e.g., every 15–30 seconds).
Monitoring Long-Lived Microservice with prometheus and grafana

However, Spring Batch or other short-lived applications present a different scenario:

The lifecycle of your short-lived job:
1. Job starts → The batch job, ETL process, or task is triggered.
2. Job runs → While running, it may expose metrics (e.g., through an HTTP endpoint or via Spring Boot Actuator).
3. Job completes → The process exits normally (status = SUCCESS/FAILED).
4. Job disappears → Since the process is terminated, the HTTP endpoint is gone, memory is freed, and there is no metrics endpoint anymore.

The endpoint has vanished, so no metrics are collected.

The endpoint has vanished, so no metrics are collected

Prometheus’s design assumes long-running targets that are always up. For ephemeral jobs such as Spring Batch, Prometheus may miss their metrics completely unless you change how metrics are delivered.

How to Monitor Short-Lived Jobs such as Spring Batch?<br>


The Usual Fix: Monitoring Short-Lived Jobs

Instead of expecting Prometheus to scrape a dead process, you need a push model, e.g., using the PushGateway.

PushGateway

The PushGateway acts as a metrics buffer for ephemeral jobs. Here’s how it solves the problem:

  1. Metrics Persistence: PushGateway stores metrics pushed by batch jobs until Prometheus scrapes them.
  2. Asynchronous Collection: Jobs can push metrics immediately upon completion.
  3. Reliable Delivery: Ensures no metrics are lost due to timing issues.

How It Works

Spring Batch + PushGateway + Prometheus<br>

Spring Batch + PushGateway + Prometheus

1. Spring Batch Job → PushGateway (Pushes metrics)

  • At the end of each step (or job completion), the job pushes counters, timers, success/failure flags, etc., to the PushGateway.

2. PushGateway (Stores metrics temporarily)

  • Even after the Spring Batch job finishes, the metrics remain available for Prometheus to scrape.
  • Solves the “job disappears → nothing to scrape” problem.

3. Prometheus → PushGateway (Scrapes metrics)

  • Instead of scraping the short-lived job directly (which might be gone), Prometheus scrapes the PushGateway on a regular interval.
  • Ensures metrics persistence long enough for Prometheus to record them in its time-series database.

Grafana Dashboard uses Prometheus as datasource<br>

4. Grafana → Prometheus (Visualizes metrics)

  • Grafana queries Prometheus with PromQL.
  • Since Prometheus has already scraped and stored the batch-job metrics, Grafana can display:
    • Job duration
    • Success/failure counts
    • Error rates
    • Any custom job metrics

Example of Grafana Dashboard of spring batch application<br>


Practice:

Visit my YouTube channel for the full hands-on code example. Let me summarize.

Complete example here

1. Setup docker for Prometheus+Grafana+Pushgateway


  prometheus:
    image: prom/prometheus
    container_name: 'prometheus'
    ports:
      - '9090:9090'
    volumes:
      - /Users/pascoalbayonne/dev/prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - jaeger-example

  pushgateway:
    image: prom/pushgateway
    container_name: 'pushgateway'
    ports:
      - '9091:9091'
    networks:
      - jaeger-example

  grafana:
    image: grafana/grafana
    container_name: 'grafana'
    ports:
      - '3000:3000'
    networks:
      - jaeger-example
    depends_on:
      - prometheus
Enter fullscreen mode Exit fullscreen mode

2. Enable Pushgateway in your spring batch application


management.prometheus.metrics.export.enabled=true
management.prometheus.metrics.export.pushgateway.enabled=true
management.prometheus.metrics.export.pushgateway.address=localhost:9091
management.prometheus.metrics.export.pushgateway.push-rate=1ms
management.prometheus.metrics.export.pushgateway.job=${spring.application.name}

Enter fullscreen mode Exit fullscreen mode

3. Run Spring Batch Application (short-Lived)
The spring batch application sends the metrics to pushgateway. Check the logs


2025-08-23T17:41:04.457+01:00  INFO 47721 --- [multipart-download] [           main] o.h.e.t.j.p.i.JtaPlatformInitiator       : HHH000489: No JTA platform available (set 'hibernate.transaction.jta.platform' to enable JTA platform integration)
2025-08-23T17:41:04.513+01:00  INFO 47721 --- [multipart-download] [           main] j.LocalContainerEntityManagerFactoryBean : Initialized JPA EntityManagerFactory for persistence unit 'default'
2025-08-23T17:41:04.812+01:00  INFO 47721 --- [multipart-download] [           main] p.b.s.m.MultipartDownloadApplication     : Started MultipartDownloadApplication in 2.932 seconds (process running for 3.378)
2025-08-23T17:41:04.814+01:00  INFO 47721 --- [multipart-download] [           main] o.s.b.a.b.JobLauncherApplicationRunner   : Running default command line with: []
2025-08-23T17:41:04.887+01:00  INFO 47721 --- [multipart-download] [           main] o.s.b.c.l.s.TaskExecutorJobLauncher      : Job: [FlowJob: [name=import-customers]] launched with the following parameters: [{'run.id':'{value=20, type=class java.lang.Long, identifying=true}'}]
2025-08-23T17:41:04.896+01:00  WARN 47721 --- [multipart-download] [           main] i.m.p.PrometheusMeterRegistry            : The meter (MeterId{name='spring.batch.job.active', tags=[tag(job-deprecated=false),tag(spring.batch.job.name=import-customers),tag(spring.batch.job.status=UNKNOWN)]}) registration has failed: Prometheus requires that all meters with the same name have the same set of tag keys. There is already an existing meter named 'spring_batch_job_active_seconds' containing tag keys [job_deprecated, spring_batch_job_active_name]. The meter you are attempting to register has keys [job_deprecated, spring_batch_job_name, spring_batch_job_status]. Note that subsequent logs will be logged at debug level.
2025-08-23T17:41:04.910+01:00  INFO 47721 --- [multipart-download] [           main] o.s.batch.core.job.SimpleStepHandler     : Executing step: [downloadFileStep]
2025-08-23T17:41:10.097+01:00  INFO 47721 --- [multipart-download] [           main] s.a.a.t.s.p.LoggingTransferListener      : Transfer initiated...
2025-08-23T17:41:10.157+01:00  INFO 47721 --- [multipart-download] [ AwsEventLoop 2] s.a.a.t.s.p.LoggingTransferListener      : |====================| 100.0%
2025-08-23T17:41:10.160+01:00  INFO 47721 --- [multipart-download] [           main] p.b.s.m.service.CustomS3Client           : file: 2024-sales-info 3000.csv downloaded successfully from bucket sales-info
2025-08-23T17:41:10.160+01:00  INFO 47721 --- [multipart-download] [nc-response-1-0] s.a.a.t.s.p.LoggingTransferListener      : Transfer complete!
2025-08-23T17:41:10.163+01:00  INFO 47721 --- [multipart-download] [           main] c.v.flexypool.FlexyPoolDataSource        : Connection leased for 5246 millis, while threshold is set to 1000 in dataSource FlexyPoolDataSource
2025-08-23T17:41:10.165+01:00  INFO 47721 --- [multipart-download] [           main] o.s.batch.core.step.AbstractStep         : Step: [downloadFileStep] executed in 5s254ms
2025-08-23T17:41:10.177+01:00  INFO 47721 --- [multipart-download] [           main] o.s.batch.core.job.SimpleStepHandler     : Executing step: [fromFileDownloadedToDb]
2025-08-23T17:41:11.124+01:00  INFO 47721 --- [multipart-download] [           main] o.s.batch.core.step.AbstractStep         : Step: [fromFileDownloadedToDb] executed in 946ms
2025-08-23T17:41:11.131+01:00  INFO 47721 --- [multipart-download] [           main] o.s.b.c.l.s.TaskExecutorJobLauncher      : Job: [FlowJob: [name=import-customers]] completed with the following parameters: [{'run.id':'{value=20, type=class java.lang.Long, identifying=true}'}] and the following status: [COMPLETED] in 6s232ms
2025-08-23T17:41:11.135+01:00  INFO 47721 --- [multipart-download] [ionShutdownHook] j.LocalContainerEntityManagerFactoryBean : Closing JPA EntityManagerFactory for persistence unit 'default'
2025-08-23T17:41:11.138+01:00  INFO 47721 --- [multipart-download] [ionShutdownHook] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown initiated...
2025-08-23T17:41:11.140+01:00  INFO 47721 --- [multipart-download] [ionShutdownHook] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown completed.


Enter fullscreen mode Exit fullscreen mode

4. PushGateway Receives metrics
Pushgateway receives and stores metrics

5. Prometheus scrapes the metrics from PushGateway
PromQL

5. Grafana: create dashboard using Promethues as Datasource

Grafana Dashboard


Conclusion

Monitoring Spring Batch applications requires a different approach than traditional long-running services. The Prometheus PushGateway pattern provides the missing link that enables comprehensive observability for ephemeral batch jobs.

By implementing this monitoring strategy, you gain:

  • Real-time visibility into job performance and health
  • Historical trends for capacity planning and optimization
  • Proactive alerting for job failures and performance degradation
  • Business intelligence through custom metrics and dashboards

The combination of Spring Batch, Prometheus PushGateway, and Grafana creates a powerful observability stack that transforms batch processing from a “black box” into a transparent, measurable, and optimizable system.

“Remember: You can’t optimize what you can’t measure.”
With proper monitoring in place, your Spring Batch applications become not just functional, but truly observable and maintainable.


This blog post is part of our series on modern batch processing with Spring Boot and Java 21. Stay tuned for more insights on leveraging Virtual Threads and other Java 21 features in your batch applications.

Watch the complete hands-on code tutorial on YouTube


Top comments (0)