DEV Community

Sadiul Hakim
Sadiul Hakim

Posted on

Spring Batch Tutorial Part #1

1. What is Spring Batch?

Spring Batch is a lightweight, comprehensive framework for batch processing that's part of the Spring ecosystem. It's designed to build robust and scalable batch applications for both enterprise and small-scale use. Batch applications are programs that process large volumes of data without human interaction, often running on a schedule. Think of tasks like processing daily transaction reports, generating monthly statements, or migrating data.


2. How Spring Batch Works

Spring Batch provides reusable functions that are essential for processing large datasets, including logging, transaction management, restartability, and skip functionality. The framework follows a common pattern called ETL (Extract, Transform, Load).

  • Extract: An ItemReader reads data from a source, like a database or a file.
  • Transform: An optional ItemProcessor modifies or filters the data.
  • Load: An ItemWriter writes the processed data to a destination.

The core of a Spring Batch application is a Job. A Job is made up of one or more Steps. Each Step is a self-contained sequence of the read-process-write operation.


3. When to Use It and When Not to Use It

Use Spring Batch when:

  • You need to process large volumes of data periodically.
  • You need to perform scheduled data migrations.
  • You require restartability and skip-on-error functionality for your batch jobs.
  • You need a centralized way to manage and monitor batch jobs.

Don't use Spring Batch when:

  • You're building real-time, user-facing applications that require immediate responses.
  • Your application is a typical web application serving synchronous requests.
  • The data volume is very small and a simple script or a direct database operation would suffice.

4. Key Features

  • Restartability: Jobs can be restarted from where they left off if they fail.
  • Chunk-based processing: Processes data in chunks, which is highly efficient and minimizes memory usage.
  • Declarative I/O: Provides pre-built ItemReader and ItemWriter classes for common data sources (e.g., files, databases).
  • Scalability: Supports various scaling models, from multi-threaded steps to parallel processing.
  • Transaction Management: Manages transactions to ensure data integrity during processing.

5. What are Job, Step, ItemReader, ItemProcessor, ItemWriter

These are the core components of a Spring Batch application.

  • Job: The highest-level abstraction, encapsulating the entire batch process. It's composed of one or more Steps.
  • Step: A self-contained, independent phase of a Job. Most Steps follow the read-process-write chunk-oriented model.
  • ItemReader: Reads data from a source one item at a time. Examples: reading a line from a file or a row from a database.
  • ItemProcessor: (Optional) Processes or transforms an item read by the ItemReader. This is where business logic is applied.
  • ItemWriter: Writes a chunk of processed items to a destination.

6. What are FlatFileItemReader, LineTokenizer, LineMapper

These components are specifically used for reading data from flat files, like CSVs or fixed-width files.

  • FlatFileItemReader: A specific implementation of ItemReader for reading from flat files. It reads one line at a time.
  • LineTokenizer: A helper interface used by the FlatFileItemReader. It takes a single line of text and tokenizes it (splits it into an array of strings or tokens). DelimitedLineTokenizer is a common implementation for comma-separated files.
  • LineMapper: An interface that maps a LineTokenizer's tokens to an object. It takes the array of strings and populates a Java object with the data from each field.

7. FlatFileItemWriter, JdbcCursorItemWriter

  • FlatFileItemWriter: An implementation of ItemWriter for writing data to a flat file. It takes a chunk of objects and writes each one as a line to the file.
  • JdbcCursorItemWriter: (There's no JdbcCursorItemWriter). The correct component for writing to a database is typically a JdbcBatchItemWriter. It uses a PreparedStatement to perform batch inserts or updates, which is highly performant. A JdbcCursorItemReader is used for reading from a database using a cursor.

8. Example of Reading Employee Data from CSV and Writing to a Database

Prerequisites

  • Dependencies: You'll need spring-boot-starter-batch and spring-boot-starter-data-jpa (or jdbc), along with a database driver (e.g., h2, mysql, postgresql).

Step 1: Create the Employee Model and Entity

Create a simple Java POJO that represents the employee data. Use @Entity for JPA mapping.

import javax.persistence.Entity;
import javax.persistence.Id;
import java.util.Date;

@Entity
public class Employee {

    @Id
    private Long id;
    private String firstName;
    private String lastName;
    private Date hireDate;

    // Getters and Setters
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Create the Batch Configuration Class

Use @EnableBatchProcessing to enable Spring Batch.

import org.springframework.batch.core.Job;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.item.ItemProcessor;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.ItemWriter;
import org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider;
import org.springframework.batch.item.database.JdbcBatchItemWriter;
import org.springframework.batch.item.file.FlatFileItemReader;
import org.springframework.batch.item.file.LineMapper;
import org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper;
import org.springframework.batch.item.file.mapping.DefaultLineMapper;
import org.springframework.batch.item.file.transform.DelimitedLineTokenizer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.ClassPathResource;

import javax.sql.DataSource;

@Configuration
@EnableBatchProcessing
public class BatchConfig {

    @Autowired
    public JobBuilderFactory jobBuilderFactory;

    @Autowired
    public StepBuilderFactory stepBuilderFactory;

    @Autowired
    public DataSource dataSource;

    // 1. ItemReader
    @Bean
    public FlatFileItemReader<Employee> reader() {
        FlatFileItemReader<Employee> reader = new FlatFileItemReader<>();
        reader.setResource(new ClassPathResource("employee.csv"));
        reader.setLineMapper(lineMapper());
        return reader;
    }

    @Bean
    public LineMapper<Employee> lineMapper() {
        DefaultLineMapper<Employee> lineMapper = new DefaultLineMapper<>();
        DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
        lineTokenizer.setDelimiter(",");
        lineTokenizer.setNames("id", "firstName", "lastName", "hireDate");

        BeanWrapperFieldSetMapper<Employee> fieldSetMapper = new BeanWrapperFieldSetMapper<>();
        fieldSetMapper.setTargetType(Employee.class);

        lineMapper.setLineTokenizer(lineTokenizer);
        lineMapper.setFieldSetMapper(fieldSetMapper);
        return lineMapper;
    }

    // 2. ItemProcessor (Optional)
    @Bean
    public ItemProcessor<Employee, Employee> processor() {
        return new EmployeeProcessor();
    }

    // 3. ItemWriter
    @Bean
    public JdbcBatchItemWriter<Employee> writer() {
        JdbcBatchItemWriter<Employee> writer = new JdbcBatchItemWriter<>();
        writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>());
        writer.setSql("INSERT INTO employee (id, first_name, last_name, hire_date) VALUES (:id, :firstName, :lastName, :hireDate)");
        writer.setDataSource(dataSource);
        return writer;
    }

    // Job and Step
    @Bean
    public Job importEmployeeJob(JobCompletionNotificationListener listener, Step step1) {
        return jobBuilderFactory.get("importEmployeeJob")
                .listener(listener)
                .flow(step1)
                .end()
                .build();
    }

    @Bean
    public Step step1(ItemReader<Employee> reader, ItemWriter<Employee> writer, ItemProcessor<Employee, Employee> processor) {
        return stepBuilderFactory.get("step1")
                .<Employee, Employee>chunk(10) // Process 10 items at a time
                .reader(reader)
                .processor(processor)
                .writer(writer)
                .build();
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Create the Processor Class

The processor can be used to add business logic, like capitalizing names.

import org.springframework.batch.item.ItemProcessor;

public class EmployeeProcessor implements ItemProcessor<Employee, Employee> {

    @Override
    public Employee process(final Employee employee) {
        final String firstName = employee.getFirstName().toUpperCase();
        final String lastName = employee.getLastName().toUpperCase();

        final Employee transformedEmployee = new Employee();
        transformedEmployee.setId(employee.getId());
        transformedEmployee.setFirstName(firstName);
        transformedEmployee.setLastName(lastName);
        transformedEmployee.setHireDate(employee.getHireDate());

        return transformedEmployee;
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Create employee.csv file

Create a src/main/resources/employee.csv file with your data.

1,John,Doe,2023-01-15
2,Jane,Smith,2022-05-20
3,Peter,Jones,2024-03-10
Enter fullscreen mode Exit fullscreen mode

That's correct. The core of Spring Batch's efficiency lies in its chunk-oriented processing, where ItemReader, ItemProcessor, and ItemWriter interact in a specific, performance-optimized way.


How the Read-Process-Write Loop Works

  1. Read an Item: The ItemReader reads a single item from the data source (e.g., a single row from a database, one line from a CSV file).
  2. Process the Item: The item is immediately passed to the ItemProcessor (if one is configured), where any business logic, transformation, or filtering is applied.
  3. Accumulate Items: Instead of writing the item right away, the framework holds the processed item in memory.
  4. Repeat until Chunk Size is Reached: Steps 1-3 are repeated. The ItemReader continues reading, and the ItemProcessor continues processing, accumulating items in an internal list.
  5. Write the Chunk: Once the number of processed items in memory reaches the configured chunk size, the entire list of items (the "chunk") is passed to the ItemWriter. The ItemWriter then writes all items in that chunk to the destination in a single, batched operation.

This behavior is highly efficient because it minimizes the number of I/O operations. For example, instead of executing 100 separate INSERT statements for 100 items, the JdbcBatchItemWriter can perform a single, highly performant batch insert of all 100 items at once. This reduces the overhead associated with establishing connections and managing transactions for each individual item.

The transaction boundary in a chunk-oriented step is around the entire chunk. The transaction is committed only after the ItemReader has finished reading, the ItemProcessor has finished processing, and the ItemWriter has successfully written all the items in the chunk. If any part of this process fails, the entire chunk's work is rolled back, providing restartability and data integrity.

Notes: Job consists of Multiple Steps and Step consists of Reader,Processor(Optional),Writer or Tasklet

Top comments (0)