Mircea Cadariu

Posted on Aug 24 • Edited on Aug 25

Loading One-to-Many relationships efficiently using Spring Data JPA and Postgres

#database #postgres #java #spring

Intro

The One-to-Many relationship, or parent-child, is a common occurrence in application development. Off the cuff, we can name numerous instances, a sports team with its players, blog posts and their comments, etc. A natural solution for this use-case is to use a relational database and create foreign key constraints to enforce data integrity.

This post focuses on the following task: how to efficiently load the list of parents, and all their corresponding children in one go, with Spring Data JPA and Postgres. We'll start with the slowest approach (hello, N+1!) and show how to make it faster through successive refinements. The code is available in this repo.

Authors and books

Let's use a familiar scenario: authors and their books. This is how we'll create the tables.

create table authors (
    id             bigint primary key generated always as identity,
    name           varchar(255) not null,
    bio            text
);

create table books (
    id             bigint primary key generated always as identity,
    title          varchar(255) not null,
    isbn           varchar(13),
    published_year integer,
    author_id      bigint not null,
    constraint fk_books_author foreign key (author_id) references authors(id)
);

create index idx_books_author_id on books(author_id);

Populating the tables

Let's insert some data to work with. We'll generate 1000 authors, and every author wrote 30 books each.

insert into authors (name, bio)
select
    'Author ' || seq,
    'Bio for author ' || seq
from generate_series(1, 1000) seq;

insert into books (author_id, title, isbn, published_year)
select
    a.id as author_id,
    'Book ' || gs as title,
    random()::bigint::text as isbn,
    (2000 + FLOOR(random() * 25))::int as published_year
from authors a, generate_series(1, 30) as gs;

Entities

We'll create two entity classes, Author and Book. In order to learn how to map them correctly with JPA/Hibernate, you can read this post.

In the Book class, we'll reference Author like this.

 @ManyToOne(fetch = LAZY)
 @JoinColumn(name = "author_id")
 private Author author;

Accordingly, in the Author class:

@OneToMany(mappedBy = "author", fetch = LAZY)
private List<Book> books = new ArrayList<>();

At this point, we have the tables, the test data and the entities. So far so good! We're ready to look at ways we can query the data.

Querying

We want to always keep a close eye on the queries Hibernate is generating for us under the hood in order to avoid surprises. We do it with the following setting.

@DynamicPropertySource 
static void registerPgProperties(DynamicPropertyRegistry registry) {
  registry.add("spring.jpa.show_sql", () -> true);
}

Iteration 1

We'll start with a pure Java approach, which looks quite elegant actually.

return authorRepository
            .findAll()
            .stream()
            .map(author -> {
               List<Book> books = author.getBooks();
                 return new AuthorWithBooksDto(
                       author.getId(),
                       author.getName(),
                       author.getBio(),
                       books.stream()
                           .map(book -> new BookDto(
                                  book.getTitle(),
                                  book.getIsbn(),
                                  book.getPublishedYear()))
                           .collect(toList())
                    );
                })
                .collect(toList());

But when running it, we immediately notice our console filling up with queries. You've just witnessed the N+1 problem. You want to avoid this if you want a fast application.

Iteration 2

Alright, let's make this better. This is what we'll add in the repository class:

 @Query("SELECT a FROM Author a JOIN FETCH a.books")
 List<Author> findAllWithBooks();

Great stuff! Turns out, this cuts the time to approximately half. Hibernate generates one query only. You should always try to load all the data you need with a single query. It's this one:

select a1_0.id,a1_0.bio,b1_0.author_id,b1_0.id,b1_0.isbn,b1_0.published_year,b1_0.title,a1_0.name from authors a1_0 join books b1_0 on a1_0.id=b1_0.author_id

Let's have a look at the explain plan to learn the steps the database took to retrieve our data.

 Hash Join  (cost=31.50..631.58 rows=30000 width=65) (actual time=0.476..8.912 rows=30000 loops=1)
   Hash Cond: (b1_0.author_id = a1_0.id)
   Buffers: shared hit=230
   ->  Seq Scan on books b1_0  (cost=0.00..521.00 rows=30000 width=29) (actual time=0.009..2.466 rows=30000 loops=1)
         Buffers: shared hit=221
   ->  Hash  (cost=19.00..19.00 rows=1000 width=36) (actual time=0.431..0.432 rows=1000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 77kB
         Buffers: shared hit=9
         ->  Seq Scan on authors a1_0  (cost=0.00..19.00 rows=1000 width=36) (actual time=0.010..0.169 rows=1000 loops=1)
               Buffers: shared hit=9

Nothing surprising, it joined two tables, authors and books, using the hash join algorithm. Note though the rows=30000 on the first line of the explain plan. This tells us that our final result set consists of 30000 rows. Let's have a look also at the layout of these rows. Below are the first 10 rows of the result set.

You're seeing where I'm going with this. Because of the join, we are fetching to our application code a result set that's larger than necessary and duplicated, with other words quite wasteful.

Iteration 3

Let's try something else. We will construct the expected shape of the response fully database-side, using Postgres features. For this, we'll have to write a native query like the one below.

@Query(value = """
            SELECT
                a.id,
                a.name,
                a.bio,
                jsonb_agg(
                    jsonb_build_object(
                        'id', b.id,
                        'title', b.title,
                        'isbn', b.isbn,
                        'publishedYear', b.published_year
                    )
                ) AS books
            FROM authors a
            JOIN books b ON b.author_id = a.id
            GROUP BY a.id;
        """, nativeQuery = true)
    List<Object[]> findAllWithBooksAsJson();

This is how the result looks like:

 652 | Author 652  | Bio for author 652  | [{"id": 652, "isbn": "0", "title": "Book 1 of author 652", "publishedYear": 2013}, {"id": 1652, "isbn": "1", "title": "Book 2 of author 652", "publishedYear": 2010}, {"id": 2652, "isbn": "1", "title": "Book 3 of author 652", "publishedYear": 2000}, {"id": 3652, "isbn": "0", "title": "Book 4 of author 652", "publishedYear": 2002}, {"id": 4652, "isbn": "1", "title": "Book 5 of author 652", "publishedYear": 2019}, {"id": 5652, "isbn": "1", "title": "Book 6 of author 652", "publishedYear": 2006}, {"id": 6652, "isbn": "1", "title": "Book 7 of author 652", "publishedYear": 2020}, {"id": 7652, "isbn": "1", "title": "Book 8 of author 652", "publishedYear": 2004}, {"id": 8652, "isbn": "1", "title": "Book 9 of author 652", "publishedYear": 2010}, {"id": 9652, "isbn": "1", "title": "Book 10 of author 652", "publishedYear": 2022}, {"id": 10652, "isbn": "1", "title": "Book 11 of author 652", "publishedYear": 2001}, {"id": 11652, "isbn": "1", "title": "Book 12 of author 652", "publishedYear": 2010}, {"id": 12652, "isbn": "1", "title": "Book 13 of author 652", "publishedYear": 2024}, {"id": 13652, "isbn": "1", "title": "Book 14 of author 652", "publishedYear": 2021}, {"id": 14652, "isbn": "0", "title": "Book 15 of author 652", "publishedYear": 2004}, {"id": 15652, "isbn": "1", "title": "Book 16 of author 652", "publishedYear": 2001}, {"id": 16652, "isbn": "1", "title": "Book 17 of author 652", "publishedYear": 2001}, {"id": 17652, "isbn": "0", "title": "Book 18 of author 652", "publishedYear": 2020}, {"id": 18652, "isbn": "0", "title": "Book 19 of author 652", "publishedYear": 2009}, {"id": 19652, "isbn": "1", "title": "Book 20 of author 652", "publishedYear": 2000}, {"id": 20652, "isbn": "1", "title": "Book 21 of author 652", "publishedYear": 2000}, {"id": 21652, "isbn": "1", "title": "Book 22 of author 652", "publishedYear": 2013}, {"id": 22652, "isbn": "1", "title": "Book 23 of author 652", "publishedYear": 2012}, {"id": 23652, "isbn": "0", "title": "Book 24 of author 652", "publishedYear": 2014}, {"id": 24652, "isbn": "0", "title": "Book 25 of author 652", "publishedYear": 2001}, {"id": 25652, "isbn": "1", "title": "Book 26 of author 652", "publishedYear": 2016}, {"id": 26652, "isbn": "1", "title": "Book 27 of author 652", "publishedYear": 2014}, {"id": 27652, "isbn": "0", "title": "Book 28 of author 652", "publishedYear": 2024}, {"id": 28652, "isbn": "0", "title": "Book 29 of author 652", "publishedYear": 2016}, {"id": 29652, "isbn": "0", "title": "Book 30 of author 652", "publishedYear": 2012}]
 273 | Author 273  | Bio for author 273  | [{"id": 273, "isbn": "1", "title": "Book 1 of author 273", "publishedYear": 2007}, {"id": 1273, "isbn": "1", "title": "Book 2 of author 273", "publishedYear": 2010}, {"id": 2273, "isbn": "1", "title": "Book 3 of author 273", "publishedYear": 2023}, {"id": 3273, "isbn": "0", "title": "Book 4 of author 273", "publishedYear": 2010}, {"id": 4273, "isbn": "1", "title": "Book 5 of author 273", "publishedYear": 2020}, {"id": 5273, "isbn": "0", "title": "Book 6 of author 273", "publishedYear": 2013}, {"id": 6273, "isbn": "0", "title": "Book 7 of author 273", "publishedYear": 2008}, {"id": 7273, "isbn": "1", "title": "Book 8 of author 273", "publishedYear": 2012}, {"id": 8273, "isbn": "1", "title": "Book 9 of author 273", "publishedYear": 2001}, {"id": 9273, "isbn": "0", "title": "Book 10 of author 273", "publishedYear": 2011}, {"id": 10273, "isbn": "0", "title": "Book 11 of author 273", "publishedYear": 2005}, {"id": 11273, "isbn": "1", "title": "Book 12 of author 273", "publishedYear": 2012}, {"id": 12273, "isbn": "1", "title": "Book 13 of author 273", "publishedYear": 2010}, {"id": 13273, "isbn": "1", "title": "Book 14 of author 273", "publishedYear": 2013}, {"id": 14273, "isbn": "1", "title": "Book 15 of author 273", "publishedYear": 2019}, {"id": 15273, "isbn": "1", "title": "Book 16 of author 273", "publishedYear": 2004}, {"id": 16273, "isbn": "1", "title": "Book 17 of author 273", "publishedYear": 2022}, {"id": 17273, "isbn": "0", "title": "Book 18 of author 273", "publishedYear": 2021}, {"id": 18273, "isbn": "0", "title": "Book 19 of author 273", "publishedYear": 2004}, {"id": 19273, "isbn": "1", "title": "Book 20 of author 273", "publishedYear": 2022}, {"id": 20273, "isbn": "1", "title": "Book 21 of author 273", "publishedYear": 2021}, {"id": 21273, "isbn": "1", "title": "Book 22 of author 273", "publishedYear": 2016}, {"id": 22273, "isbn": "1", "title": "Book 23 of author 273", "publishedYear": 2002}, {"id": 23273, "isbn": "0", "title": "Book 24 of author 273", "publishedYear": 2015}, {"id": 24273, "isbn": "1", "title": "Book 25 of author 273", "publishedYear": 2010}, {"id": 25273, "isbn": "1", "title": "Book 26 of author 273", "publishedYear": 2021}, {"id": 26273, "isbn": "1", "title": "Book 27 of author 273", "publishedYear": 2016}, {"id": 27273, "isbn": "0", "title": "Book 28 of author 273", "publishedYear": 2017}, {"id": 28273, "isbn": "0", "title": "Book 29 of author 273", "publishedYear": 2024}, {"id": 29273, "isbn": "1", "title": "Book 30 of author 273", "publishedYear": 2007}]
...

Let's have a look at the explain plan as well.

 HashAggregate  (cost=856.58..869.08 rows=1000 width=68) (actual time=68.398..74.411 rows=1000 loops=1)
   Group Key: a.id
   Batches: 1  Memory Usage: 23361kB
   Buffers: shared hit=230
   ->  Hash Join  (cost=31.50..631.58 rows=30000 width=57) (actual time=0.385..7.345 rows=30000 loops=1)
         Hash Cond: (b.author_id = a.id)
         Buffers: shared hit=230
         ->  Seq Scan on books b  (cost=0.00..521.00 rows=30000 width=29) (actual time=0.012..1.821 rows=30000 loops=1)
               Buffers: shared hit=221
         ->  Hash  (cost=19.00..19.00 rows=1000 width=36) (actual time=0.356..0.357 rows=1000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 77kB
               Buffers: shared hit=9
               ->  Seq Scan on authors a  (cost=0.00..19.00 rows=1000 width=36) (actual time=0.005..0.109 rows=1000 loops=1)
                     Buffers: shared hit=9

The same hash join you saw before, plus the operations needed to provide the in-line JSON list of books for every author. Note how we're returning now only 1000 rows, 30 times less than before. On my laptop, it turned out to run just a little bit faster than Option 2 above. However, in a typical 3 tier architecture, where the database is separated from the application by a network, choosing this approach will make all the difference, due to much less data needed to be transported over the network for every user request.

Conclusion

We have seen three ways to retrieve the data we needed and looked closely at what makes one approach more performant than the other. Option 3 is the fastest but also the least portable, because it is based on a native query which leverages Postgres specific functionality.

I hope you enjoyed reading it, and potentially even applied it in order to make your application faster.

Thanks for reading! Until next time!

DEV Community