Understanding Subqueries in SQL and Building JSON Directly in PostgreSQL

#sql #postgres #webdev #typeorm

There's a level of obsession with the need for optimization (speed and memory) required to be a good backend engineer. One useful tool in database optimization is SQL subqueries.
A subquery is a query inside another query. It allows you to pull related data or computed results without writing multiple queries or heavy joins. It also allows you to perform calculations dynamically and aggregate data.
There are different kinds of subqueries based on the type of data you want to return.

Scalar Subqueries

Scalar subqueries return a single computed value needed within an outer query, which is useful for comparisons or assignments.
For example, say you have two tables:

users
-----
id | name

posts
-----
id | title | user_id | status

A way to get the post count for each user could be:

SELECT u.*, COUNT(p.id) AS post_count
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
GROUP BY u.id;

But with a scalar subquery, you can retrieve it like this:

SELECT 
  u.*,
  (SELECT COUNT(*) FROM posts WHERE posts.user_id = u.id) AS post_count
FROM users u;

This form is often easier to extend and can sometimes perform better, especially with proper indexes.

Column Subqueries

A column subquery returns a single column for multiple rows. This is useful when the outer query needs to compare values with multiple rows, so it is often used after the IN, ANY, or EXISTS operator.

SELECT u.* 
FROM users u 
WHERE u.id IN (
  SELECT DISTINCT p.user_id 
  FROM posts p 
  WHERE p.status = 'active'
);

-- OR 
SELECT u.* 
FROM users u 
WHERE EXISTS (
  SELECT 1 
  FROM posts p 
  WHERE p.user_id = u.id 
  AND p.status = 'active'
);

Row Subqueries

A row subquery returns a single row with multiple columns. This is where PostgreSQL's JSON constructors come in. With json_build_object, you can build objects that contain multiple columns from another table in each row of the outer query.

SELECT
  u.id,
  u.name,
  (
    SELECT json_build_object(
      'id', p.id,
      'title', p.title,
      'status', p.status
    )
    FROM posts p
    WHERE p.user_id = u.id
    ORDER BY p.created_at DESC
    LIMIT 1
  ) AS latest_post
FROM users u;

This allows you to fetch the latest post for each user without multiple queries or joins. The result would look like this in JSON:

[
  {
    "u_id": 1,
    "u_name": "Alice",
    "latest_post": {
      "id": 12,
      "title": "Optimizing SQL Queries",
      "status": "published",
      "created_at": "2025-10-11T09:25:43.125Z"
    }
  },
  {
    "u_id": 2,
    "u_name": "Bob",
    "latest_post": {
      "id": 15,
      "title": "Working with Subqueries",
      "status": "draft",
      "created_at": "2025-10-10T14:02:19.761Z"
    }
  }
]

Very cool stuff.

Table Subqueries

A table subquery returns multiple rows and columns and can be useful for aggregating, filtering, or joining computed data. It behaves like a temporary, in-memory table you can join or select from, and is powerful for modularizing complex queries. Here, we use another powerful PostgreSQL JSON constructor called json_agg.
A good example is if you want to fetch each user with their last 10 posts.

SELECT
  u.id,
  u.name,
  (
    SELECT json_agg(
      json_build_object(
        'id', p.id,
        'title', p.title,
        'status', p.status,
        'created_at', p.created_at
      )
    )
    FROM posts p
    WHERE p.user_id = u.id
    ORDER BY p.created_at DESC
    LIMIT 10
  ) AS posts
FROM users u;

json_build_object() creates a JSON object for each post and json_agg() aggregates all those post objects into a single JSON array. Each user gets their own array of posts, automatically ordered by created_at DESC. This can outperform ORM relations and joins, especially when the tables and data are large and only specific fields are needed.
The result would look like this:

[
  {
    "u_id": 1,
    "u_name": "Alice",
    "posts": [
      {
        "id": 14,
        "title": "Optimizing SQL Queries",
        "status": "published",
        "created_at": "2025-10-11T09:25:43.125Z"
      },
      {
        "id": 9,
        "title": "Getting Started with PostgreSQL JSON Functions",
        "status": "published",
        "created_at": "2025-09-15T17:12:09.872Z"
      },
      //...up to 10 posts
    ]
  },
  {
    "u_id": 2,
    "u_name": "Bob",
    "posts": [
      {
        "id": 15,
        "title": "Working with Subqueries",
        "status": "draft",
        "created_at": "2025-10-10T14:02:19.761Z"
      },
      //...up to 10 posts
    ]
  }
]

This is very useful when fetching data for back-office dashboards, analytics, etc.

Caveats

As powerful as subqueries and JSON constructors are, there are potential mistakes that can lead to performance issues, and considerations to be made based on your application's use case.

Forgetting to Handle Null Results

Use COALESCE to handle null values.

COALESCE (
(
    SELECT json_agg(
      json_build_object(
        'id', p.id,
        'title', p.title,
        'status', p.status,
        'created_at', p.created_at
      )
    )
    FROM posts p
    WHERE p.user_id = u.id
    ORDER BY p.created_at DESC
    LIMIT 10
  ), '[]'
) AS posts

Not Indexing Foreign Keys

This can be disastrous, as PostgreSQL will perform a full table scan per row if your subquery references another table and the foreign key is not indexed.

Building Huge JSON Objects in One Go

If you try to aggregate too many rows into one JSON array (e.g., all posts in your entire database), you can hit memory limits or experience slowdowns. Instead:

Use pagination (LIMIT, OFFSET) in subqueries
Or aggregate per entity (user, order, etc.), not globally.

Subqueries vs. Joins - Structuring vs. Combining Data

The prevalent argument against subqueries is whether it's better to join tables since PostgreSQL optimizes JOINs so well. Most arguments relegate the need for subqueries to code readability and maintainability, but I believe it is always dependent on the data and your use case.

One thing that always helps is to use PostgreSQL's EXPLAIN ANALYZE. It is your best friend for performance debugging.

EXPLAIN ANALYZE
SELECT ...

It will show you whether your subquery is executed once per row (bad) or optimized via index scans (good), and you can ultimately see for yourself which approach is best.

Joins are great for combining data across tables, for example, retrieving a list of users with their corresponding orders. They shine when you need flat, tabular data and care about speed across large datasets.

Subqueries, on the other hand, are better for structuring data. They let you embed small, scoped queries inside larger ones, so each part of your query can focus on a specific task: filtering, computing, or shaping data into a nested structure, which enables you to reduce post-processing in your backend code and focus on business logic- every backend engineer's desire.

Bonus - TypeORM Example

I love TypeORM, and since most software developers use ORMs, here are a couple of samples of what using subqueries in TypeORM would look like in relation to the earlier used queries:

This gets an array of users with a computed post_count in a single query.

const userRepo = dataSource.getRepository(User);

const users = await userRepo
  .createQueryBuilder('u')
  .addSelect((qb) => {
    return qb
      .select('COUNT(p.id)', 'post_count')
      .from(Post, 'p')
      .where('p.user_id = u.id');
  }, 'post_count')
  .getRawMany();

This fetches all users and their posts, and merges them into a single JSON with one query.

const users = await dataSource
  .getRepository(User)
  .createQueryBuilder('u')
  .select(['u.id', 'u.name'])
  .addSelect((qb) => {
    return qb
      .subQuery()
      .select(`json_agg(json_build_object('id', p.id, 'title', p.title))`)
      .from(Post, 'p')
      .where('p.user_id = u.id');
  }, 'posts')
  .getRawMany();

Summarily, by combining subqueries and PostgreSQL's JSON functions, you're letting the database handle both data retrieval and shaping, something it's really good at. Your backend no longer needs to map, merge, or format objects. It simply returns the JSON the client needs, straight from the database with zero to minimal round-trips. Again, every backend engineer's desire.

Cheers.