Sanjeet Singh Jagdev

Posted on Mar 9

Building a Declarative JSON Extraction Engine with Python’s Annotated Types, jq and Pydantic

#programming #python #showdev #softwareengineering

Inspiration

I recently ran into a problem where I needed to process a JSON payload, extract a few fields, transform some values, compute a few derived ones, and produce a new JSON structure.

My first instinct was the usual approach: manually parse the JSON, dig through nested keys, and apply transformations along the way. But it quickly started to feel repetitive and brittle. Every new field meant more extraction logic and more edge cases to handle.

I knew about tools like jq for expression-based JSON querying and Pydantic for schema validation, so I tried combining them. While this helped a bit, I still found myself writing a lot of glue code just to move data from one place to another.

At that point a simple idea occurred to me:

What if I could just declare what I want from the JSON, and let the rest happen automatically?

I looked around for a library that did this out of the box but couldn’t find something that quite fit. So, naturally, I decided to try building one.

Understanding the Problem

Say you have a JSON structure of an Order

{
  "id": "order_789",
  "created_at": "2024-02-10T14:21:00Z",
  "customer": {
    "first_name": "Alice",
    "last_name": "Smith",
    "contact": {
      "email": "ALICE@EXAMPLE.COM"
    }
  },
  "items": [
    { "name": "Keyboard", "category": "premium", "price": 120 },
    { "name": "Mouse", "category": "standard", "price": 40 },
    { "name": "Monitor", "category": "premium", "price": 300 }
  ]
}

Now I want to transform this into an OrderSummary that looks like

{
  "order_id": "order_789",
  "customer_email": "alice@example.com",
  "item_count": 3,
  "premium_total": 420,
  "order_label": "ORDER-order_789"
}

Naive Approach

To produce the same transformed JSON using a traditional approach, we would typically write something like this:

with open("orders.json") as f:
    data = json.loads(f.read())

order_id = data["id"]

customer_email = data["customer"]["contact"]["email"].lower()

items = data.get("items", [])
item_count = len(items)

premium_total = sum(
    item["price"]
    for item in items
    if item.get("category") == "premium"
)

order_label = f"ORDER-{data['id']}"

order_summary = {
    "order_id": order_id,
    "customer_email": customer_email,
    "item_count": item_count,
    "premium_total": premium_total,
    "order_label": order_label
}

While this works, the extraction logic, transformations, computed fields are all mixed together and the schema is implicit. As the JSON structure grows, this approach quickly becomes harder to maintain.

Declarative Solution

The idea is to declare a model that describes the OrderSummary like

from typing import Annotated

# I named the library as "jresolve"
from jresolve import (
    JqModel,
    Jq,
    JqMode,
    Transform,
    Computed
)


class OrderSummary(JqModel):
    order_id: Annotated[
        str,
        Jq(".id")
    ]

    customer_email: Annotated[
        str,
        Jq(".customer.contact.email"),
        Transform(str.lower)
    ]

    item_count: Annotated[
        int,
        Jq(".items"),
        Transform(len)
    ]

    premium_total: Annotated[
        float,
        Jq(
            ".items[] | select(.category == \"premium\") | .price", 
            mode=JqMode.MANY
        ),
        Transform(sum)
    ]

    order_label: Annotated[
        str,
        Computed(lambda d: f"ORDER-{d['id']}")
    ]

The usage for this would be

order_summary = OrderSummary.from_json(data)

The key idea here is

The model declares how fields are extracted and transformed directly in the type annotation.

High Level Architecture

Input JSON
    ↓
Resolver (Jq / Computed)
    ↓
Transform Pipeline
    ↓
Collected Field Values
    ↓
Pydantic Model Construction
    ↓
Typed Output

Diving Deeper

Now that we have seen how the declarative model looks from the outside, let's briefly look at the core ideas that make this work internally.

Analyze a field

customer_email: Annotated[
        str,
        Jq(".customer.contact.email"),
        Transform(str.lower)
    ]

Just by looking at it I can already tell that:

The type of the field is str
Its value is extracted using the jq expression ".customer.contact.email"
Once I have the value I want to apply a transformation to lowercase

The intent of the field becomes immediately obvious. And since we are using Pydantic you get schema validation for free.

The backbone `Annotated`

The Annotated type acts as the glue that binds all the declarations together.

If we examine the type

order_id: Annotated[
    str,
    Jq(".id"),
    Transform(str.upper)
]

It tells us

Type: 
  → str

Metadata:
  Resolvers:
    → Jq(".id")

  Transforms:
    → str.upper

Since Pydantic allows us to access the metadata stored in Annotated, we can interpret those declarations and execute them against the JSON input.

So the field effectively becomes

JSON
  ↓
Jq Resolver
  ↓
Transform Pipeline
  ↓
Typed Field

A clean pipeline to reason and add more operations to.

Resolver

While Annotated does all the heavy-lifting in the interface, the Resolver does so for the core implementation.

It is an Abstract class that provides an interface for different types of Resolvers

class Resolver(ABC):
    @abstractmethod
    def resolve(self, data: dict) -> Result[Any, ResolutionError]:
        ...

NOTE: A Rust styled Result type is used to return results or errors throughout the implementation. Take a look at result.

Currently there are 3 implementations of Resolver:

Jq to extract values using the jq expression syntax
Computed to generate a value using a function
Pipeline used internally that works on a base Resolver and a list[Transform]

How the Model Is Executed

The current execution model looks like

resolver output (Jq / Computed)
      ↓
transform 1
      ↓
transform 2
      ↓
final value

Which translates to

Inspect model annotations
Extract resolvers and transforms from Annotated metadata
Build a pipeline
Execute pipeline on input JSON
Construct the Pydantic model which roughly looks like

for field_name, field in cls.model_fields.items():
    resolver = build_pipeline_from_field(field)

    if resolver:
        result = resolver.resolve(data)
        values[field_name] = result.ok()

Nested Models

Real JSON structures are rarely flat. Fortunately, JqModels can be nested and are resolved recursively.

class OrderSummary(JqModel):
    # previous fields

    customer: Customer

Where Customer can look like

from typing import Annotated

class Customer(JqModel):
    name: Annotated[
        str,
        Jq(".profile.name.last + ', ' + .profile.name.first")
    ]
    email: Annotated[
        str,
        Jq(".customer.email"),
        Transform(str.lower)
    ]

Error Handling

Resolvers return a Rust-style Result type that either contains a value or a structured error.

Ok(value)
Err(ResolutionError)

The pipeline inspects the result and short-circuits if an error occurs, allowing failures to propagate cleanly without raising exceptions.

Closing Thoughts

This pattern works particularly well when dealing with:

Complex API responses
ETL pipelines
Data normalization layers
Event payload transformations

Instead of scattering extraction logic across the codebase, transformations become declarative and centralized in the model definition.
Using Annotated types allows us to build a small DSL directly inside Python’s type system, while still benefiting from Pydantic’s validation and typing support.

The result is a system that is:

Declarative
Type-safe
Composable
Easy to extend

If you’re interested in the full implementation, the code is available in my Github Repo.

I would love to hear your thoughts about this and what I could have done better!

Top comments (3)

Utsav Saha Kar • Mar 10

Amazing job👏👏

Apex Stack • Mar 16

This is exactly the kind of abstraction I wish I had when I started building financial data pipelines. I process JSON payloads from Yahoo Finance's API for 8,000+ stock tickers, and the extraction logic for each ticker profile — P/E ratio, dividend yield, market cap, sector classification — started as exactly your "naive approach." Dozens of nested key lookups with .get() fallbacks scattered across the codebase, each one a potential silent failure when the API response structure changes (which it does, more often than you'd expect).

The Annotated + jq + Pydantic combination you've built addresses the exact pain point: making the extraction intent visible in the type definition itself. What I especially like is the Transform pipeline — chaining operations like str.lower after extraction is where most of my bugs lived. We had cases where a dividend yield came through as a string "0.42%" from one endpoint and a float 0.0042 from another, and the transformation logic was buried three functions deep instead of being declared at the field level.

The Rust-style Result type for error handling is a smart choice too. In batch processing scenarios where you're extracting data for thousands of entities, you want structured errors that tell you which field on which ticker failed and why, not a stack trace that kills the whole pipeline. Have you considered adding a batch mode to JqModel that could process a list of JSON objects and return a summary of successes vs. failures? That would make this really powerful for ETL workloads at scale.

Sanjeet Singh Jagdev • Mar 19

Really appreciate this — your use case is exactly the kind of scenario I had in mind.

This is still the first iteration of the project, so suggestions like batch processing and better aggregation of errors are super valuable. It actually fits quite naturally with the current design, so I’ll definitely explore it.

Thanks for sharing this, gives a lot of good direction

Some comments may only be visible to logged-in visitors. Sign in to view all comments.