DEV Community

Sanjeet Singh Jagdev
Sanjeet Singh Jagdev

Posted on

Building a Declarative JSON Extraction Engine with Python’s Annotated Types, jq and Pydantic

Inspiration

I recently ran into a problem where I needed to process a JSON payload, extract a few fields, transform some values, compute a few derived ones, and produce a new JSON structure.

My first instinct was the usual approach: manually parse the JSON, dig through nested keys, and apply transformations along the way. But it quickly started to feel repetitive and brittle. Every new field meant more extraction logic and more edge cases to handle.

I knew about tools like jq for expression-based JSON querying and Pydantic for schema validation, so I tried combining them. While this helped a bit, I still found myself writing a lot of glue code just to move data from one place to another.

At that point a simple idea occurred to me:

What if I could just declare what I want from the JSON, and let the rest happen automatically?

I looked around for a library that did this out of the box but couldn’t find something that quite fit. So, naturally, I decided to try building one.

Understanding the Problem

Say you have a JSON structure of an Order

{
  "id": "order_789",
  "created_at": "2024-02-10T14:21:00Z",
  "customer": {
    "first_name": "Alice",
    "last_name": "Smith",
    "contact": {
      "email": "ALICE@EXAMPLE.COM"
    }
  },
  "items": [
    { "name": "Keyboard", "category": "premium", "price": 120 },
    { "name": "Mouse", "category": "standard", "price": 40 },
    { "name": "Monitor", "category": "premium", "price": 300 }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Now I want to transform this into an OrderSummary that looks like

{
  "order_id": "order_789",
  "customer_email": "alice@example.com",
  "item_count": 3,
  "premium_total": 420,
  "order_label": "ORDER-order_789"
}
Enter fullscreen mode Exit fullscreen mode

Naive Approach

To produce the same transformed JSON using a traditional approach, we would typically write something like this:

with open("orders.json") as f:
    data = json.loads(f.read())

order_id = data["id"]

customer_email = data["customer"]["contact"]["email"].lower()

items = data.get("items", [])
item_count = len(items)

premium_total = sum(
    item["price"]
    for item in items
    if item.get("category") == "premium"
)

order_label = f"ORDER-{data['id']}"

order_summary = {
    "order_id": order_id,
    "customer_email": customer_email,
    "item_count": item_count,
    "premium_total": premium_total,
    "order_label": order_label
}
Enter fullscreen mode Exit fullscreen mode

While this works, the extraction logic, transformations, computed fields are all mixed together and the schema is implicit. As the JSON structure grows, this approach quickly becomes harder to maintain.

Declarative Solution

The idea is to declare a model that describes the OrderSummary like

from typing import Annotated

# I named the library as "jresolve"
from jresolve import (
    JqModel,
    Jq,
    JqMode,
    Transform,
    Computed
)


class OrderSummary(JqModel):
    order_id: Annotated[
        str,
        Jq(".id")
    ]

    customer_email: Annotated[
        str,
        Jq(".customer.contact.email"),
        Transform(str.lower)
    ]

    item_count: Annotated[
        int,
        Jq(".items"),
        Transform(len)
    ]

    premium_total: Annotated[
        float,
        Jq(
            ".items[] | select(.category == \"premium\") | .price", 
            mode=JqMode.MANY
        ),
        Transform(sum)
    ]

    order_label: Annotated[
        str,
        Computed(lambda d: f"ORDER-{d['id']}")
    ]
Enter fullscreen mode Exit fullscreen mode

The usage for this would be

order_summary = OrderSummary.from_json(data)
Enter fullscreen mode Exit fullscreen mode

The key idea here is

The model declares how fields are extracted and transformed directly in the type annotation.

High Level Architecture

Input JSON
    ↓
Resolver (Jq / Computed)
    ↓
Transform Pipeline
    ↓
Collected Field Values
    ↓
Pydantic Model Construction
    ↓
Typed Output
Enter fullscreen mode Exit fullscreen mode

Diving Deeper

Now that we have seen how the declarative model looks from the outside, let's briefly look at the core ideas that make this work internally.

Analyze a field

customer_email: Annotated[
        str,
        Jq(".customer.contact.email"),
        Transform(str.lower)
    ]
Enter fullscreen mode Exit fullscreen mode

Just by looking at it I can already tell that:

  1. The type of the field is str
  2. Its value is extracted using the jq expression ".customer.contact.email"
  3. Once I have the value I want to apply a transformation to lowercase

The intent of the field becomes immediately obvious. And since we are using Pydantic you get schema validation for free.

The backbone Annotated

The Annotated type acts as the glue that binds all the declarations together.

If we examine the type

order_id: Annotated[
    str,
    Jq(".id"),
    Transform(str.upper)
]
Enter fullscreen mode Exit fullscreen mode

It tells us

Type: 
  → str

Metadata:
  Resolvers:
    → Jq(".id")

  Transforms:
    → str.upper
Enter fullscreen mode Exit fullscreen mode

Since Pydantic allows us to access the metadata stored in Annotated, we can interpret those declarations and execute them against the JSON input.

So the field effectively becomes

JSON
  ↓
Jq Resolver
  ↓
Transform Pipeline
  ↓
Typed Field
Enter fullscreen mode Exit fullscreen mode

A clean pipeline to reason and add more operations to.

Resolver

While Annotated does all the heavy-lifting in the interface, the Resolver does so for the core implementation.

It is an Abstract class that provides an interface for different types of Resolvers

class Resolver(ABC):
    @abstractmethod
    def resolve(self, data: dict) -> Result[Any, ResolutionError]:
        ...
Enter fullscreen mode Exit fullscreen mode

NOTE: A Rust styled Result type is used to return results or errors throughout the implementation. Take a look at result.

Currently there are 3 implementations of Resolver:

  • Jq to extract values using the jq expression syntax
  • Computed to generate a value using a function
  • Pipeline used internally that works on a base Resolver and a list[Transform]

How the Model Is Executed

The current execution model looks like

resolver output (Jq / Computed)
      ↓
transform 1
      ↓
transform 2
      ↓
final value
Enter fullscreen mode Exit fullscreen mode

Which translates to

  1. Inspect model annotations
  2. Extract resolvers and transforms from Annotated metadata
  3. Build a pipeline
  4. Execute pipeline on input JSON
  5. Construct the Pydantic model which roughly looks like
for field_name, field in cls.model_fields.items():
    resolver = build_pipeline_from_field(field)

    if resolver:
        result = resolver.resolve(data)
        values[field_name] = result.ok()
Enter fullscreen mode Exit fullscreen mode

Nested Models

Real JSON structures are rarely flat. Fortunately, JqModels can be nested and are resolved recursively.

class OrderSummary(JqModel):
    # previous fields

    customer: Customer   
Enter fullscreen mode Exit fullscreen mode

Where Customer can look like

from typing import Annotated

class Customer(JqModel):
    name: Annotated[
        str,
        Jq(".profile.name.last + ', ' + .profile.name.first")
    ]
    email: Annotated[
        str,
        Jq(".customer.email"),
        Transform(str.lower)
    ] 
Enter fullscreen mode Exit fullscreen mode

Error Handling

Resolvers return a Rust-style Result type that either contains a value or a structured error.

Ok(value)
Err(ResolutionError)
Enter fullscreen mode Exit fullscreen mode

The pipeline inspects the result and short-circuits if an error occurs, allowing failures to propagate cleanly without raising exceptions.

Closing Thoughts

This pattern works particularly well when dealing with:

  • Complex API responses
  • ETL pipelines
  • Data normalization layers
  • Event payload transformations

Instead of scattering extraction logic across the codebase, transformations become declarative and centralized in the model definition.
Using Annotated types allows us to build a small DSL directly inside Python’s type system, while still benefiting from Pydantic’s validation and typing support.

The result is a system that is:

  • Declarative
  • Type-safe
  • Composable
  • Easy to extend

If you’re interested in the full implementation, the code is available in my Github Repo.

I would love to hear your thoughts about this and what I could have done better!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.