Exploring the Instructor Library: Structuring Unstructured Data (and Some Fun along the Way)

#python #ai #architecture #promptengineering

I’ve recently come across the instructor library, and I have to say, I’m pretty impressed. The concept of structuring unstructured data is both powerful and, dare I say, a bit magical. The idea that you can take data that’s all over the place and somehow impose order on it—well, that’s just my kind of wizardry.

But… how exactly does it work?

To find out, I spent some time digging into the internals of this library, and I discovered that there are two key players behind the scenes that are responsible for much of its magic.

Meet the Players: Pydantic and a nice prompt


import instructor
from pydantic import BaseModel
from openai import OpenAI

Now, if you're familiar with Python's data validation and settings management, you’ve probably heard of Pydantic. And if you haven't... well, buckle up! It’s an amazing library that allows you to define data structures and then validate that incoming data matches those structures—in real time. Think of it as the bouncer at a fancy club, making sure that only the right data gets in.

FastAPI, which is another great tool, makes excellent use of Pydantic to ensure that data passing through an API is in the right format. So, what's the next step? Now that we’ve defined our structure, how do we get an LLM (like OpenAI’s GPT) to follow it? Hmm…

Hypothesis 1: Pydantic's Serialization

My first hypothesis was that Pydantic might allow some kind of serialization—converting the data structure into something an LLM can easily understand and work with. And, it turns out, I wasn’t wrong.

Pydantic allows you to serialize your data into a dictionary with the following method:

model.model_dump(...)  # Dumps the model into a dictionary

This method recursively converts your Pydantic models into dictionaries, which can then be fed into an LLM for processing. So far, so good. But then I stumbled across something even more interesting:

Hypothesis 2: Generating a JSON Schema

It gets better. Pydantic doesn’t just convert data into dictionaries—it can also generate a JSON schema for your model. This is key, because now you have a blueprint of the structure you want the LLM to follow.

Here’s where things really started to click:

# Generate a JSON schema for a Pydantic model
response_model.model_json_schema()

Bingo! Now you’ve got a clear schema that defines exactly how the data should look. This is the blueprint we can send to the LLM, so it knows exactly how to structure its output.

Bringing It All Together

message = dedent(
    f"""
    Understand the content and provide
    the parsed objects in json that match the following json_schema:\n

    {json.dumps(response_model.model_json_schema(), indent=2, ensure_ascii=False)}

    Make sure to return an instance of the JSON, not the schema itself
    """
)

Here, the library is passing the schema to the LLM, asking it to return data that conforms to that structure. The message is clear: "Hey LLM, respect this schema when you generate your output." It’s like giving your LLM a detailed map and saying, “Follow these directions exactly.”

Thanks for bear with me

So, after all this investigation, I’m now convinced: Pydantic’s serialization and JSON schema generation are what make it possible for the Instructor library to get an LLM to follow structured data formats.

Thanks for sticking with me through this fun (and slightly convoluted) investigation. Who knew that unstructured data could be tamed with a little help from Python libraries and a bit of creative prompting?