Paton Wong

Posted on Apr 10

Citation Needed: Structured data extraction workflows

#ai #agents #automation #opensource

In the previous article we explored how to generate and use structured data in a workflow. Now, let's take it a step further.

We'll build a workflow that checks whether an article provides evidence to support its claims (but not whether the evidence itself is valid). Rather than using this to fact check articles in the wild, this might be useful for critiquing your own writing before submission or checking generated text for hallucinations.

This task is impractical to automate without generative language models. Natural language processing pipelines might be able to extract or categorize entities and phrases from a text, but this task requires a degree of reading comprehension not available without larger language models.

Furthermore, while many language models are capable of performing individual steps, the overall process requires more rigor and discipline than they are trained for. Frontier models might handle moderately complex tasks, but verifying that they haven't hallucinated the results requires additional work on par with this workflow.

What we can do instead is split the task into distinct steps: extracting claims then checking each of them. In this article we'll look into the first part using our old friend the LLM › Structured node.

Claims Schema

In the Structured Generation tutorial we saw how to generate a single structured entry from scratch. LLMs are capable of handling much more complexity. This time we will ask the model to determine which phrases in a text are factual claims and place them into a list. Furthermore, we ask the model to rank the importance of each claim, holistically, when deciding whether to include it.

Like before, create a new workflow and swap out the normal Chat for a Structured node.

Create a Parse JSON node and connect it to the schema input of the Structured node. Fill it with this schema conveniently generated by an LLM:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "ClaimsList",
  "type": "object",
  "properties": {
    "claims": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": 1,
      "maxItems": 5,
      "description": "A list of claim strings. The list must contain at least one and at most five items."
    }
  },
  "required": [
    "claims"
  ],
  "additionalProperties": false
}

📢 important
Technically, an array at the top-level would be a valid schema.

However, many models have trouble generating data with that format. To ensure compatibility between providers, wrap the array in an object. Then extract the list later using JSON transformations.

Instructions

In the previous example we combined instructions with dynamic data into the prompt. This time we'll reserve the system message for instructions and inject the data in a separate step.

By partitioning the instructions and the data it becomes much easier to reuse the workflow on new inputs. We can use the system message field of the Agent node for instructions:

Follow these instructions exactly.
Do not respond directly to the user.
Do not hallucinate the final answer.

## Instructions

Extract the key factual claims in the user's statement and format them into a list (5 items or fewer).
Ensure that each claim can stand alone without additional context to make sense of it.

💡 tip
You should experiment with variations on the instructions, particularly the preamble to optimize it for your preferred language model. I find this combination effective with the nemotron family and various other open models.

The system message is sent once at the beginning of each request. Theoretically, the LLM should pay special attention to it. Regardless, this avoids sending repeat instructions with every prompt of a conversation, even when the entire conversation is sent with every request. ¹

Input Document

The input document for a workflow will typically be supplied by the runner. While developing a workflow, however, it's convenient to create a node for a predefined text to take advantage of iterative execution. In the final version of the workflow we can delete this node and connect to the input of the Start node.

Create a Value › Plain Text node to hold the article content.

Connect it to the prompt input of the Structured node.

Paste the contents of an article into the text field. I'm using a Wikipedia article about apiaries (artificial beehives).

Claim Checking

We now have a workflow that generates a list of claims from a text. Our eventual goal is to have each claim checked individually against the original text, which will be supplied to the language model in a context document.

However, before learning how to check every item, we should first explore how to check a single item.

First, let's pull a single claim out of the structured generation using JSON › Transform JSON. This node uses a jq filter to manipulate JSON.

The filter .claims[1] tells it to access the "claims" field and return the second element (0-indexed).

💡 tip
Ask your favorite frontier LLM for help writing jq filters from sample data.

Add a second Agent node with these instructions:

Follow these instructions exactly.
Do not respond directly to the user.
Do not hallucinate the final answer.

## Instructions

Help the user analyze the article in the context file.
The user is examining individual claims that the article makes.

Determine whether the context provides supporting evidence for the claim stated by the user.
List the reference or citation provided by the article.

DO NOT interpret the article as evidence for a claim made by the user.
The user is simply examining a claim made by the article.

How can we provide the article as context for the LLM? There are several ways:

Inject it into the system message using templating
Provide it as a user message in the conversation
Use a LLM › Context node

The third option is cleanest since it provides a clear demarcation between instructions, context and prompt. The Context node sits between the agent and a chat node, augmenting the agent by injecting its contents into requests made by the agent.

Connect the Plain Text node containing the article to the context input. In the final version of the workflow, this should be connected to the input pin of the Start node.

We can use a simple Chat node to do a quick spot check on how the context affects the language model response. However, to facilitate checking the entire collection, the responses for each item should be structured.

Structured Check

Replace the Chat node with a Structured node, connecting it to the Context and Transform nodes.

Use this schema for the claims checking Structured node:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "A factual claim with evidence from citations or references",
  "type": "object",
  "required": [
    "claim",
    "grounding"
  ],
  "properties": {
    "claim": {
      "type": "string",
      "description": "the original claim made by the article"
    },
    "grounding": {
      "enum": [
        "not a claim",
        "unsupported",
        "fully supported"
      ],
      "description": "The level of support for the claim provided by citations and references. If the provided text is actually a definition or something other than a claim, then \"not a claim\""
    },
    "evidence": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "The citations and references that support the claim. Empty if the claim is not supported."
    }
  }
}

Connect the unwrapped claim to the prompt and run.

By changing the claim index we can see how it handles different claims and statements.

Conclusion

In this tutorial we've explored using language models to extract structured data from plain text, then transforming data for further processing. The workflow is still incomplete since we've only checked one claim.

Before we can go any further, we'll need to learn about iterating over lists using subgraphs. This will allow us to check every claim individually, then draw a conclusion by combining all results.

Some LLM providers support caching portions of the request. However, since this behavior isn't standardized across providers yet, aerie does not support it. ↩

DEV Community