Edward Burton

Posted on Jun 24, 2024

A forensic analysis of the Claude Sonnet 3.5 system prompt leak

#ai #programming #machinelearning #webdev

A forensic analysis of the Claude Sonnet 3.5 system prompt

Originally published in the Tying Shoelaces Blog

Introducing Artifacts

A step forward in structured output generation.

This is an analysis of the system prompt generation for Claude 3.5 Sonnet. The link to the code for this analysis is available at the bottom with the source. The main focus of this analysis is the introduction of the concept of artifacts, and how this might work as part of an intelligent categorization and retrieval system.

Artifacts are for substantial, self-contained content that users might modify or reuse.

An artifact is a paradigm change because it formalizes a new concept. The concept of persistent data. Persistent data is a stepping stone to us accessing a highly curated and structured content library. By providing fixed references, we unblock iteration and the ability to incrementally improve and refine output. This is a step towards controlling the ephemeral nature of verbose LLM output.

One of the inherent problems with Generative AI for functional tasks such as code completion is that they often repeat entire files for simple changes. There is a huge demand for a ‘diff’ feature, where we output the difference between before and after as opposed to repeating the same content.

Artifacts thus serve a dual purpose; first they act as a reference point for how and where we need output. This is like the setting of the scope or the definition of a reference point. This will stop the LLM from losing focus of the original problem and also keeps persistent structure and categorization in the output.

As a bonus point, we also have an autocomplete feature. By defining the ‘base’ code and scope of the changes, we have now directed our LLM to focus on a specific task or problem, in an opinionated and curated way. This stops erratic shifts in zoom and also provides the entire work in progress to the prompt. Any engineer who has accidentally wiped their code with "Rest of code here" thanks you. We can see the setting of the scope here:

Self-contained, complex content that can be understood on its own, without context from the conversation

We are directing focus from uncontrolled verbose output to a concrete artifact. It is worth noting the explicit instruction to ignore the context from the conversation. This is a method of ensuring quality by reference to curated data. It is a quality control mechanism that controls the verbose and potentially random characteristics of the input.

All of this fits together with an architecture for retrieval. By having a deep library of curated artifacts, we can now direct our system to retrieve from a controlled dataset. We know that all large AI providers are focussing heavily on investing in high quality curated data. Artifacts are a step towards framing verbose input and output with a structure.

We can see the focus away from the input and mapping to the system defined research in the prompt. Here is an example of some of the exclusion criteria:

Content that is dependent on the current conversational context to be useful.

Content that is unlikely to be modified or iterated upon by the user.

Request from users that appears to be a one-off question.

The prompt is actively focusing on the system context and the task in hand. The prompt is explicitly trying to filter out input that is not relevant to a very specific output. So the artifact acts as a concrete reference point both in the generated text, and as structured data behind the scenes. This gives us fast and accurate retrieval, and focus. Something that is very helpful for...

Thinking

Logical thinking is a key part of the generation process.

Prompt engineers have long been telling us that one of the keys to reliable output is obligating LLMs to form a multi-step structured and logical thought process. We see formal recognition of this in the prompt.

1. Briefly before invoking an artifact, think for one sentence in tags about how it evaluates against the criteria for a good and bad artifact. Consider if the content would work just fine without an artifact. If it's artifact-worthy, in another sentence determine if it's a new artifact or an update to an existing one (most common). For updates, reuse the prior identifier.

Here, we are obligating our system to take a structured multi-step process to analyse the task and the output. Again, moving towards the strong definition of verbose content and alluding to a search and retrieval system for artifacts.

Creating a Python script to calculate factorials meets the criteria for a good artifact. It's a self-contained piece of code that can be understood on its own and is likely to be reused or modified. This is a new conversation, so there are no pre-existing artifacts. Therefore, I'm creating a new artifact.

This request is a direct modification of the existing factorial-calculator artifact. It's not a new artifact but an update to make the script more robust. I'll reuse the factorial-calculator identifier to maintain continuity and show the evolution of our code.

Here we can see the implementation of a logical thought process for the generation of defined outputs. By ensuring that our algorithm goes through the same logical steps, we have the seeds of an intelligent and repeatable generation process. Thought.

We can map this logic to the thought process of a person. First of all we have a logical and rational problem solving approach. We supplement this with hard artifacts. The LLM data set is the brain, but artifacts are the skills and knowledge enabling us to arrive at a certain output.

If we imagine all the competing models, we can derive that they are relying on the replication of logical thought process. We are essentially creating a robot brain to mimic the logical thought process of a human. We are building the missing parts, the knowledge, structures and retrieval processes that fuel the brain.

This makes systems prompts and instructions incredibly valuable assets. The understanding and refinement of "logical thinking" is a key part of the generation process.

We can see some basic implementations of this structured thinking in the code...

Identifiers and Search

Search and retrieval of artifacts is a key part of the system prompt.

def factorial(n): if n == 0: return 1 else: return n * factorial(n - 1)

So what is application/vnd.ant.code. Application is simple enough, VND is vendor, ANT will be Anthropic (the creators of Claude) and code; that's an insight into their architecture. I would expect some kind of taxonomy and structured data that lists the tasks that people trying to achieve with LLMs.

Coding tasks
Presentations
Documents
Analysis
Many more...

We could for example create some psuedo code for an attempt at a powerpoint presentation.

<antartifact 
    identifier="powerpoint-presentation" 
    type="application/vnd.ant.presentation" 
    purpose="business" 
    title="Simple powerpoint presentation">
        Slide 1: Title slide
        Slide 2: Introduction
        Slide 3: Problem statement
        Slide 4: Solution
</antartifact>

This is almost certainly nothing like the production code, but an interesting mental paradigm. To control and structure verbose output, we have to encounter logical and rational processes for categorizing and standardizing the input and output.

I suspect this means that when inputs come in, they run separate battle-hardened algorithms that run entity extraction and categorization. This structured data is then run through an asset search and retrieval process. Where for text we use vector databases; for other defined outputs we have now introduced this concept of artifacts. For example, a React Code task could go something like this.

"INPUT: Create a react component for a metrics dashboard",
"ENTITY_EXTRACTION: Coding, React, Metrics Dashboard",
"ENTITY_SEARCH: Retrieve code artifacts for Metrics Dashboard where type = React",
"SYSTEM_PROMPT: create_system_prompt(artifact_id='metrics-dashboard-component', type='application/vnd.ant.code', language='react')"

There is a lot going on, and we can see the hard yards that are needed behind the scenes to curate high quality examples and taxonomies for what is essentially an unlimited theoretical pool of tasks. There will be iteration with other AI classification algorithms behind the scenes to automate this.

But it is at its core, as far we can see, a fancy search and retrieval system, based on a proprietary templating language.

Templating language structure

A rendering template that will shift based on input variables

I started my career many years ago as a Drupal developer. Reading the prompt, the word that jumped out at me was TWIG. Twig is a html templating language that was commonly used for rendering templates in HTML from PHP. Claude will almost certainly use some equivalent approach that tailors input and context based on structured data (probably extracted outside the LLM).

It looks like Claude Sonnet 3.5 uses something similar, which makes perfect sense. Given the text input to the LLM, we need to systematically generate blocks of text. These are the dynamic tags that are put together to generate the prompt.

This will leverage a kind of function calling approach. Each tag has a specific purpose. This then serves as an abstraction as we direct our model to find the right category and type for each specific purpose.

So there we have it, a thought process broken into blocks. Entity extraction mapped with advanced search and retrieval. The building blocks for a logical thought process. The underpinning data is key to the quality of the output.

Conclusion

One small artifact for Claude, a giant leap for AI.

Artifacts are to structured output such as code generation what vector search is to rag. It is the search and retrieval system for structured output.

We see evidence of a structured and rational thought process in Claude 3.5. Something we've always expected to be important in Generative AI, but this is formal proof.

I can imagine armies of developers and marketeers, building libraries of curated artifacts. This library is accessed via classification, and then search and retrieval tasks. But the real step forward is the concept of persistence.

By working with artifacts we have reference points that exist beyond the ephemeral. Ones that can be refined and re-reused. We already had thought and verbose output. Now we've got memories and expertise...

Claude 3.5 system

The system prompt in full TyingShoelaces

DEV Community