How does an LLM reliably produce responses that strictly follow JSON syntax when using features like "json_mode" or "function calling"?
These options actually provide an answer to the question: "How can we get an LLM to generate responses exactly the way we want?"
You're probably familiar with the fact that LLMs generate responses token by token, step-by-step.
But what's not commonly known, especially outside technical circles, is that each token is generated probabilistically.
Then, what if we could influence or adjust the probabilities, discouraging tokens that don't match our desired format?
Wouldn't that let us reliably produce code snippets in JSON, CSV, or Python scripts with correct syntax?
Surprisingly, this approach is widely used in practice—known under names like json_mode
, structured output, or function calling.
Given that tokens are selected probabilistically, couldn't we intentionally manipulate these probabilities?
Here's an example from llama.cpp:
https://github.com/ggml-org/llama.cpp/blob/master/grammars/json.gbnf
By artificially setting the probabilities of grammatically incorrect tokens to zero, we can ensure the LLM strictly adheres to the desired syntax.
Top comments (0)