Simon Risman for LLMWare

Posted on Jun 17 • Edited on Aug 5

Are we all prompting wrong? Balancing Creativity and Consistency in RAG.

#ai #llm #python #rag

For a Boston native like myself, there are few things more heartwarming than Artificial Intelligence understanding the brilliance of Good Will Hunting. A few cursory prompts reveal that it views it as a "must-watch tale of redemption and self discovery".

But a slightly closer look reveals what many users of LLMs have accepted as a given - slight variations on an otherwise consistent topic. This is the result of Stochastic Generation.

Stochastic generation 🤖

This is a fairly common term, from online bootcamps to college lectures, students of AI are familiar with this concept. For those who need a quick refresher, here is the 3-step generation loop that many LLMs follow.

LLMs are trained using a next-token prediction task, where the model predicts the next token in a sequence based on the previous tokens. This process involves:

Tokenized Input: The input text is converted into a sequence of numbers (tokens).
Probability Distribution: The model generates a probability distribution over the possible next tokens.
Sampling Algorithm: This distribution is passed through a sampling algorithm to select the next token.

The probabilistic elements that this process introduces enables LLMs to generate more captivating dialogue, novel images, and creatively praise award-winning films.

Randomness and RAG 🎰

When building RAG based applications, we are often not as concerned with creativity as we are with facts. When dealing with facts, we want as little probability involved as possible. In other words, instead of sampling a probability distribution, its beneficial to just take the token with the maximum likelihood every time.

LLMWARE allows you to explore how random your generated results are, as well as augment how random you want them to be. Heres a quick demonstration:

Demo 🙌

Load the model

model = ModelCatalog().load_model("bling-stablelm-3b-tool",
                                  sample=True,
                                  temperature=0.3,
                                  get_logits=True,
                                  max_output=123)

In the load_model method, we make a few important selections. The bling 3B is one of our newest and highest performing models.

Setting the sample attribute to True or False will allow you to change between a stochastic approach and a top-token model.

The temperature can be an important tool to control the randomness of the output, with lower values making responses more focused and higher values increasing diversity in the generated text.

These key settings will allow you to see what kind of approach you want to take when it comes to the probabilistic nature of your model.

Run a simple inference model on some sample text

response = model.inference("What is a list of the key points?", sample)

This step is where your model is doing the heavy lifting, analyzing and summarizing the loaded-in documents.

Run a sampling analysis

sampling_analysis = ModelCatalog().analyze_sampling(response)
print("sampling analysis: ", sampling_analysis)

Now you get to see the analytics - giving you a better idea of how heavily your model samples from the lower-probability side of the distribution.

This analysis will include what percentage of the tokens selected by the model were also the highest probability output, and will note cases where the not-top-token was selected.

In cases where the top token was not selected, the below code will print out the exact entries of the outputs, including their token rank.

for i, entries in enumerate(sampling_analysis["not_top_tokens"]):
    print("sampled choices: ", i, entries)

All these tools can help you make an informed decision on whether you want your model to think a little outside the box, or stick to the most likely answer. To see this process in action, check out our youtube video on consistent LLM output generation.

The full code for this example can be found in our Github repo.

If you have any questions, or would like to learn more about LLMWARE, come to our Discord community. Click here to join. See you there!🚀🚀🚀

Please be sure to visit our website llmware.ai for more information and updates.

Top comments (2)

Jerry Hargrive • Jun 24

What a fascinating read! The part about balancing creativity and consistency in RAG was really enlightening. You mentioned setting the sample attribute to True or False—are there specific scenarios where one approach significantly outperforms the other? Would love to hear your thoughts on this.

Simon Risman • Jul 1

Glad you liked it! Absolutely, the sample attribute is what controls whether or not the model selects the most likely outcome or samples the distribution. Think of it as kind of an on/off switch for creativity. In a more creative domain, like writing copy or editing a report, you want the model to sample the distribution. Otherwise it will end up generating the same answer over and over again, when you may want a little outside the box thinking. When your task is more along the lines of analyzing data to feed workflows, you want it to be choosing the highest token, creating robust and consistent processes a business can rely on. Let me know if you have more questions I can answer!

DEV Community

Are we all prompting wrong? Balancing Creativity and Consistency in RAG.

Stochastic generation 🤖

Randomness and RAG 🎰

Demo 🙌

Top comments (2)

Read next

Data Science in the Era of Generative AI, IoT, and Sustainable Technologies: A Complete Roadmap

Python crawler practice: using 98ip proxy IP to obtain cross-border e-commerce data

What Are the Most Powerful Open-Source LLMs You Need to Know in 2024?

10 Top Strategic Technology Trends for 2025