DEV Community

Jamesb
Jamesb

Posted on

Building an Open Source LLM Recommender System: Prompt Iteration and Refinement

Over the past month I have been working on Open Recommender, an open source YouTube video recommendation system which takes your Twitter feed as input and recommends YouTube-shorts style clips tailored to your interests. I made a video about it if you want a more in-depth introduction.

The data pipeline looks like this:

Image description

So based on analysing your Twitter feed, the system generates YouTube search queries and uses the search API to find relevant videos. It then chops up the videos into clips. All of this data processing is controlled using LLMs, currently GPT-4, but over the next couple of weeks I'm going to be migrating away from OpenAI's closed source expensive APIs towards fine tuned open source models using OpenPipe, a brilliant service for incrementally replacing OpenAI's models with smaller, faster, cheaper fine tuned open source models.

Prompt Iteration

The main focus over the past couple of weeks has been tweaking and improving the reliability of the prompts and data processing pipeline to the point where 8/10 of the recommendations feel interesting. When I started, only half of the recommendations felt relevant, which was quite encouraging because I knew from previous projects that as long as you have a decent LLM program, with enough tweaking it's possible to turn it into something great. I'm happy to report that after many hours banging my head against the wall I have finally achieved the 8/10 quality recommendations goal consistently across runs, at least for my own Twitter data. Here are some of the key things I learned over the past couple of weeks:

Better Tools for Prompt Engineering

We need better tools for prompt engineering. Ideally prompts should be written and auto optimised by an LLM with optional human in the loop feedback. I frequently ran into issues where my approach wasn't working, but I didn't have the energy to try something else because it would take too much time without any guarantee that it would perform better. Just like in programming, you want the experimentation cycle to be short so you can quickly filter through possible solutions to find something that works. But this just isn't possible with prompt engineering right now. It takes a huge amount of time to set up alternative prompts, in-context examples or re-jig your prompt chain to quickly experiment with a different approach. Minimising friction here is essential.

Based on my experience here I started working on a TypeScript library called Prompt Iteration Assistant. I described it as "a set of simple tools to speed up the prompt engineering iteration cycle". It gives you a nice CLI dialog for creating, testing and iterating on prompts. To create a new prompt, you tell it the goal and ideal output from the prompt and it bootstraps a new prompt by getting GPT-4 to write it. It infers the input and output schemas and will support code generation to add the prompt to your codebase automatically.

My goal is to make prompt engineering 10x easier, but it definitely hasn't reached that level yet. I think the DX is nice because the CLI dialogs and code generation makes writing prompts a lot faster, but I don't think this represents the next generation of prompt engineering yet.

A couple of days ago I ran into a really impressive project called DSPy which supports auto generating and optimising whole programs composed of multiple prompts. To quote the docs: "DSPy gives you general-purpose modules (e.g., ChainOfThought) and takes care of optimising prompts for your program and your metric."

Please see my article specifically on "Better Tools for Prompt Engineering" where I go into more detail about these topics.

Optimise the Main Levers and Avoid Cascading Failure

I realised that the createQueries and createClips prompts are the two stages in the pipeline that make the biggest impact on the quality of the recommendations. createQueries controls which queries get sent to the YouTube search API and createClips controls whether and how each video gets split up into YouTube-shorts style clips.

With the createClips function, I was able to improve the quality of the output using traditional prompt engineering techniques. I kept tweaking the prompt and evaluating it against 3 datasets - an unrelated transcript, a moderately related transcript and a completely unrelated transcript to validate that I got the expected output from each one. But I wasn't able to guaruntee the quality of the clips. To make the quality more reliable I implemented a re-ranking prompt for video clips (inspired by RankGPT) to make sure only the best of the best gets recommended. I also added some logic to control the number of recommendations from the same source to make sure there is enough variety in the final recommended clips.

For the createQueries prompt, I made some improvements to the prompt by obsessing over the in-context example I created from my own Twitter data. But I realised that occasional strange queries would always sneak in there and cause a cascading failure of poor recommendations further down the pipeline. One generalisation I have reasoned my way to is that a long LLM program is like a game of Chinese whispers - if you don't build error correction and recovery into the system, your output will get stranger and stranger due to error compounding.

I controlled for this by implementing a filterSearchResults prompt which compares video search results returned from the YouTube API against the user's Tweets and filters out the ones which are unrelated. Importantly I used the user's Tweets to compare against the search results, rather than comparing against the queries or a summary of the user's tweets. This controls against the LLM compounding errors earlier in the pipeline because the LLM may have generated strange queries, or misinterpret something in its summary of the user's tweets. It's better to compare against the "ground truth" for the user's interests which is the Tweets themselves.

In my article on "Avoiding Cascading Failure in LLM Pipelines" I analysed the cascading failure problem in more detail.

Look at Your Data

A week ago I was running the pipeline over my Twitter data and I realised that I was consistently getting strange recommendations that made no sense. Looking at my Twitter likes and tweets I couldn't understand why certain videos had been recommended to me. Why was I getting wrestling video recommendations when I have never tweeted about anything to do with wrestling?

It wasn't until I inspected the raw data getting fed into the LLM requests using OpenPipe's request log web UI that I noticed that there were Tweets included in my Twitter data that I did not recognise. I ran the getTweets function a bunch more times and realised that the unofficial Twitter API I'm using to fetch tweets was returning advertisement tweets interleaved within my own tweets!

I caught another bug in the appraiseTranscripts prompt. I noticed upon re-running the prompt many times over the same video it would output the correct response only 50% of the time. Using my testing setup I was able to quickly debug the issue. I found that the prompt performed fine with 250, 500 and 1000 tokens of transcript context. But frequently fails with specifically 350 tokens of context! The transcript was a video called "The 10 AI Innovations Expected to Revolutionize 2024 - 2025". The correct output would be to classify it as spam.

Here was GPT's reasoning for recommending it in the 350 token context test: "The video uses some buzzwords and makes some broad claims about the future of AI, but it also provides specific examples and details about current developments in the field, such as self-driving cars and drone delivery services.'

My explanation is that with fewer tokens of context, GPT can't appraise the quality of the transcript well, because one interesting nugget can skew the assessment of quality a lot. So even if 50% of the 350 tokens is buzzwords, a couple of quality sentences can "persuade" GPT to recommend it. The funny part is that the 250 token context test passes every time because it excludes a mildly interesting example about self-driving cars! So in conclusion, to get a stable, accurate assessment of average quality you need to pass a larger number of tokens (quite obvious in hindsight).

There are tons of other bugs I caught too like inconsistent in-context example formatting, using the incorrect function name for function call examples and prompt variables that weren't getting replaced.

Next Steps

Now the prompt engineering is done, it's time to start curating a dataset for fine tuning using OpenPipe. This will bring down the cost of running the pipeline and allow me to scale to more users. If you want to try out the recommendations and give suggestions about how they could be improved, please DM me on Twitter @experilearning.

Top comments (0)