RAG with Web Search

Retrieval augmented generation (RAG) is a popular way to use current or proprietary data with Large Language Models (LLMs). There are many articles describing how to perform RAG. Typically, they involve encoding data as vectors and storing the vectors in a database. The database is queried and the data is placed in the context where it is tokenized (converted to vectors) along with the prompt to the LLM. At its simplest, RAG is placing data in the prompt for the LLM to process.

For all practical purposes, the Internet is the database for the world and we can query it with search engines that have methods for returning relevant results. Sound familiar? We can use search to power a RAG application.

For this example, we'll use DuckDuckGo for search, Langchain to retrieve web pages and process the data, and your choice of an Ollama with an open-source LLM or a LLM service like OpenAI.

For the impatient, code

To get started, import the packages into your environment.

pip install -r requirements.txt

Let's dig into the code! Querying DuckDuckGo, retrieving the web pages, and formatting for insertion into the prompt are done by these three function. The ddg_search function queries DuckDuckGo. The get_page function uses Langchain's document loader to retrieve the pages from the search, extracts only the text between p HTML tags with the BeautifulSoupTransformer, and returns a list of Langchain documents.

The ddg_search function extracts the text from the documents and truncates them to ensure they fit within the context window of the LLM. Recent LLMs have larger context windows, and you can change the amount truncated and where to truncate by changing the values. For example, you may want to capture the end of the text which includes conclusions and summaries. The processed text from each document is returned as a list.

def ddg_search(query):
    results = DDGS().text(query, max_results=5)
    print(results)
    urls = []
    for result in results:
        url = result['href']
        urls.append(url)

    docs = get_page(urls)

    content = []
    for doc in docs:
        page_text = re.sub("\n\n+", "\n", doc.page_content)
        text = truncate(page_text)
        content.append(text)

    return content

def get_page(urls):
    loader = AsyncChromiumLoader(urls)
    html = loader.load()

    bs_transformer = BeautifulSoupTransformer()
    docs_transformed = bs_transformer.transform_documents(html, tags_to_extract=["p"], remove_unwanted_tags=["a"])

    return docs_transformed

def truncate(text):
    words = text.split()
    truncated = " ".join(words[:400])

    return truncated

The following section creates a prompt. Currently, there is no standard prompt format and each LLM implements it's own prompt format. The following section demonstrates how to create a prompt for a llama2 or llama3 LLM and an OpenAI LLM. Note how the prompt construction differs.

def create_prompt_ollama(llm_query, search_results):
    content_start = (
        "Answer the question using only the context below.\n\n"+
        "Context:\n"
    )

    content_end = (
        f"\n\nQuestion: {llm_query}\nAnswer:"
    )

    content = (
        content_start + "\n\n---\n\n".join(search_results) + 
        content_end
    )

    prompt = [{'role': 'user', 'content': content }]

    return prompt

def create_prompt_openai(llm_request, search_results):
    prompt_start = (
        "Answer the question using only the context below.\n\n"+
        "Context:\n"
    )

    prompt_end = (
        f"\n\nQuestion: {llm_request}\nAnswer:"
    )

    prompt = (
        prompt_start + "\n\n---\n\n".join(search_results) + 
        prompt_end
    )

    return prompt

Creating a completion (or response) for each LLM also differs.

# use ollama with llama3 foundation model
def create_completion_ollama(prompt):
    completion = ollama.chat(model='llama3', messages=prompt)

    return completion['message']['content']

# use openai's foundation models
def create_completion_openai(prompt):
    res = client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt,
        temperature=0,
        max_tokens=1000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )

    return res.choices[0].text

The application uses Streamlit to search DuckDuckGo, send the results to the LLM, and displays the completion.

# ui
with st.form("prompt_form"):
    result =""
    prompt = ""
    search_query = st.text_area("DuckDuckGo search:", None)
    llm_query = st.text_area("LLM prompt:", None)
    submitted = st.form_submit_button("Send")
    if submitted:
        search_results = ddg_search(search_query)
        # prompt = create_prompt_ollama(llm_query,search_results)
        # result = create_completion_ollama(prompt)
        prompt = create_prompt_openai(llm_query,search_results)
        result = create_completion_openai(prompt)

    e = st.expander("LLM prompt created:")
    e.write(prompt)
    st.write(result)

This is an example query critiquing Taylor Switft's new Tortured Poets Department album using OpenAI's GPT3.5. Ouch! A bit harsh.

Thoughts

Can this be improved upon? Definitely! Most search engines have operators that support searching a specific site or excluding sites. For example, a search for Kubernete's Container Network Interfaces (CNI) can be limited to just kubernetes.io instead of all the other sites that address CNI. The BeautifulSoupTransformer supports extracting or excluding text by tag and the truncate function can be expanded to extract text from certain parts such as the end where conclusions and summaries are located. You can also change the LLM from a general chat model to an instruct model and use it as a coding assistant.

Using web search with an LLM can help produce better search results and summaries. Be sure to check out the code long Github.