Swapping in elasticsearch to the proto-OLIVER

#llm #rag

In the last video of the llm-zoomcamp, which I didn't post about, I reformatted the code to make it modular, so I could swap in a different search engine or a different LLM. In this video, the last video of module 1, I learned how to exchange elasticsearch with the in-memory search engine. I had already installed elasticsearch at the beginning, when I installed everything else.

The first thing I did was to open a docker container with elasticsearch in it. This didn't work at first. I got the error "Elasticsearch exited unexpectedly". I went to the course FAQ and found the solution. I needed to add a line at the end: -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" This limits the memory used by elasticsearch, so it can run in GitHub codespaces.

docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

Then I imported elasticsearch and created a client.

from elasticsearch import Elasticsearch

es_client = Elasticsearch("http://localhost:9200")

Indexing the documents is slightly more complicated than in our in-memory search engine. The instructor had the object set up for us. It has the same fields and keywords as before, but also includes some settings.

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

Next I fed the documents into the search engine. I set up a progress bar for this operation using tqdm, which I also installed earlier. Apparently I was missing a library, but it didn't matter. I still had a crude progress bar and could tell how long it was going to take.

from tqdm.auto import tqdm

/usr/local/python/3.10.13/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|███████| 948/948 [00:20<00:00, 45.47it/s]

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

Here we see the documents are the same as what I used before.

Next I called the search engine with the usual query.

query = "I just discovered the course. can I still join?"

def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

Here, the search_query is also more complicated than it was in our in-memory search engine. Again, the instructor had set it up for us. I also had to do some work to get the output into the same format as I had before. Once I did that, I could call the LLM with the context from the search engine.

Here are the results of the module I defined, elastic_search, and then the call to the entire rag function, which gives the same result as before. You can see that the only change I made was to define the search engine differently. The rest is the same as before.

elastic_search(query)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don’t rely on its answers 100%, it is pretty good though.',
  'section': 'General course-related questions',
  'question': 'Course - Can I get support if I take the course in the self-paced mode?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
  'section': 'General course-related questions',
  'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
  'course': 'data-engineering-zoomcamp'}]

def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

rag(query)

"Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute."

Previous post: Generating a result with a context

DEV Community

Swapping in elasticsearch to the proto-OLIVER

Top comments (0)

Read next

Benchmarking Pixtral Large vs Pixtral 12B

Fine-Tuning and Deploying Custom AI Models on Amazon Bedrock: A Practical Guide

Granting autonomy to agents

ChatGPT, Find Me A Laptop! (Prompting For Purchasing, Part 1)