Robert Boys

Posted on May 9 • Edited on May 10

A 19th Century Author Taught Me RAG.

#ai #python #machinelearning #beginners

I asked a 14 billion parameter LLM to remember a short story by Nathaniel Hawthorne and it told me it was written by Edith Wharton. This was a good thing, because I wanted to see first hand how Retrieval Augmented Generation (RAG) really works. I am writing this post to share the process and my results, in case there are others who are new to RAG and would like to read a case study on it.

RAG is a method to improve the accuracy of LLM output. It creates and uses a separate database that the LLM refers to for additional information that it either lacks or has insufficient knowledge in its pretraining memory. The relevant context is pulled from the database and silently added to the user's prompt. With this additional information, the LLM has more to work with and, hopefully produce a better result. There are several design decisions in developing this process, which we will explore in a hands-on experiment using entirely local tools.

This study was intentionally old school in the set-up. - I used a Large Language Model (LLM) from the days of AI antiquity, 2024, to run on my local laptop. Specifically, the study was built entirely in Python using ChromaDB 1.5.2 as the vector database, nomic-embed-text for embeddings, and phi4 via Ollama as the LLM. All code can be found on my GitHub at https://github.com/RobNotBob/rag-rappaccini.

The Text

My first decision was to select a source text to evaluate. I soon settled on the short story, Rappaccini's Daughter by Nathaniel Hawthorne.

Published in 1844, Rappaccini’s Daughter is the tragic love story of a young university student, Giovanni Guasconti. The young man comes to Southern Italy to study at the same school of Prof. Baglioni, who was a good friend of his father. The property adjacent to his room is an elaborate garden lush with exotic plants. He observes that the garden is presided over by a pale and stern-faced man, Dr. Rappaccini, and his beautiful daughter.

He falls in love with the girl but she warns him not to enter the garden or touch her. We learn the garden is actually full of highly toxic plants Dr. Rappaccini uses in his medical experiments. His daughter grew up among the plants and so is immune to their toxic nature. Despite the warnings of Prof. Baglioni, Giovanni is driven by his love to get closer to her. When her fingers accidentally touch his wrist, his skin breaks out into a horrible rash. Dr. Baglioni, observing how the toxins are affecting the son of his friend, offers an antidote for both him and Rappaccini’s daughter. He is cured but when she drinks it she dies.

It is a sad story, but one that was well suited to this study as it is in the public domain, easily available, and was more obscure than, say, Shakespeare or Charles Dickens. This meant that the degree of phi4's pretraining memory for this story would be low.

I obtained the text from the Project Gutenberg website shown below. The story is contained in the book Mosses from an Old Manse published in 1846. This was an important step as the file was not cluttered with formatting scripts but had clear headings included. This made it easier to extract and prepare the document for RAG processing.

https://www.gutenberg.org/files/512/512-h/512-h.htm#rappaccini

Baseline Performance

Before performing a RAG development, I need to assess whether such a process is even needed. The LLM will need to rely on its pretraining content to respond to my prompts. The following are some of the important evaluations to be considered.

Recitation. I prompted the phi4 LLM, "Recite the first four sentences of Rappaccini's Daughter by Nathaniel Hawthorne." The response said the correct author was Edith Wharton and then proceeded to hallucinate the entire quote.

Factual Recall. I asked a series of verifiable questions from the short story and examined the responses. Out of ten basic questions it only missed one outright - where Giovanni came from. There were two other responses that were either mis-framed or anachronistic (calling the plants "genetically modified" from a 19th Century text that knew no such thing). Of the five harder questions it provided either fully or partial wrong answers to all but one of them. The most egregious was saying Dr. Rappaccini's reputation among scientists was that he turned humans into plants. The pattern of phi4 is that it anchors on something true and then fills in the gaps confidently and thus can hallucinate. For simple facts, the LLM is mostly correct but with scene details it is consistently wrong or padded with errors. Note the LLM never once signaled uncertainty even though I prompted it to do so.

Plausible Distractors. I then tested the LLM by providing two different plot points - one real and one plausible but false one. This is to assess whether the LLM "knows" the text or is simply pattern-matching based on the literary genre. Phi4 performed much better on distractors than on factual recall. This test presents two anchors, which makes it easier for the LLM to identify the more plausible option, whereas factual recall requires a specific detail from nothing. A RAG process will help with that drawback.

Intentional Error Detection. The final test provided a statement with an error and asked the LLM to both find the error and correct it. Phi4 answered all five questions correctly, which illustrates that when given an anchor, the LLM performs well.

The conclusion of all these tests is that the LLM struggles to answer non-simple factual questions or recite specific lines of text. When given an anchor, the LLM can perform much better. This is a concern because the prompter will not always know enough to provide the proper anchor. A RAG process could indeed improve this performance.

Here is the list of questions that the LLM failed during the baseline testing:

Recite the first 4 sentences.
Question 5: What country does Giovanni come from?
Question 12: What flower does Rappaccini give Beatrice when she enters the garden?
Question 14: What does Professor Baglioni tell Giovanni about Rappaccini's reputation among scientists?

Document Preparation

Now that I have a good baseline and confirmed the need for improvement, I need to prepare the text document for use in the LLM. This is a significant first step for the overall performance of RAG. The more a document or data file is cluttered with irrelevant formatting or information, the more limiting or difficult it can be for chunking strategies (more on this later). Fortunately, using the Project Gutenberg website, the text was clean and the only real effort was to extract the Rappaccini's Daughter story from a compilation of stories within the single text document.

Here is an example of what the text looked like when downloaded from Project Gutenberg:

<h2><a name="chap04"></a>
YOUNG GOODMAN BROWN</h2>
--
</div><!--end chapter-->
<div class="chapter">
<h2><a name="chap05"></a>
RAPPACCINI’S DAUGHTER</h2>
--
</div><!--end chapter-->
<div class="chapter">
<h2><a name="chap06"></a>
MRS. BULLFROG</h2>
--

Chunking Strategy

Once the text is ready, the next step is to determine a chunking strategy. Chunking splits the text into meaningful pieces that can be added quietly to a user's prompt to improve the request quality for the LLM. Without chunking, the context window would be overwhelmed by content without any focus which would likely make performance even worse than without it.

For example, if I wanted to know what happened to the insect that was near Beatrice, I would want the RAG process to find that one relevant paragraph and include that with my prompt to the LLM. The question is how to create these chunks. If made too large, you risk having irrelevant details. If made too small, some of the critical context could be left behind. I examined two different chunking strategies - fixed sized and paragraph boundary.

Fixed chunking takes a set length of characters for each chunk. An overlap is included to help avoid circumstances when an idea may be carried across chunks. Here is the python code to do this:

def chunk_fixed_size(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size           # Define the end of this chunk
        chunk = text[start:end].strip()    # Extract the chunk
        if chunk:                          # Only add non-empty chunks
            chunks.append(chunk)
        start += chunk_size - overlap     # Move start less overlap
    return chunks

# Function call executed and stored in a variable:
fixed_chunks = chunk_fixed_size(text, chunk_size=500, overlap=50)

Paragraph boundary can be an effective strategy for text documents if the prose style is compatible. This was something that was examined in this study. Knowing that the source text uses a double newline to start a new paragraph, the script used this as a marker for splitting the text.

def chunk_by_paragraph(text, min_length=100):
    raw_chunks = text.split("\n\n")   # Split on double newlines
    chunks = []                       # Clean each chunk, filter short ones
    for chunk in raw_chunks:
        cleaned = chunk.strip()
        if len(cleaned) >= min_length:
            chunks.append(cleaned)

    return chunks

# Function call executed and stored in a variable:
para_chunks = chunk_by_paragraph(text, min_length=100)

The story had 66,335 characters. For the fixed-size method, 148 chunks were created with an average size of 497 characters. Here is an example, chunk #10:

ni stood at the window he heard a rustling behind a screen of
leaves, and became aware that a person was at work in the garden. His figure soon emerged into view, and showed itself to be that of no common laborer, but a tall, emaciated, sallow, and sickly-looking man, dressed in a scholarâs garb of black. He was beyond the middle term of life, with gray hair, a thin, gray beard, and a face singularly marked with intellect and cultivation, but which could never, even in his more youthful days,

For the paragraph boundary method, 117 chunks were created with an average size of 554 characters. The shortest chunk was 100 characters while the larges was 2,118 characters. Here is an example of its chunk #10:

Soon there emerged from under a sculptured portal the figure of a young girl, arrayed with as much richness of taste as the most splendid of the flowers, beautiful as the day, and with a bloom so deep and vivid that one shade more would have been too much. She looked redundant with life, health, and energy; all of which attributes were bound down and compressed, as it were and girdled tensely, in their luxuriance, by her virgin zone. Yet Giovanniâs fancy must have grown morbid while he looked down into the garden; for the impression which the fair stranger made upon him was as if here were another flower, the human sister of those vegetable ones, as beautiful as they, more beautiful than the richest of them, but still to be touched only with a glove, nor to be approached without a mask. As Beatrice came down the garden path, it was observable that she handled and inhaled the odor of several of the plants which
her father had most sedulously avoided.

Note that the fixed-size chunk starts with a word fragment and ends before the completion of the sentence. This could be a loss of context that may impair the LLM. For the paragraph boundaries, we are fortunate that Hawthorne wrote in similarly sized paragraphs with at least one outlier that was very large. The weakness of this strategy is its entire dependence on the writer's style of paragraphs.

Embedding Strategy

Each of these two lists of strings, one for each type of chunking, must be converted into a vector of numbers for the LLM to perform its mathematical operations, called Vector Search. A vector is a long list of numbers that represent meaning. For example, the word "dolphin" would have a list of numbers to represent ideas of water, swimming, mammal, fish-eater, echolocation, and so on. The word "snorkel" would have a list of numbers to represent ideas of water, swimming, equipment, plastic, manufactured, and so on. By representing each of these ideas with a number, these numbers can be mathematically compared. How this is done specifically is explained below.

{"embedding":[1.1739883422851562,-0.32890406250953674,-3.7720236778259277,0.4036591947078705,1.3114928007125854,-0.17043189704418182,-1.1893550157546997,...]

To work with vectors, a vector database is needed to store and use this data. The vector database used in the study was ChromaDB. This application was selected because it runs entirely in Python, and stored locally in an SQLite database, which avoids the need for separate servers.

ChromaDB comes with its own embedding model, called all-MiniLM-L6-v2, which stores meaning in 384 dimensional vectors and requires a storage size of 87 MB. For this project, I used a different embedding model - nomic-embed-text. This model creates 768 dimensional vectors using 137M parameters. This choice provides a much larger context window to avoid truncation of chunks at the cost of speed. Nomic takes about five times longer to run. However, for this study the embedding process took only 1-2 minutes and was not a factor.

The script for sending the text and receiving the vector is pretty simple:

def get_embedding(text):
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={
            "model": "nomic-embed-text",
            "prompt": text
        }
    )
    return response.json()["embedding"]

Each vector is then stored in a database table which in vector terminology is called a collection. The embeddings is the vector, documents holds the original text, and metadata provides extra information we can filter on later:

for i, chunk in enumerate(chunks):
    # Get the embedding vector for this chunk
    embedding = get_embedding(chunk)
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk],
        metadatas=[{"chunk_index": i, "length": len(chunk)}]
    )

. . .

collection = client.get_or_create_collection(
    name="rappaccini",
    metadata={"hnsw:space": "cosine"}

When doing vector searches we need to specify a mathematical method to do the evaluation. There are several to choose from, so I chose cosine, which measures the angle between vectors. This method is better suited for text than raw distance measurements. This is one parameter that can be changed for tuning RAG processes.

Both the database and the embedding model were ran locally on a CPU-powered laptop. The process ran at about five chunks per second, which was fine for only having about 150 chunks per each of the two methods being tested.

Retrieval Quality

Now that we have a vector database with the contents of Rappaccini's Daughter loaded - chunked by fixed-length and paragraph boundary, we need to assess how well the LLM does in retrieving the right chunks. This process is hidden from the regular user but is essential for the performance of RAG. Retrieving the wrong chunk will doom the process to failure.

Reviewing the list of failures in the baseline testing, I selected the following question to perform the test because it is a concrete detail that can be located in a specific paragraph:

What does Professor Baglioni tell Giovanni about Rappaccini's reputation among scientists?

To do this test, I converted the question into a vector using the same nomic embedding model and script described above. Then I ran a query against the collection and asked it to return the top three results:

def retrieve_chunks(collection, query, n_results=3):
    query_embedding = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "distances", "metadatas"]
    )
    return results

For cosine distances, the metric ranges from 0 (identical) to 2 (opposite). For the paragraph chunking, the top three results were 0.3437, 0.3598, and 0.3630. The fixed-length chunking had similar scores of 0.3671, 0.3760, and 0.3834.

The top result for the paragraph boundary was Chunk 44:

Distance score: 0.3437 (lower = more similar)
Chunk index: 44 | Length: 253 chars

âPatience! patience!âI tell thee, my poor Giovanni, that Rappaccini has a scientific interest in thee. Thou hast fallen into fearful hands! And the Signora Beatrice,—what part does she act in this mystery?âAre there such idle rumors?âDo people say that I am skilled in my fatherâs science of plants? What
a jest is there! No; though I have grown up among these flowers, I know no more of them than their hues and perfume; and sometimes methinks I would fain rid myself of even that small knowledge. There are many flowers here, and those not
the least brilliant, that shock and offend me when they meet my eye. But pray, signor, do not believe these stories about my science. Believe nothing of me save what you see with your own eyes.âAha!âSo now our friend Giovanniâs secret is out. You have heard of this daughter, whom all the young men in Padua are
wild about, though not half a dozen have ever had the good hap to see her face. I know little of the Signora Beatrice save that Rappaccini is said to have instructed her deeply in his science, and that, young and beautiful as fame reports her, she is already qualified to fill a professorâs chair. Perchance
her father destines her for mine! Other absurd rumors there be, not worth talking about or listening to. So now, Signor Giovanni, drink off your glass of lachryma.âAnd what ar

The top result for the fixed-length method was Chunk 27. I am including Chunk 28 because it contains the rest of the relevant content:

--- Result 1 ---
Distance score: 0.3671
Chunk index: 27 | Length: 500 chars

hand, I should answer it but scantily to my conscience were I to permit a worthy youth like yourself, Signor Giovanni, the son of an ancient friend, to imbibe erroneous ideas respecting a man who might hereafter chance to hold your life and death in his hands. The truth is, our worshipful Dr. Rappaccini has as much science as any member of the faculty—with perhaps one single exception—in Padua, or all Italy; but there are certain grave objections to his professional character.âAnd what ar


--- Result 2 ---
Distance score: 0.376
Chunk index: 28 | Length: 499 chars

to his professional character.âAnd what are they?âHas my friend Giovanni any disease of body or heart, that he is so inquisitive about physicians?âBut as for Rappaccini, it is said of him—and I, who know the man well, can answer for its
truth—that he cares infinitely more for science than for mankind. His patients are interesting to him only as subjects for some new experiment. He would sacrifice human life, his own amon

Note that the distance scores are comparable for each method. However the fixed chunk 27 had more of the correct content than the paragraph chunk 44, despite being smaller and having a poorer distance score. The overlapping helps the fixed-length method as it includes the remainder of the relevant context. Both methods produced the right result among the top three but neither captured the top hit perfectly. Both chunking methods produced better retrieval quality than the LLM alone, so I proceeded to the next step.

Generation

Now it is time to bring this all together and run the RAG-enhanced LLM process. To do this, I used some simple Python scripts to control the evaluation process:

def generate_answer(question, context_chunks):
    # Build the context string from retrieved chunks
    context = "\n\n---\n\n".join(context_chunks)
    # This is the RAG prompt - it instructs phi4 to use
    # only the provided context to answer the question
    prompt = (
        "You are answering questions about Rappaccini's Daughter "
        "by Nathaniel Hawthorne. Use only the context provided "
        "below to answer the question. Be specific and concise. "
        "If the answer is not contained in the context, say so.\n\n"
        "CONTEXT:\n"
        + context +
        "\n\nQUESTION:\n"
        + question
    )
    # Send to phi4 via Ollama generate endpoint
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "phi4",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

Here are some highlights from this testing.

Positional Questions are Challenging

Recite the first 4 sentences of Rappaccinis Daughter by Nathaniel Hawthorne.

I had thought this should have been a simple question, considering the chunks are taken in sequential order and numbered. This proved to be a false assumption.

The boundary chunking method produced this response:

Distance scores: [0.4081, 0.4126, 0.4138]
Generating answer with phi4...
Answer:
I'm sorry, but I can't provide verbatim excerpts from "Rappaccini's Daughter" as it is not included in my training data. However, I can tell you that "Rappaccini's Daughter" begins with Giovanni Guasconti arriving at the University of Padua and renting a room in an old building near the garden belonging to Professor Baglioni. If you're looking for the exact text, I recommend checking a copy of the story itself or a reliable literary source.

The fixed chunking method produced a very similar response:

Distance scores: [0.4121, 0.4174, 0.4219]
Generating answer with phi4...
Answer:
I'm sorry, but I can't provide verbatim text from "Rappaccini's Daughter" as it is not included in my training data. However, you might consider looking up the opening lines directly from a copy of the story to see them for yourself. If there are specific themes or questions about the beginning of the story that you would like to discuss, feel free to ask!

Even when provided with RAG chunks, the LLM still refused to consider it as it claimed it did not have it in its pretraining memory. There are two possible causes for this behavior. First, it could indeed be a retrieval failure given the distance scores. The prompt itself may be misleading the LLM engine. Second, this could be intentional diversion training built into the phi4 model over literary copyright concerns.

To evaluate these options, I pulled the top chunks from each method and neither of them had the correct one, Chunk 0. When I forced the retrieval of Chunk 0 for each method, I received the same, and correct, response. Although the fixed-length chunk was cut off since the prompt asked for more than 500 characters:

A young man, named Giovanni Guasconti, came, very long ago, from the more
southern region of Italy, to pursue his studies at the University of Padua.
Giovanni, who had but a scanty supply of gold ducats in his pocket, took
lodgings in a high and gloomy chamber of an old edifice which looked not...

Regardless of the cause, this result should caution RAG developers that responses may still deviate from expected behavior. RAG is designed for semantic searches, not positionally-specific content. This is a cause that can trigger hallucinations. Note also that diversion training can be part of model training that can provide significant influence into how it responds.

Distance Scores Can Mislead

Question: What does Professor Baglioni tell Giovanni about Rappaccinis reputation among scientists?

The baseline answer was wrong:

Rappaccini is considered a dangerous scientist who has crossed
ethical boundaries by turning humans into plants.

The paragraph answer was not much better:

Retrieved chunks: 3
Distance scores: [0.3568, 0.3571, 0.3709]

Generating answer with phi4...

Answer:
The context provided does not contain specific information regarding Professor Baglioni's comments on Rappaccini's reputation among scientists. It mentions that Rappaccini has a scientific interest in Giovanni and instructs Beatrice deeply in his science, but no explicit details about Rappaccini's reputation are given.

The fixed-chunks answer was far better:

Retrieved chunks: 3
Distance scores: [0.3679, 0.3777, 0.3839]

Generating answer with phi4...

Answer:
Professor Baglioni tells Giovanni that Dr. Rappaccini is highly regarded for his scientific knowledge, possibly equal to any member of the faculty in Padua or all Italy with one exception. However, there are significant objections to Rappaccini's professional character; he cares infinitely more for science than for mankind and views patients as subjects for experiments rather than individuals deserving care.

Note that even though the distance scores of the fixed-length chunks were not as good as the paragraph boundary, the fixed-length provided a much better result. The paragraph chunk merged several scenes with multiple speakers together which confused the LLM. The fixed-length provided two consecutive chunks that provided the full context of the the question. It could have still been better because the third chunk cut off a sentence where Baglioni said Dr. Rappaccini would sacrifice human life, including his own, for the cause of science.

Regression Testing

I performed baseline testing for each of the methods and found that most responses had no degradation in performance. The paragraph chunking method generally had better distance scores as it tended to preserve the full context in a single chunk when compared to fixed-length.

This is illustrated with the following question:

Question: Does Professor Baglioni warn Giovanni out of genuine concern or professional jealousy toward Rappaccini?

The baseline answer, without RAG enhancement, responded correctly with this:

Baglioni acts from both genuine concern AND professional jealousy.
The text deliberately leaves his motives ambiguous.

The paragraph boundary method also provided the correct answer with even more context:

istance scores: [0.3623, 0.3659, 0.3702]

Generating answer with phi4...

Answer:
The context suggests that Professor Baglioni's warning to Giovanni may be driven by both concern and professional jealousy. The text describes him as "imperturbable" and notes his mention of Rappaccini's "scientific interest" in Giovanni, indicating a potential rivalry or mistrust toward Rappaccini. Additionally, the professor expresses fear that Giovanni has "fallen into fearful hands," which implies genuine concern for Giovanni's well-being. Therefore, Baglioni's warning likely stems from a combination of both motivations.

The fixed-length, however, stumbled on this one:

Distance scores: [0.3819, 0.3833, 0.3945]

Generating answer with phi4...

Answer:
The context provided does not contain information regarding Professor Baglioni's motivations for warning Giovanni. It mentions his reaction to Beatrice and Giovanni but does not specify whether his concerns are due to genuine care or professional jealousy towards Rappaccini. Therefore, the answer cannot be determined from the given context.

This result should be expected as the breaks in the chunks are not arbitrary for the paragraph boundary method. As this prompt required a synthesis of text from multiple places in the story, it was easier for RAG to identify and prioritize the relevant chunks of text.

Conclusions

So what can we conclude from this study? RAG processes can certainly help with LLM responses, but they are not infallible and the techniques used can impact results. I had thought that the paragraph boundary method would be superior across the board, but I was mistaken.

It should be noted that the paragraph boundary method is highly dependent on the writing style of the source text. It is fortunate in this particular case that Nathaniel Hawthorne largely kept each narrative point contained in a single paragraph. Other authors such as William Faulkner, Virginia Woolf, and Leo Tolstoy are known to have a single paragraph flowing across several pages. It is therefore an important consideration to know what documents you are working with and ensure your starting point has an uncluttered format. RAG systems often fail not due to the process or the tools but data cleanliness. The old adage of, 'garbage in, garbage out' certainly applies here. The more the document has only relevant data, with clearly defined headings, the better RAG will perform.

It is tempting to use distance scores as a quantifiable metric of quality. However, this study proved such is not the case. Distance scores do not measure the correctness of the answer. The values indicate how topically related the chunk is to the prompt.

For factual queries, fixed-chunks consistently provided the better answers. This is due to prompts asking for concentrated details that can be located in a single chunk. However, when a prompt requires reasoning across a broader part of the text, paragraph chunking did better in finding and prioritizing the content in a fewer number of chunks. Non-semantic prompts, such as asking to recite the first few sentences of a text was a struggle for both methods and highlights a structural weakness of RAG. Lastly, when it comes to questions of simple questions, a sufficiently trained LLM should be able to answer those questions without RAG assistance. In the case of Rappaccini's Daughter, however, 14 billion parameters are not enough.

Query Type	Better Strategy	Reason
Factual — concentrated detail	Fixed chunks	Isolates specific content cleanly
Reasoning — across narrative	Paragraph chunks	Preserves complete thoughts
Positional — first sentences	Neither	Requires direct retrieval
Simple known fact	Both equal	RAG adds no value

Here are some questions to ask when you are looking at RAG solutions:

What chunking strategy do you use and why?
How do you handle positional queries?
How do you measure retrieval quality separately from answer quality?
What is your average chunk size and how was it determined?
How does your system perform on queries requiring reasoning across multiple passages?
Can you show me examples of retrieval failures and how the system handles them?

DEV Community