Lately I've been thinking about one problem all LLMs have: the knowledge cutoff problem. Basically, all LLMs have a training cutoff date where they don't know anything after that date. Here are some examples:
- ChatGPT's knowledge cutoff is October 2023
- Claude 3.5 Sonnet's knowledge cutoff is April 2024
- Gemini 2.0 Flash's knowledge cutoff is August 2024
Keep in mind, this doesn't mean the LLMs know about everything up to these dates either. For some topics, the knowledge cutoff might be even earlier. (It's also not clear if October 2023 means October 1 or October 31. 🤔)
In this article, I want to outline some potential solutions to this problem for LLMs and other AI tools.
tldr: For now, you can get around this issue by using an llms-full.txt
file or using an AI tool's web search feature.
llms.txt Proposal
llms.txt is a relatively new proposal designed to make it easier for LLMs to understand information on the web.
llms.txt File
The llms.txt proposal seems to have 3 ideas. One idea is to have a file called llms.txt
that lists out the URLs for a website in Markdown with some short descriptions. You can think of it as a sitemap specifically for LLMs. If I'm being honest though, I don't understand how someone would use this format today, since I'm not sure how to get an LLM to accurately follow links in a Markdown file.
To see what it looks like, you can view Svelte's llms.txt file.
llms-full.txt File
Some developers are creating a file called llms-full.txt
which contains all of the documentation for their tool/library/framework in a single file, converted into Markdown. This idea makes more sense to me, since you can give the LLM all of the information at once and you don't have to worry about it visiting the right URLs. The only bad thing about the llms-full.txt
file is that I haven't seen any info on how project creators' should manage multiple versions of the llms-full.txt
file if their project has multiple major versions.
To see what it looks like, you can view Svelte's llms-full.txt file.
Markdown Files
llms.txt does have a third recommendation as well: documentation sites should have Markdown versions of each page. So if you have a page called installation.html
, you should also be able to visit installation.html.md
to see a clean Markdown version of that same information. Once again, I'm not sure how developers are supposed to use this format at the moment.
To see what it looks like, you can look at this page on the Tab feature from Cursor's documentation.
In the next sections, I'll go over how you can use the llms-full.txt
file with LLMs, and also how other tools are working around the knowledge cutoff problem.
ChatGPT
ChatGPT has 4 ways of getting around the knowledge cutoff problem.
File Upload
First, you can take the llms-full.txt
file, upload it in a chat, and then tell it to use the file to answer your questions.
Projects
Second, you can create a Project folder, upload the llms-full.txt
file to the Project, and then add Instructions to the Project telling ChatGPT to answer questions using the file. The Projects feature lets you group related chats together and prevents you from having to upload the same file over and over again every time you start a new chat.
Custom GPT
Third, you can create a custom GPT, upload the file llms-full.txt
under the Knowledge section, and then add Instructions telling the GPT to use the file to answer questions.
One interesting thing about these first 3 methods is that they seem to use some form of RAG under the hood.
Search
Finally, in a normal chat, you can tell ChatGPT to search the web or click the Search button to force it to search the web. Just remember, not every model has this feature. For example, o1 can't search the web.
Tips
If you use the llms-full.txt
file, I recommend renaming it to something more descriptive to prevent hallucinations. For example, I've been using Svelte's llms-full.txt
file and renamed it svelte-5-docs.md
. Also, you may have noticed I kept saying you have to tell ChatGPT to use the file. If you don't explicitly tell it that, it won't know to use the file.
Claude
Claude also has the ability for users to upload files, so you could try uploading an llms-full.txt to teach it about a new library/framework.
It also has a Projects feature, so you could create a Project and upload the file there to stay more organized. This is a paid feature though, so I haven't tried it yet.
GitHub Copilot
GitHub Copilot doesn't have any special features that I'm aware of for working with documentation websites or llms-full.txt
files, but one thing you could try is adding the llms-full.txt
file to your project folder and then using it as context in your Chat or Edit sessions.
Cursor
In Cursor, you have the ability to use @ symbols in a chat. One of those symbols is @Docs. Cursor has already crawled and indexed the docs for many popular tools/libraries/frameworks. If you use @Docs and ask it about one of the things it's crawled and indexed, it can use that knowledge to answer your question. You can view the list of already crawled docs here.
If you want Cursor to know about something that it hasn't crawled or indexed yet, you can add custom docs. One interesting thing about this feature is that it seems to basically be creating llms-full.txt files behind the scenes based on this forum answer.
Windsurf
Windsurf also has @ symbols. More specifically, it has @web and @docs symbols. From what I can gather, @docs is used for documentation Windsurf already knows about, and @web is used for documentation it doesn't know about. I'm not sure if Windsurf is converting the web pages to Markdown behind the scenes like Cursor is.
Devin
I was curious how more advanced AI tools were solving this issue, so I looked up the docs for Devin, which was touted last year as the "first AI software engineer." According to Devin's documentation, you can provide links to documentation or Devin can search for documentation independently, so it seems like it's just using web search tools behind the scenes. Again, I'm not sure if Devin converts the documentation to Markdown behind the scenes.
Future Solutions
OpenAI recently did an AMA on Reddit and someone asked about the knowledge cutoff. Someone on the OpenAI team mentioned they are working on the knowledge cutoffs, but emphasized the search feature. In fact, Sam Altman said that now that they have search, he personally doesn't think about the knowledge cutoff anymore.
However, in my experience, I like the answers from llms-full.txt better than answers using search. From my perspective, the answers from search just feel more like a summary of the web results rather than a direct answer. I imagine there's a way to prompt the LLM to give better answers using search, but I'm not sure.
Given the fact that it'll be hard for every single project out there to adopt the llms.txt standard and the direction OpenAI, Windsurf, and Devin are moving in, I think search will be the ultimate solution in the end (although I do really like the idea behind llms.txt). I think what I'd like to see in the future is an ability to whitelist or prioritize certain URLs, that way I can get the LLM to search the documentation first before consulting the rest of the web.
Conclusion
What are your thoughts on this issue? Have you found other solutions? Let me know in the comments below!
Top comments (0)