OpenAI has recently introduced the GPT-4 Turbo model with a 128K context window. A 4x boost to the previous maximum of 32K in GPT-4.
Did you know that this context size is enough to hold 1684 tweets or 123 StackOverflow questions? Or that the largest source file from the Linux kernel would require a context window that is 540 times larger than 128k? Hold on for more comparisons.
I've taken a few artefacts, converted them to a textual representation, counted sizes in tokens, and calсulated how many artefacts can fit into a 128K context window.
Smaller ones:
And larger ones:
Here's the full table with links/details.
Bottom-line first
Conclusions that I argue about further in the post:
- 128k is still not enough for many practical tasks today
- It can barely hold the raw HTML of a single web page OR full contents of all documents that might be required to navigate complex knowledge or draw conclusions
- RAG is a workaround, not a solution, it injects fragments of text that are not enough to navigate complex knowledge bases, maybe out-of-order/unrelated
- You should not expect that 100% of the context window can be effectively used, as you reach 50% LLM performance degrades
- Huge ramp-up of context window (10-100x) or Multi-modality may present opportunities for a leap forward
Artefact-to-Text
Large Language Models can only operate on text, as opposed to a new flavour of foundation models - Large Multimodal Models (LMMs).
There're multiple ways a given document/object/entity can be represented as a text. And there's no perfect way to convert different kinds of digital objects to text preserving all information.
Converting a document to text can be lossy: you lose styles, structure, media, or even some of the text information itself (i.e. URLs for hyperlinks).
Take for example this StackOverflow question:
If I select the part in the browser and copy/paste it to text editor, it will look like this:
Notice how the vote counter turns into a single number (2), there's no formatting of code blocks, and the URLs for links (dart:typed_data and file) are missing.
The source in Markdown has way more nuances (though vote counter is not part of MD):
And it makes sense to feed LLMs not with pure text as seen on the screen, but with the source text. LLMs are fine with understanding structured inputs and there're many examples of prompts that operate on XML, HTML, JSON, etc., and provide good results.
With our StackOverflow example, it is just a 10% increase when moving from TXT copy-paste to MD source (947 vs 1037 tokens). However, it is not always possible to have access to the source (in my case as an author I was able to click 'Edit') and one would need to employ certain techniques to extract the text and enrich it with structure and related info.
Though even if we talk of pure text vs source markup, not all languages are as basic and light as Markdown. Take for example web pages:
Artefact | Tokens | Diff, times larger |
---|---|---|
Google Results Page (copy/paste, txt) | 975 | |
Google Results Page (source, HTML) | 246781 | 253x |
apple.com(copy/paste, txt) | 997 | |
apple.com(source, HTML) | 92091 | 92x |
Wikipedia page (Albania, copy/paste, txt) | 42492 | |
Wikipedia page (Albania, source, wikitext) | 76462 | 1.8x |
If we speak of HTML, we're talking of 10x and even 100x order of magnitude. A significant part of this increase is due to JavaScript and CSS, nevertheless, LLMs even with the largest context window wouldn't be able to operate on raw HTML inputs.
RAG and AI agents
Here's how a Google results page used for counting tokens looked like:
It has 10 search results (as long as you don't scroll down and get more results loaded), a header with search box, footer, side pane, horizontal ribbon with few tiles. 131 pages like this will completely fit into a 128K context window when represented at copy/pasted text (losing styles, links, layouts, interactivity). The same page won't fit into the context when viewed as source HTML (52% of it will fit).
Assume you are building a retrieval-augmented AI agent (the one that can request external sources and have pieces of relevant info injected into the prompt). The agent has tools to make HTTP requests, you are using Chat Completion API from OpenAI, and your model is GPT-4 Turbo with 128k context.
This is called RAG (retrieval augmented generation) and it is typically solved in the following way. The retrieval tool makes an API call, request a page, or reads files. It refines the retrieved data to make the amount of text/tokens smaller while preserving the necessary information. Then it uses a text splitter turning every document into a series of chunks (paragraphs, code blocks, etc.), using certain criteria to make a break (e.g. a new headline in markdown) and making sure each paragraph doesn't exceed a given size (e.g. 1000k tokens). Then each of these text snippets comes through semantic indexing via embedding (gets assigned a vector of a few hundred numbers) and is stored in a vector DB. Then, before a reply to the user is generated, embeddings are used to pick semantically relevant (to the initial user request) text chunks and insert them into the prompt. A few of the key assumptions here are:
- You can efficiently refine the results to smaller texts and keep the required information
- Embeddings can provide relevant pieces of information and restore connections/retrieve related pieces of information
Now assume that we want to operate on arbitrary web pages and have no prior knowledge of web page structure. I.e. you don't have transformers for a given page or can't use CSS/HTML selectors to extract certain info from known parts (i.e. all search results residing in <div>
elements with search-result
CSS classes). It is the AI agent's job to read HTML, understand the structure of the page, and decide where to move next and which parts to click. Under such assumptions, we can't have an effective refinement of HTML output as long as we don't want a human in the loop. If it is LLM's task to explore the structure of the page and navigate i it, the only option likely to succeed is to give the model the full contents of the page (source HTML). And apparently, even 128K context is not enough for LLM to reliably process pure HTML of an average web page you have on the internet.
And we are not even touching the second part of the problem... What if the knowledge is complex, spread across different documents, have intricate relations... And text splitting and embeddings do not provide a meaningful context for a prompt, you need way more raw inputs. This makes the problem of the context size even more acute.
Multiple-modalities
How can we build the kind of universal AI agent that can crawl any web page, look into it, make assumptions of knowledge structure, and extract relevant data?
In this paper exploring the capabilities of GPT-4V(ision) there's a nice use case with GUI navigation:
"A picture is worth a thousand words" is a perfect demonstration of how hundreds of thousands of tokens (raw HTML) that can't fit into LLM turn into a small actionable piece of info when changing the modality.
Can you effectively use the full size of the context?
Why is there even a limit to the context window in the first place?
First of all, there's a quadratic complexity problem. When doing inference doubling the input prompt (number of tokens in the request) increases the CPU and memory requirements by 4 - 2 times linger requests will take 4 times longer time to complete.
Besides, for the model to be effective in manipulating larger context windows, it must be trained on larger context windows, also requires more compute.
And there're empirical studies and tests by individual developers that draw a grim picture of advertised vs effective context window. The closer you get to the maximum limit of the context the more likely that LLM will start forgetting or missing certain bits of info in the prompt.
One such test of GPT-4 Turbo says that you can expect no degradation in recall performance (the ability of LLM to find some of the info given in the prompt before) only if you do not exceed 71k token length, ~55%.
That is aligned with my observation working with GPT-3.5 and GPT-4 APIs, if you cross the boundary of 50% context length, you start getting more hallucinations and BS.
How the results were gathered?
I have used VSCode and cptX extension. It has a handy feature of counting the number of tokens (using OpenAI's tokenizer Tiktoken) in the currently open document:
The workflow was the following. I either opened the doc and selected a portion of interest in Safari on Preview (on macOS) and pasted it to open the VSCode file OR opened the source of the doc (HTML, MD, wiki text) and also pasted it to the editor and received the number of tokens when I then put to the spreadsheet.
Top comments (4)
what did you make of this?
reddit.com/r/LocalLLaMA/comments/1...
why isn't it happening?
it sounded so promising - and easy to implement even... I was sure we would have this feature everywhere by now.
any idea why this didn't pan out? I can't find any followup on this paper - it was dropped, people raved about it and made the usual "this changes everything" predictions... then, nothing.
I've tried to ask Shapiro about it, but he doesn't reply.
I've tried asking on multiple forums and blogs where the news was posted - no one replies.
maybe it wasn't anything it was propped up to be?
That's an interesting paper, though I don't think there're many evals comparing it to other models on applied/more practical tasks (reasoning, coding etc.).
Yet I found this web site and it seems there're 15 papers that cited the original work: ui.adsabs.harvard.edu/abs/2023arXi... - it doesn't seem to be the dead end. Let's see where the context expansion will lead us.
Speaking of why it did not revolutionise current LLMs and we still have context limits... I can speculate and say that:
November 21, 2023 : Claude 2.1 - 200K
November 6, 2023 : GPT-4 Turbo - 128K
June 12, 2023 : gpt-3.5-turbo-0613 - 16K
May 11, 2023 : Claude100K - 100K
March 14, 2023 : GPT4 - 8K and 32K
March 14, 2023 : Claude - 9K
November 30, 2022 : ChatGPT/GPT3.5 - 4K
June 11, 2020 : GPT3 - 2K
P.S.: GPT4 was introduced in March 2023, in summer there were rumours that it is using MoE architecure and only in December 2023 this approach was repeated in Mixrtal. It takes time and resources to iterate over LLMs...
Interesting, thanks for sharing. :-)
Main reason I'm interesting is for coding tasks, in large codebases - currently the context limitation makes that "impossible".
I realize several products have managed to "augment" the LLM in various ways, mainly by searching a semantic index and internally prompting the LLM by showing it a select number of "probably related" files - but from what I've seen, this does not work well. Coding tasks usually require more of a "big picture" overview - if you kept 95% of the codebase hidden from me, I probably couldn't do the tasks I'm asked to do.
I've managed to do certain specific tasks with LLM by figuring out the procedure and creating custom semantic indices based on my knowledge of the codebase. For example, I have a codebase with no static types, but with a bunch of DI container bootstrapping. I've created a semantic index of the bootstrapping files, and I've managed to make the LLM go one file at a time - "here is the relevant DI configuration, figure out the missing types and apply them to this file".
I can imagine similar approaches to fix other aspects of a codebase. But it's a lot of manual work. Figuring out the exact description and how to find the information relevant to solving very specific problems. Nothing like having an LLM that knows your entire codebase, where you could just ask it to "look at the DI configuration and figure out the types, please". 😅
In your opinion, is very large (or "infinite") context even the answer to what I'm looking for?
Apparently, there are a few (paid) products now that will finetune a custom model on your codebase. I haven't really heard anyone talk about those, so I don't know how well that works. But if I had to guess, I would say, that probably works well for a codebase that is in good working condition - but if you were to finetune a model on a codebase in a really broken state, aren't you just teaching the LLM to repeat the same bad patterns? I would guess. 🤔
IMO, when we have context with 100 million tokens (~1GB of text data) it would be possible to put all source files for majority of solutions into the prompt and it would be possible to say that a model has complete information and there's no need to digest solution into smaller pieces and drop into the prompt some code snippets. Though even if we imaging that we have current models with super large contexts I don't think it will be an instant success. I.e. I can see how LLMs struggle to operate even on way smaller prompts and for some reasons skip/ignore data in the prompt - hallucinations and blindness to provided data, anomalies in reasoning and instruction following will be same old problem.
Speaking of indexing the solution, it seems that many IDEs that have the promises of 'understanding the code base' or 'talk to your code' (e.g. Cody, Cursor.sh) use pretty similar approach RAG + embeddings. I.e. they create embeddings index, every request that is supposed to be using the code base has this meta prompt part with individual lines of code from different files and a mention that those lines might be relevant. Apparently this kind of context is not a connected knowledge, just snippets. And from my experience it works OK for isolated single step task with singular input ("please tell me how to change the default width of a window") but NOT OK for multi-step tasks or tasks that require multiple inputs and produce more than just a single code snippet ("Here's the class I just created write a unit test, first check how my tests are arranged, identify the best placement for new tests, ensure the tests")
Speaking of fine-tuning, I don't have hard evidence readily to share, though my current feeling that fine-tuning had seen little adoption. Actualy recently I have seen a post (maybe on LinkedIn) where someone argued that in-context learning/RAG won over fine-tuning. And there're quite a few instances that demonstrate failed attempts of using fine-tuning (which might be the reason of my bias towards it):