DEV Community

Cover image for GPT-4, 128K context - it is not big enough

GPT-4, 128K context - it is not big enough

Maxim Saplin on November 20, 2023

OpenAI has recently introduced the GPT-4 Turbo model with a 128K context window. A 4x boost to the previous maximum of 32K in GPT-4. Did you know ...
Collapse
 
mindplay profile image
Rasmus Schultz

what did you make of this?

reddit.com/r/LocalLLaMA/comments/1...

why isn't it happening?

it sounded so promising - and easy to implement even... I was sure we would have this feature everywhere by now.

any idea why this didn't pan out? I can't find any followup on this paper - it was dropped, people raved about it and made the usual "this changes everything" predictions... then, nothing.

I've tried to ask Shapiro about it, but he doesn't reply.

I've tried asking on multiple forums and blogs where the news was posted - no one replies.

maybe it wasn't anything it was propped up to be?

Collapse
 
maximsaplin profile image
Maxim Saplin • Edited

That's an interesting paper, though I don't think there're many evals comparing it to other models on applied/more practical tasks (reasoning, coding etc.).

Yet I found this web site and it seems there're 15 papers that cited the original work: ui.adsabs.harvard.edu/abs/2023arXi... - it doesn't seem to be the dead end. Let's see where the context expansion will lead us.

Speaking of why it did not revolutionise current LLMs and we still have context limits... I can speculate and say that:

  1. Most capable models by OpenAI, Google, Anthropic are closed source, god knows what they are doing to their models and which techniques they apply.
  2. There's already a great progress in terms of context window size extension. GPT-3.5 started with 4k in November 2022, in November 2023 GPT-4 Turbo boasted 128k context - a 32x bump over 1 year (vs 2x increase from 2k to 4k between GPT3 and GPT3.5)

November 21, 2023 : Claude 2.1 - 200K
November 6, 2023 : GPT-4 Turbo - 128K
June 12, 2023 : gpt-3.5-turbo-0613 - 16K
May 11, 2023 : Claude100K - 100K
March 14, 2023 : GPT4 - 8K and 32K
March 14, 2023 : Claude - 9K
November 30, 2022 : ChatGPT/GPT3.5 - 4K
June 11, 2020 : GPT3 - 2K

P.S.: GPT4 was introduced in March 2023, in summer there were rumours that it is using MoE architecure and only in December 2023 this approach was repeated in Mixrtal. It takes time and resources to iterate over LLMs...

Collapse
 
mindplay profile image
Rasmus Schultz

Interesting, thanks for sharing. :-)

Main reason I'm interesting is for coding tasks, in large codebases - currently the context limitation makes that "impossible".

I realize several products have managed to "augment" the LLM in various ways, mainly by searching a semantic index and internally prompting the LLM by showing it a select number of "probably related" files - but from what I've seen, this does not work well. Coding tasks usually require more of a "big picture" overview - if you kept 95% of the codebase hidden from me, I probably couldn't do the tasks I'm asked to do.

I've managed to do certain specific tasks with LLM by figuring out the procedure and creating custom semantic indices based on my knowledge of the codebase. For example, I have a codebase with no static types, but with a bunch of DI container bootstrapping. I've created a semantic index of the bootstrapping files, and I've managed to make the LLM go one file at a time - "here is the relevant DI configuration, figure out the missing types and apply them to this file".

I can imagine similar approaches to fix other aspects of a codebase. But it's a lot of manual work. Figuring out the exact description and how to find the information relevant to solving very specific problems. Nothing like having an LLM that knows your entire codebase, where you could just ask it to "look at the DI configuration and figure out the types, please". 😅

In your opinion, is very large (or "infinite") context even the answer to what I'm looking for?

Apparently, there are a few (paid) products now that will finetune a custom model on your codebase. I haven't really heard anyone talk about those, so I don't know how well that works. But if I had to guess, I would say, that probably works well for a codebase that is in good working condition - but if you were to finetune a model on a codebase in a really broken state, aren't you just teaching the LLM to repeat the same bad patterns? I would guess. 🤔

Thread Thread
 
maximsaplin profile image
Maxim Saplin

IMO, when we have context with 100 million tokens (~1GB of text data) it would be possible to put all source files for majority of solutions into the prompt and it would be possible to say that a model has complete information and there's no need to digest solution into smaller pieces and drop into the prompt some code snippets. Though even if we imaging that we have current models with super large contexts I don't think it will be an instant success. I.e. I can see how LLMs struggle to operate even on way smaller prompts and for some reasons skip/ignore data in the prompt - hallucinations and blindness to provided data, anomalies in reasoning and instruction following will be same old problem.

Speaking of indexing the solution, it seems that many IDEs that have the promises of 'understanding the code base' or 'talk to your code' (e.g. Cody, Cursor.sh) use pretty similar approach RAG + embeddings. I.e. they create embeddings index, every request that is supposed to be using the code base has this meta prompt part with individual lines of code from different files and a mention that those lines might be relevant. Apparently this kind of context is not a connected knowledge, just snippets. And from my experience it works OK for isolated single step task with singular input ("please tell me how to change the default width of a window") but NOT OK for multi-step tasks or tasks that require multiple inputs and produce more than just a single code snippet ("Here's the class I just created write a unit test, first check how my tests are arranged, identify the best placement for new tests, ensure the tests")

Image description

Speaking of fine-tuning, I don't have hard evidence readily to share, though my current feeling that fine-tuning had seen little adoption. Actualy recently I have seen a post (maybe on LinkedIn) where someone argued that in-context learning/RAG won over fine-tuning. And there're quite a few instances that demonstrate failed attempts of using fine-tuning (which might be the reason of my bias towards it):