DEV Community

🤜💥🤛 GPT-4 vs Claude-2 context recall analysis

zvone187 on December 05, 2023

In today’s rapidly advancing AI world, one of the limiting factors of modern Large Language Models (LLMs) is the context size. But it would also be...

Read full post

Matija Sosic • Dec 5 '23

This is an excellent and very detailed comparison - thanks for sharing, super interesting! Is GPT Pilot going to add support for Claude?

zvone187 • Dec 5 '23

It already does. You can actually use any LLM with GPT Pilot.

Maxim Saplin • Dec 23 '23

Surprisingly the results here are opposite to the results from github.com/gkamradt/LLMTest_Needle.... I.e. in your research Claude was able to demonstrate almost 100% recall for context sizes up to 96k while GPT-4 Turbo could show same recall performance only up to 16k context size. Which means that Claude has 4-8x edge over GPT-4 in your tests.

On the contrary the "Needle in the Haystack" gives ~4x edge to GPT-4 Turbo (128k context window) over Claude 2.1 (200k context window) since it showed perfect recall results for windows up to ~70k vs 19k for Claude.

Any ideas why the results are so different?

I would also add that you likely used Claude 2.1 (not 2) cause it is the only Anthropic model that has 200k context window.