DEV Community

Discussion on: I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

Collapse
 
danielvisovsky profile image
Daniel Visovsky

0.5690 Context Precision on code with RecursiveChar is honestly worse than random for some queries. Half the retrieved chunks being irrelevant means every other search pulls garbage. Thanks for running the numbers - now I can point to this when someone says just use a 512 token splitter for everything.

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Yeah 0.5690 is bad, 44% of retrieved chunks are irrelevant, which means you're paying token cost to feed garbage into the LLM on nearly half your queries. The 512 token default gets away with it on docs and PDFs which is why nobody catches it until they actually test on code. Glad the numbers give you something concrete to point to!!