DEV Community

Thilak Singh Thakur
Thilak Singh Thakur

Posted on

Improving Clause-Level Retrieval Accuracy in Legal RAG

Hi devs ! I am new to genAi and I am asked to build genAi app for structured commercial lease agreement.
I did built rag pipeline:
parsing digital PDF --> section aware chunking (recognised sections individually )--> Summarising chunks-->embeddings of sumarized chunks & embeddings of chunks --> storing in postgresql
2 level retrieval semantic relevancy of query embeddings with summary embeddings (ranking)-->then query embeddings with direct chunk embeddings (reranking)
Here 166 queries need to catch right clause then am supposed to retrieve relevant lines from that paragraph..
My question:
Am Summarising every chunk for navigating quickly to right chunks in 1st retrieval but there are 145 chunks in my 31 pages pdf will relatively increase budget and token limit but if i don't summarise , semantic retrieval is getting diluted with each big clauses holding multiple obligations.
I am getting backlash that having Summarizing chunks in the pipeline costs high.
Do u have better approach for increasing accuracy ?
Thanks in advance

Top comments (0)