Relevant arXiv paper RAG

#devchallenge #pgaichallenge #database #ai

This is a submission for the Open Source AI Challenge with pgai and Ollama

What I Built

Imagine having an assistant that can, based on a given research topic/paper, instantly connect you with papers that are relevant to you - saving hours spent sifting through services like Arxiv!

This project is a research paper recommendation system leveraging RAG with PostgreSQL, pgVectorScale, and a language model (choose from ChatGpt4o:mini/Claude35/llama32:3b). Using a *single * arXiv paper ID, the system finds similar research articles using vector embeddings, allowing users to dive deeper into related works, spot trends, and explore different approaches on the topic.

Demo

Hosted GUI:
https://tinyurl.com/timescalechallengeanyademo
Note: May be unavailable due to Gradio 72hr url limit - SEE COLAB-HOSTED SELF-RUNNABLE SOURCE CODE BELOW, slow due to multiple users, some recent arXiv url/papers don't work)

Top-10 Similar Papers Demo

T10 Similar Paper Summary/Analysis Demo

Static Preview

Colab Notebook Source Code (Try it: ~10min):
https://tinyurl.com/timescalechallengeanyanotebook

NOTE: Default configuration uses Ollama, but OpenAI Anthropic Claude w/ Cohere Embeddings is preferred due to context length limitations with pgAI and Ollama embeddings (LLM similarity analysis/question: 3 papers instead of 10 papers w/ Ollama, see final thoughts).

Tools Used

pgvector & pgvectorscale: Backbone for storing and searching vector embeddings of arXiv paper texts, which are each converted into vector representation. Use DISKANN (or IVFFLAT) for grouping, indexing embeddings.
pgai: Used for generating embeddings and answer questions for research documents. pgAI is used as a gateway to OpenAI, Anthropic, Cohere, and Ollama.

Final Thoughts

Additional unique aspects of this project:
- Usage Postgres stored functions to call pgai functions in 'function mode', enabling users without any access to the database or pgai to build a RAG (superior security).
- Integration with OpenAI, Anthropic, and local Ollama APIs.
Learnings
- The learning curve for implementation was of medium level, but I felt like I learned a lot from exploring timescale's github documentation and writing stored function commands (with ChatGPT's help, took my database systems course >1yr ago - a little rusty). Should add further documentation and review for inefficiencies to notebook in future.
Feedback
- pgvector is limited to an embedding dimension size of 4k (2k if full vector is used), falling short of OpenAI's 4096. I wrote additional code to trim the output, which complicated the implementation.
- pgAI's Ollama may have context length issues (when I use Ollama's interface directly there are no such issues), which limited the later question-answering function to 3 papers. When using Anthropic/Cohere, we could do more.

Top comments (1)

andrew • Nov 11 '24 • Edited

Forgot to mention which additional prize categories my project may be applicable towards:

Open-source Models from Ollama: This project allows users to choose Llama3.2:3b as their LLM of choice (note that this is the default, but Anthropic/Cohere or OpenAI is preferred due to llama-pgai restrictions on context length and slow Colab-Ollama run time - up to ~2min)
All the Extensions: This project utilizes pgvector, pgvectorscale, and pgai in its functionality

Additionally, please note that the original post has two notable links:

A self-contained Colab project notebook with project source code where anyone can run create their own arXiv RAG by following the link (Repost of link [tinyurl.com/timescalechallengeanya...], see first code block for , then pressing 'Run all' in the Colab notebook .
The demo GRadio UI (which you can try!) is running off of an active Colab instance of the above notebook (Repost of link [ tinyurl.com/timescalechallengeanya... ]). NOTE: The Colab instance hosting may have ended prematurely due to idling - If so, I suggest building and running from scratch using the source code notebook linked above!