Looking to incorporate a recommendation module on your blog website? Something that would help you recommend high-quality, personalized blogs to your users based on their interests and preferences? Don’t have the time to acquaint yourself with machine learning algorithms, models, and Python programming? This tutorial will help you out in building a high-quality blog recommendation application by using the latest open-source no-code tools and large language models.
The code used in this tutorial can also be followed here.
Specifically, we are going to use Flowise — a no-code drag & drop graphical tool with the aim to make it easy for people to visualize and build LLM apps, and Qdrant — an open-source vector search engine and vector database.
We need a vector database for this task as we would be using a collection of blog articles to recommend from. This collection can be very large to input to a LLM, because of limited context windows and cost per input token. Thus we need to use the Retrieval Augmented Generation (RAG) technique to firstly chunk, embed, and index these blogs in a vector database, and then find a smaller, more precise set of blogs to input to the LLM to decide to recommend from.
While there are many vector databases out there to choose from, Qdrant is a complete vector database for building LLM applications. It is open-source, and also provides a managed cloud service. It also doubles up as a hybrid and full-text search engine, benefiting from its sparse-vectors capability. Moreover, it offers metadata filtering, in-built embeddings creation (both text and image), sharding, disk-based indexing, and easily integrates with LangChain and LlamaIndex.
Following are the steps you need to take to ensure you have all the tools at your disposal before we begin using them to build the blog recommender.
- You need to ensure that you have an OpenAI API key.
- You need to obtain a Qdrant API key.
- You need to install Flowise in your machine or cloud.
Install Flowise (Official documentation)
Download and Install NodeJS >= 18.15.0
Install Flowise:
npm install -g flowise
- Start Flowise:
npx flowise start
You can now see a similar page open up in your browser
The chatflow is the place where you can plug-in different components to create an LLM app. Notice on the left hand pane, apart from chatflow you have got other options as well. You can explore the marketplace to use some ready-made LLM app flows, e.g. conversation agent, QnA agent, etc.
For now, lets' try out the Flowise Docs QnA template. Click on it and you can see the template as connected blocks. These blocks, also called nodes, are essential compontents in any LLM app. Think of them as functions when one programmtically creates an LLM app via Langchain or LlamaIndex.
These nodes are explained herein:
- You have got text splitter for chunking large documents, where you can specify the relevant parameters like chunk size and chunk overlap.
- Text splitter is connected to a document source, in this case, the Flowise github repo. ou need to specify the connect credential, which is essential any form of authorisation such as api key, needed to access the source documents.
- There is an embedding model, in this case OpenAI embeddings and therefore you need the OpenAI API key as connect credential.
- The embedding model and the document source is connected to the vector store, wherein the chunked and embedded documents are indexed and stored for retrieval.
- We also have the LLM model, in this case, ChatOpenAI model from Open AI.
- The LLM and the output of the vector store are input to the Converational Retrieval QA chain, which is a chain of prompts meant to perform the required task - chatting with the LLM over the Flowise documentation.
On the top right, you can see the Use template button. We can use this template as a starting point of our recommendation app.
As we see, the first things we need to build a recommendation system is a set of documents. We need a pool of blog articles from which our LLM agent can recommend blogs.
One can either scrape blogs from the internet using some scraper node like Cheerio Web Scraper in Flowise. Or one can have a collection of blogs in disk to be loaded via a document loader.
Fortunately, I could find a well scraped, clean collection of medium blogs on diverse topics in Hugging Face datasets. Clicking on this link and going to the File and Versions tab, one can download the 1.04 GB file named medium_articles.csv.
In the below cells, I show how to use this file to create the LLM blog recommendation agent.
First we need to import the libraries required to load and process the dataset
import pandas as pd
import ast
# replace the file path as appropriate
file_path = "./data/medium_articles.csv"
df = pd.read_csv(file_path)
df.head()
title | text | url | authors | timestamp | tags | |
---|---|---|---|---|---|---|
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | ['Mental Health', 'Health', 'Psychology', 'Sci... |
1 | Your Brain On Coronavirus | Your Brain On Coronavirus\n\nA guide to the cu... | https://medium.com/age-of-awareness/how-the-pa... | ['Simon Spichak'] | 2020-09-23 22:10:17.126000+00:00 | ['Mental Health', 'Coronavirus', 'Science', 'P... |
2 | Mind Your Nose | Mind Your Nose\n\nHow smell training can chang... | https://medium.com/neodotlife/mind-your-nose-f... | [] | 2020-10-10 20:17:37.132000+00:00 | ['Biotechnology', 'Neuroscience', 'Brain', 'We... |
3 | The 4 Purposes of Dreams | Passionate about the synergy between science a... | https://medium.com/science-for-real/the-4-purp... | ['Eshan Samaranayake'] | 2020-12-21 16:05:19.524000+00:00 | ['Health', 'Neuroscience', 'Mental Health', 'P... |
4 | Surviving a Rod Through the Head | You’ve heard of him, haven’t you? Phineas Gage... | https://medium.com/live-your-life-on-purpose/s... | ['Rishav Sinha'] | 2020-02-26 00:01:01.576000+00:00 | ['Brain', 'Health', 'Development', 'Psychology... |
# some basic info about the dataframe
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192368 entries, 0 to 192367
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 192363 non-null object
1 text 192368 non-null object
2 url 192368 non-null object
3 authors 192368 non-null object
4 timestamp 192366 non-null object
5 tags 192368 non-null object
dtypes: object(6)
memory usage: 8.8+ MB
None
We see that there are close to 200k articles in this dataset. We don't need these many articles in our pool to recommend from.
Therefore, we focus only on a niche domain, 'AI'. We sample only those articles which include 'AI' as a tag.
# first converting the tags values from string to a list for the explode operation
df['tags'] = df.tags.apply(lambda d: ast.literal_eval(d))
df = df.explode('tags')
df.head()
title | text | url | authors | timestamp | tags | |
---|---|---|---|---|---|---|
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | Mental Health |
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | Health |
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | Psychology |
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | Science |
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | Neuroscience |
# now we see that the explode operation has duplicated rest of the row for each tag in the tags list
# We can further filter only the AI tagged articles
df_ai = df.query('tags == "AI"')
df_ai.head()
title | text | url | authors | timestamp | tags | |
---|---|---|---|---|---|---|
34 | AI creating Human-Looking Images and Tracking ... | AI creating Human-Looking Images and Tracking ... | https://medium.com/towards-artificial-intellig... | ['David Yakobovitch'] | 2020-09-07 18:01:01.467000+00:00 | AI |
69 | Predicting The Protein Structures Using AI | Proteins are found essentially in all organism... | https://medium.com/datadriveninvestor/predicti... | ['Vishnu Aravindhan'] | 2020-12-26 08:46:36.656000+00:00 | AI |
72 | Unleash the Potential of AI in Circular Econom... | Business Potential of AI in promoting circular... | https://medium.com/swlh/unleash-the-potential-... | ['Americana Chen'] | 2020-12-07 22:46:53.490000+00:00 | AI |
85 | Essential OpenCV Functions to Get You Started ... | Reading, writing and displaying images\n\nBefo... | https://towardsdatascience.com/essential-openc... | ['Juan Cruz Martinez'] | 2020-06-12 16:03:06.663000+00:00 | AI |
105 | Google Objectron — A giant leap for the 3D obj... | bjecrPhoto by Tamara Gak on Unsplash\n\nGoogle... | https://towardsdatascience.com/google-objectro... | ['Jair Ribeiro'] | 2020-11-23 17:48:03.183000+00:00 | AI |
# Finally, we concatenate information across the columns into a single column, so that we have to index only a single column in the Qdrant vector DB
# Also, we include only use the titles of the article as the summary of its content. This is to minimise the db upsert times (as we will be using Qdrant DB on our local machine)
# We keep the url field, as the LLM agent can cite the url of the recommended blog using this field.
# finally, we take a random sample of 200 articles, again to minimise the db upsert times.
df_ai.loc[:, 'combined_info'] = df_ai.apply(lambda row: f"title: {row['title']}, url: {row['url']}", axis=1)
df_ai_combined = df_ai['combined_info']
df_ai_combined.sample(200).to_csv('medium_articles_ai.csv')
df_ai_combined.head()
34 title: AI creating Human-Looking Images and Tr...
69 title: Predicting The Protein Structures Using...
72 title: Unleash the Potential of AI in Circular...
85 title: Essential OpenCV Functions to Get You S...
105 title: Google Objectron — A giant leap for the...
Name: combined_info, dtype: object
Now that we have the csv file, we go back to the flowise dashboard.
In the QnA template we discussed earlier, replace the MarkdownTextsplitter by RecursiveTextSplitter, the Github Document Loader with the CSV document loader, the In-memory retrieval with the Qdrant vector store and the Conversational Retrieval Chain with the Retrieval QA Chain.
Also, in the CSV document loader, upload the csv file we created in the above cell, putting 'combined_info' in the 'Single Column Extraction' field. Also, ensure that you put down your OpenAI API keys, the Qdrant server URL as 'http://0.0.0.0:6333' and give a new collection (database) name.
The dashboard looks like this now:
You can save your chatflow using the save icon on the top right, and can run the flow using the green coloured database icon below it.
But before proceeding, you need to firstly start a local Qdrant server. The easiest way to do is via docker. Ensure you have docker installed in your system. Then, go to a terminal and paste the following commands:
docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant
You can then go to your browser at http://localhost:6333/dashboard and see the Qdrant dashboard. We will come back to this dashboard later.
Run the flow now, using the green coloured database icon and click on the upsert button on the pop-up which follows.
Once the documents are upserted into the Qdrant DB, you can do over to the Qdrant DB dashboard at http://localhost:6333/dashboard and refresh to see the new collection 'medium_articles_ai' created. Click on it to see the indexed csv file.
Finally, lets start the chatbot by clicking on the purple message icon next to the green database icon.
You can ask the bot about articles in AI and the bot would recommend you the articles from the collection we created with the csv file. Without us instructing it to do so, it also knows to cite the url of the recommended blog, so that the user can straightaway start reading the blogs.
Conclusion
In this article we saw how to use an LLM to build a blog recommendation application. We used OpenAI’s GPT-3.5 as the language model and Flowise as a graphical interface with built-in nodes to design the architecture of the application. Qdrant vector database performed the important task of RAG, which helped us augment the LLM’s reasoning capability by feeding it the pool of blogs, in an optimal, vectorized manner. The LLMs may not have access to this specific pool of blogs as they may either be privately held, or created after its pre-training cut-off date. While the chat example I showed asked the LLM agent to explicitly recommend blogs of a particular topic, one can also input another blog and ask the agent to recommend similar blogs.
Top comments (0)