Introduction
In this blog post, you will be guided with the process on how to effectively extract the topics and subtopics from a given set of sentences. Before we deep dive into the topic extraction, let's try to understand what exactly the topics are?
Topic Extraction
Here's the brief content about the topic extraction, generated with the help of ChatGPT.
Topic extraction is a crucial natural language processing (NLP) technique that involves automatically identifying and extracting key themes or subjects from a body of text. The goal is to distill the most relevant and representative topics within a document or a collection of documents, providing a structured and concise overview of the content. By employing various algorithms and statistical models, topic extraction aids in uncovering the underlying themes, facilitating efficient content summarization, categorization, and analysis. This technique finds applications across diverse domains, including information retrieval, document clustering, and sentiment analysis, empowering organizations to derive actionable insights from large volumes of textual data.
Hands-on
- Please head over to the Google Colab
- Make sure to login to the Google Cloud and get the Project Id and Location Info.
- Use the below code for Vertex AI initialization purposes.
import sys
# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
# Authenticate user to Google Cloud
from google.colab import auth
auth.authenticate_user()
PROJECT_ID = "<<project_id>>" # @param {type:"string"}
LOCATION = "<<location>>" # @param {type:"string"}
if "google.colab" in sys.modules:
# Define project information
PROJECT_ID = PROJECT_ID
LOCATION = LOCATION
# Initialize Vertex AI
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)
The basic requirement for accomplishing the topic extraction is done via the careful consideration of the topic extraction prompt. Here's the code snippet for the same.
def get_topic_extraction_prompt(content):
prompt = f"""Label the main topic or topics in the following text: {content}"""
prompt = prompt + """1. Identify and list the primary topic or category or provide a short description of the main subject matter of the text.
2. If there are subtopics or secondary themes mentioned in the text, list them as well. If the text discusses multiple topics, provide a list of these topics and describe their relevance.
3. Consider the context and tone of the text to determine the most appropriate topics. Take into account keywords, phrases, or specific terms that relate to the topics.
4. If any notable entities (people, places, brands, products, etc.) are mentioned in the text that play a role in the topics, mention them and their associations.
5. If the text suggests any actions, decisions, or recommendations related to the identified topics, provide a brief summary of these insights.
Ensure that your labeling is clear, concise, and reflects the most significant topics or categories found in the text.
Here's the output schema:
{
"Topic": "",
"Subtopics": [""],
"Context": "",
"NotableEntities": [],
"Recommendations": ""
}
```
Do not respond with your own suggestions or recommendations or feedback."""
return prompt
Now let's see a generic code for executing the above topic extraction prompt using the Google Gemini Pro model. Here's the code snippet for the same.
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
def execute_prompt(prompt, max_output_tokens=8192):
model = GenerativeModel("gemini-pro")
responses = model.generate_content(
prompt,
generation_config={
"max_output_tokens": max_output_tokens,
"temperature": 0,
"top_p": 1
},
stream=True,
)
final_response = []
for response in responses:
final_response.append(response.candidates[0].content.parts[0].text)
return ".".join(final_response)
Now is the time to perform the prompt execution and do some JSON transformation for the extraction of topics. Here's the code snippet for the same.
Code block for extracting the JSON from the LLM response. Please note, at this time, Google Gemini Pro being released to the public and has some known issues in building the clean and formatted structured JSON response. Hence, the need to tweak a bit.
import re
import json
def extract_json(input_string):
# Extract JSON within
matches = re.findall(r'```(.*?)
```',
input_string, re.DOTALL)
if matches:
# Join the matches into a single string
json_content = ''.join(matches)
# Remove periods
json_content = re.sub(r'\.', '', json_content)
return json_content
else:
print("No ```
block found.")
return None
topics = []
prompt = get_topic_extraction_prompt(summary)
response = execute_prompt(prompt)
extracted_json = extract_json(response)
if extracted_json != None:
topics.append(extracted_json)
Top comments (0)