DEV Community

Cover image for Topic and Subtopic Extraction with Google Gemini Pro
Ranjan Dailata
Ranjan Dailata

Posted on

Topic and Subtopic Extraction with Google Gemini Pro

Introduction

In this blog post, you will be guided with the process on how to effectively extract the topics and subtopics from a given set of sentences. Before we deep dive into the topic extraction, let's try to understand what exactly the topics are?

Topic Extraction

Here's the brief content about the topic extraction, generated with the help of ChatGPT.

Topic extraction is a crucial natural language processing (NLP) technique that involves automatically identifying and extracting key themes or subjects from a body of text. The goal is to distill the most relevant and representative topics within a document or a collection of documents, providing a structured and concise overview of the content. By employing various algorithms and statistical models, topic extraction aids in uncovering the underlying themes, facilitating efficient content summarization, categorization, and analysis. This technique finds applications across diverse domains, including information retrieval, document clustering, and sentiment analysis, empowering organizations to derive actionable insights from large volumes of textual data.

Hands-on

  1. Please head over to the Google Colab
  2. Make sure to login to the Google Cloud and get the Project Id and Location Info.
  3. Use the below code for Vertex AI initialization purposes.
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

PROJECT_ID = "<<project_id>>"  # @param {type:"string"}
LOCATION = "<<location>>"  # @param {type:"string"}

if "google.colab" in sys.modules:
    # Define project information
    PROJECT_ID = PROJECT_ID
    LOCATION = LOCATION

    # Initialize Vertex AI
    import vertexai
    vertexai.init(project=PROJECT_ID, location=LOCATION)
Enter fullscreen mode Exit fullscreen mode

The basic requirement for accomplishing the topic extraction is done via the careful consideration of the topic extraction prompt. Here's the code snippet for the same.

def get_topic_extraction_prompt(content):
    prompt = f"""Label the main topic or topics in the following text: {content}"""
    prompt = prompt + """1. Identify and list the primary topic or category or provide a short description of the main subject matter of the text.
      2. If there are subtopics or secondary themes mentioned in the text, list them as well. If the text discusses multiple topics, provide a list of these topics and describe their relevance.
      3. Consider the context and tone of the text to determine the most appropriate topics. Take into account keywords, phrases, or specific terms that relate to the topics.
      4. If any notable entities (people, places, brands, products, etc.) are mentioned in the text that play a role in the topics, mention them and their associations.
      5. If the text suggests any actions, decisions, or recommendations related to the identified topics, provide a brief summary of these insights.

      Ensure that your labeling is clear, concise, and reflects the most significant topics or categories found in the text.

      Here's the output schema:

      ```


      {
          "Topic": "",
          "Subtopics": [""],
          "Context": "",
          "NotableEntities": [],
          "Recommendations": ""
      }


      ```

      Do not respond with your own suggestions or recommendations or feedback."""
    return prompt

Enter fullscreen mode Exit fullscreen mode

Now let's see a generic code for executing the above topic extraction prompt using the Google Gemini Pro model. Here's the code snippet for the same.

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def execute_prompt(prompt, max_output_tokens=8192):
  model = GenerativeModel("gemini-pro")
  responses = model.generate_content(
    prompt,
    generation_config={
        "max_output_tokens": max_output_tokens,
        "temperature": 0,
        "top_p": 1
    },
  stream=True,
  )

  final_response = []

  for response in responses:
      final_response.append(response.candidates[0].content.parts[0].text)

  return ".".join(final_response)
Enter fullscreen mode Exit fullscreen mode

Now is the time to perform the prompt execution and do some JSON transformation for the extraction of topics. Here's the code snippet for the same.

Code block for extracting the JSON from the LLM response. Please note, at this time, Google Gemini Pro being released to the public and has some known issues in building the clean and formatted structured JSON response. Hence, the need to tweak a bit.

import re
import json

def extract_json(input_string):
    # Extract JSON within ```

 block
    matches = re.findall(r'

```(.*?)```

', input_string, re.DOTALL)

    if matches:
        # Join the matches into a single string
        json_content = ''.join(matches)

        # Remove periods
        json_content = re.sub(r'\.', '', json_content)

        return json_content
    else:
        print("No

 ``` block found.")
        return None
Enter fullscreen mode Exit fullscreen mode
topics = []
prompt = get_topic_extraction_prompt(summary)
response = execute_prompt(prompt)
extracted_json = extract_json(response)
if extracted_json != None:
   topics.append(extracted_json)
Enter fullscreen mode Exit fullscreen mode

TopicExtraction

Top comments (0)