Automated Labeling or Tagging with Google Gemini

#ai #llm #python #programming

Introduction

In this blog post, you will be guided with the process on how to effectively perform the automated tagging from a given context. Before we deep dive into the automated labeling or tagging, let's try to understand what exactly the topics are?

Automated Labeling

Here's the brief content about the automated labeling, generated with the help of ChatGPT.

Automated labeling or tagging refers to the process of assigning descriptive labels or tags to data, typically in the context of digital content, such as images, texts, or other types of media. The goal is to categorize and organize data automatically, enabling easier retrieval, analysis, and management. This process is essential in various fields, including information retrieval, content management, and machine learning.

Automated labeling enhances efficiency by reducing the manual effort required to organize and categorize large volumes of data. It is widely employed in applications such as content recommendation systems, image recognition, and document management, where accurate and timely labeling is crucial for effective data utilization and retrieval.

Hands-on

Please head over to the Google Colab
Make sure to login to the Google Cloud and get the Project Id and Location Info.
Use the below code for Vertex AI initialization purposes.



import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

PROJECT_ID = "<<project_id>>"  # @param {type:"string"}
LOCATION = "<<location>>"  # @param {type:"string"}

if "google.colab" in sys.modules:
    # Define project information
    PROJECT_ID = PROJECT_ID
    LOCATION = LOCATION

    # Initialize Vertex AI
    import vertexai
    vertexai.init(project=PROJECT_ID, location=LOCATION)

The basic requirement for accomplishing the topic extraction is done via the careful consideration of the topic extraction prompt. Here's the code snippet for the same.



def get_automated_tagger_extraction_prompt(content):
    prompt = f"""Automate the tagging of the following unstructured data: {content}"""
    prompt = prompt + """1. Identify and extract the most relevant tags, keywords, or categories for the given data. These tags should succinctly represent the content's main themes, subjects, or topics.
        2. List the extracted tags, and provide a brief description or rationale for each tag to help users understand their significance.
        3. If there are subcategories or hierarchies in the tags, ensure that they are appropriately nested or organized.
        4. Consider the context, content, and domain-specific knowledge when selecting tags. Ensure that the tags accurately reflect the essence of the data.
        5. If any tags are ambiguous or could have multiple interpretations, address these challenges and provide explanations for the chosen tags.
        6. If there are specific tasks or analyses where the tagged data will be used, describe these use cases and how the tags are expected to be applied.
        7. If the data contains temporal or dynamic elements, mention any trends, changes, or time-sensitive aspects that might impact the tags.

        Ensure that your automated tagging results are clear, relevant, and make the data more accessible and useful.

        Here's the output schema:

    {
        "AutomatedTagging": {
            "Tags": [
                {
                    "Tag": "",
                    "Sentences": []
                }
            ]
        }
    }
    ```

    Do not respond with your own suggestions or recommendations or feedback.
"""
return prompt


Now let's see a generic code for executing the above topic extraction prompt using the Google Gemini Pro model. Here's the code snippet for the same.

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def execute_prompt(prompt, max_output_tokens=8192):
model = GenerativeModel("gemini-pro")
responses = model.generate_content(
prompt,
generation_config={
"max_output_tokens": max_output_tokens,
"temperature": 0,
"top_p": 1
},
stream=True,
)

final_response = []

for response in responses:
final_response.append(response.candidates[0].content.parts[0].text)

return ".".join(final_response)


Now is the time to perform the prompt execution and do some JSON transformation for the extraction of topics. Here's the code snippet for the same.

Code block for extracting the JSON from the LLM response. Please note, at this time, Google Gemini Pro being released to the public and has some known issues in building the formatted structured JSON response. Hence, the need to tweak a bit.

import re
import json

def extract_json(input_string):
# Extract JSON within

    matches = re.findall(r'```(.*?)

```',

 input_string, re.DOTALL)

    if matches:
        # Join the matches into a single string
        json_content = ''.join(matches)

        # Remove periods
        json_content = re.sub(r'\.', '', json_content)

        return json_content
    else:
        print("No ```

 block found.")
        return None



taggers= []
prompt = get_automated_tagger_extraction_prompt(summary)
response = execute_prompt(prompt)
extracted_json = extract_json(response)
if extracted_json != None:
  taggers.append(extracted_json)

DEV Community

Automated Labeling or Tagging with Google Gemini

Introduction

Automated Labeling

Hands-on

Top comments (0)

Read next

pyya - The way to manage YAML config in your Python project

SSL Certificate: Complete Guide

Day 50: Building a REST API for LLM Inference

Code Better, Debug Smarter: Tips Every Developer Needs