Introduction
In this blog post, you will be presented with the mechanism on how to perform the information extraction with ease using the large language models like Google Gemini Pro.
The LLMs are the hottest topic of the year 2022/23. Since then there has been a great demand and a ton of innovation and applications are being build by directly utilizing the LLMs or in combination with vector databases etc. However, in this blog post, you will be presented with the information extraction aspects only.
Background
Information Extraction has been an ever challenging one in the history of mankind. Considering the complexities of data extraction, especially when dealing with the unstructured to structured data previously involved a ton of complexities. However, these days, things have changed or evolved with the introduction of large language models.
Hands-on
- Please head over to the Google Colab
- Make sure to login to the Google Cloud and get the Project Id and Location Info.
- Use the below code for Vertex AI initialization purposes.
import sys
# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
# Authenticate user to Google Cloud
from google.colab import auth
auth.authenticate_user()
PROJECT_ID = "<<project_id>>" # @param {type:"string"}
LOCATION = "<<location>>" # @param {type:"string"}
if "google.colab" in sys.modules:
# Define project information
PROJECT_ID = PROJECT_ID
LOCATION = LOCATION
# Initialize Vertex AI
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)
For the purpose of this post, let's consider a scenario of web data extraction.
Here's the code snippet for performing the textual data extraction. Our goal is to extract the meaningful information from the specified content consists of a ton of information includes links, images and HTML tags for example. It could be anything for that matter.
def get_text_extract_prompt(title, content):
prompt = f"""
Here is its title: {title}
Here is some text extracted:
---------
{content}
---------
Web pages can have a lot of useless junk in them.
For example, there might be a lot of ads, or a
lot of navigation links, or a lot of text that
is not relevant to the topic of the page. We want
to extract only the useful information from the text.
You can use the url and title to help you understand
the context of the text.
Please extract only the useful information from the text.
Try not to rewrite the text, but instead extract
only the useful information from the text.
"""
return prompt
Now let's take a look into the code snippet which is responsible for executing the prompt using the Google Gemini Pro LLM. Here's the code snippet.
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
def execute_prompt(prompt, max_output_tokens=8192):
model = GenerativeModel("gemini-pro")
responses = model.generate_content(
prompt,
generation_config={
"max_output_tokens": max_output_tokens,
"temperature": 0,
"top_p": 1
},
stream=True,
)
final_response = []
for response in responses:
final_response.append(response.candidates[0].content.parts[0].text)
return ".".join(final_response)
Let's take a look into the code snippet for performing the above-mentioned calls. Here's the code snippet. Notice below, the text extracts prompt is constructed based on the specific title and the content, further the execute prompt is being called for performing the information extraction using the Gemini Pro LLM.
information_extraction = []
text_extract_prompt = get_text_extract_prompt(title, content)
prompt_response = execute_prompt(text_extract_prompt)
information_extraction.append(prompt_response)
Top comments (0)