TL;DR
This article will demonstrate how to utilize LLM for extracting data from PDF invoices. I will build a FastAPI server that will accept a PDF file and return the extracted data in JSON format.
We will be covering:
Paka streamlines the deployment and management of large language model (LLM) applications with a single-command approach.
Introduction
Previously, converting free-form text into a structured format often required me to write custom scripts. This involved using a programming language like Python or NodeJS to parse the text and extract the relevant information. One big problem with this approach is that I need to write different scripts for different types of documents.
The advent of LLMs enables the extraction of information from diverse documents using a single model. I will show you how to use LLM to extract information from PDF invoices in this article.
Some of my goals for this project are:
- Use the open-source models (llama2-7B 🦙) from HuggingFace and avoid the OpenAI API or any other cloud AI APIs.
- Build a production-ready API. This means that the API should be able to handle multiple requests concurrently and should be able to scale horizontally.
Example PDF Invoice
We will be using the Linode invoice as an example. Here is a sample invoice:
We are going to extract the following information from this invoice:
- Invoice Number/ID
- Invoice Date
- Company Name
- Company Address
- Company Tax ID
- Customer Name
- Customer Address
- Invoice Amount
Building the API
Step 1: Preprocessing the PDF
Since LLMs require text inputs, PDF files must initially be converted into text. For this task, we can use the pypdf library or LangChain's wrapper of pypdf - PyPDFLoader
from langchain_community.document_loaders import PyPDFLoader
pdf_loader = PyPDFLoader(pdf_path)
pages = pdf_loader.load_and_split()
page_content = pages[0].page_content
print(page_content)
Here is an example of the conversion result:
Page 1 of 1
Invoice Date: 2024-01-01T08:29:56
Remit to:
Akamai Technologies, Inc.
249 Arch St.
Philadelphia, PA 19106
USA
Tax ID(s):
United States EIN: 04-3432319Invoice To:
John Doe
1 Hacker Way
Menlo Park, CA
94025
Invoice: #25470322
Description From To Quantity Region Unit
PriceAmount TaxTotal
Nanode 1GB
debian-us-west
(51912110)2023-11-30
21:002023-12-31
20:59Fremont, CA
(us-west)0.0075 $5.00 $0.00$5.00
145 Broadway, Cambridge, MA 02142
USA
P:855-4-LINODE (855-454-6633) F:609-380-7200 W:https://www.linode.com
Subtotal (USD) $5.00
Tax Subtotal (USD) $0.00
Total (USD) $5.00
This invoice may include Linode Compute Instances that have been powered off as the data is maintained and
resources are still reserved. If you no longer need powered-down Linodes, you can remove the service
(https://www.linode.com/docs/products/platform/billing/guides/stop-billing/) from your account.
145 Broadway, Cambridge, MA 02142
USA
P:855-4-LINODE (855-454-6633) F:609-380-7200 W:https://www.linode.com
Agreed, the text is not friendly to read for humans. But it is perfect for LLMs.
Step 2: Extracting Information
Instead of using custom scripts in Python, NodeJs, or other programming languages for data extraction, we program LLMs through carefully crafted prompts. A good prompt is the key to getting the LLMs to produce the desired output.
For our use case, we can write a prompt like this:
Extract all the following values: invoice number, invoice date, remit to company, remit to address, tax ID, invoice to customer, invoice to address, total amount from this invoice: <THE_INVOICE_TEXT>
Depending on the models, such a prompt might or might not work. To get a small, pre-trained, general-purposed model, e.g. llama2-7B, to produce consistent results, we better use the Few-Shot prompt technique. That's a fancy way of saying we should provide examples of the output we want for the model. Now we write our model prompt like this:
Extract all the following values: invoice number, invoice date, remit to company, remit to address, tax ID, invoice to customer, invoice to address, total amount from this invoice: <THE_INVOICE_TEXT>
An example output:
{
"invoice_number": "25470322",
"invoice_date": "2024-01-01",
"remit_to_company": "Akamai Technologies, Inc.",
"remit_to_address": "249 Arch St. Philadelphia, PA 19106 USA",
"tax_id": "United States EIN: 04-3432319",
"invoice_to_customer": "John Doe",
"invoice_to_address": "1 Hacker Way Menlo Park, CA 94025",
"total_amount": "$5.00"
}
Most LLMs would appreciate the examples and produce more accurate and consistent results.
However, instead of using the prompt described above, we will approach this using the LangChain method. While it's possible to accomplish these tasks without LangChain, it greatly simplifies the development of LLM applications.
With LangChain, we define the output schema with code (Pydantic model).
from langchain.output_parsers import PydanticOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
class Invoice(BaseModel):
number: str = Field(description="invoice number, e.g. #25470322")
date: str = Field(description="invoice date, e.g. 2024-01-01T08:29:56")
company: str = Field(description="remit to company, e.g. Akamai Technologies, Inc.")
company_address: str = Field(
description="remit to address, e.g. 249 Arch St. Philadelphia, PA 19106 USA"
)
tax_id: str = Field(description="tax ID/EIN number, e.g. 04-3432319")
customer: str = Field(description="invoice to customer, e.g. John Doe")
customer_address: str = Field(
description="invoice to address, e.g. 123 Main St. Springfield, IL 62701 USA"
)
amount: str = Field(description="total amount from this invoice, e.g. $5.00")
invoice_parser = PydanticOutputParser(pydantic_object=Invoice)
Write the field descriptions with details. Later, the descriptions will be used to generate the prompt.
Then we need to define the prompt template, which will be fed to the LLM later.
from langchain_core.prompts import PromptTemplate
template = """
Extract all the following values : invoice number, invoice date, remit to company, remit to address,
tax ID, invoice to customer, invoice to address, total amount from this invoice: {invoice_text}
{format_instructions}
Only returns the extracted JSON object, don't say anything else.
"""
prompt = PromptTemplate(
template=template,
input_variables=["invoice_text"],
partial_variables={
"format_instructions": invoice_parser.get_format_instructions()
},
)
huh, that's not as intuitive as the Few-Shot prompt. But invoice_parser.get_format_instructions()
will produce a more detailed example for the LLMs to consume.
The completed prompt, crafted using LangChain, appears as follows:
Extract all the following values :
...
...
...
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
{"properties": {"number": {"title": "Number", "description": "invoice number, e.g. #25470322", "type": "string"}, "date": {"title": "Date", "description": "invoice date, e.g. 2024-01-01T08:29:56", "type": "string"}, "company": {"title": "Company
", "description": "remit to company, e.g. Akamai Technologies, Inc.", "type": "string"}, "company_address": {"title": "Company Address", "description": "remit to address, e.g. 249 Arch St. Philadelphia, PA 19106 USA", "type": "string"}, "tax_id"
: {"title": "Tax Id", "description": "tax ID/EIN number, e.g. 04-3432319", "type": "string"}, "customer": {"title": "Customer", "description": "invoice to customer, e.g. John Doe", "type": "string"}, "customer_address": {"title": "Customer Addre
ss", "description": "invoice to address, e.g. 123 Main St. Springfield, IL 62701 USA", "type": "string"}, "amount": {"title": "Amount", "description": "total amount from this invoice, e.g. $5.00", "type": "string"}}, "required": ["number", "date
", "company", "company_address", "tax_id", "customer", "customer_address", "amount"]}
Only returns the extracted JSON object, don't say anything else.
You can see that the prompt is much more detailed and informative. "Only returns the extracted JSON object, don't say anything else.
" was added by me to make sure the LLMs don't output anything else.
Now, we are ready to employ LLMs for information extraction.
llm = LlamaCpp(
model_url=LLM_URL,
temperature=0,
streaming=False,
)
chain = prompt | llm | invoice_parser
result = chain.invoke({"invoice_text": page_content})
LlamaCpp is a client proxy to the Llama2-7B model that will be hosted in AWS by Paka
. LlamaCpp is defined here. When Paka
deploys the Llama2-7B model, it uses the awesome llama.cpp project and the llama-cpp-python as the model runtime.
The chain
is a pipeline that includes the prompt, LLM, and output parser. In this pipeline, the prompt is fed into the LLM, and the output is parsed by the output parser. Aside from creating the one-shot example in the prompt, invoice_parser
can validate the output and return a Pydantic object.
Step 3: Building the API
With the core logic in place, our next step is to construct an API endpoint that receives a PDF file and delivers the results in JSON format. We will be using FastAPI for this task.
from fastapi import FastAPI, File, UploadFile
from uuid import uuid4
@app.post("/extract_invoice")
async def upload_file(file: UploadFile = File(...)) -> Any:
unique_filename = str(uuid4())
tmp_file_path = f"/tmp/{unique_filename}"
try:
with open(tmp_file_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
return extract(tmp_file_path) # extract is the function that contains the LLM logic
finally:
if os.path.exists(tmp_file_path):
os.remove(tmp_file_path)
The code is pretty straightforward. It accepts a file, saves it to a temporary location, and then calls the extract
function to extract the invoice data.
Deploying the API
We are only halfway there. As promised, our aim is to develop a production-ready API, not merely a prototype operating on my local machine. This involves deploying the API and models to the cloud and ensuring they can scale horizontally. Additionally, we need to collect logs and metrics for monitoring and analysis purposes. That's a lot of work and it's less fun than building the core logic. Luckily, we have Paka to help us with this task.
But before diving deep into deployment, let's try to answer this question: "Why do we need to deploy the model rather than just using OpenAI or Google's APIs?". Main reasons that you want to deploy your model:
- Cost: Using OpenAI APIs might become expensive with large volumes of data.
- Vendor lock-in: You may wish to avoid being tethered to a specific provider.
- Flexibility: You may prefer to tailor the model more closely to your needs or select an open-source option from the HuggingFace hub.
- Control: You maintain complete control over both the stability and the scalability of the system.
- Privacy: You may prefer not to expose your sensitive data to external parties.
Now, let's deploy the API to AWS using Paka
:
Prerequisites
Installing the tools:
pip install paka
# Ensure AWS credentials and CLI are set up.
aws configure
# Install pack CLI and verify it is working (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
pack --version
# Install pulumi CLI and verify it is working (https://www.pulumi.com/docs/install/)
pulumi version
# Ensure the Docker daemon is running
docker info
Creating the config file for the cluster
To run the model with CPU instances. We can create a cluster.yaml
file with the following content:
aws:
cluster:
name: invoice-extraction
region: us-west-2
namespace: default
nodeType: t2.medium
minNodes: 2
maxNodes: 4
prometheus:
enabled: false
tracing:
enabled: false
modelGroups:
- nodeType: c7a.xlarge
minInstances: 1
maxInstances: 3
name: llama2-7b
resourceRequest:
cpu: 3600m
memory: 6Gi
autoScaleTriggers:
- type: cpu
metadata:
type: Utilization
value: "50"
Most of the fields are self-explanatory. The modelGroups
field is where we define the model group. In this case, we define a model group called llama2-7b
with a c7a.xlarge
instance type. The autoScaleTriggers
field is where we define the auto-scaling triggers. We are defining a CPU trigger that will scale the instances based on the CPU utilization. Please note, Paka
doesn't support scaling the model group to zero instances, because the cold start time is too long. We need to keep at least one instance running.
To run the model with GPU instances, here is an example cluster config.
Provisioning the cluster
You can now provision the cluster using the following command:
# Provision the cluster and update ~/.kube/config
paka cluster up -f cluster.yaml -u
The above command will create a new EKS cluster with the specified configuration. It will also update the ~/.kube/config
file with the new cluster information. Paka
downloads the llama2-7b model from the HuggingFace hub and deploys it to the cluster.
Deploying the FastAPI app
We now would like to deploy the FastAPI app to the cluster. We can do this by running the following command:
# Change the directory to the source code directory
paka function deploy --name invoice-extraction --source . --entrypoint serve
The FastAPI app is deployed as a function. That means it is serverless. Only when there is a request, the function will be invoked.
Behind the scenes, the command will build a Docker image with the buildpacks and then push it to the Elastic Container Registry. The images are then deployed to the cluster as a function.
Testing the API
First, we need to get the URL of the FastAPI app. We can do this by running the following command:
paka function list
If all steps are successful, the function should appear in the list marked as "READY". By default, the function is accessible via a public REST API endpoint, typically formatted like http://invoice-extraction.default.50.112.90.64.sslip.io
.
You can test the API by sending a POST request to the endpoint using curl or another HTTP client. Here is an example using curl
:
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@/path/to/invoices/invoice-2024-02-29.pdf" http://invoice-extraction.default.xxxx.sslip.io/extract_invoice
If the invoice extraction succeeds, the response will display the structured data as follows:
{"number":"#25927345","date":"2024-01-31T05:07:53","company":"Akamai Technologies, Inc.","company_address":"249 Arch St. Philadelphia, PA 19106 USA","tax_id":"United States EIN: 04-3432319","customer":"John Doe","customer_address":"1 Hacker Way Menlo Park, CA 94025","amount":"$5.00"}
Monitoring
For monitoring purposes, Paka automatically sends all logs to CloudWatch, where they can be viewed directly in the CloudWatch console. Additionally, you can enable Prometheus within the cluster.yaml to collect predefined metrics.
Conclusion
This article has demonstrated how to use LLMs to extract data from PDF invoices. We constructed a FastAPI server capable of receiving a PDF file and returning the information in JSON format. Subsequently, we deployed the API on AWS using Paka and enabled horizontal scaling.
For the full source code https://github.com/jjleng/paka/tree/main/examples/invoice_extraction
Top comments (0)