I will show you today how you can automate the process of importing PDF data into Excel. This can save a lot of time for everyone doing this daily.
Let's jump to the code!
First we need to import dependencies
from tika import parser
import pprint
from collections import defaultdict
import re
import pandas as pd
After that we need to define PrettyPrinter and get the content of the PDF file and convert it into a list:
pp = pprint.PrettyPrinter(indent=3)
parsedPDF = parser.from_file("final-test.pdf")
content = parsedPDF['content']
contentlist = content.split('\n')
Remove empty strings in the list resulting from the split
contentlist = list(filter(lambda a: a != '', contentlist))
Create an iterator and other flags that we will use to for the algorithm, Iterator of the contents of PDF per line:
iterateContent = iter(contentlist)
Dictionary placeholder of the data scraped
data = defaultdict(dict)
Our counter to count how many blocks did we able to get
cntr = 0
Indicator which line are we in a specific block of data
line = 1
The algorithm will use the flags cntr and line to determine if we are in a new block or existing block
while True:
try:
string = next(iterateContent)
except StopIteration:
break
if re.match('^[A-Z\s]+$', string):
cntr += 1
data[cntr]['Name'] = string
line = 2
print('matched')
elif line == 2:
data[cntr]['Address'] = string
line += 1
elif line == 3:
data[cntr]['Website'] = string
line += 1
print("Total data:", len(data.keys()))
Setting up the data into Dataframe
df = pd.DataFrame(data.values())
df.index += 1
print(df)
Write the dataframe into excel
writer = pd.ExcelWriter("dataframe.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='output', index=False)
writer.save()
Our final results:
Thank you all.
Top comments (1)
Very useful. I was wondering if it would make sense to add 'with Apache Tika' at a visible point, eg before the code snippets.
Thank you for the post again!