Introduction
This article shows you how to extract content or sections from Word documents using beautifulSoup
and regular expressions. I assume you understand Python programming and regular expressions.
A Word Document
Behind the scenes, a Word (docx) document is a zip file containing a collection of XML files. Using Python, you can easily view these files. For example:
import zipfile
f = zipfile.ZipFile('test.docx')
print(f.namelist())
# [..., 'word/document.xml', ..., 'word/settings.xml', ...]
I created a Word document containing several "Lorem Ipsum" text as the body and "Introduction to Python programming" as the title.
The main XML file, word/document.xml, contains the text content you would normally see while the rest XML files are related to settings, styling, and so on.
Viewing the Content
To view the content of document.xml, you can use the BeautifulSoup()
class from the bs4
package.
Here, I defined a function read_doc_to_soup()
to make it easier.
Note that you have to install the
lxml
package along with thebeautisoup4
package to getBeautifulSoup()
working withXML
correctly.
def read_doc_to_bsoup(filename: str) -> BeautifulSoup:
with zipfile.ZipFile(filename) as file:
document = file.read('word/document.xml')
return BeautifulSoup(document, 'xml')
soup = read_doc_to_bsoup('test.doc')
print(soup.prettify())
A sample output is shown below.
<?xml version="1.0" encoding="utf-8"?>
<w:document mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16 ...
...
<w:r w:rsidRPr="0002422F">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:cs="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b/>
<w:bCs/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t>
Introduction to Python Programming
</w:t>
</w:r>
</w:p>
...
<w:sectPr w:rsidR="00F152B4" w:rsidRPr="0002422F">
<w:pgSz w:h="16838" w:w="11906"/>
<w:pgMar w:bottom="1440" w:footer="708" w:gutter="0" w:header="708" w:left="1440" w:right="1440" w:top="1440"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>
The text we're interested in is usually within a w:t
XML tag. Notice that the text "Introduction to Python Programming" is wrapped within a w:t
XML tag.
Extracting The Content
To get the first text, simply use the .find()
method of beautiful soup like so
first_text = soup.find('w:t')
print(first_text) # <w:t>Introduction to Python Programming</w:t>
# To extract the text without the tags
print(first_text.get_text()) # Introduction to Python Programming
To get all text tags, use the .find_all()
method.
content = soup.find_all('w:t')
print(content)
[<w:t>Introduction to Python Programming</w:t>,
<w:t xml:space="preserve">Lorem ipsum </w:t>, <w:t>dolor</w:t>,
<w:t xml:space="preserve"> sit </w:t>, <w:t>amet</w:t>,
<w:t xml:space="preserve">, </w:t>, <w:t>consectetur</w:t>,
<w:t xml:space="preserve"> </w:t>, <w:t>adipiscing</w:t>, <w:t xml:space="preserve"> </w:t>,
<w:t>elit</w:t>, <w:t xml:space="preserve">. </w:t>,
<w:t>Pellentesque</w:t>, <w:t xml:space="preserve"> </w:t>, <w:t>metus</w:t>, ...]
To extract the text without the tags, you can use a list comprehension. For example
content = [node.get_text() for node in soup.find_all('w:t') if node is not None]
print(content)
# ['Introduction to Python Programming', 'Lorem ipsum ', 'dolor', ' sit ',
# 'amet', ', ', 'consectetur', ' ', 'adipiscing', ' ', 'elit', '. ',
# 'Pellentesque', ' ', 'metus', ' ', 'elit', ', ', 'consectetur', ' id ',
# 'mollis', ' non, ', 'fringilla', ' in eros. Mauris ', 'aliquam', ' ',
# 'quis', ' ', 'odio', ' id tempus. ', 'Aliquam', ' ', 'erat', ' ',
# 'volutpat', '. Donec id ', 'iaculis', ' ipsum. In ', 'tincidunt', ' ',
# 'massa', ' non ', 'aliquam', ' ', 'dignissim', '. Donec semper ', ...]
Processing The Content
While you have successfully extracted the content, you'll notice that in some cases, the text does not make complete sentences. While some texts are within <w:t xml:space="preserve">
, others are within plain <w:t>
. This makes working with Word documents a huge problem.
The solution to the problem is entirely dependent on the structure of the document. Therefore, you'll have to spend time understanding the structure of your document(s).
In the simplest case, you could choose to combine all text into a single string and then split by period (.).
Example: Extracting Research Objectives from Research Proposals
Thankfully, research proposals have a fairly regular structure across various fields. Just for fun, I wanted to extract the objectives from a series of research proposals.
In cases where all the objectives are on a single line, the function below just works!
import re
def process_simple(filename: str, target: str = 'To determine'):
"""Extract objectives using a target keyword."""
soup = read_doc_to_bsoup(filename)
wt = soup.find_all('w:t')
results = [re.sub(r'\d+', '', t.get_text()).strip()
for t in wt if t.get_text()]
results = list(filter(lambda x: x, results))
return list(filter(lambda x: x.startswith(target), results))
- Extract the content as you saw above
- Remove digits and empty strings from the text array
- Collect the texts that start with "To determine".
In a more complicated scenario, the objectives spanned multiple lines, section numbers were different, and so on. The above function failed woefully. Luckily, I found that across objectives spanning multiple lines, each text within the <w:t>
tag was part of the preceding <w:t xml:space="preserve">
tag.
The code below walks through the XML soup and combines all <w:t>
tags with the previous <w:t xml:space="preserve">
tag. It also works for cases where the objectives are on single lines.
def process_xml_soup(soup: BeautifulSoup):
# extract content
wt = soup.find_all('w:t')
results = []
# Keep track of cursor positions
current = 0
next = current + 1
stop = len(wt) - 1
while current < stop:
cur_elem = wt[current]
next_elem = wt[next]
# get the current text
current_text = cur_elem.get_text().strip() or ''
# Are we between <w:t xml:space="preserve"> and <w:t>?
if cur_elem.has_attr('xml:space') and not next_elem.has_attr('xml:space'):
# Join subsequent <w:t> to the preceding <w:t xml:space="preserve">
while not next_elem.has_attr('xml:space') and next < stop:
current_text += f' {next_elem.get_text().strip() or ""}'
# Advance the cursor until
# we see another <w:t xml:space="preserve">
next += 1
next_elem = wt[next]
# Remove too many spaces
current_text = re.sub(r'\s+', ' ', current_text.strip())
# make objectives title consistent
# That is, if you find 3.1 objectives make it 3.1 OBJECTIVES
current_text = re.sub(r'(\d\.\d\.?\d?) objectives',
r'\1 OBJECTIVES', current_text, flags=re.I)
# save the text
results.append(current_text)
# Move the cursor forward
current = next
next = current + 1
results = [c.strip() for c in results]
return ' '.join(list(filter(lambda x: x, results)))
Research objectives are usually sandwiched between different sections depending on the field/department. For example, between "AIM" and "HYPOTHESIS".
The following functions complete the extraction.
class NoneObject:
"""Create a NoneObject to avoid return None from find_last()."""
def start(self):
return None
def end(self):
return None
# Find last uses NoneObject to maintain API consistency
def find_last(text: str, target: str):
"""Find the last occurrence of a target string."""
# finditer returns an iterator of match objects or None
result = list(re.finditer(target, text))
# Let's avoid None checks in any function that uses the find last
# by using NoneObject
return result[-1] if len(result) >= 1 else NoneObject()
# Use find last to extract a section
def extract_section(text: str, start: str, end: str, verbose: bool = True):
"""Extract a section of a text."""
start = find_last(text, start)
end = find_last(text, end)
if verbose:
print(start)
print(end)
return text[start.end():end.start()].strip()
# Account for different headers after the list of objectives
def gen_end_regex():
"""Generate a series of regexes to match the section after the objectives."""
end_titles = ['HYPOTHESES', 'HYPOTHESIS',
'RESEARCH QUESTIONS', 'NULL AND ALTERNATE HYPOTHESIS',
'CHAPTER TWO LITERATURE REVIEW', 'CHAPTER THREE LITERATURE REVIEW']
fdgts = r'\d?\.\d?\.?\s*'
regex = ''
for i, w in enumerate(end_titles):
regex += (fdgts + w)
if i != len(end_titles) - 1:
regex += '|'
return regex
def extract_objectives(text: str, verbose: bool = True):
"""Extract the objectives section"""
start = r'\d?\.?\d?\.?\s*OBJECTIVES[:;]?|\d?\.?\d?\.?\s*Objectives|OBJECTIVES OF THE STUDY'
end = gen_end_regex()
return extract_section(text, start, end, verbose)
def seperate_objectives(objectives: str):
"""Separate or split the objectives."""
objectives = re.sub(r'\d\.?', '', objectives).removeprefix('OBJECTIVES ')
objectives = list(filter(lambda x: x.strip(), objectives.split('To ')))
objectives = list(
map(lambda x: 'To ' + x.strip().replace('.', '').strip(), objectives))
return objectives
def process(filename: str, verbose: bool = True):
"""Run all functions on a word document."""
soup = read_doc_to_bsoup(filename)
document = process_xml_soup(soup)
objectives = extract_objectives(document, verbose)
objectives = seperate_objectives(objectives)
return objectives
- In research proposals, each section title or heading is often defined in two places - within the table of contents and the body of the document.
find_last()
finds the last occurrence within the body of the document. - The
re.finditer()
returns amatch
object which has two methods.start()
and.end()
. These methods return the start and end index respectively. In the absence of a match, instead of returningNone
, I use theNoneObject()
to mimic the result ofre.finditer()
. -
extract_section()
usesfind_last()
to do its job. -
extract_objectives()
andseparate_objectives()
functions are fairly obvious, I think :) - Everything is combined within the
process()
function.
Testing
Create a Word document objectives.docx
containing the following content:
OBJECTIVES
- To determine the effect of exercise on weight loss in elderly people.
- To determine if the effect of exercise on weight loss in elderly people is influenced by factors such as location and gender.
- To determine the effect of dieting on weight loss in elderly people.
- To determine if the effect of dieting on weight loss in elderly people is influenced by factors such as location and gender.
from pprint import pp
pp(process('objectives.docx'))
['To determine the effect of exercise on weight loss in elderly people',
'To determine if the effect of exercise on weight loss in elderly people is '
'influenced by factors such as location and gender',
'To determine the effect of dieting on weight loss in elderly people',
'To determine if the effect of dieting']
Summary
In this article, you saw how to extract text from a Word document. You also saw a sample case that combined various methods including regex to extract the objective section of a research proposal.
Thank you for reading.
Top comments (0)