Estimation of text complexity

#gratitude

Medium post: Medium

Motivation

To acquire new knowledge and insights in a complex field such as data science, machine learning, or any other scientific filed, a person needs to practice a lot (ofc), but a big chunk of time goes to reading and researching papers in the correlated field. Any of you who have experience with research, are aware of how challenging it can be reading just a single paper: understanding the concepts and connecting all the terms to build a complete idea. Someone might think what is there to understand, it’s “Just a bunch of words put together to form complex sentences to display someone’s thoughts.”, and I would agree with them. However, getting to a level where someone can easily read and comprehend scientific papers requires years of experience and a tone of reading. For that reason, starting gradually would be beneficial for a complete beginner, like first year students, to grasp new vocabulary needed for understanding such texts.

This post will go through the process of making a tool for paper complexity evaluation based on word frequency and academic vocabulary lists (no AI for this post).

AWL and low frequency words

Before even beginning any work, I needed to decide on the method of measurement of complexity. The problem is that defining something as difficult or easy is a subjective statement and can’t be generalized effectively. Because of that I turned to splitting the problem in smaller variables that can quantify the complexity in an acceptable and generic way.

The two variables used to calculate the complexity are the ratio of low frequency words and academic words (AWL) present in the document.

The Academic Word List (AWL) is a predefined list of words, as the name suggests, used primarily in academic environments. It is closely tied to the academic language, defined as specialized language, both oral and written, of academic settings that facilitates communication and thinking about disciplinary content [source]. Higher use of academic words decreases readers comprehension [source].

Word frequency is a good indication and measurement of a person’s vocabulary size. High frequency words are more widely used by majority of people and are therefore processed faster and more easily. High frequency words are the first to be learned, also, word frequency is correlated with text coverage, where a small number of high frequent word is enough to cover 80% of a typical written text. By logic of exclusion, texts that contain more low frequency words are therefore more complex and harder to understand.

For reasons mentioned above, I selected AWL and low frequency words ratios as complexity measurement of a given text. The higher the ratios, the more complex is the processed text. For comparison purpose: a children story (like The little mermaid) has a AWL ratio of 1.5% and low frequency words of 0.6%, while a scientific paper has 19% and 3.5% respectively.

Before jumping into code

Before jumping to the code and process there are a few things needed to be set up. Firstly, install the necessary packages:

NLTK: for token processing
pandas: for faster data processing and manipulation
PyMuPDF: pdf reader and word extractor package

These are the main ones. The whole list you can find in the requirements.txt file on the GitHub repository.

Secondly, find and prepare the needed word lists. For the AWL database, I combined the common list of 570 word families and a more recent list from a Kaggle post which contains an updated version. The AWL data frame contains 1439 words. For the low frequency word list, I extracted words with frequency lower than 1%, resulting in a list of 36621 words. Both lists are in a base lexical form which means before searching for concurrences, tokens must be preprocessed into base lexical form.

Additionally, to the corpus processing, I implemented a basic API for testing (written with Flask). However, this part wont be covered in this post and you can find it on the git link at the end of this article.

Process (code)

The process steps are as follows:

Load document (PyMuPDF package).
Preprocess document text: transformations needed to bring the tokens to their basic form (NLTK package).
Get sample: in case of long documents, extract a representative random corpus.
Calculate AWL and low frequency words ratios (pandas package).
Extract complexity level.

Load document: to load and work with pdf documents I am using PyMuPDF library. The package has a lot of functionality for pdf manipulation, but for my case, I need just loading and word extraction functions. For now, this step supports only pdf documents but it can be easily enough modified to support other common text documents.

Preprocess text: after getting the list of all words from the document, I preprocess every word. The transformations are as follow: 1. Lower case the whole corpus; 2. Tokenize the corpus; 3. Remove punctuations; 4. Remove stop words; 5. Remove numbers; 6. Lemmatize the tokens; 7. and finally remove single letters tokens.
The preprocessing was made with NLTK package.

Get sample: this step is only for longer documents. If the document is longer than 50 pages, I extract randomly 30 pages. Additionally, if the final corpus has more than 10k tokens, I extract at random 10k tokens. This step serves to increase efficiency: a randomly selected 10k long corpus has showed to be representative enough for the whole document estimation.

Calculate ratio: the aim is to find the ratio of AWL words and low frequency words. For a faster processing, I convert the corpus to a Pandas DataFrame. I calculate the ratio by dividing the number of words found in the earlier built list and the size of the corpus (standard percentage calculation).

Get complexity level: the following matrix determines the complexity level, where 1 indicates low complexity and 5 high complexity. The x axis is determined by the AWL ratio (converted to index) while the y axis by the low frequency words ratio. Of course, this matrix can (and should) be optimized.

Complexity level matrix: top-left is low complexity, bottom-right is high complexity — Complexity level matrix: indexes are calculated from respective ratios.

What follows is the core function of the calculation which covers all the above described steps:

Link to GitHub repository of the project: link

Conclusion

Personally, I think this would be a great tool for professors and new students that just embarked on the scientific path. Instead of just sending students random papers for reference, this would first indicate if a student would be capable of understanding the necessary concepts described in the research.

If you reached the end, thank you for your time. Let me know what you think of the idea of a complexity checker tool and any suggestions for a better variable selection are welcomed.