DEV Community: Aldo Ferlatti

Estimation of text complexity

Aldo Ferlatti — Thu, 24 Nov 2022 12:46:03 +0000

Medium post: Medium

Motivation

To acquire new knowledge and insights in a complex field such as data science, machine learning, or any other scientific filed, a person needs to practice a lot (ofc), but a big chunk of time goes to reading and researching papers in the correlated field. Any of you who have experience with research, are aware of how challenging it can be reading just a single paper: understanding the concepts and connecting all the terms to build a complete idea. Someone might think what is there to understand, it’s “Just a bunch of words put together to form complex sentences to display someone’s thoughts.”, and I would agree with them. However, getting to a level where someone can easily read and comprehend scientific papers requires years of experience and a tone of reading. For that reason, starting gradually would be beneficial for a complete beginner, like first year students, to grasp new vocabulary needed for understanding such texts.

This post will go through the process of making a tool for paper complexity evaluation based on word frequency and academic vocabulary lists (no AI for this post).

AWL and low frequency words

Before even beginning any work, I needed to decide on the method of measurement of complexity. The problem is that defining something as difficult or easy is a subjective statement and can’t be generalized effectively. Because of that I turned to splitting the problem in smaller variables that can quantify the complexity in an acceptable and generic way.

The two variables used to calculate the complexity are the ratio of low frequency words and academic words (AWL) present in the document.

The Academic Word List (AWL) is a predefined list of words, as the name suggests, used primarily in academic environments. It is closely tied to the academic language, defined as specialized language, both oral and written, of academic settings that facilitates communication and thinking about disciplinary content [source]. Higher use of academic words decreases readers comprehension [source].

Word frequency is a good indication and measurement of a person’s vocabulary size. High frequency words are more widely used by majority of people and are therefore processed faster and more easily. High frequency words are the first to be learned, also, word frequency is correlated with text coverage, where a small number of high frequent word is enough to cover 80% of a typical written text. By logic of exclusion, texts that contain more low frequency words are therefore more complex and harder to understand.

For reasons mentioned above, I selected AWL and low frequency words ratios as complexity measurement of a given text. The higher the ratios, the more complex is the processed text. For comparison purpose: a children story (like The little mermaid) has a AWL ratio of 1.5% and low frequency words of 0.6%, while a scientific paper has 19% and 3.5% respectively.

Before jumping into code

Before jumping to the code and process there are a few things needed to be set up. Firstly, install the necessary packages:

NLTK: for token processing
pandas: for faster data processing and manipulation
PyMuPDF: pdf reader and word extractor package

These are the main ones. The whole list you can find in the requirements.txt file on the GitHub repository.

Secondly, find and prepare the needed word lists. For the AWL database, I combined the common list of 570 word families and a more recent list from a Kaggle post which contains an updated version. The AWL data frame contains 1439 words. For the low frequency word list, I extracted words with frequency lower than 1%, resulting in a list of 36621 words. Both lists are in a base lexical form which means before searching for concurrences, tokens must be preprocessed into base lexical form.

Additionally, to the corpus processing, I implemented a basic API for testing (written with Flask). However, this part wont be covered in this post and you can find it on the git link at the end of this article.

Process (code)

The process steps are as follows:

Load document (PyMuPDF package).
Preprocess document text: transformations needed to bring the tokens to their basic form (NLTK package).
Get sample: in case of long documents, extract a representative random corpus.
Calculate AWL and low frequency words ratios (pandas package).
Extract complexity level.

Load document: to load and work with pdf documents I am using PyMuPDF library. The package has a lot of functionality for pdf manipulation, but for my case, I need just loading and word extraction functions. For now, this step supports only pdf documents but it can be easily enough modified to support other common text documents.

Preprocess text: after getting the list of all words from the document, I preprocess every word. The transformations are as follow: 1. Lower case the whole corpus; 2. Tokenize the corpus; 3. Remove punctuations; 4. Remove stop words; 5. Remove numbers; 6. Lemmatize the tokens; 7. and finally remove single letters tokens.
The preprocessing was made with NLTK package.

Get sample: this step is only for longer documents. If the document is longer than 50 pages, I extract randomly 30 pages. Additionally, if the final corpus has more than 10k tokens, I extract at random 10k tokens. This step serves to increase efficiency: a randomly selected 10k long corpus has showed to be representative enough for the whole document estimation.

Calculate ratio: the aim is to find the ratio of AWL words and low frequency words. For a faster processing, I convert the corpus to a Pandas DataFrame. I calculate the ratio by dividing the number of words found in the earlier built list and the size of the corpus (standard percentage calculation).

Get complexity level: the following matrix determines the complexity level, where 1 indicates low complexity and 5 high complexity. The x axis is determined by the AWL ratio (converted to index) while the y axis by the low frequency words ratio. Of course, this matrix can (and should) be optimized.

Complexity level matrix: indexes are calculated from respective ratios.

What follows is the core function of the calculation which covers all the above described steps:

Link to GitHub repository of the project: link

Conclusion

Personally, I think this would be a great tool for professors and new students that just embarked on the scientific path. Instead of just sending students random papers for reference, this would first indicate if a student would be capable of understanding the necessary concepts described in the research.

If you reached the end, thank you for your time. Let me know what you think of the idea of a complexity checker tool and any suggestions for a better variable selection are welcomed.

My goal is to simplify complexity. I just want to build stuff that really simplifies our base human interaction. -Jack Dorsey

React Native + Tensorflow.js - implementing a model

Aldo Ferlatti — Fri, 15 Jul 2022 11:36:49 +0000

Original source: Medium

There are three reasons why I decided to write this post:

Sometime ago I came upon an article about how to implement a machine learning model with React. The article was about implementing a simple Gaussian Naïve Bayes binary classifier made with scikit-learn which ran on a Flask backend, while the front was made in React. Obviously, it is a very needed skill and I recommend everyone to read it. However, I had a different problem. What if my model needs to be loaded onto the device, needs to be mobile compatible, it is more than 200MB in size and is made with Tensorflow? The ‘simple’ server solution doesn’t work anymore.
A claim made on VentureBeat says that 87% of data science projects never make it into production. That means that only 1 in 10 project are actually being used. Considering all the money and time (a lot of time) needed for developing a model, the odds are not very motivating. After 9 projects you spend your time and efforts on, end up in some cloud folder (because maybe someday will be used), you start to question if this is the right way and if your next project will also be a waste of time.
Lastly, not every company has a data science team to build models and a development team to implement the said models. Sometimes, if you want your model to be used by people, you need to put them out there by yourself or nobody will.

Following these points, I wanted to write about a method to put to use our hardly made and time consuming models out there, in the world, by ourselves.

Here I will not write about building the model (it has already been built) but only about its implementation and use.

There are two paths to choose for mobile development: native code or cross-platform. As the choice of development can vary, so it can the choice for model processing. If you prefer native code, then a tensorflow-lite approach would be a better option, on the other hand, a cross-platform approach like React Native, allows to transfer knowledge from the web development into mobile, consequentially making TensorFlow.js (tfjs) a good choice.

As you probably already guessed, in this article I’ll be using the cross-platform path, therefore tfjs will be used as a central library. For the conversion part we need the python library:

pip install tensorflowjs

And because we are trying to implement it with React Native, we need the adapter for the framework:

npm i @tensorflow/tfjs
npm i @tensorflow/tfjs-react-native

The implementation is done in four steps:

Transform the model so it can be loaded onto the device and be used with tfjs
Load the model
Transform the input (image) in a way it can be fed to the model
And finally make predictions

Model transformation

Once you trained your model and are satisfied with the result, you save the entire model as a SavedModel format. The SavedModel format is a directory containing a protobuf binary and a TensorFlow checkpoint which can be loaded with tensorflow using the load_model function. But this format is not suitable for mobile and cannot be loaded inside Tensorflow.js library. For that, tfjs has a built-in converter which can convert a SavedModel format into a javascript compatible format (JSON + weights: more about it later).

To convert a saved model, use the following command. Be sure you are inside the root directory, where your model is saved:

>tensorflowjs_converter --input_format=tf_saved_model --saved_model_tags=serve --weight_shard_size_bytes=30000000 "path_to_your/model/" "converted_model"

What does it do? It is capable to convert from 3 types of models (SavedModel, Frozen Model and from Tensorflow Hub). Because of that, we need to specify what type is the input model (input format). The output format is an JSON file with the dataflow graph and weight manifest of the model, together with a collection of binary weights files.
If the shared size is smaller than the total size of the model’s weights, then the weights are sliced in multiple files. However, to load the model with tfjs, we need weights in a single file. Therefore, if you put a small shared size, keep in mind that you need to merge the output files into one file. The last two lines represents the input path, or the position of your model, and the output directory path where the generated files will be stored.

After running the above command the result should be a model.json and a group1-shard\of\ binary files (in our case, it should be just one shard of weights).

Before jumping to the application, we need to import all the necessary packages, and more importantly the files we just created with the converter.

Loading the model

The next step is to load the model which, thanks to tensorflow and its simple api, is basically a one liner. Tfjs_allows to load graph models and layered models. Since this is a Keras sequential model, we will load a Layered model with _loadLayersModel function. We load the weights and the json in one go and to do so we use the helper from the react-native adapter bundleResourceIO

After this, the model is loaded and ready to use.

Input image transformations

Now that we have our model loaded, we need to feed it data. But before that we need do some transformations so it would be compatible with the input shape. My model is for image classification and require a tensor with a size of an image of 300x300 pixels. The input depends on the model and the training of it, so you need to transform it in the way model learned it before. For this I will transform a local image into base64 encoding and then transform it into a tensor.

Make predictions

Just as easy it was to load the model, making a prediction is the same. So yeah, a one liner. The predict function can do a prediction on a batch of images, we only need to split the result based on the batch size.

Wrap everything together

The only thing left to do is to call our functions. However, before using any tfjs methods, we need to load the package with tf.ready(), only after that we can use the tensorflow package. We export this function so we can call it later from wherever we want in the application.

Congratulations! Now you can run inferences on your mobile device. In this case, the _tfjs_library was used only for loading and predictions, but it also has all the tools for training models. I invite you to experiment with it and let me know if it even makes sense to train a model on a mobile device, and if it does, up to what reasonable point.

Sometimes it is the people no one can imagine anything of who do the things no one can imagine. ― Alan Turing