DEV Community: Oursky

Parse Name and Address: Regex vs NER, with Code Examples

Elliot Wong — Tue, 16 Mar 2021 12:17:13 +0000

Here we have some regular expressions (regex) that can match a majority of names and addresses. Don't directly copy and paste them though, as there's no guarantee on always landing a 100% match only by using them.

Name Regex

We’re first discussing name regex intentionally, just because the it often includes a human name. It’d be clearer for you if we talk about names first.

The regex here can be applied to a first or last name text field. We’ll focus on a data field for human name and ignore details differentiating first names and surnames.

The pattern in more common names like “James,” “William,” “Elizabeth,” “Mary,” are trivial. They can be easily matched with . How about those with more variations? There are plenty of languages with different naming conventions. We’ll try to group different types of names into basic groups:

Hyphenated names, e.g., Lloyd-Atkinson, Smith-Jones
Names with apostrophes, e.g., D’Angelo, D’Esposito
Names with spaces in-between, e.g., Van der Humpton, De Jong, Di Lorenzo

Carry on reading to see how text extraction can be done with a regex.

import re

test = [
  "james", 
  "william", 
  "elizabeth", 
  "mary",
  "d'angelo",
  "andy",
  "lloyd-atkinson",
  "van der humpton",
  "jo",
]

regex = re.compile(
    r'^[a-z ,.\'-]+$'
)

print(sum([regex.findall(x) for x in test],[]))

Address Regex

For geographical or language reasons, the format of an address varies all over the world. Here’s a long list describing these formats per country.

Since address format is too varied, it’s impossible for a regex to cover all these patterns. Even if there is one that manages to do so, it’d be very challenging to test, as the testing data set has to be more than enormous.

Our regex for address will only cover some of the common ones in English-speaking countries. It should do the trick for addresses that start with a number, like “123 Sesame Street.” It’s from this discussion thread where it received positive feedback.

import re

test = [
  "224 Belmont Street APT 220",
  "225 N Belmont St 220",
  "123 west 2nd ave",
  "4 Saffron Hill Road 1",
  # will fail as they don't start with a digit
  "Flat A, 2 Second Avenue",
  "Upper Level 10 ABC Street"
]

regex = re.compile(
    r'^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$'
)

print(sum([regex.findall(x) for x in test],[]))

Limitations of Using Regex to Extract Names and Addresses

Dealing with Uncommon Values

While these regexes may be able to validate a large portion of names and addresses, they will likely miss out on some, especially those that are non-English or newly introduced. For example, Spanish or German names weren’t considered thoroughly here, probably because the developer wasn’t familiar with these languages.

No Pattern to Follow

Regex works well against data that has a strict pattern to follow, where neither name nor address belongs to a category. They’re ever-changing, with new instances created every day, along with a massive variation. Regex isn’t really going to do a good job on extracting them. In short, they are not “regular” enough with no intuitive patterns to follow.

Unable to Find the Likeliest Name

Regex also lacks the ability to differentiate to find the “most likely” name. Let’s take a step back and assume there’s a regex R that can extract names flawlessly from documents that scanned via an OCR data extraction service. We want to get the recipient’s name (from a letter from Ann to Mary):

Dear Mary,

How have you been these days? Lately, Tom and I have been planning to travel around the World.
...
...
...

Love,
Ann

There are three names in the letter — Mary, Tom, and Ann. If you use R to find names, you’ll end up with a list of the three names, but you won’t be receiving just Mary, the recipient.

So, how can this be achieved? We can give each name a score based on:

Its position on the document
How “naive” it is (i.e., how often it appeared in a training data set)
Likelihood of a name to be the single target from a training data set

Unable to Differentiate Name and Address

On paper, names and address can be the same thing. “John” can be a name, or it can also be a part of an address, like “John Street”. Regexes don’t have the capability to see this difference and react accordingly. We surely don’t want to have results “Sesame Street” as a name and “Mr. Sherlock Holmes” as a street address!

Well, how can I achieve a better extraction accuracy then? For more details and our proposed solution, please refer this article! Cheers!

Regex for Date, Time and Currency, with Code Examples

Elliot Wong — Wed, 03 Mar 2021 12:13:18 +0000

In this article, regular expressions of currency (e.g., US$100, £0.12, or HK$54), time, and date are listed out for quick copy and paste. They’re all battle-tested. While each regex comes with limitations, we have notes addressing that along with customization tips.

We do hope you check out the interactive code snippets to get a better idea on how the regexes work!

Currency Regex

Note that currency signs apart from “$” will be dropped. The currency value will still gets matched, i.e., pound sterling sign £ in the first item of the test array.

import re

test = [
  "$9876 £112.00",
  "asdf$1234",
  "$12.00 14",
  "$3000000000000",
  "$00000000000001",
  "$00000000000000",
  "asdf",
  "one hundred forty two dollars"
]

regex = re.compile(
    r'\$?(?:(?:[1-9][0-9]{0,2})(?:,[0-9]{3})+|[1-9][0-9]*|0)(?:[\.,][0-9][0-9]?)?(?![0-9]+)'
)

print(sum([regex.findall(x) for x in test],[]))

Results should be:

['$9876', '112.00', '$1234', '$12.00', '14', '$3000000000000', '1', '0']

Interactive code snippets available here

Time Regex

import re
test = [
  "00:00:00", "23:59:59",
  "00 00 00", "23 59 59",
  "00.00.00", "23.59.59",
  "00:00.00", "23.59:59",
  "9:00pm", "9:00am", "10:00:00 am", 
  "13:00:12 am", "13 pm" #won't be considered as valid time
]
regex = re.compile(
    r'(?=((?: |^)[0-2]?\d[:. ]?[0-5]\d(?:[:. ]?[0-5]\d)?(?:[ ]?[ap]\.?m?\.?)?(?: |$)))'
)
print(sum([regex.findall(x) for x in test],[]))

Results should be:

['00:00:00', '23:59:59', '00 00 00', ' 00 00', '23 59 59', '00.00.00', '23.59.59', '00:00.00', '23.59:59', '9:00pm', '9:00am', '10:00:00 am', '13:00:12 am']

Interactive code snippets available here

Regex Date with months in English (YYYY/MMMM/dd)

Note that currency signs apart from “$” will be dropped. The currency value will still gets matched, i.e., pound sterling sign £ in the first item of the test array.

import re

test = [
  "2020-jan-1", 
  "2012-jan-12",
  "1920-feb-22",
  # space isn't a valid delimiter here, you can add it in the regex though
  "2020 mar 1",
  # only 19** and 20** are considered valid here, add year prefix accordingly, or extract with the last two year digits only
  "1840-jun-12",
  # Must follow the format YYYY-MMMM-dd
  "2020-01-01"
]

regex = re.compile(
    '(?=((?:(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])[/\-,.]?(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:19|20)?\d{2}(?!\:)|'
    '(?:19|20)?\d{2}[/\-,.]?(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])|'
    '(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])[/\-,.]?(?:19|20)\d{2}(?!\:)|'
    '(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*[/\-,.]?(?:[0][1-9]|[1-2][0-9]|3[0-1]|[1-9])[/\-,.]?\d{2})))'
)

print(sum([regex.findall(x) for x in test],[]))

Results should be:

['2020-jan-1', '20-jan-1', '2012-jan-12', '12-jan-12', '2-jan-12', '1920-feb-22', '20-feb-22', '40-jun-12']

Interactive code snippets available here

Check out the Original Post for More Details

This is an abstract from our original blog post, which provides more regexes and explanations. In that article, more accurate ways to extract data are also discussed, with solutions proposed. It'd be nice if you can check it out and share your thoughts. Happy coding, cheers!

Explaining Authentication Security Issues through Memes!

Elliot Wong — Wed, 10 Feb 2021 08:19:41 +0000

Data not Hashed/Encrypted Properly

Adopt algorithms/functions that are widely regarded as secure when it comes to password hashing or data encryption. MD5 can be handy in many occasions, but it's not really a strong candidate to be used solely for encryption.

For hashing, try argon2 or bcrypt, as suggested by OWASP. Those from the SHA family are alright too, be aware of the fact that they can be accelerated by a GPU though, making it more susceptible to brute-force attacks.

Hmm How to Identify a User when S/he Resets Password?

Please don't do this...

Avoid using any personally identifiable information (PII). Never take chances when it comes to PII. Even if encryption is applied, it can still be broken/decrypted by attackers, where they can then use the PII to match a user from your system.

We’ve seen “encrypted” user IDs being used as the password reset token passed in a URL, which is not a very good idea, as aforementioned. In our case, the token encryption wasn’t done properly where a cryptographically broken algorithm (MD5) was adopted, which resulted in the quoted word to be encrypted in the last sentence.

Always use randomly generated ID as the identifier. Give each generated ‘reset password’ session a life span and prevent brute-force matching attempts on the ID by implementing rate-limit mechanisms on the URL token.

No Expiry on Access Token

Let’s assume that you are generating access tokens properly with safe encryption. If there’s no expiry mechanism, the tokens that were already generated will haunt you forever. This is literally giving hackers unlimited time to pull off a token sidejacking. Just imagine an attacker getting their hands on an access token! They can authenticate themselves and go into your system and do whatever they want. This is quite likely to happen. Just open your cookies manager on your favorite browser and check how many access tokens are stored there.

Even if your machine is kept safe and all transits are conducted with HTTPs, access token with no limited life span can still pose serious threats. Even if your connection is protected by HTTPs, with enough computing power (which isn’t hard to come by nowadays) and time, an attacker can intercept your exchanged data and crack your sessions/tokens out of it.

For more details and in-depth solutions

Read the original post, served with more memes!

Receipt Data Extraction with OCR, Regex and AI

Elliot Wong — Wed, 10 Feb 2021 07:42:56 +0000

Optional Image Recognition (OCR) is often the default option when it comes to document data extraction. Still, a OCR receipt scanner itself cannot yield accurate-enough results, therefore we have added Regular Expressions (Regex) and some Artificial Intelligence (AI) models to the formula.

This article records our journey developing this final solution, which is now branded under the name FormX.

Let’s start with a successful case first – FormX played an important role in streamlining the vetting process of a non-government organization’s disbursement program by enabling them to digitize data from images, forms, and physical documents from 43,000 applications.

We will dive deep into parts where data is captured and extracted. While FormX can pull data off all kinds of physical forms and documents, for the sake of readability, general receipts will be used as a primary example throughout this article.

Photo by Carli Jeen on Unsplash

Proposed Stages to Solve the Problem

The foremost problem we want to solve here is how to extract {amount}, {date}, and {time} from various receipts.

All sorts of receipts with different layouts exist out there, which make it challenging to extract just the amount, date and time. We came up with a solution that has four main stages:

Get text data out from receipt images with OCR technology.
Filter outliers and group text data into horizontal lines.
Find candidates from horizontal lines.
Classify candidates with AI models and return positive ones.

Note that while Google Vision and other OCR providers out there consistently do a great job on turning a document image to an array of strings, accurate receipt data extraction requires a few more steps. Regex patterns filter out “candidates”, where only the most likely one of each target { date, time, total amount } are picked by the AI models.

We spent a considerable amount of time in the fourth stage above by experimenting on AI models and tweaking parameters. However, we’d like to emphasize that pre-processing (stages 1 to 3) are equally important. They improve the quality of text data, which, in turn, improves the final classification result.

Receipt OCR via Google Vision

This is the first stage where a receipt image is converted to a collection of text with the aid of Google Vision API.

Whether the image is for training AI models or is actually a receipt that will have its information extracted, it is always passed to Google’s Text Detection API to have its text recognized. It’s worth mentioning that – to enhance OCR accuracy, every image goes through a process of image warping first.

The returned result is represented by five hierarchies in this descending scale order: Page, Block, Paragraph, Word, and Symbol.

Each entity, no matter which hierarchy it belongs to, contains a text data and its bounding box (a collection of four vertices with x and y coordinates).

We only used the two most basic ones, Word and Symbol. The former is an array of Symbols, while the latter represents a character or punctuation mark. You can find more detailed definitions of these hierarchies on Google’s official documentation.

Line Orientation Estimation

By this time we have the texts from receipt images stored under the Word and Symbol entities.

We will now group them into horizontal lines relative to the receipt, sorted by the vertical offset of each from the top of receipt, stored as an array. Here’s the rationale behind it:

Information in receipts is almost always horizontally printed. Text items on the same horizontal line are much more likely to be related.
It removes Words that aren’t horizontal enough. The output from OCR can sometimes contain some vertical items, which aren’t our target data.
Different combinations of Words result in different meanings. Putting them together allows us to iterate through all possible ones.
Spacing between Words or Symbols is important. Once they are grouped within the same data instance, calculating the space length between them becomes easier.
Adjacent lines are also more likely to be related. To access them, we can simply move indices up and down as they are sorted instead of comparing the distance between a set of Words with another.
The images we receive can be captured with tilted angles, like the below one.

Figure 1. Receipt Data Extraction from Relatively Horizontal Lines

Let’s take the green lines shown in the Figure 1 as example. Apart from the lines being relatively horizontal, the date and time on each receipt are on the same line. Of course, this isn’t the case for every receipt.

As a disclaimer, the example above is just a random image. In real life, receipts can be nowhere near as good and legible as we’d like them to be. For example, the receipt on the right receipt is covered. While we can accommodate tilted angles, we cannot see through covered information.

Grouping Words into Horizontal Lines, with RANSAC

Each instance of Word comes with a set of four vertices, and with them is a vector of the Word which carries its direction. It can be calculated through the following:

Figure 2. Vector Direction of a Bounding Box

All the Words’ vectors are computed and stored as a matrix. Now we need to determine whether they are horizontally on the same line. Calculating the distance between each Word’s vector and the average vector from all Words seems a good approach. If the distance lies within a threshold, it is horizontal enough; otherwise, the Word is thrown away. Once all the words are checked, the valid ones can be grouped into lines sorted with their vertical offsets (i.e., y coordinates).

Although this method would filter out Words that are not horizontal enough, they may have already contaminated the calculation of the average vector. The filter process may end up as pointless, as the result wouldn’t be accurate.

Fortunately, there is a saying – when we see outliers , we RANSAC them! RANdom SAmple Consensus (RANSAC) is an algorithm for robust-fitting a model in the presence of outliers, which, when implemented, will take them out (i.e., Words that don’t fit). To run a RANSAC, we will take the vector of each Word as one data item.

Let’s say there’s a 70% chance to get one inlier (a value within a pattern) out of all Words by picking randomly. We have to be 99.99% sure that only inliers are picked according to this formula:

Figure 3. Formula for Picking Inliers

In Figure 3, the formula is where:

C is the required confidence = 99.999%
r is inlier chance = 70%
k is the number of samples needed to fit a model, which is a vector in each run (i.e., one in each iteration)
n is the number of iteration needed to attain required confidence

To visualize the formula better, put the numbers in and do the math. You will see that the number of iterations (n) needed to have required confidence (c) in getting an inlier is >= 10 times.

In fact, 70% of inlier chance is pessimistic as the majority of Words on a receipt are horizontally printed. Setting this lower than the actual value ensures the outliers are eliminated. Plus, since we are picking one Word each time to check if it’s an inlier, k = 1.

Based on the n value computed with RANSAC, we ran 10 iterations through the unprocessed Words yielding an array of Words where 99.999% of them got to be inliers. The average vector can then be calculated.

Now we have an accurate average vector. Along with a threshold, we can calculate the distance of each Word’s vector against it to decide whether it is an inlier. Then all the inliers are grouped into horizontal lines with their y-axis values.

Shortlisting Candidates with Amount, Time and Date regex

Before we pass data to the AI classifier, we need to extract Candidates from the horizontal lines, mainly with regular expressions (regex). In this case, any text pattern that looks like price, date, or time will be considered as a candidate. Below is an example of regex for finding the amount and price candidates:

(?=((?:[^$0-9]|^)\$?(?:[1-9]\d{2}|[1-9]\d|[1-9])(?:,?\d{3})?(?:.\d{1,2})?))

Let’s say there are two adjacent Words, 12/20 and 21/01/2020, in a horizontal line. The no-space candidate of concatenating the two is 12/2021/01/2020, which looks like a really messed up date and no one can tell what part is the year. If any part of this is the date we are seeking, we might end up missing it. The with-space version 12/20 21/01/2020 ensures the AI receives the separated Words, which will improve the chance of landing a match.

At this stage, we realized regex can be a very handy tool to net some candidates. Consequently, a regex builder is available on FormX’s portal assisting users to come up with a correct regex for their target document.

Data Extraction with AI Binary Classifiers

Three models have been trained for our respective needs: price, date, and time.

Addressing the Flood of Useless Metadata

Receipts often contain unwanted metadata like the grocery’s name and quantities of items purchased. If we simply train the classification model with an unprocessed dataset, the model will be extremely biased towards negative results and end up with an unbalanced dataset. To balance the dataset, we can multiply the data of amount, date, and time to a 1:1 ratio of positive and negative results.

Bag-of-Words (BoW) Model

A BoW model is employed to first classify texts. In a BoW model, a dictionary is built from words that appear in the receipt’s training dataset. If there are n unique words, the BoW model will be a vector with n dimensions.

Normally, a BoW model records the occurrence of words, but we don’t in our case. Every word in classification data (i.e., receipt image copy) will be matched against the BoW model. If the word can’t be found in that dictionary, it will be ignored.

For price data, the surrounding text on the same line will be computed against a BoW dictionary. If the current candidate doesn’t have the surrounding text matching the dictionary, they will be marked as false. For the others, the +/-1 lines are taken into account, as data on the date or time can reside across them.

Amount Classifier

The model we used for this is logistic regression (examining and describing the relationship between binary variables, such as pass/fail, win/lose, etc). These are the input parameters we used:

Position in Receipt. The Words and Symbols come with a bounding box property. With that we can determine their vertical position divided by the total number of lines. It’s less likely to have a price right at the top of a receipt, so the candidates at lower positions have better likelihood.

Has Symbols. For candidates, we check that symbols indicating price-related data exist in a pattern, such as “$”, “.”, and “,”.

Range Checking. The numeric values in candidates are checked against a set of ranges like <10, >= 10, and <100, or an extreme one, like >= 10000000. Biases will be given based on the matching ranges. This can be tweaked based on the receipt. For example, if we’ve now extracted the amount from a bunch of receipts from a luxury brand, the range should be on the upper side of the scale.

Date Classifier

The model we used for this is random forest (an ensemble of randomized decision trees) with the number of estimators at 300. These are the input parameters we used:

Position in receipt. This is calculated similarly to the Amount Classifier. Date usually shows up on the top or bottom, so candidates with a more central position have a reduced likeliness mark.

Has Symbols. We check for symbols that imply date-related data, such as a slash (/) or period (.). Having less than two occurrences of these improves the candidate’s probability of being a date. Having a full year is also an advantage. A candidate that has “2019” in it, for example, is more likely to be a date than another one which has only “19”. Months in English is also a good indicator, and a fully spelled out month, like “September”, is a plus.

Time and date are often printed on the same line or adjacent to each other, which we also take into consideration. Candidates with inconsistent delimiters will get penalized, such as 11/04-2019 over 11/04/2019. Some of the other factors we look at are:

1/(current year – extracted year + 1)
If the time candidate is on the same line or +/- one line
If different separators are used

Time Classifier

The model we used for this is random forest with the number of estimators at 300. These are the input parameters we used:

Position in receipt. This is calculated similarly to Date Classifier. Like Date, Time usually shows up on the top or bottom so candidates with a more central position have a reduced likeliness mark.

Has Symbols. Candidates with “:” and empty space with less than 2 occurrences are more likely to be time. The ones with am or pm are also prime candidates. Similar to how Date is classified, candidates with Words that imply data related to Time will get extra marks.

Photo by Alex on Unsplash

Future Plans for FormX

Much like everything we do at Oursky, we are always on the lookout for improving our solutions and processes. Some are already in the works. We are looking to train the models with different inputs and parameters to improve accuracy. We also plan to expand our dataset. We’re currently collecting standard forms around the world, like insurance forms in the U.S.

There are tons of very helpful AI researches and projects all over the world, and the amount of investment in them is awe-inspiring. We will definitely keep a lookout on them and integrate them if they prove to be innovative and outperform the current models.

We’ll continue improving FormX so stay tuned for more of our explorations into the wonderful world of AI!

Addendum:

This article has been updated on May 23, 5:25 p.m. HKT with key updates on the introduction, line orientation estimation, AI – binary classifier, and future plans for FormX. The updates are in line with our presentation of this topic in the Google Developer Group Hong Kong ML Series 2020, an online event and series of learning sessions on machine learning. The webinar was presented as “How to extract 𝓧 from receipts?”, which was held on May 23, 2020.
This article has been updated on October 15, 2020, 3:36 HKT with our official FormX brand/name.

Kubernetes Security - Network Encryption between k8s Deployments and Ingress

Elliot Wong — Wed, 03 Feb 2021 14:11:53 +0000

By Calvin, who refuses to create a dev.to account.

TL; DR: With a simple example here, we demonstrate how to secure connections between your Kubernetes (k8s) deployments and ingress by enabling TLS and HTTPS. This can be a critical part in your DevSecOps workf low or a business requirement your development team must accomplish.

Kubernetes Tutorial on Securing Connections

This a quick how-to guide on hardening a k8s application by enforcing secure communication between an Ingress controller and other k8s services. This is important especially if your business requirements, like in financial services or enterprise environments, compel you to enforce strict security measures such as encrypting all traffic in transit.

Some caveats: Managing a Kubernetes cluster itself is complex enough, and securing it can be convoluted. This will add another layer of complexity, so consider what your actual requirements are and conduct risk assessments. Not all projects require this level of security. Below is a simple visualization of traffic between Ingress and back end services.


        ╔═════════════════════════╗       ╔════════════════════╗
 https  ║ ingress                 ║ https ║ backend            ║
 ───>───╫─────────────────────────╫───>───╫────────────────────╢
        ║ demo.some-cluster.com   ║       ║ demo-app           ║
        ╚═════════════════════════╝       ╚════════════════════╝

What’s not covered in this Kubernetes Security how-to

This guide only walks you through strengthening the connection between an Ingress and a k8s service. Say you have collection of microservices, you may also want to secure the connection between every one of them as well. Below are a few suggestions, weigh them accordingly.

Route all traffic with Ingress

Calls from a backend app to another must be routed through the Ingress. Connection is secured as we have already implemented TLS between the Ingress and the service(s) pointing to target backend app(s).

This approach does have one downside though, where communication points of all backend apps are exposed. IP whitelisting and using internal headers are some measures to protect them exposed endpoints.

Encrypted connection for each app by implementing TLS

With this one you will have to implement TLS and manage the corresponding certificate for each backend app. There can be a lot of chores just to generate the certificates, though this one completely avoids the exposed port issue.

Integrate service mesh

You can install a service mesh like Linkerd or istio. What’s a service mesh? Basically it takes your yaml files and does some rewriting based on your instructions (e.g. some istio commands). With these amended config files, your k8s cluster will be deployed with some extra proxy services that intercept all communication between microservices and have security measures applied.

Prerequisites and assumptions

You are familiar with concepts of containers, Docker.
Basic understandings on Kubernetes and how it achieves container orchestration are also required.
You have the fundamentals like “https vs http” or “TLS vs SSL” and know how to generate a self-signed certificate.

Photo by Adi Goldstein on Unsplash

Components in our example

The example is made of these Kubernetes components:

An Ingress where SSL termination for the public-facing domain, such as secure-demo.some-cluster.com is set.
A k8s Service, routing to our backend.
A k8s Deployment a.k.a our backend, a nginx web server serving HTTPS.

Sample Kubernetes Configuration Files

Here’s a configuration file named backend.yaml, covering our entire backend (nginx server, a config map and a service). By providing the certs, we are done with the TLS Security Settings.

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: demo-app
  name: nginx-conf
data:
  site.conf: |
    server {
      listen 443 ssl;
      server_name demo-app;
      ssl_certificate /run/secrets/nginx-cert/tls.crt;
      ssl_certificate_key /run/secrets/nginx-cert/tls.key;
      location / {
        root   /usr/share/nginx/html;
        index  index.html index.htm;
        try_files $uri $uri/ /index.html;
      }
    }
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: demo-app
  name: demo-app
spec:
  ports:
  - port: 443
    protocol: TCP
    targetPort: 443
  selector:
    app: demo-app
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: demo-app
  name: demo-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo-app
  template:
    metadata:
      labels:
        app: demo-app
    spec:
      restartPolicy: Always
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
      - name: demo-app-tls
        secret:
          secretName: demo-app-tls
      containers:
      - name: demo-app
        image: nginx:1.19.2-alpine
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            cpu: "8m"
            memory: "16Mi"
          limits:
            cpu: "16m"
            memory: "64Mi"
        ports:
        - containerPort: 443
        volumeMounts:
        - name: nginx-conf
          mountPath: "/etc/nginx/conf.d"
          readOnly: true
        - name: demo-app-tls
          mountPath: "/run/secrets/nginx-cert"
          readOnly: true

And now comes the TLS network encryption part, where an Ingress config ingress.yaml is applied:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: demo-app
  annotations:
    ingress.kubernetes.io/proxy-body-size: 4m
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/proxy-ssl-secret: "NAMESPACE/demo-app-tls"
    nginx.ingress.kubernetes.io/proxy-ssl-verify: "true"
spec:
  rules:
    - host: YOUR-NAME.EXAMPLE-CLUSTER.com
      http:
        paths:
          - path: /
            backend:
              serviceName: demo-app
              servicePort: 443
  tls:
    - hosts:
        - YOUR-NAME.EXAMPLE-CLUSTER.com
      secretName: YOUR-NAME.EXAMPLE-CLUSTER.com

Important Note: If your deployment is not within the same namespace of the Ingress controller (which is the usual case), you need to specify the namespace for proxy-ssl-secret, i.e. NAMESPACE/demo-app-tls.

Network encryption with multiple back ends

Microservices mean having many backend apps, but there is only one proxy-ssl-secret configuration per Ingress. To serve multiple apps from the same Ingress you may configure the Ingress to treat all services with the same name, as shown in the example below:

nginx.ingress.kubernetes.io/proxy-ssl-name: demo-app

What’s done under the hood is that the proxy name got overridden as demo-app for all services, so that they are served with the same certificate. This will slightly weaken the security, again weigh different options and decide what level of security you are looking to achieve. To go for a higher level of communication security, perhaps you’d like to create several Ingresses instead. Don’t hesitate and let me know if you have other ideas, it’s always nice interacting my fellow developers!

You can also apply a wild card like *.svc.cluster.local to match services, but by doing this technically all services are trusted which is just not very elegant.

Create the demo server

Below is a snippet for creating a self-signed certificate. Note that this is just a simple example and you should not copy this impetuously for production:

# root CA
openssl genrsa -out rootCA.key 4096
openssl req -x509 -nodes -new -key rootCA.key -sha256 -days 1024 -out rootCA.crt
# generate cert for demo-app
openssl genrsa -out demo-app.key 4096
openssl req -new -sha256 -key demo-app.key -out demo-app.csr \
  -subj "/C=HK/ST=HK/L=HongKong/O=Example/OU=Org/CN=demo-app"
openssl x509 -req -in demo-app.csr -CA rootCA.crt -CAkey rootCA.key -CAcreateserial \
  -out demo-app.crt -days 1024 -sha256

Then you can submit your secrets to k8s:

kubectl -n NAMESPACE create secret generic demo-app-tls \
  --from-file=tls.crt=demo-app.crt \
  --from-file=tls.key=demo-app.key \
  --from-file=ca.crt=rootCA.crt

Here the actual deployment and ingress are applied:

kubectl -n NAMESPACE apply -f backend.yaml
kubectl -n NAMESPACE apply -f ingress.yaml

Wait for the deployment to take effect, the server will be ready on https://YOUR-NAME.EXAMPLE-CLUSTER.com!

Clean up the namespace

kubectl -n NAMESPACE delete -f ingress.yaml
kubectl -n NAMESPACE delete -f backend.yaml
kubectl -n NAMESPACE delete secret demo-app-tls YOUR-NAME.EXAMPLE-CLUSTER.com

To learn more about working with the Ingress controller, check out these references on Kubernetes’ user guide:

How to Build an AI Text Generator: Text Generation with a GPT-2 Model

Elliot Wong — Wed, 03 Feb 2021 06:34:06 +0000

We wrote this after the Oursky Skylab.ai team completed an AI content generator for a startup client, and we’d like to share our experience and journey._ From a corpus of stories with an aligned writing style, provided by our client, we trained a text generation model that outputs similar text pieces.

In this technical report, we will:

Go through what a language model is.
Discuss how to use language modeling to generate articles.
Explain what Generative Pre-Trained Transformer 2 (GPT-2) is and how it can be used for language modeling.
Visualize text predictions – print out our GPT-2 model’s internal states where input words affect the next’s prediction the most.

Demo with a Generic GPT-2

Let's start with a GIF showing the outputs from a standard GPT2 model, when it was fed with 1. a sentence randomly extracted from a Sherlock Holmes book, 2. the definition of Software Engineering on Wikipedia.

Prerequisites

Basic knowledge on Natural Language Processing with python
Understandings on Probability Theory

Before we start building a predictive text generator, let’s go through a few concepts first.

Language Model

A language model is just a probability distribution of a sequence of words. For example, given a language model of English, we can ask the probability of seeing a sentence, “All roads lead to Rome” in English.

We could also estimate that the probability of seeing grammatically wrong or nonsensical sentences – “jump hamburger I” definitely has a much lower probability of being correct than “I eat hamburger”.

Let’s pull in some mathematical notations to describe a language model better.

P(w1, w2, …, wn) means the probability of having the sentence “w1 w2 … wn”. Here, the language model is a probability distribution instead of a probability. Having a probability distribution means we can tell the value of P(All, roads, lead, to, Rome) or P(I, eat, hamburger) if we know any wi=1…n for any n in P(w1, w2, …, wn).

A bit more on the notation first. Whenever you see P(hello, world), where items inside P() are actual words, P() is then describing a probability since wi=1…n and n are known (the former = “hello”, “world” while the latter = 2). If items inside P() are unknown, P() is then indicating a probability distribution. From here on out, we’ll use “probability” and “probability distribution” interchangeably used unless specified.

Sometimes, it’s more convenient if we express P(w1, w2, …, wn) as P(w, context). What happens here is that we lump w1 to wn-1 (i.e., all words of a sentence except the last one) to a bulky stuff that we call “context”. We can then calculate the chance of being in this “context” (seeing previous n-1 words) and ending up with the word “w” at the end.

Here, P(w1, w2, …, wn) and P(w, context) are describing the same thing.

Using the chain rule, we could write P(w, context) as P(w | context) P(context). We’d like to do this because P(w | context) is, in fact, the target we want most of the time. P(w | context) here is a conditional probability distribution. It tells the chance of seeing a word w given that the context (i.e. previous words) is known.

Now let’s put in some words to P(w | context), say, P(apple | context) or P(orange | context). Assuming we have the previous words, we can start predicting how likely it is to have “apple” or “orange” as the next word of this sentence. By obtaining the “mostly likely next word”, we can start creating some article generation AI models.

Right, so now we need a language model. How do we get one? Another article answers this question.

One approach is to count the number of wn that comes after w1 to wn-1 on a large text corpus, which will build a n-gram language model. Another is to directly learn the language model using a neural network by feeding lots of text.

In our case, we used the latter approach by using the GPT-2 model to learn the language model.

Text Generation with a Language Model

As mentioned, P(w | context) is the basis for a neural network text generator.

P(w | context) tells the probability distribution of all English words given all seen words (as context). For example, for P(w | “I eat”), we would expect a higher probability when w is a noun rather than a verb. The likelihood of w being a food is much higher than other nouns like “book”.

To generate the next word with all seen words, we could keep adding one word at a time with P(w | context) until we have enough for a sentence or have reached some ending word/character like a full stop.

There are various approaches on how to pick the next word, which we discuss below.

Greedy Approach

One approach is to pick the word with the highest probability. A quick example would be:

P(w | “I eat”), where “hamburger” has the highest probability of being w among all words from a dictionary. We call this the greedy approach for sentence generation.

This approach is quick and very simple. The main drawback is that for the same set of previous words, we will always generate the same sentence. In other words, it lacks creativity.

Plus, when we always pick the highest probability, so it’s very easy to fall in the case of degenerate repetition, like getting the same chunk of text during sentence generation. For example:

I eat hamburger for breakfast. I eat hamburger for breakfast. I eat hamburger for breakfast ...

Not so human-like, right? We need something more random to create a language generator that yields human readable sentences.

Beam Approach

Another approach is to generate lots of sentences first, then pick the most likely sentence.

Let’s assume that there are 20,000 words in the dictionary, and we want to generate a sentence with 5 words starting with word “I”. The number of all possible sentences that we could generate will be 20000⁴, or one hundred and sixty quadrillion. Clearly, that’s too many! We cannot calculate all those sentences’ probability within a reasonable time, even with a powerful computer.

Instead of constructing all possible sentences, we could instead just track the top-N partial sentences. At the end, we only need to check the probability of N sentences. By doing so, we hope to search the top-N likeliest sentence without having to try all combinations. This kind of searching is called beam search, and N is the beam width.

The decision tree figure below illustrates a case of generating a sentence with three words, starting with “I” with N = 2. This means we only track top-2 partial sentences.

Here, we first check P(w | “I”). Among all the possible words, the language model tells “eat” and “read” are the most probable next words. Hence, in the next step, we’ll only consider the trees of P(w | “I eat”) and P(w | “I read”) and ignore other possibilities like sentences that start with “I drink”.

Afterwards, we repeat the same procedure and find the two most probable words after “I eat” or “I read”. Among all that start with “I eat” and “I read”, P(“hamburger” | “I eat”) and P(“cake” | “I eat”) have the highest two probabilities. We’ll thus only expand the search with sentence prefixes “I eat hamburger” and “I eat cake” while the “I read” branch dies.

We will keep repeating the “expand and pick best-N” procedure until we have a sentence with desired length. This’ll finally return a sentence with the highest probability.

You may already notice that when the beam width is reduced to 1, the beam search will become the greedy approach. When the beam width equals the size of the dictionary, beam search becomes an exhaustive search. Beam search allows us to choose between sentence quality and speed.

With beam width larger than 1, beam search tends to generate more promising sentences. However, like the greedy approach, the lack of randomness remains. The same sentence prefix will lead to the same sentence, and degenerate repetition will likely to happen.

Pure Sampling

The drawbacks of beam search and greedy approaches are due to the fact that we’re picking the most probable choice. Instead of picking the most probable word from P(w | context), we could sample a word with P(w | context). Time to add some randomness!

For example, with a sentence that start with “I”, we can sample a word according to P(w | “I”), as the sampling is random. Even if P(“eat” | “I”) > P(“read” | “I”), we could still sample the word “read”. Using sampling, we’ll have a very high chance of getting a new sentence in each generation.

Sentence generated from pure sampling will be free from degenerate repetition, but it tends to result in some gibberish.

Top-k Sampling and Sampling with Temperature

There are common ways to improve pure sampling.

There’s Top-k sampling. Instead of sampling from full P(w | context), we only sample from top K words according to P(w | context).

Another is sampling with temperature. It means we reshape the P(w | context) with a temperature factor t, where t is between 0 and 1.

This is where we’re using a neural network to estimate a language model. Instead of probability values (which are in the range of 0 to 1), we are getting a real number that could be in any range, which is called logits. We can convert logits to probability value using the softmax function.

Temperature t is a part in applying the softmax function to retrieve the probabilities. This is to reshape the resultant P(w | context) by dividing each logit value by t before applying the softmax function. As t is between 0 and 1, dividing it will amplify the logit value.

Summing up, more probable words become even more probable while the less probable ones become even less probable. Top-k sampling and sampling with temperature usually are applied together.

Nucleus Sampling

When using Top-k sampling, we need to decide which k to use. The best k varies depending on context. The idea of Top-k sampling is to ignore very unlikely words according to P(w | context).

We can do this in another way. Instead of focusing on Top-k words in sampling, we filter out words whose sum of probabilities is less than a certain threshold, and we only sample from the remaining words.

This approach is called nucleus sampling. According to The Curious Case of Neural Text Degeneration, the original paper that proposed nucleus sampling, we should choose p = 0.95, which implies that the threshold value is 1-p = 0.05. By doing nucleus sampling with p = 0.95, we could generate text pieces that are statistically most similar to human-written text.

The paper is a must-read! It provides a lot of comparison among human-written text and texts generated through various approaches (beam search, top-k sampling, nucleus sampling, etc.), measured by different metrics.

Introduction to GPT-2 Model

Time to dive into the AI model!

Like we mentioned, we used a neural network, GPT-2 model from OpenAI, to estimate the language model.

GPT-2 is a Transformer-based model trained for language modelling. It can be fine-tuned to solve a diverse amount of natural language processing (NLP) problems such as text generation, summarization, question answering, translation, and sentiment analysis, among others.

Throughout this article some NLP Python code snippets will be provided to aid reading.

Diving into the GPT-2 model itself deserves a separate blog. Here, we’ll focus on a few main concepts. We highly recommend reading two awesome articles from Jay Alammar on Transformer and GPT-2 for more in-depth information.

Here, we’ll talk about how GPT-2 model works by building it piece by piece.

Example of GPT 2 – Input and Output

First, let’s describe the input and output of the GPT-2 model. We’ll start small and seek to construct a sentence first.

Given words in its embedded form, GPT-2 transforms the input word-embedding vector (blue ellipses) to the output word embedding (purple ellipses). This transformation would not change the dimension of the word embedding (although it could). Output word embedding is known as the hidden state.

During the transformation, input embeddings from previous words will affect the result of the current word’s output embedding, but not the other way round. In our example, the output embedding of “cake” will depend on the input embedding of “I”, “eat”, and “cake”. On the other hand, the output embedding of “I” (the first word) will only depend on the input embedding of “I”.

Due to this, the output embedding of the last input word somehow captures the essence of the whole input sentence.

To obtain the language model, we could have a matrix WLM **whose number of column equals the dimension of output embedding. **WLM has a number of rows that equals the dictionary size and bias vector **bLM **with its dimension being the dictionary size.

We can then compute the logit of each word in the dictionary by multiplying WLM with the output embedding of the last word, then adding bLM. To convert those logits to probabilities, we’ll apply the softmax function, and its result could be interpreted as P(w | context).

Inside the GPT-2 Model

Until now, we’ve discussed how output word embeddings are computed from input word embeddings.

Input word embeddings are simply vectors. The first steps of the transformation is to create even more vectors from those input word embeddings. Three vectors, namely, the key vector, query vector and value vector, will be created based on each input word embedding.

Producing these vectors is simple. We just need three matrices Wkey, Wquery, and Wvalue. **By multiplying the input word embedding with these three matrices, we’ll get the corresponding key, query, and value vector of the corresponding input word. **Wkey, **Wquery **and **Wvalue **are parts of the parameters of the GPT-2 model.

To further demonstrate, let’s consider Iinput , the input word embedding of “I”. Here, we have:

Ikey = Wkey Iinput, Iquery = Wquery Iinput, Ivalue = Wvalue Iinput

We’ll use the same Wkey, Wquery **and **Wvalue to compute the key, query, and value vectors for all other words.

After we know how to compute the key, query, and value vectors for each input word, it’s time to use these vectors to compute the output word embedding.

As mentioned, the current word’s output embedding will depend on the current word’s input embedding and all the previous words’ input embedding.

The output embedding of a current word is the weighted sum of the current word and all its previous words’ value vectors. This also explains why value vectors are called as such.

Let’s take *eatoutput **as the output embedding of *“eat”. Its value is computed by:

eat output = ^I A eat I value + ^eat A eat eat value

Here, ^I A eat **and eat* **A* eat are attention values. They could be interpreted as how much attention should “eat” pay on “I” and “eat” when computing its output embedding. To avoid shrinking the output embedding, the sum of attention values need to be to 1.

This implies that for the first word, its output embedding will be equal to its value vector; for example, Ioutput equals to I value.

Each attention value ^xAy is computed by:

Taking the dot product between the key vector of x and query vector of y
Scaling down the dot product with the square root of the dimension of the key vector
Taking the softmax to ensure the related attention values are summing up to 1, as shown below:

^xAy **= softmax(xkey^T yquery** / sqrt(k)), where k is the dimension of key vector.

Let’s recap!

We should now know how output embedding is computed as the weighted sum of value vectors of the current and previous words. The weights used in the sum are called attention value, which is a value between two words, and is computed by taking the dot product of key vector of one word and query vector of another word. As the weights should sum up to 1, we’ll also take the softmax on the dot product.

Structure Replication

What we’ve discussed so far is just the attention layer in GPT-2. This layer covers most of the details, as the rest of the GPT-2 model structure is just a replication of the attention layer.

Let’s continue our GPT-2 model construction journey. GPT-2 uses multiple attention layers. This is the so-called multi-head attention.

While those attention layers run in parallel, they’re not dependent on each other and don’t share weights, i.e., there will be a different set of Wkey, Wquery, **and **Wvalue for each attention layer.

As we have multiple attention layers, we’ll have multiple output word embeddings for each word. To combine all those output word embeddings into one, we’ll first concatenate all the output word embeddings from different attention layers. We then multiply the concatenated matrix Wproject to make the output word embedding have the same dimension as the input word embedding.

The output word embeddings we got so far is actually not the final one. The output word embeddings will further go through a feedforward layer and transform into actual output word embeddings.

These attention layers running in parallel together with the feedforward layer are grouped to a block called the decoder block¹.

GPT-2 doesn’t just include one decoder block. There’s a chain of it. We choose the input word embedding and output word embedding to have the same dimensionality so that we could chain the decoder blocks.

These decoder blocks have exactly the same structure but don’t share weight.

The GPT-2 model has a different sizes. They are different in the embedding dimensionality, key, query, value vector’s dimensionality, number of attention layer in each decoder block, and number of decoder blocks in the model.

Some Omitted Details

Here are some details worth noting, and you can take these as pointers to learn more about them:

GPT-2 uses Byte pair encoding when tokenizing the input string. One token does not necessarily correspond to one word. GPT-2 works in terms of tokens instead of words.
Positional embeddings are added to the input embeddings of the first decoder block so as to encode the word order information in the word embedding.
All residual addition and normalization layers are omitted.

Training the GPT-2 Model

So, now you have a sense of how GPT-2 works. You know how GPT-2 can be used to estimate the language model by converting last word’s output embedding to logits using WLM and bLM, then to probabilities.

We can now talk about training the GPT-2 model for text generation.

The first step to train a GPT-2 text generator is language model estimation. Given an input string, such as “I eat cake”, GPT-2 can estimate P(eat | “I”) and P(cake | “I eat”).

For this input string in training, we’ll assume the following:

P(eat | “I”) = 1, P(w != eat | “I”) = 0

P(cake | “I eat”) = 1, P(w != cake | “I eat”) = 0

Now that we have estimated and targeted the probability distributions, we can then compute the cross entropy loss, and use this to update the weights.

As you’ll see, we need to feed it with a large amount of text to train the GPT-2 model.

Testing and Fine-Tuning GPT-2

To quickly test GPT-2 on article generation, we’ll use Huggingface 🤗 Transformers. It is a Python library for developers to quickly test pre-trained and transformers-based NLP models. GPT-2 is one of them, along with others like PyTorch and TensorFlow.

To fine-tune a pre-trained model, we could use the run_langauge_modeling.py. All we need are two text files; one containing the training text pieces, and another containing the text pieces for evaluation.

Here’s an example of using run_language_modeling.py for fine-tuning a pre-trained model:

python run_language_modeling.py \
    --output_dir=output \          # The trained model will be store at ./output
    --model_type=gpt2 \            # Tell huggingface transformers we want to train gpt-2
    --model_name_or_path=gpt2 \    # This will use the pre-trained gpt2 samll model
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --per_gpu_train_batch_size=1   # For GPU training only, you may increase it if your GPU has more memory to hold more training data.

Huggingface 🤗 Transformers has a lot of built-in functions, and generating text is one of them.

The following is a code snippet of text generation using a pre-trained GPT-2 model:

from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
)

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

sentence_prefix = "I eat"

input_ids = tokenizer.encode(
    sentence_prefix,
    add_special_tokens=False,
    return_tensors="pt",
    add_space_before_punct_symbol=True
)

output_ids = model.generate(
    input_ids=input_ids,
    do_sample=True,
    max_length=20,  # desired output sentence length
    pad_token_id=model.config.eos_token_id,
)[0].tolist()

generated_text = tokenizer.decode(
    output_ids,
    clean_up_tokenization_spaces=True)

print(generated_text)

Attention Visualization

Thanks to jessevig’s BertViz tool, we can peek at how GPT-2 works by visualizing the attention values.

The figure above is a visualization of attention values on each decoder block (from top to bottom of the grid, with the first row as the first block). Each attention head (from left to right) of the GPT-2 small model takes “I disapprove of what you say, but” as input.

On the left is a zoomed-in look at the 2^nd block’s 6^th attention head’s result.

The words on the left are the output, and those on the right are the input. The opacity of the line indicates how much attention the output word paid to the input words.

An interesting tidbit here is that, most of the time, the first word is paid the most attention. This general pattern remains even if we use other input sentences.

Word Importance Visualization

Purely looking at the attention values doesn’t seem to give us clues on how the input sentence affects how the GPT-2 model picks its next word. One of the reasons could be that it’s hard to imagine how the attention is utilized for the next word text prediction, as there’s still a Wproject feedforward layer to transform the attention layer’s output.

So, we’re interested in how the input sentence affects the probability distribution of the next word. We want to know which word in the input sentence will affect the next word’s probability distribution the most.

Measure Word Importance Through Input Perturbation

In Towards a Deep and Unified Understanding of Deep Neural Models in NLP, the authors propose a way to answer this. They also provide the code that we could use to analyze the GPT-2 model with.

The paper also discussed measuring the importance of input word. The idea is to assign a value σi to each input word, where σi is initially a random value between 0 and 1.

Later on, we’ll generate some noise vector with the size of input word embedding. This noise vector will be added to the input word embedding with the weight specified in σi. This means σi tells how much noise is added to the corresponding input word.

With the original and perturbed input word embeddings, we feed both of them to our GPT-2 model and get two sets of logit from the last output embeddings.

We then measure the difference (using L² norm) between these two logits. This difference tells us how severely the perturbation is affecting the resultant logits that we use to construct the language model. We then optimize σi to minimize the difference between two logits.

We keep generating new noise vector and add them to the original input word embedding using the updated σi. We then compute the difference between the resultant logits, and use this difference to guide the update of σi.

During the iteration, we’ll track the best σi that leads to the smallest difference in the resultant logits, and report it as the result after we reach the maximum number of iteration.

The reported σi tells us how much noise the corresponding input word could withstand in a way that will not lead to significant change in the resultant logits.

If a word is important to the resultant logits, we’d expect that the small perturbation on that word’s input embedding will lead to a significant change in the logits. Hence, the reported σi is inversely proportional to the importance of the words. The smaller the the reported σi, the more important the corresponding input word is.

Code Snippet

Here’s a code snippet for visualizing the word importance. Interpreter.py could be found here.

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from Interpreter import Interpreter 

def Phi(x):
    global model
    result = model(inputs_embeds=x)[0]
    return result # return the logit of last word

model_path = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_path, output_attentions=True)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)

input_embedding_weight_std = (
    model.get_input_embeddings().weight.view(1,-1)
    .std().item()
)

text = "I disapprove of what you say , but"
inputs = tokenizer.encode_plus(text, return_tensors='pt', 
                               add_special_tokens=True, 
                               add_space_before_punct_symbol=True)
input_ids = inputs['input_ids']

with torch.no_grad():
    x = model.get_input_embeddings()(input_ids).squeeze()

interpreter = Interpreter(x=x, Phi=Phi, 
                          scale=10*input_embedding_weight_std,
                          words=text.split(' ')).to(model.device)

# This will take sometime.
interpreter.optimize(iteration=1000, lr=0.01, show_progress=True)
interpreter.get_sigma()
interpreter.visualize()

Below are the reported σi and its visualization. The smaller the value, the darker the color.

array([0.8752377, 1.2462736, 1.3040292, 0.55643 , 1.3775877, 1.2515365, 1.2249271, 0.311358 ], dtype=float32)

From the figures above, we can now know that P( w | “I disapprove of what you say, but”) will be affected by the word “but” the most, followed by “what”, then “I”.

Conclusion

To sump up, we discussed what a language model is and how to utilize it to do article generation that uses different approaches to get text similar to how humans write them.

We also briefly introduced the GPT-2 model and some of its internal workings. We also saw how to use Huggingface 🤗 Transformers in applying the GPT-2 model in text predictions.

We’ve visualized the attention values in GPT-2 model and used the input perturbation approach to see which word/s in the input sentence would affect the next word prediction the most.

Footnote:

The actual structure of decoder block consists of one attention layer only. What we describe here as attention layer should be called attention head. One attention layer includes multiple attention heads and the Wproject for combining the attention heads’ output.