stereobooster

Posted on Sep 21, 2019 • Originally published at stereobooster.com on Sep 21, 2019

Introduction to natural language processing

#python #machinelearning #datascience #beginners

What is NLP?

Don't confuse with neuro-linguistic programming 🤦

Natural language processing allows computers to access unstructured data expressed as speech or text. Speech or text data does involve linguistic structure. Linguistic structures vary depending on the language

-- Bender, 2019

NLP is a class of tasks (computer algorithms) to work with text in natural languages, for example: named entity recognition (NER), part-of-speech tagging (POS), text categorization, coreference resolution, etc.

Image source: Understanding Natural Language Understanding

See paperswithcode and nlpprogress for bigger taxonomy of tasks.

Getting started

I'm not an expert in machine learning (yet), but I know something about developer experience, so I will show how to get started with NLP fast and comfortably.

We will use:

Docker
Jupyter notebooks
Python with spaCy

There are a lot of tools in this field, but those seem to me as approachable and modern.

Setup

Create Dockerfile:

FROM jupyter/datascience-notebook:1386e2046833
RUN pip install spacy
RUN python -m spacy download en_core_web_sm

We will use awesome Jupyter Docker Stacks.

Add docker-compose.yml:

version: "3"
services:
 web:
 build: .
 ports:
 - "8888:8888"
 volumes:
 - ./work:/home/jovyan/work

Run

Run (in the terminal, in the same folder where you created files):

docker-compose up

This command will download, build and start development environment. You will see text

To access the notebook, copy and paste one of these URLs:
 http://127.0.0.1:8888/?token=...

Open the URL in a browser
Navigate to "work" folder
Click "New" in the right top corner, select "Python 3" from the dropdown

Your notebook is ready for work.

Jupyter notebook is the mix of a runtime environment for experiments and a scientific journal.

First experiment: POS

POS stands for part-of-speech tagging - we need to identify parts for speech for each word, for the given text, for example, noun, verb.

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc1 = nlp(u"This is a sentence.")
displacy.render([doc1], style="dep", page=True)

Type in the program and click "Run".

Here is the list of all tags.

Second experiment: NER

NER stands for named entity recognition. This task is about distinguishing specific entities, for example, people names, which consist of more than one part (Siddhartha Gautama), or country name (U.K.), or amount of money (\$1 billion).

import spacy
from spacy import displacy
text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

Type in the program and click "Run".

Here is the list of all entity types.

Save your work

Rename your notebook (click "Untitled") to something more meaningful, for example, "experiments". Click the "Save" button.

Create .gitignore file:

work/.ipynb_checkpoints

Run (in the terminal, in the same folder where you created files):

git init
git add .
git commit -m "first commit"

Now you saved your work in the git.

Tutorial

The purpose of those experiments was to show how it is easy to get started. If you want actually learn it you can use this tutorial.

Good luck!

PS

Checkout spaCy universe for more cool projects. spaCy is just one of the tools, you can use any alternative you like, for example, nltk, Stanford CoreNLP, etc.

Photo by Green Chameleon on Unsplash

Top comments (3)

Eric Ahnell • Sep 21 '19

Very interesting! For those of you with Macs running macOS Mojave (or later) who want to try some of this stuff locally without Python, Nalaprop from Eclectic Light is a way to get started... but to get truly deep in it, you will need more advanced tools, such as the set outlined above.

stereobooster • Sep 21 '19

👍

I avoid installing things locally, because I want to test something once and it will sit on my machine forever, that is why Docker is my go to tool (matter of taste). Often (in my cases) running Docker easier and faster (because everything precompiled and configured).

But all above from my own experience.

Vikram Sharma • Feb 28 '20

Thanks for the primer. I took an audit course in machine learning last year. This year I have couple courses on advanced machine learning. As my project I am planning to build a article summary generator. My strategy is create a paragraph summary generator using an API and then implement it in Python. I can use the API as a benchmark for my tool.

DEV Community

Introduction to natural language processing

What is NLP?

Getting started

Setup

Run

First experiment: POS

Second experiment: NER

Save your work

Tutorial

PS

Top comments (3)

Read next

Password Validator using html css and javascript

Resolving Client Secret Expiry Issues in Microsoft Graph Data Connect for SharePoint

PWA and Django #3: Online and offline resources in a PWA - Developing Progressive Web Applications with Django

New AI Breakthrough Makes Self-Driving Cars 15x Faster and Safer with Truncated Diffusion Model