loading...

入門自然言語処理 pp.83-86

twitter logo ・1 min read

ウェブのテキストを取得する方法。nltk 3.2.5 では、サンプルがそのまま動かない。3.1.2章のHTML処理をするには、以下のようにする。

$ pip install beautifulsoup4
$ python
Python 3.6.3 (default, Dec 14 2017, 14:27:23)
...
>>> import nltk
>>> from bs4 import BeautifulSoup
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html = html.decode()
>>> soup = BeautifulSoup(html, "html.parser")
>>> raw = soup.get_text()
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("gene")
Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next
blonde hair is caused by a recessive gene . In order for a child to have blond
...
twitter logo DISCUSS
Classic DEV Post from Aug 16 '19

Powerlifting has made me a better developer. (Part 1: Interpersonally)

Toru Furukawa profile image