DEV Community

Toru Furukawa
Toru Furukawa

Posted on

入門自然言語処理 pp.83-86

ウェブのテキストを取得する方法。nltk 3.2.5 では、サンプルがそのまま動かない。3.1.2章のHTML処理をするには、以下のようにする。

$ pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
$ python
Python 3.6.3 (default, Dec 14 2017, 14:27:23)
...
>>> import nltk
>>> from bs4 import BeautifulSoup
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html = html.decode()
>>> soup = BeautifulSoup(html, "html.parser")
>>> raw = soup.get_text()
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("gene")
Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next
blonde hair is caused by a recessive gene . In order for a child to have blond
...
Enter fullscreen mode Exit fullscreen mode

Top comments (0)