DEV Community

Toru Furukawa
Toru Furukawa

Posted on

2 1

入門自然言語処理 pp.83-86

ウェブのテキストを取得する方法。nltk 3.2.5 では、サンプルがそのまま動かない。3.1.2章のHTML処理をするには、以下のようにする。

$ pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
$ python
Python 3.6.3 (default, Dec 14 2017, 14:27:23)
...
>>> import nltk
>>> from bs4 import BeautifulSoup
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html = html.decode()
>>> soup = BeautifulSoup(html, "html.parser")
>>> raw = soup.get_text()
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("gene")
Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next
blonde hair is caused by a recessive gene . In order for a child to have blond
...
Enter fullscreen mode Exit fullscreen mode

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay