DEV Community

loading...
Cover image for Newsgroup Text classification with Machine Learning

Newsgroup Text classification with Machine Learning

petercour
・3 min read

Text can be automatically classified. As anything with Machine Learning, it needs data. So what data are we going to us?

The Data

Lets say our source of data is the fetch_20newsgroups data set.
This data set contains the text of nearly 20,000 newsgroup posts partitioned across 20 different newsgroups.

The dataset is quite old, but that doesn't matter.You can find the original homepage here: 20 news groups dataset

The data set included by default in the Python Machine Learning module sklearn.

To simplify, we'll only take 2 news groups "rec.motorcycles" and "rec.sport.hockey".

#!/usr/bin/python3
news = fetch_20newsgroups(subset="all", categories=['rec.sport.hockey', 'rec.motorcycles'])

Test the Algorithm

Before using the classifier, you want to know how well it works. That is done by splitting the data set into train and test set.

#!/usr/bin/python3
x_train, x_test, y_train, y_test = train_test_split(news.data,news.target)

The data we're dealing with is text. It needs to be vectors. Then use the TfidfVectorizer. So we have two vectors: x_train and x_test.

#!/usr/bin/python3
transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

No need to change y_train and y_test, as those are output labels (class 0 or class 1)

Create an algorithm object and train it with the data.

#!/usr/bin/python3
estimator = MultinomialNB()
estimator.fit(x_train,y_train)

Then you can make predictions and see how well it classifies on the test data

#!/usr/bin/python3
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)

score = estimator.score(x_test, y_test)
print("score:\n", score)

Run the program and you'll see the accuracy:

score: 0.9939879759519038

Make your own predictions

You can make predictions with new texts:

Enter some text: i like to drive motor cycle on the highway
y_predict:
[0]

Enter some text: i like to play hockey game
y_predict:
[1]

To do so add these lines:

#!/usr/bin/python3
sentence = input("Enter some text: ")
sentence_x = transfer.transform([sentence])
y_predict = estimator.predict(sentence_x)
print("y_predict:\n", y_predict)

The program

The program below does it all

#!/usr/bin/python3
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

def nb_news():
    news = fetch_20newsgroups(subset="all", categories=['rec.sport.hockey', 'rec.motorcycles'])

    x_train, x_test, y_train, y_test = train_test_split(news.data,news.target)

    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = MultinomialNB()
    estimator.fit(x_train,y_train)

    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)

    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return None

nb_news()

Related links:

Discussion (0)