Bayesian classifier as text classification case

Bayesian classifier for text classification

Text classification is not only a large module in modern machine learning applications, but also one of the foundations of natural language processing. We can process text data into digital data, and then use Bayes to help us judge a paragraph, or the topic classification, emotional tendency, and even the genre of an article. Now, most of the automatic collection of social media data depends on first encoding text into numbers, and then collecting the required information according to the classification results. Although the field of natural language processing is mostly controlled by deep learning, Bayesian classifier is still a pearl in text classification. Now, let's learn how Bayesian classifier realizes text classification.

Introduction to text coding technology

Word count vector

Before we begin to classify, we must first encode the text into numbers. A common method is the word count vector. In this technique, a sample can contain a paragraph or an article. If 10 words appear in this sample One word, there will be 10 features
(n=10). Each feature represents a word. The value of the feature indicates how many times the word appears in this sample. It is a discrete, positive integer representing the number of times .
stay sklearn Among them, the word count vector count can be passed feature_extraction.text In the module CountVectorizer Class implementation, let's take a simple example:
sample = ["Machine learning is fascinating, it is wonderful","Machine learning is a sensational techonology","Elsa is a popular character"]
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample) 
print(X)#Use interface get_feature_names() calls the name of each column
import pandas as pd
#Note that a sparse matrix cannot be input into pandas
CVresult = pd.DataFrame(X.toarray(),columns = vec.get_feature_names())
CVresult

result:

TF-IDF

Full name of TF-IDF term   frequency-inverse   Document frequency, word frequency inverse document frequency, measures the weight of a word by its frequency in the document, that is, the size of IDF is inversely proportional to the frequency of a word. The more common the word is, the smaller the weight set for it after coding, so as to suppress some frequently occurring meaningless words. In sklearn, we use feature_ Class T "dfVectorizer" in extraction.text   To perform this coding.

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
vec = TFIDF()
X = vec.fit_transform(sample) 
X

result:

#Also use the interface get_feature_names() calls the name of each column
TFIDFresult = pd.DataFrame(X.toarray(),columns=vec.get_feature_names())
TFIDFresult

result:  

#After using TF-IDF coding, is the weight of more words reduced?
CVresult.sum(axis=0)/CVresult.sum(axis=0).sum()

result:  

character      0.0625
elsa           0.0625
fascinating    0.0625
is             0.2500
it             0.0625
learning       0.1250
machine        0.1250
popular        0.0625
sensational    0.0625
techonology    0.0625
wonderful      0.0625
dtype: float64
TFIDFresult.sum(axis=0) / TFIDFresult.sum(axis=0).sum()

result:

character      0.083071
elsa           0.083071
fascinating    0.064516
is             0.173225
it             0.064516
learning       0.110815
machine        0.110815
popular        0.083071
sensational    0.081192
techonology    0.081192
wonderful      0.064516
dtype: float64

case

Explore text data

In reality, the processing of text data is very time-consuming and labor-consuming, especially the processing of irregular long text, which can not be explained in a sentence or two. Therefore, the data set we use here is the text data set fetch of sklearn_ 20newsgroup. This data set is a corpus of 20 online newsgroups, including about 20000 news articles, all displayed in English. If you want to use Chinese, the processing process will be more difficult and you will need to load your own Chinese corpus. In this example, the main purpose is to show you the usage and effect of Bayesian, so we use English corpus.

from sklearn.datasets import fetch_20newsgroups
#When using this dataset for the first time, it will be downloaded at the time of instantiation
data = fetch_20newsgroups()
#Usually we use data to see what is contained in data, but because of fetch_ The data loaded by the class of 20newscourps is huge, and the number is huge
#According to the structure, there are many words mixed, so it is difficult to see
#Different types of news
data.target_names
#Actually, fetch_ 20 newsgroups is also a class. Since it is a class, it should have parameters that can be called
#In the face of simple data sets, we often don't write anything in the process of instantiation, but now there are too many data in data, which is inconvenient to explore
#So we need to look at our class fetch_ What are the parameters of 20 newsgroups that can help us

result:

import numpy as np
import pandas as pd
categories = ["sci.space" #Science and technology - space
             ,"rec.sport.hockey" #Sports - Hockey
             ,"talk.politics.guns" #Politics - firearms
             ,"talk.politics.mideast"] #Politics - Middle East issues
train = fetch_20newsgroups(subset="train",categories = categories)
test = fetch_20newsgroups(subset="test",categories = categories)

#It can be observed that there is still a class dictionary structure, and we can extract the content by using keys
train.target_names

result:

Encoding text data using TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
Xtrain = train.data
Xtest = test.data
Ytrain = train.target
Ytest = test.target

tfidf = TFIDF().fit(Xtrain)
Xtrain_ = tfidf.transform(Xtrain)
Xtest_ = tfidf.transform(Xtest)

Xtrain_

result:

tosee = pd.DataFrame(Xtrain_.toarray(),columns=tfidf.get_feature_names()) 
tosee.head()

result:

tosee.shape

result:  

Model separately on Bayesian and view the results

from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.metrics import brier_score_loss as BS
name = ["Multinomial","Complement","Bournulli"] #Note that Gaussian naive Bayes does not accept sparse matrices
models = [MultinomialNB(),ComplementNB(),BernoulliNB()]
for name,clf in zip(name,models):
    clf.fit(Xtrain_,Ytrain)
    y_pred = clf.predict(Xtest_)
    proba = clf.predict_proba(Xtest_)
    score = clf.score(Xtest_,Ytest)
    print(name)
    #Brill scores under 4 different tag values
    Bscore = []
    for i in range(len(np.unique(Ytrain))):
        bs = BS(Ytest,proba[:,i],pos_label=i)
        Bscore.append(bs)
        print("\tBrier under {}:{:.3f}".format(train.target_names[i],bs))

    print("\tAverage Brier:{:.3f}".format(np.mean(Bscore)))
    print("\tAccuracy:{:.3f}".format(score))
    print("\n")

result:

Multinomial
Brier under rec.sport.hockey:0.018
Brier under sci.space:0.033
Brier under talk.politics.guns:0.030
Brier under talk.politics.mideast:0.026
Average Brier: 0.027
Accuracy: 0.975

Complement
Brier under rec.sport.hockey:0.023
Brier under sci.space:0.039
Brier under talk.politics.guns:0.039
Brier under talk.politics.mideast:0.033
Average Brier:0.033
Accuracy: 0.986

Bournulli
Brier under rec.sport.hockey: 0.068
Brier under sci.space:0.025
Brier under talk.politics.guns:0.045
Brier under talk.politics.mideast:0.053
Average Brier: 0.048
Accuracy: 0.902

Probability calibration  

from sklearn.calibration import CalibratedClassifierCV
name = ["Multinomial"
       ,"Multinomial + Isotonic"
       ,"Multinomial + Sigmoid"
       ,"Complement"
       ,"Complement + Isotonic"
       ,"Complement + Sigmoid"
       ,"Bernoulli"
       ,"Bernoulli + Isotonic"
       ,"Bernoulli + Sigmoid"]
models = [MultinomialNB()
         ,CalibratedClassifierCV(MultinomialNB(), cv=2, method='isotonic')
         ,CalibratedClassifierCV(MultinomialNB(), cv=2, method='sigmoid')
         ,ComplementNB()
         ,CalibratedClassifierCV(ComplementNB(), cv=2, method='isotonic')
         ,CalibratedClassifierCV(ComplementNB(), cv=2, method='sigmoid')
         ,BernoulliNB()
         ,CalibratedClassifierCV(BernoulliNB(), cv=2, method='isotonic')
         ,CalibratedClassifierCV(BernoulliNB(), cv=2, method='sigmoid')
         ]
for name,clf in zip(name,models):
    clf.fit(Xtrain_,Ytrain)
    y_pred = clf.predict(Xtest_)
    proba = clf.predict_proba(Xtest_)
    score = clf.score(Xtest_,Ytest)
    print(name)
    Bscore = []
    for i in range(len(np.unique(Ytrain))):
        bs = BS(Ytest,proba[:,i],pos_label=i)
        Bscore.append(bs)
        print("\tBrier under {}:{:.3f}".format(train.target_names[i],bs))
    print("\tAverage Brier:{:.3f}".format(np.mean(Bscore)))
    print("\tAccuracy:{:.3f}".format(score))
    print("\n")

result:

Multinomial
Brier under rec.sport.hockey: 0.018
Brier under sci.space:0.033
Brier under talk.politics.guns:0.030
Brier under talk.politics.mideast:0.026
Average Brier: 0.027
Accuracy: 0.975

Multinomial + Isotonic
Brier under rec.sport.hockey: 0.006
Brier under sci.space:0.012
Brier under talk.politics.guns: 0.013
Brier under talk.politics.mideast:0.009
Average Brier: 0.010
Accuracy: 0.973

Multinomial + Sigmoid
Brier under rec.sport.hockey: 0.006
Brier under sci.space:0.012
Brier under talk.politics.guns:0.013
Brier under talk.politics.mideast:0.009
Average Brier:0.010
Accuracy: 0.973

Complement
Brier under rec.sport.hockey:0.023
Brier under sci.space:0.039
Brier under talk.politics.guns:0.039
Brier under talk.politics.mideast:0.033
Average Brier: 0.033
Accurary.0.996

Bernoulli
Brier under rec.sport.hockey: 0.068
Brier under sci.space:0.025
Brier under talk.politics.guns:0.045
Brier under talk.politics.mideast:0.053
Average Brier: 0.048
Accuracy: 0.902

Bernoulli + Isotonic
Brier under rec.sport.hockey: 0.016
Brier under sci. space: 0. 014
Brier under talk.politics.guns:0.034
Brier under talk.politics.mideast:0.033
Average Brier: 0.024
Accuracy: 0.952

Complement + Isotonic
Brier under rec.sport.hockey: 0.004
Brier
under sci.space:0.}e7
Brier under talk.politics.guns:0.009
Brier under talk.politics.mideast:0.006
Average Brier:0.006
Accuracy: 0.985

Complement + Sigmoid
Brier under rec.sport.hockey: 0.004
Brier under
sci.space:0.009
Brier under talk.politics.guns:0.010
Brier under talk.politics.mideast:0.007
Average Brier: 0.097
Accuracy:0.986
It can be observed that no matter how the polynomial distribution is adjusted, the effect of the algorithm is not as good as the complement naive Bayes. Therefore, we should choose complement naive Bayes when classifying. For complement naive Bayes, Sigmoid is used The model with probability calibration is the best: the accuracy is the highest, and the logarithmic loss and Brill score are both 0.1 Below, it can be said to be a very ideal model. For machine learning, naive Bayes may not be the most commonly used classification algorithm, but as the only probability prediction algorithm that really depends on probability for calculation, and simple and fast algorithm, naive Bayes is often mentioned. Moreover, the effect of naive Bayes in text classification is really excellent. Thus, as long as we can provide enough data and make rational use of high-dimensional data for training, naive Bayes can provide us with unexpected results.

Tags: Machine Learning sklearn svm

Posted on Wed, 01 Dec 2021 20:33:05 -0500 by Decipher