Bayesian classifier as text classification case

Text classification is not only a large module in modern machine learning applications, but also one of the foundations of natural language processing. We can process text data into digital data, and then use Bayes to help us judge a paragraph, or the topic classification, emotional tendency, and even the genre of an article. Now, most of the automatic collection of social media data depends on first encoding text into numbers, and then collecting the required information according to the classification results. Although the field of natural language processing is mostly controlled by deep learning, Bayesian classifier is still a pearl in text classification. Now, let's learn how Bayesian classifier realizes text classification.

Introduction to text coding technology

Word count vector

Before we begin to classify, we must first encode the text into numbers. A common method is the word count vector. In this technique, a sample can contain a paragraph or an article. If 10 words appear in this sample One word, there will be 10 features
(n=10). Each feature represents a word. The value of the feature indicates how many times the word appears in this sample. It is a discrete, positive integer representing the number of times . stay sklearn Among them, the word count vector count can be passed feature_extraction.text In the module CountVectorizer Class implementation, let's take a simple example:

sample = ["Machine learning is fascinating, it is wonderful","Machine learning is a sensational techonology","Elsa is a popular character"] from sklearn.feature_extraction.text import CountVectorizer vec = CountVectorizer() X = vec.fit_transform(sample) print(X)#Use interface get_feature_names() calls the name of each column import pandas as pd #Note that a sparse matrix cannot be input into pandas CVresult = pd.DataFrame(X.toarray(),columns = vec.get_feature_names()) CVresult

result:

TF-IDF

Full name of TF-IDF term frequency-inverse Document frequency, word frequency inverse document frequency, measures the weight of a word by its frequency in the document, that is, the size of IDF is inversely proportional to the frequency of a word. The more common the word is, the smaller the weight set for it after coding, so as to suppress some frequently occurring meaningless words. In sklearn, we use feature_ Class T "dfVectorizer" in extraction.text To perform this coding.

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF vec = TFIDF() X = vec.fit_transform(sample) X

result:

#Also use the interface get_feature_names() calls the name of each column TFIDFresult = pd.DataFrame(X.toarray(),columns=vec.get_feature_names()) TFIDFresult

result:

#After using TF-IDF coding, is the weight of more words reduced? CVresult.sum(axis=0)/CVresult.sum(axis=0).sum()

result:

character 0.0625 elsa 0.0625 fascinating 0.0625 is 0.2500 it 0.0625 learning 0.1250 machine 0.1250 popular 0.0625 sensational 0.0625 techonology 0.0625 wonderful 0.0625 dtype: float64

TFIDFresult.sum(axis=0) / TFIDFresult.sum(axis=0).sum()

result:

character 0.083071 elsa 0.083071 fascinating 0.064516 is 0.173225 it 0.064516 learning 0.110815 machine 0.110815 popular 0.083071 sensational 0.081192 techonology 0.081192 wonderful 0.064516 dtype: float64

case

Explore text data

In reality, the processing of text data is very time-consuming and labor-consuming, especially the processing of irregular long text, which can not be explained in a sentence or two. Therefore, the data set we use here is the text data set fetch of sklearn_ 20newsgroup. This data set is a corpus of 20 online newsgroups, including about 20000 news articles, all displayed in English. If you want to use Chinese, the processing process will be more difficult and you will need to load your own Chinese corpus. In this example, the main purpose is to show you the usage and effect of Bayesian, so we use English corpus.

from sklearn.datasets import fetch_20newsgroups #When using this dataset for the first time, it will be downloaded at the time of instantiation data = fetch_20newsgroups() #Usually we use data to see what is contained in data, but because of fetch_ The data loaded by the class of 20newscourps is huge, and the number is huge #According to the structure, there are many words mixed, so it is difficult to see #Different types of news data.target_names #Actually, fetch_ 20 newsgroups is also a class. Since it is a class, it should have parameters that can be called #In the face of simple data sets, we often don't write anything in the process of instantiation, but now there are too many data in data, which is inconvenient to explore #So we need to look at our class fetch_ What are the parameters of 20 newsgroups that can help us

result:

import numpy as np import pandas as pd categories = ["sci.space" #Science and technology - space ,"rec.sport.hockey" #Sports - Hockey ,"talk.politics.guns" #Politics - firearms ,"talk.politics.mideast"] #Politics - Middle East issues train = fetch_20newsgroups(subset="train",categories = categories) test = fetch_20newsgroups(subset="test",categories = categories) #It can be observed that there is still a class dictionary structure, and we can extract the content by using keys train.target_names

result:

Encoding text data using TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF Xtrain = train.data Xtest = test.data Ytrain = train.target Ytest = test.target tfidf = TFIDF().fit(Xtrain) Xtrain_ = tfidf.transform(Xtrain) Xtest_ = tfidf.transform(Xtest) Xtrain_

result:

tosee = pd.DataFrame(Xtrain_.toarray(),columns=tfidf.get_feature_names()) tosee.head()

result:

tosee.shape

result:

Model separately on Bayesian and view the results

from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB from sklearn.metrics import brier_score_loss as BS name = ["Multinomial","Complement","Bournulli"] #Note that Gaussian naive Bayes does not accept sparse matrices models = [MultinomialNB(),ComplementNB(),BernoulliNB()] for name,clf in zip(name,models): clf.fit(Xtrain_,Ytrain) y_pred = clf.predict(Xtest_) proba = clf.predict_proba(Xtest_) score = clf.score(Xtest_,Ytest) print(name) #Brill scores under 4 different tag values Bscore = [] for i in range(len(np.unique(Ytrain))): bs = BS(Ytest,proba[:,i],pos_label=i) Bscore.append(bs) print("\tBrier under {}:{:.3f}".format(train.target_names[i],bs)) print("\tAverage Brier:{:.3f}".format(np.mean(Bscore))) print("\tAccuracy:{:.3f}".format(score)) print("\n")

result:

Multinomial Brier under rec.sport.hockey:0.018 Brier under sci.space:0.033 Brier under talk.politics.guns:0.030 Brier under talk.politics.mideast:0.026 Average Brier: 0.027 Accuracy: 0.975 Complement Brier under rec.sport.hockey:0.023 Brier under sci.space:0.039 Brier under talk.politics.guns:0.039 Brier under talk.politics.mideast:0.033 Average Brier:0.033 Accuracy: 0.986 Bournulli Brier under rec.sport.hockey: 0.068 Brier under sci.space:0.025 Brier under talk.politics.guns:0.045 Brier under talk.politics.mideast:0.053 Average Brier: 0.048 Accuracy: 0.902

Probability calibration

from sklearn.calibration import CalibratedClassifierCV name = ["Multinomial" ,"Multinomial + Isotonic" ,"Multinomial + Sigmoid" ,"Complement" ,"Complement + Isotonic" ,"Complement + Sigmoid" ,"Bernoulli" ,"Bernoulli + Isotonic" ,"Bernoulli + Sigmoid"] models = [MultinomialNB() ,CalibratedClassifierCV(MultinomialNB(), cv=2, method='isotonic') ,CalibratedClassifierCV(MultinomialNB(), cv=2, method='sigmoid') ,ComplementNB() ,CalibratedClassifierCV(ComplementNB(), cv=2, method='isotonic') ,CalibratedClassifierCV(ComplementNB(), cv=2, method='sigmoid') ,BernoulliNB() ,CalibratedClassifierCV(BernoulliNB(), cv=2, method='isotonic') ,CalibratedClassifierCV(BernoulliNB(), cv=2, method='sigmoid') ] for name,clf in zip(name,models): clf.fit(Xtrain_,Ytrain) y_pred = clf.predict(Xtest_) proba = clf.predict_proba(Xtest_) score = clf.score(Xtest_,Ytest) print(name) Bscore = [] for i in range(len(np.unique(Ytrain))): bs = BS(Ytest,proba[:,i],pos_label=i) Bscore.append(bs) print("\tBrier under {}:{:.3f}".format(train.target_names[i],bs)) print("\tAverage Brier:{:.3f}".format(np.mean(Bscore))) print("\tAccuracy:{:.3f}".format(score)) print("\n")

result:

Multinomial Brier under rec.sport.hockey: 0.018 Brier under sci.space:0.033 Brier under talk.politics.guns:0.030 Brier under talk.politics.mideast:0.026 Average Brier: 0.027 Accuracy: 0.975 Multinomial + Isotonic Brier under rec.sport.hockey: 0.006 Brier under sci.space:0.012 Brier under talk.politics.guns: 0.013 Brier under talk.politics.mideast:0.009 Average Brier: 0.010 Accuracy: 0.973 Multinomial + Sigmoid Brier under rec.sport.hockey: 0.006 Brier under sci.space:0.012 Brier under talk.politics.guns:0.013 Brier under talk.politics.mideast:0.009 Average Brier:0.010 Accuracy: 0.973 Complement Brier under rec.sport.hockey:0.023 Brier under sci.space:0.039 Brier under talk.politics.guns:0.039 Brier under talk.politics.mideast:0.033 Average Brier: 0.033 Accurary.0.996 Bernoulli Brier under rec.sport.hockey: 0.068 Brier under sci.space:0.025 Brier under talk.politics.guns:0.045 Brier under talk.politics.mideast:0.053 Average Brier: 0.048 Accuracy: 0.902 Bernoulli + Isotonic Brier under rec.sport.hockey: 0.016 Brier under sci. space: 0. 014 Brier under talk.politics.guns:0.034 Brier under talk.politics.mideast:0.033 Average Brier: 0.024 Accuracy: 0.952 Complement + Isotonic Brier under rec.sport.hockey: 0.004 Brier under sci.space:0.}e7 Brier under talk.politics.guns:0.009 Brier under talk.politics.mideast:0.006 Average Brier:0.006 Accuracy: 0.985 Complement + Sigmoid Brier under rec.sport.hockey: 0.004 Brier under sci.space:0.009 Brier under talk.politics.guns:0.010 Brier under talk.politics.mideast:0.007 Average Brier: 0.097 Accuracy:0.986

It can be observed that no matter how the polynomial distribution is adjusted, the effect of the algorithm is not as good as the complement naive Bayes. Therefore, we should choose complement naive Bayes when classifying. For complement naive Bayes, Sigmoid is used The model with probability calibration is the best: the accuracy is the highest, and the logarithmic loss and Brill score are both 0.1 Below, it can be said to be a very ideal model. For machine learning, naive Bayes may not be the most commonly used classification algorithm, but as the only probability prediction algorithm that really depends on probability for calculation, and simple and fast algorithm, naive Bayes is often mentioned. Moreover, the effect of naive Bayes in text classification is really excellent. Thus, as long as we can provide enough data and make rational use of high-dimensional data for training, naive Bayes can provide us with unexpected results.

Bayesian classifier as text classification case

Introduction to text coding technology

Word count vector

TF-IDF

case

Explore text data

Encoding text data using TF-IDF

Model separately on Bayesian and view the results

1 December 2021, 20:33 | Views: 10216

Add new comment

0 comments