Text classification based on multinomial NB
1, Data processing
1. Data acquisition
Sogou lab Sohu News data download address: http://www.sogou.com/labs/resource
2. Data preprocessing
After downloading, the data is in GBK format, transcoded into UTF8 format by UltraEdit, and then sorted into json for easy processing
Then, the news types corresponding to the url prefix are classified, and a total of 15 categories are obtained. Take 11 evenly distributed categories, as follows
Take the first 100 words of each news, and get the final data by jieba word segmentation and removing the word (punctuation) with part of speech 'x'
(1) News text data, 1 news per line, each news is composed of several words separated by spaces, 17600 lines of training text and 4324 lines of test text;
(2) News tag data, one number per line, corresponding to the category number of the news, 17600 lines of training tag and 4324 lines of test tag
2, Discrete naive Bayes -- multinomial NB algorithm
1. NB in text recognition
For text classification task, the input is a group of documents, and the output is the type of documents. Naive Bayes performs well in text classification task. NB is a probability classifier. It tells us the possibility of the result in the class, that is, for each element, we use the formula to calculate its probability, and take the class corresponding to the highest probability as its classification result.
The initial formula is as follows:
The a priori probability P (c) is calculated as the percentage of training data documents in class C.
(formula 1)
Where Nc is the total number of documents with Class c in the data, and Ndoc is the total number of documents.
Since document d is represented as a set of features (words) w, the initial likelihood formula of each word in the class is:
That is, the result is the ratio of the count of the given words in category c to all the words in category c
However, if there is no given word in category c, the probability will be equal to 0, and finally we have to multiply all the feature likelihood, which will lead to the probability of the whole category is also 0. Therefore, using Laplace smoothing technology, the formula is as follows: (formula 2)
Where V is the set of all our words. Because each probability value is very small (e.g. 0.0001) and several very small probability values are multiplied directly, the result will be smaller and smaller. In order to avoid underflow in the calculation process , the Log function Log is introduced and calculated in the log space. Then, using the occurrence frequency of each word wi of the word bag model as the feature, the following formula is obtained (formula 3)
2. Training naive Bayesian classifier
The process of training naive Bayes is actually the process of calculating a priori probability and likelihood function.
2.1 a priori probability P © Calculation of
P © Among all documents, what is the probability of documents of category c? Suppose there are a total of Ndoc documents in the training data, just count the number of documents of category c to calculate P © , there are Nc documents in category c, and the calculation formula of a priori probability is as follows:
For prior information understanding, refer to: This article.
2.2 calculation of likelihood function P(wi|c)
Since the word bag model is used to represent a document d, for each word wi in document d, find all documents with category C in the training data set, and count the times of word wi in these documents (category C): count(wi,c)
Then, count the total number of words in the document with category c in the training dataset. calculate the ratio between the two, which is the value of the likelihood function. The calculation formula of the likelihood function is as follows:
Where V is the thesaurus. (if some words are in the thesaurus but do not belong to category C, then count(w,c)=0)
Note that the actual training also needs to use Laplace smoothing.
3. Code implementation
Next, I will implement the code from the code running order.
3.1 execution function
class Implementation(): ''' Execution function accuary: Calculation accuracy main: Main function readfile: Read dataset ''' def __init__(self): self.labels = dict() @staticmethod def accuracy(prediction, test): acc = 0 test_list = list(test) for idx, result in enumerate(prediction): if result == test_list[idx]: acc += 1 return acc / len(test) def main(self): x_train, y_train, x_test, y_test = self.readFile() nb = MultinominalNB() """ save: Save training model cached: Read training model """ nb.fit(x_train, y_train, save=True) predictions = nb.predict(x_test, cached=False) print('Accuracy: ', self.accuracy(predictions, y_test)) def readFile(self): x_train = open('mydata/train_contents.txt', encoding="utf-8").read().split('\n') y_train = open('mydata/train_labels.txt', encoding="utf-8").read().split('\n') x_test = open('mydata/test_contents.txt', encoding="utf-8").read().split('\n') y_test = open('mydata/test_labels.txt', encoding="utf-8").read().split('\n') print('Train data: ', len(x_train), 'Testing data: ', len(x_test), 'Total: ', len(x_train)+len(x_test)) return x_train, y_train, x_test, y_test
As an execution function, this function includes the methods of calculating the accuracy, reading the data set, training the model and so on.
In the main () method, multinomialnb () is a built class. It will be discussed later. nb.fit () trains the model, and the parameter save controls whether to save the training model. nb.predict () classifies the test set and returns the classification results. predictions, in which the parameter cached controls whether to read the training model for classification. predictions is passed into the method accuracy () medium
Accuracy() compares accuracy() with the test set label and returns the accuracy ratio.
3.2 fit training
fit () method realizes the training of the model.
def fit(self, x, y, save=False): self.docs = x self.classes = y num_doc = len(self.docs) # Number of training sets uniq_cls = np.unique(self.classes) # Label type and quantity self.vocab = self.buildGlobalVocab() # Build a thesaurus of all words vocab_cnt = len(self.vocab) # Number of words in Thesaurus t = time() for _cls in uniq_cls: cls_docs_num = self.countCls(_cls) # Number of current label types self.logprior[_cls] = np.log(cls_docs_num / num_doc) self.buildClassVocab(_cls) # Create a thesaurus of the current category class_vocab_counter = Counter(self.class_vocab[_cls]) # Record the number of each word in the current class and thesaurus class_vocab_cnt = len(self.class_vocab[_cls]) # Length of current class feature Thesaurus for word in self.vocab: w_cnt = class_vocab_counter[word] self.loglikelihood[word, _cls] = np.log((w_cnt + 1) / (class_vocab_cnt + vocab_cnt)) if save: self.saveModel() print('Training finished at {} mins.'.format(round((time() - t) / 60, 2)))
In main
nb = MultinominalNB() nb.fit(x_train, y_train, save=True)
First, read the data sets x_train and y_train, and make save=True during the first training,
self.docs = x self.classes = y
In fit, first save x_train in docs and y_train in classes
num_doc = len(self.docs) # Number of training sets uniq_cls = np.unique(self.classes) # Label type and quantity
Calculate the number of documents and label types in the training set, and the output is as follows
17600 ['1' '10' '11' '2' '3' '4' '5' '6' '7' '8' '9']
self.vocab = self.buildGlobalVocab() # Build a thesaurus of all words vocab_cnt = len(self.vocab) # Number of words in Thesaurus
Use the method * * buildGlobalVocab() * * to establish a thesaurus of all words, which will be discussed below. The returned thesaurus is saved in vocab, and the number of all words in the thesaurus is calculated
Some outputs are as follows:
['lower' 'Chinese style' 'Qiao Shi' 'Yes' 'Hendry' 'with' 'above' 'as well as' 'Buddhism' 'Foshan' 'Foshan City' 'all day' 'Family portraits' 'Totally enclosed' 'Global' 'Eight balls' 'champion' 'Droiyan' 'Come out' 'preliminary' 'after' 'ton' 'contain' 'Monday' 'Wednesday' 'Warming up' 'international' 'Image & Text' 'stay' 'recovery' 'foreign exchange' 'master' 'Bridge' 'chase ' 'implementation' 'Gong Yi' 'already' 'year' 'happiness' 'Guangfo' 'start' 'Zhang Guanghao' 'population' 'implement' 'challenge round' 'morgan ' 'data' 'construction' 'Travel' 'day' 'daily' 'Rixun' 'time' 'show' 'And the first' 'month' 'come from' 'cup' 'Liang Baozhong' 'automobile' 'Henan' 'Photo' 'world' 'Club' 'weak from fatigue' 'of' 'station' 'Second quarter' 'Phase II' 'The fourth stop' 'control' 'rice' 'experience' 'Economics' 'repair' 'Emerald' 'huashan' 'England' 'performance' 'express' 'Information' 'truck' 'rise' 'Vehicle passage' 'conduct' 'sign' 'player' 'Invitational tournament' 'Metropolitan Network' 'sales volume' 'Changsha' 'Chen' 'limit' 'Shaanxi' 'Shaanxi Province' 'retail' 'awarding ceremony' 'qualifying competition' 'height' 'shark'] 58513
for _cls in uniq_cls: cls_docs_num = self.countCls(_cls) # Number of current label types self.logprior[_cls] = np.log(cls_docs_num / num_doc) self.buildClassVocab(_cls) # Create a thesaurus of the current category class_vocab_counter = Counter(self.class_vocab[_cls]) # Record the number of each word in the current class and thesaurus class_vocab_cnt = len(self.class_vocab[_cls]) # Number of words in the current class feature Thesaurus
For each type of document, first use the method * * countCls() to calculate the number of current label types, and calculate the a priori probability (formula 1), which is saved in logprior[_cls] * * and * self.buildClassVocab(_cls) *Method to create a thesaurus of the current category and save it in class_vacab, count below, save the number of each word in class CLS in class_vocab_counter, calculate the total number of words in the current thesaurus and save it to class_vocab_cnt.
for word in self.vocab: w_cnt = class_vocab_counter[word] self.loglikelihood[word, _cls] = np.log((w_cnt + 1) / (class_vocab_cnt + vocab_cnt))
The likelihood function is calculated by formula 2
if save: self.saveModel() print('Training finished at {} mins.'.format(round((time() - t) / 60, 2)))
Finally, judge whether to save and calculate the training time
3.3 build global vocab (self) build all Thesaurus
def buildGlobalVocab(self): # Build Thesaurus vocab = [] for doc in self.docs: vocab.extend(self.cleanDoc(doc)) return np.unique(vocab)
For each doc in the training set, first separate the words in the doc. Here, the method cleanDoc is used, remove the repeated words, and finally return.
3.4 cleanDoc () separates words from documents
def cleanDoc(doc): # participle return re.split(' ', doc) # routine vocab_cnt = cleanDoc(x_train_test[1]) print(x_train_test[1]) print(vocab_cnt)
The output is as follows:
Happy moment family photo mm / DD / yyyy family photo of Cuihua Mountain in Shaanxi Province mm / DD / yyyy family photo of Cuihua Mountain in Shaanxi Province mm / DD / yyyy family photo of Cuihua Mountain in Shaanxi Province mm / DD / yyyy Shaanxi ['happiness', 'time', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi Province', 'Emerald', 'huashan', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi Province', 'Emerald', 'huashan', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi Province', 'Emerald', 'huashan', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi']
3.5 countCls() calculates the quantity of a category in the dataset
def countCls(self, cls): # Calculate the number of cls in docs cnt = 0 for idx, _docs in enumerate(self.docs): if (self.classes[idx] == cls): cnt += 1 return cnt
Traverse the data set and add one if the category matches the count
3.6 buildClassVocab(self, _cls) calculates the thesaurus of a class
def buildClassVocab(self, _cls): # Establish a feature thesaurus, and establish a thesaurus for all cls words in the training set curr_word_list = [] for idx, doc in enumerate(self.docs): if self.classes[idx] == str(_cls): curr_word_list.extend(self.cleanDoc(doc)) # Save thesaurus. If not, create a new one. If yes, add it if _cls not in self.class_vocab: self.class_vocab[_cls] = curr_word_list else: self.class_vocab[_cls].append(curr_word_list)
First traverse the entire data set, find out all the thesaurus of this class, and finally peel out the words in the thesaurus and put them into class_ In vocab [#u CLS].
3.7 saveModel(self) and readModel(), model saving and reading
def saveModel(self): try: f = open("classifier.txt", "wb") pickle.dump([self.logprior, self.vocab, self.loglikelihood, self.classes], f) f.close() except: print('Error saving the model') @staticmethod def readModel(): try: f = open("classifier.txt", "rb") model = pickle.load(f) f.close() return model except: print('Error reading the model')
3.8 predict(self, test_docs, cached=False) classifies test sets
def predict(self, test_docs, cached=False): ''' Classify test sets :param test_docs: Test data set :param cached: Do you want to read your training model :return: Classification results corresponding to the test set ''' output = [] if not cached: logprior = self.logprior vocab = self.vocab loglikelihood = self.loglikelihood classes = self.classes else: logprior, vocab, loglikelihood, classes = self.readModel() for doc in test_docs: uniq_cls = np.unique(classes) sum = dict() # Create sum dictionary for _cls in uniq_cls: sum[_cls] = logprior[_cls] for word in self.cleanDoc(doc): if word in vocab: try: sum[_cls] += loglikelihood[word, _cls] except: print(sum, _cls) result = np.argmax(list(sum.values())) output.append(uniq_cls[result]) return output
First, judge whether to read the model by the flag cached
Then, each document in the test set is calculated according to formula 3, and the final classification result is obtained and added to output
Where np.argmax is used
import numpy as np a = np.array([3, 1, 2, 4, 6, 1]) b=np.argmax(a)#Take out the index corresponding to the maximum value of the element in a. at this time, the maximum value is bit 6, and its corresponding position index value is 4 (the index value starts from 0 by default) print(b)#4
4. Implementation results
C:\Users\BC_PC\AppData\Local\Programs\Python\Python39\python.exe C:/Users/BC_PC/Desktop/multinomial-nb-master/mynb.py Train data: 17600 Testing data: 4324 Total: 21924 Training finished at 0.12 mins. Accuracy: 0.8533765032377428 Process finished with exit code 0