NB naive Bayes for Chinese text classification

Text classification based on multinomial NB

1, Data processing

1. Data acquisition

Sogou lab Sohu News data download address: http://www.sogou.com/labs/resource

2. Data preprocessing

After downloading, the data is in GBK format, transcoded into UTF8 format by UltraEdit, and then sorted into json for easy processing

Then, the news types corresponding to the url prefix are classified, and a total of 15 categories are obtained. Take 11 evenly distributed categories, as follows

Take the first 100 words of each news, and get the final data by jieba word segmentation and removing the word (punctuation) with part of speech 'x'

(1) News text data, 1 news per line, each news is composed of several words separated by spaces, 17600 lines of training text and 4324 lines of test text;

(2) News tag data, one number per line, corresponding to the category number of the news, 17600 lines of training tag and 4324 lines of test tag

2, Discrete naive Bayes -- multinomial NB algorithm

1. NB in text recognition

For text classification task, the input is a group of documents, and the output is the type of documents. Naive Bayes performs well in text classification task. NB is a probability classifier. It tells us the possibility of the result in the class, that is, for each element, we use the formula to calculate its probability, and take the class corresponding to the highest probability as its classification result.

The initial formula is as follows:

The a priori probability P (c) is calculated as the percentage of training data documents in class C.

(formula 1)

Where Nc is the total number of documents with Class c in the data, and Ndoc is the total number of documents.

Since document d is represented as a set of features (words) w, the initial likelihood formula of each word in the class is:

That is, the result is the ratio of the count of the given words in category c to all the words in category c

However, if there is no given word in category c, the probability will be equal to 0, and finally we have to multiply all the feature likelihood, which will lead to the probability of the whole category is also 0. Therefore, using Laplace smoothing technology, the formula is as follows: (formula 2)

Where V is the set of all our words. Because each probability value is very small (e.g. 0.0001) and several very small probability values are multiplied directly, the result will be smaller and smaller. In order to avoid underflow in the calculation process , the Log function Log is introduced and calculated in the log space. Then, using the occurrence frequency of each word wi of the word bag model as the feature, the following formula is obtained (formula 3)

2. Training naive Bayesian classifier

The process of training naive Bayes is actually the process of calculating a priori probability and likelihood function.

2.1 a priori probability P © Calculation of

P © Among all documents, what is the probability of documents of category c? Suppose there are a total of Ndoc documents in the training data, just count the number of documents of category c to calculate P © , there are Nc documents in category c, and the calculation formula of a priori probability is as follows:

For prior information understanding, refer to: This article.

2.2 calculation of likelihood function P(wi|c)

Since the word bag model is used to represent a document d, for each word wi in document d, find all documents with category C in the training data set, and count the times of word wi in these documents (category C): count(wi,c)

Then, count the total number of words in the document with category c in the training dataset. calculate the ratio between the two, which is the value of the likelihood function. The calculation formula of the likelihood function is as follows:

Where V is the thesaurus. (if some words are in the thesaurus but do not belong to category C, then count(w,c)=0)

Note that the actual training also needs to use Laplace smoothing.

3. Code implementation

Next, I will implement the code from the code running order.

3.1 execution function

class Implementation():
    '''
    Execution function
    accuary: Calculation accuracy
    main: Main function
    readfile: Read dataset
    '''
    def __init__(self):
        self.labels = dict()

    @staticmethod
    def accuracy(prediction, test):
        acc = 0
        test_list = list(test)
        for idx, result in enumerate(prediction):
            if result == test_list[idx]:
                acc += 1

        return acc / len(test)

    def main(self):
        x_train, y_train, x_test, y_test = self.readFile()
        nb = MultinominalNB()

        """ 
        save: Save training model
        cached: Read training model
        """
        nb.fit(x_train, y_train, save=True)
        predictions = nb.predict(x_test, cached=False)

        print('Accuracy: ', self.accuracy(predictions, y_test))

    def readFile(self):
        x_train = open('mydata/train_contents.txt', encoding="utf-8").read().split('\n')
        y_train = open('mydata/train_labels.txt', encoding="utf-8").read().split('\n')
        x_test = open('mydata/test_contents.txt', encoding="utf-8").read().split('\n')
        y_test = open('mydata/test_labels.txt', encoding="utf-8").read().split('\n')

        print('Train data: ', len(x_train), 'Testing data: ', len(x_test), 'Total: ', len(x_train)+len(x_test))

        return x_train, y_train, x_test, y_test

As an execution function, this function includes the methods of calculating the accuracy, reading the data set, training the model and so on.

In the main () method, multinomialnb () is a built class. It will be discussed later. nb.fit () trains the model, and the parameter save controls whether to save the training model. nb.predict () classifies the test set and returns the classification results. predictions, in which the parameter cached controls whether to read the training model for classification. predictions is passed into the method accuracy () medium

Accuracy() compares accuracy() with the test set label and returns the accuracy ratio.

3.2 fit training

fit () method realizes the training of the model.

    def fit(self, x, y, save=False):
        self.docs = x
        self.classes = y
        num_doc = len(self.docs)  # Number of training sets
        uniq_cls = np.unique(self.classes)  # Label type and quantity
        self.vocab = self.buildGlobalVocab()  # Build a thesaurus of all words
        vocab_cnt = len(self.vocab)  # Number of words in Thesaurus

        t = time()

        for _cls in uniq_cls:
            cls_docs_num = self.countCls(_cls)  # Number of current label types
            self.logprior[_cls] = np.log(cls_docs_num / num_doc)
            self.buildClassVocab(_cls)  # Create a thesaurus of the current category
            class_vocab_counter = Counter(self.class_vocab[_cls])  # Record the number of each word in the current class and thesaurus
            class_vocab_cnt = len(self.class_vocab[_cls])  # Length of current class feature Thesaurus

            for word in self.vocab:
                w_cnt = class_vocab_counter[word]
                self.loglikelihood[word, _cls] = np.log((w_cnt + 1) / (class_vocab_cnt + vocab_cnt))

        if save:
            self.saveModel()
        print('Training finished at {} mins.'.format(round((time() - t) / 60, 2)))

In main

nb = MultinominalNB()
nb.fit(x_train, y_train, save=True)

First, read the data sets x_train and y_train, and make save=True during the first training,

        self.docs = x
        self.classes = y

In fit, first save x_train in docs and y_train in classes

        num_doc = len(self.docs)  # Number of training sets
        uniq_cls = np.unique(self.classes)  # Label type and quantity

Calculate the number of documents and label types in the training set, and the output is as follows

17600
['1' '10' '11' '2' '3' '4' '5' '6' '7' '8' '9']
        self.vocab = self.buildGlobalVocab()  # Build a thesaurus of all words
        vocab_cnt = len(self.vocab)  # Number of words in Thesaurus

Use the method * * buildGlobalVocab() * * to establish a thesaurus of all words, which will be discussed below. The returned thesaurus is saved in vocab, and the number of all words in the thesaurus is calculated
Some outputs are as follows:

['lower' 'Chinese style' 'Qiao Shi' 'Yes' 'Hendry' 'with' 'above' 'as well as' 'Buddhism' 'Foshan' 'Foshan City' 'all day' 'Family portraits' 'Totally enclosed'
 'Global' 'Eight balls' 'champion' 'Droiyan' 'Come out' 'preliminary' 'after' 'ton' 'contain' 'Monday' 'Wednesday' 'Warming up' 'international' 'Image & Text' 'stay'
 'recovery' 'foreign exchange' 'master' 'Bridge' 'chase ' 'implementation' 'Gong Yi' 'already' 'year' 'happiness' 'Guangfo' 'start' 'Zhang Guanghao' 'population'
 'implement' 'challenge round' 'morgan ' 'data' 'construction' 'Travel' 'day' 'daily' 'Rixun' 'time' 'show' 'And the first' 'month' 'come from' 'cup'
 'Liang Baozhong' 'automobile' 'Henan' 'Photo' 'world' 'Club' 'weak from fatigue' 'of' 'station' 'Second quarter' 'Phase II' 'The fourth stop' 'control' 'rice'
 'experience' 'Economics' 'repair' 'Emerald' 'huashan' 'England' 'performance' 'express' 'Information' 'truck' 'rise' 'Vehicle passage' 'conduct' 'sign' 'player'
 'Invitational tournament' 'Metropolitan Network' 'sales volume' 'Changsha' 'Chen' 'limit' 'Shaanxi' 'Shaanxi Province' 'retail' 'awarding ceremony' 'qualifying competition' 'height' 'shark']
 58513
        for _cls in uniq_cls:
            cls_docs_num = self.countCls(_cls)  # Number of current label types
            self.logprior[_cls] = np.log(cls_docs_num / num_doc)
            self.buildClassVocab(_cls)  # Create a thesaurus of the current category
            class_vocab_counter = Counter(self.class_vocab[_cls])  # Record the number of each word in the current class and thesaurus
            class_vocab_cnt = len(self.class_vocab[_cls])  # Number of words in the current class feature Thesaurus

For each type of document, first use the method * * countCls() to calculate the number of current label types, and calculate the a priori probability (formula 1), which is saved in logprior[_cls] * * and * self.buildClassVocab(_cls) *Method to create a thesaurus of the current category and save it in class_vacab, count below, save the number of each word in class CLS in class_vocab_counter, calculate the total number of words in the current thesaurus and save it to class_vocab_cnt.

            for word in self.vocab:
                w_cnt = class_vocab_counter[word]
                self.loglikelihood[word, _cls] = np.log((w_cnt + 1) / (class_vocab_cnt + vocab_cnt))

The likelihood function is calculated by formula 2

        if save:
            self.saveModel()
        print('Training finished at {} mins.'.format(round((time() - t) / 60, 2)))

Finally, judge whether to save and calculate the training time

3.3 build global vocab (self) build all Thesaurus

    def buildGlobalVocab(self):
        # Build Thesaurus
        vocab = []
        for doc in self.docs:
            vocab.extend(self.cleanDoc(doc))

        return np.unique(vocab)

For each doc in the training set, first separate the words in the doc. Here, the method cleanDoc is used, remove the repeated words, and finally return.

3.4 cleanDoc () separates words from documents

    def cleanDoc(doc):
        # participle
        return re.split(' ', doc)
    
# routine
vocab_cnt = cleanDoc(x_train_test[1])
print(x_train_test[1])
print(vocab_cnt)

The output is as follows:

Happy moment family photo mm / DD / yyyy family photo of Cuihua Mountain in Shaanxi Province mm / DD / yyyy family photo of Cuihua Mountain in Shaanxi Province mm / DD / yyyy family photo of Cuihua Mountain in Shaanxi Province mm / DD / yyyy Shaanxi
['happiness', 'time', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi Province', 'Emerald', 'huashan', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi Province', 'Emerald', 'huashan', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi Province', 'Emerald', 'huashan', 'Family portraits', 'Travel', 'Photo', 'year', 'month', 'day', 'Shaanxi']

3.5 countCls() calculates the quantity of a category in the dataset

    def countCls(self, cls):
        # Calculate the number of cls in docs
        cnt = 0
        for idx, _docs in enumerate(self.docs):
            if (self.classes[idx] == cls):
                cnt += 1
        return cnt

Traverse the data set and add one if the category matches the count

3.6 buildClassVocab(self, _cls) calculates the thesaurus of a class

    def buildClassVocab(self, _cls):
        # Establish a feature thesaurus, and establish a thesaurus for all cls words in the training set
        curr_word_list = []
        for idx, doc in enumerate(self.docs):
            if self.classes[idx] == str(_cls):
                curr_word_list.extend(self.cleanDoc(doc))
        # Save thesaurus. If not, create a new one. If yes, add it
        if _cls not in self.class_vocab:
            self.class_vocab[_cls] = curr_word_list
        else:
            self.class_vocab[_cls].append(curr_word_list)

First traverse the entire data set, find out all the thesaurus of this class, and finally peel out the words in the thesaurus and put them into class_ In vocab [#u CLS].

3.7 saveModel(self) and readModel(), model saving and reading

    def saveModel(self):
        try:
            f = open("classifier.txt", "wb")
            pickle.dump([self.logprior, self.vocab, self.loglikelihood, self.classes], f)
            f.close()
        except:
            print('Error saving the model')

    @staticmethod
    def readModel():
        try:
            f = open("classifier.txt", "rb")
            model = pickle.load(f)
            f.close()
            return model
        except:
            print('Error reading the model')

3.8 predict(self, test_docs, cached=False) classifies test sets

    def predict(self, test_docs, cached=False):
        '''
        Classify test sets
        :param test_docs: Test data set
        :param cached: Do you want to read your training model
        :return: Classification results corresponding to the test set
        '''
        output = []

        if not cached:
            logprior = self.logprior
            vocab = self.vocab
            loglikelihood = self.loglikelihood
            classes = self.classes
        else:
            logprior, vocab, loglikelihood, classes = self.readModel()

        for doc in test_docs:
            uniq_cls = np.unique(classes)
            sum = dict()  # Create sum dictionary

            for _cls in uniq_cls:
                sum[_cls] = logprior[_cls]

                for word in self.cleanDoc(doc):
                    if word in vocab:
                        try:
                            sum[_cls] += loglikelihood[word, _cls]
                        except:
                            print(sum, _cls)

            result = np.argmax(list(sum.values()))
            output.append(uniq_cls[result])

        return output

First, judge whether to read the model by the flag cached

Then, each document in the test set is calculated according to formula 3, and the final classification result is obtained and added to output

Where np.argmax is used

import numpy as np
a = np.array([3, 1, 2, 4, 6, 1])
b=np.argmax(a)#Take out the index corresponding to the maximum value of the element in a. at this time, the maximum value is bit 6, and its corresponding position index value is 4 (the index value starts from 0 by default)
print(b)#4

4. Implementation results

C:\Users\BC_PC\AppData\Local\Programs\Python\Python39\python.exe C:/Users/BC_PC/Desktop/multinomial-nb-master/mynb.py
Train data:  17600 Testing data:  4324 Total:  21924
Training finished at 0.12 mins.
Accuracy:  0.8533765032377428

Process finished with exit code 0

Tags: Machine Learning AI NLP

Posted on Sat, 02 Oct 2021 14:44:15 -0400 by help_needed