Machine learning naive Bayes classification food safety news

Tip: after the article is written, the directory can be generated automatically. Please refer to the help document on the right for how to generate it

catalogue

Article catalog

preface

1, Preparatory knowledge

1. Basic concepts

2. Bayesian formula

2, Naive Bayesian principle

1. Discrimination model and generation model

2. Bayesian classifier

2.1 Bayesian criterion

  2.2 Bayesian classifier

  3. Naive Bayesian classifier

  3.1 calculation of conditional probability estimation of characteristic attributes

3.2 Laplace correction

3.3 anti overflow strategy

4. Test naive Bayesian classifier

4.1 construct word vector

4.2 naive Bayesian classification training function

  4.3 classification function

  4.4 test function

3, Naive Bayes classification food safety news

1. Dataset

2. Read training set

3. Word segmentation of news

  3. Generate dictionary

4. Vectorize the news after word segmentation

5. Training function and classification function

6. Training model

7. Test model

summary


preface

This experiment will learn to use naive Bayesian method to classify spam and food safety news.

Tip: the following is the main content of this article. The following cases can be used for reference


1, Preparatory knowledge

1. Basic concepts

A priori probability: the probability of something according to previous experience and analysis, such as the probability of rain in Xiamen, can be obtained through previous experience or statistical results. We use P(Y) to represent the initial probability assumed by Y before there is no training data.

A posteriori probability: the probability obtained by analyzing the events that have occurred. P(Y|X) represents the probability of observing Y when assuming that x is true, because it reflects the confidence that Y is true after seeing the training data X.

Conditional probability: the probability of event B when event a occurs, i.e. P(B|A); A posteriori probability is a conditional probability, but it is different from other conditions in that the posteriori probability limits the value of the target event as an implicit variable, and the condition is the observation result. General conditional probabilities, conditions and events can be arbitrary. Bayesian is to use a priori probability to estimate a posteriori probability.

Joint probability: joint probability refers to the probability that multiple random variables meet their respective conditions in the multivariate probability distribution. The joint probability of X and Y is expressed as P(X,Y) or P(XY) or P(X ∩ Y).

Assuming that both X and Y obey the normal distribution, then p (x < 5, y < 0) is a joint probability, indicating the probability that the two conditions x < 5 and Y < 0 are true at the same time, and the probability that the two events occur together.

2. Bayesian formula

  Let A represent the event and B represent the influencing factors of A, then:

P(A) is A priori probability, that is, event A occurs without considering influencing factor B; P(A|B) is A posteriori probability, indicating the probability of occurrence of A under the condition of factor B; P(B|A) is the conditional probability, indicating the probability of factor B under the condition that event A occurs; P(B) is the normalization factor, where B is known, equivalent to A constant and does not affect other probabilities. Represented graphically as follows:

P(A|B) is the proportion of orange part (joint probability) in red part in the above figure. Bayesian formula derivation:

        

Summary: a posteriori probability = a priori probability*   conditional probability




2, Naive Bayesian principle

1. Discrimination model and generation model

Supervised learning methods are divided into generative method and discriminant method. The learned models are called generative model and discriminant model respectively.

Generative model: learn the joint probability distribution P(X,Y) from the training data, and then obtain the a posteriori probability distribution P(Y|X). Specifically, the training data is used to learn the estimates of P(X|Y) and P(Y), and the joint probability distribution is obtained: P(C,Y)=P(Y)*P(Y|X), which is then used for classification. Classical algorithms: Naive Bayes, K-nearest neighbor, HMM, deep belief network (DBN), etc.

Discriminant model: the decision function Y=f(X) or conditional probability distribution is directly learned from the data as the prediction model, that is, the discriminant model. The basic idea is to establish a discriminant function under the condition of limited samples. We don't care about the data distribution behind it, but what kind of output should be predicted for a given input. Classical algorithms: linear regression, Logistic regression, decision tree, support vector machine, etc.

Examples are as follows:

Using two models to determine whether a sheep is a goat or a sheep?

Generative model: first learn a goat model according to the characteristics of the goat, then learn a sheep model according to the characteristics of the sheep, then extract the characteristics from the sheep, put them into the goat model to see how much the probability is, and then put them into the sheep model to see how little the probability is, and which probability belongs to which category.

Discriminant model: extract feature x directly from the data of goats and sheep and learn the model, then extract feature x from the sheep to be classified and substitute it into the model to judge the probability that this sheep is a goat and sheep.

We can see that the generative model emphasizes the characteristics of the data itself, and the discriminant model emphasizes the data boundary. In the process of classification, the generative model should try each result personally, and take the result with the greatest probability after traversing it once; The discriminant model obtains the results directly through the model.

2. Bayesian classifier

2.1 Bayesian criterion

Bayesian criterion tells us how to exchange the conditions and results in conditional probability, that is, if P(X|C) is known and P(C|X) is required, the following calculation method can be used:

  2.2 Bayesian classifier

Suppose we now have a set of training samples and their corresponding classification labels. Each tuple is represented as an n-dimensional attribute vectorThe form consists of K categories C1,C2,...,Ck. What the classification needs is that the model can predict which category the data belongs to. For each category Ci, the Bayesian formula is used to estimate the conditional probability P(Ci|x) given the training tuple X.

If and only if the probability P(Ci|x) is the largest in all categories, data X belongs to class CI. P(Ci) is the class a priori probability, and P(x|Ci) is the class conditional probability of sample x relative to class Ci, which is called likelihood. Because P(x) is the evidence factor for normalization, it is constant for all categories. Therefore, it is only necessary to estimate P(Ci) and P(x ∣ CI) based on training data.

Problems of Bayesian classifier:

(1) For the class conditional probability p (x ∣ ci), because it involves the joint probability of all attributes of X, it will encounter serious difficulties to estimate directly according to the frequency of sample occurrence, because if d attributes of the sample are assumed to be binary, the sample space may be different  There are two possibilities. In reality, this value is often much larger than the number of training samples m, that is to say, the value of many samples may not appear in the training samples at all. Because not observed and not appearing are two different events, it is obviously not feasible to estimate P (x ∣ c i) directly according to the sample frequency.

(2) According to statistics, if N samples are required for each feature, it is required for 10 featuresA sample vocabulary of 1000 features will be requiredSamples. It can be seen that the required number of samples will increase rapidly with the increase of the number of features. If the features are independent of each other, the number of samples can be changed fromReduced to 1000 * N.

  3. Naive Bayesian classifier

We can find that the main difficulty in calculating the posterior probability P(c ∣ x) based on Bayesian formula is that the class conditional probability P(x ∣ c) is the joint probability of all attributes, which is difficult to be estimated directly from limited training samples. In order to avoid this obstacle, the naive Bayesian classifier adopts the attribute conditional independence assumption, that is, for a known category, all attributes are independent, which will have an impact on the classification results, that is, all attributes are conditionally independent, and the joint probability is equal to the product of the probability of each individual attribute:

 

  Because the denominator is the same for all categories, our goal is: maximum a posteriori probability (CMP)

  The training process of naive Bayesian classifier is to calculate the class a priori probability P(ci) and class conditional probability based on the training data set D.

  3.1 calculation of conditional probability estimation of characteristic attributes

Let Dc represent the set of class c sample combinations in training set D, then the class a priori probability:

  For continuous attributes, it is assumed that they conform to the standard normal distribution, and then the probability density function of the standard normal distribution is used to solve them.

3.2 Laplace correction

It should be noted that if an attribute value does not appear in a class in the training set, the class conditional probabilityWill be equal to zero, resulting in a cumulative probability value of zero. In order to avoid this problem, smoothing should be carried out when calculating the probability value. Laplace correction is often used to avoid the problem that the probability estimation is zero due to insufficient training samples. The occurrence number of each feature is initialized to 1 and the denominator is initialized to 2. When the training set is sufficiently large, the influence introduced by the correction process will also become negligible, making the estimated value gradually tend to the actual probability.

3.3 anti overflow strategy

In the calculation of conditional probability multiplication, the factors are generally small (all real numbers less than). When the number of attributes increases, the cumulative multiplication result will overflow. In algebra, ln(a*b) = ln(a)+ln(b), so the conditional probability accumulation can be transformed into logarithmic accumulation. The classification results only need to compare the logarithm of probability with the value after wig operation to determine the classification.

4. Test naive Bayesian classifier

25 insulting emails and 25 non insulting documents were used to test the naive Bayesian classifier, of which 49 were used as training data, and 10 were randomly selected as test sets:

4.1 construct word vector

'''
Function Description: create experimental sample
:return: Document collection after term segmentation; Collection of category labels (insulting and non insulting)
'''
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]          #1 means insulting words and 0 means normal speech
    return postingList,classVec

#Create a list of non repeating words that appear in all documents
def createVocabList(dataSet):
    vocabSet = set([])                      #Create an empty collection
    for document in dataSet:
        vocabSet = vocabSet | set(document) #Find the union of two sets
    return list(vocabSet)

#According to the vocabList vocabulary, vectorize each inputSet entry. Each value of the vector is 1 or 0, indicating whether the word appears in the vocabulary or not
#Input variables: vocabulary, a document
def setOfWords2Vec(vocabList, inputSet):
    #Create a vector with 0 elements
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

#Naive Bayesian word bag model
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            #Each word can appear many times in the word bag. If it appears, it will accumulate
            returnVec[vocabList.index(word)] += 1
    return returnVec

4.2 naive Bayesian classification training function

#Naive Bayesian classifier training function
'''
Function Description: naive Bayesian classifier training function
:param trainMatrix: Document matrix
:param trainCategory: Document category label vector
:return: Conditional probability array of non insult class, conditional probability array of insult class, and probability that the document belongs to insult class
'''
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                   #The number of training sets, such as 6 elements
    #print("quantity is:", numTrainDocs)
    numWords = len(trainMatrix[0])                    #The length of each entry vector, such as each is 32 dimensions
    #print("length is:", numWords)
    #sum(trainCategory) means adding (0,1) in the label vector to get the number of 1 (that is, the number of insulting documents)
    #"1" in the tag indicates insult and "0" indicates non insult, so it is the probability that the document belongs to insult category
    pAbusive = sum(trainCategory)/float(numTrainDocs)

    #The element values of the array created by zeros() are all 0
    #p0Num = zeros(numWords)
    #p1Num = zeros(numWords)
    #p0Denom = 0.0
    #p1Denom = 0.0

    #The ones() function can create an array of any dimension and number of elements, with element values of 1
    #Create numpy.ones array, initialize the number of entries to 1, and use Laplace smoothing method (to prevent multiplication with 0)
    p0Num = ones(numWords)
    p1Num = ones(numWords)
    #Denominator initialized to 2, Laplace smoothing method
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] ==1:
            #The data required for statistics of conditional probability belonging to insult class, i.e. P(w0/1),P(w1/1)
            p1Num += trainMatrix[i]            #Array addition
            #print("p1Num:",p1Num)
            p1Denom += sum(trainMatrix[i])     #sum(): add all elements in trainMatrix[i]
            #print("p1Denom:",p1Denom)
        else:
            #Statistics of data required for conditional probability belonging to non insult class, i.e. P(w0/0),P(w1/0)
            p0Num += trainMatrix[i]
            p0Denom +=sum(trainMatrix[i])
            #print("p0Denom:",p0Denom)
    p1Vect = log(p1Num/p1Denom)             #Take the logarithm of each item in p1Num
    p0Vect = log(p0Num/p0Denom)             #Probability of words appearing in non insulting messages
    return p0Vect,p1Vect,pAbusive


  4.3 classification function

#Naive Bayesian classification function
'''
Function Description: naive Bayesian classification function
:param vec2Classify: Vector to classify
:param p0Vec: Conditional probability array of non insulting classes
:param p1Vec: Conditional probability array of insult class
:param pClass1: Probability that the document belongs to the insult class
:return: 0->Represents a non insulting document; one->Represents an insult class document
'''
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    #The corresponding elements of the two vectors are multiplied and then summed
    p1 = sum(vec2Classify * p1Vec) +log(pClass1)
    p0 = sum(vec2Classify * p0Vec) +log(1-pClass1)
    if p1>p0:
        return 1
    else:
        return 0

  4.4 test function

#Test with a single piece of data
def testingNB():
    listOPosts,listClasses = loadDataSet()
    # Create a list of non repeating words that appear in all documents
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry=['love','my','dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList,testEntry))
    print(testEntry,'The classification results are:',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid','garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'The classification results are:', classifyNB(thisDoc, p0V, p1V, pAb))

#File parsing function
def textParse(bigString):           #input is big string, #output is word list
    import re                       #JGsoft RegexBuddy 
    #The separator of split data is any string except words and numbers
    listOfTokens = re.split(r'\W*', bigString)
    #Convert all words to lowercase and filter useless short strings
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]
#Spam test function
def spamTest():
    docList = []           #Word vector for each message
    classList = []         #Label for storing mail
    fullText = []
    for i in range(1, 26):
        #Read insult class (stored in spam) mail and generate word vector
        wordList = textParse(open('./email/spam/%d.txt' % i).read())
        docList.append(wordList)               #Store word vectors in docList
        fullText.extend(wordList)
        classList.append(1)                    #Store the corresponding class label, and the insult class is 1
        # Read the non insulting mail (stored in ham) and generate the word vector
        wordList = textParse(open('./email/ham/%d.txt' % i).read())
        docList.append(wordList)               #Store word vectors in docList
        fullText.extend(wordList)
        classList.append(0)                    #Store the corresponding class label, and the non insulting class is 0
    #Generate thesaurus from all word vectors
    # xx = len(docList)
    # yy = list(range(xx))
    # print(xx,yy)
    vocabList = createVocabList(docList)
    trainSet = list(range(50))                      #Generate 50 numbers from 0-49
    testIndex = []                                  #Subscript for storing test data
    for i in range(10):
        #Randomly generate a subscript from 0-49
        randIndex = int(random.uniform(0, len(trainSet)))
        testIndex.append(trainSet[randIndex])  #Extract the corresponding data as test data
        del(trainSet[randIndex])              #Delete the corresponding data to avoid re selection next time
    trainDataSet = []                          #Store training data (for word set method)
    trainClasses = []                          #Label for storing training data (for word set method)
    trainDataSet1 = []                        #Store training data (for word bag method)
    trainClasses1 = []                        #Label for storing training data (for word bag method)
    for docIndex in trainSet:
        #Extract training data (word set method)
        trainDataSet.append(setOfWords2Vec(vocabList, docList[docIndex]))
        #Extract training data tag
        trainClasses.append(classList[docIndex])

        #Extract training data (word bag method)
        trainDataSet1.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses1.append(classList[docIndex])
    #Start training
    p0V, p1V, pSpam = trainNB0(array(trainDataSet), array(trainClasses))
    errorCount = 0                     #Count the number of data classified incorrectly during the test
    p0V_1, p1V_1, pSpam1 = trainNB0(array(trainDataSet1), array(trainClasses1))
    errorCount1 = 0
    #Start testing classifier
    for Index in testIndex:  # classify the remaining items
        #print("classification:", Index)
        wordVector = setOfWords2Vec(vocabList, docList[Index])   #Data preprocessing
        # Test the classifier. If the classification is incorrect, add 1 to the number of errors
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[Index]:
            errorCount += 1
        wordVector1 = bagOfWords2VecMN(vocabList, docList[Index])  #Data preprocessing
        if classifyNB(array(wordVector1), p0V_1, p1V_1, pSpam1) != classList[Index]:
            errorCount1 += 1
    #Output classification error rate
    print('Word set method(set)Error rate of: ', float(errorCount) / len(testIndex))
    print('Thesaurus method(bag)Error rate of: ', float(errorCount1) / len(testIndex))

  Operation results:

It can be seen that the average error rate is as high as 61%, but because the training set and test set are very small, even if there is only one error in 10 test sets, the error rate will be as high as 10%, so this result can not correctly reflect the accuracy of naive Bayesian classification. At the same time, there are errors caused by assuming the conditional independence of attributes, so we will test it with more data sets.  

3, Naive Bayes classification food safety news

1. Dataset

The data set consists of two parts, training data train_food.txt (article 3352) and train_ Notfood.txt (article 14946), test data Eval_ Food.txt (article 600) and eval_ Notfood.txt (article 899).

Dataset location: (1 message) classification + naive Bayes + food safety news. zip - machine learning document resources - CSDN Library

Partial data display:

2. Read training set

#1 represents food safety news and 0 represents non food safety news
# Read dataset
def readData(posFile,negFile):
    docList = []  # Store training sets
    classList = []  # Storage classification label

    # Read in dataset with category "1" (food safety related)
    posList = list(open(posFile,encoding='utf-8').readlines())
    posVec =[1]*len(posList)
    docList += posList
    classList +=posVec
    # Read in the data set with category "0" (non food safety related training set) and generate a vector
    negList = list(open(negFile,encoding='utf-8').readlines())
    negVec = [0] * len(negList)
    docList += negList
    classList += negVec
    return docList,classList

3. Word segmentation of news

Word segmentation of these news. In order to save the time for future training, save the word segmentation results in the same directory of the training book. The name is cleared_ trainMatrix2.txt.

#Word segmentation of news
def jieba_cut_and_save_file(inputList, output_cleaned_file=False):
    """
    1. Read Chinese files and segment sentences
    2. You can save the results of word segmentation to a file
    3. If a word segmented data file already exists, it will be loaded directly
    """
    output_file = os.path.join('./data/', 'cleaned_' + 'trainMatrix2.txt')
    #If a word segmented data file already exists, it will be loaded directly
    #os.path.exists() checks whether the given path really exists
    if os.path.exists(output_file):
        lines = list(open(output_file, 'r').readlines())
        lines = [line.strip('\n').split(' ') for line in lines]
    else:
        lines = [list(jieba.cut(clean_str(line))) for line in inputList]
        lines = [[word for word in line if word != ' '] for line in lines]
        #If it does not exist, save the file after word segmentation
        if output_cleaned_file:
            with open(output_file, 'w') as f:
                for line in lines:
                    f.write(" ".join(line) + '\n')
    # Generate thesaurus from all word vectors
    vocabulary = createVocabList(lines)
    # The word vector quantizer is generated according to the dictionary and word vector quantization is carried out
    setOfWords2Vec = setOfWords2VecFactory(vocabulary)
    vectorized = [setOfWords2Vec(news) for news in lines]
    return vectorized, vocabulary
def clean_str(string):
    """
    1. Convert characters other than Chinese characters into a space
    2. Remove the space characters before and after the sentence
    """
    string = re.sub(r'[^\u4e00-\u9fff]', ' ', string)
    string = re.sub(r'\s{2,}', ' ', string)
    return string.strip()

  3. Generate dictionary

Generate a dictionary that contains all the words that appear in the news but are not repeated.

#Generate dictionary
def createVocabList(dataSet):
    """
    Construct a dictionary from the news list after word segmentation
    """
    # Create a list of non repeating words that appear in all news.
    vocabSet = set([])
    for news in dataSet:
        vocabSet = vocabSet | set(news)
        # |Take and merge
    return list(vocabSet)

4. Vectorize the news after word segmentation

#Vectorized news required for testing (vectorization of a single news)
def vectorize_newslist(news_list,vocabulary):
    """
    The news list is vectorized into a word vector matrix
    Note: if there is no dictionary, the default value is to create from the collection
    """
    # Word segmentation and filtering
    cut_news_list = [list(jieba.cut(clean_str(news))) for news in news_list]

    # The word vector quantizer is generated according to the dictionary and word vector quantization is carried out
    setOfWords2Vec = setOfWords2VecFactory(vocabulary)
    vectorized = [setOfWords2Vec(news) for news in cut_news_list]
    return vectorized,vocabulary

#Vectorization of the news
def setOfWords2VecFactory(vocabList):
    """
    Construct the corresponding dictionary through a given dictionary setOfWords2Vec
    Nested definition of functions
    """
    #Optimization: speed up the conversion by constructing a hash table from words to indexes in advance
    index_map = {}
    #The enumerate() function is used to combine a traversable data object into an index sequence, listing data and data subscripts at the same time
    for i, word in enumerate(vocabList):
        index_map[word] = i

    def setOfWords2Vec(news):
        """
        Based on the dictionary provided at the time of construction, the dictionary vectorizes a news
        """
        result = [0]*len(vocabList)
        for word in news:
                #The default value query can obtain the index and judge whether it exists or not at the same time
                index = index_map.get(word, None)
                if index:
                    result[index] = 1
        return result
    return setOfWords2Vec

5. Training function and classification function

The data used in this test and training are very large, and each training requires a lot of time. Therefore, save the trained model and directly import the test into the saved model to save time. In order to save the model, the model is regarded as a class.

#The model is regarded as a class to facilitate the preservation of the model
class oldNB:
    def __init__(self,vocabulary):
        self.p1V = None
        self.p0V = None
        self.p1 = None
        self.vocabulary = vocabulary

    def train(self,trainMatrix, classVector):
        """
        Training function
        :param trainMatrix: Training word vector matrix
        :param classVector: Classification vector
        """
        # Number of news
        numTrainNews = len(trainMatrix)
        # The length of each term vector in the training set, for example, each term vector is 32 dimensions
        numWords = len(trainMatrix[0])

        # News belongs to the probability of food safety, a priori probability
        pFood = sum(classVector) / float(numTrainNews)
        # Manually set a priori probability
        # pFood = float(1)/float(200)
        # Initialize the probability to avoid the existence of zero and make the subsequent product result 0
        p0Num = ones(numWords)
        p1Num = ones(numWords)
        p0Sum = 2.0
        p1Sum = 2.0
        for i in range(numTrainNews):
            if classVector[i] == 1:
                # +=
                # The quantity distribution of each word in each food safety news, p1Num is a vector
                p1Num += trainMatrix[i]
                # Find the total number of words in the food safety set in each news. p1Sum represents the total number of words in category 0 news
                p1Sum += sum(trainMatrix[i])
            else:
                p0Num += trainMatrix[i]
                p0Sum += sum(trainMatrix[i])
        # Probability of occurrence of each word in the case of 1
        p1Vect = log(p1Num / p1Sum)
        # Probability of occurrence of each word in the case of 0
        p0Vect = log(p0Num / p0Sum)
        #return p1Vect, p0Vect, pFood
        # Save results
        self.p0V = p0Vect
        self.p1V = p1Vect
        self.p1 = pFood

    def classify_news(self,news):
        """
        Classification function,Process the input news, and then classify it
        :param newa: News to be classified
        """
        vectorized = vectorize_newslist([news],self.vocabulary)
        return self.classify_vector(vectorized)

    def classify_vector(self,vec2Classify):
        """
        Classification function,Classification of input word vectors
        :param vec2Classify: Word vector to be classified
        """
        vec2Classify = vec2Classify[0]
        p1 = sum(vec2Classify * self.p1V) + log(self.p1)
        # Element multiplication
        p0 = sum(vec2Classify * self.p0V) + log(1.0 - self.p1)
        if p1 > p0:
            return 1
        else:
            return 0

6. Training model

#Training model
def trainModal():
    posFile = "./data/train_food.txt"
    negFile = "./data/train_notfood.txt"

    print("Getting training matrix and its classification vector")
    #Save training dataset list, category label
    trainList, classVec = readData(posFile, negFile)

    trainMat =[]
    trainClass = []         #Category label list
    trainClass += classVec

    #global vocabulary       # Global variables, saving Thesaurus
    print("Segmenting training matrix and generating Thesaurus")
    #Participle the news you read
    trainMat,vocabulary = jieba_cut_and_save_file(trainList,True)
    # Initialization model
    bayes = oldNB(vocabulary)
    print("Training model")
    bayes.train(trainMat, trainClass)

    print("Save model")
    joblib.dump(bayes, "./arguments/train_model.m")

7. Test model

#test model 
def testModal():
    posFile1 = "./data/eval_food.txt"
    negFile1 = "./data/eval_notfood.txt"

    print("Getting the test matrix and its classification vector")
    trainList1, classVec1 = readData(posFile1, negFile1)

    # Read model
    nb = joblib.load("arguments/train_model.m")

    dataLen = range(len(trainList1))
    results = []
    for index in dataLen:
        result = nb.classify_news(trainList1[index])
        results.append(result)
    ans = 0
    sureCount = 0
    for i in range(len(classVec1)):
        if results[i] == classVec1[i]:
            sureCount += 1
    ans = sureCount / len(classVec1)
    print("Accuracy:" + str(ans))

Test results:

  It can be seen that compared with the above 40 training, 10 tests have a more obvious effect. The 94% accuracy rate is still considerable in classification. At the same time, we can manually set the a priori probability according to the actual situation to make the accuracy rate higher.

When the a priori probability is manually set to 1 / 200, the following accuracy is obtained:

The url used here is the data requested by the network.



summary

Advantages of using naive Bayesian method for classification:

(1) It is very easy to establish and can make decisions quickly. When there is new sample data, incremental learning can be realized only by modifying the probability estimation involved in the attribute value of the new sample.

(2) The calculation is simple and can process multi category data

(3) The decision results are easy to explain

(4) Commonly used for text classification

  Disadvantages:

(1) Naive Bayes model assumes that attributes are independent of each other, which is often not true in practical application. Although naive Bayes has good performance when the attribute correlation is small, the classification effect is not good when the number of attributes is large or the correlation between attributes is large;

(2) We need to know the prior probability, and the prior probability depends on the hypothesis in many cases, and the hypothesis is inaccurate in many cases, resulting in the low accuracy of prediction results;

(3) It is sensitive to the expression of input data.

Tags: Python Machine Learning

Posted on Sun, 28 Nov 2021 12:25:13 -0500 by IMakeLousyCode