catalogue

Principle of naive Bayesian algorithm

# summary

Naive Bayes classifier is a classification method based on Bayesian theorem and the assumption of feature conditional independence. For a given training set, firstly, the joint probability distribution of input and output is learned based on the independent assumption of characteristic conditions (the naive Bayes method, which obtains the model through learning, obviously belongs to the generative model); Then, based on this model, for a given input x, the output y with the maximum a posteriori probability is obtained by using Bayesian theorem.

First, let's understand some concepts

For example, we already know that there are 10 in the bag A ball is either a black ball or a white ball. Six of them are black balls. If you reach in and touch a ball, you can know the probability of touching a black ball. But this situation is often God's perspective, that is, to understand the whole picture of things before making judgment.

But we don't know the proportion of black and white balls in the bag in advance. Can we judge the proportion of black and white balls in the bag by the color of the ball we touch from the back

A priori probability:

The probability of happening is judged by experience, for example, the incidence rate of "beye death" is 1/10000, that is, the prior probability.

Posterior probability:

A posteriori probability is the probability of inferring the cause after the result occurs. For example, if someone finds out that he has "shellfish death", the cause of the disease may be a, B or C** Suffering from "bayet death" is because the probability of cause a is a posteriori probability. It is a kind of conditional probability.

Conditional probability:

The occurrence probability of event A under the condition that another event B has occurred, expressed as P(A|B). For example, under the condition of cause A, the probability of suffering from "Bei Ye death" is the conditional probability.

Next, learn about the discriminant model and generative model in machine learning

In machine learning, supervised learning can be divided into two types: discriminant model and generative model. In short, discriminant model is used to model conditional distribution, while generative model is used to model joint distribution.

Simply put

- For discriminant model, we only need to learn the difference between them. For example, good melons and bad melons For example, the vines of good melons are greener than bad melons.
- The generative model is different. You need to learn what the characteristics of good melons are and what the characteristics of bad melons are. After having the characteristics of the two, we can distinguish them according to their respective characteristics.

Generative model: learn the joint probability distribution P(X,Y) from the data, and then calculate the probability distribution P(Y|X) from P(Y|X)=P(X,Y)/P(X) as the prediction model. This method represents the generation relationship between the given input X and the generated output Y

Discriminant model: the decision function Y=f(X) or conditional probability distribution P(Y|X) is directly learned from the data as the prediction model, that is, the discriminant model. The discrimination method is concerned with what kind of output Y should be predicted for a given input X.

The Bayesian model learned in this experiment is generative model

Create a model for each class. You can create as many models as there are categories. For example, if the category label has {good melon, bad melon}, first learn a good melon model according to the characteristics of good melon, then learn a bad melon model according to the characteristics of bad melon, and then calculate new samples respectively Joint probability with two categories ， Then according to Bayesian formula:

Calculate separately and select the largestClassification as samples

## Principle of naive Bayesian algorithm

Bayesian formula

In the Bayesian formula, P(A) is called "Prior probability", that is, A judgment on the probability of event A before event B occurs.

P(A|B) is called "Posterior probability", that is, the reassessment of the probability of event A after the occurrence of event B.

P(B|A)/P(B) is called "likelihood function", which is an adjustment factor to make the estimated probability closer to the real probability.

Therefore, the conditional probability can be understood as the following formula: a posteriori probability = a priori probability x adjustment factor

It will be easier to understand in another form

P("belonging to a certain class" | "having a certain characteristic") = P("having a certain characteristic" | "belonging to a certain class") P("belonging to a certain class") / P("having a certain characteristic")

Bayesian method transforms the probability of "belonging to a certain class (i.e. classification) under certain feature conditions" into the probability of "belonging to a certain feature (training model respectively)" under certain condition, which belongs to supervised learning

Let's use the teacher's example to understand the naive Bayesian formula

The problem now is that if we pick a bunch of watermelons, for example, one of the watermelons we pick is characterized by green, curling, turbid sound, depression, hard and slippery, and the density and sugar content are specific values Then we judge whether the melon is good or bad

This is a typical classification problem. Turning into a mathematical problem is to compare good melons with bad melons. Whoever has a high probability can get the result

P (good melon | green, curled, turbid, sunken, hard slippery, density, sugar content) = P (green, curled, turbid, sunken, hard slippery, density, sugar content | good melon) * P (good melon) / P (green, curled, turbid, sunken, hard slippery, density, sugar content)

Through Bayesian formula, we can convert the calculated content into three quantities on the right

P (green, curled, curled, concave, hard and smooth, density, sugar content | good melon) * P (good melon) = P (green | good melon) * P (curled | good melon) * P (concave | good melon) * P (density | good melon) * P (sugar content | good melon)

# Code practice

## 1. Message classification

In this demo, we need to understand the meaning of these two paragraphs

import numpy as np from functools import reduce import math #Create experimental samples def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], #Segmented entry ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] # Category label vector, 1 for insulting words and 0 for non insulting words classVec = [0,1,0,1,0,1] print(sum(classVec)) return postingList,classVec #The segmented experimental sample entries are sorted into a list of non repeated entries, that is, a vocabulary def createVocabList(dataSet): vocabSet=set([]) #Create an empty non repeating list for document in dataSet: vocabSet =vocabSet|set(document) return list(vocabSet) #According to the vocabList vocabulary, the inputSet is vectorized, and each element of the vector is 1 or 0“ """Parameters: vocabList - createVocabList List of returned inputSet - Segmented entry list Returns: returnVec - Document vector,Word set model""" def setOfWords2Vec(vocabList, inputSet): # Create a vector with 0 elements returnVec = [0] * len(vocabList) # Traverse each entry for word in inputSet: if word in vocabList: # If the entry exists in the vocabulary, set 1 returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) print("returnVec ",returnVec) return returnVec #Naive Bayesian classifier training function """ Parameters: trainMatrix - Training document matrix, i.e setOfWords2Vec Returned returnVec Composition matrix trainMat trainCategory - Training category label vector, i.e loadDataSet Returned classVec Returns: p0Vect - Conditional probability array of non insulting classes p1Vect - Conditional probability array of insult class pAbusive - Probability that the document belongs to the insult class """ def trainNB0(trainMatrix,trainCategory): numTrainDocs=len(trainMatrix) #6 numWords = len(trainMatrix[0]) #32 pAbusive = sum(trainCategory) / float(numTrainDocs) #3/6 p0Num = np.ones(numWords) p1Num = np.ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: # The data required for statistics of conditional probability belonging to insult class, i.e. P(w0|1),P(w1|1),P(w2|1)··· p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) print("p1Num p1Denom",p1Num,p1Denom) else: # The data required to count the conditional probability belonging to the non insult class, i.e. P(w0|0),P(w1|0),P(w2|0)··· p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) print("p0Num poDenom",p0Num,p0Denom) p1Vect = p1Num/p1Denom #Conditional probability array of insult class p0Vect = p0Num/p0Denom #Conditional probability array of non insulting classes return p0Vect,p1Vect,pAbusive """ Function description:Naive Bayesian classifier classification function Parameters: vec2Classify - Array of entries to be classified p0Vec - Conditional probability array of non insulting classes p1Vec -Conditional probability array of insult class pClass1 - Probability that the document belongs to the insult class Returns: 0 - It belongs to the non insulting category 1 - It belongs to the insult category """ def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + math.log(pClass1) p0 = sum(vec2Classify * p0Vec) + math.log(1.0 - pClass1) print('p0:',p0) print('p1:',p1) if p1 > p0: return 1 else: return 0 if __name__ == '__main__': listOPosts, listClasses = loadDataSet() # Create experimental samples myVocabList = createVocabList(listOPosts) # Create vocabulary trainMat = [] for postinDoc in listOPosts: print(postinDoc) trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) print("trainMat ",trainMat) print("len(trainMat)",len(trainMat)) print("len(trainMat[0])",len(trainMat[0])) p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses)) print("p0V ",p0V) print("p1V ",p1V) print("pAb ",pAb) testEntry = ['love', 'my', 'dalmation'] # Test sample 1 thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry)) # Test sample Vectorization if classifyNB(thisDoc, p0V, p1V, pAb): print(testEntry, 'It belongs to the insult category') # Perform classification and print classification results else: print(testEntry, 'It belongs to the non insulting category') # Perform classification and print classification results testEntry = ['stupid', 'garbage'] # Test sample 2 thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry)) # Test sample Vectorization if classifyNB(thisDoc, p0V, p1V, pAb): print(testEntry, 'It belongs to the insult category') # Perform classification and print classification results else: print(testEntry, 'It belongs to the non insulting category')

## 2. Spam classification

def spamTest(): docList = []; classList = []; fullText = [] for i in range(1, 26): # Read each spam and convert the string into a string list wordList = textParse(open('email/spam/%d.txt' % i, 'r').read()) print(i," ",wordList) docList.append(wordList) fullText.append(wordList) # Mark spam, 1 indicates junk file classList.append(1) print(i," ",classList) # Read each non spam message and convert the string into a string list wordList = textParse(open('email/ham/%d.txt' % i, 'r').read()) print(i, " ", wordList) docList.append(wordList) fullText.append(wordList) # Mark non spam, 0 means non spam classList.append(0) # Create vocabulary without repetition vocabList = createVocabList(docList) trainingSet = list(range(50)); testSet = [] print("trainingSet",trainingSet) for i in range(10): randIndex = int(random.uniform(0, len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat = []; trainClasses = [] # Ergodic training set for docIndex in trainingSet: # Add the generated word set model to the training matrix trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) # Add a category to the training set category label coefficient vector trainClasses.append(classList[docIndex]) # Training naive Bayesian model p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses)) errorCount = 0 # Traversal test set for docIndex in testSet: wordVector = setOfWords2Vec(vocabList, docList[docIndex]) if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: errorCount += 1 print("Misclassified test sets:",docList[docIndex]) print('Error rate:%.2f%%' % (float(errorCount) / len(testSet) * 100))