Machine learning -- naive Bayes

1, Naive Bayesian theory

1. Introduction

Naive Bayesian algorithm is a supervised learning algorithm, which solves classification problems, such as whether customers are lost, whether it is worth investing, credit rating and so on. The advantages of this algorithm are easy to understand, high learning efficiency, and can be comparable with decision tree and neural network in some fields. However, because the algorithm assumes the independence between independent variables (conditional characteristic independence) and the normality of continuous variables, the accuracy of the algorithm will be affected to some extent.

2. Bayesian decision theory

Naive Bayes is a part of Bayesian decision theory, so it is necessary to quickly understand Bayesian decision theory before talking about naive Bayes.
Suppose we have a data set, which is composed of two types of data. The data distribution is shown in the figure below:

We now use p1(x,y) to represent the probability that the data point (x,y) belongs to category 1 (the category represented by the red dot in the figure) and p2(x,y) to represent the probability that the data point (x,y) belongs to category 2 (the category represented by the blue triangle in the figure). Then for a new data point (x,y), its category can be determined by the following rules:

If P1 (x, y) > P2 (x, y), the category is 1
If P1 (x, y) < P2 (x, y), the category is 2

That is, we will select the category with high probability. This is the core idea of Bayesian decision theory, that is, to choose the decision with the highest probability. Next, learn how to calculate p1 and p2 probabilities.

3. Conditional probability

P(A|B) refers to the probability of event A when event B occurs.

As can be seen from Wen's diagram, P ( A ∣ B ) = P ( A ⋂ B ) P ( B ) P(A|B)=\frac P(A ∣ B)=P(B)P(A ⋂ B) so, P ( A ⋂ B ) = P ( A ∣ B ) P ( B ) P(A\bigcap B)=P(A|B)P(B) P(A ⋂ B)=P(A ∣ B)P(B), P ( A ⋂ B ) = P ( B ∣ A ) P ( A ) P(A\bigcap B)=P(B|A)P(A) P(A ⋂ B)=P(B ∣ A)P(A) so, P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) P(A|B)P(B)=P(B|A)P(A) P(A ∣ B)P(B)=P(B ∣ A)P(A), P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B)=\frac P(A ∣ B)=P(B)P(B ∣ A)P(A). This is the calculation formula of conditional probability.

4. Full probability

It means that if A and A 'constitute A division of the sample space, the probability of event B is equal to the sum of the probabilities of A and A' multiplied by the conditional probabilities of B for these two events, i.e P ( B ) = P ( B ∣ A ) P ( A ) + P ( B ∣ A ′ ) P ( A ′ ) P(B)=P(B|A)P(A)+P(B|A')P(A') P(B)=P(B∣A)P(A)+P(B∣A′)P(A′)
Then the conditional probability formula can also be written as: P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( B ∣ A ′ ) P ( A ′ ) P(A|B)=\frac P(A∣B)=P(B∣A)P(A)+P(B∣A′)P(A′)P(B∣A)P(A)

5. Bayesian inference

For conditional probability deformation: P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P(A|B)=P(A)\frac P(A∣B)=P(A)P(B)P(B∣A)
We call P(A) A "priori probability", which represents the initial probability that A has according to historical data / experience before training the model. It reflects the probability distribution of A, which is independent of the sample.

P(A|B) is called "A posteriori probability", that is, the probability that A is established when A data sample B is given, which reflects the confidence that A is established after seeing data sample B.

Most machine learning models try to get a posteriori probability.

P(B|A)/P(B) is called "possibility function", which is an adjustment factor to make the estimated probability closer to the real probability.

Now think about a problem: we only need to compare the size of P(A1|B) and P(A2|B). If the denominator is the same, we only need to compare the molecules. Therefore, in order to reduce the amount of calculation, the full probability formula can not be used in practical programming.

6. Naive Bayesian classifier

Naive Bayesian classifier adopts "attribute conditional independence hypothesis", that is, each attribute independently affects the classification results.
For the convenience of formula marking, it may be noted that P(C=c|X=x) is P(c|x). Based on the assumption of attribute conditional independence, the Bayesian formula can be rewritten as P ( c ∣ x ) = P ( c ) P ( x ∣ c ) P ( x ) = P ( c ) P ( x ) ∏ i = 1 d P ( x i ∣ c ) P(c|x)=\frac=\frac\prod_^P(x_|c) P(c∣x)=P(x)P(c)P(x∣c)=P(x)P(c)i=1∏dP(xi∣c)

Where d is the number of attributes, x i x_ xi is the value of x on the ith attribute.
P(x) is the same for all categories, so only the comparison molecule needs to be calculated.

7. Laplace smoothing

Zero probability problem: when calculating the probability of an event, if an event does not appear in the observation sample database (training set), the probability result of the event will be 0. This is unreasonable. An event cannot be considered impossible because it is not observed (i.e. the probability of the event is 0).

Laplace smoothing is to solve the problem of zero probability.

Laplace, a French mathematician, first proposed the method of adding 1 to estimate the probability of phenomena that have not occurred. Laplace smoothing, also known as plus one smoothing, adds 1 to the count of molecular division and 1 to the denominator.
Theoretical hypothesis: assuming that the training sample is large, the change of estimation probability caused by the count of each component x plus 1 can be ignored, but the problem of zero probability can be avoided conveniently and effectively

2, Actual combat

1. Naive Bayesian document classification

Background: take online community messages as an example. In order not to affect the development of the community, we need to block insulting remarks, so we need to build a fast filter. If a message uses negative or insulting language, it will be marked as inappropriate content. Filtering such content is a common requirement. Two types are established for this problem: insulting class and non insulting class, which are represented by 1 and 0 respectively.

1.1 preparing data: building word vectors from text

We regard the text as a word vector or entry vector, that is, convert the sentence into a vector. Consider the words that appear in all documents, then decide which words to include in the vocabulary or the vocabulary set to be used, and then you must convert each document into a vector on the vocabulary. For simplicity, let's assume that this article has been segmented, stored in the list, and classify and label the vocabulary vector. The code is as follows:

The first is the function loadDataSet(), which is used to create experimental samples.

""" Function description:Create experimental samples Parameters: nothing Returns: postingList - Entries segmented by experimental samples classVec - Category label vector """ def loadDataSet(): #Segmentation entry, which is any combination of characters postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1] #Category label vector, 1 for insulting text, 0 for normal speech return postingList,classVec

As can be seen from the running results, we have made postingList store the entry list, classVec store the category of each entry, 1 represents the insult class and 0 represents the non insult class.

The function createVocabList() is followed to create a list of non repeating entries that appear in all documents, that is, a vocabulary.

""" Function Description: create a list of non repeating entries in all documents, that is, vocabulary Parameters: dataSet - Collated sample data set Returns: vocabSet - Returns a list of non repeating entries, that is, a glossary """ def createVocabList(dataSet): #Create an empty set vocabSet = set([]) for document in dataSet: #The union of two sets | is used to find the union of two sets, so the objects must be sets, so the set function is used vocabSet = vocabSet | set(document) #The set function obtains a dictionary and needs to return a list of non duplicate entries return list(vocabSet)

Next, the function setOfWords2Vec() is used to vectorize the inputSet according to the vocabList vocabulary, and each element of the vector is 1 or 0.

""" Function Description: according to vocabList Glossary, adding inputSet Vectorization, each element of the vector is 1 or 0 Parameters: vocabList - createVocabList Returned vocabulary inputSet - Segmented entry list Returns: returnVec - Document vector,Word set model """ def setOfWords2Vec(vocabList,inputSet): #Create a vector of the same length as the vocabulary with elements of 0 returnVec = [0]*len(vocabList) for word in inputSet: #Traverse each entry if word in vocabList: #If it exists in the vocabulary, set 1 returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) return returnVec

The test runs these two functions.

if __name__ == '__main__': #Test loadDataSet() function postingList,classVec = loadDataSet() print('postingList:\n',postingList) #Test the createVocabList() function myVocabList = createVocabList(postingList) print('myVocabList:\n',myVocabList) #Test setOfWords2Vec() function print('testResult:\n',setOfWords2Vec(myVocabList,postingList[0]))

From the running results, we can see that postingList is the original entry list and myVocabList is the vocabulary. myVocabList is a collection of all words without duplicate elements. The function of vocabulary is to vectorize entries. If a word appears once in the vocabulary, it will be recorded as 1 in the corresponding position. If it does not appear, it will be recorded as 0 in the corresponding position.

1.2 training algorithm: calculate the probability from the word vector

Naive Bayesian classifier training function

""" Function description:Naive Bayesian classifier training function Parameters: trainMatrix - Training document matrix, i.e setOfWords2Vec Returned returnVec Composition matrix trainCategory - Training category label vector, i.e loadDataSet Returned classVec Returns: p0Vect - Conditional probability array of non insulting classes p1Vect - Conditional probability array of insult class pAbusive - Probability that the document belongs to the insult class """ def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) #Calculate the number of training documents numWords = len(trainMatrix[0]) #Calculate the number of entries per document pAbusive = sum(trainCategory)/float(numTrainDocs) #Probability that the document belongs to the insult class p0Num = zeros(numWords); p1Num = zeros(numWords) #Create numpy.zeros array, and initialize the number of entries to 0 p0Denom = 0.0; p1Denom = 0.0 #Denominator initialized to 0 for i in range(numTrainDocs): if trainCategory[i] == 1: #The data required for statistics of conditional probability belonging to insult class, i.e. P(w0|1),P(w1|1),P(w2|1)··· p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: #The data required to count the conditional probability belonging to the non insult class, i.e. P(w0|0),P(w1|0),P(w2|0)··· p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = p1Num/p1Denom p0Vect = p0Num/p0Denom return p0Vect,p1Vect,pAbusive #Returns the conditional probability array belonging to non insult class, the conditional probability array belonging to insult class, and the probability that the document belongs to insult class if __name__ == '__main__': # #Test loadDataSet() function # postingList,classVec = loadDataSet() # print('postingList:\n',postingList) # # Test the createVocabList() function # myVocabList = createVocabList(postingList) # print('myVocabList:\n',myVocabList) # # Test setOfWords2Vec() function # print('testResult:\n',setOfWords2Vec(myVocabList,postingList[0])) postingList, classVec = loadDataSet() myVocabList = createVocabList(postingList) print('myVocabList:\n', myVocabList) trainMat = [] for postinDoc in postingList: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) p0V, p1V, pAb = trainNB0(trainMat, classVec) print('p0V:\n', p0V) print('p1V:\n', p1V) print('classVec:\n', classVec) print('pAb:\n', pAb)

The running results are as follows. p0V stores the probability that each word belongs to category 0, that is, non insulting words. For example, the probability that the word stupid of p0V belongs to the non insult class is 0. Similarly, the probability that the word stupid of p1V belongs to the insult category is 0.15789474, which is about 15.79%. Obviously, this word belongs to the insult category. pAb is the probability that the samples of all insult classes account for all samples. It can be seen from classVec that there are three insult classes and three non insult classes in one use. So the probability of insulting is 0.5. Therefore, p0V stores P(I | non insulting) = 0.04166667, P(problems | non insulting) = 0.04166667, until P(so | non insulting) = 0.04166667, that is, the conditional probability that each word belongs to non insulting. Similarly, p1V stores the conditional probability that each word belongs to the insult class. pAb is a priori probability.

1.3 test algorithm: modify the classifier according to the actual situation

Naive Bayesian classification function
After training the classifier, we can start to use the classifier for classification.

""" Function description:Naive Bayesian classifier classification function Parameters: vec2Classify - Array of entries to be classified p0Vec - Conditional probability array of insult class p1Vec -Conditional probability array of non insulting classes pClass1 - Probability that the document belongs to the insult class Returns: 0 - It belongs to the non insulting category 1 - It belongs to the insult category """ def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = reduce(lambda x,y:x*y, vec2Classify * p1Vec) * pClass1 #Multiply corresponding elements p0 = reduce(lambda x,y:x*y, vec2Classify * p0Vec) * (1.0 - pClass1) print('p0:',p0) print('p1:',p1) if p1 > p0: return 1 else: return 0 """ Function description:Test naive Bayesian classifier Parameters: nothing Returns: nothing """ def testingNB(): listOPosts,listClasses = loadDataSet() #Create experimental samples myVocabList = createVocabList(listOPosts) #Create vocabulary trainMat=[] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) #Vectorization of experimental samples p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) #Training naive Bayesian classifier testEntry = ['love', 'my', 'dalmation'] #Test sample 1 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) #Test sample Vectorization if classifyNB(thisDoc,p0V,p1V,pAb): print(testEntry,'It belongs to the insult category') #Perform classification and print classification results else: print(testEntry,'It belongs to the non insulting category') #Perform classification and print classification results testEntry = ['stupid', 'garbage'] #Test sample 2 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) #Test sample Vectorization if classifyNB(thisDoc,p0V,p1V,pAb): print(testEntry,'It belongs to the insult category') #Perform classification and print classification results else: print(testEntry,'It belongs to the non insulting category') #Perform classification and print classification results if __name__ == '__main__': testingNB()

The running results are as follows: p0 and p1 are all 0. Obviously, the algorithm can not be compared.

First, there are two questions
Problem 1 when using Bayesian classifier to classify documents, it is necessary to calculate the product of multiple probabilities to obtain the probability that the document belongs to a category, that is, calculate p(w0|1)p(w1|1)p(w2|1). If one of the probability values is 0, the final score is also 0. This is the zero probability problem. It cannot be considered impossible because there are no samples. To reduce this effect, you can initialize the number of occurrences of all words to 1 and the denominator to 2. This method is called Laplace smoothing, also known as plus 1 smoothing. It is a commonly used smoothing method to solve the problem of 0 probability. (see contents I and 7.)
Overflow under problem 2, which is caused by the multiplication of too many small numbers. Anyone who has studied mathematics knows that when two decimals are multiplied, the smaller the multiplication, the lower overflow is caused. In the program, the calculation result may become 0 by rounding the corresponding decimal position. In order to solve this problem, take the natural logarithm of the product result. In algebra, ln(a*b) = ln(a) + ln(b). The logarithm can avoid errors caused by down overflow or floating-point rounding. At the same time, there will be no loss in the treatment of natural logarithm. The following figure shows the curves of functions f(x) and ln(f(x)).

It is found that they increase or decrease at the same time in the same region, and the extreme value is taken at the same point. Although their values are different, they do not affect the final result.
Therefore, to solve these two problems, we can change the trainNB0(trainMatrix, trainCategory) function as follows:
For question 1

p0Num = np.ones(numWords); p1Num = np.ones(numWords) #Create numpy.ones array, initialize the number of entries to 1, and Laplace smoothing p0Denom = 2.0; p1Denom = 2.0 #Denominator initialized to 2, Laplace smoothing

For question 2

p1Vect = log(p1Num/p1Denom) #Take logarithm to prevent lower overflow p0Vect = log(p0Num/p0Denom)

It can be seen from the operation results that there is no zero probability problem.

Modification of test function

p1 = sum(vec2Classify * p1Vec) + log(pClass1) # Multiply the corresponding elements. logA * B = logA + logB, so add log(pClass1) here p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)

It can be seen from the operation results that the previous problems have been solved.

2. Naive Bayesian filtering spam

First, let's look at the classification steps:

Collect data: provide text files.
Prepare data: parse the text file into an entry vector.
Analyze data: check entries to ensure the correctness of parsing.
Training algorithm: use the trainNB0() function we established earlier.
Test algorithm: use classifyNB() and build a new test function to calculate the error rate of the document set.
Using algorithm: build a complete program to classify a group of documents and output the misclassified documents to the screen.

2.1. Data collection

Download from Github: email dataset , where ham is non spam and spam is spam.

2.2. Preparing data: segmenting text

For a text string, we can use the split function to segment non alphabetic and non numeric symbols. The code is as follows:

""" Function description:Receives a large string and parses it into a list of strings Parameters: nothing Returns: nothing """ def textParse(bigString): #Convert string to character list listOfTokens = re.split(r'[\W*]', bigString) #Use special symbols as segmentation marks to segment strings, that is, non alphabetic and non numeric return [tok.lower() for tok in listOfTokens if len(tok) > 2] #Except for a single letter, such as a capital I, other words become lowercase if __name__ == '__main__': docList = []; classList = [] for i in range(1, 26): #Traverse 25 txt files wordList = textParse(open('email/spam/%d.txt' % i, 'r').read()) #Read each spam and convert the string into a string list docList.append(wordList) classList.append(1) #Mark spam, 1 indicates junk file wordList = textParse(open('email/ham/%d.txt' % i, 'r').read()) #Read each non spam message and convert the string into a string list docList.append(wordList) classList.append(0) #Mark non spam, 1 indicates junk file vocabList = createVocabList(docList) #Create vocabulary without repetition print(vocabList)

A glossary is obtained at the end of the run:

According to the vocabulary, we can vectorize each text. We divide the data set into training set and test set, and use retained cross validation to test the accuracy of naive Bayesian classifier. The code is as follows:

#File parsing and complete spam test function """ Function description:Receives a large string and parses it into a list of strings Parameters: nothing Returns: nothing """ def textParse(bigString): #Convert string to character list listOfTokens = re.split(r'[\W*]', bigString) #Use special symbols as segmentation marks to segment strings, that is, non alphabetic and non numeric return [tok.lower() for tok in listOfTokens if len(tok) > 2] #Except for a single letter, such as a capital I, other words become lowercase """ Function description:Test naive Bayesian classifier Parameters: nothing Returns: nothing Author: Jack Cui Blog: http://blog.csdn.net/c406495762 Modify: 2017-08-14 """ def spamTest(): docList = []; classList = []; fullText = [] for i in range(1, 26): #Traverse 25 txt files wordList = textParse(open('email/spam/%d.txt' % i, 'r').read()) #Read each spam and convert the string into a string list docList.append(wordList) fullText.append(wordList) classList.append(1) #Mark spam, 1 indicates junk file wordList = textParse(open('email/ham/%d.txt' % i, 'r').read()) #Read each non spam message and convert the string into a string list docList.append(wordList) fullText.append(wordList) classList.append(0) #Mark non spam, 1 indicates junk file vocabList = createVocabList(docList) #Create vocabulary without repetition trainingSet = list(range(50)); testSet = [] #Create a list that stores the index values of the training set and the test set for i in range(10): #From the 50 emails, 40 were randomly selected as the training set and 10 as the test set randIndex = int(random.uniform(0, len(trainingSet))) #Random selection of index value testSet.append(trainingSet[randIndex]) #Add index value of test set del(trainingSet[randIndex]) #Delete the index value added to the test set in the training set list trainMat = []; trainClasses = [] #Create training set matrix and training set category label coefficient vector for docIndex in trainingSet: #Ergodic training set trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) #Add the generated word set model to the training matrix trainClasses.append(classList[docIndex]) #Add a category to the training set category label coefficient vector p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses)) #Training naive Bayesian model errorCount = 0 #Error classification count for docIndex in testSet: #Traversal test set wordVector = setOfWords2Vec(vocabList, docList[docIndex]) #Word set model of test set if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: #If the classification is wrong errorCount += 1 #Error count plus 1 print("Misclassified test sets:",docList[docIndex]) print('Error rate:%.2f%%' % (float(errorCount) / len(testSet) * 100)) if __name__ == '__main__': spamTest()

The function spamTest() outputs the probability of classification errors on 10 randomly selected emails. Since these emails are randomly selected, the output may be different each time. If an error is found, the function will output this table of the wrong document, so that you can know which document has the error. If you want to better estimate the error rate, you should repeat the above process several times, such as 10 times, and then average it. In contrast, it is better to misjudge spam as normal mail than to classify normal mail as spam. The operation results are as follows:

3, Summary

1. Advantages of naive Bayesian inference:

Generative model, which classifies by calculating probability, can be used to deal with multi classification problems.
It performs well on small-scale data, is suitable for multi classification tasks, is suitable for incremental training, and the algorithm is also relatively simple.
It is not sensitive to missing data and the algorithm is relatively simple. It is often used in text classification

Disadvantages of naive Bayesian inference:

It is sensitive to the expression of input data.
Due to the "simplicity" of naive Bayes, it will bring some loss of accuracy. When the naive Bayesian model gives the output category, it assumes that the attributes are independent of each other. This assumption is often not tenable in practical application. When the number of attributes is large or the correlation between attributes is large, the classification effect is not good. When the attribute correlation is small, naive Bayes has the best performance.
A priori probability needs to be calculated, and there is an error rate in classification decision-making.
We need to know the prior probability, and the prior probability often depends on assumptions. There can be many hypothetical models. Therefore, in some cases, the prediction effect will be poor due to the hypothetical prior model.

2. The txt file in email must be encoded in the same form, otherwise it cannot be parsed and the code will not run.
3. Code in textbook:

listOfTokens = re.split(r'\W*', bigString)

The code needs to be modified, otherwise the output is an empty set. Change * to +.
4. Laplace smoothing plays a great role in improving the classification effect of naive Bayes.