Machine learning -- naive Bayes

catalogue

summary

Principle of naive Bayesian algorithm

Code practice

1. Message classification

2. Spam classification

3,My Data

Reference

summary

Naive Bayes classifier is a classification method based on Bayesian theorem and the assumption of feature conditional independence. For a given training set, firstly, the joint probability distribution of input and output is learned based on the independent assumption of characteristic conditions (the naive Bayes method, which obtains the model through learning, obviously belongs to the generative model); Then, based on this model, for a given input x, the output y with the maximum a posteriori probability is obtained by using Bayesian theorem.

First, let's understand some concepts

For example, we already know that there are 10 in the bag   A ball is either a black ball or a white ball. Six of them are black balls. If you reach in and touch a ball, you can know the probability of touching a black ball. But this situation is often God's perspective, that is, to understand the whole picture of things before making judgment.

But we don't know the proportion of black and white balls in the bag in advance. Can we judge the proportion of black and white balls in the bag by the color of the ball we touch from the back

A priori probability:
The probability of happening is judged by experience, for example, the incidence rate of "beye death" is 1/10000, that is, the prior probability.

Posterior probability:
A posteriori probability is the probability of inferring the cause after the result occurs. For example, if someone finds out that he has "shellfish death", the cause of the disease may be a, B or C** Suffering from "bayet death" is because the probability of cause a is a posteriori probability. It is a kind of conditional probability.

Conditional probability:
The occurrence probability of event A under the condition that another event B has occurred, expressed as P(A|B). For example, under the condition of cause A, the probability of suffering from "Bei Ye death" is the conditional probability.
 

Next, learn about the discriminant model and generative model in machine learning

In machine learning, supervised learning can be divided into two types: discriminant model and generative model. In short, discriminant model is used to model conditional distribution, while generative model is used to model joint distribution.

Simply put

  • For discriminant model, we only need to learn the difference between them. For example, good melons and bad melons   For example, the vines of good melons are greener than bad melons.
  • The generative model is different. You need to learn what the characteristics of good melons are and what the characteristics of bad melons are. After having the characteristics of the two, we can distinguish them according to their respective characteristics.

Generative model: learn the joint probability distribution P(X,Y) from the data, and then calculate the probability distribution P(Y|X) from P(Y|X)=P(X,Y)/P(X) as the prediction model. This method represents the generation relationship between the given input X and the generated output Y

Discriminant model: the decision function Y=f(X) or conditional probability distribution P(Y|X) is directly learned from the data as the prediction model, that is, the discriminant model. The discrimination method is concerned with what kind of output Y should be predicted for a given input X.
 

The Bayesian model learned in this experiment is generative model

Create a model for each class. You can create as many models as there are categories. For example, if the category label has {good melon, bad melon}, first learn a good melon model according to the characteristics of good melon, then learn a bad melon model according to the characteristics of bad melon, and then calculate new samples respectively    Joint probability with two categories   , Then according to Bayesian formula:

 

Calculate separately and select the largestClassification as samples

 

Principle of naive Bayesian algorithm

Bayesian formula  

In the Bayesian formula, P(A) is called "Prior probability", that is, A judgment on the probability of event A before event B occurs.

P(A|B) is called "Posterior probability", that is, the reassessment of the probability of event A after the occurrence of event B.

P(B|A)/P(B) is called "likelihood function", which is an adjustment factor to make the estimated probability closer to the real probability.

Therefore, the conditional probability can be understood as the following formula: a posteriori probability = a priori probability x adjustment factor

It will be easier to understand in another form

 

P("belonging to a certain class" | "having a certain characteristic") = P("having a certain characteristic" | "belonging to a certain class") P("belonging to a certain class") / P("having a certain characteristic")

  Bayesian method transforms the probability of "belonging to a certain class (i.e. classification) under certain feature conditions" into the probability of "belonging to a certain feature (training model respectively)" under certain condition, which belongs to supervised learning

Let's use the teacher's example to understand the naive Bayesian formula

The problem now is that if we pick a bunch of watermelons, for example, one of the watermelons we pick is characterized by green, curling, turbid sound, depression, hard and slippery, and the density and sugar content are specific values   Then we judge whether the melon is good or bad

This is a typical classification problem. Turning into a mathematical problem is to compare good melons with bad melons. Whoever has a high probability can get the result

P (good melon | green, curled, turbid, sunken, hard slippery, density, sugar content) = P (green, curled, turbid, sunken, hard slippery, density, sugar content | good melon) * P (good melon) / P (green, curled, turbid, sunken, hard slippery, density, sugar content)

Through Bayesian formula, we can convert the calculated content into three quantities on the right

P (green, curled, curled, concave, hard and smooth, density, sugar content | good melon) * P (good melon) = P (green | good melon) * P (curled | good melon) * P (concave | good melon) * P (density | good melon) * P (sugar content | good melon)

 

 

Code practice

1. Message classification

In this demo, we need to understand the meaning of these two paragraphs

 

import numpy as np
from functools import reduce
import math
#Create experimental samples

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],                #Segmented entry
                ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    # Category label vector, 1 for insulting words and 0 for non insulting words
    classVec = [0,1,0,1,0,1]
    print(sum(classVec))
    return postingList,classVec

#The segmented experimental sample entries are sorted into a list of non repeated entries, that is, a vocabulary
def createVocabList(dataSet):
    vocabSet=set([])   #Create an empty non repeating list
    for document in dataSet:
        vocabSet =vocabSet|set(document)
    return list(vocabSet)

#According to the vocabList vocabulary, the inputSet is vectorized, and each element of the vector is 1 or 0“
"""Parameters:
    vocabList - createVocabList List of returned
    inputSet - Segmented entry list
Returns:
    returnVec - Document vector,Word set model"""
def setOfWords2Vec(vocabList, inputSet):
    # Create a vector with 0 elements
    returnVec = [0] * len(vocabList)
    # Traverse each entry
    for word in inputSet:
        if word in vocabList:
            # If the entry exists in the vocabulary, set 1
            returnVec[vocabList.index(word)] = 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    print("returnVec ",returnVec)
    return returnVec

#Naive Bayesian classifier training function
"""
Parameters:
    trainMatrix - Training document matrix, i.e setOfWords2Vec Returned returnVec Composition matrix trainMat
    trainCategory - Training category label vector, i.e loadDataSet Returned classVec 
Returns:
    p0Vect - Conditional probability array of non insulting classes
    p1Vect - Conditional probability array of insult class
    pAbusive - Probability that the document belongs to the insult class
"""
def trainNB0(trainMatrix,trainCategory):

    numTrainDocs=len(trainMatrix) #6
    numWords = len(trainMatrix[0]) #32
    pAbusive = sum(trainCategory) / float(numTrainDocs) #3/6
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:  # The data required for statistics of conditional probability belonging to insult class, i.e. P(w0|1),P(w1|1),P(w2|1)···
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
            print("p1Num p1Denom",p1Num,p1Denom)
        else:  # The data required to count the conditional probability belonging to the non insult class, i.e. P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
            print("p0Num poDenom",p0Num,p0Denom)
    p1Vect = p1Num/p1Denom #Conditional probability array of insult class
    p0Vect = p0Num/p0Denom #Conditional probability array of non insulting classes
    return p0Vect,p1Vect,pAbusive

"""
Function description:Naive Bayesian classifier classification function

Parameters:
	vec2Classify - Array of entries to be classified
	p0Vec - Conditional probability array of non insulting classes
	p1Vec -Conditional probability array of insult class
	pClass1 - Probability that the document belongs to the insult class
Returns:
	0 - It belongs to the non insulting category
	1 - It belongs to the insult category
"""
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + math.log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + math.log(1.0 - pClass1)
    print('p0:',p0)
    print('p1:',p1)
    if p1 > p0:
        return 1
    else:
        return 0

if __name__ == '__main__':
    listOPosts, listClasses = loadDataSet()  # Create experimental samples
    myVocabList = createVocabList(listOPosts)  # Create vocabulary
    trainMat = []
    for postinDoc in listOPosts:
        print(postinDoc)
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    print("trainMat ",trainMat)
    print("len(trainMat)",len(trainMat))
    print("len(trainMat[0])",len(trainMat[0]))
    p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
    print("p0V ",p0V)
    print("p1V ",p1V)
    print("pAb ",pAb)
    testEntry = ['love', 'my', 'dalmation']  # Test sample 1
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))  # Test sample Vectorization
    if classifyNB(thisDoc, p0V, p1V, pAb):
        print(testEntry, 'It belongs to the insult category')  # Perform classification and print classification results
    else:
        print(testEntry, 'It belongs to the non insulting category')  # Perform classification and print classification results

    testEntry = ['stupid', 'garbage']  # Test sample 2
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))  # Test sample Vectorization
    if classifyNB(thisDoc, p0V, p1V, pAb):
        print(testEntry, 'It belongs to the insult category')  # Perform classification and print classification results
    else:
        print(testEntry, 'It belongs to the non insulting category')

 

2. Spam classification

def spamTest():
    docList = []; classList = []; fullText = []
    for i in range(1, 26):
        # Read each spam and convert the string into a string list
        wordList = textParse(open('email/spam/%d.txt' % i, 'r').read())
        print(i," ",wordList)
        docList.append(wordList)
        fullText.append(wordList)
        # Mark spam, 1 indicates junk file
        classList.append(1)
        print(i," ",classList)
        # Read each non spam message and convert the string into a string list
        wordList = textParse(open('email/ham/%d.txt' % i, 'r').read())
        print(i, " ", wordList)
        docList.append(wordList)
        fullText.append(wordList)
        # Mark non spam, 0 means non spam
        classList.append(0)
    # Create vocabulary without repetition
    vocabList = createVocabList(docList)
    trainingSet = list(range(50)); testSet = []
    print("trainingSet",trainingSet)
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat = []; trainClasses = []
    # Ergodic training set
    for docIndex in trainingSet:
        # Add the generated word set model to the training matrix
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        # Add a category to the training set category label coefficient vector
        trainClasses.append(classList[docIndex])
    # Training naive Bayesian model
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
    errorCount = 0
    # Traversal test set
    for docIndex in testSet:                                                
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])           
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:   
            errorCount += 1                                                 
            print("Misclassified test sets:",docList[docIndex])
    print('Error rate:%.2f%%' % (float(errorCount) / len(testSet) * 100))

3,My Data

Reference

Take you to understand the naive Bayesian classification algorithm_ AI CSDN blog_ Naive Bayes https://blog.csdn.net/AMDS123/article/details/70173402?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522163810726716780261950626%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=163810726716780261950626&biz_id=0&spm=1018.2226.3001.4187

Tags: Algorithm Machine Learning

Posted on Mon, 29 Nov 2021 06:26:44 -0500 by willythemax