[machine learning practice] naive Bayes

Classification method based on Bayesian decision theory

Naive Bayes

  • Advantages: it is still effective in the case of less data, and can deal with multi category problems.
  • Disadvantages: it is sensitive to the preparation method of input data.

Applicable data type: nominal data.

Use conditional probability to classify

conditional probability

If you don't understand conditional probability, you can check my knowledge article

conditional probability

Bayesian decision theory requires the calculation of two probabilities p 1 ( x , y ) \mathrm{p} 1(\mathrm{x}, \mathrm{y}) p1(x,y) and p 2 ( x , y ) \mathrm{p} 2(\mathrm{x}, \mathrm{y}) p2(x,y) :

  • If p 1 ( x , y ) > p 2 ( x , y ) p 1(x, y)>p 2(x, y) P1 (x, y) > P2 (x, y), then it belongs to category 1;
  • If p 2 ( x , y ) > p 1 ( x , y ) \mathrm{p} 2(\mathrm{x}, \mathrm{y})>\mathrm{p} 1(\mathrm{x}, \mathrm{y}) P2 (x, y) > P1 (x, y), then it belongs to category 2.

However, these two criteria are not all contents of Bayesian decision theory. p1 () and p2 () are only used to simplify the description as much as possible, and what really needs to be calculated and compared is p ( c 1 ∣ x , y ) p\left(c_{1} \mid x , y\right) p(c1 ∣ x, y) and p ( c 2 ∣ x , y ) p\left(c_{2} \mid x , y\right) p(c2​∣x,y) . The specific meanings represented by these symbols are:
Given a by x , y \mathrm{x} , \mathrm{y} x. y represents the data point, then the data point comes from the category C 2 \mathrm{C}_{2} What is the probability of C2 ?? Data points from categories C 2 \mathrm{C}_{2} What is the probability of C2 ?? Note that these probabilities are different from those just given p ( x , y ∣ c 2 ) p\left(x, y \mid c_{2}\right) p(x,y ∣ c2) is not the same (the meaning of this is, the given condition is c 1 c_1 c1, when the parameter is c 1 c_1 Under the condition of c1, the experimental result is the probability of x,y), but Bayesian criterion can be used to exchange the conditions and results in probability. Specifically, the Bayesian criterion is applied to obtain:
p ( c i ∣ x , y ) = p ( x , y ∣ c i ) p ( c i ) p ( x , y ) p\left(c_{i} \mid x, y\right)=\frac{p\left(x, y \mid c_{i}\right) p\left(c_{i}\right)}{p(x, y)} p(ci​∣x,y)=p(x,y)p(x,y∣ci​)p(ci​)​

Using these definitions, Bayesian classification criteria can be defined as:

  • If P ( C 1 ∣ x , y ) > P ( C 2 ∣ x , y ) P\left(C_{1} \mid x, y\right)>P\left(C_{2} \mid x, y\right) P(C1 ∣ x,y) > P (C2 ∣ x,y), then it belongs to the category C 1 C_{1} C1​ .
  • If P ( C 1 ∣ x , y ) < P ( C 2 ∣ x , y ) P\left(C_{1} \mid x, y\right)<P\left(C_{2} \mid x, y\right) P(C1 ∣ x,y) < p (C2 ∣ x,y), then it belongs to the category C 2 C_{2} C2​ .

Document classification using naive Bayes

The general process of naive Bayes

  1. Collect data: any method can be used. This chapter uses RSS feeds.
  2. Prepare data: numeric or boolean data is required.
  3. Analysis of data: when there are a large number of features, the drawing features have little effect. At this time, the effect of histogram is better.
  4. Training algorithm: calculate the conditional probability of different independent features.
  5. Test algorithm: calculate the error rate.
  6. Using algorithms: a common naive Bayesian application is document classification. Naive Bayesian classifier can be used in any classification scenario, not necessarily text.

Text categorization using Python

#Conversion function from thesaurus to vector
def loadDataSet():
    postingList=[['my','dog','has','flea','problems','help','please'],['maybe','not','take','him','to','dog','park','stupid'],
                ['my','dalmation','is','so','cute','I','love','him'],['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],['quit','buying','worthless','dog','food','stupid']]
    classVec = [0,1,0,1,0,1] #1 stands for insulting words and 0 stands for normal speech
    return postingList,classVec
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document) #Creates the union of two sets
    return list(vocabSet)
def setOfWord2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec
listOposts,listClasses = loadDataSet()
myVocabList = createVocabList(listOposts)
myVocabList
['dog',
 'to',
 'ate',
 'love',
 'not',
 'him',
 'my',
 'posting',
 'steak',
 'food',
 'mr',
 'so',
 'how',
 'buying',
 'has',
 'is',
 'park',
 'dalmation',
 'cute',
 'problems',
 'stop',
 'flea',
 'stupid',
 'I',
 'quit',
 'worthless',
 'please',
 'licks',
 'maybe',
 'garbage',
 'help',
 'take']
setOfWord2Vec(myVocabList,listOposts[0])
[1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0]
setOfWord2Vec(myVocabList,listOposts[3])
[0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0]
from numpy import *
#Naive Bayesian classifier training function
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)  #Calculate the number of training documents
    numWords = len(trainMatrix[0]) #Calculate the number of entries per document
    pAbusive = sum(trainCategory)/float(numTrainDocs)#Probability that the document belongs to the insult class
    p0Num = zeros(numWords); p1Num = zeros(numWords)
    p0Denom = 0.0; p1Denom = 0.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:               #The data required for statistics of conditional probability belonging to insult class, i.e. P(w0|1),P(w1|1),P(w2|1)···
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:                                   #The data required to count the conditional probability belonging to the non insult class, i.e. P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    return p0Vect,p1Vect,pAbusive#Returns the conditional probability array belonging to insult class, the conditional probability array belonging to non insult class, and the probability that the document belongs to insult class
trainMat = []
for positinDoc in listOposts:
    trainMat.append(setOfWord2Vec(myVocabList,positinDoc))
p0V,p1V,pAb = trainNB0(trainMat,listClasses)
pAb
0.5
p0V
array([0.04166667, 0.04166667, 0.04166667, 0.04166667, 0.        ,
       0.08333333, 0.125     , 0.        , 0.04166667, 0.        ,
       0.04166667, 0.04166667, 0.04166667, 0.        , 0.04166667,
       0.04166667, 0.        , 0.04166667, 0.04166667, 0.04166667,
       0.04166667, 0.04166667, 0.        , 0.04166667, 0.        ,
       0.        , 0.04166667, 0.04166667, 0.        , 0.        ,
       0.04166667, 0.        ])
p1V
array([0.10526316, 0.05263158, 0.        , 0.        , 0.05263158,
       0.05263158, 0.        , 0.05263158, 0.        , 0.05263158,
       0.        , 0.        , 0.        , 0.05263158, 0.        ,
       0.        , 0.05263158, 0.        , 0.        , 0.        ,
       0.05263158, 0.        , 0.15789474, 0.        , 0.05263158,
       0.10526316, 0.        , 0.        , 0.05263158, 0.05263158,
       0.        , 0.05263158])
#Modified more stable version
#Naive Bayesian classifier training function
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)  #Calculate the number of training documents
    numWords = len(trainMatrix[0]) #Calculate the number of entries per document
    pAbusive = sum(trainCategory)/float(numTrainDocs)#Probability that the document belongs to the insult class
    p0Num = ones(numWords); p1Num = ones(numWords)
    p0Denom = 2.0; p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:               #The data required for statistics of conditional probability belonging to insult class, i.e. P(w0|1),P(w1|1),P(w2|1)···
            #Count the number of occurrences of each entry in all entry vectors with category 1
            p1Num += trainMatrix[i]
            #Count the total number of all entries in the entry vector with category 1
            #That is, count the number of words in all documents of class 1
            p1Denom += sum(trainMatrix[i])
        else:                                   #The data required to count the conditional probability belonging to the non insult class, i.e. P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)
    p0Vect = log(p0Num/p0Denom) #Frequency of each entry in all documents of category 0 p(wi|c0)
    return p0Vect,p1Vect,pAbusive#Returns the conditional probability array belonging to insult class, the conditional probability array belonging to non insult class, and the probability that the document belongs to insult class
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0
def testingNB():
    listOposts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOposts)
    trainMat = []
    for positinDoc in listOposts:
        trainMat.append(setOfWord2Vec(myVocabList,positinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love','my','dalmation']
    thisDoc = array(setOfWord2Vec(myVocabList,testEntry))
    print(testEntry,"classified as :",classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid','garbage']
    thisDoc = array(setOfWord2Vec(myVocabList,testEntry))
    print(testEntry,"classified as :",classifyNB(thisDoc,p0V,p1V,pAb))
testingNB()
['love', 'my', 'dalmation'] classified as : 0
['stupid', 'garbage'] classified as : 1
#Naive Bayesian word bag model
def bagOfWords2VecMN(vocabList,inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec   

Example: spam filtering

Dataset Download

Blue cloud address

Example: e-mail classification using naive Bayes

  1. Collect data: provide text files.
  2. Prepare data: parse the text file into an entry vector.
  3. Analyze data: check entries to ensure the correctness of parsing.
  4. Training algorithm: use the trainNB0() function we established earlier.
  5. Test algorithm: use classifynB() and build a new test function to calculate the error rate of the document set.
  6. Using algorithm: build a complete program to classify a group of documents and output the misclassified documents to the screen.
#File parsing and spam test function
def textParse(bigString):
    import re
    listOfTokens = re.split(r'\W*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

def spamTest():
    docList = []; classList = []; fullText = []
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i,'rb').read().decode('utf8','ignore'))
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i,'rb').read().decode('utf8','ignore'))
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    #Constructs the strings that appear in all messages into a string list
    vocabList = createVocabList(docList)
    #Build a list of integers of size 50 and an empty list
    #Retained cross validation
    trainingSet = list(range(50)); testSet = []
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat = [] ;trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWord2Vec(vocabList,docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:
        wordVector = setOfWord2Vec(vocabList,docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print('the error rate is: ',float(errorCount)/len(testSet))
spamTest()
the error rate is:  0.6

Tags: Python Machine Learning

Posted on Sun, 28 Nov 2021 04:18:31 -0500 by dhydrated