# Chapter 6 (1.6) machine learning practice -- building your own Bayesian classifier

1, Introduction

• The project uses SpringBoot to do a layer of web encapsulation
• Word segmentation tools used in the project hanlp
• The project uses JDK8
• Bayesian rule The probability of event A under the condition of event B (occurrence) is different from that of event B under the condition of event A; However, there is A definite relationship between the two, and Bayesian law is the statement of this relationship.
• Bayesian terminology [image upload failed... (image-d286a7-1547375244426)] Where L(A|B) is the possibility of A when B occurs. In Bayesian law, every noun has a conventional name: Pr(A) is a priori probability or edge probability of A. It is called "a priori" because it does not consider any B factor. Pr(A|B) is the conditional probability of A after the occurrence of B is known, and it is also called the A posteriori probability of A because it is obtained from the value of B. Pr(B|A) is the conditional probability of B after the occurrence of a is known, and it is also called the a posteriori probability of B because it is obtained from the value of A. Pr(B) is a priori probability or marginal probability of B, which is also used as normalized constant. A posteriori probability = (likelihood * a priori probability) / normalized constant, that is, the a posteriori probability is directly proportional to the product of a priori probability and likelihood.
• Meaning of Bayesian inference By modifying the conditional probability formula, the following form can be obtained:

• We call P(A) A "Prior probability", that is, we judge the probability of event A before event B. P(A|B) is called "Posterior probability", that is, after the occurrence of event B, we reassess the probability of event A. P(B|A)/P(B) is called "likelihood function", which is an adjustment factor to make the estimated probability closer to the real probability. A posteriori probability = a posteriori probability x adjustment factor
• This is what Bayesian inference means. We first estimate a "prior probability", and then add the experimental results to see whether the experiment enhances or weakens the "prior probability", so as to get a "posterior probability" closer to the fact. Here, if the "possibility function" P (b|A) / P (b) > 1, it means that the "A priori probability" is enhanced and the possibility of event A becomes greater; If "possibility function" = 1, it means that event B is not helpful to judge the possibility of event A; If the "possibility function" < 1, it means that the "A priori probability" is weakened and the possibility of event A becomes smaller.

2, Examples

• Villa and dog A villa has been stolen twice in the past 20 years. The owner of the villa has a dog. The dog barks three times a week on average. The probability of the dog barking when the thief invades is estimated to be 0.9. The question is: what is the probability of the invasion when the dog barks? We assume that event A is A dog barking at night and event B is A thief invasion, then P(A) = 3 / 7, P(B)=2/(20 · 365) = 2 / 7300, P(A | B) = 0.9. According to the formula, it is easy to get the result: P(B|A)=0.9*(2/7300)/(3/7)=0.00058

3, Actual combat code

• Model file (classify.txt)
```Naruto Naruto
Naruto secret
Naruto snake pill
Naruto theater Edition
Naruto action
Naruto fight
Naruto battle
Naruto reincarnation
Naruto Sasuke
Naruto Village
Naruto sixth generation Naruto
Naruto KRA
Naruto Kaka
Naruto with earth
Naruto blast
Naruto comes from
Naruto Naruto
Naruto Fairy
Naruto six ways
Naruto war
Naruto nine tails
Naruto Ninja
Naruto research
Naruto master
Naruto Naruto
Naruto leaf
Naruto Ninja
Naruto dirt
Naruto Yu Zhibo
Naruto Nine Tailed demon fox
Naruto a Fei
Pirate Wang Wenwen
Pirate Wang Weitian
Pirate king pirate king
Frankie the pirate king
Pirate king straw hat
Pirate king pirate
Pirate Wang Wuhai
Pirate king incident
Pirate king offers a reward
Pirate king's words
Pirate king dream
Pirate king blood group
Pirate king
Pirate king route
History of pirate king
Deres the pirate king
Captain pirate king
Pirate king demon
Pirate Wang Lufei
One Piece comics
Pirate king supernova
Pirate king Rosa chapter
Pirate king world
Pirate king fruit
Pirate king Pluto
Pirate Wang rongyilang
Pirate king pirate regiment
Pirate king justice
Pirate king Superman
The Pirate King became
Pirate king looking for
Legend of pirate king
Pirate king pirate king
Pirate king Zhonghai
Roger the pirate king
Pirate king's Secret Treasure
The pirate king stays
Pirate king partner
One Piece ONE
One Piece PIECE
Pirate king pirate
Like minded pirate king
Pirate Wang Yangqi
Pirate king
Dragon Ball Resurrection
Dragon pearl Fairy
Longzhu Wudao
Longzhu get
Dragon Ball Legion
Longzhu search
Dragon Ball demon king
Longzhu dumplings
Longzhu special
Dragon Ball defeat
Longzhu pear
Dragon Ball ribbon
Longzhu sale date
Longzhu Longzhu
Longzhu Tianjin
Dragon Ball seven dragon balls
Longzhu bick
Dragon Ball God
Dragon Ball cultivation
Dragon Ball Wukong
Dragon ball seal
Longzhujiro
Longzhuraf
Longzhu wish
Dragon Ball guards
Dragon ball animation```
• TestBayes.java
```package com.xinrui.util;

import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.io.Charsets;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Logger;

import com.hankcs.hanlp.HanLP;

/**
* Bayesian calculator body class
*/
public class Bayes {

private static Logger logger = Logger.getLogger(Bayes.class);

/**
* Divide the original training tuples into categories
*
* @param datas
*            Training tuple
* @return Map<Category, training tuples belonging to this category >
*/
public static Map<String, ArrayList<ArrayList<String>>> classifyByCategory(ArrayList<ArrayList<String>> datas) {
if (datas == null) {
return null;
}

Map<String, ArrayList<ArrayList<String>>> map = new HashMap<String, ArrayList<ArrayList<String>>>();
ArrayList<String> singleTrainning = null;
String classificaion = "";
for (int i = 0; i < datas.size(); i++) {
singleTrainning = datas.get(i);
classificaion = singleTrainning.get(0);
singleTrainning.remove(0);
if (map.containsKey(classificaion)) {
} else {
ArrayList<ArrayList<String>> list = new ArrayList<ArrayList<String>>();
map.put(classificaion, list);
}
}

return map;
}

/**
* Based on the training data, the category of test tuples is predicted
*
* @param datas
*            Training tuple
* @param testData
*            Test tuple
* @return Category of test tuples
*/
public static String predictClassify(ArrayList<ArrayList<String>> datas, ArrayList<String> testData) {

if (datas == null || testData == null) {
return null;
}

int maxPIndex = -1;
Map<String, ArrayList<ArrayList<String>>> map = classifyByCategory(datas);
Object[] classes = map.keySet().toArray();
double maxProbability = 0.0;
for (int i = 0; i < map.size(); i++) {
double p = 0.0;
for (int j = 0; j < testData.size(); j++) {
p += calProbabilityClassificationInKey(map, classes[i].toString(), testData.get(j));
}
if (p > maxProbability) {
maxProbability = p;
maxPIndex = i;
}
}

return maxPIndex == -1 ? "other" : classes[maxPIndex].toString();
}

/**
* Based on the training data, the category of test tuples is predicted
*
* @param testData
*            Test tuple
* @return Category of test tuples
* @throws Exception
*/
public String predictClassify(ArrayList<String> testData, String mId) throws Exception {
}

/**
* Probability distribution of an eigenvalue on a classification [P(key|Classify)]
*
* @param classify
*            A classification feature vector set
* @param value
*            An eigenvalue
* @return probability distribution
*/
private static double calProbabilityKeyInClassification(ArrayList<ArrayList<String>> classify, String value) {
if (classify == null || StringUtils.isEmpty(value)) {
return 0.0;
}
int totleKeyCount = 0;
int foundKeyCount = 0;
ArrayList<String> featureVector = null; // A feature vector in classification
for (int i = 0; i < classify.size(); i++) {
featureVector = classify.get(i);
for (int j = 0; j < featureVector.size(); j++) {
totleKeyCount++;
if (featureVector.get(j).equalsIgnoreCase(value)) {
foundKeyCount++;
}
}
}
return totleKeyCount == 0 ? 0.0 : 1.0 * foundKeyCount / totleKeyCount;
}

/**
* Probability of obtaining a classification [P(Classify)]
*
* @param classes
*            Classification set
* @param classify
*            A specific classification
* @return Probability of a classification
*/
private static double calProbabilityClassification(Map<String, ArrayList<ArrayList<String>>> map, String classify) {
if (map == null | StringUtils.isEmpty(classify)) {
return 0;
}
Object[] classes = map.keySet().toArray();
int totleClassifyCount = 0;
for (int i = 0; i < classes.length; i++) {
totleClassifyCount += map.get(classes[i].toString()).size();
}
return 1.0 * map.get(classify).size() / totleClassifyCount;
}

/**
* Total probability of obtaining keywords
*
* @param map
*            All classified datasets
* @param key
*            An eigenvalue
* @return The ratio of an eigenvalue in all classification datasets
*/
private static double calProbabilityKey(Map<String, ArrayList<ArrayList<String>>> map, String key) {
if (map == null || StringUtils.isEmpty(key)) {
return 0;
}
int foundKeyCount = 0;
int totleKeyCount = 0;
Object[] classes = map.keySet().toArray();
for (int i = 0; i < map.size(); i++) {
ArrayList<ArrayList<String>> classify = map.get(classes[i]);
ArrayList<String> featureVector = null; // A feature vector in classification
for (int j = 0; j < classify.size(); j++) {
featureVector = classify.get(j);
for (int k = 0; k < featureVector.size(); k++) {
totleKeyCount++;
if (featureVector.get(k).equalsIgnoreCase(key)) {
foundKeyCount++;
}
}
}
}
return totleKeyCount == 0 ? 0.0 : 1.0 * foundKeyCount / totleKeyCount;
}

/**
* Calculate the probability of classifying when a key appears [P(Classify | key)]
*
* @param map
*            All classified datasets
* @param classify
*            A specific classification
* @param key
*            A particular characteristic
* @return P(Classify | key)
*/
private static double calProbabilityClassificationInKey(Map<String, ArrayList<ArrayList<String>>> map, String classify, String key) {
ArrayList<ArrayList<String>> classifyList = map.get(classify);
double pkc = calProbabilityKeyInClassification(classifyList, key); // p(key|classify)
double pc = calProbabilityClassification(map, classify); // p(classify)
double pk = calProbabilityKey(map, key); // p(key)
return pk == 0 ? 0 : pkc * pc / pk; // p(classify | key)
}

/**
* Read and encapsulate the training data in the training document
*
* @param filePath
*            Path to training document
* @return Training data set
* @throws Exception
*/
public static ArrayList<ArrayList<String>> read(String clzss) throws Exception {
ArrayList<String> singleTrainning = null;
ArrayList<ArrayList<String>> trainningSet = new ArrayList<ArrayList<String>>();
List<String> datas = new ArrayList<String>(FileUtils.readLines(new File(clzss), Charsets.UTF_8));
if (datas.size() == 0) {
}
for (int i = 0; i < datas.size(); i++) {
String[] characteristicValues = datas.get(i).split(" ");
singleTrainning = new ArrayList<String>();
for (int j = 0; j < characteristicValues.length; j++) {
if (StringUtils.isNotEmpty(characteristicValues[j])) {
}
}
}
return trainningSet;
}

/**
*
* @param fileName
*            Training documents
* @param size
*            Number of keywords
*/
public static void trainBayes(String fileName, String mId, int size) {
try {
Bayes bayes = new Bayes();
String line = null;
int total = 0;
int right = 0;
long start = System.currentTimeMillis();
ArrayList<String> testData = (ArrayList<String>) HanLP.extractKeyword(line, size);
String classification = bayes.predictClassify(testData, mId);
if (classification.equals(fileName.split("\\."))) {
right += 1;
}
System.out.print("\n Classification:" + classification);
total++;
}
long end = System.currentTimeMillis();
System.out.println("Correct classification:" + right);
System.out.println("Total number of rows:" + total);
System.out.println("Accuracy:" + MathUtil.div(right, total, 4) * 100 + "%");
System.out.println("Program run time: " + (end - start) / 1000 + "s");
} catch (Exception e) {
e.printStackTrace();
}
}

}```
• TestBayes.java
```package com.xinrui.test;

import java.util.ArrayList;

import com.hankcs.hanlp.HanLP;
import com.xinrui.util.Bayes;

public class TestBayes {
public static void main(String[] args) throws Exception {
// Get the current project storage location
String path = TestBayes.class.getResource("").getPath();
String classPath = path.substring(0, path.indexOf("/com/xinrui"));
// Model file storage location
String modelName = classPath + "/model/classify_model.txt";
// Extract 10 keywords to form a Yuanzu
ArrayList<String> testData = (ArrayList<String>) HanLP
.extractKeyword(
"It was the "age of the great pirate", in order to find the great secret treasure left by the legendary pirate king Roger“ ONE PIECE"，Countless pirates raised their flags and fought with each other. There was a young man who dreamed of becoming a pirate named Luffy. He became a rubber man because he ate the "devil fruit" by mistake. He paid the price of being unable to swim all his life while acquiring superhuman ability. Ten years later, Luffy went to sea to realize his agreement with shanks, who had lost his arm to save him. He kept looking for like-minded partners on the journey and began a great adventure with the goal of becoming a pirate king  ",
15);
// Output prediction results
System.out.println(Bayes.predictClassify(model, testData));
}
}```
• result image

Posted on Wed, 24 Nov 2021 08:37:01 -0500 by pourmeanother