Chapter 6 (1.6) machine learning practice -- building your own Bayesian classifier

github project address: https://github.com/liangzhicheng120/bayes

1, Introduction

The project uses SpringBoot to do a layer of web encapsulation
Word segmentation tools used in the project hanlp
The project uses JDK8
Bayesian rule The probability of event A under the condition of event B (occurrence) is different from that of event B under the condition of event A; However, there is A definite relationship between the two, and Bayesian law is the statement of this relationship.
Bayesian terminology [image upload failed... (image-d286a7-1547375244426)] Where L(A|B) is the possibility of A when B occurs. In Bayesian law, every noun has a conventional name: Pr(A) is a priori probability or edge probability of A. It is called "a priori" because it does not consider any B factor. Pr(A|B) is the conditional probability of A after the occurrence of B is known, and it is also called the A posteriori probability of A because it is obtained from the value of B. Pr(B|A) is the conditional probability of B after the occurrence of a is known, and it is also called the a posteriori probability of B because it is obtained from the value of A. Pr(B) is a priori probability or marginal probability of B, which is also used as normalized constant. A posteriori probability = (likelihood * a priori probability) / normalized constant, that is, the a posteriori probability is directly proportional to the product of a priori probability and likelihood.
Meaning of Bayesian inference By modifying the conditional probability formula, the following form can be obtained:

[image upload failed... (image-3fbd35-1547375244427)]

We call P(A) A "Prior probability", that is, we judge the probability of event A before event B. P(A|B) is called "Posterior probability", that is, after the occurrence of event B, we reassess the probability of event A. P(B|A)/P(B) is called "likelihood function", which is an adjustment factor to make the estimated probability closer to the real probability. A posteriori probability = a posteriori probability x adjustment factor
This is what Bayesian inference means. We first estimate a "prior probability", and then add the experimental results to see whether the experiment enhances or weakens the "prior probability", so as to get a "posterior probability" closer to the fact. Here, if the "possibility function" P (b|A) / P (b) > 1, it means that the "A priori probability" is enhanced and the possibility of event A becomes greater; If "possibility function" = 1, it means that event B is not helpful to judge the possibility of event A; If the "possibility function" < 1, it means that the "A priori probability" is weakened and the possibility of event A becomes smaller.

2, Examples

Villa and dog A villa has been stolen twice in the past 20 years. The owner of the villa has a dog. The dog barks three times a week on average. The probability of the dog barking when the thief invades is estimated to be 0.9. The question is: what is the probability of the invasion when the dog barks? We assume that event A is A dog barking at night and event B is A thief invasion, then P(A) = 3 / 7, P(B)=2/(20 · 365) = 2 / 7300, P(A | B) = 0.9. According to the formula, it is easy to get the result: P(B|A)=0.9*(2/7300)/(3/7)=0.00058

3, Actual combat code

Model file (classify.txt)

Naruto Naruto Naruto secret Naruto snake pill Naruto theater Edition Naruto action Naruto fight Naruto battle Naruto reincarnation Naruto Sasuke Naruto Village Naruto sixth generation Naruto Naruto KRA Naruto Kaka Naruto with earth Naruto blast Naruto comes from Naruto Naruto Naruto Fairy Naruto six ways Naruto war Naruto nine tails Naruto Ninja Naruto research Naruto master Naruto Naruto Naruto leaf Naruto Ninja Naruto dirt Naruto Yu Zhibo Naruto Nine Tailed demon fox Naruto a Fei Pirate Wang Wenwen Pirate Wang Weitian Pirate king pirate king Frankie the pirate king Pirate king straw hat Pirate king pirate Pirate Wang Wuhai Pirate king incident Pirate king offers a reward Pirate king's words Pirate king dream Pirate king blood group Pirate king Pirate king route History of pirate king Deres the pirate king Captain pirate king Pirate king demon Pirate Wang Lufei One Piece comics Pirate king supernova Pirate king Rosa chapter Pirate king world Pirate king fruit Pirate king Pluto Pirate Wang rongyilang Pirate king pirate regiment Pirate king justice Pirate king Superman The Pirate King became Pirate king looking for Legend of pirate king Pirate king pirate king Pirate king Zhonghai Roger the pirate king Pirate king's Secret Treasure The pirate king stays Pirate king partner One Piece ONE One Piece PIECE Pirate king pirate Like minded pirate king Pirate Wang Yangqi Pirate king Dragon Ball Resurrection Dragon pearl Fairy Longzhu Wudao Longzhu get Dragon Ball Legion Longzhu search Dragon Ball demon king Longzhu dumplings Longzhu special Dragon Ball defeat Longzhu pear Dragon Ball ribbon Longzhu sale date Longzhu Longzhu Longzhu Tianjin Dragon Ball seven dragon balls Longzhu bick Dragon Ball God Dragon Ball cultivation Dragon Ball Wukong Dragon ball seal Longzhujiro Longzhuraf Dragon bead seal Longzhu wish Dragon Ball guards Longzhu Yiwu Road Dragon ball animation

TestBayes.java

package com.xinrui.util; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.commons.io.Charsets; import org.apache.commons.io.FileUtils; import org.apache.commons.lang.StringUtils; import org.apache.log4j.Logger; import com.hankcs.hanlp.HanLP; /** * Bayesian calculator body class */ public class Bayes { private static Logger logger = Logger.getLogger(Bayes.class); /** * Divide the original training tuples into categories * * @param datas * Training tuple * @return Map<Category, training tuples belonging to this category > */ public static Map<String, ArrayList<ArrayList<String>>> classifyByCategory(ArrayList<ArrayList<String>> datas) { if (datas == null) { return null; } Map<String, ArrayList<ArrayList<String>>> map = new HashMap<String, ArrayList<ArrayList<String>>>(); ArrayList<String> singleTrainning = null; String classificaion = ""; for (int i = 0; i < datas.size(); i++) { singleTrainning = datas.get(i); classificaion = singleTrainning.get(0); singleTrainning.remove(0); if (map.containsKey(classificaion)) { map.get(classificaion).add(singleTrainning); } else { ArrayList<ArrayList<String>> list = new ArrayList<ArrayList<String>>(); list.add(singleTrainning); map.put(classificaion, list); } } return map; } /** * Based on the training data, the category of test tuples is predicted * * @param datas * Training tuple * @param testData * Test tuple * @return Category of test tuples */ public static String predictClassify(ArrayList<ArrayList<String>> datas, ArrayList<String> testData) { if (datas == null || testData == null) { return null; } int maxPIndex = -1; Map<String, ArrayList<ArrayList<String>>> map = classifyByCategory(datas); Object[] classes = map.keySet().toArray(); double maxProbability = 0.0; for (int i = 0; i < map.size(); i++) { double p = 0.0; for (int j = 0; j < testData.size(); j++) { p += calProbabilityClassificationInKey(map, classes[i].toString(), testData.get(j)); } if (p > maxProbability) { maxProbability = p; maxPIndex = i; } } return maxPIndex == -1 ? "other" : classes[maxPIndex].toString(); } /** * Based on the training data, the category of test tuples is predicted * * @param testData * Test tuple * @return Category of test tuples * @throws Exception */ public String predictClassify(ArrayList<String> testData, String mId) throws Exception { return predictClassify(read(mId), testData); } /** * Probability distribution of an eigenvalue on a classification [P(key|Classify)] * * @param classify * A classification feature vector set * @param value * An eigenvalue * @return probability distribution */ private static double calProbabilityKeyInClassification(ArrayList<ArrayList<String>> classify, String value) { if (classify == null || StringUtils.isEmpty(value)) { return 0.0; } int totleKeyCount = 0; int foundKeyCount = 0; ArrayList<String> featureVector = null; // A feature vector in classification for (int i = 0; i < classify.size(); i++) { featureVector = classify.get(i); for (int j = 0; j < featureVector.size(); j++) { totleKeyCount++; if (featureVector.get(j).equalsIgnoreCase(value)) { foundKeyCount++; } } } return totleKeyCount == 0 ? 0.0 : 1.0 * foundKeyCount / totleKeyCount; } /** * Probability of obtaining a classification [P(Classify)] * * @param classes * Classification set * @param classify * A specific classification * @return Probability of a classification */ private static double calProbabilityClassification(Map<String, ArrayList<ArrayList<String>>> map, String classify) { if (map == null | StringUtils.isEmpty(classify)) { return 0; } Object[] classes = map.keySet().toArray(); int totleClassifyCount = 0; for (int i = 0; i < classes.length; i++) { totleClassifyCount += map.get(classes[i].toString()).size(); } return 1.0 * map.get(classify).size() / totleClassifyCount; } /** * Total probability of obtaining keywords * * @param map * All classified datasets * @param key * An eigenvalue * @return The ratio of an eigenvalue in all classification datasets */ private static double calProbabilityKey(Map<String, ArrayList<ArrayList<String>>> map, String key) { if (map == null || StringUtils.isEmpty(key)) { return 0; } int foundKeyCount = 0; int totleKeyCount = 0; Object[] classes = map.keySet().toArray(); for (int i = 0; i < map.size(); i++) { ArrayList<ArrayList<String>> classify = map.get(classes[i]); ArrayList<String> featureVector = null; // A feature vector in classification for (int j = 0; j < classify.size(); j++) { featureVector = classify.get(j); for (int k = 0; k < featureVector.size(); k++) { totleKeyCount++; if (featureVector.get(k).equalsIgnoreCase(key)) { foundKeyCount++; } } } } return totleKeyCount == 0 ? 0.0 : 1.0 * foundKeyCount / totleKeyCount; } /** * Calculate the probability of classifying when a key appears [P(Classify | key)] * * @param map * All classified datasets * @param classify * A specific classification * @param key * A particular characteristic * @return P(Classify | key) */ private static double calProbabilityClassificationInKey(Map<String, ArrayList<ArrayList<String>>> map, String classify, String key) { ArrayList<ArrayList<String>> classifyList = map.get(classify); double pkc = calProbabilityKeyInClassification(classifyList, key); // p(key|classify) double pc = calProbabilityClassification(map, classify); // p(classify) double pk = calProbabilityKey(map, key); // p(key) return pk == 0 ? 0 : pkc * pc / pk; // p(classify | key) } /** * Read and encapsulate the training data in the training document * * @param filePath * Path to training document * @return Training data set * @throws Exception */ public static ArrayList<ArrayList<String>> read(String clzss) throws Exception { ArrayList<String> singleTrainning = null; ArrayList<ArrayList<String>> trainningSet = new ArrayList<ArrayList<String>>(); List<String> datas = new ArrayList<String>(FileUtils.readLines(new File(clzss), Charsets.UTF_8)); if (datas.size() == 0) { logger.error("[" + "Model file loading error" + "]" + clzss); throw new Exception("Model file loading error!"); } for (int i = 0; i < datas.size(); i++) { String[] characteristicValues = datas.get(i).split(" "); singleTrainning = new ArrayList<String>(); for (int j = 0; j < characteristicValues.length; j++) { if (StringUtils.isNotEmpty(characteristicValues[j])) { singleTrainning.add(characteristicValues[j]); } } trainningSet.add(singleTrainning); } return trainningSet; } /** * * @param fileName * Training documents * @param size * Number of keywords */ public static void trainBayes(String fileName, String mId, int size) { try { Bayes bayes = new Bayes(); BufferedReader reader = new BufferedReader(new FileReader(fileName)); String line = null; int total = 0; int right = 0; long start = System.currentTimeMillis(); while ((line = reader.readLine()) != null) { ArrayList<String> testData = (ArrayList<String>) HanLP.extractKeyword(line, size); String classification = bayes.predictClassify(testData, mId); if (classification.equals(fileName.split("\\.")[0])) { right += 1; } System.out.print("\n Classification:" + classification); total++; } reader.close(); long end = System.currentTimeMillis(); System.out.println("Correct classification:" + right); System.out.println("Total number of rows:" + total); System.out.println("Accuracy:" + MathUtil.div(right, total, 4) * 100 + "%"); System.out.println("Program run time: " + (end - start) / 1000 + "s"); } catch (Exception e) { e.printStackTrace(); } } }

TestBayes.java

package com.xinrui.test; import java.util.ArrayList; import com.hankcs.hanlp.HanLP; import com.xinrui.util.Bayes; public class TestBayes { public static void main(String[] args) throws Exception { // Get the current project storage location String path = TestBayes.class.getResource("").getPath(); String classPath = path.substring(0, path.indexOf("/com/xinrui")); // Model file storage location String modelName = classPath + "/model/classify_model.txt"; ArrayList<ArrayList<String>> model = Bayes.read(modelName); // Extract 10 keywords to form a Yuanzu ArrayList<String> testData = (ArrayList<String>) HanLP .extractKeyword( "It was the "age of the great pirate", in order to find the great secret treasure left by the legendary pirate king Roger“ ONE PIECE"，Countless pirates raised their flags and fought with each other. There was a young man who dreamed of becoming a pirate named Luffy. He became a rubber man because he ate the "devil fruit" by mistake. He paid the price of being unable to swim all his life while acquiring superhuman ability. Ten years later, Luffy went to sea to realize his agreement with shanks, who had lost his arm to save him. He kept looking for like-minded partners on the journey and began a great adventure with the goal of becoming a pirate king[9] ", 15); // Output prediction results System.out.println(Bayes.predictClassify(model, testData)); } }

result

image

Chapter 6 (1.6) machine learning practice -- building your own Bayesian classifier

24 November 2021, 08:37 | Views: 3316

Add new comment

0 comments