Chapter 6 (1.6) machine learning practice -- building your own Bayesian classifier

github project address: https://github.com/liangzhicheng120/bayes

1, Introduction

  • The project uses SpringBoot to do a layer of web encapsulation
  • Word segmentation tools used in the project hanlp
  • The project uses JDK8
  • Bayesian rule The probability of event A under the condition of event B (occurrence) is different from that of event B under the condition of event A; However, there is A definite relationship between the two, and Bayesian law is the statement of this relationship.
  • Bayesian terminology [image upload failed... (image-d286a7-1547375244426)] Where L(A|B) is the possibility of A when B occurs. In Bayesian law, every noun has a conventional name: Pr(A) is a priori probability or edge probability of A. It is called "a priori" because it does not consider any B factor. Pr(A|B) is the conditional probability of A after the occurrence of B is known, and it is also called the A posteriori probability of A because it is obtained from the value of B. Pr(B|A) is the conditional probability of B after the occurrence of a is known, and it is also called the a posteriori probability of B because it is obtained from the value of A. Pr(B) is a priori probability or marginal probability of B, which is also used as normalized constant. A posteriori probability = (likelihood * a priori probability) / normalized constant, that is, the a posteriori probability is directly proportional to the product of a priori probability and likelihood.
  • Meaning of Bayesian inference By modifying the conditional probability formula, the following form can be obtained:

[image upload failed... (image-3fbd35-1547375244427)]

  • We call P(A) A "Prior probability", that is, we judge the probability of event A before event B. P(A|B) is called "Posterior probability", that is, after the occurrence of event B, we reassess the probability of event A. P(B|A)/P(B) is called "likelihood function", which is an adjustment factor to make the estimated probability closer to the real probability. A posteriori probability = a posteriori probability x adjustment factor
  • This is what Bayesian inference means. We first estimate a "prior probability", and then add the experimental results to see whether the experiment enhances or weakens the "prior probability", so as to get a "posterior probability" closer to the fact. Here, if the "possibility function" P (b|A) / P (b) > 1, it means that the "A priori probability" is enhanced and the possibility of event A becomes greater; If "possibility function" = 1, it means that event B is not helpful to judge the possibility of event A; If the "possibility function" < 1, it means that the "A priori probability" is weakened and the possibility of event A becomes smaller.

2, Examples

  • Villa and dog A villa has been stolen twice in the past 20 years. The owner of the villa has a dog. The dog barks three times a week on average. The probability of the dog barking when the thief invades is estimated to be 0.9. The question is: what is the probability of the invasion when the dog barks? We assume that event A is A dog barking at night and event B is A thief invasion, then P(A) = 3 / 7, P(B)=2/(20 · 365) = 2 / 7300, P(A | B) = 0.9. According to the formula, it is easy to get the result: P(B|A)=0.9*(2/7300)/(3/7)=0.00058

3, Actual combat code

  • Model file (classify.txt)
Naruto Naruto
 Naruto secret
 Naruto snake pill
 Naruto theater Edition
 Naruto action
 Naruto fight
 Naruto battle
 Naruto reincarnation
 Naruto Sasuke
 Naruto Village
 Naruto sixth generation Naruto
 Naruto KRA
 Naruto Kaka
 Naruto with earth
 Naruto blast
 Naruto comes from
 Naruto Naruto
 Naruto Fairy
 Naruto six ways
 Naruto war
 Naruto nine tails
 Naruto Ninja
 Naruto research
 Naruto master
 Naruto Naruto
 Naruto leaf
 Naruto Ninja
 Naruto dirt
 Naruto Yu Zhibo
 Naruto Nine Tailed demon fox
 Naruto a Fei
 Pirate Wang Wenwen
 Pirate Wang Weitian
 Pirate king pirate king
 Frankie the pirate king
 Pirate king straw hat
 Pirate king pirate
 Pirate Wang Wuhai
 Pirate king incident
 Pirate king offers a reward
 Pirate king's words
 Pirate king dream
 Pirate king blood group
 Pirate king
 Pirate king route
 History of pirate king
 Deres the pirate king
 Captain pirate king
 Pirate king demon
 Pirate Wang Lufei
 One Piece comics
 Pirate king supernova
 Pirate king Rosa chapter
 Pirate king world
 Pirate king fruit
 Pirate king Pluto
 Pirate Wang rongyilang
 Pirate king pirate regiment
 Pirate king justice
 Pirate king Superman
 The Pirate King became
 Pirate king looking for
 Legend of pirate king
 Pirate king pirate king
 Pirate king Zhonghai
 Roger the pirate king
 Pirate king's Secret Treasure
 The pirate king stays
 Pirate king partner
 One Piece ONE
 One Piece PIECE
 Pirate king pirate
 Like minded pirate king
 Pirate Wang Yangqi
 Pirate king
 Dragon Ball Resurrection
 Dragon pearl Fairy
 Longzhu Wudao
 Longzhu get
 Dragon Ball Legion
 Longzhu search
 Dragon Ball demon king
 Longzhu dumplings
 Longzhu special
 Dragon Ball defeat
 Longzhu pear
 Dragon Ball ribbon
 Longzhu sale date
 Longzhu Longzhu
 Longzhu Tianjin
 Dragon Ball seven dragon balls
 Longzhu bick
 Dragon Ball God
 Dragon Ball cultivation
 Dragon Ball Wukong
 Dragon ball seal
 Longzhujiro
 Longzhuraf
 Dragon bead seal
 Longzhu wish
 Dragon Ball guards
 Longzhu Yiwu Road
 Dragon ball animation
  • TestBayes.java
package com.xinrui.util;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.io.Charsets;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Logger;

import com.hankcs.hanlp.HanLP;

/**
 * Bayesian calculator body class
 */
public class Bayes {

    private static Logger logger = Logger.getLogger(Bayes.class);

    /**
     * Divide the original training tuples into categories
     * 
     * @param datas
     *            Training tuple
     * @return Map<Category, training tuples belonging to this category >
     */
    public static Map<String, ArrayList<ArrayList<String>>> classifyByCategory(ArrayList<ArrayList<String>> datas) {
        if (datas == null) {
            return null;
        }

        Map<String, ArrayList<ArrayList<String>>> map = new HashMap<String, ArrayList<ArrayList<String>>>();
        ArrayList<String> singleTrainning = null;
        String classificaion = "";
        for (int i = 0; i < datas.size(); i++) {
            singleTrainning = datas.get(i);
            classificaion = singleTrainning.get(0);
            singleTrainning.remove(0);
            if (map.containsKey(classificaion)) {
                map.get(classificaion).add(singleTrainning);
            } else {
                ArrayList<ArrayList<String>> list = new ArrayList<ArrayList<String>>();
                list.add(singleTrainning);
                map.put(classificaion, list);
            }
        }

        return map;
    }

    /**
     * Based on the training data, the category of test tuples is predicted
     * 
     * @param datas
     *            Training tuple
     * @param testData
     *            Test tuple
     * @return Category of test tuples
     */
    public static String predictClassify(ArrayList<ArrayList<String>> datas, ArrayList<String> testData) {

        if (datas == null || testData == null) {
            return null;
        }

        int maxPIndex = -1;
        Map<String, ArrayList<ArrayList<String>>> map = classifyByCategory(datas);
        Object[] classes = map.keySet().toArray();
        double maxProbability = 0.0;
        for (int i = 0; i < map.size(); i++) {
            double p = 0.0;
            for (int j = 0; j < testData.size(); j++) {
                p += calProbabilityClassificationInKey(map, classes[i].toString(), testData.get(j));
            }
            if (p > maxProbability) {
                maxProbability = p;
                maxPIndex = i;
            }
        }

        return maxPIndex == -1 ? "other" : classes[maxPIndex].toString();
    }

    /**
     * Based on the training data, the category of test tuples is predicted
     * 
     * @param testData
     *            Test tuple
     * @return Category of test tuples
     * @throws Exception
     */
    public String predictClassify(ArrayList<String> testData, String mId) throws Exception {
        return predictClassify(read(mId), testData);
    }

    /**
     * Probability distribution of an eigenvalue on a classification [P(key|Classify)]
     * 
     * @param classify
     *            A classification feature vector set
     * @param value
     *            An eigenvalue
     * @return probability distribution
     */
    private static double calProbabilityKeyInClassification(ArrayList<ArrayList<String>> classify, String value) {
        if (classify == null || StringUtils.isEmpty(value)) {
            return 0.0;
        }
        int totleKeyCount = 0;
        int foundKeyCount = 0;
        ArrayList<String> featureVector = null; // A feature vector in classification
        for (int i = 0; i < classify.size(); i++) {
            featureVector = classify.get(i);
            for (int j = 0; j < featureVector.size(); j++) {
                totleKeyCount++;
                if (featureVector.get(j).equalsIgnoreCase(value)) {
                    foundKeyCount++;
                }
            }
        }
        return totleKeyCount == 0 ? 0.0 : 1.0 * foundKeyCount / totleKeyCount;
    }

    /**
     * Probability of obtaining a classification [P(Classify)]
     * 
     * @param classes
     *            Classification set
     * @param classify
     *            A specific classification
     * @return Probability of a classification
     */
    private static double calProbabilityClassification(Map<String, ArrayList<ArrayList<String>>> map, String classify) {
        if (map == null | StringUtils.isEmpty(classify)) {
            return 0;
        }
        Object[] classes = map.keySet().toArray();
        int totleClassifyCount = 0;
        for (int i = 0; i < classes.length; i++) {
            totleClassifyCount += map.get(classes[i].toString()).size();
        }
        return 1.0 * map.get(classify).size() / totleClassifyCount;
    }

    /**
     * Total probability of obtaining keywords
     * 
     * @param map
     *            All classified datasets
     * @param key
     *            An eigenvalue
     * @return The ratio of an eigenvalue in all classification datasets
     */
    private static double calProbabilityKey(Map<String, ArrayList<ArrayList<String>>> map, String key) {
        if (map == null || StringUtils.isEmpty(key)) {
            return 0;
        }
        int foundKeyCount = 0;
        int totleKeyCount = 0;
        Object[] classes = map.keySet().toArray();
        for (int i = 0; i < map.size(); i++) {
            ArrayList<ArrayList<String>> classify = map.get(classes[i]);
            ArrayList<String> featureVector = null; // A feature vector in classification
            for (int j = 0; j < classify.size(); j++) {
                featureVector = classify.get(j);
                for (int k = 0; k < featureVector.size(); k++) {
                    totleKeyCount++;
                    if (featureVector.get(k).equalsIgnoreCase(key)) {
                        foundKeyCount++;
                    }
                }
            }
        }
        return totleKeyCount == 0 ? 0.0 : 1.0 * foundKeyCount / totleKeyCount;
    }

    /**
     * Calculate the probability of classifying when a key appears [P(Classify | key)]
     * 
     * @param map
     *            All classified datasets
     * @param classify
     *            A specific classification
     * @param key
     *            A particular characteristic
     * @return P(Classify | key)
     */
    private static double calProbabilityClassificationInKey(Map<String, ArrayList<ArrayList<String>>> map, String classify, String key) {
        ArrayList<ArrayList<String>> classifyList = map.get(classify);
        double pkc = calProbabilityKeyInClassification(classifyList, key); // p(key|classify)
        double pc = calProbabilityClassification(map, classify); // p(classify)
        double pk = calProbabilityKey(map, key); // p(key)
        return pk == 0 ? 0 : pkc * pc / pk; // p(classify | key)
    }

    /**
     * Read and encapsulate the training data in the training document
     * 
     * @param filePath
     *            Path to training document
     * @return Training data set
     * @throws Exception
     */
    public static ArrayList<ArrayList<String>> read(String clzss) throws Exception {
        ArrayList<String> singleTrainning = null;
        ArrayList<ArrayList<String>> trainningSet = new ArrayList<ArrayList<String>>();
        List<String> datas = new ArrayList<String>(FileUtils.readLines(new File(clzss), Charsets.UTF_8));
        if (datas.size() == 0) {
            logger.error("[" + "Model file loading error" + "]" + clzss);
            throw new Exception("Model file loading error!");
        }
        for (int i = 0; i < datas.size(); i++) {
            String[] characteristicValues = datas.get(i).split(" ");
            singleTrainning = new ArrayList<String>();
            for (int j = 0; j < characteristicValues.length; j++) {
                if (StringUtils.isNotEmpty(characteristicValues[j])) {
                    singleTrainning.add(characteristicValues[j]);
                }
            }
            trainningSet.add(singleTrainning);
        }
        return trainningSet;
    }

    /**
     * 
     * @param fileName
     *            Training documents
     * @param size
     *            Number of keywords
     */
    public static void trainBayes(String fileName, String mId, int size) {
        try {
            Bayes bayes = new Bayes();
            BufferedReader reader = new BufferedReader(new FileReader(fileName));
            String line = null;
            int total = 0;
            int right = 0;
            long start = System.currentTimeMillis();
            while ((line = reader.readLine()) != null) {
                ArrayList<String> testData = (ArrayList<String>) HanLP.extractKeyword(line, size);
                String classification = bayes.predictClassify(testData, mId);
                if (classification.equals(fileName.split("\\.")[0])) {
                    right += 1;
                }
                System.out.print("\n Classification:" + classification);
                total++;
            }
            reader.close();
            long end = System.currentTimeMillis();
            System.out.println("Correct classification:" + right);
            System.out.println("Total number of rows:" + total);
            System.out.println("Accuracy:" + MathUtil.div(right, total, 4) * 100 + "%");
            System.out.println("Program run time: " + (end - start) / 1000 + "s");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}
  • TestBayes.java
package com.xinrui.test;

import java.util.ArrayList;

import com.hankcs.hanlp.HanLP;
import com.xinrui.util.Bayes;

public class TestBayes {
    public static void main(String[] args) throws Exception {
        // Get the current project storage location
        String path = TestBayes.class.getResource("").getPath();
        String classPath = path.substring(0, path.indexOf("/com/xinrui"));
        // Model file storage location
        String modelName = classPath + "/model/classify_model.txt";
        ArrayList<ArrayList<String>> model = Bayes.read(modelName);
        // Extract 10 keywords to form a Yuanzu
        ArrayList<String> testData = (ArrayList<String>) HanLP
                .extractKeyword(
                        "It was the "age of the great pirate", in order to find the great secret treasure left by the legendary pirate king Roger“ ONE PIECE"´╝îCountless pirates raised their flags and fought with each other. There was a young man who dreamed of becoming a pirate named Luffy. He became a rubber man because he ate the "devil fruit" by mistake. He paid the price of being unable to swim all his life while acquiring superhuman ability. Ten years later, Luffy went to sea to realize his agreement with shanks, who had lost his arm to save him. He kept looking for like-minded partners on the journey and began a great adventure with the goal of becoming a pirate king[9]  ",
                        15);
        // Output prediction results
        System.out.println(Bayes.predictClassify(model, testData));
    }
}
  • result

image

Posted on Wed, 24 Nov 2021 08:37:01 -0500 by pourmeanother