Spark.ml -- Naive Bayes

Preface

Naive Bayes classifier is a classifier with low variance and high deviation. It assumes that there is conditional independence between each feature: for a given category, all features are independent of each other. Obviously, this assumption makes the problem too simple, but naive Bayes does have a good effect on the task of text classification.  

Naive Bayesian model is a group of very simple and fast classification algorithm, which is usually suitable for data sets with very high dimensions. Because of its fast running speed and less adjustable parameters, it is very suitable to provide a quick and rough basic scheme for classification problems. It is called "naive" or "naive Bayes" because if we make a very simple assumption about the generation model of each label, we can find the approximate solution of each type of generation model, and then we can use Bayes classification. The basic idea of naive Bayes is: for the given item to be classified, if the probability of each category appears under the condition of this item, which is the largest, it is considered that the item to be classified belongs to which category.

Naive Bayes algorithm includes Gauss NB (assuming that the data of each tag is subject to simple Gauss distribution, which is applicable to continuous variables), polynomial NB (assuming that the feature is generated by a simple polynomial distribution, which is applicable to the case of discrete features) and Bernoulli NB (assuming that the feature is subject to Bernoulli distribution, which is applicable to the case of discrete features). Different from the polynomial Bayesian model, the value of each feature in Bernoulli model can only be 1 and 0. Take text classification as an example, if a word has appeared in the document, its feature value is 1, otherwise it is 0)

 

Characteristic

Advantages

  • Training and forecasting are very fast.
  • Use probability prediction directly.
  • It's usually easy to explain.
  • There are very few adjustable parameters (if any).

Suitable for use

  • It is assumed that the distribution function matches the data (rarely seen in practice).
  • All kinds of models are highly differentiated, and the complexity of models is not important.
  • Very high dimensional data, model complexity is not important.

 

principle

P(A|B) indicates the probability of event A under the premise that event B has occurred, which is called the conditional probability of event A under the premise that event B has occurred.

The classification principle is to use Bayesian formula to calculate the posterior probability (that is, the probability that the object belongs to a certain category) according to the prior probability of an object, and then select the category with the maximum posterior probability as the category of the object. Generally speaking, when the number of sample features is large or the correlation between features is large, the efficiency of naive Bayes classification is less than that of decision tree model; when the correlation between features is small, the performance of naive Bayes classification is the best.  

It is easier to understand the principle through examples, which can be explained by referring to the examples in the article: https://blog.csdn.net/ac540101928/article/details/103941495

 

pySpark API

    class pyspark.ml.classification.NaiveBayes(featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', smoothing=1.0, modelType='multinomial', thresholds=None, weightCol=None)   Pyspark Documentation

  • Smoothing: smoothing parameter needs to be > = 0, default is 1.0
  • modelType: the model type is a string (case sensitive). Supported options: polynomial (default) | Bernoulli (Bernoulli)

 

case

''' 
//Content: pyspark realizes naive Bayes binary classification
//Version: spark 2.4.4
//Data: spam data
//Data source: http://archive.ics.uci.edu/ml/datasets/Spambase
'''

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator,ParamGridBuilder

spark = SparkSession.builder.master("local").appName("NaiveBayes").getOrCreate()

sc =spark.sparkContext

data = spark.read.format("csv").option("header","true").load(u"D:\Data\Spam-Dataset\spambase.csv")

for i in data.columns:
    data = data.withColumn(i,col(i).cast("Double"))
    if i == "spam":
        data = data.withColumnRenamed(i, "label") # Change column names

trainData, testData = data.randomSplit([0.8, 0.2])

featuresArray = data.columns[:-1]

assembler = VectorAssembler().setInputCols(featuresArray).setOutputCol("features")

nb = NaiveBayes().setLabelCol("label").setFeaturesCol("features")

pipeline = Pipeline().setStages([assembler,nb])

ParamGrid = ParamGridBuilder().addGrid(nb.smoothing,[0.1,0.3,0.5]).build()

evaluator = BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("rawPrediction").setLabelCol("label")

CV = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(ParamGrid).setNumFolds(3)

# Training data
nbModel = CV.fit(trainData)

# Forecast data
nbPrediction = nbModel.transform(testData)
#nbPrediction.show(10)

''' Best model parameters '''
bestModel = nbModel.bestModel
NBModel = bestModel.stages[1]
print("best smoothing : ", NBModel.explainParam(NBModel.smoothing))

''' Assessment model 1 -- AUC '''
AUC = evaluator.evaluate(nbPrediction)

print("Naive Bayes  Area Under ROC:: ",AUC)

''' Assessment model 2 -- Accuracy rate '''
evaluatorX = MulticlassClassificationEvaluator().setMetricName("accuracy").setLabelCol("label")

ACC = evaluatorX.evaluate(nbPrediction)

print("Accurate value of naive Bayes prediction:",ACC)

 

Reference resources

Determined22 Original: Machine learning -- Basic collation (2) naive Bayesian classifier; discussion on methods of text classification

Turing education original: naive bayesian classification

Amber Original: Three common models of naive Bayes: Gauss, polynomial, Bernoulli

suipingsp y Original: Detailed explanation of machine learning classic algorithm and Python implementation

52 original articles published, 10 praised, 3027 visited
Private letter follow

Tags: Spark less SQL Python

Posted on Thu, 16 Jan 2020 10:05:20 -0500 by sp2hari