3, Classification and expression
Spark.mllib package provides a variety of support tools for binary, multiclassification and regression analysis
linear models
1) Mathematical formula
Many standard machine learning methods can be expressed as convex optimization problems. For example, the task of finding the minimum value of convex function f which depends on variable vector w (called weight in code) usually contains d inputs. Formally, we can write it as the optimization problem minw ∈ℝ df(w), which has the following form of objective equation
The vector here
Here the vector xi ∈ℝ d is an example of training data, where 1 ≤ i ≤ n, and yi ∈ℝ is their corresponding label, which is also what we want to predict. We call the method linear if L(w;x,y) can be expressed as the equation wTx and y. Several kinds of classification and regression algorithms of spark.mllib belong to this category, which are discussed here.
The objective equation f has two parts:
The objective function f has two parts: a regularizer to control the complexity of the model and to measure the loss of model error on the training data. Loss function L (w;.) is usually a convex function in w. The fixed regularization parameter λ ≥ 0 (regParam in the code) defines the tradeoff between two objectives, which are to minimize the loss (i.e. training error) and the model complexity (i.e. to avoid over fitting).
(1) Loss equation
The following table summarizes the loss functions and their gradients or sub gradients of the methods supported by spark.mllib:
Note that in the above mathematical formula, the binary mark y is expressed as + 1 (positive) or  1 (negative), which is convenient for the formula.
However, negative tags are represented by 0 instead of  1 in spark.mllib to be consistent with multiclass tags.
(2)Regularizers (regularizers)
The purpose of the regularizer is to encourage simplification of the model and avoid over fitting. We support the following regularizers in spark.mllib:
The sign (w) here is a vector consisting of the sign (± 1) of all the terms of w. Because of smoothness, L2 regularization is usually easier to solve than L1 regularization. However, L1 regularization can help to promote the sparseness of weights, resulting in smaller models and easier interpretation, which is very useful for feature selection. Elastic nets are a combination of L1 and L2 regularizations. It is not recommended to train the model without any regularization, especially when the number of training examples is small.
(3)Optimization
Behind the scenes, the linear method uses convex optimization to optimize the objective function. spark.mllib uses two methods, SGD and LBFGS, which are introduced in the optimization part. At present, most algorithms support random gradient descent (SGD), while a few support LBFGS. For guidance on choosing between two optimization methods, see this optimization section.
2)classification
The purpose of the classification is to divide the projects into several categories. The most common classification type is binary classification. There are two categories, usually named positive and negative. If there are more than two categories, they are called multi category classifications. spark.mllib supports two linear classification methods: linear support vector machine (SVM) and logical regression. Linear SVM only supports binary classification, while logical regression supports binary and multiclass classification. For both methods, spark.mllib supports L1 and L2 regularization variables. The training data set is represented by the RDD of LabeledPoint in MLlib, where the label is a zero based class index: 0,1,2.
(1) Linear support vector machines (SVM)
Linear SVM is a standard method for largescale classification tasks. This is a linear method, as described in formula (1) above, in which the loss function is given by the hinge loss:
By default, linear SVM uses L2 regularization training. We also support alternative L1 regularization. In this case, the problem becomes a linear program. The linear SVM algorithm outputs a SVM model. Given a new data point, represented by x, the model is based on the value of wTx. By default, if wTx ≥ 0, the result is positive, otherwise it is negative.
Sample code
The following code snippet shows how to load the sample data set, how to use the static method in the algorithm object to perform the training algorithm on this training data, and how to use the resulting model to predict the training error.
For API details, refer to the SVMWithSGD Scala documentation and the SVMModel Scala documentation.
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) // Clear the default threshold. model.clearThreshold() // Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println(s"Area under ROC = $auROC") // Save and load model model.save(sc, "target/tmp/scalaSVMWithSGDModel") val sameModel = SVMModel.load(sc, "target/tmp/scalaSVMWithSGDModel")
(2)Logistic regression
Logical regression is widely used to predict binary responses. This is a linear method, as described in equation (1) above, the loss function in the formula is given by logical loss:
For binary classification, the algorithm outputs binary logistic regression model. Given a new data point, represented by x, the model predicts by applying logic function
Where z = wTx. By default, if f (wTx) > 0.5, the result is positive, otherwise it is negative, although unlike linear SVM, the original output of the log regression model f (z) has a probability interpretation (that is, the probability of x is positive).
Binary logistic regression can be generalized as polynomial logistic regression to train and predict multi class classification problems. For example, for K possible results, you can select one of the results as the pivot and regress the other K1 results separately from the pivot results. In spark.mllib, select the first class 0 as the "pivot" class. For reference, see section 4.4 of the elements of statistical learning. This is a detailed mathematical derivation.
For multi class classification problem, the algorithm will output a polynomial logistic regression model, including K1 binary logistic regression model for the first class regression. Given a new data point, K1 models will be run, and the category with the highest probability will be selected as the prediction category.
We implement two algorithms to solve logistic regression: small batch gradient descent and LBFGS. We suggest that LBFGS should be used when the gradient of small batch is decreasing in order to speed up the convergence.
Sample code
The following code shows how to load the sample multiclass dataset, divide it into training and testing, and how to use Logistic regression with lbfgs to fit the Logistic regression model. The model is then evaluated against the test data set and saved to disk.
For more information about the API, refer to the logistregressionwithlbfgs Scala documentation and the logistregressionmodel Scala documentation.
import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS} import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(training) // Compute raw scores on the test set. val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label) } // Get evaluation metrics. val metrics = new MulticlassMetrics(predictionAndLabels) val accuracy = metrics.accuracy println(s"Accuracy = $accuracy") // Save and load model model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel") val sameModel = LogisticRegressionModel.load(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
3) regression
(1)Linear least squares , lasso and ridge regression
Linear least squares is the most common representation of regression problems. This is a linear method, as described in formula (1) above, in which the loss function is given by the square loss:
By using different types of regularization to derive various related regression methods: ordinary least squares or linear least squares do not use regularization; ridge regression uses L2 regularization; Lasso uses L1 regularization. For all these models, the average loss or training error
It is called mean square error.
Sample code
The following example shows how to load training data and parse it into LabeledPoint's RDD. Then, the example uses LinearRegressionWithSGD to build a simple linear model to predict tag values. At last, we calculate the mean square error to evaluate the goodness of fit.
For API details, see the LinearRegressionWithSGD Scala documentation and the LinearRegressionModel Scala documentation.
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD // Load and parse the data val data = sc.textFile("data/mllib/ridgedata/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v  p), 2) }.mean() println(s"training Mean Squared Error $MSE") // Save and load model model.save(sc, "target/tmp/scalaLinearRegressionWithSGDModel") val sameModel = LinearRegressionModel.load(sc, "target/tmp/scalaLinearRegressionWithSGDModel")
RidgeRegressionWithSGD and LassoWithSGD are used in a similar way to LinearRegressionWithSGD.
To run the above application, follow the instructions provided in the self contained Applications section of the Spark quick start guide. Make sure that Spark mllib is also included as a dependency in the build file.
(2)Streaming linear regression
When the data arrives in the way of flow, it is very useful to fit the regression model online and update the parameters of the model when the new data arrives. spark.mllib currently supports the use of ordinary least square method for linear regression of flow. Fitting is similar to offline fitting, except that each batch of data is fitted, so the model is constantly updated to reflect the data in the flow.
Example
The following example shows how to load training and test data from two different text file input streams, how to parse these streams into marker points, fit the linear regression model online to the first stream, and predict the second stream.
Sample code
First, we import the necessary classes to parse the input data and create the model. Then, we create input streams for training and test data. We assume that StreamingContext ssc has been created. For more information, see the Spark Streaming programming guide. In this example, we use labeled points in training and test flows, but in fact, you might want to use unlabeled vectors for test data.
We create the model by initializing the weights to zero, register the flow for training and testing, and then get to work. Printing predictions with real labels makes it easy to see the results.
Finally, we can save the text file with the data to the training or test folder. Each row should be a data point in the format (y, [x1, x2, x3]), where y is the label and x1, x2, x3 is the feature. The model is updated whenever a text file is placed in args (0). Whenever you place a text file in args (1), you will see the prediction. When you input more data into the training directory, the prediction will become better!
This is a complete example:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache() val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse) val numFeatures = 3 val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.zeros(numFeatures)) model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print() ssc.start() ssc.awaitTermination()
4)Implementation(developer)  Implementation (developer)
Behind the scenes, spark.mllib implements a simple distributed version of random gradient descent (SGD) based on the basic gradient descent primitive (as described in the optimization section). All the provided algorithms take the regularization parameter (regParam) and various parameters (stepSize, numIterations, miniBatchFraction) related to random gradient descent as input. For each of them, we support all three possible regularizations (none, L1 or L2).
For Logistic regression, LBFGS version is implemented under Logistic regression with lbfgs, which supports binary and multiple Logistic regression, while SGD version only supports binary Logistic regression. However, the LBFGS version does not support L1 regularization, while SGD supports L1 regularization. When L1 regularization is not needed, the LBFGS version is strongly recommended, because by using quasi Newton method to approximate Hessian inverse matrix, it converges faster and more accurately than SGD.
All algorithms are implemented in Scala:
 SVMWithSGD
 LogisticRegressionWithLBFGS
 LogisticRegressionWithSGD
 LinearRegressionWithSGD
 RidgeRegressionWithSGD
 LassoWithSGD
naive Bayes
Naive Bayes is a simple multi class classification algorithm, which assumes that each pair of features is independent. Naive Bayes can be trained very effectively. In the process of transferring training data once, it calculates the conditional probability distribution of each feature of a given label, and then applies Bayes theorem to calculate the conditional probability distribution of a given observation value label, and uses it for prediction.
spark.mllib supports polynomial naive Bayes and bernoulli naive Bayes. These models are often used for document classification. In this case, each observation is a document, and each feature represents a term whose value is the frequency of occurrence (in polynomial naive Bayes), or zero or one, indicating whether the term is found in the document (in bernoulli naive Bayes). The eigenvalue must be a non negative number. Select the model type using the optional parameters multinomial or bernoulli (default is multinomial). You can use additive smoothing by setting the parameter λ (1.0 by default). For document classification, input eigenvectors are usually sparse. Sparse vectors should be provided as input to take advantage of sparsity. Because training data is used only once, there is no need to cache it.
Sample code
Naive Bayes implements polynomial naive Bayes. It takes LabeledPoint's RDD and optional smoothing parameter lambda as input, optional model type parameter (default is "polynomial") as input, and outputs a naive Bayes model, which can be used for evaluation and prediction.
For API details, see the naive Bayes Scala documentation and the naive Bayes model Scala documentation.
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val Array(training, test) = data.randomSplit(Array(0.6, 0.4)) val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial") val predictionAndLabel = test.map(p => (model.predict(p.features), p.label)) val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count() // Save and load model model.save(sc, "target/tmp/myNaiveBayesModel") val sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")
decision trees
Decision tree and its integration are popular methods for classification and regression of machine learning tasks. Decision trees are widely used because they are easy to interpret, process classification features, extend to multi class classification settings, do not need feature scaling and can capture nonlinearity and feature interaction. Tree classification algorithms (such as random forest and boosting) perform best in classification and regression tasks.
spark.mllib supports binary and multi class classification and decision trees for regression using continuous and classification functions. The implementation partitions the data by rows, so that distributed training can be carried out for millions of instances.
A collection of trees (random forests and gradual trees) is described in the collection guide.
1)Basic algorithm
Decision tree is a greedy algorithm, which performs recursive binary partition of feature space. The tree predicts the same label for each lowest (leaf) partition. Each partition is greedily selected by selecting the best partition from a set of possible partitions to maximize the information gain of tree nodes. In other words, select the split selected in each tree node from the set argmaxsIG (D, s), where IG (D, s) is the information gain when split s is applied to dataset D.
(1)Node Impurity and information gain
The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance).
(2)Split candidates
Continuous feature
For small datasets in a standalone implementation, the segmentation candidate of each continuous feature is usually the unique value of the feature. Some implementations sort the eigenvalues, and then use the unique sorted values as candidate dividers to achieve faster tree calculation.
For large distributed data sets, sorting eigenvalues is very expensive. This implementation calculates a set of approximate split candidate sets by Quantile calculation on the sampling part of the data. Ordered splitting creates "bins" that you can use the maxBins parameter to specify the maximum number of such bins. Note that the number of bin cannot be greater than the number of instance N (a rare case, because the default maxBins value is 32). If the conditions are not met, the tree algorithm will automatically reduce the number of bins.
Classification characteristics
For classification features with M possible values (categories), 2M11 split candidates can be proposed. For binary (0 / 1) classification and regression, we can reduce the number of segmentation candidates to M1 by sorting the classification eigenvalues by average tags. (for more information, see section 9.2.4 in the elements of statistical machine learning.) For example, for A binary classification problem with three class features A, B and C, the corresponding proportion of tag 1 is 0.2, 0.6 and 0.4, then the order of class features is A, C and B. The two split candidates are A  C, B and A, C  B, where ︱ indicates where to split.
In multi category classification, use all 2M11 possible splits as much as possible. When 2M11 is larger than the maxBins parameter, we use a (heuristic) method similar to the binary classification and regression method. M classification eigenvalues are sorted by impurities, and M1 split candidates are considered.
(3) Stop rule
The recursive tree construction stops at the node when one of the following conditions is met:
The node depth is equal to the maxDepth training parameter.
 Without segmentation of candidates, the information gain will be greater than minInfoGain.
 Candidate objects that are not split generate each child node that has at least minInstancesPerNode training instance.
2) Tips
By discussing various parameters, we include some criteria for using decision trees. These parameters are listed below in order of importance from high to low. New users should mainly consider the "problem specification parameters" section and the maxDepth parameter.
(1) Problem specification parameters
These parameters describe the problem and data set that you want to solve. They should be specified and do not need to be adjusted.
 Algorithm: type, classification or regression of decision tree.
 numClasses: number of classes (for classification only).
 categoricalFeaturesInfo: specifies which features are classified and how many classification values can be applied to each feature. This is the mapping from the feature index to the feature Arity (number of categories). All features not included in the map are considered continuous.
 For example, Map (0  > 2, 4  > 10) specifies that element 0 is binary (value 0 or 1), and element 4 has 10 categories (value {0, 1,...) (9}). Note that the feature index is based on 0: features 0 and 4 are the first and fifth elements of the instance's feature vector.
 Note that you do not have to specify a categoricalFeaturesInfo. The algorithm will still run and get reasonable results. However, if the classification features are specified correctly, the performance should be better.
(2) Stop condition
These parameters determine when the tree stops building (adding new nodes). When adjusting these parameters, be careful to verify the retained test data to avoid over fitting.
 maxDepth: the maximum depth of a tree. Deeper trees are more expressive (which may allow for higher accuracy), but they also cost more to train and are more likely to over fit.
minInstancesPerNode: in order to further split the node, each of its child nodes must receive at least this number of training instances. This is often used with RandomForest because they are usually trained more deeply than individual trees.  minInfoGain: for nodes to be further split, the split must improve at least so much (in terms of information gain).
(3)Tunable paramters
These parameters can be adjusted. When adjusting, please carefully verify the reserved test data to avoid over fitting.
 maxBins: the number of bins used to discretize continuous features. The increase of maxBins allows the algorithm to consider more split candidates and make finegrained split decisions. But it also adds computing and communication.
 Note that for any classification function, the maxBins parameter must be at least the maximum number of category M.
 maxMemoryInMB: amount of memory used to collect sufficient statistics.
 Conservatively select the default value of 256 MB so that the decision algorithm can be used in most cases. Increasing maxMemoryInMB can speed up training by reducing data transfer, if available. However, with the increase of maxMemoryInMB, the revenue may decrease, because the traffic per iteration may be proportional to maxMemoryInMB.
 Implementation details: for faster processing, the decision tree algorithm collects statistics about the node groups to be split (rather than allocating one node at a time). The number of nodes in a group that can be processed is determined by memory requirements (depending on each function). The maxMemoryInMB parameter specifies the memory limit in megabytes that each worker can use for these statistics.
 subsamplingRate: the score of the training data used to learn the decision tree. This parameter is most relevant to the ensemble of the training tree (using RandomForest and gradientbooted trees), in which case the original data can be resampled. For training a single decision tree, this parameter is not useful because the number of training instances is usually not the primary constraint.
 Impurity: impurity measure used to select between candidate divisions, as described above. This metric must match algorithm parameters.
(4) Caching and checkpoints
MLlib 1.2 adds features that extend to larger (deeper) trees and tree combinations. Turning on node ID caching and checkpoints is useful when maxDepth is set to large. These parameters are also useful for RandomForest when numTrees is set to large.
 Usenodeid cache: if set to true, the algorithm avoids passing the current model (one or more trees) to the executor at each iteration.

This is useful for deep trees (to speed up staff calculations) and large random forests (to reduce communication per iteration).

Implementation details: by default, the algorithm communicates the current model to the performer so that the performer can match the training instance with the tree node. When this setting is enabled, the algorithm caches this information instead.

The node ID cache generates a series of RDDS (one per iteration). Such a long lineage can cause performance problems, but RDD in the middle of the checkpoint can alleviate these problems. Note that checkpoints apply only if usenodeid cache is set to true.
 Checkpoint dir: directory used for checkpoint node ID caching RDD.
 Checkpoint interval: how often the checkpoint node ID caches RDD. Setting it too low will cause additional overhead of writing to HDFS; setting it too high can cause problems if the program fails and the RDD needs to be recalculated.
3)Scaling
Calculate the approximate linear scaling in the number of training instances, the number of features, and the maxBins parameter. The communication function and the number of maxBins are approximately linearly proportional. The algorithm reads sparse and dense data. However, it is not optimized for sparse input.
4) example
(1) classification
The following example shows how to load a LIBSVM data file, parse it into LabeledPoint's RDD, and then use a decision tree to perform the classification, which uses Gini impurities as impurity measures with a maximum tree depth of 5. Calculate the test error to measure the accuracy of the algorithm.
Sample code
For API details, see the DecisionTree Scala documentation and DecisionTreeModel Scala documentation.
import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a DecisionTree model. // Empty categoricalFeaturesInfo indicates all features are continuous. val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "gini" val maxDepth = 5 val maxBins = 32 val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) // Evaluate model on test instances and compute test error val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count() println(s"Test Error = $testErr") println(s"Learned classification tree model:\n ${model.toDebugString}") // Save and load model model.save(sc, "target/tmp/myDecisionTreeClassificationModel") val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
(2) return
The following example shows how to load the LIBSVM data file, parse it into LabeledPoint's RDD, then use the decision tree to perform regression, and use variance as impurity measurement, with the maximum tree depth of 5. Calculate the mean square error (MSE) and finally evaluate the fitness.
import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a DecisionTree model. // Empty categoricalFeaturesInfo indicates all features are continuous. val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "variance" val maxDepth = 5 val maxBins = 32 val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins) // Evaluate model on test instances and compute test error val labelsAndPredictions = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v  p, 2) }.mean() println(s"Test Mean Squared Error = $testMSE") println(s"Learned regression tree model:\n ${model.toDebugString}") // Save and load model model.save(sc, "target/tmp/myDecisionTreeRegressionModel") val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeRegressionModel")
Assemblies of decisions trees
The integration method is a learning algorithm that creates a model consisting of a set of other basic models. spark.mllib supports two main integration algorithms: GradientBoostedTrees and RandomForest. Both use decision tree as their basic model.
GradientBoosted Trees vs. Random Forest
Both gradient enhanced tree (GBT) and random forest are ensemble algorithms for learning trees, but the training process is different. Some practical tradeoffs need to be made:
 GBT trains one tree at a time, so they take longer to train than random forests. Random forest can train many trees in parallel.
 On the other hand, it is usually reasonable to use smaller (shallower) trees for GBT, and it takes less time to train smaller trees than random forests.
 Random forests may not be suitable for over fitting. Training more trees in random forest can reduce the possibility of over fitting, but using GBT to train more trees can increase the possibility of over fitting. (in statistical language, random forests use more trees to reduce variance, while GBT uses more trees to reduce bias.)
 Random forests may be easier to adjust because performance monotonically increases with the number of trees (and if the number of trees is too large, GBT performance may start to decline).
In short, both algorithms are valid and the selection should be based on a specific data set.
1) Random forest
Random forest is a set of decision trees. Random forest is one of the most successful machine learning models for classification and regression. They combine many decision trees to reduce the risk of over fitting. Like decision tree, random forest processes classification features, which extend to multi class classification settings, does not need feature scaling, and can capture nonlinearity and feature interaction.
spark.mllib supports the use of continuous and classification functions for binary and multiclass classification and regression of random forests. spark.mllib uses the existing decision tree implementation to implement random forest. See the decision tree guide for more information about trees.
Basic grammar
A group of decision trees are trained by random forest, so they can be trained in parallel. The algorithm injects randomness into the training process, so each decision tree is slightly different. Combining the predictions from each tree can reduce the variance of the prediction and improve the performance of the test data.
train
The randomness of the injection training process includes:
 In each iteration, the original data set is resampled to obtain different training sets (also known as bootstrap).
 Consider different random subsets of features to be split on each tree node.
In addition to these randomizations, decision trees are trained in the same way as individual decision trees.
Forecast
In order to predict the new case, random forest must aggregate the prediction of its decision tree set. This aggregation performs differently for classification and regression.
Category: majority vote. The prediction of each tree counts as a vote for a class. The label is expected to be the category with the most votes.
Regression: average. Each tree predicts a real value. The tag is predicted as the average of the tree predictions.
Using skills
By discussing various parameters, we include some criteria for using random forests. Because some decision tree parameters are introduced in the decision tree guide, we omit them.
The first two parameters we mentioned are the most important, and adjusting them can usually improve performance:
 numTrees: number of trees in the forest
 Increasing the number of trees will reduce the variance of the prediction and improve the accuracy of the test time of the model.
 The training time increased linearly in the number of trees.
 maxDepth: the maximum depth of each tree in the forest
 The increase of depth makes the model more expressive and functional. However, deep trees need longer training time and are more prone to over fitting.
 In general, it is acceptable to train deeper trees using random forests than using a single decision tree. A tree is more likely to over fit than a random forest (variance is reduced by averaging multiple trees in the forest).
The next two parameters usually do not need to be adjusted. However, it can be adjusted to speed up the training.
 subsamplingRate: this parameter specifies the size of the dataset used to train each tree in the forest as part of the original dataset size. The default value (1.0) is recommended, but reducing this ratio can speed up the workout.
 featureSubsetStrategy: the number of candidate features used for segmentation at each tree node. The number is specified as a fraction or function of the total number of features. Reducing this number will speed up training, but if it is too low, it can sometimes affect performance.
Sample code
classification
The following example shows how to load a LIBSVM data file, parse it into LabeledPoint's RDD, and then classify it using a random forest. Calculate the test error to measure the accuracy of the algorithm.
import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a RandomForest model. // Empty categoricalFeaturesInfo indicates all features are continuous. val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 3 // Use more in practice. val featureSubsetStrategy = "auto" // Let the algorithm choose. val impurity = "gini" val maxDepth = 4 val maxBins = 32 val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) // Evaluate model on test instances and compute test error val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count() println(s"Test Error = $testErr") println(s"Learned classification forest model:\n ${model.toDebugString}") // Save and load model model.save(sc, "target/tmp/myRandomForestClassificationModel") val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")
regression
The following example shows how to load a LIBSVM data file, resolve it to LabeledPoint's RDD, and then perform regression using a random forest. Finally, MSE is calculated to evaluate the goodness of fit.
import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a RandomForest model. // Empty categoricalFeaturesInfo indicates all features are continuous. val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 3 // Use more in practice. val featureSubsetStrategy = "auto" // Let the algorithm choose. val impurity = "variance" val maxDepth = 4 val maxBins = 32 val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) // Evaluate model on test instances and compute test error val labelsAndPredictions = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v  p), 2)}.mean() println(s"Test Mean Squared Error = $testMSE") println(s"Learned regression forest model:\n ${model.toDebugString}") // Save and load model model.save(sc, "target/tmp/myRandomForestRegressionModel") val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")
2)GradientBoosted Trees
The gradient enhancement tree (GBT) is a set of decision trees. GBT trains decision tree repeatedly to minimize loss function. Like decision tree, GBT can deal with classification features, expand to multi class classification settings, do not need feature scaling, and can capture nonlinear and feature interaction.
spark.mllib supports GBT for binary classification and regression using the continuous and classification features. spark.mllib uses the existing decision tree implementation to implement GBT. See the decision tree guide for more information about trees.
Note: GBT does not yet support multiclass classification. For many kinds of problems, please use decision tree or random forest
Basic algorithm
Gradient elevation trains decision tree sequences in an iterative way. In each iteration, the algorithm uses the current overall prediction tags of each training instance, and then compares the prediction results with the real tags. The datasets were relabeled to focus more on the training cases with poor prediction effect. Therefore, in the next iteration, the decision tree will help to correct previous errors.
The specific mechanism for relabeling instances is defined by the loss function (described below). Each iteration, GBT will further reduce the loss function on the training data.
loss
The following table lists the loss of GBT's current support in spark.mllib. Note that each loss applies to one of the categories or regressions, but not both.
Using skills
By discussing various parameters, we include some guidelines for using GBT. Because some decision tree parameters are introduced in the decision tree guide, we omit them.
 Loss: for information on loss and its applicability (classification and regression) to tasks, see above. Depending on the data set, different losses may produce significantly different results
 numIterations: sets the number of trees in the collection. Each iteration produces a tree. Increasing this number can make the model more expressive and improve the accuracy of training data. However, if the test time accuracy is too large, it may reduce the test time accuracy.
 learningRate: this parameter does not need to be adjusted. If the behavior of the algorithm seems to be unstable, reducing this value can improve the stability.
 algo: use the tree [Strategy] parameter to set algorithm or task (classification and regression).
Verify during training
When training more trees, the gradient may be over fitted. In order to prevent over fitting, it is useful to verify it in training. Method runWithValidation is provided to use this option. It takes a pair of RDD as parameters, the first is training data set, the second is validation data set.
Training stops when the improvement of validation errors does not exceed a certain tolerance (provided by the validationTol parameter in boosting strategy). In practice, the verification error will decrease initially and then increase. In some cases, validation errors do not monotonically change. It is recommended that you set a large enough negative tolerance and use the validate each iteration to check the validation curve to adjust the number of iterations.
Sample code
classification
import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a GradientBoostedTrees model. // The defaultParams for Classification use LogLoss by default. val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.numIterations = 3 // Note: Use more iterations in practice. boostingStrategy.treeStrategy.numClasses = 2 boostingStrategy.treeStrategy.maxDepth = 5 // Empty categoricalFeaturesInfo indicates all features are continuous. boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]() val model = GradientBoostedTrees.train(trainingData, boostingStrategy) // Evaluate model on test instances and compute test error val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count() println(s"Test Error = $testErr") println(s"Learned classification GBT model:\n ${model.toDebugString}") // Save and load model model.save(sc, "target/tmp/myGradientBoostingClassificationModel") val sameModel = GradientBoostedTreesModel.load(sc, "target/tmp/myGradientBoostingClassificationModel")
regression
import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a GradientBoostedTrees model. // The defaultParams for Regression use SquaredError by default. val boostingStrategy = BoostingStrategy.defaultParams("Regression") boostingStrategy.numIterations = 3 // Note: Use more iterations in practice. boostingStrategy.treeStrategy.maxDepth = 5 // Empty categoricalFeaturesInfo indicates all features are continuous. boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]() val model = GradientBoostedTrees.train(trainingData, boostingStrategy) // Evaluate model on test instances and compute test error val labelsAndPredictions = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v  p), 2)}.mean() println(s"Test Mean Squared Error = $testMSE") println(s"Learned regression GBT model:\n ${model.toDebugString}") // Save and load model model.save(sc, "target/tmp/myGradientBoostingRegressionModel") val sameModel = GradientBoostedTreesModel.load(sc, "target/tmp/myGradientBoostingRegressionModel")
isotonic regression
Isotonic regression belongs to the family of regression algorithms. Formal isotonic regression is a problem, where given a finite set of real numbers Y = y1, y2 , yn represents the observed response, while X = x1, x2 , xn unknown response value to be fitted to find a function to minimize
spark.mllib supports the algorithm of pool adjacent violators, which uses a method to parallelize isotonic regression. Training input is RDD of tuples with three double precision values, which in turn represent label, feature and weight. In addition, the IsotonicRegression algorithm has an optional parameter called isotonic, which defaults to true. This parameter specifies whether isotonic regression is isotonic (monotonic increase) or reverse (monotonic decrease).
The training returns an IsotonicRegressionModel, which can be used to predict the tags of known and unknown features. The results of isotonic regression are regarded as piecewise linear functions. Therefore, the prediction rules are:
 If the prediction input exactly matches the training function, the associated prediction is returned. If there are multiple predictions with the same characteristics, one of them is returned. Which one is undefined (same as java.util.Arrays.binarySearch).
 If the prediction input is lower or higher than all training features, the prediction with the lowest or highest features is returned respectively. If there are multiple predictions with the same characteristics, the lowest or highest is returned respectively.
 If the prediction input is between two training features, the prediction is regarded as a piecewise linear function, and the interpolation is calculated according to the prediction of the two closest features. If there are multiple values with the same characteristics, the same rule as the previous point is used.
Sample code
import org.apache.spark.mllib.regression.{IsotonicRegression, IsotonicRegressionModel} import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_isotonic_regression_libsvm_data.txt").cache() // Create label, feature, weight tuples from input data with weight set to default value 1.0. val parsedData = data.map { labeledPoint => (labeledPoint.label, labeledPoint.features(0), 1.0) } // Split data into training (60%) and test (40%) sets. val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0) val test = splits(1) // Create isotonic regression model from training data. // Isotonic parameter defaults to true so it is only shown for demonstration val model = new IsotonicRegression().setIsotonic(true).run(training) // Create tuples of predicted and real labels. val predictionAndLabel = test.map { point => val predictedLabel = model.predict(point._2) (predictedLabel, point._1) } // Calculate mean squared error between predicted and real labels. val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p  l), 2) }.mean() println(s"Mean Squared Error = $meanSquaredError") // Save and load model model.save(sc, "target/tmp/myIsotonicRegressionModel") val sameModel = IsotonicRegressionModel.load(sc, "target/tmp/myIsotonicRegressionModel")