Datawhale October Learning - Tree Model and Integrated Learning: Two Parallel Integrated Tree Models

Previous Situation Review

Conclusion Express

In this study, two parallel integrated tree models, random forest and isolated forest, were learned and coded. Among them, learning about isolated forests is relatively simple and needs to be supplemented later.

1 Random forest

1.1 Principle

Random forests use the bagging algorithm, and CART trees are often used in the base learners.

When dealing with regression problems, the output values are the mean of each learner. When dealing with classification problems, you can use voting or probability aggregation strategies in sklearn s.

Random forests come from three main sources:

bootstrap sampling results in random training sets
Randomly select a subset of features for each node for impurity calculation
Randomness generated when using random split point selection

Feature importance is achieved by simply averaging the importance scores of all trees.

After sampling, for each base learner, there will always be 1 − e − 1 1-e^{-1} The data set of 1_e_1 was not trained. We call this part of the data the out-of-bag sample, or OOB sample for short. At this point, after each base learner is trained, we predict the OOB samples, and the oob_corresponding to each sample Prediction_ The value is the mean of all base learner predictions that have not been sampled for training, and the logic for this part is shown in the source implementation here. In getting oob_of all samples Prediction_ Then, for regression problems, use r2_score to calculate the corresponding oob_score_, For classification problems, use accuracy_directly Score to calculate oob_score_.

[Think Question] r2_ What is the difference between score and mean square error? What advantages does it have?
r2_score is the coefficient of determination, which reflects the proportion of all variability of the dependent variable that can be explained by the independent variable through the regression coefficient.
R 2 = 1 − S S E S S T R^2 = 1 - \frac R2=1−SSTSSE
That is
R 2 = 1 − ∑ i = 1 n ( y i − y ^ i ) 2 ∑ i = 1 n ( y i − y 2 ) 2 R^2 = 1 - \frac {\sum_^ (y_i - \hat y_i)^2}{\sum_^ (y_i - y^2)^2} R2=1−∑i=1n(yi−y2)2∑i=1n(yi−y^i)2
Further simplification
R 2 = 1 − R M S E V a r R^2 = 1 - \frac R2=1−VarRMSE
The molecule becomes the MSE of the commonly used evaluation index, and the denominator becomes the variance. It is common to understand that the mean value is used as the error datum to see if the prediction error is greater or less than the mean datum error.
R2_score = 1, the predicted and true values in the sample are exactly the same, with no error, indicating that the better the independent variable is to interpret the dependent variable in regression analysis.
R2_score = 0. At this point the molecule is equal to the denominator and each predicted value of the sample is equal to the mean.
R2_score is not the square of r, but may be negative (molecule > denominator), and the model is equal to blind guess, rather than directly calculating the average of the target variable.

The following methods describe the representation of implicit features of samples very well.

Finally, we introduce a Total Random Trees Embedding method, which can get a forest-based embedding of sample characteristics based on the leaf node positions of each sample in each decision tree. Specifically, let's say there are now four trees and four leaf nodes per tree. We numbered them sequentially from 0 to 15. Note that sample i has position number [0,7,8,14] on the four leaf nodes and sample J has number [1,7,9,13], where the embedded vectors of the two samples are [1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0 Assuming that the corresponding number of sample k is [0,6,8,14], the corresponding embedded vector should be closer to sample i and farther from sample j, that is, the more times two samples are assigned the same leaf nodes on different trees, the closer they are. Therefore, this method cleverly uses the tree structure to obtain the implicit features of the samples.

[Think Question] Assuming Min's distance is used to measure the distance between two embedded vectors, will the order of number of leaf nodes affect the result of distance measurement?
(Minkowski distance is the definition of a set of distances, including Manhattan distance, Euclidean distance, and Chebyshev distance) does not affect the order in which leaf nodes are numbered, since the order in which vectors are positioned is irrelevant in distance calculations when features are embedded.

1.2 Code implementation

from sklearn.tree import DecisionTreeClassifier import numpy as np from sklearn.utils import resample class RandomForest(): def __init__(self,tree_count = 10): self.tree_count = tree_count self.sample_rate = 0.8 def fit(self,x,y): self._bagging(x,y) def predict(self,x): return self._vote(x) def _bagging(self,x,y): clfLis = [] for i in range(self.tree_count): clf = DecisionTreeClassifier() #connect = np.concatenate((x, y), axis=1) bootstrapX,bootstrapy = resample(x,y,n_sample = int(self.sample_rate*len(y))) clf.fit(bootstrapX,bootstrapy) clfLis.append(clf) self.treeList = clfLis def _vote(self,x): y_predict = np.zeros(x.shape[0],len(self.treeLis)) for i in range(len(self.treeLis)): clf = self.treeLis[i] y_predict[:,i] = clf.predict(x) y = np.array([np.argmax(np.bincount(y_predict[i,:])) for i in range(x.shape[0])]) return y

2 Isolated forest

2.1 Principle

Isolated forests are also an integrated algorithm that uses trees to detect anomalies in continuous feature data. The basic idea of isolated forests is that if features and corresponding split points are randomly selected multiple times to separate the sample points in space, the outlier points can easily be separated from other samples in earlier splits, and the normal points need more splits because they are closer.

The following illustration shows the four segmentation processes under the two characteristics. It is clear that the anomaly points in the upper right corner have been separated separately.

Provides that tree growth stops if and only if the tree's height (maximum path) reaches a given restricted height, or if the number of leaf node samples is only 1, or if all the eigenvalues of the leaf node samples are identical (i.e., the points in space are coincident and cannot be separated).

The given limited height is log ⁡ n \log n logn.

The training pseudocode is as follows

In the prediction phase, the paths along which the samples are assigned to the leaves on each tree need to be calculated.

2.2 Code implementation

(to be supplemented)

from pyod.utils.data import generate_data import matplotlib.pyplot as plt import numpy as np class Node: def __init__(self, depth): self.depth = depth self.left = None self.right = None self.feature = None self.pivot = None class Tree: def __init__(self, max_height): self.root = Node(0) self.max_height = max_height self.c = None def _build(self, node, X,): if X.shape[0] == 1: return if node.depth+1 > self.max_height: node.depth += self._c(X.shape[0]) return node.feature = np.random.randint(X.shape[1]) pivot_min = X[:, node.feature].min() pivot_max = X[:, node.feature].max() node.pivot = np.random.uniform(pivot_min, pivot_max) node.left, node.right = Node(node.depth+1), Node(node.depth+1) self._build(node.left, X[X[:, node.feature]<node.pivot]) self._build(node.right, X[X[:, node.feature]>=node.pivot]) def build(self, X): self.c = self._c(X.shape[0]) self._build(self.root, X) def _c(self, n): if n == 1: return 0 else: return 2 * ((np.log(n-1) + 0.5772) - (n-1)/n) def _get_h_score(self, node, x): if node.left is None and node.right is None: return node.depth if x[node.feature] < node.pivot: return self._get_h_score(node.left, x) else: return self._get_h_score(node.right, x) def get_h_score(self, x): return self._get_h_score(self.root, x) class IsolationForest: def __init__(self, n_estimators=100, max_samples=256): self.n_estimator = n_estimators self.max_samples = max_samples self.trees = [] def fit(self, X): for tree_id in range(self.n_estimator): random_X = X[np.random.randint(0, X.shape[0], self.max_samples)] tree = Tree(np.log(random_X.shape[0])) tree.build(X) self.trees.append(tree) def predict(self, X): result = [] for x in X: h = 0 for tree in self.trees: h += tree.get_h_score(x) / tree.c score = np.power(2, - h/len(self.trees)) result.append(score) return np.array(result)

3 Think Questions

What is the oob score for random forests?
oob is out of bag, because the basic learner of random forest is randomly sampled, so a certain number of samples are not collected in the end, oob score is the learner's predictive effect score on this part of the sample.
How does a random forest integrate multiple decision tree models?
When dealing with regression problems, the integrated output values are the mean of each learner. When dealing with classification problems, you can use voting or probability aggregation strategies in sklearn s.
Describe the algorithm principle and process of isolated forests.
It can be roughly divided into two stages. In the first stage, t isolated trees need to be trained to form an isolated forest. Each sample point is then brought into each isolated tree in the forest, the average height is calculated, and then the outlier score for each sample point is calculated.

Reference reading

In-depth study: regression model evaluation index R2_score