Primary directory
machine learning
definition
- Let the computer have the same learning and thinking ability as people
- Let the computer learn a certain skill of human beings
classification
School classification
- Symbolism
Logical derivation with symbols (the knowledge represented by symbols) (the basic unit of human cognition and thinking is symbols, and the cognitive process is an operation on the symbolic representation) - Connectionism (similar to neurons)
Conduct derivation and operation by simulating and simulating human brain (neural network, neuron) - Behaviorism (evolutionary)
Simulate human evolution
Classified by learning style
- Supervised
With sample guidance, you can judge right and wrong (right and wrong are external) - Unsupervised
For example, infants acquire skills, clustering, and generative confrontation networks - Self supervision
Judge the sample by yourself, and then compare the correctness and error of the sample (the correctness and error are internal) - Semi supervision
Business area classification
- Signal, radio
- image
- Voice field
- Natural semantics
- automation
Learning steps
- End to end learning
Input – > model – > output - Non end to end learning
Input – > feature extractor – > feature – > classifier – > output
Learning skills
- Transfer learning
Use previous knowledge to learn new knowledge - Meta learning
Learn the concept / essence / principle, and then draw inferences from one instance - Cascade learning
Large tasks are divided into multiple small tasks, one by one - Incremental learning
From simple to difficult learning - Antagonistic learning
Competitive learning - cooperative learning
Learning rounds
- N-SHOT
Multi round learning
Mass learning
Small batch learning - ONE-SHOT
Learn once - ZERO-SHOT
By analogy (learn the skills of English to French from English to Chinese + Chinese to French)
task
- Regression / fitting / function approximation
Regression: use the model to obtain the value of the sample
Fitting: multiple samples can be expressed by curves to derive formulas
Function approximation: the sample can be approximately expressed by a function - classification
- clustering
The samples with the same characteristics are aggregated together without classification ability (unsupervised) - Feature extraction / dimension reduction / principal component analysis
Feature extraction: after feature extraction, similarity comparison can be done
Dimensionality reduction: important features are screened out and easy to distinguish
Principal component analysis: weighting features - Generation creation
Example: generate the appearance of a person who does not actually exist (style GAN2); Generate an article - Evaluation and planning
Planning: plan the steps one by one, and then implement them step by step; Example: ai playing go
Evaluation: judge the standard degree with scores and the like; Example: evaluation of appearance: 90 points - policy decision
After obtaining the data sample, implement the specific command operation
According to the model
- Statistics
linear regression
Maximum entropy - bionics
evolutionary algorithms
Ant colony
Furthermore, deep learning belongs to the field of machine learning, which is nothing more than a system of its own.
Machine learning development process
- Model development process
- Data processing (accounting for about 80% of the work)
- data acquisition
Manual collection, system collection, crawler, virtual simulation, countermeasure generation, open source data - Data Annotations
Manual labeling, data labeling software, automatic labeling - Data cleaning
Remove bad data - Data enhancement
Enhance data (one data is enhanced to multiple data)
Enhanced tools, openCV programming, etc - Data preprocessing
Pretreatment can improve the efficiency of model training
Making good use of non labeled number or data labeling automation is the key to reduce cost
Scientific sampling reflects the distribution law of the whole sample
Data quality is more important than data quantity, which directly affects the accuracy and generalization ability of the model
Data enhancement can improve the diversity and quantity of data, but the single enhancement effect is limited
Data preprocessing can improve the efficiency of model training
- data acquisition
- Data processing (accounting for about 80% of the work)
- Model development
- model design
- Model optimization
- model training
- Test evaluation
- Model test
- Model evaluation
- assessment report
- Model deployment
Deploy various platforms, servers and software
data set
- definition
-
Training set: (training model) data samples used for model fitting;
Test set: (the final evaluation of the learning method) is used to evaluate the generalization ability of the final model. However, it can not be used as the basis for the selection of algorithms related to parameter adjustment and feature selection.
Validation set "(selection of model): it is a sample set set left separately in the process of model training. It can be used to adjust the super parameters of the model and to preliminarily evaluate the ability of the model;
Common data sets
- image
mnist,lfw128,cifar10,cifar100,voc,imagenet,coco - NLP
- Film evaluation, emotion analysis, poetry generation
error analysis
- Error refers to the difference between the actual predicted output of the algorithm and the real output of the sample.
- The error of the model on the training set is called "training error"
- The error of the model on the population sample is called "generalization error"
- The error of the model on the test set is called "test error"
- We can't know the overall sample, so we can only minimize the training error as much as possible, resulting in obvious differences between the training error and the generalization error, otherwise it won't be so difficult to generate the countermeasure network.
- Over fitting refers to the phenomenon that the model can fit the training samples well, but cannot fit the test samples well, resulting in the decline of generalization performance. In order to prevent "over fitting", you can choose to reduce parameters, reduce model complexity, regularize, terminate in advance, Dropout, maximum pooling, increase the amount of data, enhance data, reduce the number of iterations, increase the learning rate, etc.
- Under fitting refers to the phenomenon that the model has not well trained the general law of data and the degree of model fitting is not high. In order to prevent "under fitting", you can choose to adjust parameters, increase iteration depth, replace more complex models, etc.
Over fitting and under fitting are normally understood. Over fitting is common in our work. In our high-dimensional space, we can fully express the two-dimensional information by using the three-dimensional information. However, due to the increase of expression parameters, the changes of some irrelevant parameters will not affect the results of the current sample, and eventually lead to the model not being well fitted.
Under fitting: less parameters / less correlation + more data
The parameter quantity needs to be increased
Over fitting: too many parameters + too little data. It is necessary to reduce parameters or increase data
Generalization error analysis
Bias reflects the gap between the expected output of the model on the sample and the real mark, that is, the accuracy of the model itself, and reflects the fitting ability of the model itself.
Variance reflects the error between the output of the function learned by the model under different training data sets and the expected output, that is, the stability of the model, and reflects the fluctuation of the model.
Expression corresponding to over fitting and under fitting:
Under fitting: high deviation and low variance
- Find better features and improve the ability to describe data
- Increase the number of features
- Reselect more complex models
Over fitting: low deviation and high variance
- Increase the number of training samples
- Reduce the feature dimension and reduce the density of high-dimensional space
- The regularization term is added to make the model smoother
Cross validation
Basic idea: divide the training set into K parts, in which K-1 part is used as the training set and the other part is used as the verification set. After learning the function on the training set, then calculate the error K-fold cross verification on the verification set
- The K-fold is repeated many times, and different segments are generated in each repetition
- Leave one out
linear regression
Linear regression is a method to find a linear relationship between sample attributes and labels, and find a linear model according to the training data to minimize the gap between the predicted value generated by the model and the sample label.
Summary: y= w x +b
Corresponding code implementation:
import numpy as np from sklearn.utils import shuffle from sklearn.datasets import load_diabetes class lr_model(): def __init__(self): pass def prepare_data(self): data = load_diabetes().data target = load_diabetes().target X, y = shuffle(data, target, random_state=42) X = X.astype(np.float32) y = y.reshape((-1, 1)) data = np.concatenate((X, y), axis=1) return data def initialize_params(self, dims): w = np.zeros((dims, 1)) b = 0 return w, b def linear_loss(self, X, y, w, b): num_train = X.shape[0] num_feature = X.shape[1] y_hat = np.dot(X, w) + b loss = np.sum((y_hat-y)**2) / num_train dw = np.dot(X.T, (y_hat - y)) / num_train db = np.sum((y_hat - y)) / num_train return y_hat, loss, dw, db def linear_train(self, X, y, learning_rate, epochs): w, b = self.initialize_params(X.shape[1]) for i in range(1, epochs): y_hat, loss, dw, db = self.linear_loss(X, y, w, b) w += -learning_rate * dw b += -learning_rate * db if i % 10000 == 0: print('epoch %d loss %f' % (i, loss)) params = { 'w': w, 'b': b } grads = { 'dw': dw, 'db': db } return loss, params, grads def predict(self, X, params): w = params['w'] b = params['b'] y_pred = np.dot(X, w) + b return y_pred def linear_cross_validation(self, data, k, randomize=True): if randomize: data = list(data) shuffle(data) slices = [data[i::k] for i in range(k)] for i in range(k): validation = slices[i] train = [data for s in slices if s is not validation for data in s] train = np.array(train) validation = np.array(validation) yield train, validation if __name__ == '__main__': lr = lr_model() data = lr.prepare_data() for train, validation in lr.linear_cross_validation(data, 5): X_train = train[:, :10] y_train = train[:, -1].reshape((-1, 1)) X_valid = validation[:, :10] y_valid = validation[:, -1].reshape((-1, 1)) loss5 = [] loss, params, grads = lr.linear_train(X_train, y_train, 0.001, 100000) loss5.append(loss) score = np.mean(loss5) print('five kold cross validation score is', score) y_pred = lr.predict(X_valid, params) valid_score = np.sum(((y_pred - y_valid) ** 2)) / len(X_valid) print('valid score is', valid_score)
logistic regression
- It's actually a classification algorithm
Easy implementation of corresponding code:
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification class logistic_regression(): def __init__(self): pass def sigmoid(self, x): z = 1 / (1 + np.exp(-x)) return z def initialize_params(self, dims): W = np.zeros((dims, 1)) b = 0 return W, b def logistic(self, X, y, W, b): num_train = X.shape[0] num_feature = X.shape[1] a = self.sigmoid(np.dot(X, W) + b) cost = -1 / num_train * np.sum(y * np.log(a) + (1 - y) * np.log(1 - a)) dW = np.dot(X.T, (a - y)) / num_train db = np.sum(a - y) / num_train cost = np.squeeze(cost) return a, cost, dW, db def logistic_train(self, X, y, learning_rate, epochs): W, b = self.initialize_params(X.shape[1]) cost_list = [] for i in range(epochs): a, cost, dW, db = self.logistic(X, y, W, b) W = W - learning_rate * dW b = b - learning_rate * db if i % 100 == 0: cost_list.append(cost) if i % 100 == 0: print('epoch %d cost %f' % (i, cost)) params = { 'W': W, 'b': b } grads = { 'dW': dW, 'db': db } return cost_list, params, grads def predict(self, X, params): y_prediction = self.sigmoid(np.dot(X, params['W']) + params['b']) for i in range(len(y_prediction)): if y_prediction[i] > 0.5: y_prediction[i] = 1 else: y_prediction[i] = 0 return y_prediction def accuracy(self, y_test, y_pred): correct_count = 0 for i in range(len(y_test)): for j in range(len(y_pred)): if y_test[i] == y_pred[j] and i == j: correct_count += 1 accuracy_score = correct_count / len(y_test) return accuracy_score def create_data(self): X, labels = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=2) labels = labels.reshape((-1, 1)) offset = int(X.shape[0] * 0.9) X_train, y_train = X[:offset], labels[:offset] X_test, y_test = X[offset:], labels[offset:] return X_train, y_train, X_test, y_test def plot_logistic(self, X_train, y_train, params): n = X_train.shape[0] xcord1 = [] ycord1 = [] xcord2 = [] ycord2 = [] for i in range(n): if y_train[i] == 1: xcord1.append(X_train[i][0]) ycord1.append(X_train[i][1]) else: xcord2.append(X_train[i][0]) ycord2.append(X_train[i][1]) fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(xcord1, ycord1, s=32, c='red') ax.scatter(xcord2, ycord2, s=32, c='green') x = np.arange(-1.5, 3, 0.1) y = (-params['b'] - params['W'][0] * x) / params['W'][1] ax.plot(x, y) plt.xlabel('X1') plt.ylabel('X2') plt.show() if __name__ == "__main__": model = logistic_regression() X_train, y_train, X_test, y_test = model.create_data() print(X_train.shape, y_train.shape, X_test.shape, y_test.shape) cost_list, params, grads = model.logistic_train(X_train, y_train, 0.01, 1000) print(params) y_train_pred = model.predict(X_train, params) accuracy_score_train = model.accuracy(y_train, y_train_pred) print('train accuracy is:', accuracy_score_train) y_test_pred = model.predict(X_test, params) accuracy_score_test = model.accuracy(y_test, y_test_pred) print('test accuracy is:', accuracy_score_test) model.plot_logistic(X_train, y_train, params)
Support vector machine
Support vector machine is one of the most influential methods in supervised learning. It is a model based on linear discriminant function.
Basic idea of SVM: for linearly separable data, there are many hyperplanes that can divide the training samples, so we look for "hyperplane at the center of two kinds of training samples", that is, margin maximization. Intuitively, this division has the best tolerance to the local disturbance of training samples. In fact, the performance of this partition is also good.
One discriminant surface and two support vectors maximize the interval between discriminant surface and support vector.
Add nuclear weapons if you don't make a decision
Dimension upgrading is required for multi category data sets or complex problems.
Common kernel functions:
- Linear kernel
- Polynomial kernel
- Gaussian kernel (common)
- Laplace nucleus
- sigmoid kernel
The code implementation is as follows:
import numpy as np import matplotlib.pyplot as plt class Hard_Margin_SVM: def __init__(self, visualization=True): self.visualization = visualization self.colors = {1: 'r', -1: 'g'} if self.visualization: self.fig = plt.figure() self.ax = self.fig.add_subplot(1, 1, 1) # Define training function def train(self, data): self.data = data # Parameter dictionary {| w | [w, b]} opt_dict = {} # Data conversion list transforms = [[1, 1], [-1, 1], [-1, -1], [1, -1]] # Get all data from the dictionary all_data = [] for yi in self.data: for featureset in self.data[yi]: for feature in featureset: all_data.append(feature) # Get data max min self.max_feature_value = max(all_data) self.min_feature_value = min(all_data) all_data = None # Define a list of learning rates (steps) step_sizes = [self.max_feature_value * 0.1, self.max_feature_value * 0.01, self.max_feature_value * 0.001 ] # Range setting of parameter b b_range_multiple = 2 b_multiple = 5 latest_optimum = self.max_feature_value * 10 # Training optimization based on different step size for step in step_sizes: w = np.array([latest_optimum, latest_optimum]) # Convex optimization optimized = False while not optimized: for b in np.arange(-1 * (self.max_feature_value * b_range_multiple), self.max_feature_value * b_range_multiple, step * b_multiple): for transformation in transforms: w_t = w * transformation found_option = True for i in self.data: for xi in self.data[i]: yi = i if not yi * (np.dot(w_t, xi) + b) >= 1: found_option = False # print(xi,':',yi*(np.dot(w_t,xi)+b)) if found_option: opt_dict[np.linalg.norm(w_t)] = [w_t, b] if w[0] < 0: optimized = True print('Optimized a step!') else: w = w - step norms = sorted([n for n in opt_dict]) # ||w|| : [w,b] opt_choice = opt_dict[norms[0]] self.w = opt_choice[0] self.b = opt_choice[1] latest_optimum = opt_choice[0][0] + step * 2 for i in self.data: for xi in self.data[i]: yi = i print(xi, ':', yi * (np.dot(self.w, xi) + self.b)) # Define prediction function def predict(self, features): # sign( x.w+b ) classification = np.sign(np.dot(np.array(features), self.w) + self.b) if classification != 0 and self.visualization: self.ax.scatter(features[0], features[1], s=200, marker='^', c=self.colors[classification]) return classification # Define result drawing function def visualize(self): [[self.ax.scatter(x[0], x[1], s=100, color=self.colors[i]) for x in data_dict[i]] for i in data_dict] # hyperplane = x.w+b # v = x.w+b # psv = 1 # nsv = -1 # dec = 0 # Define linear hyperplane def hyperplane(x, w, b, v): return (-w[0] * x - b + v) / w[1] datarange = (self.min_feature_value * 0.9, self.max_feature_value * 1.1) hyp_x_min = datarange[0] hyp_x_max = datarange[1] # (w.x+b) = 1 # Positive support vector psv1 = hyperplane(hyp_x_min, self.w, self.b, 1) psv2 = hyperplane(hyp_x_max, self.w, self.b, 1) self.ax.plot([hyp_x_min, hyp_x_max], [psv1, psv2], 'k') # (w.x+b) = -1 # Negative support vector nsv1 = hyperplane(hyp_x_min, self.w, self.b, -1) nsv2 = hyperplane(hyp_x_max, self.w, self.b, -1) self.ax.plot([hyp_x_min, hyp_x_max], [nsv1, nsv2], 'k') # (w.x+b) = 0 # Linearly separated hyperplane db1 = hyperplane(hyp_x_min, self.w, self.b, 0) db2 = hyperplane(hyp_x_max, self.w, self.b, 0) self.ax.plot([hyp_x_min, hyp_x_max], [db1, db2], 'y--') plt.show() data_dict = {-1: np.array([[1, 7], [2, 8], [3, 8], ]), 1: np.array([[5, 1], [6, -1], [7, 3], ])} if __name__ == '__main__': svm = Hard_Margin_SVM() svm.train(data=data_dict) predict_us = [[0, 10], [1, 3], [3, 4], [3, 5], [5, 5], [5, 6], [6, -5], [5, 8], [2, 5], [8, -3]] for p in predict_us: svm.predict(p) svm.visualize()
Decision tree
This mainly involves the concept and related of entropy. Please review above:
Entropy, conditional entropy, information gain, Gini index.
Decision tree is a machine learning method for decision-making based on tree structure, which is a very natural processing mechanism when human beings face decision-making.
In the structure of these trees, the leaf node gives the class mark, and the internal node represents an attribute;
For example, when banks face the problem of whether to lend to customers, they usually make a series of decisions. The bank will first judge: is the customer's credit reputation good? If good, then judge whether the customer has a stable job? If the defect is good, it may be rejected directly, or it may be judged whether the customer has collateral This thinking process is the generation process of decision tree.
In the process of generating decision tree, the most important factor is the selection of root node, that is, which feature is selected as the decision factor: ID3 algorithm uses information gain as the criterion.
import numpy as np import pandas as pd from math import log df = pd.read_csv('./example_data.csv') def entropy(ele): ''' function: Calculating entropy value. input: A list contain categorical value. output: Entropy value. entropy = - sum(p * log(p)), p is a prob value. ''' # Calculating the probability distribution of list value probs = [ele.count(i)/len(ele) for i in set(ele)] # Calculating entropy value entropy = -sum([prob*log(prob, 2) for prob in probs]) return entropy def split_dataframe(data, col): ''' function: split pandas dataframe to sub-df based on data and column. input: dataframe, column name. output: a dict of splited dataframe. ''' # unique value of column unique_values = data[col].unique() # empty dict of dataframe result_dict = {elem : pd.DataFrame for elem in unique_values} # split dataframe based on column value for key in result_dict.keys(): result_dict[key] = data[:][data[col] == key] return result_dict def split_dataframe(data, col): ''' function: split pandas dataframe to sub-df based on data and column. input: dataframe, column name. output: a dict of splited dataframe. ''' # unique value of column unique_values = data[col].unique() # empty dict of dataframe result_dict = {elem : pd.DataFrame for elem in unique_values} # split dataframe based on column value for key in result_dict.keys(): result_dict[key] = data[:][data[col] == key] return result_dict class ID3Tree: # define a Node class class Node: def __init__(self, name): self.name = name self.connections = {} def connect(self, label, node): self.connections[label] = node def __init__(self, data, label): self.columns = data.columns self.data = data self.label = label self.root = self.Node("Root") # print tree method def print_tree(self, node, tabs): print(tabs + node.name) for connection, child_node in node.connections.items(): print(tabs + "\t" + "(" + connection + ")") self.print_tree(child_node, tabs + "\t\t") def construct_tree(self): self.construct(self.root, "", self.data, self.columns) # construct tree def construct(self, parent_node, parent_connection_label, input_data, columns): max_value, best_col, max_splited = choose_best_col(input_data[columns], self.label) if not best_col: node = self.Node(input_data[self.label].iloc[0]) parent_node.connect(parent_connection_label, node) return node = self.Node(best_col) parent_node.connect(parent_connection_label, node) new_columns = [col for col in columns if col != best_col] # Recursively constructing decision trees for splited_value, splited_data in max_splited.items(): self.construct(node, splited_value, splited_data, new_columns) from sklearn.datasets import load_iris from sklearn import tree import graphviz iris = load_iris() # The criterion selects entropy, which means ID3 algorithm is selected clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best') clf = clf.fit(iris.data, iris.target) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data)
Random forest
Tree stump algorithm
Unsupervised learning
Dataset has no tag information (self-study)
- Clustering: we can use unsupervised learning to predict the correlation degree between samples. Samples with high correlation degree are classified into the same category and samples with low correlation degree are classified into different categories, which is "clustering"
- Dimensionality reduction: we can also use unsupervised learning to process data, transforming data with high dimensions and complex calculations into data with low dimensions, easy to process, and containing no or less information loss, which is "dimensionality reduction"
clustering
The purpose of clustering is to divide the data into multiple categories. In the same category, objects (entities) have high similarity, and in different categories, objects have great differences.
For a batch of sample sets without category labels, they are classified according to the degree of similarity between samples. The similar ones are classified into one category and the dissimilar ones are classified into other categories. This classification is called cluster analysis, also known as unsupervised classification
Common methods include K-Means clustering, mean shift clustering, density based clustering and so on.
kmeans
-
K objects in the data space are selected as the initial centers, and each object represents a cluster center;
-
For the data objects in the sample, according to the Euclidean distance between them and these cluster centers, they are divided into the classes corresponding to the nearest cluster center (the most similar) according to the nearest distance criterion;
-
Update cluster center: take the mean value of all objects in each category as the cluster center of the category, and calculate the value of the objective function;
-
Judge whether the values of the cluster center and the objective function have changed. If not, output the result. If changed, return 2).
be careful:
The result of clustering will be affected by the initial randomly selected center point distribution + data distribution
Generate two center points at random positions in a pile of data points, calculate the distance between the data points and the two center points, and classify them into the category of short distance - > calculate the mean value of all distances in the category, then regenerate the center point at the corresponding position through the mean value of distance, continue to calculate the distance of the new center point, and classify them into the category of short distance - > repeat the steps -
kmeas has an init. The farthest points are randomly selected for clustering, which is more effective++
Mean shift (no need to specify the number of clusters)
Produce a central point and take it as the center of the circle to calculate the vector average from this point to all points. If the density in the park is large and evenly distributed, it will end. On the contrary, move the center of the circle to find a new position
hierarchical clustering
Hierarchical clustering; Pay attention to the degree of clustering, avoid over fitting, and cluster into only one class
- Shortest distance method, longest distance method, middle distance method
- Shortest: from small to large, a little makes a lot
- Longest: start from the total, peel layer by layer
- Middle: take the middle distance as the dividing line for clustering
- Using trees to cluster, it's easy to pass, passed!!!!
Density clustering
minpts = 3 in the density clustering graph (outliers may appear)
Field: for example, x1 is the center of the circle, and the range of the circle is the field
Density direct - it must be within the field
Density up to - the density of x1 goes directly to x2. If there is a density up to x3, then the density of x1 is x3
Density connection: x3 and x4 density connection
-You do not need to specify k
-Circle size / number of nodes in a circle need to be specified
-Slow calculation
Prone to outliers.
Mixed Gaussian clustering
Gaussian mixture (calculate the distribution fit, and then cluster)
After fitting the two Gaussian distributions, the data points are compared in the two Gaussian distributions
EM algorithm (maximum expectation algorithm)
-Find P(A, B): fix the data B, P(A|B) - -- > look back to fix the obtained A,
-P (b|a) - -- > continue P(A|B) and iterate to the exact value
AP (nearest neighbor propagation) clustering
Attracting information: a matrix that expresses its own similarity Attribution information: a matrix similar to the similarity of attracting information by self comparison
Dimensionality reduction
The purpose of dimensionality reduction is to reduce the dimension of the original sample data to a smaller number, and try to minimize the loss of information contained in the sample or the error caused by restoring the data. Such as principal component analysis
- Data is easier to process and use in low dimension;
- Relevant features, especially important features, can be clearly displayed in the data;
- If there is only two-dimensional or three-dimensional, visual display can be carried out;
- Remove data noise and reduce algorithm overhead.
PCA principal component analysis (unsupervised)
- The feature is decomposed (rotation basis), and the feature partition vector with little / no influence can be removed
The covariance matrix is made between the features, the matrix is diagonalized, and then the eigenvalues with small influence in the matrix are removed
Sometimes, the features are not completely independent and will have some influence on each other. Dimension reduction is needed to remove the relationship
The purpose of dimensionality reduction: the greater the variance between the same features, the better; The smaller the variance between different features, the better
LDA linear discriminant analysis can be used for supervised dimensionality reduction and classification
-In fact, it is to find a discriminant surface in k-1-dimensional space.
reference resources:
Link: machine learning
Link: DataWhale.