DataWhale -- Summary of common machine learning algorithms

Primary directory

machine learning


  • Let the computer have the same learning and thinking ability as people
  • Let the computer learn a certain skill of human beings


School classification

  • Symbolism
    Logical derivation with symbols (the knowledge represented by symbols) (the basic unit of human cognition and thinking is symbols, and the cognitive process is an operation on the symbolic representation)
  • Connectionism (similar to neurons)
    Conduct derivation and operation by simulating and simulating human brain (neural network, neuron)
  • Behaviorism (evolutionary)
    Simulate human evolution

Classified by learning style

  • Supervised
    With sample guidance, you can judge right and wrong (right and wrong are external)
  • Unsupervised
    For example, infants acquire skills, clustering, and generative confrontation networks
  • Self supervision
    Judge the sample by yourself, and then compare the correctness and error of the sample (the correctness and error are internal)
  • Semi supervision

Business area classification

  • Signal, radio
  • image
  • Voice field
  • Natural semantics
  • automation

Learning steps

  • End to end learning
    Input – > model – > output
  • Non end to end learning
    Input – > feature extractor – > feature – > classifier – > output

Learning skills

  • Transfer learning
    Use previous knowledge to learn new knowledge
  • Meta learning
    Learn the concept / essence / principle, and then draw inferences from one instance
  • Cascade learning
    Large tasks are divided into multiple small tasks, one by one
  • Incremental learning
    From simple to difficult learning
  • Antagonistic learning
    Competitive learning
  • cooperative learning

Learning rounds

  • N-SHOT
    Multi round learning
    Mass learning
    Small batch learning
    Learn once
    By analogy (learn the skills of English to French from English to Chinese + Chinese to French)


  • Regression / fitting / function approximation
    Regression: use the model to obtain the value of the sample
    Fitting: multiple samples can be expressed by curves to derive formulas
    Function approximation: the sample can be approximately expressed by a function
  • classification
  • clustering
    The samples with the same characteristics are aggregated together without classification ability (unsupervised)
  • Feature extraction / dimension reduction / principal component analysis
    Feature extraction: after feature extraction, similarity comparison can be done
    Dimensionality reduction: important features are screened out and easy to distinguish
    Principal component analysis: weighting features
  • Generation creation
    Example: generate the appearance of a person who does not actually exist (style GAN2); Generate an article
  • Evaluation and planning
    Planning: plan the steps one by one, and then implement them step by step; Example: ai playing go
    Evaluation: judge the standard degree with scores and the like; Example: evaluation of appearance: 90 points
  • policy decision
    After obtaining the data sample, implement the specific command operation

According to the model

  • Statistics
    linear regression
    Maximum entropy
  • bionics
    evolutionary algorithms
    Ant colony

Furthermore, deep learning belongs to the field of machine learning, which is nothing more than a system of its own.

Machine learning development process

  • Model development process
    • Data processing (accounting for about 80% of the work)
      • data acquisition
        Manual collection, system collection, crawler, virtual simulation, countermeasure generation, open source data
      • Data Annotations
        Manual labeling, data labeling software, automatic labeling
      • Data cleaning
        Remove bad data
      • Data enhancement
        Enhance data (one data is enhanced to multiple data)
        Enhanced tools, openCV programming, etc
      • Data preprocessing
        Pretreatment can improve the efficiency of model training
        Making good use of non labeled number or data labeling automation is the key to reduce cost
        Scientific sampling reflects the distribution law of the whole sample
        Data quality is more important than data quantity, which directly affects the accuracy and generalization ability of the model
        Data enhancement can improve the diversity and quantity of data, but the single enhancement effect is limited
        Data preprocessing can improve the efficiency of model training
  • Model development
  • model design
  • Model optimization
  • model training
  • Test evaluation
    • Model test
    • Model evaluation
    • assessment report
  • Model deployment
    Deploy various platforms, servers and software

data set

  • definition
    Training set: (training model) data samples used for model fitting;
    Test set: (the final evaluation of the learning method) is used to evaluate the generalization ability of the final model. However, it can not be used as the basis for the selection of algorithms related to parameter adjustment and feature selection.
    Validation set "(selection of model): it is a sample set set left separately in the process of model training. It can be used to adjust the super parameters of the model and to preliminarily evaluate the ability of the model;

Common data sets

  • image
  • NLP
    • Film evaluation, emotion analysis, poetry generation

error analysis

  • Error refers to the difference between the actual predicted output of the algorithm and the real output of the sample.
    • The error of the model on the training set is called "training error"
    • The error of the model on the population sample is called "generalization error"
    • The error of the model on the test set is called "test error"
  • We can't know the overall sample, so we can only minimize the training error as much as possible, resulting in obvious differences between the training error and the generalization error, otherwise it won't be so difficult to generate the countermeasure network.
  • Over fitting refers to the phenomenon that the model can fit the training samples well, but cannot fit the test samples well, resulting in the decline of generalization performance. In order to prevent "over fitting", you can choose to reduce parameters, reduce model complexity, regularize, terminate in advance, Dropout, maximum pooling, increase the amount of data, enhance data, reduce the number of iterations, increase the learning rate, etc.
  • Under fitting refers to the phenomenon that the model has not well trained the general law of data and the degree of model fitting is not high. In order to prevent "under fitting", you can choose to adjust parameters, increase iteration depth, replace more complex models, etc.
    Over fitting and under fitting are normally understood. Over fitting is common in our work. In our high-dimensional space, we can fully express the two-dimensional information by using the three-dimensional information. However, due to the increase of expression parameters, the changes of some irrelevant parameters will not affect the results of the current sample, and eventually lead to the model not being well fitted.
    Under fitting: less parameters / less correlation + more data
    The parameter quantity needs to be increased
    Over fitting: too many parameters + too little data. It is necessary to reduce parameters or increase data

Generalization error analysis

Bias reflects the gap between the expected output of the model on the sample and the real mark, that is, the accuracy of the model itself, and reflects the fitting ability of the model itself.
Variance reflects the error between the output of the function learned by the model under different training data sets and the expected output, that is, the stability of the model, and reflects the fluctuation of the model.

Expression corresponding to over fitting and under fitting:

Under fitting: high deviation and low variance

  • Find better features and improve the ability to describe data
  • Increase the number of features
  • Reselect more complex models

Over fitting: low deviation and high variance

  • Increase the number of training samples
  • Reduce the feature dimension and reduce the density of high-dimensional space
  • The regularization term is added to make the model smoother

Cross validation

Basic idea: divide the training set into K parts, in which K-1 part is used as the training set and the other part is used as the verification set. After learning the function on the training set, then calculate the error K-fold cross verification on the verification set

  • The K-fold is repeated many times, and different segments are generated in each repetition
  • Leave one out

linear regression

Linear regression is a method to find a linear relationship between sample attributes and labels, and find a linear model according to the training data to minimize the gap between the predicted value generated by the model and the sample label.

Summary: y= w x +b
Corresponding code implementation:

import numpy as np
from sklearn.utils import shuffle
from sklearn.datasets import load_diabetes

class lr_model():    
    def __init__(self):        

    def prepare_data(self):
        data = load_diabetes().data
        target = load_diabetes().target
        X, y = shuffle(data, target, random_state=42)
        X = X.astype(np.float32)
        y = y.reshape((-1, 1))
        data = np.concatenate((X, y), axis=1)        
        return data   
    def initialize_params(self, dims):
        w = np.zeros((dims, 1))
        b = 0
        return w, b    
    def linear_loss(self, X, y, w, b):
        num_train = X.shape[0]
        num_feature = X.shape[1]

        y_hat =, w) + b
        loss = np.sum((y_hat-y)**2) / num_train
        dw =, (y_hat - y)) / num_train
        db = np.sum((y_hat - y)) / num_train        
        return y_hat, loss, dw, db    
    def linear_train(self, X, y, learning_rate, epochs):
        w, b = self.initialize_params(X.shape[1])        
        for i in range(1, epochs):
            y_hat, loss, dw, db = self.linear_loss(X, y, w, b)
            w += -learning_rate * dw
            b += -learning_rate * db            
            if i % 10000 == 0:
                print('epoch %d loss %f' % (i, loss))
            params = {                
                'w': w,                
                'b': b
            grads = {                
                'dw': dw,                
                'db': db
         return loss, params, grads    
     def predict(self, X, params):
        w = params['w']
        b = params['b']
        y_pred =, w) + b        
        return y_pred    
     def linear_cross_validation(self, data, k, randomize=True):        
        if randomize:
            data = list(data)

        slices = [data[i::k] for i in range(k)]        
        for i in range(k):
            validation = slices[i]
            train = [data                        
            for s in slices if s is not validation for data in s]
            train = np.array(train)
            validation = np.array(validation)            
            yield train, validation
if __name__ == '__main__':
    lr = lr_model()
    data = lr.prepare_data()   
    for train, validation in lr.linear_cross_validation(data, 5):
        X_train = train[:, :10]
        y_train = train[:, -1].reshape((-1, 1))
        X_valid = validation[:, :10]
        y_valid = validation[:, -1].reshape((-1, 1))

        loss5 = []
        loss, params, grads = lr.linear_train(X_train, y_train, 0.001, 100000)
        score = np.mean(loss5)
        print('five kold cross validation score is', score)
        y_pred = lr.predict(X_valid, params)
        valid_score = np.sum(((y_pred - y_valid) ** 2)) / len(X_valid)
        print('valid score is', valid_score)

logistic regression

  • It's actually a classification algorithm

  • Easy implementation of corresponding code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

class logistic_regression():
    def __init__(self):

    def sigmoid(self, x):
        z = 1 / (1 + np.exp(-x))
        return z

    def initialize_params(self, dims):
        W = np.zeros((dims, 1))
        b = 0
        return W, b

    def logistic(self, X, y, W, b):
        num_train = X.shape[0]
        num_feature = X.shape[1]

        a = self.sigmoid(, W) + b)
        cost = -1 / num_train * np.sum(y * np.log(a) + (1 - y) * np.log(1 - a))

        dW =, (a - y)) / num_train
        db = np.sum(a - y) / num_train
        cost = np.squeeze(cost)
        return a, cost, dW, db

    def logistic_train(self, X, y, learning_rate, epochs):
        W, b = self.initialize_params(X.shape[1])
        cost_list = []
        for i in range(epochs):
            a, cost, dW, db = self.logistic(X, y, W, b)
            W = W - learning_rate * dW
            b = b - learning_rate * db
            if i % 100 == 0:
            if i % 100 == 0:
                print('epoch %d cost %f' % (i, cost))

        params = {
            'W': W,
            'b': b
        grads = {
            'dW': dW,
            'db': db

        return cost_list, params, grads

    def predict(self, X, params):
        y_prediction = self.sigmoid(, params['W']) + params['b'])
        for i in range(len(y_prediction)):
            if y_prediction[i] > 0.5:
                y_prediction[i] = 1
                y_prediction[i] = 0

        return y_prediction

    def accuracy(self, y_test, y_pred):
        correct_count = 0
        for i in range(len(y_test)):
            for j in range(len(y_pred)):
                if y_test[i] == y_pred[j] and i == j:
                    correct_count += 1

        accuracy_score = correct_count / len(y_test)
        return accuracy_score

    def create_data(self):
        X, labels = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2, random_state=1,
        labels = labels.reshape((-1, 1))
        offset = int(X.shape[0] * 0.9)
        X_train, y_train = X[:offset], labels[:offset]
        X_test, y_test = X[offset:], labels[offset:]
        return X_train, y_train, X_test, y_test

    def plot_logistic(self, X_train, y_train, params):
        n = X_train.shape[0]
        xcord1 = []
        ycord1 = []
        xcord2 = []
        ycord2 = []
        for i in range(n):
            if y_train[i] == 1:
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.scatter(xcord1, ycord1, s=32, c='red')
        ax.scatter(xcord2, ycord2, s=32, c='green')
        x = np.arange(-1.5, 3, 0.1)
        y = (-params['b'] - params['W'][0] * x) / params['W'][1]
        ax.plot(x, y)

if __name__ == "__main__":
    model = logistic_regression()
    X_train, y_train, X_test, y_test = model.create_data()
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
    cost_list, params, grads = model.logistic_train(X_train, y_train, 0.01, 1000)
    y_train_pred = model.predict(X_train, params)
    accuracy_score_train = model.accuracy(y_train, y_train_pred)
    print('train accuracy is:', accuracy_score_train)
    y_test_pred = model.predict(X_test, params)
    accuracy_score_test = model.accuracy(y_test, y_test_pred)
    print('test accuracy is:', accuracy_score_test)
    model.plot_logistic(X_train, y_train, params)

Support vector machine

Support vector machine is one of the most influential methods in supervised learning. It is a model based on linear discriminant function.
Basic idea of SVM: for linearly separable data, there are many hyperplanes that can divide the training samples, so we look for "hyperplane at the center of two kinds of training samples", that is, margin maximization. Intuitively, this division has the best tolerance to the local disturbance of training samples. In fact, the performance of this partition is also good.
One discriminant surface and two support vectors maximize the interval between discriminant surface and support vector.

Add nuclear weapons if you don't make a decision

Dimension upgrading is required for multi category data sets or complex problems.
Common kernel functions:

  • Linear kernel
  • Polynomial kernel
  • Gaussian kernel (common)
  • Laplace nucleus
  • sigmoid kernel

The code implementation is as follows:

import numpy as np
import matplotlib.pyplot as plt

class Hard_Margin_SVM:
    def __init__(self, visualization=True):
        self.visualization = visualization
        self.colors = {1: 'r', -1: 'g'}
        if self.visualization:
            self.fig = plt.figure()
   = self.fig.add_subplot(1, 1, 1)

    # Define training function
    def train(self, data): = data
        # Parameter dictionary {| w | [w, b]}
        opt_dict = {}

        # Data conversion list
        transforms = [[1, 1],
                      [-1, 1],
                      [-1, -1],
                      [1, -1]]

        # Get all data from the dictionary
        all_data = []
        for yi in
            for featureset in[yi]:
                for feature in featureset:

        # Get data max min
        self.max_feature_value = max(all_data)
        self.min_feature_value = min(all_data)
        all_data = None

        # Define a list of learning rates (steps)
        step_sizes = [self.max_feature_value * 0.1,
                      self.max_feature_value * 0.01,
                      self.max_feature_value * 0.001

        # Range setting of parameter b
        b_range_multiple = 2
        b_multiple = 5
        latest_optimum = self.max_feature_value * 10

        # Training optimization based on different step size
        for step in step_sizes:
            w = np.array([latest_optimum, latest_optimum])
            # Convex optimization
            optimized = False
            while not optimized:
                for b in np.arange(-1 * (self.max_feature_value * b_range_multiple),
                                   self.max_feature_value * b_range_multiple,
                                   step * b_multiple):
                    for transformation in transforms:
                        w_t = w * transformation
                        found_option = True

                        for i in
                            for xi in[i]:
                                yi = i
                                if not yi * (, xi) + b) >= 1:
                                    found_option = False
                                    # print(xi,':',yi*(,xi)+b))

                        if found_option:
                            opt_dict[np.linalg.norm(w_t)] = [w_t, b]

                if w[0] < 0:
                    optimized = True
                    print('Optimized a step!')
                    w = w - step

            norms = sorted([n for n in opt_dict])
            # ||w|| : [w,b]
            opt_choice = opt_dict[norms[0]]
            self.w = opt_choice[0]
            self.b = opt_choice[1]
            latest_optimum = opt_choice[0][0] + step * 2

        for i in
            for xi in[i]:
                yi = i
                print(xi, ':', yi * (, xi) + self.b))

                # Define prediction function

    def predict(self, features):
        # sign( x.w+b )
        classification = np.sign(, self.w) + self.b)
        if classification != 0 and self.visualization:
  [0], features[1], s=200, marker='^', c=self.colors[classification])
        return classification

    # Define result drawing function
    def visualize(self):
        [[[0], x[1], s=100, color=self.colors[i]) for x in data_dict[i]] for i in data_dict]

        # hyperplane = x.w+b
        # v = x.w+b
        # psv = 1
        # nsv = -1
        # dec = 0
        # Define linear hyperplane
        def hyperplane(x, w, b, v):
            return (-w[0] * x - b + v) / w[1]

        datarange = (self.min_feature_value * 0.9, self.max_feature_value * 1.1)
        hyp_x_min = datarange[0]
        hyp_x_max = datarange[1]

        # (w.x+b) = 1
        # Positive support vector
        psv1 = hyperplane(hyp_x_min, self.w, self.b, 1)
        psv2 = hyperplane(hyp_x_max, self.w, self.b, 1)[hyp_x_min, hyp_x_max], [psv1, psv2], 'k')

        # (w.x+b) = -1
        # Negative support vector
        nsv1 = hyperplane(hyp_x_min, self.w, self.b, -1)
        nsv2 = hyperplane(hyp_x_max, self.w, self.b, -1)[hyp_x_min, hyp_x_max], [nsv1, nsv2], 'k')

        # (w.x+b) = 0
        # Linearly separated hyperplane
        db1 = hyperplane(hyp_x_min, self.w, self.b, 0)
        db2 = hyperplane(hyp_x_max, self.w, self.b, 0)[hyp_x_min, hyp_x_max], [db1, db2], 'y--')

data_dict = {-1: np.array([[1, 7],
                           [2, 8],
                           [3, 8], ]),

             1: np.array([[5, 1],
                          [6, -1],
                          [7, 3], ])}

if __name__ == '__main__':

    svm = Hard_Margin_SVM()

    predict_us = [[0, 10],
                  [1, 3],
                  [3, 4],
                  [3, 5],
                  [5, 5],
                  [5, 6],
                  [6, -5],
                  [5, 8],
                  [2, 5],
                  [8, -3]]

    for p in predict_us:


Decision tree

This mainly involves the concept and related of entropy. Please review above:
Entropy, conditional entropy, information gain, Gini index.

Decision tree is a machine learning method for decision-making based on tree structure, which is a very natural processing mechanism when human beings face decision-making.

In the structure of these trees, the leaf node gives the class mark, and the internal node represents an attribute;
For example, when banks face the problem of whether to lend to customers, they usually make a series of decisions. The bank will first judge: is the customer's credit reputation good? If good, then judge whether the customer has a stable job? If the defect is good, it may be rejected directly, or it may be judged whether the customer has collateral This thinking process is the generation process of decision tree.
In the process of generating decision tree, the most important factor is the selection of root node, that is, which feature is selected as the decision factor: ID3 algorithm uses information gain as the criterion.

import numpy as np
import pandas as pd
from math import log

df = pd.read_csv('./example_data.csv')
def entropy(ele):    
    function: Calculating entropy value.
    input: A list contain categorical value.
    output: Entropy value.
    entropy = - sum(p * log(p)), p is a prob value.
    # Calculating the probability distribution of list value
    probs = [ele.count(i)/len(ele) for i in set(ele)]    
    # Calculating entropy value
    entropy = -sum([prob*log(prob, 2) for prob in probs])    
    return entropy
def split_dataframe(data, col):    
    function: split pandas dataframe to sub-df based on data and column.
    input: dataframe, column name.
    output: a dict of splited dataframe.
    # unique value of column
    unique_values = data[col].unique()    
    # empty dict of dataframe
    result_dict = {elem : pd.DataFrame for elem in unique_values}    
    # split dataframe based on column value
    for key in result_dict.keys():
        result_dict[key] = data[:][data[col] == key]    
    return result_dict
def split_dataframe(data, col):    
    function: split pandas dataframe to sub-df based on data and column.
    input: dataframe, column name.
    output: a dict of splited dataframe.
    # unique value of column
    unique_values = data[col].unique()    
    # empty dict of dataframe
    result_dict = {elem : pd.DataFrame for elem in unique_values}    
    # split dataframe based on column value
    for key in result_dict.keys():
        result_dict[key] = data[:][data[col] == key]    
    return result_dict
 class ID3Tree:    
    # define a Node class
    class Node:        
        def __init__(self, name):
   = name
            self.connections = {}    
        def connect(self, label, node):
            self.connections[label] = node    
    def __init__(self, data, label):
        self.columns = data.columns = data
        self.label = label
        self.root = self.Node("Root")    
    # print tree method
    def print_tree(self, node, tabs):
        print(tabs +        
        for connection, child_node in node.connections.items():
            print(tabs + "\t" + "(" + connection + ")")
            self.print_tree(child_node, tabs + "\t\t")    
    def construct_tree(self):
        self.construct(self.root, "",, self.columns)    
    # construct tree
    def construct(self, parent_node, parent_connection_label, input_data, columns):
        max_value, best_col, max_splited = choose_best_col(input_data[columns], self.label)        
        if not best_col:
            node = self.Node(input_data[self.label].iloc[0])
            parent_node.connect(parent_connection_label, node)            

        node = self.Node(best_col)
        parent_node.connect(parent_connection_label, node)

        new_columns = [col for col in columns if col != best_col]        
        # Recursively constructing decision trees
        for splited_value, splited_data in max_splited.items():
            self.construct(node, splited_value, splited_data, new_columns)
from sklearn.datasets import load_iris
from sklearn import tree
import graphviz

iris = load_iris()
# The criterion selects entropy, which means ID3 algorithm is selected
clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best')
clf =,

dot_data = tree.export_graphviz(clf, out_file=None,
graph = graphviz.Source(dot_data)

Random forest

Tree stump algorithm

Unsupervised learning

Dataset has no tag information (self-study)

  • Clustering: we can use unsupervised learning to predict the correlation degree between samples. Samples with high correlation degree are classified into the same category and samples with low correlation degree are classified into different categories, which is "clustering"
  • Dimensionality reduction: we can also use unsupervised learning to process data, transforming data with high dimensions and complex calculations into data with low dimensions, easy to process, and containing no or less information loss, which is "dimensionality reduction"


The purpose of clustering is to divide the data into multiple categories. In the same category, objects (entities) have high similarity, and in different categories, objects have great differences.

For a batch of sample sets without category labels, they are classified according to the degree of similarity between samples. The similar ones are classified into one category and the dissimilar ones are classified into other categories. This classification is called cluster analysis, also known as unsupervised classification

Common methods include K-Means clustering, mean shift clustering, density based clustering and so on.


  • K objects in the data space are selected as the initial centers, and each object represents a cluster center;

  • For the data objects in the sample, according to the Euclidean distance between them and these cluster centers, they are divided into the classes corresponding to the nearest cluster center (the most similar) according to the nearest distance criterion;

  • Update cluster center: take the mean value of all objects in each category as the cluster center of the category, and calculate the value of the objective function;

  • Judge whether the values of the cluster center and the objective function have changed. If not, output the result. If changed, return 2).
    be careful:
    The result of clustering will be affected by the initial randomly selected center point distribution + data distribution
    Generate two center points at random positions in a pile of data points, calculate the distance between the data points and the two center points, and classify them into the category of short distance - > calculate the mean value of all distances in the category, then regenerate the center point at the corresponding position through the mean value of distance, continue to calculate the distance of the new center point, and classify them into the category of short distance - > repeat the steps

  • kmeas has an init. The farthest points are randomly selected for clustering, which is more effective++

Mean shift (no need to specify the number of clusters)

Produce a central point and take it as the center of the circle to calculate the vector average from this point to all points. If the density in the park is large and evenly distributed, it will end. On the contrary, move the center of the circle to find a new position

hierarchical clustering

Hierarchical clustering; Pay attention to the degree of clustering, avoid over fitting, and cluster into only one class

  • Shortest distance method, longest distance method, middle distance method
    • Shortest: from small to large, a little makes a lot
    • Longest: start from the total, peel layer by layer
    • Middle: take the middle distance as the dividing line for clustering
  • Using trees to cluster, it's easy to pass, passed!!!!

Density clustering

minpts = 3 in the density clustering graph (outliers may appear)
Field: for example, x1 is the center of the circle, and the range of the circle is the field

Density direct - it must be within the field
Density up to - the density of x1 goes directly to x2. If there is a density up to x3, then the density of x1 is x3
Density connection: x3 and x4 density connection
-You do not need to specify k
-Circle size / number of nodes in a circle need to be specified
-Slow calculation

Prone to outliers.

Mixed Gaussian clustering

Gaussian mixture (calculate the distribution fit, and then cluster)
After fitting the two Gaussian distributions, the data points are compared in the two Gaussian distributions
EM algorithm (maximum expectation algorithm)
-Find P(A, B): fix the data B, P(A|B) - -- > look back to fix the obtained A,
-P (b|a) - -- > continue P(A|B) and iterate to the exact value

AP (nearest neighbor propagation) clustering

Attracting information: a matrix that expresses its own similarity
 Attribution information: a matrix similar to the similarity of attracting information by self comparison

Dimensionality reduction

The purpose of dimensionality reduction is to reduce the dimension of the original sample data to a smaller number, and try to minimize the loss of information contained in the sample or the error caused by restoring the data. Such as principal component analysis

  • Data is easier to process and use in low dimension;
  • Relevant features, especially important features, can be clearly displayed in the data;
  • If there is only two-dimensional or three-dimensional, visual display can be carried out;
  • Remove data noise and reduce algorithm overhead.

PCA principal component analysis (unsupervised)

  • The feature is decomposed (rotation basis), and the feature partition vector with little / no influence can be removed
    The covariance matrix is made between the features, the matrix is diagonalized, and then the eigenvalues with small influence in the matrix are removed
    Sometimes, the features are not completely independent and will have some influence on each other. Dimension reduction is needed to remove the relationship

The purpose of dimensionality reduction: the greater the variance between the same features, the better; The smaller the variance between different features, the better

LDA linear discriminant analysis can be used for supervised dimensionality reduction and classification

-In fact, it is to find a discriminant surface in k-1-dimensional space.

reference resources:
Link: machine learning
Link: DataWhale.

Tags: Python AI Deep Learning

Posted on Wed, 17 Nov 2021 23:12:44 -0500 by mrkite