Linear discriminant criterion and linear classification programming practice

1, Generate simulation data set by programming and practice LDA Algorithm

1. Linear criterion LDA

1. Basic idea of LDA

LDA is a dimension reduction technology of supervised learning, that is, each sample of its data set has category output. This is different from PCA. PCA is an unsupervised dimensionality reduction technique without considering the output of sample categories. The idea of LDA can be summarized in one sentence, that is, "after projection, the intra class variance is the smallest and the inter class variance is the largest". What do you mean? We want to project the data on the low dimension. After projection, we want the projection points of each category of data to be as close as possible, and the distance between the category centers of different categories of data to be as large as possible.
Suppose we have two types of data, red and blue, as shown in the figure below. These data features are two-dimensional. We want to project these data onto a one-dimensional straight line, so that the projection points of each type of data are as close as possible, and the distance between the red and blue data centers is as large as possible.

2. Multi classification LDA principle

With the foundation of class II LDA, let's look at the principle of multi class LDA.
Suppose our data set D={(x1,y1),(x2,y2),..., ((xm,ym))}, where any sample xi is an n-dimensional vector, yi ∈ {C1,C2,..., Ck}. We define Nj(j=1,2... k) as the number of samples of class j, Xj(j=1,2... k) as the set of samples of class j, and μ j(j=1,2... k) is the mean vector of class j samples, defined as Σ j(j=1,2... k) is the covariance matrix of class j samples. The formula defined in class II LDA can be easily analogized to multi class LDA.
Because we are multi class projection to low dimension, the low dimension space projected to is not a straight line, but a hyperplane. Suppose that the dimension of the low dimensional space we project to is d, the corresponding base vector is (w1,w2,... wd), and the matrix composed of the base vector is w, which is an n × d matrix.
At this point, our optimization objectives should be:

Where Sb = ∑ J = 1knj( μ j− μ) ( μ j− μ) T, μ Is the mean vector of all samples. Sw = ∑ j=1kSwj = ∑ j=1k ∑ x ∈ Xj(x −) μ j)(x− μ j) But there is a problem, WTSbW
Both WTSwW and WTSwW are matrices, not scalars, and cannot be optimized as a scalar function! That is, we cannot directly use the optimization method of class II LDA. What should we do? Generally speaking, we can use some other alternative optimization objectives to achieve.
A common LDA multi class optimization objective function is defined as:

Where π diagA is the product of the main diagonal elements of A and W is n × d matrix.
The optimization process of J(W) can be transformed into:

Take a closer look at the rightmost side of the above formula. This is the generalized Rayleigh quotient! The maximum value is the matrix S − 1wSb
The product of the largest d eigenvalues is the product of the largest d eigenvalues of matrix S − 1wSb. At this time, the corresponding matrix W is the matrix formed by the eigenvector corresponding to the largest d eigenvalues.
Since W is a projection matrix obtained by using the categories of samples, the maximum dimension d reduced to is k-1. Why is the maximum dimension not the number of categories K? Because each dimension in Sb μ j− μ The rank of the covariance matrix is 1, so the maximum rank after the covariance matrix is added is K (the rank of the matrix is less than or equal to the sum of the ranks of each additive matrix), but if we know the first k-1 μ After J, the last one μ K can be represented by the first k-1 μ J is linear, so the rank of Sb is k-1 at most, that is, there are k-1 eigenvectors at most.

2. Algorithm implementation

1. Package introduction and generation of simulation data set

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda
from sklearn.datasets._samples_generator import make_classification 
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd
x,y=make_classification(n_samples=200,n_features=2,n_redundant=0,n_classes=2,n_informative=1,n_clusters_per_class=1,class_sep=0.5,random_state=100)
plt.scatter(x[:,0],x[:,1], marker='o', c=y)
plt.show()
x_train=x[:60, :60]
y_train=y[:60]
x_test=x[40:, :]
y_test=y[40:]


2. Data set grouping

x_train=x[:150, :150]
y_train=y[:150]
x_test=x[50:, :]
y_test=y[50:]
lda_test=lda()
lda_test.fit(x_train,y_train)
predict_y=lda_test.predict(x_test)
count=0
for i in range(len(predict_y)):
    if predict_y[i]==y_test[i]:
        count+=1
print("The number of accurate forecasts is"+str(count))
print("The accuracy is"+str(count/len(predict_y)))


2, The moon data set is classified by SVM, and the results are compared by using linear kernel, polynomial kernel, Gaussian kernel and different parameters (such as penalty coefficient C).

1. Linear classification algorithm

  • Support vector machine (SVM)

support vector machines (SVM) is a binary classification model. Its purpose is to find a hyperplane to segment the samples. The segmentation principle is to maximize the interval, which is finally transformed into a convex quadratic programming problem. The simple to complex models include:

When the training samples are linearly separable, a linearly separable support vector machine is learned by maximizing the hard interval;
When the training samples are approximately linearly separable, a linear support vector machine is learned by maximizing the soft interval;
When the training samples are linearly inseparable, a nonlinear support vector machine is learned through kernel technique and soft interval maximization;

2. SVM classification of moon data set

import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib as mpl
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
# To display Chinese
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False#rc configuration or rc parameters. The default properties can be modified through rc parameters, including form size, points per inch, line width, color, style, coordinate axis, coordinate and network properties, text, font, etc.
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)#Generate moon dataset
def plot_dataset(X, y, axes):#Drawing graphics
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)
    plt.title("Moon data",fontsize=20)
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.show()

3. Cluster the moon dataset using SVM:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_moons

# Import dataset
X,y = make_moons(n_samples=200,random_state=0,noise=0.05)

h = .02  # Step size in mesh

# Create an instance of support vector machine and fit the data
C = 1.0  # SVM regularization parameters
svc = svm.SVC(kernel='linear', C=C).fit(X, y) # Linear kernel
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y) # Radial basis kernel
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y) # Polynomial kernel
lin_svc = svm.LinearSVC(C=C).fit(X, y) #Linear kernel

# Create a mesh to draw an image
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Title of the diagram
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']


for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    # Draw the decision boundary and assign different colors to different areas
    plt.subplot(2, 2, i + 1) # Create a graph with 2 rows and 2 columns, and take the ith graph as the current graph
    plt.subplots_adjust(wspace=0.4, hspace=0.4) # Set subgraph interval

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) #The elements in xx and yy form a pair of coordinates as the input of support vector machine and return an array
    
    # Draw the classification results
    Z = Z.reshape(xx.shape) #(220, 280)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8) #Use the contour function to draw different areas

    # The training data are drawn in the form of discrete points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()

summary

Advantages of LDA Algorithm

  1. In the process of dimensionality reduction, category prior knowledge experience can be used, while unsupervised learning such as PCA can not use category prior knowledge.
  2. LDA is better than PCA when the sample classification information depends on the mean rather than variance.

Disadvantages of LDA Algorithm

  1. LDA is not suitable for dimensionality reduction of non Gaussian distribution samples, and PCA also has this problem.
  2. The dimension reduction of LDA can be reduced to the dimension of category number k-1 at most. If the dimension reduction is greater than k-1, LDA cannot be used. Of course, there are some evolutionary algorithms of LDA that can bypass this problem.
  3. When the sample classification information depends on variance rather than mean, the dimensionality reduction effect of LDA is not good.

reference resources

https://icode9.com/content-1-1207197.html
https://blog.csdn.net/gwplovekimi/article/details/80299070

Tags: Machine Learning AI

Posted on Fri, 05 Nov 2021 17:12:15 -0400 by nepeaNMedia