SVM is a very elegant algorithm with perfect mathematical theory. It is often used in data classification and data regression prediction. Due to its beautiful theoretical guarantee and the processing skills of kernel function for linear inseparable problems, SVM was very popular around the 1990s.
SVM includes the functions of many algorithms:
Support vector machine in sklearn
class  meaning  input 

svm.LinearsvC  Linear support vector classification  [penalty, loss, dual, tol, C, ...]) 
svm.LinearsVR  Linear support vector regression  [epsilon, tol, C, loss, ...]) 
svm.SVC  Nonlinear multidimensional support vector classification  [C, kernel, degree, gamma, coef0, ...]) 
svm.SVR  Nonlinear multidimensional support vector regression  [kernel, degree, gamma, coef0, tol, ...]) 
svm.NuSVC  Nu support vector classification  [nu, kernel, degree, gamma, ...]) 
svm.NuSVR  Nu support vector regression  [nu, C, kernel, degree, gamma, ...]) 
svm.OneClassSVM  Unsupervised outlier detection  [kernel, degree, gamma, ...]) 
svm.l1_min_c  Returns the lowest boundary of parameter C such that for C in (l1_min_C, infinity) Model guarantee cannot be empty  X,y[,loss, fit intercept, ...]) 
Use libsvm functions directly  
swm.libsm.cross.validation  SVM specific cross validation  
svm.libsvm.decisionfunction  SVM specific prediction marginal function (libsvm name is predict values)  
svm.libsum.fit  Training model using libsvm  
svm.libsvm.predict  A given model predicts the target value of X  
svm.libsvm.predict proba  Prediction probability 
Note that except for the two classes LinearSVC and LinearSVR that are specifically linear, all other classes support both linear and nonlinear.
NuSVC and NuSVC can manually adjust the number of support vectors, and other parameters are consistent with the most commonly used SVC and SVR. Note that OneClassSVM is an unsupervised class.
In addition to its own classes, sklearn also provides several functions that directly call the libsvm library. Libsvm is a simple, easytouse, fast and effective English SVM library developed and designed by Professor Lin Chih Jen of Taiwan University. It provides a large number of underlying calculations and parameter selection of SVM, and it is also the library called behind many classes of sklearn. At present, libsvm has dozens of language versions such as C, Java, Matlab, Python and R. each language version can be downloaded on the official website of libsvm: https://www.csie.ntu.edu.tw/cjlin/libsvm/
SVM is a two class classification model. Its basic idea is to find the separation hyperplane with the largest interval in the feature space, so that the data can be efficiently classified. Specifically, there are three cases (without kernel function, it is a linear model, which will be upgraded to a nonlinear model):
 When the training samples are linearly separable, a linear classifier, linear separable support vector machine, is learned by maximizing the hard interval;
 When the training data is approximately linearly separable, the relaxation variable is introduced to learn a linear classifier, namely linear support vector machine, by maximizing the soft interval;
 When the training data are linearly inseparable, the nonlinear support vector machine is learned by using kernel technique and soft interval maximization.
sklearn.svm.SVC
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter= 1, decision_function_shape='ovr', break_ties=False, random_state=None)
parameter  meaning 

C  Floating point number; optional; default to 1.0; must be greater than or equal to 0 Penalty term coefficient of relaxation coefficient. If the C value is set to be large, SVC may choose the decision boundary with smaller margin, which can better classify all training points. If the setting value of C is small, SVC will maximize the boundary as much as possible, and the decision function will be simpler, but the cost is the accuracy of training. In other words, the influence of C in SVM is like the influence of regularization parameters on logistic regression. 
kernel  Character, optional, default "rbf" Specify the type of kernel function to be used in the algorithm. You can enter 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or callable objects (such as functions, classes, etc.). If a callable object is given, this object will be used to pre calculate the kernel matrix from the characteristic matrix X; The matrix should be an array of (n_samples, n_samples) structures. 
degree  Integer; optional; 3 by default The degree of the polynomial kernel function ('poly '). If the kernel function does not select "poly", this parameter will be ignored. 
gamma  Floating point number; optional; default "auto"“ The coefficient of the Kernel function is valid only when the options of the parameter Kernel are "rbf", "poly" and "sigmoid". When you enter "auto", 1/(n features) is automatically used as the value of gamma. In sklearn0.22, if you can enter "scale", use 1/(n features * X.std()) as the value of gamma. If you enter "auto_deprecated", it means that no explicit gamma value is passed (not recommended). 
coefo  Floating point number; optional; default = 0.0 The independent term in the kernel function, which is valid only when the parameters kernel are 'poly' and sigmoid '. 
kernel function
Visualization of linear SVM decision process
from sklearn.datasets import make_blobs from sklearn.svm import SVC import matplotlib.pyplot as plt import numpy as np X,y = make_blobs(n_samples=50, centers=2, random_state=0,cluster_std=0.6) plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow")#Rainbow rainbow plt.xticks([]) plt.yticks([]) plt.show()
#First, we need to obtain the plane composed of samples as an object. plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow") ax = plt.gca() #Get the current subgraph. If it does not exist, create a new subgraph
Draw decision boundary: make grid and understand the function meshgrid
#Gets the maximum and minimum values of the two coordinate axes on the plane xlim = ax.get_xlim() ylim = ax.get_ylim() #30 regular data are formed between the maximum and minimum values axisx = np.linspace(xlim[0],xlim[1],30) axisy = np.linspace(ylim[0],ylim[1],30) #axisx.shape (30,) axisy,axisx = np.meshgrid(axisy,axisx) #We will use the twodimensional array formed here as X and Y in our contour function #The meshgrid function is used to convert two onedimensional vectors into characteristic matrices #The core is to broadcast two feature vectors to obtain the abscissa and ordinate of so many coordinate points of y.shape * x.shape #axisx.shape (30, 30) xy = np.vstack([axisx.ravel(), axisy.ravel()]).T #Among them, travel () is a dimension reduction function, and vstack can stack multiple onedimensional arrays with consistent structure in rows #xy is the formed mesh, which is a dense point spread over the whole canvas plt.scatter(xy[:,0],xy[:,1],s=1,cmap="rainbow") #Understand the functions meshgrid and vstack a = np.array([1,2,3]) b = np.array([7,8]) #How many coordinates will you get when you combine them in pairs? #The answers are 6, which are (1,7), (2,7), (3,7), (1,8), (2,8), (3,8) v1,v2 = np.meshgrid(a,b) v1 v2 v = np.vstack([v1.ravel(), v2.ravel()]).T
#The corresponding decision boundary is calculated by fit clf = SVC(kernel = "linear").fit(X,y)#The corresponding decision boundary is calculated Z = clf.decision_function(xy).reshape(axisx.shape) #z.shape (30, 30) #Important interface decision_function to return the distance to the decision boundary corresponding to each input sample #Then this distance is converted into the structure of axisx, because the function contour of drawing requires that the structure of Z must be consistent with X and Y #First, there must be a scatter chart plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow") ax = plt.gca() #Get the current subgraph. If it does not exist, create a new subgraph #Draw decision boundary and hyperplane parallel to decision boundary ax.contour(axisx,axisy,Z ,colors="k" ,levels=[1,0,1] #Draw three contour lines, which are Z1, z0 and Z1 ,alpha=0.5#transparency ,linestyles=["","",""]) ax.set_xlim(xlim)#Set xaxis value ax.set_ylim(ylim)
#Remember the essence of Z? Is the distance from the input sample to the decision boundary, and the level in the contour function actually inputs this distance #Let's try with a dot plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow") plt.scatter(X[10,0],X[10,1],c="black",s=50,cmap="rainbow")
level=clf.decision_function(X[10].reshape(1,2)) plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow") ax = plt.gca() ax.contour(axisx,axisy,Z ,colors="k" ,levels=level ,alpha=0.5 ,linestyles=[""])
#Wrap the above process into a function: def plot_svc_decision_function(model,ax=None): if ax is None: ax = plt.gca() xlim = ax.get_xlim() ylim = ax.get_ylim() x = np.linspace(xlim[0],xlim[1],30) y = np.linspace(ylim[0],ylim[1],30) Y,X = np.meshgrid(y,x) xy = np.vstack([X.ravel(), Y.ravel()]).T P = model.decision_function(xy).reshape(X.shape) ax.contour(X, Y, P,colors="k",levels=[1,0,1],alpha=0.5,linestyles=["","",""]) ax.set_xlim(xlim) ax.set_ylim(ylim) #Then the whole drawing process can be written: clf = SVC(kernel = "linear").fit(X,y) plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow") plot_svc_decision_function(clf)
clf.predict(X) #The samples in X are classified according to the decision boundary, and the returned structure is n_samples clf.score(X,y) #Returns the average accuracy of a given test data and label clf.support_vectors_ #Return support vector coordinates clf.n_support_ #array([2, 1]) #Returns the number of support vectors in each class
Soft interval and important parameter C
Not all data are completely linearly separable. There may be a straight line that can correctly classify most of the data points, but in any case, it is impossible to pair all the points. As shown in the figure below, there are purple points mixed in the red category. In this case, no straight line can completely classify the two types of data correctly.
Key concepts: hard interval and soft interval 

When the two sets of data are completely linearly separable, we can find a decision boundary so that the classification error on the training set is 0, the two kinds of data are called "hard interval". When the two sets of data are almost completely linearly separable, but the decision boundary has small training error on the training set, the two kinds of data are called "soft interval". 
At this time, our decision boundary is not simply seeking the maximum margin, because for soft interval data, the larger the margin, the more samples will be divided wrong. Therefore, we need to find a balance between "the maximum margin" and "the number of samples divided wrong". Parameter C is used to weigh "the correct classification of training samples" and " The marginal maximization of decision function "two goals that can not be completed at the same time, hoping to find a balance point to make the model work best.
parameter  meaning 

C  Floating point number, default 1, must be greater than or equal to 0, optional Penalty term coefficient of relaxation coefficient. If the C value is set to be large, SVC may choose the one with smaller margin, which can better classify the decision boundary of all training points, but the training time of the model will be longer. If the C value is set to be high, SVC will maximize the boundary as much as possible and the decision function will be simpler, but the cost is the accuracy of training. In other words, the influence of C in SVM The influence of regularization parameters on logistic regression. 
In practical use, C and kernel function related parameters (gamma, degree, etc.) Collocation is often the focus of SVM parameter adjustment. Unlike gamma, C does not appear in the dual function and defines the parameter adjustment goal, so we can know whether we need high accuracy on the training set to adjust the direction of C. by default, C is 1, which is usually a reasonable parameter. If our data are noisy, we tend to reduce it C. Of course, we can also use grid search or learning curve to adjust the value of C.
import matplotlib.pyplot as plt import numpy as np from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target X = X[y<2,:2] y = y[y<2] plt.scatter(X[y==0,0],X[y==0,1],color='red') plt.scatter(X[y==1,0],X[y==1,1],color='blue') plt.show()
Then conduct data standardization:
from sklearn.preprocessing import StandardScaler standardScaler = StandardScaler() standardScaler.fit(X) X_std = standardScaler.transform(X)
First, take a very large C value for observation. In this case, the algorithm approximates Hard Margin
from sklearn.svm import SVC svc = SVC(kernel='linear',C=1e9) svc.fit(X_std,y)
Draw decision boundaries and Margin boundaries:
def plot_svc_decision_boundary(model, axis,ax=None): x0, x1 = np.meshgrid( np.linspace(axis[0], axis[1], int((axis[1]axis[0])*100)).reshape(1, 1), np.linspace(axis[2], axis[3], int((axis[3]axis[2])*100)).reshape(1, 1), ) xy = np.vstack([x0.ravel(), x1.ravel()]).T y_predict = model.predict(xy) zz = y_predict.reshape(x0.shape) P = model.decision_function(xy).reshape(x0.shape) from matplotlib.colors import ListedColormap if ax is None: ax = plt.gca() custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9']) ax.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap) ax.contour(x0, x1, P,colors='k',levels=[1,0,1],alpha=0.5,linestyles=["","",""]) plot_svc_decision_boundary(svc, axis=[3, 3, 3, 3]) plt.scatter(X_std[y==0,0], X_std[y==0,1]) plt.scatter(X_std[y==1,0], X_std[y==1,1]) plt.show()
Through the above picture, it can be clearly seen that three blue data points fall on the upper boundary and two orange data points fall on the lower boundary, which is the support vector. Because it is similar to Hard Margin SVM, there is no data between margins. It not only ensures the correct classification, but also makes the point closest to the decision boundary farthest from the decision boundary.
Then take a very small C value to see the performance of Soft Margin:
svc2 = SVC(kernel='linear',C=0.005) svc2.fit(X_std,y) plot_svc_decision_boundary(svc2, axis=[3, 3, 3, 3]) plt.scatter(X_std[y==0,0], X_std[y==0,1]) plt.scatter(X_std[y==1,0], X_std[y==1,1]) plt.show()
It can be seen that an orange point has been incorrectly classified. The smaller C, the larger Margin. There are many data points, which gives a lot of fault tolerance space.
In addition, for the case that the data set cannot be linearly separable due to a small number of outliers, the form of Soft Margin is introduced. In fact, it can be regarded as adding a regularization term to Hard Margin to improve its fault tolerance.
from sklearn.datasets import load_breast_cancer from sklearn.svm import SVC from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np from time import time import datetime import pandas as pd data = load_breast_cancer() X = data.data y = data.target print(X.shape) # (569, 30) print(np.unique(y)) # [0 1] from sklearn.preprocessing import StandardScaler X=StandardScaler().fit_transform(X) Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3, random_state=420) score = [] C_range = np.linspace(0.01, 30, 50) for i in C_range: clf = SVC(kernel="linear", C=i, cache_size=5000).fit(Xtrain, Ytrain) score.append(clf.score(Xtest, Ytest)) print(max(score), C_range[score.index(max(score))]) #0.9766081871345029 1.2340816326530613 plt.plot(C_range, score) plt.show()
Nonlinear SVM and kernel function
Important parameters kernel & degree & gamma
Important parameter kernel
As one of the most important parameters of SVC class, "kernel" can select the following options in sklearn:
kernel="poly"
 Method 1:
Polynomial thinking: expand the original data and create new polynomial features; (add polynomial features to each sample)
Steps:
 PolynomialFeatures(degree = degree): expand the original data to generate polynomial features;
 StandardScaler(): standardized processing of expanded data;
 LinearSVC: use SVM algorithm to train the model;
 Method 2:
Use the kernel function encapsulated in scikit learn: SVC(kernel = 'poly', degree=degree, C=C)
**Function: * * when the parameter kernel of SVC() is' poly ', the polynomial feature is directly used to process the data;
Note: before using SVC(), the data also needs to be standardized
example
1) Generate data
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets X, y = datasets.make_moons(noise=0.15, random_state=666) #X.shape(100, 2) y.shape(100,) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
2) Drawing function
def plot_decision_boundary(model, axis): x0, x1 = np.meshgrid( np.linspace(axis[0], axis[1], int((axis[1]axis[0])*100)).reshape(1,1), np.linspace(axis[2], axis[3], int((axis[3]axis[2])*100)).reshape(1,1) ) X_new = np.c_[x0.ravel(), x1.ravel()] y_predict = model.predict(X_new) zz = y_predict.reshape(x0.shape) from matplotlib.colors import ListedColormap custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9']) plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
3) Method 1: polynomial thinking
from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline def PolynomialSVC(degree, C=1.0): return Pipeline([ ('poly', PolynomialFeatures(degree=degree)), ('stdscaler', StandardScaler()), ('linearSVC', LinearSVC(C=C)) ]) poly_svc = PolynomialSVC(degree=3) poly_svc.fit(X, y) plot_decision_boundary(poly_svc, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
4) Method 2: use kernel function (SVC)
 For SVM algorithm, in the package of scikit learn, the data can be transformed into highdimensional polynomial data without using PolynomialFeatures, and then the data is provided to the algorithm;
 SVC() algorithm: directly use polynomial features;
from sklearn.svm import SVC from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # When the parameter kernel='poly 'of algorithm SVC(), SVC() can directly achieve the effect of polynomial feature; # Before using SVC(), you also need to standardize the data def PolynomialKernelSVC(degree, C=1.0): return Pipeline([ ('std_scaler', StandardScaler()), ('kernelSVC', SVC(kernel='poly', degree=degree, C=C)) ]) poly_kernel_svc = PolynomialKernelSVC(degree=3) poly_kernel_svc.fit(X, y) plot_decision_boundary(poly_kernel_svc, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
kernel = "rbf" and gamma
1. Gaussian kernel function
 μ: Expected value, mean value, sample average; (decide to tell the position of the central axis of the function: x)= μ)
 σ 2: Variance; (measure the deviation between the random sample and the average value: it is the overall variance, the variable, the overall mean and the number of overall cases)
 In practice, when it is difficult to obtain the population mean, the sample statistics shall be used to replace the population parameters. After correction, the calculation formula of sample variance: $S^2= \sum (X μ)^ 2/(n1) $is the sample mean, and N is the number of samples.
 σ: Standard deviation; (reflect the distribution of sample data: σ The smaller the Gaussian distribution is, the narrower the sample distribution is; σ The larger the Gaussian distribution is, the wider the sample distribution is)
 γ = 1 / (2 σ 2 σ^2 σ 2 ): γ The larger the Gaussian distribution is, the narrower the Gaussian distribution is, and the more concentrated the sample distribution is; γ The smaller the Gaussian distribution is, the wider the Gaussian distribution is, and the denser the sample distribution is;
2. RBF kernel in scikit learn
1) Format
from sklearn.svm import SVC svc = SVC(kernel='rbf', gamma=1.0)
#Set parameters directly γ = 1.0；
2) Simulation dataset, import drawing function, design pipeline

The generalization ability is not investigated here, only the decision boundary of the classification of the training data set is viewed, and the train_test_split is not required;

Analog data set
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets X, y = datasets.make_moons(noise=0.15, random_state=666) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
Import drawing function
def plot_decision_boundary(model, axis): x0, x1 = np.meshgrid( np.linspace(axis[0], axis[1], int((axis[1]axis[0])*100)).reshape(1,1), np.linspace(axis[2], axis[3], int((axis[3]axis[2])*100)).reshape(1,1) ) X_new = np.c_[x0.ravel(), x1.ravel()] y_predict = model.predict(X_new) zz = y_predict.reshape(x0.shape) from matplotlib.colors import ListedColormap custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9']) plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
Design pipeline
from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.pipeline import Pipeline def RBFKernelSVC(gamma=1.0): return Pipeline([ ('std_scaler', StandardScaler()), ('svc', SVC(kernel='rbf', gamma=gamma)) ])
3) Adjust parameters γ， Different decision boundaries are obtained
γ == 0.1
svc_gamma_01 = RBFKernelSVC(gamma=0.1) svc_gamma_01.fit(X, y) plot_decision_boundary(svc_gamma_01, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
γ == 0.5
svc_gamma_05 = RBFKernelSVC(gamma=0.5) svc_gamma_05.fit(X, y) plot_decision_boundary(svc_gamma_05, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
γ == 1
svc_gamma_1 = RBFKernelSVC(gamma=1.0) svc_gamma_1.fit(X, y) plot_decision_boundary(svc_gamma_1, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
γ == 10
svc_gamma_10 = RBFKernelSVC(gamma=10) svc_gamma_10.fit(X, y) plot_decision_boundary(svc_gamma_10, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
γ == 100
svc_gamma_100 = RBFKernelSVC(gamma=100) svc_gamma_100.fit(X, y) plot_decision_boundary(svc_gamma_100, axis=[1.5, 2.5, 1.0, 1.5]) plt.scatter(X[y==0, 0], X[y==0, 1]) plt.scatter(X[y==1, 0], X[y==1, 1]) plt.show()
4) Analysis

With parameters γ From small to large, the model experiences: under fitting  excellent  over fitting;

γ == At 100:
 Phenomenon: a "bell" pattern is formed around each blue sample, and the blue sample point is the top of the "bell" pattern;
 reason: γ The value of is too large, the "bell" pattern formed by sample distribution is relatively narrow, and the model is too fitted;
 Geometric meaning of decision boundary: only the samples distributed in the "bell" pattern are judged as blue type; otherwise, they are judged as Huangshan type;

γ == At 10:00, γ When the value decreases, the "bell" pattern of sample distribution law becomes wider, and the "bell" pattern areas of different samples cross together to form the distribution area of blue type samples;

Super parameter γ The smaller the value, the lower the complexity of the model, γ The larger the value, the higher the complexity of the model;
Performance of Kernel on different data sets
import numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn import svm#from sklearn.svm import SVC both from sklearn.datasets import make_circles, make_moons, make_blobs,make_classification n_samples = 100 datasets = [ make_moons(n_samples=n_samples, noise=0.2, random_state=0), make_circles(n_samples=n_samples, noise=0.2, factor=0.5, random_state=1), make_blobs(n_samples=n_samples, centers=2, random_state=5),#Clustered dataset make_classification(n_samples=n_samples,n_features = 2,n_informative=2,n_redundant=0, random_state=5) #n_features: number of features, n_informative: feature number with information, n_redundant: the number of features without information ] Kernel = ["linear","poly","rbf","sigmoid"] #What are the four data sets like? for X,Y in datasets: plt.figure(figsize=(5,4)) plt.scatter(X[:,0],X[:,1],c=Y,s=50,cmap="rainbow")
nrows=len(datasets) ncols=len(Kernel) + 1 fig, axes = plt.subplots(nrows, ncols,figsize=(20,16))
[*enumerate(datasets)] == list(enumerate(datasets)) # enumerate, map and zip can be expanded in this way # Index, (X,Y) = [(index, array([characteristic matrix X], [label Y]))
nrows=len(datasets) ncols=len(Kernel) + 1 fig, axes = plt.subplots(nrows, ncols,figsize=(20,16)) #Layer 1 loop: loop in different data sets for ds_cnt, (X,Y) in enumerate(datasets): #In the first column of the image, the distribution of the original data is placed ax = axes[ds_cnt, 0] if ds_cnt == 0: ax.set_title("Input data") ax.scatter(X[:, 0], X[:, 1], c=Y, zorder=10, cmap=plt.cm.Paired,edgecolors='k') ax.set_xticks(()) ax.set_yticks(()) #Second level loop: loop in different kernel functions #Starting from the second column of the image, fill in the classification results one by one for est_idx, kernel in enumerate(Kernel): #Define subgraph location ax = axes[ds_cnt, est_idx + 1] #modeling clf = svm.SVC(kernel=kernel, gamma=2).fit(X, Y) score = clf.score(X, Y) #Draw a scatter diagram of the distribution of the image itself ax.scatter(X[:, 0], X[:, 1], c=Y ,zorder=10 ,cmap=plt.cm.Paired,edgecolors='k') #Draw support vector ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=50, facecolors='none', zorder=10, edgecolors='k')# facecolors='none ': transparent #Draw decision boundaries x_min, x_max = X[:, 0].min()  .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min()  .5, X[:, 1].max() + .5 #np.mgrid combines the usage of np.linspace and np.meshgrid we used before #Use the maximum and minimum values at once to generate the mesh #Expressed as [start value: end value: step size] #If the step is a complex number, its integer part is the number of points created between the start value and the end value, and the end value is included XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j] #np.c_， Functions similar to np.vstack Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()]).reshape(XX.shape) #The color that fills different areas of the contour ax.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired) #Draw contours ax.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['', '', ''], levels=[1, 0, 1]) #Set the axis to not display ax.set_xticks(()) ax.set_yticks(()) #Place the title at the top of the first line if ds_cnt == 0: ax.set_title(kernel) #Add a classification score for each graph ax.text(0.95, 0.06, ('%.2f' % score).lstrip('0') , size=15 , bbox=dict(boxstyle='round', alpha=0.8, facecolor='white') #Add a white grid to the score as the background color, boxstyle box shape , transform=ax.transAxes #Determine the coordinate axis corresponding to the text, which is the coordinate axis itself of the ax subgraph , horizontalalignment='right' #In what direction of the axis ) plt.tight_layout() #The subgraph parameters are automatically adjusted to fill the entire image area plt.show()
It can be observed that the performance of linear kernel function and polynomial kernel function will float on nonlinear data. If the data is relatively linearly separable, it will perform well. If it is completely inseparable like ring data, it will perform poorly. On the linear data set, the linear kernel function and polynomial kernel function can perform well even if there are disturbance terms. It can be seen that although the polynomial kernel function can also deal with nonlinear cases, it is more inclined to linear functions.
Sigmoid kernel function is awkward. It is stronger than two linear kernel functions in nonlinear data, but the effect is obviously not as good as rbf. It is completely inferior to linear kernel functions in linear data and has weak resistance to disturbance terms. Therefore, its function is relatively weak and rarely used.
rbf and Gaussian radial basis kernel function perform well in any data set. They are universal kernel functions. My personal experience is that, in any case, try the Gaussian radial basis kernel function first. It is suitable for the case where the kernel is converted to a high space. In all cases, the effect is often very good. If the rbf effect is not good, let's try other kernel functions. In addition, polynomial kernel functions are mostly used in image processing.
Advantages and disadvantages of kernel function
from sklearn.datasets import load_breast_cancer from sklearn.svm import SVC from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np from time import time import datetime data = load_breast_cancer() X = data.data y = data.target X.shape #(569, 30) np.unique(y) #array([0, 1]) plt.scatter(X[:,0],X[:,1],c=y) plt.show() Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420) Kernel = ["linear","poly","rbf","sigmoid"]
Kernel = ["linear","rbf","sigmoid"] for kernel in Kernel: time0 = time() clf= SVC(kernel = kernel , gamma="auto" # , degree = 1 #The default is 3 , cache_size=10000#Memory used for calculation. The unit is MB. The default is 200MB ).fit(Xtrain,Ytrain) print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest))) print(time()time0)
Polynomial kernel function needs a lot of time at this moment, and the operation is very slow.
We can make two discoveries. First, the breast cancer dataset is a linear dataset, and the linear kernel function runs very well. rbf and sigmoid, which are good at nonlinear data, are completely unavailable in effect. Secondly, the running speed of linear kernel function is far less than that of nonlinear two kernel functions. If the data is linear, if we adjust the degree parameter to 1, the polynomial kernel function should also get good results:
rbf can also perform very well on linear data. Why is the result so bad here? In fact, the real problem here is the dimension of data.
It depends on calculating the "distance". Although we can't say that SVM is a complete distance model, it is seriously affected by the data dimension. Let's explore the dimensions of the breast cancer dataset:
import pandas as pd data = pd.DataFrame(X) data.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T#descriptive statistics #From the mean and std columns, we can see that the dimensions are seriously inconsistent #Compare 1% of the data with the minimum value and 90% of the data with the maximum value to see whether it is a normal distribution or a skew distribution. If there is too much difference, it is a skew distribution, and the general direction will be biased to who #It can be found that the characteristics of large data have skewness problems #At this time, it is necessary to standardize the data
from sklearn.preprocessing import StandardScaler X=StandardScaler().fit_transform(X) Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420) Kernel = ["linear","poly","rbf","sigmoid"] for kernel in Kernel: time0 = time() clf= SVC(kernel = kernel , gamma="auto" , degree = 1 , cache_size=5000 ).fit(Xtrain,Ytrain) print("The accuracy under kernel %s is %f" % (kernel,clf.score(Xtest,Ytest))) print(time()time0)
After dimensional unification, it can be observed that the operation time of all kernel functions is greatly reduced, especially for linear kernels, while polynomial kernel functions have become the fastest. Secondly, rbf shows excellent results. After our exploration, we can conclude that:
1. The calculation of linear kernel, especially polynomial kernel function, is very slow in highorder terms
2.rbf and polynomial kernel functions are not good at dealing with data sets with nonuniform dimensions
Fortunately, both of these shortcomings can be solved by dimensionless data. Therefore, it is highly recommended to dimensionless data before SVM implementation!
Sample imbalance in binary SVC: class_weight
For the classification problem, one of the pain points that can never escape is the sample imbalance problem. Sample imbalance means that in a set of data sets, the category of labels naturally accounts for a large proportion, but we have the need to capture a specific classification. For example, we now need to classify potential criminals and ordinary people. The proportion of potential criminals in the total population is quite low, perhaps only about 2%, 98% of people are ordinary people, and our goal is to capture potential criminals. Such label distribution will bring many problems.
First, the classification model naturally tends to most classes, making it easier for most classes to be judged correctly and a few classes to be sacrificed. Because for the model, the larger the sample size, the more information the label can learn, and the algorithm will rely more on the information learned from most classes for judgment. If we want to capture a few classes, the model will fail. Secondly, the model evaluation index will lose its meaning. In this classification situation, even if the model does nothing and treats everyone as a person who will not commit a crime, the accuracy can be very high, which makes the model evaluation index accuracy meaningless and can not achieve our modeling goal of "identifying people who will commit a crime".
To solve the first problem, we have introduced some basic methods in logistic regression, such as up sampling and down sampling. However, these sampling methods will increase the total number of samples. For the algorithm of support vector machine, which always has a great impact on the calculation speed, we don't want to easily increase the number of samples. Moreover, the decisionmaking in support vector machine is only affected by the decision boundary, and the decision boundary is only affected by parameter C and support vector. Simply increasing the number of samples will not only increase the calculation time, but also increase countless sample points that have no impact on the decision boundary. Therefore, in support vector machine, we should rely heavily on the parameter we adjust the sample equilibrium: class in SVC class_weight and sample that can be set in the interface fit_ weight. In logistic regression, the parameter class_weight defaults to None. This mode means that it is assumed that all labels in the dataset are balanced, that is, the proportion of labels is automatically considered to be 1:1. Therefore, when the samples are unbalanced, we can use the dictionary such as ("tag value 1": weight 1, "tag value 2": weight 2} to input the real sample tag proportion to make the algorithm realize that the samples are unbalanced. Or use the "balanced" mode and directly use n_samples / (n_classes*np.bincount (y)) As a weight, we can better correct our sample imbalance.
However, in SVM, our classification judgment is based on the decision boundary, and the parameter that ultimately determines what support vector to use and the decision boundary is parameter C, so all sample equilibria are adjusted through parameter C.
SVC parameter: class_weight
You can enter the dictionary or "balanced" or not. By default, None is set to SVC, and the parameter C of class I is set to class_weight[i]*C. if no specific class_weight is given, all classes are assumed to occupy the same weight 1, and the model will be trained according to the original condition of the data. If you want to improve the sample imbalance, please enter the form ("tag value 1": weight 1, "Value 2 of label": a dictionary of weights 2}, then parameter C will be automatically set to:
C of tag value 1: weight 1C, C of tag value 2: weight 2 $\ * $C
Alternatively, you can use the "balanced" mode, which uses the value of y to automatically adjust the weight inversely proportional to the class frequency in the input data to n_samples / (n_classes*np.bincount(y))
Parameters of SVC interface fit: sample_weight
Array, structured as (n_samples,), must correspond to each sample of the characteristic matrix in the input fit
Weight of each sample at ft
∗
\ *
* the C value corresponding to each sample forces the classifier to emphasize the samples with larger weight. Generally, larger weight is added to the samples of a few classes to force the model to model in the direction of a few classes
Generally speaking, we only select one of these two parameters to set. If we set two parameters at the same time, C will be affected by both parameters, that is, the weight set in class_weight
∗
\ *
* weight set in sample_weight * C.
Let's take a look at how to use this parameter.
First, we build a set of data sets with unbalanced samples. We build two SVC models on this set, one with class_weight parameter and the other without class_weight parameter. We evaluate the two models and draw their decision boundaries to observe the effect of class_weight.
import numpy as np import matplotlib.pyplot as plt from sklearn import svm from sklearn.datasets import make_blobs class_1 = 500 #Category 1 has 500 samples, 10:1 class_2 = 50 #Category 2 has only 50 centers = [[0.0, 0.0], [2.0, 2.0]] #Set the center of two categories clusters_std = [1.5, 0.5] #Set the variance of the two categories. Generally speaking, the category with large sample size will be more loose X, y = make_blobs(n_samples=[class_1, class_2], centers=centers, cluster_std=clusters_std, random_state=0, shuffle=False) X.shape #(550, 2) #See what the dataset looks like plt.scatter(X[:, 0], X[:, 1], c=y, cmap="rainbow",s=10) plt.show() #Among them, red dots are a few categories and purple dots are most categories
#Do not set class_weight clf = svm.SVC(kernel='linear', C=1.0) clf.fit(X, y) #Set class_weight wclf = svm.SVC(kernel='linear', class_weight={1: 10}) wclf.fit(X, y) #Score the two models separately. This score is the accuracy #After sample equalization, our accuracy decreases, and the accuracy is higher without sample equalization print(clf.score(X,y)) #0.9418181818181818 print(wclf.score(X,y)) #0.9127272727272727
#First, there must be data distribution plt.figure(figsize=(6,5)) plt.scatter(X[:, 0], X[:, 1], c=y, cmap="rainbow",s=10) ax = plt.gca() #Get the current subgraph. If it does not exist, create a new subgraph #The first step in drawing decision boundaries: have a grid xlim = ax.get_xlim() ylim = ax.get_ylim() xx = np.linspace(xlim[0], xlim[1], 30) yy = np.linspace(ylim[0], ylim[1], 30) YY, XX = np.meshgrid(yy, xx) xy = np.vstack([XX.ravel(), YY.ravel()]).T #Step 2: find out the distance from our sample point to the decision boundary Z_clf = clf.decision_function(xy).reshape(XX.shape) a = ax.contour(XX, YY, Z_clf, colors='black', levels=[0], alpha=0.5, linestyles=['']) Z_wclf = wclf.decision_function(xy).reshape(XX.shape) b = ax.contour(XX, YY, Z_wclf, colors='red', levels=[0], alpha=0.5, linestyles=['']) #Step 3: draw legend plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right") plt.show()
a.collections #Call all lines drawn in this contour object to return an inert object <a list of 1 mcoll.LineCollection objects> #Try opening it with [*] [*a.collections] #A linecollection object is returned, which is actually a list of all lines in our contour lines [<matplotlib.collections.LineCollection at 0x1855e926dc0>] #Now we have only one line, so we can use index 0 to lock this object a.collections[0] <matplotlib.collections.LineCollection at 0x1855e926dc0> #plt.legend([object list], [legend list], loc) #The legend can be displayed as long as the object list corresponds to the legend list #All samples judged to be correct and indeed 1 / all samples judged to be 1 #For the grey decision boundary without class_weight and sample balance: (y[y == clf.predict(X)] == 1).sum()/(clf.predict(X) == 1).sum() 0.7142857142857143 (y[y == clf.predict(X)] == 1).sum() #True = 1, False =0 #All points where the real value is equal to the predicted value 30 #For the red decision boundary with class_weight and sample balance: (y[y == wclf.predict(X)] == 1).sum()/(wclf.predict(X) == 1).sum() 0.5102040816326531 #Proportion of all points with predict 1 / all points with predict 1 #For the grey decision boundary without class_weight and sample balance: (y[y == clf.predict(X)] == 1).sum()/(y == 1).sum() 0.6 #For the red decision boundary with class_weight and sample balance: (y[y == wclf.predict(X)] == 1).sum()/(y == 1).sum() 1.0 #All samples correctly predicted as 0 / all 0 samples #For the grey decision boundary without class_weight and sample balance: (y[y == clf.predict(X)] == 0).sum()/(y == 0).sum() 0.976 #For the red decision boundary with class_weight and sample balance: (y[y == wclf.predict(X)] == 0).sum()/(y == 0).sum() 0.904
It can be seen that from the perspective of accuracy, the accuracy is higher when sample balancing is not done, and the accuracy becomes lower when sample balancing is done. This is because after sample balancing, in order to capture a few classes more effectively, the model wrongly hurts many samples of most classes, and the number of samples of most classes wrongly classified > the number of samples of a few classes correctly classified Now, if our goal is to improve the overall accuracy of the model, we must reject the sample balance and make the class_weight set to the previous model.
However, in reality, we often pursue to capture a few classes, because in many cases, the cost of misjudging a few classes is huge. Then we must use the model set by class_weight.