KNN classification and classification prediction based on support vector machine
Iris dataset - KNN classification
Understanding: calculate the distance between the unknown sample and all known samples, select the K known samples closest to the unknown sample, and classify the unknown sample and the k nearest samples into one class according to the majority voting rule.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_iris data=load_iris() iris_target=data.target iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) # Convert to DataFrame format using Pandas from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics # In order to correctly evaluate the model performance, the data is divided into training set and test set, the model is trained on the training set, and the model performance is verified on the test set. x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target,random_state=2020) ## Define knn model knn = KNeighborsClassifier(n_neighbors=5,metric='minkowski')#Take the nearest five samples and use the Euclidean distance # Training knn model on training set knn.fit(x_train, y_train) ## In the training set and test set, the trained model is used to predict train_predict = knn.predict(x_train) test_predict = knn.predict(x_test) ## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples] print('Accuracy:', metrics.accuracy_score(y_train, train_predict)) print('Accuracy:', metrics.accuracy_score(y_test, test_predict)) ## View confusion matrix confusion_matrix_result = metrics.confusion_matrix(test_predict, y_test) print('Confusion matrix results:\n', confusion_matrix_result) plt.figure(figsize=(8, 6)) # Specifies the width and height of the figure, in inches sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel('Predictedlabels') plt.ylabel('Truelabels') plt.show()
Confusion matrix results:
[[15 0 0]
[ 0 10 2]
[ 0 1 10]]
Classification and prediction based on support vector machine
Understanding of svm: construct a partition hyperplane to maximize the distance from the two types of sample points to the plane (i.e. reach the maximum interval)
sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None,random_state=None)
C: The default value of penalty parameter C of C-SVC is 1.0
The larger C is, it is equivalent to the penalty relaxation variable. It is hoped that the relaxation variable is close to 0, that is, the penalty for misclassification increases, and tends to fully split the training set. In this way, the accuracy of testing the training set is very high, but the generalization ability is weak. The C value is small, the penalty for misclassification is reduced, fault tolerance is allowed, they are regarded as noise points, and the generalization ability is strong.
Kernel: kernel function, which is rbf by default. It can be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'
Polynomial: (gamma*u'v + coef0)^ degree
sigmoid: tanh(gammau'*v + coef0)
RBF function: e(-gamma|u-v|2)
degree: the dimension of polynomial poly function. The default value is 3. It will be ignored when selecting other kernel functions.
Kernel function parameters of gamma: 'rbf', 'poly' and 'sigmoid'. The default is' auto ', then 1 / N will be selected_ features
coef0: constant term of kernel function. Useful for 'poly' and 'sigmoid'.
Probability: whether to use probability estimation. The default value is False
Boolean type; optional; default is False
Decide whether to enable probability estimation. You need to add this parameter when training the fit() model before using the related method: predict_proba and predict_log_proba
Shrinking: whether to use the shrinking heuristic method. The default value is true
tol: the error value of stopping training. The default value is 1e-3
cache_size: kernel function cache size. The default value is 200
class_weight: the weight of the category, which is passed in dictionary form. Set the parameter C of the class to weight * C (C in C-SVC)
verbose: whether to allow redundant output
max_iter: maximum number of iterations- 1 is unlimited.
decision_function_shape : 'ovo', 'ovr' or None, default=None3
random_state: seed value during data shuffle, int value
import matplotlib.pyplot as plt import numpy as np from sklearn import svm x_fearures=np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]]) y_label=np.array([0, 0, 0, 1, 1, 1]) # SVM function clf=svm.SVC(kernel='linear')# linear clf.fit(x_fearures, y_label) # View the w of its corresponding model print('the weight of Logistic Regression:',clf.coef_) # View w0 of its corresponding model print('the intercept(w0) of Logistic Regression:',clf.intercept_) y_train_pred = clf.predict(x_fearures) print('The prediction result:', y_train_pred) x_range = np.linspace(-3, 3) w = clf.coef_ a = -w / w y_3 = a*x_range - (clf.intercept_) / w # Visual decision boundary plt.figure() plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis') plt.plot(x_range, y_3, '-c') plt.show()
the weight of Logistic Regression: [[0.33364706 0.33270588]]
the intercept(w0) of Logistic Regression: [-0.00031373]
The prediction result: [0 0 0 1 1 1]