KNN classification and classification prediction based on support vector machine

KNN classification and classification prediction based on support vector machine

Iris dataset - KNN classification

KNN classification

Understanding: calculate the distance between the unknown sample and all known samples, select the K known samples closest to the unknown sample, and classify the unknown sample and the k nearest samples into one class according to the majority voting rule.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
iris_features = pd.DataFrame(, columns=data.feature_names)  
# Convert to DataFrame format using Pandas

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
# In order to correctly evaluate the model performance, the data is divided into training set and test set, the model is trained on the training set, and the model performance is verified on the test set.
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target,random_state=2020)
## Define knn model
knn = KNeighborsClassifier(n_neighbors=5,metric='minkowski')#Take the nearest five samples and use the Euclidean distance
# Training knn model on training set, y_train)

## In the training set and test set, the trained model is used to predict
train_predict = knn.predict(x_train)
test_predict = knn.predict(x_test)

## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples]
print('Accuracy:', metrics.accuracy_score(y_train, train_predict))
print('Accuracy:', metrics.accuracy_score(y_test, test_predict))
## View confusion matrix
confusion_matrix_result = metrics.confusion_matrix(test_predict, y_test)
print('Confusion matrix results:\n', confusion_matrix_result)

plt.figure(figsize=(8, 6))  # Specifies the width and height of the figure, in inches
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')

Accuracy: 0.9821428571
Accuracy: 0.9210526315789473
Confusion matrix results:
[[15 0 0]
[ 0 10 2]
[ 0 1 10]]

Classification and prediction based on support vector machine

Understanding of svm: construct a partition hyperplane to maximize the distance from the two types of sample points to the plane (i.e. reach the maximum interval)

sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None,random_state=None)

C: The default value of penalty parameter C of C-SVC is 1.0
The larger C is, it is equivalent to the penalty relaxation variable. It is hoped that the relaxation variable is close to 0, that is, the penalty for misclassification increases, and tends to fully split the training set. In this way, the accuracy of testing the training set is very high, but the generalization ability is weak. The C value is small, the penalty for misclassification is reduced, fault tolerance is allowed, they are regarded as noise points, and the generalization ability is strong.

  • Kernel: kernel function, which is rbf by default. It can be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'

    • Linearity: u'v

    • Polynomial: (gamma*u'v + coef0)^ degree

    • sigmoid: tanh(gammau'*v + coef0)

    • RBF function: e(-gamma|u-v|2)

  • degree: the dimension of polynomial poly function. The default value is 3. It will be ignored when selecting other kernel functions.

  • Kernel function parameters of gamma: 'rbf', 'poly' and 'sigmoid'. The default is' auto ', then 1 / N will be selected_ features

  • coef0: constant term of kernel function. Useful for 'poly' and 'sigmoid'.

  • Probability: whether to use probability estimation. The default value is False
    Boolean type; optional; default is False
    Decide whether to enable probability estimation. You need to add this parameter when training the fit() model before using the related method: predict_proba and predict_log_proba

  • Shrinking: whether to use the shrinking heuristic method. The default value is true

  • tol: the error value of stopping training. The default value is 1e-3

  • cache_size: kernel function cache size. The default value is 200

  • class_weight: the weight of the category, which is passed in dictionary form. Set the parameter C of the class to weight * C (C in C-SVC)

  • verbose: whether to allow redundant output

  • max_iter: maximum number of iterations- 1 is unlimited.

  • decision_function_shape : 'ovo', 'ovr' or None, default=None3

  • random_state: seed value during data shuffle, int value

import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm

x_fearures=np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label=np.array([0, 0, 0, 1, 1, 1])

# SVM function
clf=svm.SVC(kernel='linear')# linear, y_label)
# View the w of its corresponding model
print('the weight of Logistic Regression:',clf.coef_)
# View w0 of its corresponding model
print('the intercept(w0) of Logistic Regression:',clf.intercept_)
y_train_pred = clf.predict(x_fearures)
print('The prediction result:', y_train_pred)
x_range = np.linspace(-3, 3)
w = clf.coef_[0]
a = -w[0] / w[1]
y_3 = a*x_range - (clf.intercept_[0]) / w[1]
# Visual decision boundary
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.plot(x_range, y_3, '-c')

the weight of Logistic Regression: [[0.33364706 0.33270588]]
the intercept(w0) of Logistic Regression: [-0.00031373]
The prediction result: [0 0 0 1 1 1]

Tags: Machine Learning AI

Posted on Sat, 20 Nov 2021 10:38:52 -0500 by ChrisMayhew