Principle and implementation of machine learning iris classification

Iris data set is a classic small-scale data set in machine learning. Through the following experiments, access to materials and videos for learning, share the learning experience and experimental process of the whole experiment, hoping to provide help to novices who love machine learning and get started, and urge themselves to move forward steadily at the same time.

This article is mainly written from three parts: basic lessons in the early stage of the experiment, important realization of the experimental process and learning summary at the end of the experiment. Limited to the length of the article, the subsequent content will be updated in the blog post later. You can read it by yourself:

(if in doubt, please click here Jump to this column to view the first two blog posts)

(2) Important realization of the experimental process:

KNN algorithm API: please refer to the official description click here Enter, but all in English. Recommend a blog, click here The entry is intuitive and clear. The following parts are also extracted from this blog. After sorting, modifying and summarizing, they are as follows:

Kneigborsclassifier Parameter Description:

KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
# Explanation:
n_neighbors: The number of nearest neighbors selected is equivalent to K. 
weights Parameter settings for:
- uniform: The same weight, that is, all points in each neighborhood are weighted equally.
- distance: In this case, the closer the distance, the greater the weight. On the contrary, the farther the distance, the smaller the weight.
- [callable](Callable): a user-defined function that accepts an array of distances and returns an array of the same shape containing weights.
algorithm : Algorithm for calculating nearest neighbor,. yes{auto, ball_tree, kd_tree, brute}. 
- auto : Automatically select the appropriate algorithm according to the sample data to judge whether the dimension (eigenvalue) is greater than 20 and less than 20 kd_tree,More than 20 ball_tree,If the dataset is particularly simple, use brute Just search one by one.
- ball_tree: Build the "ball tree" algorithm model and segment the hypersphere.
- kd_tree: "kd Tree algorithm, hyperplane segmentation.
- brute: Use brute force search, that is, or equivalent Knn The algorithm needs to traverse the distance between all sample data and target data, and then sort in ascending order to select the nearest one K The result is obtained by voting.
leaf_size: The size of the leaf for the algorithm is a spherical tree or KD For trees. This setting affects the speed of construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
p: Minkowski The parameters of the measurement parameters are from sklearn.emeics.pairwise.pairwise_Distance. When p=1 This is equivalent to using Manhattan distance(L1),Euclidean distance(L2)Equivalent to p=2 For any p,Then use Minkowski_distance(L_P). 
metric: Distance metric for the tree. The default metric is Minkowski,p=2 Equivalent to the standard Euclidean metric. For a list of available metrics, consult the documentation for the distance measurement class. If the measurement is "pre calculated", it is assumed that X Is a distance matrix and must be square during fitting.
metric_params: The additional keyword parameter of the measurement function should be set to dict((Dictionary) form.
n_jobs: The number of parallel jobs to search for neighbors. None Means 1 unless joblib.parallel_backend Background.-1 It means using all processors. If you want to know the relevant knowledge, you should look it up.


Review the iris dataset:

Number of instances: 150 (50 for each of the three categories)
Number of attributes: 4 (numeric, numeric, attributes and classes to help predict)
Attribute Information:
- sepal length:Sepal length( cm)
- sepal width:Sepal width( cm)
- petal lengh:Petal length( cm)
- petal width:Petal width( cm)
Class:
- Iriss-Setosa:Mountain iris
- Iris-Versicolour:Discolored iris
- Iris-Virginica:Virginia iris

Step analysis:

- Get dataset
- Basic data processing
- Characteristic Engineering
- Machine learning (model training)
- Model evaluation

The data set used below is the iris data set in sklearn:

Core code:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# -Get dataset
iris=load_iris()
# -Basic data processing
# Data segmentation deals with outliers of data. Since the data is very standard, only segmentation is performed
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=22,test_size=0.2)
# -Characteristic Engineering
# Instantiate a converter
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.fit_transform(x_test)
# -Machine learning (model training)
# Instantiate an estimator
estimator=KNeighborsClassifier(n_neighbors=5)
# Model training
estimator.fit(x_train,y_train)
# -Model evaluation
# Output predicted value
y_pre=estimator.predict(x_test)
print("The prediction of the target value of the test set is:\n",y_pre)
print("Comparison between the predicted value of the test set and its real value:\n",y_pre==y_test)
# Output accuracy
ret=estimator.score(x_test,y_test)
print("The predicted success rate is:",ret)

Effect achieved:

Based on the above experiments, the method of logistic regression (later detailed update) is adopted, that is, the process of this experiment:

Preliminary understanding:

Preliminary cognition of logistic regression: Click here to enter
In scikit learn, the use of class libraries related to logical regression: Click here to view
Scikit learn official document: Click here to enter

Experiment code:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score # Model evaluation
iris = load_iris()  # Read in dataset
# The segmentation data set is divided into training set and test set
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)
# Configure and instantiate a logistic regression classifier
logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class="multinomial",max_iter=3000)
# Call fit(x,y) to train the model, where x is the eigenvalue of the training set and Y is the target value of the training set.
logreg.fit(x_train, y_train)
# The trained model is used to predict the test set and return the predicted target value array
y_predict = logreg.predict(x_test)
print("The real target values in the test set are:\n",y_test)
print("The target values predicted in the test set are:\n",y_predict)
print("Comparison between the predicted value of the test set and its real value:\n",y_predict==y_test)
print("The predicted success rate is:",accuracy_score(y_test, y_predict))

Operation results:

(2) Learning summary at the end of the experiment:

For the experimental process, please refer to the first two blog posts in order: Principle and implementation of machine learning iris classification (1) and Principle and implementation of machine learning iris classification (2)
Several well written articles are recommended as a summary:
Recommended articles on L1 and L2 regularization:
What are the characteristics and advantages of l1 regularity and l2 regularity?
Norm regularization in machine learning (I) L0, L1 and L2 norms
Understand the L2 norm of deep learning regularization
Logical regression explanation and recommendation (Zhihu): Machine learning | logistic regression algorithm (II) logistic regression

Tags: Python Machine Learning AI sklearn

Posted on Thu, 07 Oct 2021 17:47:22 -0400 by Lashiec