Introduction to Machine Learning-K Nearest Neighbor Algorithm

Introduction

This paper introduces the K-Nearest Neighbors (knn), the first machine learning algorithm in this series.

Its idea is simple, and it uses less mathematics (only distance formulas) and works well (drawbacks)?).

This article will also cover how to handle problems related to machine learning.

k-nn

The idea of this algorithm is explained below.Let's illustrate with an example.

Characterized by tumor size and time and labeled benign and malignant, we draw the following picture:

Here, benign tumors are shown in red and malignant tumors in blue.These are used as initial information, assuming that a cancer patient is coming at this time and mapping them to the image above yields the green dots below.

How do we determine if a new patient is benign or malignant?

If the k-nearest neighbor algorithm is used to solve the problem, we need to take a k k k value first, assuming 333 here.

For each new data point, what the algorithm does is find the three nearest points to the new data point.


Then the nearest points vote with their own tags, and it makes sense to set kkk to odd here.

The three nearest points here are all malignant, so the k-nearest neighbor algorithm says that this new data point is probably a malignancy label.

The k-nearest neighbor algorithm assumes that if two samples are similar enough, they have a higher probability of belonging to the same category.

The similarity is described here by the distance in the feature space.

Assuming there is another new sample point, the following green point is shown:

At this time, the nearest neighbor votes, where the benign votes: malignant votes are 2:1, so the k-neighbors consider this sample point more likely to be benign.

The k-nearest neighbor algorithm mainly solves the classification problem in supervised learning.

Here we implement the idea of k-neighbors through code.

import numpy as np
import matplotlib.pyplot as plt

# Use false datasets here first
raw_data_X = [[ 3.3935,2.3312],
              [3.1101,1.7815],
              [1.3438,3.3684],
              [3.5823,4.6792],
              [2.2804,2.8670],
              [7.4234,4.6965],
              [5.7451,3.5340],
              [9.1721,2.5110],
              [7.7928,3.4241],
              [7.9398,0.7917]]
raw_data_y = [0,0,0,0,0,1,1,1,1,1] # 0 Benign Tumor, 1 Malignant Tumor

Where uppercase X represents the matrix and lowercase y represents the vector.

Let's start with a scatterplot to see how the data is distributed.

X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)

# Draw different types of points in different colors
plt.scatter(X_train[y_train == 0,0] ,X_train[y_train==0,1],color='g')
plt.scatter(X_train[y_train == 1,0] ,X_train[y_train==1,1],color='r')
plt.show()

Assuming that a new sample point x = np.array([8.0934,3.3657]) comes in at this point, how do we use knn to determine its category?

On the basis of the above scatterplot, we add to the drawing of the new sample:

x = np.array([8.0934,3.3657]) #New sample points

plt.scatter(X_train[y_train == 0,0] ,X_train[y_train==0,1],color='g')
plt.scatter(X_train[y_train == 1,0] ,X_train[y_train==1,1],color='r')
plt.scatter(x[0],x[1],color='b')
plt.show()


The new sample point is the bluespot above, and according to knn's thinking, we can guess that it belongs to the red sample point category.

Next, let's see how to use code to implement knn's ideas.

The first step is to calculate the distance between the new sample point and all the original points.So how do we calculate distances? Let's use Euclidean distance formulas.

d(x,y)=(x1−y1)2+(x2−y2)2+...+(xn−yn)2=∑i=1n(xi−yi)2 d(x,y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_n-y_n)^2}=\sqrt{\sum_{i=1}^{n}{(x_i-y_i)^2}}d(x,y)=(x1​−y1​)2+(x2​−y2​)2+...+(xn​−yn​)2​=i=1∑n​(xi​−yi​)2​
$$

In a two-dimensional plane, d(x,y)=(x1_y1)2+(x2_y2)2d(x,y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2}d(x,y)=(x1 y1)2+(x2 y2)2

Code implementation is also simple (see if you don't know numpy) Introduction to Machine Learning-Introduction to the Use of numpy and matplotlib):

distances = []#Save the distance between the new sample point and the original point
from math import sqrt
for x_train in X_train:
    d = sqrt(np.sum((x_train - x)**2)) # X_train - x two vectors subtract the corresponding elements to get a new vector, square each element, get a number through the aggregation function, and finally square
    distances.append(d)

So we get the distance between each point in the training data and the new sample point, and then find the kkk point with the smallest distance.

In fact, the for loop above can be simplified by a list derivation:

distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]

The next step is to sort by distance, but we know that the smallest distances are not useful, and we also need to know which points are the smallest from the new sample.

Now argsort can be applied, and it returns the index.

np.argsort(distances)
#array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2], dtype=int64)

As you can see above, the nearest point is the one with index 8, the next is 7, and so on.

nearset = np.argsort(distances)
k = 6
topK_y = [y_train[i] for i in nearset[:k]] #i The first k elements in the nearset array

Next, let's figure out which category of the k points belongs to the most. If even numbers are used here, what to do with 55.So odd numbers are recommended (actually, considering the number of categories, if there are three categories, odd 9 also has the risk of 3:3:3, so the choice of K is very important).

If we throw up the question of odd or even numbers first, we can use Counter class to easily calculate which category we belong to.

from collections import Counter
Counter(topK_y)
# Counter({1: 5, 0: 1})

You can see that there are five point voting categories 1 and one point voting category 0.
We can call the most_common method to get the category with the most votes, and it returns a list of tuples

The final code is:

from collections import Counter
votes = Counter(topK_y)
predict_y = votes.most_common(1)[0][0]
print(predict_y) # 1

Gets its category 1.Let's look back and see that Category 1 was drawn with red dots, which is consistent with our conjecture.

That's the simple idea of using code to implement knn.

Let's summarize the above code to form a method:

def kNN_classify(k,X_train,y_train,x):
    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
    nearset = np.argsort(distances)
    topK_y = [y_train[i] for i in nearset[:k]]
    votes = Counter(topK_y)
    
    return  votes.most_common(1)[0][0]
    

Next run

predict_y = kNN_classify(6,X_train,y_train,x)
print(predict_y) # 1 

Output is 1, no problem.

Let's review the process of machine learning.

In supervised learning algorithms, training datasets usually contain training data and data labels, and the process of getting a model from the training algorithm is called fit.When input samples are sent to the model, the process by which the output is obtained is called prediction.


When we move the knn algorithm into this process, we find that the knn algorithm does not have a training model, it is.knn is an algorithm that does not require a training process.

This makes it uncomfortable to abstract algorithms and public methods.To be consistent with other algorithms, the training dataset itself is considered a model of the knn algorithm.

Let's first see how sklearn calls the knn score to make a prediction.

knn in sklearn

sklearn uses object-oriented thinking, and each algorithm is a class.

from sklearn.neighbors import KNeighborsClassifier
kNN_classifier = KNeighborsClassifier(n_neighbors=6)#This parameter is k
kNN_classifier.fit(X_train,y_train) #fit is required for each learning algorithm in sklearn s
x = np.array([8.0934,3.3657]) #New sample points
kNN_classifier.predict(x)

This code will fail in the new version of sklearn, where all data should be a two-dimensional matrix, even if it is just a single row or column.

ValueError: Expected 2D array, got 1D array instead
array=[8.0934 3.3657].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

And there is a solution above, so let's follow the tips.

x = np.array([8.0934,3.3657]).reshape(1, -1)#shape becomes (1, 2)
kNN_classifier.predict(x)

The result returned is

array([1])

Note that it returns an array and can predict multiple samples at the same time. Here we only pass in one sample (one row matrix, multiple rows matrix is multiple samples, with great care), so the number of elements in the array is 1.

You can see that the process above matches this diagram

We'll also organize the code we wrote earlier into this pattern.

Here we inherit BaseEstimator, which means it can be used anywhere you use scikit-learn estimator

import numpy as np
from math import sqrt
from collections import Counter

from sklearn.base import BaseEstimator


class KNNClassifier(BaseEstimator):
    def __init__(self,k):
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self,X_train,y_train):
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least k."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self,X_predict):
        assert self._X_train is not None and self._y_train is not None, \
            "must fit before predict!"
        # The number of rows in the X_predict ion matrix does not matter, but the number of columns must be the same as in the training set.
        assert X_predict.shape[1] == self._X_train.shape[1], \
            "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self,x):
        """Given a single data to be predicted x,Return x Forecast Result Value"""
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))
                     for x_train in self._X_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):
        return "KNN(k=%d)" % self.k #Equivalent to java toString()

Next, let's apply the class we just wrote:

knn_clf = KNNClassifier(k=6)
knn_clf.fit(X_train,y_train)
X_predict = x = np.array([8.0934,3.3657]).reshape(1, -1)

y_predict = knn_clf.predict(X_predict)
y_predict # array([1])

So far we have implemented the knn algorithm, but how does it perform and how accurate is it?Next we will learn how to evaluate the performance of the algorithm.

Judging the performance of machine learning algorithms

Let's first look at the process of machine learning. First, we use the original data as training data to train a model. In the knn algorithm, we find the distance between the new data and all the data in the training set, and finally find the first k small distance.That is, we use the model derived from all the data to predict which category the new data belongs to.

We have come to the point that the model is meant to be used in a real environment, and there is a big problem with this now.
The first very serious problem is that we use all the training data to train the model. We can only use this model in a real environment. What if the model is poor?And it's hard to get the real label in the real world.

So we don't know if our model is good or bad.

The easiest way to improve this problem is to separate the training and test data.

We use most of the original data as training data and the rest as test data.So we only trained the model with our training data, and then we can test the quality of our model with test data that is not involved in the training process.

Therefore, we can directly judge whether the model is good or bad by the test data, which can improve the model before it enters the real environment.

This method is called train test split. There are still some problems with this method, but it will not be expanded here, as will be described in the following articles.

Let's test our previously written knn Algorithm in this way.

Here we use the iris dataset provided by sklearn to train:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()# Using iris datasets
X = iris.data # (150, 4)
y = iris.target #(150,)

Once we have the dataset, we separate the training test data.

Before splitting, we usually need to shuffle the cards first. Why can we look at the class vector y?


As you can see, it is ordered.If we split this data directly, the distribution of training data samples will be very uneven, which will result in poor performance of the algorithm.

It is important to note that we cannot shuffle X or y separately because they have a one-to-one correspondence.We can shuffle their indexes:

shuffle_indexs = np.random.permutation(len(X))#Randomize 150 consecutive numbers
shuffle_indexs

Now you can begin to split up by specifying the following ratio as the test data set, which assumes 20% of the data as the test data.

test_ratio = 0.2
test_size = int(len(X) * test_ratio)
test_indexes = shuffle_indexs[:test_size]#The first 20% are test data
train_indexes = shuffle_indexs[test_size:]#The last 80% are training data

With these indexes, we can use fancy indexes to get training data and test data:

X_train = X[train_indexes]
y_train = y[train_indexes]

X_test = X[test_indexes]
y_test = y[test_indexes]

After splitting it up, let's look at the shape of the training data:

Encapsulate the above procedure into a function so that it can be called several times later:

def train_test_split(X,y,test_ratio=0.2,seed=None):
    if seed:
        np.random.seed(seed) #Supports specified random seeds
    shuffle_indexs = np.random.permutation(len(X))
    test_size = int(len(X) * test_ratio)
    test_indexes = shuffle_indexs[:test_size]
    train_indexes = shuffle_indexs[test_size:]
    
    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]
    
    return X_train,X_test,y_train,y_test

Next, we use this method to split the dataset and apply it to our own knn classification algorithm:

X_train,X_test,y_train,y_test = train_test_split(X,y) #Default values for the other two parameters

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

knn_clf = KNNClassifier(k=3)
knn_clf.fit(X_train,y_train)
y_predict = knn_clf.predict(X_test)#Predict all test data
y_predict


We compared the predicted results with the true labels and found that only one sample was misjudged and the others were correct.

So how to quantify the above sentence is through accuracy.

sum(y_predict == y_test)/len(y_test)

Accuracy: 0.966666666667

From here we can see that although knn's thought is simple, its accuracy is still high.

Finally, we describe the train_test_splilt encapsulated in skleran:

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)

You can see that it works the same way as what we've written, because when designing our own approach, we actually refer to sklearn's approach.

But the name of the parameter test_size is different, so here's a note:

148 original articles published. 57% praised. 130,000 visits+
Private letter follow

Tags: less Java REST

Posted on Sat, 21 Mar 2020 21:53:19 -0400 by peuge