Note
This note is a record of some important steps and understanding of the author's selfstudy online class. The purpose is to quickly review when necessary and record a process of his own learning. Therefore, some contents may be incomplete. You can supplement the corresponding notes according to your own needs. If you happen to see my notes, you can eat them together with the introduction to Python 3 artificial intelligence, mastering machine learning + indepth learning and improving practical ability of moke.com, and the effect is better.
Chapter I guidance
1. Implementation method of artificial intelligence
Implementation methods of artificial intelligence: Symbolic learning and Machine learning
2. Development environment
Development environment: Python, Anaconda, Jupyter Notebook
3. Basic Toolkit
Basic Toolkit: Panda, Numpy, Matplotlib
4. Configuration environment
Configuration environment: Download and install python and anaconda.
In anaconda, it is recommended to create a new development environment for each project to avoid unnecessary conflicts.
The statement to create a new environment: conda create n env_name, activate environment statement: conda activate env_name . ( Bold Italic * * * env_name * * * is the custom environment name)
Install Jupiter notebook in the newly created environment to complete the environment configuration.
5.Jupyter Notebook interface optimization
Jupyter Notebook interface optimization: because the default notebook is very uncomfortable, import a new theme.

Access website: https://github.com/dunovank/jupyterthemes , find the installation command PIP install jupyterthemes / CONDA install  C CONDA forge jupyterthemes, and install the theme package using the corresponding statement of pip / conda on the command line of the current environment.
Due to the slow installation process, some domestic sources can be used. Direct Baidu python source and find some domestic image sources: ALI / Tsinghua / Douban, etc. How to modify the source: the original statement +  i + the address of the source, such as pip install jupyterthemes. After modification, it becomes pip install jupyterthemes i https://pypi.tuna.tsinghua.edu.cn/simple/ .
After the installation is completed, test whether it is installed. The statement is: jt h. If the help is successfully displayed, it will prove to be safe. Of course, we can also use the theme according to the help, as shown in the figure below.

Recommended configuration: jt t grade3 f fira fs 13 cellw 90% ofs 11 dfs 11 T
Original theme:
Current theme:
6. Start coding
Start coding: click New in the upper right corner of the notebook to create a new python file, and then rename it in the file
 Use markdown to edit the content except the code. For line feed, you should make two or more spaces at the end of the line
 After editing with markdown, run the cell to display the running results
 ctrl + enter to quickly run the selected cell
7. Basic syntax of Python and practical operations of Pandas, Numpy and Matplotlib
Basic syntax of Python and practical operations of Pandas, Numpy and Matplotlib (see basic_coding.ipynb), the download package is still recommended to use domestic sources
Domestic source address
University of science and technology of China: https://pypi.mirrors.ustc.edu.cn/simple/
Alibaba cloud: http://mirrors.aliyun.com/pypi/simple/
Douban: http://pypi.douban.com/simple/
Tsinghua University: https://pypi.tuna.tsinghua.edu.cn/simple/
University of science and technology of China: http://pypi.mirrors.ustc.edu.cn/simple/
HKUST put two to show that it is better
pip install numpy i source address
Matplotlib drawing summary
 Import matplotlib and the corresponding sub package under the required matplotlib.
 Create and initialize graph objects (including initialization operations such as specifying graph size)
 Specify the image name using the sub package method
 Drawing with given coordinates and corresponding methods of sub package
 Display images using the sub package method
eg: use Matplotlib to draw a line chart
#Using Matplotlib import matplotlib from matplotlib import pyplot as plt #Generate an instance of the graph fig1 = plt.figure(figsize=(5,5)) #Specify diagram name plt.title('My First Matplotlib Chart') plt.xlabel('xaxis') plt.ylabel('yaxis') #Given coordinates of each point x = [1,2,3,4,5] y = [2,3,4,5,7] plt.plot(x,y) plt.show()
Use Matplotlib to draw a scatter diagram, just change plot.plot to plot.scatter
#Scatter plot using Matplotlib import matplotlib from matplotlib import pyplot as plt #Let the diagram be displayed directly in jupyter %matplotlib inline #Given coordinates of each point x = [1,2,3,4,5] y = [2,3,4,5,7] #Generate an instance of the graph fig1 = plt.figure(figsize=(5,5)) #Specify diagram name plt.title('My First Matplotlib Chart') plt.xlabel('xaxis') plt.ylabel('yaxis') plt.scatter(x,y) plt.show()
Numpy summary
Numpy is powerful because it can easily operate on arrays. First, learn to use numpy to build arrays and do simple array operations. For more operations, refer to the official website documents.
eg: use numpy to generate two arrays and sum them.
import numpy as np arr1 = np.eye(5) print(type(arr1)) print(arr1) #Print the array dimension print(arr1.shape)
arr2 = np.ones((5,5)) print(type(arr2)) print(arr2) #Print the array dimension print(arr1.shape)
res = arr1+arr2 print(type(res)) print(res) #Print the array dimension print(res.shape)
Pandas summary
Pandas is powerful because it can easily load, save and index data.
eg1: read the csv file and export all row data corresponding to column x and column y respectively.
import pandas as pd #Read csv file data = pd.read_csv('data.csv') print(type(data)) print(data)
#Index all rows corresponding to x and y respectively x = data.loc[:,'x'] print(type(x)) print(x) y=data.loc[:,'y']
eg2: filter out all rows with x < 0 and Y > 50
#Index slice case: filter out all rows with x < 0 and Y > 50 data1 = data.loc[:,'x'][x<0][y>50] print(data1)
The first time I met the usage of the loc function, I checked the official documents and blogs and found that this is similar to the idea of the database, that is, to find the subset we need from all the data, but it is written in a different way. For reference https://blog.csdn.net/u014712482/article/details/85080864 And https://www.jianshu.com/p/521f6e302f38 . Here is the knowledge of python slicing. I learned it specially. Link: https://www.jianshu.com/p/15715d6f4dad , it can be said that it is very detailed. After reading the slice, I feel that the function is too powerful, and many operations will be much more convenient in the future.
Finally, Pandas is used to import csv files during data processing, but it is not convenient to process. DataFrame type is usually converted to Numpy type to facilitate scientific calculation between arrays.
#DataFrame type is often converted to Numpy type for easy processing print(type(data)) np_data = np.array(data) print(type(np_data)) print(np_data)
Compare the original csv file to see how to understand its conversion process.
As shown in the figure, the conversion of csv to numpy type can be understood in this way. Because the following table of the array is sequential, and the row labels of csv files are also sequential, you can omit the rows of csv files, directly queue all rows in turn and put them into a onedimensional array, and finally form a twodimensional array (python should call it a list) to complete the conversion process.
Finally, with regard to Pandas, it is worth noting its storage function.
After loading the file data with Pandas, we may need to save the data. In this case, use to_csv('save path ')
eg: display and save the data obtained by adding 15 to all the data loaded into the data.csv file with Pandas
#After processing the data, save the data through pandas new_data = data +15 #The first five lines are displayed, because it is not necessary to open them all when the data is very large new_data.head()
new_data.to_csv('new_data.csv')
After executing this statement, you will find that there is a new in the directory where the current ipynb file is located_ Open the data.csv file to get the saved data content. The index of the first column can be deleted selectively by referring to the official website documents.
Chapter 2 linear regression of machine learning
1. Introduction to machine learning
1.1 what is machine learning?
for instance:
As shown in the figure, for the traditional algorithm, we give the equation (relational expression) of January salary and calculation, and the computer can help us calculate the result. For machine learning, we don't need to give equations (relational expressions), just give a pile of data, and the computer will train the expressions and get the results according to the expressions. This is obviously smarter than traditional algorithms.
It is defined as follows:
1.2 application scenarios
1.3 basic framework
1.4 categories of machine learning
 Supervised learning: tell the computer what is the right data and what is the wrong data, and then let it train.
 Unsupervised learning: don't tell the computer what is the right data, let it train directly.
 Semi supervised learning: also called mixed learning, it only tells the computer a small amount of correct data and lets it train.
 Reinforcement learning: give the computer a feedback (score) according to each result, and let it optimize automatically according to the feedback.
Reinforcement learning eg: when a robot walks, it has only one channel. Let it choose its own path. When it goes to a good path, it will be given + 3 points, and when it goes to a bad path, it will be given  3 points. It will automatically find the path to get high scores.
Various learning application scenarios
2. Linear regression
2.1 what is regression analysis?
For beginners, talking about linear regression directly may confuse beginners. Who am I? Where am i? What am I doing? What's the use of this thing? Therefore, it is necessary to start with the basic concepts. What is regression analysis( Regression Analysis)
As shown in the figure, the following three cases give three problems. These three problems are prediction problems. How can we do it? Of course, it's to find a lot of samples and draw pictures to find rules. Take the first problem as an example. According to the distribution law of samples on the graph, we can easily find that it is an increasing curve with gradually decreasing slope, so as to complete the prediction according to the curve. But how to make the computer understand this picture? Similarly, let it find a suitable curve to fit the sample points on the graph. This fitting process is called regression, and the other two cases are the same.
As an example, the following thrown concepts may be easier to understand.
It can be said that regression analysis is based on fitting a curve, and the method of fitting an appropriate curve is to determine the appropriate parameters corresponding to the curve. For example, if the data points tend to be a straight line, we need to find an appropriate straight line y=ax+b, and the key to determine whether it is appropriate lies in the two parameters a and b. therefore, constantly adjusting parameters to find an appropriate straight line is what regression analysis does. As for the types of regression, one can be understood, and other principles are similar.
2.2 linear regression
After knowing what regression analysis is, linear regression is simply too easy to understand. In short, it is a regression analysis of fitting a straight line by data.
Finally, the regression problem belongs to supervised learning in machine learning (you tell the computer which samples are right and which are wrong, and then let it train)
2.3 regression problem solving
So far, we have learned what is linear regression, and how to solve the regression problem? Let's start with a question
On the left is the problem and on the right is the solution. We know that the regression problem is to draw a graph and fit a curve according to the sample points. This curve is the quantitative relationship P=f(A). After obtaining the function, this problem is the problem of primary and junior middle schools. I won't repeat it.
Obviously, the focus and difficulty of the regression problem is to find the quantitative relationship. How to find it? Don't ask anything. Get the picture first.
According to the figure, we find that many curves can be used to fit the data, but here, for simplicity, start fitting from the most basic linear model y=ax+b, and the idea of fitting with other curves is the same. The key to find the appropriate linear model y=ax+b is the parameters a and b. you can intuitively experience the influence of a and b on the linear model.
In order to find the right a and b, we can consider extrapolating from the results. What kind of model is good? Of course, the line is close to most points! What's more accurate? Daniel has studied it well, that is, for each sample point, calculate the square of the difference between it and the current ordinate of the line, then sum the results of all sample points, and take the minimum value of this sum, that is, the formula in the lower right corner. For subsequent convenience, we can divide this result by 2m to get the loss function we often hear, and record this loss function as J.
In fact, except not dividing by 2m, it has no effect on the final result, just for the convenience of later derivation calculation, about 2m is lost.
Then the original problem is to find a and b to minimize the value of the loss function.
In order to intuitively see the significance and effect of the loss function, you can make a small diagram, put several sample points, fit it with two different lines, and observe the value of the loss function and the corresponding fitting effect.
Columns x and y in the table are the horizontal and vertical coordinates of the actual sample points, and y1 'and y2' are the values of the predicted vertical coordinates, that is, the predicted points on the fitting curve. Calculate the loss functions corresponding to y1 'and y2' to obtain J1 and J2, as shown in the following figure.
Obviously, J1 < J2, and the fitting effect of y1 'is obviously better than y2' on the image. From this case, we can realize the value of loss function.
Back to the original problem, how to determine the appropriate a and B to find the minimum loss function J? In the loss function formula, Yi '= axi+byi can be brought into the formula to obtain the relationship g between J and a and B, which can make J=g(a,b). Wait, how is this a bit like the problem of finding the extreme value of multivariate function? Can we use the method of finding the extreme value of multivariate function in high numbers to solve this problem? I don't know. Leave a question first and come back to think about it when you have rich knowledge. In the field of artificial intelligence, a famous gradient descent method is proposed to solve this problem, which can automatically find the minimum.
Introduce the gradient descent method. For example, its effect is similar to placing a steel ball on the half slope of a groove with an upward opening and letting it roll by itself. It will swing around and finally stop at the minimum point. The specific implementation method is the formula in the figure above. If the abscissa of the current point is pi, the position of the next point to be searched is p(i+1). Then use the formula in the figure above to get the coordinates of the next point, then take p(i+1) as the current point, and then apply the formula above to calculate the coordinates of the next point until it converges to get the abscissa of the minimum point. I don't know how the formula came from. It must have been made by Daniel. At present, I don't need to understand my level. Just remember to use it. For more details about gradient descent, please refer to online videos and blogs.
2.4 preparation for linear regression
First, let's take a look at sklearn, a powerful open source framework for machine learning.
Its function is very powerful, and common algorithms can be written soon.
eg: call sklearn to solve the linear regression problem?
Analysis: through the previous section, we already know that the most important thing to solve linear regression is to determine the two parameters a and b in y=ax+b. There is a corresponding package in sklearn. You can quickly obtain parameters a and b by using the functions in the package.
 Import LinearRegression package
 Create a new LinearRegression instance lr_model
 The parameters a and b can be found by calling fit method to fit the model
 The fitted model is used to predict the new data
Model evaluation
There are many indicators for model evaluation, which will be introduced in the following chapters, but these three indicators will be briefly introduced in this small practical battle
 MSE is very similar to the loss function, with a difference of 2 of the denominator. The smaller it is, the better.
 R^2=1(MSE) / variance. I don't understand this principle for the time being. Let's put it aside first. I know that it is less than 1 and the closer it is to 1, the better.
 Draw a picture and compare y and y ', and visualize the model to evaluate the quality of the model, which is more intuitive.
The specific implementation method is as follows:
When drawing, you may need to display multiple diagrams together and use the subplot method. For example, subplot(211) means to select the first row with two rows and one column.
2.5 single factor linear regression practice
2.5.1 preparation
The following two csv files will be used in this actual combat
generated_ The data.csv file is a simple data created by yourself. It is mainly used to make learning simple and convenient. The contents are as follows.
usa_housing_price.csv file is the data of some house prices and influencing factors in the United States, as follows.
Open anaconda, switch to the environment where you created the learning project, open the notebook, and create the file LR in the corresponding folder_ generated_ Data.ipynb save
2.5.2 review and steps
Task:
Based on generated_data.csv data, establish a linear regression model, predict the y value corresponding to x=3.5, and evaluate the performance of the model
In the previous linear regression chapter, we have introduced how to establish a linear regression model, and then review the steps.
 Load data
 Fetch data
 Observation data
 The linear regression model was established by skleran
 Use the established model for prediction
 Two key parameters a and b of the linear model y=ax+b are obtained
 Evaluation model: MSE (mean square deviation), r2 score, draw a graph between the predicted value and the actual value to see if it approaches y=x
The following will focus on these steps.
2.5.3 loading data
#Load data import pandas as pd lr_data = pd.read_csv("generated_data.csv")#Using read_csv function, very commonly used, be sure to remember!! lr_data.head()#Display the first 5 rows of data
If you forget the basic operation of pandas, it is recommended to go back to section 7 of Chapter 1. As for the use of functions, remember more Baidu. You don't know anything in the early stage. You must remember some of the most commonly used functions.
2.5.4 taking out data
#Take out the x and y columns of data x = lr_data.loc[:,'x'] y = lr_data.loc[:,'y'] print(type(x)) print(x) print(type(y)) print(y)
Similarly, the usage of loc is also in Chapter 1, section 7. I forgot to look back. Generally, after taking out the data, check it. If there is a lot of data, use head to see the first five lines.
2.5.5 observation data
#Draw a dot to see what the data looks like from matplotlib import pyplot as plt plt.figure() #Draw a straight line or scatter diagram plt.scatter(x,y) #plt.plot(x,y) plt.show()
Because they are perfect data made by themselves, these points are just points on a straight line.
2.5.6 establish linear regression model with skleran
#Using sklearn to establish linear regression model from sklearn import linear_model lr_model = linear_model.LinearRegression() #View current x, y and dimensions: 1 print(type(x),x.shape) print(type(y),y.shape)
Linear under sklearn is used_ Model, then create an object, and then call the function with the object. There are usages under the official documents. You should read the official documents, csdn, Baidu and remember more.
It should be noted here that both x and y are 1dimensional, that is, vector form.
import numpy as np #Using fit directly will fail because it requires that the passed in X parameter be a twodimensional array, so it is necessary to convert onedimensional x into twodimensional x = np.array(x) x = x.reshape(1,1) y = np.array(y) y = y.reshape(1,1) #View the current x, y and dimensions again: 2D print(type(x),x.shape) print(x) print(type(y),y.shape) print(y)
Here, we need to convert x into twodimensional, that is, into matrix form. In fact, y can not be converted (there is still a little doubt here, but the conclusion is based on the following practical operation). In addition, refer to the usage of reshape https://blog.csdn.net/u010916338/article/details/84066369 And https://www.zhihu.com/question/52684594 .
#Using fit to construct linear regression model lr_model.fit(x,y)
The construction of the model is very simple. With this fit function, input the coordinate sets of x and y, and it can fit a linear model by itself.
2.5.7 prediction using the established model
#Use the constructed model to predict y corresponding to x (it should be a perfect prediction) y_predict = lr_model.predict(x) print(y_predict)
Input x in the original data for prediction. Since the data is perfect and the model is perfect, the predicted value must be perfect (that is, the prediction point is on a straight line). However, in normal processing, the data is not perfect, so we mainly look at the indicators of the evaluation model. In most cases, it is meaningless to directly look at the prediction results.
#Use the constructed model to predict the value of y at x=3.5 y_predict2 = lr_model.predict([[3.5]]) print(y_predict2)
Of course, this is also a perfect prediction...
2.5.8 two key parameters a and b of the linear model y=ax+b are obtained
#According to the model, parameters a and b are obtained and displayed a = lr_model.coef_ b = lr_model.intercept_ print(a,b)
Use this coef_ And intercept_ You can get the parameters a and b. is that right? Look at the raw data.
Select all x and Y (excluding these two letters) points to insert into the scatter chart.
Right click one of the scatter points and select the trend line.
Check the display formula below. You can see that the formula is y=2x+5, and the above parameters a and b are 2 and 5, which proves that the parameters obtained above are correct.
2.5.9 evaluation model
Three common methods of evaluation indicators are MSE, r2 score, draw y and y_ Remember all the pictures of predict.
from sklearn.metrics import r2_score,mean_squared_error #Evaluation index: mean square error MSE and R2 score MSE = mean_squared_error(y,y_predict) R2 = r2_score(y,y_predict) print(MSE,R2)
#Evaluation index: Drawing plt.figure() plt.plot(y,y_predict) plt.show()
2.6 multi factor linear regression practice
Linear regression prediction of house price
Task: Based on usa_housing_price.csv data, establish a linear regression model to predict the reasonable house price:
1. Taking the area as the input variable, a single factor model is established to evaluate the performance of the model and visualize the linear regression prediction results
2. Taking income, house age, numbers of rooms, population and area as input variables, a multi factor model was established to evaluate the performance of the model
3. Forecast the reasonable house price X of Income=65000, House Age=5, Number of Rooms=5, Population=30000,size=200_ test = [65000,5,5,30000,200]
It is similar to single factor linear regression, but the input is replaced by multiple factors. The steps are as follows:
 Import data
 Display data
 A single factor model is established with area as input variable
 Building a linear model using sklearn
 Using models to predict house prices
 Evaluation model
 Drawing evaluation
 Establish multi factor model
 Fitting linear model
 Forecast new data
 Evaluation model
 Drawing evaluation
 Forecast the reasonable house price X of Income=65000, House Age=5, Number of Rooms=5, Population=30000,size=200_ test = [65000,5,5,30000,200]
2.6.1 importing data
#Import data import numpy as np from matplotlib import pyplot as plt import pandas as pd data = pd.read_csv("usa_housing_price.csv") data.head()
2.6.2 display data
#Show me the data fig = plt.figure(figsize=(15,7)) fig1 = plt.subplot(231) plt.scatter(data.loc[:,'Avg. Area Income'],data.loc[:,'Price']) plt.title('Income VS Price') fig2 = plt.subplot(232) plt.scatter(data.loc[:,'Avg. Area House Age'],data.loc[:,'Price']) plt.title('Age VS Price') fig3 = plt.subplot(233) plt.scatter(data.loc[:,'Avg. Area Number of Rooms'],data.loc[:,'Price']) plt.title('Number VS Price') fig4 = plt.subplot(234) plt.scatter(data.loc[:,'Area Population'],data.loc[:,'Price']) plt.title('Population VS Price') fig5 = plt.subplot(235) plt.scatter(data.loc[:,'size'],data.loc[:,'Price']) plt.title('size VS Price') plt.show()
2.6.3 establish a single factor model with area as input variable
#A single factor model is established with area as input variable #Get X and y X = data.loc[:,'size'] y = data.loc[:,'Price'] X = np.array(X).reshape(1,1)
2.6.4 building a linear model using sklearn
#Building a linear model using sklearn from sklearn import linear_model as lm lr1 = lm.LinearRegression() lr1.fit(X,y)
2.6.5 forecasting house prices using models
#Using models to predict house prices y_predict1 = lr1.predict(X) print(y_predict1)
2.6.6 evaluation model
#Evaluation model from sklearn.metrics import mean_squared_error,r2_score MSE1 = mean_squared_error(y,y_predict1) R2_score1 = r2_score(y,y_predict1) print(MSE1,R2_score1)
2.6.7 drawing evaluation
#Drawing evaluation fig6 = plt.figure(figsize=(5,5)) plt.scatter(X,y) plt.plot(X,y_predict1,'g') plt.show()
2.6.8 establishment of multi factor model
#Establish multi factor model #Define X_muti and y X_muti = data.drop('Price',axis=1) y = data.loc[:,'Price'] X_muti.head() #y.head()
2.6.9 fitting linear model
#Fitting linear model from sklearn import linear_model as lm lr_muti = lm.LinearRegression() lr_muti.fit(X_muti,y)
2.6.10 forecast new data
#Forecast new data lr_muti_predict = lr_muti.predict(X_muti) print(lr_muti_predict)
2.6.11 evaluation model
#Evaluation model from sklearn.metrics import r2_score,mean_squared_error lr_muti_r2_score = r2_score(y,lr_muti_predict) lr_muti_mse = mean_squared_error(y,lr_muti_predict) print(lr_muti_mse,lr_muti_r2_score)
#Compared with the previous single factor model, the r2 score is much better (closer to 1 than the single), and the mse is 10 times smaller. Obviously, the multi factor model in this case is better print(MSE1,R2_score1)
2.6.12 drawing evaluation
#Drawing evaluation model from matplotlib import pyplot as plt fig7 = plt.figure(figsize=(8,5)) plt.scatter(y,lr_muti_predict) plt.title("Price VS Price_predict under multiple factors") plt.show()#From the figure, there are many overlapping parts between the predicted value and the actual value, and the effect is good
#Compare it with single factor fig8 = plt.figure(figsize=(8,5)) plt.scatter(y,y_predict1) plt.title("Price VS Price_predict under single factors") plt.show()#Oh, my God, the effect is so much better!
2.6.13 forecast the reasonable house price under the given parameters
#Forecast the reasonable house price X of Income=65000, House Age=5, Number of Rooms=5, Population=30000,size=200_ test = [65000,5,5,30000,200] import numpy as np X_muti2 = [65000,5,5,30000,200] #Convert the vector to a twodimensional matrix so that it can be passed into the predict function as a parameter X_muti2 = np.array(X_muti2).reshape(1,1) print(X_muti2)
lr_muti_predict2 = lr_muti.predict(X_muti2) print(lr_muti_predict2)
The third chapter is the logical regression of machine learning
1. Introduction to classification issues
Linear regression is mainly to solve a regression problem, while logical regression is to solve the classification problem. As for why logical regression has the word "logic", but it solves the problem of classification, my understanding is that classification is also a special regression. Interested in self Baidu.
Let's look at the following examples of classification problems:
definition
Classification method
The obvious difference between classification task and regression task
2. Explanation of logistic regression
2.1 classification tasks
Now that we know that logistic regression deals with classification tasks, it is easy to understand what classification is. There are the following cases
How to make the machine understand this task? First, let's take a look at the framework of classification tasks.
There are two steps. The first step is to solve a final result, that is, y=f(x1,x2,xn) in the figure below. The second step is to judge the category according to the result.
In this movie watching case, the first step is y=f(x) ∈ {0,1} (0 means not watching a movie and 1 means watching a movie). The second step is to judge the category according to the value of f(x). Obviously, the key is to find f(x).
In order to find f(x), you can first look at the distribution of points using a simple linear model, as shown in the following figure.
We can get the distribution function y=0.1364x+0.5. Based on this distribution function, we can easily get the situation of watching movies. When y > = 0.5, go to the movies, otherwise don't go.
Let's take a look at the prediction accuracy in this case. It seems very good.
But in fact, it has great limitations. It is not flexible enough. When the sample size becomes larger, the accuracy may decline quickly. For example, in the following case, because a sample point deviates a lot to the right, the overall linear regression function deviates a lot, which is obviously stupid. Then logical regression comes in handy.
2.2 distribution function of logistic regression
Compared with linear regression, the main difference of logistic regression is the distribution function. The distribution function of logistic regression is y=1/(1+e^(x)), also known as sigmoid function. The distribution function is as follows.
definition
The case of applying logistic regression to watching movies is the following answer
2.3 classification tasks in multiple dimensions
When there are highdimensional influencing factors, just replace the sigmoid function appropriately. For example, when the factor affecting the result is twodimensional (x1,x2), you can replace X in the sigmoid function with g(x) on the right side of the figure below. For higher dimensions, it is also replaced, but g(x) will also have higher dimensions.
This g(x) is called decision boundary. The most important thing for us to solve the classification task is to find it.
Similarly, finding the following decision boundary is to find g(x). When g(x) contains a quadratic term, it also means that it is a curve with higher flexibility. It can be seen that it plays a very good role in the following classification problems.
3. Logistic regression solution
Here the sigmoid function is fixed, and the key is to solve g(x), that is, the decision boundary. The key to get g(x) is to get its parameters. In the later actual combat, we will use sklearn to get these parameters.
For logistic regression, its loss function is different from linear regression because its points are discrete, so the following loss function is used. Interested can baidu check relevant knowledge.
The loss function of logistic regression can be combined into one formula, as shown in the figure below. With the formula of loss function, the problem of solving logical regression is transformed into finding the minimum loss function, which can be solved by gradient descent method.
4. Actual combat preparation
When using matplotlib to draw classification points, use mask to mark. When mask is true, it is a category and false is a different category. Then, when drawing points, mark the points of different categories with mask.
Using logistic regression to realize binary classification is similar to linear regression
Let's also learn about the process of obtaining the parameters theta0, theta1 and theta2 (firstorder linear), which are the three parameters of the decision boundary. If the decision boundary is a curve, the square term should be introduced and a dictionary should be rebuilt.
The accuracy of model evaluation is introduced, which is relatively simple.
The accuracy realization method and boundary curve visualization are as follows
5. Test practice
**Task: * * pass the logistic regression prediction test.
Based on the data of examdata.csv, a logistic regression model is established to predict that when exam1 = 75 and exam2 = 60, the student is passed or failed; The secondorder boundary is established to improve the accuracy of the model
Import csv file
#read the data import pandas as pd import numpy as np data = pd.read_csv('examdata.csv') data.head()
Get X and y
Here X is two dimensions. There are two exams. y is onedimensional, indicating whether it passes or not.
#get X and y X = data.drop('Pass',axis=1) y = data.loc[:,'Pass'] X1 = data.loc[:,'Exam1'] X2 = data.loc[:,'Exam2'] X2.head()
Visual data
#visualize the data from matplotlib import pyplot as plt fig1 = plt.figure(figsize=(8,5)) plt.title('Exam1 VS Exam2') plt.xlabel('Exam1') plt.ylabel('Exam2') plt.scatter(X1,X2) plt.show()
Visualization after classification
#visualize the data from matplotlib import pyplot as plt fig2 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('Exam1 VS Exam2') plt.xlabel('Exam1') plt.ylabel('Exam2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') plt.legend([passed,failed],['pass','fail']) plt.show()
Model construction and prediction
#establish the model and fit from sklearn.linear_model import LogisticRegression lr = LogisticRegression() lr.fit(X,y)
predict1 = lr.predict(X) print(predict1)
Evaluating models with accuracy
#evaluate the model from sklearn.metrics import accuracy_score accuracy_score(y,predict1)
Predict passage under specific data
#It is predicted that when exam1 = 75 and exam2 = 60, the student is passed or failed predict2 = lr.predict([[75,60]]) print('passed' if predict2==1 else 'failed')
The parameters of decision boundary are obtained
#The decision boundary function is obtained theta0 = lr.intercept_ theta1 = lr.coef_[0][0] theta2 = lr.coef_[0][1] X2_new = (theta0+theta1*X1)/theta2
Visual decision boundary
#visualize the boundary fig3 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('Exam1 VS Exam2') plt.xlabel('Exam1') plt.ylabel('Exam2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') boundary = plt.plot(X1,X2_new,c='r') plt.legend([passed,failed],['pass','fail']) plt.show()
Establish a secondorder boundary, that is, a curve
#The secondorder boundary is established to improve the accuracy of the model #get the new data X1_2 = X1*X1 X2_2 = X2*X2 X1_X2 = X1*X2 X_new = {'X1':X1,'X2':X2,'X1_2':X1_2,'X2_2':X2_2,'X1_X2':X1_X2} X_new = pd.DataFrame(X_new) print(X_new)
Build models and predict
lr2 = LogisticRegression() lr2.fit(X_new,y) predict3 = lr2.predict(X_new) print(predict3)
Evaluating models with accuracy
#get the accuary_score accuracy_score(y,predict3)
Sort x1 so that the decision boundary is a smooth line visually
X1_new = X1.sort_values() print(X1,X1_new)
The parameters of the secondorder decision boundary are obtained, and the coordinates of X2 are calculated through the equation of decision boundary = 0
#get the new X2 inorder to draw the boundary easily theta0 = lr2.intercept_ theta1,theta2,theta3,theta4,theta5 = lr2.coef_[0][0],lr2.coef_[0][1],lr2.coef_[0][2],lr2.coef_[0][3],lr2.coef_[0][4] a = theta4 b = theta2+theta5*X1_new c = theta0+theta1*X1_new+theta3*X1_new*X1_new #print(theta0,theta1,theta2,theta3,theta4,theta5) X2_new2 = (b+np.sqrt(b*b4*a*c))/(2*a) print(X2_new2)
Visual secondorder decision boundary
#draw the boundary fig4 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('Exam1 VS Exam2') plt.xlabel('Exam1') plt.ylabel('Exam2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') boundary2 = plt.plot(X1_new,X2_new2,c='r') plt.legend([passed,failed],['pass','fail']) plt.show()
6. Chip monitoring practice
Task: logistic regression to predict chip quality
1. Chip based_ Test. CSV data, establish a logistic regression model (secondorder boundary) and evaluate the performance of the model;
2. The boundary curve is solved by function
3. Draw a complete decision boundary curve
The steps are as follows:
(1) Load data
(2) Visual data
(3) Define X and y to obtain corresponding parameters and relevant data
(4) Modeling, training and prediction
(5) Evaluation model
(6) Find decision boundary (second order)
(7) Custom boundary function
Load data
#read the data import pandas as pd import numpy as np data = pd.read_csv('chip_test.csv') data.head()
Get X and y
#get X and y X = data.drop('pass',axis=1) y = data.loc[:,'pass'] X1 = data.loc[:,'test1'] X2 = data.loc[:,'test2'] X2.head()
Visual classified data
#visualize the data from matplotlib import pyplot as plt fig2 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('test1 VS test2') plt.xlabel('test1') plt.ylabel('test2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') plt.legend([passed,failed],['pass','fail']) plt.show()
Direct establishment of secondorder curve boundary
#The secondorder boundary is established to improve the accuracy of the model #get the new data X1_2 = X1*X1 X2_2 = X2*X2 X1_X2 = X1*X2 X_new = {'X1':X1,'X2':X2,'X1_2':X1_2,'X2_2':X2_2,'X1_X2':X1_X2} X_new = pd.DataFrame(X_new) print(X_new)
Model and predict
lr2 = LogisticRegression() lr2.fit(X_new,y) predict3 = lr2.predict(X_new) print(predict3)
Evaluating models with accuracy
#get the accuary_score accuracy_score(y,predict3)
Sort x1 to draw the boundary curve
X1_new = X1.sort_values() print(X1_new)
Get the parameters of the decision boundary and define a new dictionary of new square terms and mixed terms
#get the new X2 inorder to draw the boundary easily theta0 = lr2.intercept_ theta1,theta2,theta3,theta4,theta5 = lr2.coef_[0][0],lr2.coef_[0][1],lr2.coef_[0][2],lr2.coef_[0][3],lr2.coef_[0][4] a = theta4 b = theta2+theta5*X1_new c = theta0+theta1*X1_new+theta3*X1_new*X1_new #print(theta0,theta1,theta2,theta3,theta4,theta5) X2_new2 = (b+np.sqrt(b*b4*a*c))/(2*a) print(X2_new2)
Draw decision boundaries
#draw the boundary fig4 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('test1 VS test2') plt.xlabel('test1') plt.ylabel('test2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') boundary2 = plt.plot(X1_new,X2_new2,c='r') plt.legend([passed,failed],['pass','fail']) plt.show()
Since only one root formula x=(b+sqrt(b*b4ac))/(2*a) is used to solve x2, there is only one decision boundary. If you want to draw all, you can get two X's of the root formula at the same time, so there is the following userdefined boundary function f(x).
Custom boundary function
#Custom boundary function f(x) def f(x): a = theta4 b = theta2+theta5*x c = theta0+theta1*x+theta3*x*x boundary1 = (b+np.sqrt(b*b4*a*c))/(2*a) boundary2 = (bnp.sqrt(b*b4*a*c))/(2*a) return boundary1,boundary2
Obtain the ordinates of the two boundary curves
#Obtain test2 corresponding to two boundary functions test1 line1 = [] line2 = [] for x in X1_new: line1.append(f(x)[0]) line2.append(f(x)[1]) print(line1,line2)
Draw decision boundaries
#draw the boundary fig5 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('test1 VS test2') plt.xlabel('test1') plt.ylabel('test2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') boundary3 = plt.plot(X1_new,line1,c='r') boundary4 = plt.plot(X1_new,line2,c='r') plt.legend([passed,failed],['pass','fail']) plt.show() #The line is not closed because there are too few sample points in test1. You can make dense points to complete them
At this time, we find that the decision curve is not closed because there are not enough data points, so we only need to generate more points and draw the graph.
Generate dense points to complete the decision boundary curve
X1_customize = [0.9 + i/20000 for i in range(0,38001)]#You can refer to the usage of list parsing X1_customize = np.array(X1_customize) #Obtain test2 corresponding to two boundary functions test1 line3 = [] line4 = [] for x in X1_customize: line3.append(f(x)[0]) line4.append(f(x)[1]) print(line3,line4)
Draw decision boundaries
#draw the boundary fig6 = plt.figure(figsize=(8,5)) mask=y==1 plt.title('test1 VS test2') plt.xlabel('test1') plt.ylabel('test2') passed = plt.scatter(X1[mask],X2[mask]) failed = plt.scatter(X1[~mask],X2[~mask],marker='^') boundary5 = plt.plot(X1_customize,line3,c='r') boundary6 = plt.plot(X1_customize,line4,c='r') plt.legend([passed,failed],['pass','fail']) plt.show() #The line is closed
accomplished! It's finally over. After learning this chapter, I received a lot of goods~
There are many things I didn't touch at the beginning. I went to Baidu myself and really learned a lot.
Chapter 4 clustering of machine learning
1. Unsupervised learning
Unsupervised learning
For example, here are some pictures of cats. Classify these pictures. In the case of unsupervised learning, he will automatically find the common points of these pictures and classify the similar ones into one category. You don't need to tell him which pictures are which category.
Its definition, advantages and applications are as follows:
The difference between unsupervised and supervised learning is whether to label the data.
For example, the following figure is supervised learning. It will use y to represent the label, and mark the data with the corresponding y(label) value according to the x value. For example, label the red class with 0 and label the blue class with 1, representing 0 and 1 respectively.
Unsupervised learning has no label item. After being divided into two classes, it doesn't matter which class is 0 and which class is 1.
cluster analysis
Cluster analysis, also known as group analysis, automatically divides objects into different categories according to the similarity of some attributes.
Common clustering algorithms
2.KmeansKNNMeanshift
2.1 what is Kmeans
KMeans Analysis is also called Kmeans clustering
In fact, this algorithm is very simple, that is, it constantly updates the position of the cluster center until it converges. The specific steps of the algorithm are as follows:
2.2 what is KNN
KNN is also called Knearest neighbor classification model
2.3 what is Meanshift
Mean shift is also called mean shift clustering
The algorithm flow is as follows:
I don't record too much about each algorithm here in my notes. On the one hand, it is difficult to explain simply using words. It would be better to watch videos. On the other hand, my notes are for quick recall and understanding of key knowledge. I think these are enough for me to recall, and each algorithm is not difficult to understand. Therefore, it is recommended that the first contact brothers take a look at the explanation principle of each algorithm, and then look at the above pictures to get a lot of insight. I won't remember many later chapters in particular. After all, this is a note rather than a lecture.
3. Actual combat preparation
3.1 preliminary knowledge of kmeans
Based on the previous study, the training process of the model here should be understood.
It should be noted that the parameters passed in when defining the KMeans object are: n_clusters=3 and random_state=0.
n_clusters=3 means that you want to cluster into several classes, = 3 means that you want to cluster into three classes.
About random_state=0. After referring to some blogs, we draw the following conclusions:
If you need to set random_state is set, so when others run your code again, they can get exactly the same results and reproduce the same process as you. If you set it to None, a seed will be selected randomly.
Then let it = 0 to fix the seed value, so that we can ensure the recurrence of the current randomness in the future.
The method to obtain the cluster center point is KM.cluster_centers_
The accuracy rate will be good, let alone emmm and kmeans.
Oh, here's another thing to note. Because kmeans is unsupervised learning, it doesn't have a label. There is no problem with the completion of classification, but there may be problems with the separated class labels, so correction is needed. The method is also very simple. You can draw a diagram to see which labels are reversed. Use the list correction, and then turn the corrected data into numpy array to get the correct label classification results.
3.2Meanshift preparatory knowledge
This is quite simple. The bandwidth can be calculated automatically or given manually (usually automatically, because you don't know how much to give)
estimate_ The parameter X passed in by bandwidth is an array, n_samples is the number of samples used. If it is not specified, all samples will be used.
estimate_bandwidth(X,n_samples=500) means to estimate with 500 sample data in X to obtain the appropriate bandwidth.
Finally, the incoming bandwidth is required to build the model.
[the external chain image transfer fails, and the source station may have an antitheft chain mechanism. It is recommended to save the image and upload it directly (IMG owjwbeum1630847148520) (C: \ users \ double \ appdata \ roaming \ typora \ typora user images \ image20210831224737834. PNG)]
3.3KNN preliminary knowledge
KNN is simple. Define a KNN classifier and create an instance~
The only thing to pay attention to is to pass in y during training, because it is supervised learning, and you need to pass in the corresponding label. KMeans and Meanshift are unsupervised learning, and labels are not required during training.
4.Kmeans actual combat
Task: 2D data classification
1. The automatic clustering of 2D data is realized by Kmeans algorithm, and the data categories of V1 = 80 and V2 = 60 are predicted;
2. Calculate the prediction accuracy and complete the result correction
3. Repeat steps 12 with KNN and Meanshift algorithms
Data: data.csv
Start work~
As usual, first create a new file with the following name (I've done it)
Read data
#Read data import pandas as pd import numpy as np from matplotlib import pyplot as plt data = pd.read_csv('data.csv') data.head()
Define X and y
# Define X and y X = data.drop('labels',axis=1) y = data.loc[:,'labels'] y.head()
#Take a look at the distribution of labels (see how many categories there are and how many samples there are in each category) pd.value_counts(y)
Draw the data observation distribution
#Draw the original data fig1 = plt.figure(figsize=(8,6)) plt.scatter(data.loc[:,'V1'],data.loc[:,'V2']) plt.xlabel('V1') plt.ylabel('V2') plt.title('unlabeled data') plt.show()
#Label raw data fig2 = plt.figure(figsize=(8,6)) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('labeled data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show()
Build Kmeans model
#Create an instance of kmeans and use kmeans to train the model from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3,random_state=0) kmeans.fit(X,y)
Get cluster center
#Get cluster center cluster_center = kmeans.cluster_centers_ print(cluster_center)
#Visualize the cluster center (the red point is the cluster center) fig3 = plt.figure(figsize=(8,6)) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('labeled data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.scatter(cluster_center[:,0],cluster_center[:,1])#Slice operation: [row slice operation, column slice operation] plt.show()
Make predictions
#Classify and predict the data with v1=80 and v2=60 y_predict1 = kmeans.predict([[80,60]]) print('label%d' % y_predict1) #From the graph, it should be classified as label2, but the prediction result is label1, which is obviously problematic
#Look at the prediction effect of the original training data y_predict2 = kmeans.predict(X) #Compare the forecast results with the labels of the original data print(pd.value_counts(y_predict2),pd.value_counts(y))
Calculation accuracy
#Calculate prediction accuracy from sklearn.metrics import accuracy_score print(accuracy_score(y,y_predict2))
This frightening with low accuracy shows what went wrong. Draw a picture and have a look
#Draw the comparison between the current prediction results and the original data chart fig4 = plt.figure(figsize=(11,4)) plt.subplot(1,2,1) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('original data') plt.legend([label0,label1,label2],['label0','label1','label2']) fig5 = plt.subplot(1,2,2) label0 = plt.scatter(data.loc[:,'V1'][y_predict2==0],data.loc[:,'V2'][y_predict2==0]) label1 = plt.scatter(data.loc[:,'V1'][y_predict2==1],data.loc[:,'V2'][y_predict2==1]) label2 = plt.scatter(data.loc[:,'V1'][y_predict2==2],data.loc[:,'V2'][y_predict2==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('predicted data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show() #We found that although kmeans helped us classify the classes, the label was wrong. We just need to readjust the label
Although the original classification is basically correct, the label is wrong and corrected.
correcting
#Label for correcting forecast results predict_correct = [] #Customize a correction function to predict the result of a specified data after correction def check(para,lis): for x in para: if x==0: lis.append(1) elif x==1: lis.append(2) else: lis.append(0) #Correction complete check(y_predict2,predict_correct) print(pd.value_counts(predict_correct),pd.value_counts(y))
predict_correct = np.array(predict_correct) type(predict_correct)
Draw the corrected image
#Draw the corrected prediction classification diagram #Draw the comparison between the current prediction results and the original data chart fig6 = plt.figure(figsize=(11,4)) plt.subplot(1,2,1) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('original data') plt.legend([label0,label1,label2],['label0','label1','label2']) fig7 = plt.subplot(1,2,2) label0 = plt.scatter(data.loc[:,'V1'][predict_correct==0],data.loc[:,'V2'][predict_correct==0]) label1 = plt.scatter(data.loc[:,'V1'][predict_correct==1],data.loc[:,'V2'][predict_correct==1]) label2 = plt.scatter(data.loc[:,'V1'][predict_correct==2],data.loc[:,'V2'][predict_correct==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('kmeans predicted data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show()
Calculate the accuracy after correction
#Calculate the corrected accuracy print(accuracy_score(y,predict_correct))
The effect is very good~
At this time, classify and predict the data with v1=80 and v2=60
#After correction, classify and predict the data with v1=80 and v2=60 res = [] y_predict3 = kmeans.predict([[80,60]]) #The classification has been completed and the label has been corrected check(y_predict3,res) print('label%d' % res[0])
KMeans, that's it~
5.KNN actual combat
KNN is supervised learning. When building a model, you need to pass in the corresponding label.
Based on the above code, KNN algorithm is used to build the model.
Build KNN model
#Use knn algorithm (supervised learning, labels need to be given during training) from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X,y)
Prediction using KNN model
#Use the model established by knn to predict the categories corresponding to V1 = 80 and V2 = 60 knn_predict1 = knn.predict([[80,60]]) print(knn_predict1)
Calculate the prediction accuracy of KNN model
#Calculate the accuracy rate of knn model (the accuracy rate is 1, indicating that the model effect is very good) from sklearn.metrics import accuracy_score knn_predict2 = knn.predict(X) print(accuracy_score(knn_predict2,y))
The accuracy is as high as 100%! It can be seen that the effect of KNN model is very good.
Observe the distribution of the data
#Look at the data distribution of knn prediction results and the original given results (it is found that they are completely consistent, indicating that the model effect is very good) print(pd.value_counts(knn_predict2),pd.value_counts(y))
The prediction results obtained by using KNN are completely consistent with the classification results of the original data, indicating that the effect of KNN is very good.
Draw the current KNN model
#Draw a picture to see the effect of knn model #Draw the comparison between the current prediction results and the original data chart (completely consistent, the effect is very good) fig8 = plt.figure(figsize=(11,4)) plt.subplot(1,2,1) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('original data') plt.legend([label0,label1,label2],['label0','label1','label2']) fig9 = plt.subplot(1,2,2) label0 = plt.scatter(data.loc[:,'V1'][knn_predict2==0],data.loc[:,'V2'][knn_predict2==0]) label1 = plt.scatter(data.loc[:,'V1'][knn_predict2==1],data.loc[:,'V2'][knn_predict2==1]) label2 = plt.scatter(data.loc[:,'V1'][knn_predict2==2],data.loc[:,'V2'][knn_predict2==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('knn predicted data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show()
As shown in the figure, not only the distribution of graphics is consistent with the original data, but also the labels are not as disordered as Kmeans, so there is no need to correct them. This is a matter of course, because when you create the KNN model, you transfer the corresponding label data.
6.MeanShift practice
Based on the above, we can start to build the MeanShift model.
Calculate the bandwidth to be used by MeanShift model
#Use the mean shift algorithm (calculate the bandwidth first, that is, the radius of the ball) from sklearn.cluster import estimate_bandwidth bw = estimate_bandwidth(X,n_samples=500) print(bw)
Note that when constructing MeanShift model, its bandwidth should be given first. You can set your own bandwidth or use the method in sklearn to automatically estimate the bandwidth. Generally speaking, we don't know how much to give, so we use automatic estimation method. Here is to automatically calculate the bandwidth, about estimate_ For the bandwidth method, check it yourself. There should be no need to elaborate~
Build MeanShift model
#Build the model using the automatically calculated bandwidth from sklearn.cluster import MeanShift meanshift = MeanShift(bandwidth=bw) meanshift.fit(X)#Unsupervised no incoming Tags
forecast
#The mean shift model is used for prediction (the result is incorrect, indicating that there is a problem in building the model. You can see the distribution comparison between the result label and the original label) meanshift_predict1 = meanshift.predict([[80,60]]) print(meanshift_predict1)
The prediction result is class 0, which is obviously wrong. The data points with V1 = 80 and V2 = 60 should be class 2.
Observed data distribution
meanshift_predict2 = meanshift.predict(X) print(pd.value_counts(y),pd.value_counts(meanshift_predict2)) #Through comparison, it is found that the class 0 and class 2 labels of mean shift model are reversed, and the solution is the same as the kmeans correction above
Draw a picture to further observe the error
#You can draw a picture to see what's wrong (Class 0 and class 2 labels are reversed from the picture) #Draw the comparison between the current prediction results and the original data chart fig8 = plt.figure(figsize=(11,4)) plt.subplot(1,2,1) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('original data') plt.legend([label0,label1,label2],['label0','label1','label2']) fig9 = plt.subplot(1,2,2) label0 = plt.scatter(data.loc[:,'V1'][meanshift_predict2==0],data.loc[:,'V2'][meanshift_predict2==0]) label1 = plt.scatter(data.loc[:,'V1'][meanshift_predict2==1],data.loc[:,'V2'][meanshift_predict2==1]) label2 = plt.scatter(data.loc[:,'V1'][meanshift_predict2==2],data.loc[:,'V2'][meanshift_predict2==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('meanshift predicted data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show()
It can be seen from the figure that the data classification is roughly correct, but the label is wrong. The correction method is the same as KMeans, and it can be corrected.
Calibrate the label
#The results of mean shift prediction are corrected meanshift_correct = [] for i in meanshift_predict2: if i==0: meanshift_correct.append(2) elif i==1: meanshift_correct.append(1) else: meanshift_correct.append(0) meanshift_correct = np.array(meanshift_correct) print(meanshift_correct)
View corrected data distribution
#Check whether the data distribution is consistent after correction (consistency indicates that the correction is completed) print(pd.value_counts(y),pd.value_counts(meanshift_correct))
It is found from the figure that the corrected data distribution has returned to normal.
Draw the corrected data map
#If you don't feel at ease, just draw a picture (exactly the same, no problem) #Draw the comparison between the current prediction results and the original data chart fig8 = plt.figure(figsize=(11,4)) plt.subplot(1,2,1) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('original data') plt.legend([label0,label1,label2],['label0','label1','label2']) fig9 = plt.subplot(1,2,2) label0 = plt.scatter(data.loc[:,'V1'][meanshift_correct==0],data.loc[:,'V2'][meanshift_correct==0]) label1 = plt.scatter(data.loc[:,'V1'][meanshift_correct==1],data.loc[:,'V2'][meanshift_correct==1]) label2 = plt.scatter(data.loc[:,'V1'][meanshift_correct==2],data.loc[:,'V2'][meanshift_correct==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('meanshift predicted data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show()
As shown in the figure, it is obvious that the correction is successful and the effect is also very good~
Accuracy
Finally, take a look at the accuracy of MeanShift
rrect.append(2) elif i==1: meanshift_correct.append(1) else: meanshift_correct.append(0) meanshift_correct = np.array(meanshift_correct) print(meanshift_correct) ```#Look at the accuracy of meanshift (very high) print(accuracy_score(meanshift_correct,y))
View corrected data distribution
#Check whether the data distribution is consistent after correction (consistency indicates that the correction is completed) print(pd.value_counts(y),pd.value_counts(meanshift_correct))
It is found from the figure that the corrected data distribution has returned to normal.
Draw the corrected data map
#If you don't feel at ease, just draw a picture (exactly the same, no problem) #Draw the comparison between the current prediction results and the original data chart fig8 = plt.figure(figsize=(11,4)) plt.subplot(1,2,1) label0 = plt.scatter(data.loc[:,'V1'][y==0],data.loc[:,'V2'][y==0]) label1 = plt.scatter(data.loc[:,'V1'][y==1],data.loc[:,'V2'][y==1]) label2 = plt.scatter(data.loc[:,'V1'][y==2],data.loc[:,'V2'][y==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('original data') plt.legend([label0,label1,label2],['label0','label1','label2']) fig9 = plt.subplot(1,2,2) label0 = plt.scatter(data.loc[:,'V1'][meanshift_correct==0],data.loc[:,'V2'][meanshift_correct==0]) label1 = plt.scatter(data.loc[:,'V1'][meanshift_correct==1],data.loc[:,'V2'][meanshift_correct==1]) label2 = plt.scatter(data.loc[:,'V1'][meanshift_correct==2],data.loc[:,'V2'][meanshift_correct==2]) plt.xlabel('V1') plt.ylabel('V2') plt.title('meanshift predicted data') plt.legend([label0,label1,label2],['label0','label1','label2']) plt.show()
As shown in the figure, it is obvious that the correction is successful and the effect is also very good~
Accuracy
Finally, take a look at the accuracy of MeanShift
#Look at the accuracy of meanshift (very high) print(accuracy_score(meanshift_correct,y))