# Introduction

The last time I ran the code of Boston house price forecast, but the last time I left a problem is that the effect of training with only two parameters is much better than that with 13 parameters. At that time, I thought that the more parameters, the more accurate the prediction trend of the results, so it should be more accurate. Then I looked for the data and developed my brain to think about it, and I probably had a little understanding.

## code

The last code, students in need can review it

Boston house price forecast (I)

Although the code is similar this time, it still puts the process and running results here

Note: the code is copied from the notebook environment. If it runs under the py file, please pay attention to the format or delete it unnecessarily

python3.8.3 / jupyter1.0.0 / vscode

# Import related libraries first import random import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_boston #Import data and view data datasets = load_boston() datasets.keys() #Make dataset data = datasets['data'] target = datasets['target'] feature_name = datasets['feature_names'] using_data = np.array(data) price = np.array(target) training_data,test_data = np.split(using_data ,[int(using_data.shape[0]*0.7),] ) training_price,test_price = np.split(price ,[int(price.shape[0]*0.7),] ) #Definition of training model def model(x,w,b): return np.dot(x,w.T) + b def loss(y,yhat): return np.mean((yhat-y)**2) def partial_w(x,y,yhat): return np.array([2 * np.mean(yhat - y) *x[i] for i in range(len(x))]) def partial_b(x,y,yhat): return 2* np.mean(yhat - y) #Start training w = np.random.random_sample((1,training_data.shape[1])) b = random.random() learning_rate = 1e-7 losses = [] epoch = 600 for i in range(epoch): batch_losses = [] for batch in range(training_data.shape[0]): index = np.random.choice(range(training_data.shape[0])) x = training_data[index,:] y = price[index] yhat = model(x,w,b) loss_v = loss(y,yhat) batch_losses.append(loss_v) w = w - partial_w(x,y,yhat) * learning_rate b = b - partial_b(x,y,yhat) * learning_rate if batch % 100 == 0: print(f'epoch:{i} ,batch:{batch} ,loss:{loss_v}') losses.append(np.mean(batch_losses)) # View loss changes during training plt.figure(figsize=(12,8),dpi = 80) plt.plot(losses) plt.show() # Test set viewing model effects model_price = [] for i in range(test_data.shape[0]): x = test_data[i,:] res = model(x,w,b) model_price.append(res) plt.figure(figsize=(12,8),dpi = 80) plt.plot(model_price ,label = 'model_price') plt.plot(test_price ,color = 'r' ,label = 'price') plt.legend() # Without, line notes are not displayed plt.show()

##### The last effect is attached here

It can be found that the effect of fitting with two parameters last time is much better than the current 13 parameters.

## Analyze the cause

We can look at it in combination with the graph of correlation

And this is the model parameter after training

It can be clearly seen that many trained parameters are actually very different from their correlation. There are 7 parameters negatively correlated with price, while only 2 of the training results are negatively correlated and the weight is not very high. So I felt it was a training problem, and then I adjusted the training times, and the results didn't improve much.

At the beginning of learning, I didn't have a deep concept of data set. Later, I checked the data and found that it was because the amount of data was too small. There are only more than 500 data sets in sklearn. In fact, this is far from enough for 14 samples. About every additional parameter data volume needs an order of magnitude. Of course, this is not fixed. So generally speaking, the training model is not good because the amount of data is not enough.

## Occam's Razor

Occam's Razor (Ockham's Razor), also known as "okon's Razor", was proposed by William of Occam (about 1285-1349), a logician and Franciscan friar of England in the 14th century. This principle is called "do not add entities if it is not necessary", that is, the "simple and effective principle". As he said in Volume 15 of proverbs notes 2, "don't waste more things to do. You can do things with less things."

This reminds me of the Occam razor principle.

My understanding is that in machine learning, we should choose the simplest model to solve the problem. Because the more complex and refined a prediction problem is, the more things you consider, and the more likely the problem prediction direction is to deviate. Therefore, the simpler the model is, the more accurate the prediction direction is.

Taking the binary tree in the above figure as an example, I is the end point and there is only one intersection A, so it can be clear that I is on the left of A, but adding C, D and E may cause confusion, especially in the training process, I is on the left of D. if the above formula can be used

The weight is represented by the relative position of the x node. If the D weight is too large, an error will be caused.

The explanation may not be very good. I can't think of a better example for the time being

Another point is that the Occam razor understood through this process is based on the number of parameters. In fact, I think it is based on the amount of data, which is equivalent to that the amount of data is the lower limit of your learning ability. As long as the amount of data is sufficient, it theoretically supports unlimited parameter training. If the amount of data is limited, the parameters with greater influence should be selected for training, In this case ['RM '] and ['LSTAT'] are the two parameters with the largest positive correlation and negative correlation respectively.

## summary

I feel that what I said is still not clear enough. It is also due to the factors I just touched. Whether it is machine learning or writing this article, it will be further improved in the process of learning. Generally speaking, the most fundamental reason for the problems encountered this time is that the training results are not good enough due to the insufficient amount of data. In this case, we can first understand the parameters that have the greatest impact on the results through correlation or other means. The optimal model can be obtained with limited training resources.