Mu Ke: deep learning application development TensorFlow practice
Chapter 6: multiple linear regression: TesnsorFlow practice of Boston house price prediction
TensorFlow version 2.3
Problem introduction
The Boston house price data set includes 506 samples, and each sample includes 12 characteristic variables and the average house price (unit price) in the region. Obviously, it is related to multiple characteristic variables, which is not a univariate linear regression (univariate linear regression) problem. Selecting multiple characteristic variables to establish a linear equation is the problem of multivariable linear regression (multiple linear regression)
Data set introduction
The data set is linked below. You can also go to Mu class
Link: https://pan.baidu.com/s/1CYTQSYUNi4U04i26wLYR9w
Extraction code: ymfa
Take a brief look at the dataset
Meaning of each parameter
- CRIM: urban per capita crime rate
- ZN: proportion of residential land exceeding 25000 sq.ft
- INDUS: proportion of Urban Non retail commercial land
- CHAS: the boundary is 1, otherwise 0
- NOX: nitric oxide concentration
- RM: average number of residential rooms
- Age: proportion of self use houses built before 1940
- DIS: weighted distance to 5 central areas of Boston
- RAD: proximity index of radial highway
- TAX: full value property TAX rate per $10000
- PTRATIO: Urban teacher-student ratio
- LSTAT: proportion of people with lower status in the population
- MEDV: average house price of self owned housing, unit: thousands of US dollars
Next, the code goes!!
data processing
data fetch
First, download the data set, and then put the csv file and your python file in the same directory. In the following code, some obvious comments will not be written
The first step is the regular import package
import tensorflow as tf import numpy as np import matplotlib.pyplot as plt %matplotlib inline import pandas as pd from sklearn.utils import shuffle from sklearn.preprocessing import scale#Import relevant packages of sklearn print(tf.__version__)
Two new libraries are used here: pandas and sklearn. Therefore, if the following errors occur
Don't panic. pandas is not installed. Open the terminal and enter the environment. Run the following code
pip install pandas
10.2MB, there may be a slow loss. What should I do? Change the source and get up
pip install pandas -i https://pypi.doubanio.com/simple/
It'll be ready in a few seconds.
Incidentally, move your finger to install sklearn. Watch the network yourself. Choose one of the following two
pip install sklearn perhaps pip install sklearn -i https://pypi.doubanio.com/simple/
After loading, I found it ready
Next, the data set is saved in the CSV file, so we use the pandas library to read the corresponding data. As for pandas, I won't introduce it here. I don't use Baidu tutorial myself. In fact, there are only one or two methods
df = pd.read_csv("boston.csv",header=0)
It reads the pandas.core.frame.DataFrame type, but we need to deal with it later, so we need to convert it into the np.array type we need. We can obtain its corresponding values value to achieve the effect we want. Take a look at the specification
df=df.values#Return np.array df.shape
output
(506, 13)
Separating features and labels
In the dataset, the first twelve columns are eigenvalues and the last column is labels. We need to separate them
x_data=df[:,:12] y_data=df[:,12] print(f"x_data shape:{x_data.shape},y_data shape:{y_data.shape}")
output
x_data shape:(506, 12),y_data shape:(506,)
For normalization operation, add a for loop after the above code. If you don't do this step, the consequences will be discussed in the later training section
for i in range(12): x_data[:,i]=(x_data[:,i]-x_data[:,i].min())/(x_data[:,i].max()-x_data[:,i].min())
Divide training set, verification set and test set
Why divide data sets?
Constructing and training machine learning model is to make good prediction of new data. Well, if we use all the data for training, it's like we give you a question bank before the exam and tell you that all the exam questions are drawn from it without changing a word. As long as we read it carefully, isn't it 90 +?
The machine is the same. If you train it with all the data, its effect on this data set will be very good, but if you give it a new data, the effect will not be known.
In that case, let's divide the data set and give it only a part for training. When testing, we can give him new data
Generally speaking, we will divide a data set into training set and test set
- Training set - a subset used to train the model
- Test set - used to test a subset of the model
Whether the test set performs well is a useful indicator to measure whether it can perform well on new data. The premise is that the test set is large enough to not repeatedly use the same test set for fraud
In addition, ensure that the test set meets the following two conditions:
- Large enough to produce statistically significant results
- It can represent the whole data set, and the characteristics of the test set should be the same as those of the training set
It's almost like this
However, there is a problem here. In each iteration, the training data will be trained and the test data will be evaluated, and various model super parameters will be selected and changed under the guidance of the evaluation results based on the test data. Then... Repeated execution of this process may lead to the model unconsciously fitting the characteristics of a specific test set
What shall I do?
Simply, divide a verification set and let the verification set perform the tasks of the test set in the above process. The test set is only tested once at the last time to ensure that the test set machine only sees it once, and then the workflow becomes like this
ok, the principle is finished, and the code starts
train_num=300# Number of training sets valid_num=100# Number of validation sets test_num=len(x_data)-train_num-valid_num# Number of test sets # Training set partition x_train=x_data[:train_num] y_train=y_data[:train_num] # Verification set partition x_valid=x_data[train_num:train_num+valid_num] y_valid=y_data[train_num:train_num+valid_num] # Test set partition x_test=x_data[train_num+valid_num:train_num+valid_num+test_num] y_test=y_data[train_num+valid_num:train_num+valid_num+test_num]
Since we need to put the data into the model for training and calculating the loss, we convert it into tf.float32 here
x_train=tf.cast(x_train,dtype=tf.float32) x_valid=tf.cast(x_valid,dtype=tf.float32) x_test=tf.cast(x_test,dtype=tf.float32)
Build model
Define model
In fact, the model is the same as the univariate linear regression we did last time, but w and b here are no longer a scalar, but a matrix
def model(x,w,b): return tf.matmul(x,w)+b
Create variable
w=tf.Variable(tf.random.normal([12,1],mean=0.0,stddev=1.0,dtype=tf.float32)) b=tf.Variable(tf.zeros(1),dtype=tf.float32) print(w,b)
Training model
Set super parameters
training_epochs=50 lr=0.001 batch_size=10
Define mean square loss function
def loss(x,y,w,b): err=model(x,w,b)-y#Calculate the difference between the predicted value and the real value squarred_err=tf.square(err)#Find the square to get the variance return tf.reduce_mean(squarred_err)#Find the mean and get the mean square deviation
Define gradient calculation function
def grad(x,y,w,b): with tf.GradientTape() as tape: loss_=loss(x,y,w,b) return tape.gradient(loss_,[w,b])# Return gradient vector
Select optimizer
Here we choose a prefabricated SGD optimizer
optimizer=tf.keras.optimizers.SGD(lr)
Iterative training
The next step is the training process
loss_list_train=[]#train loss loss_list_valid=[]#val loss total_step=int(train_num/batch_size) for epoch in range(training_epochs): for step in range(total_step): xs=x_train[step*batch_size:(step+1)*batch_size,:] ys=y_train[step*batch_size:(step+1)*batch_size] grads=grad(xs,ys,w,b)#Calculated gradient optimizer.apply_gradients(zip(grads,[w,b]))#Optimizer tuning loss_train=loss(x_train,y_train,w,b).numpy() loss_valid=loss(x_valid,y_valid,w,b).numpy() loss_list_train.append(loss_train) loss_list_valid.append(loss_valid) print(f"epoch={epoch+1},train_loss={loss_train},valid_loss={loss_valid}")
However, if you start this training directly with the data, the following output will appear
epoch=1,train_loss=nan,valid_loss=nan epoch=2,train_loss=nan,valid_loss=nan epoch=3,train_loss=nan,valid_loss=nan epoch=4,train_loss=nan,valid_loss=nan epoch=5,train_loss=nan,valid_loss=nan epoch=6,train_loss=nan,valid_loss=nan epoch=7,train_loss=nan,valid_loss=nan epoch=8,train_loss=nan,valid_loss=nan epoch=9,train_loss=nan,valid_loss=nan epoch=10,train_loss=nan,valid_loss=nan epoch=11,train_loss=nan,valid_loss=nan epoch=12,train_loss=nan,valid_loss=nan epoch=13,train_loss=nan,valid_loss=nan epoch=14,train_loss=nan,valid_loss=nan epoch=15,train_loss=nan,valid_loss=nan epoch=16,train_loss=nan,valid_loss=nan epoch=17,train_loss=nan,valid_loss=nan epoch=18,train_loss=nan,valid_loss=nan epoch=19,train_loss=nan,valid_loss=nan epoch=20,train_loss=nan,valid_loss=nan epoch=21,train_loss=nan,valid_loss=nan epoch=22,train_loss=nan,valid_loss=nan epoch=23,train_loss=nan,valid_loss=nan epoch=24,train_loss=nan,valid_loss=nan epoch=25,train_loss=nan,valid_loss=nan epoch=26,train_loss=nan,valid_loss=nan epoch=27,train_loss=nan,valid_loss=nan epoch=28,train_loss=nan,valid_loss=nan epoch=29,train_loss=nan,valid_loss=nan epoch=30,train_loss=nan,valid_loss=nan epoch=31,train_loss=nan,valid_loss=nan epoch=32,train_loss=nan,valid_loss=nan epoch=33,train_loss=nan,valid_loss=nan epoch=34,train_loss=nan,valid_loss=nan epoch=35,train_loss=nan,valid_loss=nan epoch=36,train_loss=nan,valid_loss=nan epoch=37,train_loss=nan,valid_loss=nan epoch=38,train_loss=nan,valid_loss=nan epoch=39,train_loss=nan,valid_loss=nan epoch=40,train_loss=nan,valid_loss=nan epoch=41,train_loss=nan,valid_loss=nan epoch=42,train_loss=nan,valid_loss=nan epoch=43,train_loss=nan,valid_loss=nan epoch=44,train_loss=nan,valid_loss=nan epoch=45,train_loss=nan,valid_loss=nan epoch=46,train_loss=nan,valid_loss=nan epoch=47,train_loss=nan,valid_loss=nan epoch=48,train_loss=nan,valid_loss=nan epoch=49,train_loss=nan,valid_loss=nan epoch=50,train_loss=nan,valid_loss=nan
It can be seen that the loss output is nan. The reason is that the output is too large, so we need to standardize the data to avoid this problem. Therefore, here we need to do a normalization operation on the data. I have written the normalization operation in the step of separating features and labels. After normalization, run it again
epoch=1,train_loss=611.9638061523438,valid_loss=410.65576171875 epoch=2,train_loss=480.1791076660156,valid_loss=300.2012634277344 epoch=3,train_loss=381.4595642089844,valid_loss=223.76609802246094 epoch=4,train_loss=307.4804992675781,valid_loss=171.88424682617188 epoch=5,train_loss=252.01597595214844,valid_loss=137.6017303466797 epoch=6,train_loss=210.4091796875,valid_loss=115.82573699951172 epoch=7,train_loss=179.17657470703125,valid_loss=102.8398666381836 epoch=8,train_loss=155.71231079101562,valid_loss=95.94418334960938 epoch=9,train_loss=138.06668090820312,valid_loss=93.18765258789062 epoch=10,train_loss=124.78082275390625,valid_loss=93.16976165771484 epoch=11,train_loss=114.76297760009766,valid_loss=94.89347839355469 epoch=12,train_loss=107.19605255126953,valid_loss=97.6563949584961 epoch=13,train_loss=101.46824645996094,valid_loss=100.97039031982422 epoch=14,train_loss=97.12144470214844,valid_loss=104.50247192382812 epoch=15,train_loss=93.81257629394531,valid_loss=108.03101348876953 epoch=16,train_loss=91.28462982177734,valid_loss=111.41390991210938 epoch=17,train_loss=89.34487915039062,valid_loss=114.56527709960938 epoch=18,train_loss=87.84889221191406,valid_loss=117.43827819824219 epoch=19,train_loss=86.68826293945312,valid_loss=120.01301574707031 epoch=20,train_loss=85.78163146972656,valid_loss=122.2874755859375 epoch=21,train_loss=85.06792449951172,valid_loss=124.2712631225586 epoch=22,train_loss=84.50117492675781,valid_loss=125.98110961914062 epoch=23,train_loss=84.04681396484375,valid_loss=127.43778991699219 epoch=24,train_loss=83.67878723144531,valid_loss=128.66384887695312 epoch=25,train_loss=83.3774642944336,valid_loss=129.68228149414062 epoch=26,train_loss=83.12799072265625,valid_loss=130.5154571533203 epoch=27,train_loss=82.91910552978516,valid_loss=131.18463134765625 epoch=28,train_loss=82.74230194091797,valid_loss=131.70956420898438 epoch=29,train_loss=82.59107971191406,valid_loss=132.10824584960938 epoch=30,train_loss=82.46050262451172,valid_loss=132.39715576171875 epoch=31,train_loss=82.34676361083984,valid_loss=132.59092712402344 epoch=32,train_loss=82.24696350097656,valid_loss=132.70252990722656 epoch=33,train_loss=82.1588363647461,valid_loss=132.74366760253906 epoch=34,train_loss=82.08061981201172,valid_loss=132.72451782226562 epoch=35,train_loss=82.01095581054688,valid_loss=132.6539764404297 epoch=36,train_loss=81.94876861572266,valid_loss=132.53993225097656 epoch=37,train_loss=81.89318084716797,valid_loss=132.38919067382812 epoch=38,train_loss=81.84352111816406,valid_loss=132.2075958251953 epoch=39,train_loss=81.79922485351562,valid_loss=132.00039672851562 epoch=40,train_loss=81.75981140136719,valid_loss=131.7720489501953 epoch=41,train_loss=81.72492980957031,valid_loss=131.52635192871094 epoch=42,train_loss=81.69425201416016,valid_loss=131.26663208007812 epoch=43,train_loss=81.66749572753906,valid_loss=130.99574279785156 epoch=44,train_loss=81.64443969726562,valid_loss=130.71609497070312 epoch=45,train_loss=81.6248779296875,valid_loss=130.42990112304688 epoch=46,train_loss=81.6086196899414,valid_loss=130.1387939453125 epoch=47,train_loss=81.59551239013672,valid_loss=129.84445190429688 epoch=48,train_loss=81.58541107177734,valid_loss=129.5481414794922 epoch=49,train_loss=81.57818603515625,valid_loss=129.2509307861328 epoch=50,train_loss=81.5737075805664,valid_loss=128.953857421875
Next, we can visualize loss
plt.xlabel("Epochs") plt.ylabel("Loss") plt.plot(loss_list_train,'blue',label="Train loss") plt.plot(loss_list_valid,'red',label='Valid loss') plt.legend(loc=1)
test model
View test set loss
print(f"Test_loss:{loss(x_test,y_test,w,b).numpy()}")
Output:
Test_loss:115.94937133789062
Application model
Select one at random from the test set
# use model test_house_id=np.random.randint(0,test_num) y=y_test[test_house_id] y_pred=model(x_test,w,b)[test_house_id] y_predit=tf.reshape(y_pred,()).numpy() print(f"House id {test_house_id} actual value {y} predicted value {y_predit}")
output
House id 34 actual value 11.7 predicted value 23.49941062927246
Even if it's over here, it's recommended to take a look at Mu class.
Study notes are for reference only. If there are errors, please correct them!