Multiple linear regression: TesnsorFlow practice of Boston house price prediction

Mu Ke: deep learning application development TensorFlow practice
Chapter 6: multiple linear regression: TesnsorFlow practice of Boston house price prediction
TensorFlow version 2.3

catalogue

Problem introduction

The Boston house price data set includes 506 samples, and each sample includes 12 characteristic variables and the average house price (unit price) in the region. Obviously, it is related to multiple characteristic variables, which is not a univariate linear regression (univariate linear regression) problem. Selecting multiple characteristic variables to establish a linear equation is the problem of multivariable linear regression (multiple linear regression)

Data set introduction

The data set is linked below. You can also go to Mu class
Link: https://pan.baidu.com/s/1CYTQSYUNi4U04i26wLYR9w
Extraction code: ymfa

Take a brief look at the dataset

Meaning of each parameter

  • CRIM: urban per capita crime rate
  • ZN: proportion of residential land exceeding 25000 sq.ft
  • INDUS: proportion of Urban Non retail commercial land
  • CHAS: the boundary is 1, otherwise 0
  • NOX: nitric oxide concentration
  • RM: average number of residential rooms
  • Age: proportion of self use houses built before 1940
  • DIS: weighted distance to 5 central areas of Boston
  • RAD: proximity index of radial highway
  • TAX: full value property TAX rate per $10000
  • PTRATIO: Urban teacher-student ratio
  • LSTAT: proportion of people with lower status in the population
  • MEDV: average house price of self owned housing, unit: thousands of US dollars

Next, the code goes!!

data processing

data fetch

First, download the data set, and then put the csv file and your python file in the same directory. In the following code, some obvious comments will not be written
The first step is the regular import package

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.utils import shuffle
from sklearn.preprocessing import scale#Import relevant packages of sklearn
print(tf.__version__)

Two new libraries are used here: pandas and sklearn. Therefore, if the following errors occur

Don't panic. pandas is not installed. Open the terminal and enter the environment. Run the following code

pip install pandas

10.2MB, there may be a slow loss. What should I do? Change the source and get up

pip install pandas -i  https://pypi.doubanio.com/simple/

It'll be ready in a few seconds.
Incidentally, move your finger to install sklearn. Watch the network yourself. Choose one of the following two

pip install sklearn
 perhaps
pip install sklearn -i  https://pypi.doubanio.com/simple/

After loading, I found it ready

Next, the data set is saved in the CSV file, so we use the pandas library to read the corresponding data. As for pandas, I won't introduce it here. I don't use Baidu tutorial myself. In fact, there are only one or two methods

df = pd.read_csv("boston.csv",header=0)

It reads the pandas.core.frame.DataFrame type, but we need to deal with it later, so we need to convert it into the np.array type we need. We can obtain its corresponding values value to achieve the effect we want. Take a look at the specification

df=df.values#Return np.array
df.shape

output

(506, 13)

Separating features and labels

In the dataset, the first twelve columns are eigenvalues and the last column is labels. We need to separate them

x_data=df[:,:12]
y_data=df[:,12]
print(f"x_data shape:{x_data.shape},y_data shape:{y_data.shape}")

output

x_data shape:(506, 12),y_data shape:(506,)

For normalization operation, add a for loop after the above code. If you don't do this step, the consequences will be discussed in the later training section

for i in range(12):
    x_data[:,i]=(x_data[:,i]-x_data[:,i].min())/(x_data[:,i].max()-x_data[:,i].min())

Divide training set, verification set and test set

Why divide data sets?

Constructing and training machine learning model is to make good prediction of new data. Well, if we use all the data for training, it's like we give you a question bank before the exam and tell you that all the exam questions are drawn from it without changing a word. As long as we read it carefully, isn't it 90 +?
The machine is the same. If you train it with all the data, its effect on this data set will be very good, but if you give it a new data, the effect will not be known.
In that case, let's divide the data set and give it only a part for training. When testing, we can give him new data

Generally speaking, we will divide a data set into training set and test set

  • Training set - a subset used to train the model
  • Test set - used to test a subset of the model

Whether the test set performs well is a useful indicator to measure whether it can perform well on new data. The premise is that the test set is large enough to not repeatedly use the same test set for fraud
In addition, ensure that the test set meets the following two conditions:

  • Large enough to produce statistically significant results
  • It can represent the whole data set, and the characteristics of the test set should be the same as those of the training set

It's almost like this

However, there is a problem here. In each iteration, the training data will be trained and the test data will be evaluated, and various model super parameters will be selected and changed under the guidance of the evaluation results based on the test data. Then... Repeated execution of this process may lead to the model unconsciously fitting the characteristics of a specific test set
What shall I do?

Simply, divide a verification set and let the verification set perform the tasks of the test set in the above process. The test set is only tested once at the last time to ensure that the test set machine only sees it once, and then the workflow becomes like this
ok, the principle is finished, and the code starts

train_num=300# Number of training sets
valid_num=100# Number of validation sets
test_num=len(x_data)-train_num-valid_num# Number of test sets

# Training set partition
x_train=x_data[:train_num]
y_train=y_data[:train_num]

# Verification set partition
x_valid=x_data[train_num:train_num+valid_num]
y_valid=y_data[train_num:train_num+valid_num]

# Test set partition
x_test=x_data[train_num+valid_num:train_num+valid_num+test_num]
y_test=y_data[train_num+valid_num:train_num+valid_num+test_num]

Since we need to put the data into the model for training and calculating the loss, we convert it into tf.float32 here

x_train=tf.cast(x_train,dtype=tf.float32)
x_valid=tf.cast(x_valid,dtype=tf.float32)
x_test=tf.cast(x_test,dtype=tf.float32)

Build model

Define model

In fact, the model is the same as the univariate linear regression we did last time, but w and b here are no longer a scalar, but a matrix

def model(x,w,b):
    return tf.matmul(x,w)+b

Create variable

w=tf.Variable(tf.random.normal([12,1],mean=0.0,stddev=1.0,dtype=tf.float32))
b=tf.Variable(tf.zeros(1),dtype=tf.float32)
print(w,b)

Training model

Set super parameters

training_epochs=50
lr=0.001
batch_size=10

Define mean square loss function

def loss(x,y,w,b):
    err=model(x,w,b)-y#Calculate the difference between the predicted value and the real value
    squarred_err=tf.square(err)#Find the square to get the variance
    return tf.reduce_mean(squarred_err)#Find the mean and get the mean square deviation

Define gradient calculation function

def grad(x,y,w,b):
    with tf.GradientTape() as tape:
        loss_=loss(x,y,w,b)
    return tape.gradient(loss_,[w,b])# Return gradient vector

Select optimizer

Here we choose a prefabricated SGD optimizer

optimizer=tf.keras.optimizers.SGD(lr)

Iterative training

The next step is the training process

loss_list_train=[]#train loss
loss_list_valid=[]#val loss
total_step=int(train_num/batch_size)

for epoch in range(training_epochs):
    for step in range(total_step):
        xs=x_train[step*batch_size:(step+1)*batch_size,:]
        ys=y_train[step*batch_size:(step+1)*batch_size]
        grads=grad(xs,ys,w,b)#Calculated gradient
        optimizer.apply_gradients(zip(grads,[w,b]))#Optimizer tuning
    loss_train=loss(x_train,y_train,w,b).numpy()
    loss_valid=loss(x_valid,y_valid,w,b).numpy()
    loss_list_train.append(loss_train)
    loss_list_valid.append(loss_valid)
    print(f"epoch={epoch+1},train_loss={loss_train},valid_loss={loss_valid}")

However, if you start this training directly with the data, the following output will appear

epoch=1,train_loss=nan,valid_loss=nan
epoch=2,train_loss=nan,valid_loss=nan
epoch=3,train_loss=nan,valid_loss=nan
epoch=4,train_loss=nan,valid_loss=nan
epoch=5,train_loss=nan,valid_loss=nan
epoch=6,train_loss=nan,valid_loss=nan
epoch=7,train_loss=nan,valid_loss=nan
epoch=8,train_loss=nan,valid_loss=nan
epoch=9,train_loss=nan,valid_loss=nan
epoch=10,train_loss=nan,valid_loss=nan
epoch=11,train_loss=nan,valid_loss=nan
epoch=12,train_loss=nan,valid_loss=nan
epoch=13,train_loss=nan,valid_loss=nan
epoch=14,train_loss=nan,valid_loss=nan
epoch=15,train_loss=nan,valid_loss=nan
epoch=16,train_loss=nan,valid_loss=nan
epoch=17,train_loss=nan,valid_loss=nan
epoch=18,train_loss=nan,valid_loss=nan
epoch=19,train_loss=nan,valid_loss=nan
epoch=20,train_loss=nan,valid_loss=nan
epoch=21,train_loss=nan,valid_loss=nan
epoch=22,train_loss=nan,valid_loss=nan
epoch=23,train_loss=nan,valid_loss=nan
epoch=24,train_loss=nan,valid_loss=nan
epoch=25,train_loss=nan,valid_loss=nan
epoch=26,train_loss=nan,valid_loss=nan
epoch=27,train_loss=nan,valid_loss=nan
epoch=28,train_loss=nan,valid_loss=nan
epoch=29,train_loss=nan,valid_loss=nan
epoch=30,train_loss=nan,valid_loss=nan
epoch=31,train_loss=nan,valid_loss=nan
epoch=32,train_loss=nan,valid_loss=nan
epoch=33,train_loss=nan,valid_loss=nan
epoch=34,train_loss=nan,valid_loss=nan
epoch=35,train_loss=nan,valid_loss=nan
epoch=36,train_loss=nan,valid_loss=nan
epoch=37,train_loss=nan,valid_loss=nan
epoch=38,train_loss=nan,valid_loss=nan
epoch=39,train_loss=nan,valid_loss=nan
epoch=40,train_loss=nan,valid_loss=nan
epoch=41,train_loss=nan,valid_loss=nan
epoch=42,train_loss=nan,valid_loss=nan
epoch=43,train_loss=nan,valid_loss=nan
epoch=44,train_loss=nan,valid_loss=nan
epoch=45,train_loss=nan,valid_loss=nan
epoch=46,train_loss=nan,valid_loss=nan
epoch=47,train_loss=nan,valid_loss=nan
epoch=48,train_loss=nan,valid_loss=nan
epoch=49,train_loss=nan,valid_loss=nan
epoch=50,train_loss=nan,valid_loss=nan

It can be seen that the loss output is nan. The reason is that the output is too large, so we need to standardize the data to avoid this problem. Therefore, here we need to do a normalization operation on the data. I have written the normalization operation in the step of separating features and labels. After normalization, run it again

epoch=1,train_loss=611.9638061523438,valid_loss=410.65576171875
epoch=2,train_loss=480.1791076660156,valid_loss=300.2012634277344
epoch=3,train_loss=381.4595642089844,valid_loss=223.76609802246094
epoch=4,train_loss=307.4804992675781,valid_loss=171.88424682617188
epoch=5,train_loss=252.01597595214844,valid_loss=137.6017303466797
epoch=6,train_loss=210.4091796875,valid_loss=115.82573699951172
epoch=7,train_loss=179.17657470703125,valid_loss=102.8398666381836
epoch=8,train_loss=155.71231079101562,valid_loss=95.94418334960938
epoch=9,train_loss=138.06668090820312,valid_loss=93.18765258789062
epoch=10,train_loss=124.78082275390625,valid_loss=93.16976165771484
epoch=11,train_loss=114.76297760009766,valid_loss=94.89347839355469
epoch=12,train_loss=107.19605255126953,valid_loss=97.6563949584961
epoch=13,train_loss=101.46824645996094,valid_loss=100.97039031982422
epoch=14,train_loss=97.12144470214844,valid_loss=104.50247192382812
epoch=15,train_loss=93.81257629394531,valid_loss=108.03101348876953
epoch=16,train_loss=91.28462982177734,valid_loss=111.41390991210938
epoch=17,train_loss=89.34487915039062,valid_loss=114.56527709960938
epoch=18,train_loss=87.84889221191406,valid_loss=117.43827819824219
epoch=19,train_loss=86.68826293945312,valid_loss=120.01301574707031
epoch=20,train_loss=85.78163146972656,valid_loss=122.2874755859375
epoch=21,train_loss=85.06792449951172,valid_loss=124.2712631225586
epoch=22,train_loss=84.50117492675781,valid_loss=125.98110961914062
epoch=23,train_loss=84.04681396484375,valid_loss=127.43778991699219
epoch=24,train_loss=83.67878723144531,valid_loss=128.66384887695312
epoch=25,train_loss=83.3774642944336,valid_loss=129.68228149414062
epoch=26,train_loss=83.12799072265625,valid_loss=130.5154571533203
epoch=27,train_loss=82.91910552978516,valid_loss=131.18463134765625
epoch=28,train_loss=82.74230194091797,valid_loss=131.70956420898438
epoch=29,train_loss=82.59107971191406,valid_loss=132.10824584960938
epoch=30,train_loss=82.46050262451172,valid_loss=132.39715576171875
epoch=31,train_loss=82.34676361083984,valid_loss=132.59092712402344
epoch=32,train_loss=82.24696350097656,valid_loss=132.70252990722656
epoch=33,train_loss=82.1588363647461,valid_loss=132.74366760253906
epoch=34,train_loss=82.08061981201172,valid_loss=132.72451782226562
epoch=35,train_loss=82.01095581054688,valid_loss=132.6539764404297
epoch=36,train_loss=81.94876861572266,valid_loss=132.53993225097656
epoch=37,train_loss=81.89318084716797,valid_loss=132.38919067382812
epoch=38,train_loss=81.84352111816406,valid_loss=132.2075958251953
epoch=39,train_loss=81.79922485351562,valid_loss=132.00039672851562
epoch=40,train_loss=81.75981140136719,valid_loss=131.7720489501953
epoch=41,train_loss=81.72492980957031,valid_loss=131.52635192871094
epoch=42,train_loss=81.69425201416016,valid_loss=131.26663208007812
epoch=43,train_loss=81.66749572753906,valid_loss=130.99574279785156
epoch=44,train_loss=81.64443969726562,valid_loss=130.71609497070312
epoch=45,train_loss=81.6248779296875,valid_loss=130.42990112304688
epoch=46,train_loss=81.6086196899414,valid_loss=130.1387939453125
epoch=47,train_loss=81.59551239013672,valid_loss=129.84445190429688
epoch=48,train_loss=81.58541107177734,valid_loss=129.5481414794922
epoch=49,train_loss=81.57818603515625,valid_loss=129.2509307861328
epoch=50,train_loss=81.5737075805664,valid_loss=128.953857421875

Next, we can visualize loss

plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.plot(loss_list_train,'blue',label="Train loss")
plt.plot(loss_list_valid,'red',label='Valid loss')
plt.legend(loc=1)

test model

View test set loss

print(f"Test_loss:{loss(x_test,y_test,w,b).numpy()}")

Output:

Test_loss:115.94937133789062

Application model

Select one at random from the test set

# use model
test_house_id=np.random.randint(0,test_num)
y=y_test[test_house_id]
y_pred=model(x_test,w,b)[test_house_id]
y_predit=tf.reshape(y_pred,()).numpy()
print(f"House id {test_house_id} actual value {y} predicted value {y_predit}")

output

House id 34 actual value 11.7 predicted value 23.49941062927246

Even if it's over here, it's recommended to take a look at Mu class.

Study notes are for reference only. If there are errors, please correct them!

Tags: Python Machine Learning TensorFlow

Posted on Wed, 03 Nov 2021 05:42:29 -0400 by djtozz