Research assistant of data modeling based on paddepadde2.2

Please click here View the basic usage of this environment

Please click here for more detailed instructions.

Research assistant of data modeling based on paddepadde2.2

1. Project background

PaddlePaddle has powerful data processing capability.

The directory pad.nn contains the neural network layer supported by the propeller framework and the related API s of related functions.

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#juanjiceng

Using paddle.nn, we can build a data regression model for big data analysis.

This project is based on PaddlePaddle2.2. On the basis of the case of predicting the house price in Boston, the corresponding diagram of epoch and loss and the function of epoch ID when finding the lowest loss are added, so as to find the best model parameters and provide help for modeling and parameter adjustment.

The case links for predicting Boston house prices are as follows: https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/quick_start/linear_regression.html

In addition, the project uses correlation coefficient to evaluate the regression results.

2. Environment setting

import paddle
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

print(paddle.__version__)

3. Data import

#Download data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data -O housing.data 

# Import data from file
datafile = './housing.data'
housing_data = np.fromfile(datafile, sep=' ')
feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
feature_num = len(feature_names)
# Reshape the original data into a shape like [N, 14]
housing_data = housing_data.reshape([housing_data.shape[0] // feature_num, feature_num])

4. Exploratory analysis of data

Before data modeling, exploratory analysis of data is required.

Exploratory analysis is mainly to understand the data law (functional relationship between independent variables and dependent variables) through data visualization (drawing) and correlation analysis

# Draw a picture to see the relationship between features, mainly the relationship between two variables (linear or nonlinear, whether there is an obvious correlation)
features_np = np.array([x[:13] for x in housing_data], np.float32)
labels_np = np.array([x[-1] for x in housing_data], np.float32)
df = pd.DataFrame(housing_data, columns=feature_names)
matplotlib.use('TkAgg')
%matplotlib inline
sns.pairplot(df.dropna(), y_vars=feature_names[-1], x_vars=feature_names[::1], diag_kind='kde')
plt.show()

# correlation analysis 
fig, ax = plt.subplots(figsize=(15, 1)) 
corr_data = df.corr().iloc[-1]
corr_data = np.asarray(corr_data).reshape(1, 14)
ax = sns.heatmap(corr_data, cbar=True, annot=True)
plt.show()

5. Data preprocessing

In data regression, if the data value difference is too large, the model parameters will not converge. Therefore, it is necessary to normalize the data.

In this project, only attribute values (independent variables) are normalized, and house prices (dependent variables) are not normalized.

Note: if the house price (dependent variable) is preprocessed, the calculation result must be inversely processed, otherwise it will lead to great difference between the predicted value and the original value, and the model parameters cannot converge in serious cases

#Before normalization, the data values vary greatly
sns.boxplot(data=df.iloc[:, 0:13])

#Define normalization method
features_max = housing_data.max(axis=0)
features_min = housing_data.min(axis=0)
features_avg = housing_data.sum(axis=0) / housing_data.shape[0]
def feature_norm(input):
    f_size = input.shape
    output_features = np.zeros(f_size, np.float32)
    for batch_id in range(f_size[0]):
        for index in range(13):
            output_features[batch_id][index] = (input[batch_id][index] - features_avg[index]) / (features_max[index] - features_min[index])
    return output_features 

# Normalize attributes only
housing_features = feature_norm(housing_data[:, :13])
# print(feature_trian.shape)
housing_data = np.c_[housing_features, housing_data[:, -1]].astype(np.float32)
# print(training_data[0])

# Normalized train_data, look at the properties
features_np = np.array([x[:13] for x in housing_data],np.float32)
labels_np = np.array([x[-1] for x in housing_data],np.float32)
data_np = np.c_[features_np, labels_np]
df = pd.DataFrame(data_np, columns=feature_names)
sns.boxplot(data=df.iloc[:, 0:13])

6. Machine learning

6.1 construction of training data set and test data set

# The training data set and test data set are separated in the ratio of 8:2
ratio = 0.8
offset = int(housing_data.shape[0] * ratio)
train_data = housing_data[:offset]
test_data = housing_data[offset:]

6.2 determine the amount of data input in each batch during each round of training

The more data input in each batch, the larger the calculation amount of model regression, the smaller the calculation speed, and the possibility of non convergence increases.

However, if the amount of data input in each batch is too small, the accuracy of the model will deteriorate.

# Determine the amount of data input in each batch during each round of training
bratio=0.2 #Data feeding ratio
BATCH_SIZE = int(len(train_data)*bratio)

6.3 building data model (Networking)

The directory pad.nn contains the neural network layer supported by the propeller framework and the related API s of related functions.

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#juanjiceng

Replacing, increasing and reducing the function of padding.nn and adjusting the parameters in the function will change the data model, which determines the accuracy of prediction.

import paddle
import paddle.nn.functional as F
class LeNet(paddle.nn.Layer):
    def __init__(self):
         super(LeNet, self).__init__()
        # Shape transformation, which changes the data shape from multi-dimensional to 1-dimensional
         #self.flatten = paddle.nn.Flatten()
        # First full connection layer
         self.linear1 = paddle.nn.Linear(in_features=13, out_features=10)
#         # Activate functions using ReLU
         self.act1 = paddle.nn.Softplus()
#         # Second full connection layer
         self.linear2 = paddle.nn.Linear(in_features=10, out_features=1)
#         # Activate functions using ReLU
         #self.act2 = paddle.nn.ReLU()
#         # Third full connection layer
         #self.linear3 = paddle.nn.Linear(in_features=20, out_features=1)
         

    def forward(self, x):
         #x = self.flatten(x)
         x = self.linear1(x)
         x = self.act1(x)
         x = self.linear2(x)
         #x = self.act2(x)
         #x = self.linear3(x)
         return x
model = LeNet()

#Set the optimization method of the model
optimizer = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters())

6.4 model training

The house price (dependent variable) and normalized attribute value (independent variable) of the data training set are used for model training.

The independent variables are substituted into the model function to calculate the predicted value, and then the mean square deviation (mse_loss) between the predicted value and the real value is calculated. Based on this, the optimizer is used to optimize the model parameters.

import paddle.nn.functional as F 
y_preds = []
labels_list = []
train_nums = []
train_costs = []

def train(model):
    print('start training ... ')
    # Open model training mode
    model.train()
    EPOCH_NUM = 500 #Set the number of training rounds
    train_num = 0
    
    for epoch_id in range(EPOCH_NUM):
        # Before each iteration, the sequence of training data is randomly disrupted
        np.random.shuffle(train_data)
        # Split the training data, and each batch contains BATCH_SIZE data
        mini_batches = [train_data[k: k+BATCH_SIZE] for k in range(0, len(train_data), BATCH_SIZE)]
        for batch_id, data in enumerate(mini_batches):
            features_np = np.array(data[:, :13], np.float32)
            labels_np = np.array(data[:, -1:], np.float32)
            features = paddle.to_tensor(features_np)
            labels = paddle.to_tensor(labels_np)
            
            # Forward calculation
            y_pred = model(features)
            cost = F.mse_loss(y_pred, label=labels)#Default reduction='mean '
            train_cost = cost.numpy()[0]
            # Back propagation
            cost.backward()
            # Minimize loss and update parameters
            optimizer.step()
            # Clear gradient
            optimizer.clear_grad()
            
            if epoch_id%5 == 0: #The number of training rounds is 500, and the parameters are saved every 5 rounds
                print("Pass:%d,Cost:%0.5f"%(epoch_id, train_cost))
                # save state_dict
                paddle.save(model.state_dict(),'./checkpoint/epoch{}.pdparams'.format(epoch_id))
                paddle.save(optimizer.state_dict(),'./checkpoint/epoch{}.pdopt'.format(epoch_id))
            train_num = train_num + BATCH_SIZE
            train_nums.append(train_num)
            train_costs.append(train_cost)
        

train(model)


Draw the model training process diagram

def draw_train_process(iters, train_costs):
    plt.title("training cost", fontsize=24)
    plt.xlabel("iter", fontsize=14)
    plt.ylabel("cost", fontsize=14)
    plt.plot(iters, train_costs, color='red', label='training cost')
    plt.show()

matplotlib.use('TkAgg')
%matplotlib inline
draw_train_process(train_nums, train_costs)

6.5 model prediction

In this project, the saved parameters are loaded into the model in turn, and then the regression effect of the test data set is evaluated with the model and corresponding parameters.

The evaluation indexes include mean_loss of predicted value and real value and correlation coefficient (corr) of predicted value and real value

By drawing the epich v.s. loss diagram and the epich v.s. corr diagram, the changes of loss and corr in the training process can be viewed intuitively

Record the best epoch by finding the function with the lowest loss_ ID and save it to the best parameter directory

Record the best epoch by finding the function with the highest corr_ ID and save it to the best parameter directory

#Evaluate the prediction results of the test data set

batch=[]
loss=[]
corr=[]
for batch_id in range(0,500,5):#The number of training rounds is 500, and the parameters are saved every 5 rounds
    # Load model parameters
    layer_state_dict = paddle.load('checkpoint/epoch{}.pdparams'.format(batch_id))
    opt_state_dict = paddle.load('checkpoint/epoch{}.pdopt'.format(batch_id))
    optimizer = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters())
    model.set_state_dict(layer_state_dict)
    optimizer.set_state_dict(opt_state_dict)
    
    # Get forecast data
    INFER_BATCH_SIZE = 100 #len(test_data)

    infer_features_np = np.array([data[:13] for data in test_data]).astype("float32")
    infer_labels_np = np.array([data[-1] for data in test_data]).astype("float32")

    infer_features = paddle.to_tensor(infer_features_np)
    infer_labels = paddle.to_tensor(infer_labels_np)
    fetch_list = model(infer_features)

    sum_cost = 0
    for i in range(INFER_BATCH_SIZE):
        infer_result = fetch_list[i][0]
        ground_truth = infer_labels[i]
        #if i % 10 == 0:
            #print("No.%d: infer result is %.2f,ground truth is %.2f" % (i, infer_result, ground_truth))
        cost = paddle.pow(infer_result - ground_truth, 2)
        sum_cost += cost
    mean_loss = sum_cost / INFER_BATCH_SIZE #Calculate mean square deviation
    x = pd.Series(np.array(fetch_list.flatten()).tolist()) #Use Series to convert lists into new, pandas processable data
    y = pd.Series(infer_labels_np.tolist())
    xycorr = round(x.corr(y), 4) #Calculate the correlation coefficient. round(a, 4) is the first four decimal places of A
    print("epoch_id:%d,  Mean loss is:%.4f,  corr is:%.4f"%(batch_id, mean_loss.numpy(),xycorr))
    if mean_loss.numpy()<30 :
        batch=np.append(batch,batch_id)
        loss=np.append(loss,mean_loss.numpy())
        corr=np.append(corr,xycorr)


#Draw epoch v.s. loss diagram
def plot_epoch_loss(batch, loss):
    plt.figure()   
    plt.title("epoch v.s. loss", fontsize=24)
    plt.xlabel("epoch_id", fontsize=14)
    plt.ylabel("loss", fontsize=14)
    plt.ylim(0,30)
    plt.scatter(batch, loss, alpha=0.5)  #  Scatter: scatter, alpha: "transparency"
    #plt.plot(ground, ground, c='red')
    plt.show()


#Draw epoch v.s. corr diagram
def plot_epoch_corr(batch, corr):
    plt.figure()   
    plt.title("epoch v.s. corr", fontsize=24)
    plt.xlabel("epoch_id", fontsize=14)
    plt.ylabel("corr", fontsize=14)
    plt.ylim(0,1)
    plt.scatter(batch, corr, alpha=0.5)  #  Scatter: scatter, alpha: "transparency"
    #plt.plot(ground, ground, c='red')
    plt.show()


plot_epoch_loss(batch, loss) #mapping

zuijia=int(batch[loss.tolist().index(min(loss))]) #Record the best location
print ('loss optimum epoch Location:',zuijia)

plot_epoch_corr(batch, corr) #mapping
zuijia2=int(batch[corr.tolist().index(max(corr))]) #Record the best location
print ('corr optimum epoch Location:',zuijia2)

#Create an epoch to save the best_ Folder with ID
import os
import datetime
# Three formats are listed below
# Year month day hour: minute: Second
nowTime=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# Year month day
dayTime = datetime.datetime.now().strftime('%Y-%m-%d')
# Hour: minute: Second
hourTime = datetime.datetime.now().strftime('%H:%M:%S')
# specify the path to a file
file_path = 'best model{}'.format(nowTime)  # You can also use nowTime or hourTime here to see which format you want to use.

# Determine whether the folder already exists
isExists = os.path.exists(file_path)
if not isExists:
    os.makedirs(file_path )  # If you have multiple folders, use makedirs, and only one folder can use mkdir


from shutil import copy #shutil is used to copy and paste files
copy('checkpoint/epoch{}.pdparams'.format(zuijia), file_path)#Copy paste complete
copy('checkpoint/epoch{}.pdopt'.format(zuijia), file_path)#Copy paste complete
copy('checkpoint/epoch{}.pdparams'.format(zuijia2), file_path)#Copy paste complete
copy('checkpoint/epoch{}.pdopt'.format(zuijia2), file_path)#Copy paste complete

7. Model application

Import the data that needs to be predicted by the application model

Preprocess the imported data according to the preprocessing method in part 5

Load the best parameters of the model, predict the data and evaluate the prediction results

Visualize the predicted and actual values

# Import application data from file
datafile = './yingyong.data'
yingyong_data = np.fromfile(datafile, sep=' ')
feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
feature_num = len(feature_names)
# Reshape the original data into a shape like [N, 14]
yingyong_data = yingyong_data.reshape([yingyong_data.shape[0] // feature_num, feature_num])

#!! The normalization procedure must be exactly the same as that used in part 5!! Normalize attributes only
yingyong_features = feature_norm(yingyong_data[:, :13]) 
# print(feature_trian.shape)
yingyong_data = np.c_[yingyong_features, yingyong_data[:, -1]].astype(np.float32)
# print(training_data[0])

# load optimal parameters
batch_id=zuijia #zuijia1     !!! Adjust according to the displayed epoch position!!!
layer_state_dict = paddle.load('best model2021-11-19 12:17:58/epoch170.pdparams')
#layer_state_dict = paddle.load('checkpoint/epoch{}.pdparams'.format(batch_id))
opt_state_dict = paddle.load('checkpoint/epoch{}.pdopt'.format(batch_id))

model.set_state_dict(layer_state_dict)
optimizer.set_state_dict(opt_state_dict)
    
# Get forecast data
INFER_BATCH_SIZE = len(yingyong_data)

infer_features_np = np.array([data[:13] for data in yingyong_data]).astype("float32")
infer_labels_np = np.array([data[-1] for data in yingyong_data]).astype("float32")

infer_features = paddle.to_tensor(infer_features_np)
infer_labels = paddle.to_tensor(infer_labels_np)
fetch_list = model(infer_features)

sum_cost = 0
for i in range(INFER_BATCH_SIZE):
    infer_result = fetch_list[i][0]
    ground_truth = infer_labels[i]
    if i % 1 == 0:
        print("No.%d: infer result is %.2f,ground truth is %.2f" % (i, infer_result, ground_truth))
    cost = paddle.pow(infer_result - ground_truth, 2)
    sum_cost += cost
mean_loss = sum_cost / INFER_BATCH_SIZE
print("Mean loss is:", mean_loss.numpy())
x = pd.Series(np.array(fetch_list.flatten()).tolist()) #Use Series to convert lists into new, pandas processable data
y = pd.Series(infer_labels_np.tolist())
 
xycorr = round(x.corr(y), 4) #Calculate the standard deviation. round(a, 4) is the first four decimal places of A
 
print('correlation coefficient :', xycorr)

def plot_pred_ground(pred, ground):
    plt.figure()   
    plt.title("Predication v.s. Ground truth"+ str(xycorr), fontsize=24)
    plt.xlabel("ground truth price(unit:$1000)", fontsize=14)
    plt.ylabel("predict price(unit:$1000)", fontsize=14)
    plt.scatter(ground, pred, alpha=0.5)  #  Scatter: scatter, alpha: "transparency"
    plt.plot(ground, ground, c='red')
    plt.show()

plot_pred_ground(fetch_list, infer_labels_np)
size=24)
    plt.xlabel("ground truth price(unit:$1000)", fontsize=14)
    plt.ylabel("predict price(unit:$1000)", fontsize=14)
    plt.scatter(ground, pred, alpha=0.5)  #  Scatter: scatter, alpha: "transparency"
    plt.plot(ground, ground, c='red')
    plt.show()

plot_pred_ground(fetch_list, infer_labels_np)

8. Experience

Using this research assistant, I compared the effect of "randomly disrupting the order of training data np.random.shuffle(train_data)" before each round of iteration, and further understood the difference between machine learning and traditional data fitting.

If the sequence of training data is not randomly disrupted, the program carries out the traditional data fitting operation. With the operation of the program, the mean square deviation loss continues to decrease, indicating that the fitting accuracy is improving. However, when loading the model and parameters to predict the test set, it can be seen that the loss in the prediction set is very large and the change range is very large.


If the sequence of training data is randomly disrupted, the program is a machine learning process. With the operation of the program, the mean square deviation loss shows a fluctuating downward trend, because each disruption of the sequence will cause a pulse to the loss, but the loss will still gradually decline after optimization. When loading the model and parameters to predict the test set, we can see that the loss in the prediction set is very small and there is an interval with little change.


Tags: Machine Learning paddlepaddle

Posted on Mon, 29 Nov 2021 00:35:09 -0500 by zab329