Please click here View the basic usage of this environment
Please click here for more detailed instructions.
Research assistant of data modeling based on paddepadde2.2
1. Project background
PaddlePaddle has powerful data processing capability.
The directory pad.nn contains the neural network layer supported by the propeller framework and the related API s of related functions.
https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#juanjiceng
Using paddle.nn, we can build a data regression model for big data analysis.
This project is based on PaddlePaddle2.2. On the basis of the case of predicting the house price in Boston, the corresponding diagram of epoch and loss and the function of epoch ID when finding the lowest loss are added, so as to find the best model parameters and provide help for modeling and parameter adjustment.
The case links for predicting Boston house prices are as follows: https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/quick_start/linear_regression.html
In addition, the project uses correlation coefficient to evaluate the regression results.
2. Environment setting
import paddle import numpy as np import os import matplotlib import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import warnings warnings.filterwarnings("ignore") print(paddle.__version__)
3. Data import
#Download data !wget https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data -O housing.data
# Import data from file datafile = './housing.data' housing_data = np.fromfile(datafile, sep=' ') feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] feature_num = len(feature_names) # Reshape the original data into a shape like [N, 14] housing_data = housing_data.reshape([housing_data.shape[0] // feature_num, feature_num])
4. Exploratory analysis of data
Before data modeling, exploratory analysis of data is required.
Exploratory analysis is mainly to understand the data law (functional relationship between independent variables and dependent variables) through data visualization (drawing) and correlation analysis
# Draw a picture to see the relationship between features, mainly the relationship between two variables (linear or nonlinear, whether there is an obvious correlation) features_np = np.array([x[:13] for x in housing_data], np.float32) labels_np = np.array([x[-1] for x in housing_data], np.float32) df = pd.DataFrame(housing_data, columns=feature_names) matplotlib.use('TkAgg') %matplotlib inline sns.pairplot(df.dropna(), y_vars=feature_names[-1], x_vars=feature_names[::1], diag_kind='kde') plt.show()
# correlation analysis fig, ax = plt.subplots(figsize=(15, 1)) corr_data = df.corr().iloc[-1] corr_data = np.asarray(corr_data).reshape(1, 14) ax = sns.heatmap(corr_data, cbar=True, annot=True) plt.show()
5. Data preprocessing
In data regression, if the data value difference is too large, the model parameters will not converge. Therefore, it is necessary to normalize the data.
In this project, only attribute values (independent variables) are normalized, and house prices (dependent variables) are not normalized.
Note: if the house price (dependent variable) is preprocessed, the calculation result must be inversely processed, otherwise it will lead to great difference between the predicted value and the original value, and the model parameters cannot converge in serious cases
#Before normalization, the data values vary greatly sns.boxplot(data=df.iloc[:, 0:13])
#Define normalization method features_max = housing_data.max(axis=0) features_min = housing_data.min(axis=0) features_avg = housing_data.sum(axis=0) / housing_data.shape[0] def feature_norm(input): f_size = input.shape output_features = np.zeros(f_size, np.float32) for batch_id in range(f_size[0]): for index in range(13): output_features[batch_id][index] = (input[batch_id][index] - features_avg[index]) / (features_max[index] - features_min[index]) return output_features
# Normalize attributes only housing_features = feature_norm(housing_data[:, :13]) # print(feature_trian.shape) housing_data = np.c_[housing_features, housing_data[:, -1]].astype(np.float32) # print(training_data[0])
# Normalized train_data, look at the properties features_np = np.array([x[:13] for x in housing_data],np.float32) labels_np = np.array([x[-1] for x in housing_data],np.float32) data_np = np.c_[features_np, labels_np] df = pd.DataFrame(data_np, columns=feature_names) sns.boxplot(data=df.iloc[:, 0:13])
6. Machine learning
6.1 construction of training data set and test data set
# The training data set and test data set are separated in the ratio of 8:2 ratio = 0.8 offset = int(housing_data.shape[0] * ratio) train_data = housing_data[:offset] test_data = housing_data[offset:]
6.2 determine the amount of data input in each batch during each round of training
The more data input in each batch, the larger the calculation amount of model regression, the smaller the calculation speed, and the possibility of non convergence increases.
However, if the amount of data input in each batch is too small, the accuracy of the model will deteriorate.
# Determine the amount of data input in each batch during each round of training bratio=0.2 #Data feeding ratio BATCH_SIZE = int(len(train_data)*bratio)
6.3 building data model (Networking)
The directory pad.nn contains the neural network layer supported by the propeller framework and the related API s of related functions.
https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#juanjiceng
Replacing, increasing and reducing the function of padding.nn and adjusting the parameters in the function will change the data model, which determines the accuracy of prediction.
import paddle import paddle.nn.functional as F class LeNet(paddle.nn.Layer): def __init__(self): super(LeNet, self).__init__() # Shape transformation, which changes the data shape from multi-dimensional to 1-dimensional #self.flatten = paddle.nn.Flatten() # First full connection layer self.linear1 = paddle.nn.Linear(in_features=13, out_features=10) # # Activate functions using ReLU self.act1 = paddle.nn.Softplus() # # Second full connection layer self.linear2 = paddle.nn.Linear(in_features=10, out_features=1) # # Activate functions using ReLU #self.act2 = paddle.nn.ReLU() # # Third full connection layer #self.linear3 = paddle.nn.Linear(in_features=20, out_features=1) def forward(self, x): #x = self.flatten(x) x = self.linear1(x) x = self.act1(x) x = self.linear2(x) #x = self.act2(x) #x = self.linear3(x) return x model = LeNet()
#Set the optimization method of the model optimizer = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters())
6.4 model training
The house price (dependent variable) and normalized attribute value (independent variable) of the data training set are used for model training.
The independent variables are substituted into the model function to calculate the predicted value, and then the mean square deviation (mse_loss) between the predicted value and the real value is calculated. Based on this, the optimizer is used to optimize the model parameters.
import paddle.nn.functional as F y_preds = [] labels_list = [] train_nums = [] train_costs = [] def train(model): print('start training ... ') # Open model training mode model.train() EPOCH_NUM = 500 #Set the number of training rounds train_num = 0 for epoch_id in range(EPOCH_NUM): # Before each iteration, the sequence of training data is randomly disrupted np.random.shuffle(train_data) # Split the training data, and each batch contains BATCH_SIZE data mini_batches = [train_data[k: k+BATCH_SIZE] for k in range(0, len(train_data), BATCH_SIZE)] for batch_id, data in enumerate(mini_batches): features_np = np.array(data[:, :13], np.float32) labels_np = np.array(data[:, -1:], np.float32) features = paddle.to_tensor(features_np) labels = paddle.to_tensor(labels_np) # Forward calculation y_pred = model(features) cost = F.mse_loss(y_pred, label=labels)#Default reduction='mean ' train_cost = cost.numpy()[0] # Back propagation cost.backward() # Minimize loss and update parameters optimizer.step() # Clear gradient optimizer.clear_grad() if epoch_id%5 == 0: #The number of training rounds is 500, and the parameters are saved every 5 rounds print("Pass:%d,Cost:%0.5f"%(epoch_id, train_cost)) # save state_dict paddle.save(model.state_dict(),'./checkpoint/epoch{}.pdparams'.format(epoch_id)) paddle.save(optimizer.state_dict(),'./checkpoint/epoch{}.pdopt'.format(epoch_id)) train_num = train_num + BATCH_SIZE train_nums.append(train_num) train_costs.append(train_cost) train(model)
Draw the model training process diagram
def draw_train_process(iters, train_costs): plt.title("training cost", fontsize=24) plt.xlabel("iter", fontsize=14) plt.ylabel("cost", fontsize=14) plt.plot(iters, train_costs, color='red', label='training cost') plt.show() matplotlib.use('TkAgg') %matplotlib inline draw_train_process(train_nums, train_costs)
6.5 model prediction
In this project, the saved parameters are loaded into the model in turn, and then the regression effect of the test data set is evaluated with the model and corresponding parameters.
The evaluation indexes include mean_loss of predicted value and real value and correlation coefficient (corr) of predicted value and real value
By drawing the epich v.s. loss diagram and the epich v.s. corr diagram, the changes of loss and corr in the training process can be viewed intuitively
Record the best epoch by finding the function with the lowest loss_ ID and save it to the best parameter directory
Record the best epoch by finding the function with the highest corr_ ID and save it to the best parameter directory
#Evaluate the prediction results of the test data set batch=[] loss=[] corr=[] for batch_id in range(0,500,5):#The number of training rounds is 500, and the parameters are saved every 5 rounds # Load model parameters layer_state_dict = paddle.load('checkpoint/epoch{}.pdparams'.format(batch_id)) opt_state_dict = paddle.load('checkpoint/epoch{}.pdopt'.format(batch_id)) optimizer = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters()) model.set_state_dict(layer_state_dict) optimizer.set_state_dict(opt_state_dict) # Get forecast data INFER_BATCH_SIZE = 100 #len(test_data) infer_features_np = np.array([data[:13] for data in test_data]).astype("float32") infer_labels_np = np.array([data[-1] for data in test_data]).astype("float32") infer_features = paddle.to_tensor(infer_features_np) infer_labels = paddle.to_tensor(infer_labels_np) fetch_list = model(infer_features) sum_cost = 0 for i in range(INFER_BATCH_SIZE): infer_result = fetch_list[i][0] ground_truth = infer_labels[i] #if i % 10 == 0: #print("No.%d: infer result is %.2f,ground truth is %.2f" % (i, infer_result, ground_truth)) cost = paddle.pow(infer_result - ground_truth, 2) sum_cost += cost mean_loss = sum_cost / INFER_BATCH_SIZE #Calculate mean square deviation x = pd.Series(np.array(fetch_list.flatten()).tolist()) #Use Series to convert lists into new, pandas processable data y = pd.Series(infer_labels_np.tolist()) xycorr = round(x.corr(y), 4) #Calculate the correlation coefficient. round(a, 4) is the first four decimal places of A print("epoch_id:%d, Mean loss is:%.4f, corr is:%.4f"%(batch_id, mean_loss.numpy(),xycorr)) if mean_loss.numpy()<30 : batch=np.append(batch,batch_id) loss=np.append(loss,mean_loss.numpy()) corr=np.append(corr,xycorr)
#Draw epoch v.s. loss diagram def plot_epoch_loss(batch, loss): plt.figure() plt.title("epoch v.s. loss", fontsize=24) plt.xlabel("epoch_id", fontsize=14) plt.ylabel("loss", fontsize=14) plt.ylim(0,30) plt.scatter(batch, loss, alpha=0.5) # Scatter: scatter, alpha: "transparency" #plt.plot(ground, ground, c='red') plt.show()
#Draw epoch v.s. corr diagram def plot_epoch_corr(batch, corr): plt.figure() plt.title("epoch v.s. corr", fontsize=24) plt.xlabel("epoch_id", fontsize=14) plt.ylabel("corr", fontsize=14) plt.ylim(0,1) plt.scatter(batch, corr, alpha=0.5) # Scatter: scatter, alpha: "transparency" #plt.plot(ground, ground, c='red') plt.show()
plot_epoch_loss(batch, loss) #mapping zuijia=int(batch[loss.tolist().index(min(loss))]) #Record the best location print ('loss optimum epoch Location:',zuijia) plot_epoch_corr(batch, corr) #mapping zuijia2=int(batch[corr.tolist().index(max(corr))]) #Record the best location print ('corr optimum epoch Location:',zuijia2) #Create an epoch to save the best_ Folder with ID import os import datetime # Three formats are listed below # Year month day hour: minute: Second nowTime=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') # Year month day dayTime = datetime.datetime.now().strftime('%Y-%m-%d') # Hour: minute: Second hourTime = datetime.datetime.now().strftime('%H:%M:%S') # specify the path to a file file_path = 'best model{}'.format(nowTime) # You can also use nowTime or hourTime here to see which format you want to use. # Determine whether the folder already exists isExists = os.path.exists(file_path) if not isExists: os.makedirs(file_path ) # If you have multiple folders, use makedirs, and only one folder can use mkdir from shutil import copy #shutil is used to copy and paste files copy('checkpoint/epoch{}.pdparams'.format(zuijia), file_path)#Copy paste complete copy('checkpoint/epoch{}.pdopt'.format(zuijia), file_path)#Copy paste complete copy('checkpoint/epoch{}.pdparams'.format(zuijia2), file_path)#Copy paste complete copy('checkpoint/epoch{}.pdopt'.format(zuijia2), file_path)#Copy paste complete
7. Model application
Import the data that needs to be predicted by the application model
Preprocess the imported data according to the preprocessing method in part 5
Load the best parameters of the model, predict the data and evaluate the prediction results
Visualize the predicted and actual values
# Import application data from file datafile = './yingyong.data' yingyong_data = np.fromfile(datafile, sep=' ') feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] feature_num = len(feature_names) # Reshape the original data into a shape like [N, 14] yingyong_data = yingyong_data.reshape([yingyong_data.shape[0] // feature_num, feature_num]) #!! The normalization procedure must be exactly the same as that used in part 5!! Normalize attributes only yingyong_features = feature_norm(yingyong_data[:, :13]) # print(feature_trian.shape) yingyong_data = np.c_[yingyong_features, yingyong_data[:, -1]].astype(np.float32) # print(training_data[0])
# load optimal parameters batch_id=zuijia #zuijia1 !!! Adjust according to the displayed epoch position!!! layer_state_dict = paddle.load('best model2021-11-19 12:17:58/epoch170.pdparams') #layer_state_dict = paddle.load('checkpoint/epoch{}.pdparams'.format(batch_id)) opt_state_dict = paddle.load('checkpoint/epoch{}.pdopt'.format(batch_id)) model.set_state_dict(layer_state_dict) optimizer.set_state_dict(opt_state_dict) # Get forecast data INFER_BATCH_SIZE = len(yingyong_data) infer_features_np = np.array([data[:13] for data in yingyong_data]).astype("float32") infer_labels_np = np.array([data[-1] for data in yingyong_data]).astype("float32") infer_features = paddle.to_tensor(infer_features_np) infer_labels = paddle.to_tensor(infer_labels_np) fetch_list = model(infer_features) sum_cost = 0 for i in range(INFER_BATCH_SIZE): infer_result = fetch_list[i][0] ground_truth = infer_labels[i] if i % 1 == 0: print("No.%d: infer result is %.2f,ground truth is %.2f" % (i, infer_result, ground_truth)) cost = paddle.pow(infer_result - ground_truth, 2) sum_cost += cost mean_loss = sum_cost / INFER_BATCH_SIZE print("Mean loss is:", mean_loss.numpy())
x = pd.Series(np.array(fetch_list.flatten()).tolist()) #Use Series to convert lists into new, pandas processable data y = pd.Series(infer_labels_np.tolist()) xycorr = round(x.corr(y), 4) #Calculate the standard deviation. round(a, 4) is the first four decimal places of A print('correlation coefficient :', xycorr)
def plot_pred_ground(pred, ground): plt.figure() plt.title("Predication v.s. Ground truth"+ str(xycorr), fontsize=24) plt.xlabel("ground truth price(unit:$1000)", fontsize=14) plt.ylabel("predict price(unit:$1000)", fontsize=14) plt.scatter(ground, pred, alpha=0.5) # Scatter: scatter, alpha: "transparency" plt.plot(ground, ground, c='red') plt.show() plot_pred_ground(fetch_list, infer_labels_np) size=24) plt.xlabel("ground truth price(unit:$1000)", fontsize=14) plt.ylabel("predict price(unit:$1000)", fontsize=14) plt.scatter(ground, pred, alpha=0.5) # Scatter: scatter, alpha: "transparency" plt.plot(ground, ground, c='red') plt.show() plot_pred_ground(fetch_list, infer_labels_np)
8. Experience
Using this research assistant, I compared the effect of "randomly disrupting the order of training data np.random.shuffle(train_data)" before each round of iteration, and further understood the difference between machine learning and traditional data fitting.
If the sequence of training data is not randomly disrupted, the program carries out the traditional data fitting operation. With the operation of the program, the mean square deviation loss continues to decrease, indicating that the fitting accuracy is improving. However, when loading the model and parameters to predict the test set, it can be seen that the loss in the prediction set is very large and the change range is very large.
If the sequence of training data is randomly disrupted, the program is a machine learning process. With the operation of the program, the mean square deviation loss shows a fluctuating downward trend, because each disruption of the sequence will cause a pulse to the loss, but the loss will still gradually decline after optimization. When loading the model and parameters to predict the test set, we can see that the loss in the prediction set is very small and there is an interval with little change.