Temperature prediction based on random forest

1, Weather maximum temperature prediction task

Experimental data and source code https://github.com/wyn-365/temp-predict.git

1.1 tasks

We have three tasks to accomplish: Using random forest algorithm to complete basic modeling tasks
Basic tasks require us to process data, observe features, complete modeling and conduct visual display analysis
The effect of the amount of observation data and the number of features on the results
On the premise of ensuring the consistency of the algorithm, increase the number of data and transform the observation results. The characteristic engineering is reconsidered, and the trend of the result is observed after the new characteristic is introduced.
Adjust parameters of random forest algorithm to find the most suitable parameters
Master two classical parameter adjustment methods in machine learning, and adjust the current model

# data fetch
import pandas as pd

features = pd.read_csv('data/temps.csv')
features.head(5)

In the data sheet
Specific time represented by year, month, day and week respectively
temp_2: Maximum temperature of the day before yesterday
temp_1: Yesterday's maximum temperature
Average: the average maximum temperature of this day in history
Actual: This is our label value, the actual maximum temperature of the day
friend: this column may be fun. Your friends guess the possible value. Let's ignore it

1.2 data size

print('The shape of our features is:', features.shape)
# statistical indicators
features.describe()
# For time data, we can also perform some transformations, so that some toolkits need standard time format in the process of drawing or calculation:
# Processing time data
import datetime

# Get year, month, day respectively
years = features['year']
months = features['month']
days = features['day']

# datetime format
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
dates[:5]
 

The results show that: The shape of our features is: (348, 9), which means that there are 348 records in our data, and each sample has 9 features. If you want to observe the statistical characteristics of each indicator, you can also use. Description () to directly show:

1.3 data display

# Prepare to draw
import matplotlib.pyplot as plt

%matplotlib inline

# Specify default style
plt.style.use('fivethirtyeight')

# Then we design the layout of the drawing. Here we need to show four indicators, namely the label value of the highest temperature. The day before yesterday, yesterday, our friend predicted the highest temperature. Since it's four graphs, it's better to draw on the scale of 2 * 2, which will make it clearer. For each graph, you can specify its name and coordinate axis meaning:
# Set layout
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize = (10,10))
fig.autofmt_xdate(rotation = 45)

# Label value
ax1.plot(dates, features['actual'])
ax1.set_xlabel(''); ax1.set_ylabel('Temperature'); ax1.set_title('Max Temp')

# yesterday
ax2.plot(dates, features['temp_1'])
ax2.set_xlabel(''); ax2.set_ylabel('Temperature'); ax2.set_title('Previous Max Temp')

# The day before yesterday
ax3.plot(dates, features['temp_2'])
ax3.set_xlabel('Date'); ax3.set_ylabel('Temperature'); ax3.set_title('Two Days Prior Max Temp')

# My teasing friend
ax4.plot(dates, features['friend'])
ax4.set_xlabel('Date'); ax4.set_ylabel('Temperature'); ax4.set_title('Friend Estimate')

plt.tight_layout(pad=2)

# All the indexes seem to be normal. Because they are weather data from abroad, they are different from our statistical standards. Next, we need to consider the data preprocessing. In the raw data, there are not some numerical characteristics in the week column, but a string representing the day of the week. These computers don't know it. We need to convert it as follows:

1.5 data preprocessing

# Unique heat code
features = pd.get_dummies(features)
features.head(5)

# This completes the preprocessing of attribute values in the dataset. By default, all attribute values will be converted to the format of unique hot code, and the suffix will be added automatically to make it clearer. Here, we can also set the name of the coding feature in our own way. If you encounter a function that you are not familiar with, you want to see the details , a more direct way is to call the help tool directly in the notebook to see its API documents. The following is the details of the help tool. There are not only parameter descriptions, but also some small examples. It is suggested that you should develop the habit of more practice and more search in the process of using it, and the method of finding and solving problems is also a very important skill:

print (help(pd.get_dummies))

print('Shape of features after one-hot encoding:', features.shape)

# Data and labels
import numpy as np

# label
labels = np.array(features['actual'])

# Remove labels from features
features= features.drop('actual', axis = 1)

# Keep your name alone for future trouble
feature_list = list(features.columns)

# Convert to appropriate format
features = np.array(features)

1.6 training set and test set

# Data set segmentation
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25,
                                                                           random_state = 42)

print('Training set features:', train_features.shape)
print('Training set label:', train_labels.shape)
print('Test set features:', test_features.shape)
print('Test set label:', test_labels.shape)

1.7 establish a basic stochastic forest model

Everything is ready. We can build a random forest model. First, import the toolkit, try to build 1000 trees, and use the default values for other parameters, and then we will go into the parameter adjustment task

# Import algorithm
from sklearn.ensemble import RandomForestRegressor

# modeling
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)

# train
rf.fit(train_features, train_labels)

# Because the sample size of data is still very small, we can get the result soon. Here we use MAPE index to evaluate first, that is, the average absolute percentage error. Actually, there are many evaluation methods for regression tasks, and we can list several, which can be realized simply, or we can choose other indexes to evaluate:

1.8 testing

# Forecast results
predictions = rf.predict(test_features)

# calculation error
errors = abs(predictions - test_labels)

# mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

print ('MAPE:',np.mean(mape))

1.9 importance of features

# Get feature importance
importances = list(rf.feature_importances_)

# Transform format
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# sort
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print accordingly
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

1.10 try again with the most important features

# Choose the two most important features to try
rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=42)

# Get these two features
important_indices = [feature_list.index('temp_1'), feature_list.index('average')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]

# Retraining model
rf_most_important.fit(train_important, train_labels)

# Forecast results
predictions = rf_most_important.predict(test_important)

errors = abs(predictions - test_labels)

# Assessment results

mape = np.mean(100 * (errors / test_labels))

print('mape:', mape)

# Convert to list format
x_values = list(range(len(importances)))

# mapping
plt.bar(x_values, importances, orientation = 'vertical')

# x-axis name
plt.xticks(x_values, feature_list, rotation='vertical')

# Title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances'); 

1.11 difference between predicted value and real value

# Date data
months = features[:, feature_list.index('month')]
days = features[:, feature_list.index('day')]
years = features[:, feature_list.index('year')]

# Convert date format
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]

# Create a table to store dates and their corresponding tag values
true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})

# In the same way, create a storage date and its corresponding model forecast value
months = test_features[:, feature_list.index('month')]
days = test_features[:, feature_list.index('day')]
years = test_features[:, feature_list.index('year')]

test_dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]

test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]

predictions_data = pd.DataFrame(data = {'date': test_dates, 'prediction': predictions}) 

# True value
plt.plot(true_data['date'], true_data['actual'], 'b-', label = 'actual')

# predicted value
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()

# Title
plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');
plt.show()

It seems to be OK. Our model can basically grasp this trend. Next, we need to go further into the data and consider several questions: 1. If the amount of data available increases, what will be the impact on the results? 2. Will adding new features improve the model effect? How about the time efficiency at this time?

2, Will more data work better?

2.1 data import

# Import Toolkit
import pandas as pd

# Read data
features = pd.read_csv('data/temps_extended.csv')
features.head(5)
print('Data size',features.shape)

# statistical indicators
round(features.describe(), 2)

Data size (2191, 12)
In the new data, the data scale has changed, the data volume has expanded to 2191 and new weather indicators have been added:
ws_1: Wind speed of the previous day
prcp_1: Precipitation of the previous day
snwd_1: Snow depth of previous day
Now that we have new features, let's see what they look like first. We can draw them in the same way:

2.2 data time processing

# Convert to standard format
import datetime

# Get all kinds of date data
years = features['year']
months = features['month']
days = features['day']

# format conversion
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]

# mapping
import matplotlib.pyplot as plt

%matplotlib inline

# Style settings
plt.style.use('fivethirtyeight')

# Set up the plotting layout
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize = (15,10))
fig.autofmt_xdate(rotation = 45)

# Actual max temperature measurement
ax1.plot(dates, features['actual'])
ax1.set_xlabel(''); ax1.set_ylabel('Temperature (F)'); ax1.set_title('Max Temp')

# Temperature from 1 day ago
ax2.plot(dates, features['temp_1'])
ax2.set_xlabel(''); ax2.set_ylabel('Temperature (F)'); ax2.set_title('Prior Max Temp')

# Temperature from 2 days ago
ax3.plot(dates, features['temp_2'])
ax3.set_xlabel('Date'); ax3.set_ylabel('Temperature (F)'); ax3.set_title('Two Days Prior Max Temp')

# Friend Estimate
ax4.plot(dates, features['friend'])
ax4.set_xlabel('Date'); ax4.set_ylabel('Temperature (F)'); ax4.set_title('Friend Estimate')

plt.tight_layout(pad=2)

# Set up the overall layout
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize = (15,10))
fig.autofmt_xdate(rotation = 45)

# Average maximum temperature
ax1.plot(dates, features['average'])
ax1.set_xlabel(''); ax1.set_ylabel('Temperature (F)'); ax1.set_title('Historical Avg Max Temp')

# wind speed
ax2.plot(dates, features['ws_1'], 'r-')
ax2.set_xlabel(''); ax2.set_ylabel('Wind Speed (mph)'); ax2.set_title('Prior Wind Speed')

# precipitation
ax3.plot(dates, features['prcp_1'], 'r-')
ax3.set_xlabel('Date'); ax3.set_ylabel('Precipitation (in)'); ax3.set_title('Prior Precipitation')

# accumulated snow
ax4.plot(dates, features['snwd_1'], 'ro')
ax4.set_xlabel('Date'); ax4.set_ylabel('Snow Depth (in)'); ax4.set_title('Prior Snow Depth')

plt.tight_layout(pad=2)

In the process of data analysis and feature extraction, our starting point is to select as many valuable features as possible, because in fact, the more information we can get in the stage, the more information we can use for modeling later. For example, in this data, we have complete date data, but the transformation of the weather must be related to seasonal factors, but in the original There is no seasonal indicator in the data set. We can create a seasonal variable as a new feature, which will help modeling and analysis in the future:

With seasonal characteristics, what should I do if I want to observe the changes of the above indicators in different seasons? Here we recommend a very practical drawing function, pairplot. We need to install seaborn first. It's equivalent to encapsulating it on the basis of Matplotlib. To put it plainly, it's simpler and more standardized to use:

2.3 drawing

# Create a season variable
seasons = []

for month in features['month']:
    if month in [1, 2, 12]:
        seasons.append('winter')
    elif month in [3, 4, 5]:
        seasons.append('spring')
    elif month in [6, 7, 8]:
        seasons.append('summer')
    elif month in [9, 10, 11]:
        seasons.append('fall')

# With seasons, we can analyze more
reduced_features = features[['temp_1', 'prcp_1', 'average', 'actual']]
reduced_features['season'] = seasons

# Import seaborn Toolkit
import seaborn as sns
sns.set(style="ticks", color_codes=True);

# Choose your favorite color template
palette = sns.xkcd_palette(['dark blue', 'dark green', 'gold', 'orange'])

# Draw pairplot
sns.pairplot(reduced_features, hue = 'season', diag_kind = 'kde', palette= palette, plot_kws=dict(alpha = 0.7),
                   diag_kws=dict(shade=True)); 

It can be seen that the x-axis and y-axis are our four indicators. The points of different colors represent different seasons. On the main diagonal, the x-axis and y-axis are the same features to represent their numerical distribution in different seasons. In other locations, the scatter diagram is used to represent the relationship between the two features, for example, in the lower left corner of temp_1 and actual show a strong correlation.

2.4 data preprocessing

# Unique heat code
features = pd.get_dummies(features)

# Extract features and labels
labels = features['actual']
features = features.drop('actual', axis = 1)

# Feature name is reserved
feature_list = list(features.columns)

# Convert to required format
import numpy as np

features = np.array(features)
labels = np.array(labels)

# Data set segmentation
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, 
                                                                            test_size = 0.25, random_state = 0)
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

# The new training set consists of 1643 samples, and the test set has 548 samples. First, let's look at the situation of the old data set. Here, because we reopened a new notebook, we read the old temperature data set again in all the codes, and performed the same preprocessing operation:

2.5 first look at the results of old data

# Toolkit import
import pandas as pd

# In order to eliminate the influence of the number of features on the results, only the features in the old dataset are unified
original_feature_indices = [feature_list.index(feature) for feature in
                                      feature_list if feature not in
                                      ['ws_1', 'prcp_1', 'snwd_1']]

# Read old data set
original_features = pd.read_csv('data/temps.csv')

original_features = pd.get_dummies(original_features)

import numpy as np

# Data and label conversion
original_labels = np.array(original_features['actual'])

original_features= original_features.drop('actual', axis = 1)

original_feature_list = list(original_features.columns)

original_features = np.array(original_features)

# Data set segmentation
from sklearn.model_selection import train_test_split

original_train_features, original_test_features, original_train_labels, original_test_labels = train_test_split(original_features, original_labels, test_size = 0.25, random_state = 42)

# Modeling with the same tree model
from sklearn.ensemble import RandomForestRegressor

# The same parameters and random seed
rf = RandomForestRegressor(n_estimators= 100, random_state=0)

# The training set here uses the old data set
rf.fit(original_train_features, original_train_labels);

# In order to make the test effect fair and use the consistent test set uniformly, the test set of the new data set that I just cut is selected here
predictions = rf.predict(test_features[:,original_feature_indices])

# Calculate the average temperature error first
errors = abs(predictions - test_labels)

print('Average temperature error:', round(np.mean(errors), 2), 'degrees.')

# MAPE
mape = 100 * (errors / test_labels)

# For the convenience of observation, we use 100 to subtract the error. We hope that the larger the value, the better
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

It can be seen that when we increase the amount of data, the effect has been improved, which is also in line with the actual situation. In machine learning tasks, we all hope that the larger the amount of data, the better, so that more information can be used. Next, we will compare the influence of the number of features on the results. Before the two comparisons, no new features have been added. This time, we add three features of precipitation, wind speed and snow cover into the training set to see how the effect will be

2.6 when new data comes, will the result be improved if only the amount of data is increased

from sklearn.ensemble import RandomForestRegressor

# Remove new features to ensure the consistency of data features
original_train_features = train_features[:,original_feature_indices]

original_test_features = test_features[:, original_feature_indices]

rf = RandomForestRegressor(n_estimators= 100 ,random_state=0)

rf.fit(original_train_features, train_labels);

# forecast
baseline_predictions = rf.predict(original_test_features)

# result
baseline_errors = abs(baseline_predictions - test_labels)

print('Average temperature error:', round(np.mean(baseline_errors), 2), 'degrees.')

# (MAPE)
baseline_mape = 100 * np.mean((baseline_errors / test_labels))

# accuracy
baseline_accuracy = 100 - baseline_mape
print('Accuracy:', round(baseline_accuracy, 2), '%.')

2.7 add new features

# Ready to add new features
from sklearn.ensemble import RandomForestRegressor

rf_exp = RandomForestRegressor(n_estimators= 100, random_state=0)
rf_exp.fit(train_features, train_labels)

# Same test set
predictions = rf_exp.predict(test_features)

# assessment
errors = abs(predictions - test_labels)

print('Average temperature error:', round(np.mean(errors), 2), 'degrees.')

# (MAPE)
mape = np.mean(100 * (errors / test_labels))

# Let's see how much we've improved
improvement_baseline = 100 * abs(mape - baseline_mape) / baseline_mape
print('Model effect improved with more features:', round(improvement_baseline, 2), '%.')

# accuracy
accuracy = 100 - mape
print('Accuracy:', round(accuracy, 2), '%.')

The overall effect of the model has slightly improved. Here we also add an additional assessment that is the size of the model compared with the basic model, which is convenient for comparative observation. This time there are more features. We can study the importance of features. Although it is only for reference, there are some unwritten rules in the industry. Let's take a look at them:

2.8 importance of features

# Feature name
importances = list(rf_exp.feature_importances_)

# Names, numbers combined
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# sort
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

# Specify style
plt.style.use('fivethirtyeight')

# Specify location
x_values = list(range(len(importances)))

# mapping
plt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)

# The name of the x-axis has to be written vertically
plt.xticks(x_values, feature_list, rotation='vertical')

# Title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

Before that, we just looked at the more important download features. This time, we need to consider the cumulative importance of features. First, we rank features according to their importance, and then calculate the cumulative value. Here, we use the cumsum() function, such as cusm([1,2,3,4]) to get the cumulative value (1,3,6, 10) Usually, we take 95% as the threshold value, and see how many features are accumulated together, and the accumulated value of their importance exceeds the threshold value, then we take them as the filtered features:

2.9 add up the importance of features to see how many before 95%

# Sorting features
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]

# Cumulative importance
cumulative_importances = np.cumsum(sorted_importances)

# Draw line chart
plt.plot(x_values, cumulative_importances, 'g-')

# Draw a red dotted line, 0.95 that
plt.hlines(y = 0.95, xmin=0, xmax=len(sorted_importances), color = 'r', linestyles = 'dashed')

# X axis
plt.xticks(x_values, sorted_features, rotation = 'vertical')

# Y axis and name
plt.xlabel('Variable'); plt.ylabel('Cumulative Importance'); plt.title('Cumulative Importances');

#### Here, when the fifth feature appears, its total cumulative value is more than 95%. Then our comparative experiment comes again. What if we only use these five features? What about time efficiency?
# Look at the features
print('Number of features for 95% importance:', np.where(cumulative_importances > 0.95)[0][0] + 1)

# Select these features
important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# Find their names
important_indices = [feature_list.index(feature) for feature in important_feature_names]

# Recreate training set
important_train_features = train_features[:, important_indices]
important_test_features = test_features[:, important_indices]

# Data dimension
print('Important train features shape:', important_train_features.shape)
print('Important test features shape:', important_test_features.shape)

# Retraining model
rf_exp.fit(important_train_features, train_labels);

# Same test set
predictions = rf_exp.predict(important_test_features)

# Assessment results
errors = abs(predictions - test_labels)

print('Average temperature error:', round(np.mean(errors), 2), 'degrees.')

mape = 100 * (errors / test_labels)

# accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

In fact, the algorithm of random forest will consider the problem of features and choose the valuable ones first. We think that some of them will be removed, which means that there are fewer candidates. It is not surprising that such a phenomenon occurs in random forest!

2.10 time efficiency under calculation

# Time to count
import time

# This time with all the features
all_features_time = []

# One calculation may not be accurate. Take an average of 10
for _ in range(10):
    start_time = time.time()
    rf_exp.fit(train_features, train_labels)
    all_features_predictions = rf_exp.predict(test_features)
    end_time = time.time()
    all_features_time.append(end_time - start_time)

all_features_time = np.mean(all_features_time)
print('Average time spent modeling and testing with all features:', round(all_features_time, 2), 'second.')

When we use all the features, the total time for modeling and testing is 0.5 seconds. Here, our speed will be different due to the performance of the machine. We estimate that the running time in the notebook is a little longer than mine. Let's look at the time result of selecting only high importance features:

# This time with some important features
reduced_features_time = []

# One calculation may not be accurate. Take an average of 10
for _ in range(10):
    start_time = time.time()
    rf_exp.fit(important_train_features, train_labels)
    reduced_features_predictions = rf_exp.predict(important_test_features)
    end_time = time.time()
    reduced_features_time.append(end_time - start_time)

reduced_features_time = np.mean(reduced_features_time)
print('Average time spent modeling and testing with all features:', round(reduced_features_time, 2), 'second.')

# Accuracy vs run time calculates the evaluation results with the respective predicted values
all_accuracy =  100 * (1- np.mean(abs(all_features_predictions - test_labels) / test_labels))
reduced_accuracy = 100 * (1- np.mean(abs(reduced_features_predictions - test_labels) / test_labels))

#Create a df to save the results
comparison = pd.DataFrame({'features': ['all (17)', 'reduced (5)'], 
                           'run_time': [round(all_features_time, 2), round(reduced_features_time, 2)],
                           'accuracy': [round(all_accuracy, 2), round(reduced_accuracy, 2)]})

comparison[['features', 'accuracy', 'run_time']]

relative_accuracy_decrease = 100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('relative accuracy decline:', round(relative_accuracy_decrease, 3), '%.')

relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print('Relative time efficiency improvement:', round(relative_runtime_decrease, 3), '%.')

Generally, we will consider the cost performance when we buy things. This is also the problem here. The improvement of time efficiency is relatively larger, and basically ensures that the effect of the model is almost the same. Finally, let's summarize all the experimental results and compare them:

# Pandas is used for data manipulation
import pandas as pd

# Read in data as pandas dataframe and display first 5 rows
original_features = pd.read_csv('data/temps.csv')
original_features = pd.get_dummies(original_features)

# Use numpy to convert to arrays
import numpy as np

# Labels are the values we want to predict
original_labels = np.array(original_features['actual'])

# Remove the labels from the features
# axis 1 refers to the columns
original_features= original_features.drop('actual', axis = 1)

# Saving feature names for later use
original_feature_list = list(original_features.columns)

# Convert to numpy array
original_features = np.array(original_features)

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
original_train_features, original_test_features, original_train_labels, original_test_labels = train_test_split(original_features, original_labels, test_size = 0.25, random_state = 42)

# Find the original feature indices 
original_feature_indices = [feature_list.index(feature) for feature in
                                      feature_list if feature not in
                                      ['ws_1', 'prcp_1', 'snwd_1']]

# Create a test set of the original features
original_test_features = test_features[:, original_feature_indices]

# Time to train on original data set (1 year)
original_features_time = []

# Do 10 iterations and take average for all features
for _ in range(10):
    start_time = time.time()
    rf.fit(original_train_features, original_train_labels)
    original_features_predictions = rf.predict(original_test_features)
    end_time = time.time()
    original_features_time.append(end_time - start_time)
    
original_features_time = np.mean(original_features_time)

# Calculate mean absolute error for each model
original_mae = np.mean(abs(original_features_predictions - test_labels))
exp_all_mae = np.mean(abs(all_features_predictions - test_labels))
exp_reduced_mae = np.mean(abs(reduced_features_predictions - test_labels))

# Calculate accuracy for model trained on 1 year of data
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) / test_labels))

# Create a dataframe for comparison
model_comparison = pd.DataFrame({'model': ['original', 'exp_all', 'exp_reduced'], 
                                 'error (degrees)':  [original_mae, exp_all_mae, exp_reduced_mae],
                                 'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],
                                 'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})

# Order the dataframe
model_comparison = model_comparison[['model', 'error (degrees)', 'accuracy', 'run_time (s)']]

model_comparison

# Draw to summarize
# Set the overall layout, or a whole line looks better
fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize = (16,5), sharex = True)

# X axis
x_values = [0, 1, 2]
labels = list(model_comparison['model'])
plt.xticks(x_values, labels)

# font size
fontdict = {'fontsize': 18}
fontdict_yaxis = {'fontsize': 14}

# Comparison between predicted temperature and real temperature
ax1.bar(x_values, model_comparison['error (degrees)'], color = ['b', 'r', 'g'], edgecolor = 'k', linewidth = 1.5)
ax1.set_ylim(bottom = 3.5, top = 4.5)
ax1.set_ylabel('Error (degrees) (F)', fontdict = fontdict_yaxis); 
ax1.set_title('Model Error Comparison', fontdict= fontdict)

# Accuracy comparison
ax2.bar(x_values, model_comparison['accuracy'], color = ['b', 'r', 'g'], edgecolor = 'k', linewidth = 1.5)
ax2.set_ylim(bottom = 92, top = 94)
ax2.set_ylabel('Accuracy (%)', fontdict = fontdict_yaxis); 
ax2.set_title('Model Accuracy Comparison', fontdict= fontdict)

# Time efficiency comparison
ax3.bar(x_values, model_comparison['run_time (s)'], color = ['b', 'r', 'g'], edgecolor = 'k', linewidth = 1.5)
ax3.set_ylim(bottom = 0, top = 1)
ax3.set_ylabel('Run Time (sec)', fontdict = fontdict_yaxis); 
ax3.set_title('Model Run-Time Comparison', fontdict= fontdict);


# original represents our old data, that is, the one with less quantity and fewer characteristics_ All represents our complete new data; exp_reduced represents part of the important feature data set that we selected according to the 95% threshold. The result is also obvious. The more data and features, the better the effect, but the less time efficiency.

3, Algorithm parameter selection and tuning

3.1 old data

# Pandas is used for data manipulation
import pandas as pd

# Read in data as a dataframe
features = pd.read_csv('data/temps_extended.csv')

# One Hot Encoding
features = pd.get_dummies(features)

# Extract features and labels
labels = features['actual']
features = features.drop('actual', axis = 1)

# List of features for later use
feature_list = list(features.columns)

# Convert to numpy arrays
import numpy as np

features = np.array(features)
labels = np.array(labels)

# Training and Testing Sets
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, 
                                                                            test_size = 0.25, random_state = 42)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

print('{:0.1f} years of data in the training set'.format(train_features.shape[0] / 365.))
print('{:0.1f} years of data in the test set'.format(test_features.shape[0] / 365.))

3.2 choose the six features with high importance

# Names of five importances accounting for 95% of total importance
important_feature_names = ['temp_1', 'average', 'ws_1', 'temp_2', 'friend', 'year']

# Find the columns of the most important features
important_indices = [feature_list.index(feature) for feature in important_feature_names]

# Create training and testing sets with only the important features
important_train_features = train_features[:, important_indices]
important_test_features = test_features[:, important_indices]

# Sanity check on operations
print('Important train features shape:', important_train_features.shape)
print('Important test features shape:', important_test_features.shape)

# Use only the most important features
train_features = important_train_features[:]
test_features = important_test_features[:]

# Update feature list for visualizations
feature_list = important_feature_names[:]

# Adjust new parameter printing
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state = 42)

from pprint import pprint

# Print all parameters
pprint(rf.get_params())

3.3 start to try various parameters

There is a long way to adjust parameters, and there are too many possible combination results of parameters. We have to follow the rules. First of all, RandomizedSearchCV(), a function that can help us to continuously select a set of appropriate parameters at random to model in the candidate set combination, and find the evaluation results after cross validation. Why do you choose randomly? It's more reliable to come one by one. Suppose we have 5 parameters to be determined and each parameter has 10 candidate values, how many possibilities are there in all? It's a big number, because it takes a lot of time to build the model. When the data volume is large, it's good to complete the model once in a few hours, so it's hard for us to traverse all the possibilities, and become a strategy randomly, so that we can get a more appropriate combination of parameters. All the parameters required for the function are explained in detail in the API document Ming
from sklearn.model_selection import RandomizedSearchCV

# Number of established trees
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# How to select the largest feature
max_features = ['auto', 'sqrt']
# Maximum depth of tree
max_depth = [int(x) for x in np.linspace(10, 20, num = 2)]
max_depth.append(None)
# Number of samples required for minimum splitting of nodes
min_samples_split = [2, 5, 10]
# The minimum sample number of leaf nodes. Any split cannot make the sample number of its child nodes less than this value
min_samples_leaf = [1, 2, 4]
# Sample sampling method
bootstrap = [True, False]

# Random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
In this task, we only give you examples to illustrate. Considering the time problem, the candidate values of the selected parameters are not given too much. It is worth noting that the parameter space of each candidate parameter needs to be well controlled, because if the given value range is not appropriate, the best result will certainly not be very good. Here we can refer to some empirical values or constantly change the parameter space through experimental results, which is a repeated process, not to say that our machine learning modeling task is the same as before In the future, with the experimental results, we need to turn around again and again to compare different parameters and different pretreatment schemes.
# Random selection of the most appropriate combination of parameters
rf = RandomForestRegressor()

rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 100, scoring='neg_mean_absolute_error', 
                              cv = 3, verbose=2, random_state=42, n_jobs=-1)

# Perform search operation
rf_random.fit(train_features, train_labels)

rf_random.best_params_

Here, I'd like to explain the commonly used parameters in RandomizedSearchCV. In fact, they are all explained in the API documents. It's recommended that you develop the habit of consulting documents.
Estimator: RandomizedSearchCV is a general method, not designed for random forest, so we need to specify what algorithm model we choose.
Distribution: the candidate space of parameters. We have given the required parameter distribution in dictionary format.
n_iter: the number of random parameter combinations. For example, here we assign 100 to represent the next random combination of 100 groups of parameters to find the best one.
Scoring: evaluation method, according to which to find the best combination of parameters
Cv: cross validation, we've talked about it before.
Verbose: the number of printed information depends on your needs.
random_state: random seed. In order to make our results consistent and eliminate the interference of random components, we usually specify a value, just use your own lucky number.
n_jobs: Multithread to run this program. If it is - 1, it will use all of them, but there may be some cards.
Even if I put n_ When jobs is set to - 1, the program runs slowly. Because we build a 100 time model to select parameters, and it still has a 30% cross validation, it is equivalent to 300 tasks. The results are shown in the following figure:

3.4 evaluation function

Next, we will compare the difference between the result after parameter adjustment and the result with default parameters. All default parameters are described in the API, such as n_ Estimators: integer, optional (default = 10), which means that in the random forest model, the default number of trees to be built is 10. First, the evaluation criteria are given:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape

    print('Average temperature error.',np.mean(errors))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
# Old model
base_model = RandomForestRegressor( random_state = 42)
base_model.fit(train_features, train_labels)
evaluate(base_model, test_features, test_labels)

# New formula (best parameters)
best_random = rf_random.best_estimator_
evaluate(best_random, test_features, test_labels)

3.5 Grid Search to fine tune

from sklearn.model_selection import GridSearchCV

# Web search
param_grid = {
    'bootstrap': [True],
    'max_depth': [8,10,12],
    'max_features': ['auto'],
    'min_samples_leaf': [2,3, 4, 5,6],
    'min_samples_split': [3, 5, 7],
    'n_estimators': [800, 900, 1000, 1200]
}

# Select basic algorithm model
rf = RandomForestRegressor()

# Web search
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                           scoring = 'neg_mean_absolute_error', cv = 3, 
                           n_jobs = -1, verbose = 2)

# Perform search
grid_search.fit(train_features, train_labels)

grid_search.best_params_

best_grid = grid_search.best_estimator_
evaluate(best_grid, test_features, test_labels)

3.6 Grid Search

# After readjustment, the effect of our algorithm model has been improved a little. Although it is only a small point, it is a big achievement to accumulate every small step once. When we use Internet search again, there are too many times of traversal. We usually don't put all the possibilities into it, but divide it into different groups to execute separately. Next, let's look at another group of competitors in Internet search: 
param_grid = {
    'bootstrap': [True],
    'max_depth': [12, 15, None],
    'max_features': [3, 4,'auto'],
    'min_samples_leaf': [5, 6, 7],
    'min_samples_split': [7,10,13],
    'n_estimators': [900, 1000, 1200]
}

# Select algorithm model
rf = RandomForestRegressor()

# Keep looking
grid_search_ad = GridSearchCV(estimator = rf, param_grid = param_grid, 
                           scoring = 'neg_mean_absolute_error', cv = 3, 
                           n_jobs = -1, verbose = 2)

grid_search_ad.fit(train_features, train_labels) 

grid_search_ad.best_params_

best_grid_ad = grid_search_ad.best_estimator_
evaluate(best_grid_ad, test_features, test_labels)

3.7 final model

print('Final model parameters:\n')
pprint(best_grid_ad.get_params())

Let's summarize our assignment:

1. Parameter space is very important, it will have a decisive impact on the results, so before starting the task, we have to choose a suitable interval, which can refer to some experience values of the same task paper.
2. Random search can save more time, especially at the beginning of the task. We don't know which parameter is better in which position. In this way, we can set the parameter interval to be larger, and first use random search to determine some approximate positions.
3. Network search is equivalent to carpet search. When we get the general position, we can use it when we want to find the optimal parameters here. We can use random and network search as a set of combination boxing.
4. in the end, there are many ways to adjust the parameters. For example, Bayesian optimization, this is still very interesting. Let us briefly talk about it. Let's think about the way we adjusted the parameters before, and whether each one is independent. It will not have any impact on the subsequent results. The basic idea of Juliu optimization is that every optimization is constantly accumulating experience, so I will slow down. If you are interested in Bayesian optimization, you can refer to the Hyperopt toolkit, which is also very easy to use

Tags: less Attribute encoding Lambda

Posted on Sun, 24 May 2020 01:21:53 -0400 by jrschwartz