Introduction to the competition: Give the passenger flow data of a high-speed rail line from 08-25 to 06-25, and ask to forecast the number of passengers from 06-25 to 24-09-25 in the morning or afternoon. When predicting the number of passengers in the morning, the afternoon data will be given, and the morning data will be given when predicting the afternoon data.
Score for questions: 100-MAE
Data Understanding: From the black bold part of the introduction to the title, we can see that there is a problem with the data leaking (or training data traversing) in the future (within the prediction period), so it has little practical significance.
Online score: 42.6881
1. The overall trend of historical data is increasing and there is a large gap between them. Training with historical data that is too far away will reduce the predictive power of the model. Only data from June 10, 2014 will be used for training modeling in this scheme.
2. Because of the missing morning or afternoon data in the training data, it is better to use "Morning" and "Afternoon" to model predictions separately. Quote the morning and afternoon because 8 am to 20 pm were chosen as "Morning" and the rest of the time as "Afternoon" in the actual operation.The reason for this is that 8:00 to 22:00 is a relatively complete process that monotonically increases to a peak and then decreases (in the words of 4th graders of Dalai Primary, obeys Poisson distribution). You can draw your own pictures to understand it or find a better way to divide it.
3. After 1-2 steps of operation + parameter optimization, the score can reach 39+. On this basis, label can do boxcox transformation for more than 3 points, up to 42+.
The following is the code explanation section:
Import related packages
import warnings warnings.filterwarnings("ignore") import numpy as np import pandas as pd from fbprophet import Prophet import matplotlib.pyplot as plt from scipy import stats from scipy.special import inv_boxcox
Data Loading and Processing
#Reading training data rese_df = pd.read_csv('../data/train.csv', names =['ds', 'y'], header=0) rese_df['ds'] = rese_df['ds'].astype('datetime64[ns]') rese_df['y'] = rese_df['y'].astype(int) #Read test data test = pd.read_csv('../data/test.csv', names =['id', 'ds'], header=0) test['ds'] = test['ds'].astype('datetime64[ns]') #Constructing prediction time points test_df = test[['ds']].copy() #Submitting data on construction lines subs = test[['id']].copy()
hours = np.arange(8, 21) #Split Training Data rese_df_mor = rese_df[rese_df['ds'].dt.hour.isin( hours) ] rese_df_aft = rese_df[~rese_df['ds'].dt.hour.isin(hours) ] #Split Forecast Point in Time test_df_mor = test_df[test_df['ds'].dt.hour.isin( hours) ] test_df_aft = test_df[~test_df['ds'].dt.hour.isin(hours) ]
#Modeling in the morning and afternoon #Select data from 20140610 onwards as training data cut_off = pd.date_range(start='06/10/2014', freq='1M', periods=1) #Morning Part Modeling Prediction tail_df = rese_df_mor[ rese_df_mor['ds']>cut_off[len(cut_off)-1] ].copy() #y does boxcox transformation tail_df.loc[tail_df['y']==0, 'y'] = tail_df['y'].mean() xt, fitted_lambda_mor = stats.boxcox(tail_df['y']) tail_df['y'] = xt m = Prophet(yearly_seasonality=False , daily_seasonality=True , weekly_seasonality=True , seasonality_mode='multiplicative' , interval_width=0.95 , changepoint_range=0.95 , changepoint_prior_scale=0.1 ) m.fit(tail_df) preds_mor = m.predict(test_df_mor)['yhat'] #Afternoon Part Modeling Prediction tail_df = rese_df_aft[ rese_df_aft['ds']>cut_off[len(cut_off)-1] ].copy() #y does boxcox transformation tail_df.loc[tail_df['y']==0, 'y'] = tail_df['y'].mean() xt, fitted_lambda_aft = stats.boxcox(tail_df['y']) tail_df['y'] = xt m = Prophet(yearly_seasonality=False , daily_seasonality=True , weekly_seasonality=True , seasonality_mode='multiplicative' , interval_width=0.95 , changepoint_range=0.95 , changepoint_prior_scale=0.1 ) m.fit(tail_df) preds_aft = m.predict(test_df_aft)['yhat']
Generate online submission files
#Inverse boxcox transformation of predicted values test_df_mor['y'] = inv_boxcox( np.array( preds_mor ), fitted_lambda_mor) test_df_aft['y'] = inv_boxcox( np.array( preds_aft ), fitted_lambda_aft) #Merge results result = pd.concat([test_df_mor, test_df_aft]).sort_values(by=['ds'], ascending=True) #Generate online submission files subs['y'] = result['y'] subs.to_csv('../subs/prophet39_boxcox.csv', index=None,header=False)
Published in 2020-11-16