python builds ARIMA model for time series analysis (paper)

#How to start a paper on data processing from scratch
I really don't know how to do it. It's pure entry-level. Just draw a picture and hand in a paper. Here's how I do it (for reference only)

##CSV data

Pay attention to the date format. When you open it in excel, it will become the default date format. Remember to change the slash to - sometimes it's weird

Vertical displacement monitoring of dam crest is a main content of deformation monitoring, and its results can accurately reflect the vertical deformation of hydraulic structures. The following table shows the vertical displacement of the dam crest at 5 × 104 points of a gravity dam in China from January 2008 to January 2012, with 50 sets of original data.

Date,dy
2008-01-11,0.62
2008-02-06,1.01
2008-03-17,1.78
2008-04-13,1.29
2008-05-17,0.11
2008-06-15,-0.35
2008-07-01,-0.44
2008-07-12,-0.3
2008-08-10,-1.11
2008-09-14,-1.78
2008-10-12,-1.39
2008-11-16,-0.94
2008-12-13,-0.36
2009-01-19,1.47
2009-02-16,1.75
2009-03-13,2.04
2009-04-18,1.03
2009-05-16,0.02
2009-06-20,-0.59
2009-07-11,-1.35
2009-08-15,-2.14
2009-09-19,-1.96
2009-10-16,-1.46
2009-11-14,-0.56
2009-12-11,0.04
2010-01-15,0.96
2010-02-20,1.58
2010-03-13,1.43
2010-04-17,0.95
2010-05-15,0.14
2010-06-12,-0.3
2010-07-16,-1.35
2010-08-14,-1.6
2010-09-18,-1.98
2010-10-16,-1.58
2010-11-19,-0.98
2010-12-24,0.56
2011-01-21,1.14
2011-02-18,1.19
2011-03-19,1.18
2011-04-17,0.61
2011-05-15,0.76
2011-06-18,-0.66
2011-07-16,-1.14
2011-08-20,-1.35
2011-09-24,-1.85
2011-10-22,-0.95
2011-11-19,-0.65
2011-12-24,0.44
2012-01-14,1.09

The experimental process is to use 50 groups of data to fit the model, select the best one; then use the first 40 groups to build the model, predict the next ten groups of data, and draw a picture to see. In fact, I don't think I'm serious. Don't ask, don't bother to change

program

Because of my inconsistencies, I wrote two programs, one for prediction; the one I annotated could also run, so I annotated the program directly if it was too troublesome. In fact, this program is used to draw pictures

fitting

import warnings  # Ignore warnings
warnings.filterwarnings('ignore')


import pandas as pd #Table and data operations
import numpy as np
import matplotlib.pyplot as plt
from random import randrange
from statsmodels.tsa.stattools import adfuller as ADF
from statsmodels.stats.diagnostic import acorr_ljungbox #White noise test
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.api import tsa
import statsmodels.api as sm
from statsmodels.graphics.api import qqplot
from itertools import product

dta=pd.read_csv('Time series 1 - copy - copy.CSV',header=0 ,index_col=0)



dta = dta.dropna() #Remove incomplete item data

print(dta.describe())

#Stability test by unit root test (ADF)

dta1= dta
diff=0
adf=ADF(dta1)
if adf[1]>0.05:
    print (u'The original sequence is not stable, p The value is:%s'%(adf[1]))
else:
    print (u'The original sequence is stable, p The value is:%s'%(adf[1]))
while adf[1]>=0.05:#adf[1] is p value, and P value less than 0.05 is considered stable
    diff=diff+1
    adf=ADF(dta1.diff(diff).dropna())
print (u'Original sequence%s The order difference is stable, p The value is%s'%(diff,adf[1]))

for i in range(1,3):
    dta1 = dta1.diff(i).dropna()
    adf = ADF(dta1)
    print(u'Original sequence%s Order difference,p The value is%s'%(i,adf[1]))

#White noise test with LB statistics

dta2 = dta
[[lb],[p]]=acorr_ljungbox(dta2,lags=1)
if p<0.05:
    print (u'The original sequence is a non white noise sequence, p The value is:%s'%p)
else:
    print (u'The original sequence is white noise sequence, p The value is:%s'%p)
[[lb],[p]]=acorr_ljungbox(dta2.diff(1).dropna(),lags=1)
if p<0.05:
    print (u'The first-order difference sequence is a non white noise sequence, p The value is:%s'%p)
else:
    print (u'The first order difference sequence is white noise sequence, p The value is:%s'%p)


dta.plot(figsize=(8,7))
plt.show()


#Drawing autocorrelation and skew correlation
fig = plt.figure(figsize=(8,7))
ax1= fig.add_subplot(211)
ax2= fig.add_subplot(212)
fig = plot_acf(dta,ax=ax1)
fig = plot_pacf(dta,ax=ax2)
fig.show()


#Model recognition
#Determine the best p,d,q values

###Rank determination
##pmax = int(len(dta)/10) #General order no more than length/10
##qmax = int(len(dta)/10) #General order no more than length/10
##bic_matrix = [] #bic matrix
##for p in range(pmax+1):
##    tmp = []
##    for q in range(qmax+1):
##        try: #Some errors are reported, so try is used to skip them.
##            tmp.append(ARIMA(dta, (p,1,q)).fit().bic)
##        except:
##            tmp.append(None)
##    bic_matrix.append(tmp)
##
##bic_matrix = pd.DataFrame(bic_matrix) #From which we can find the minimum value
##
##p,q = bic_matrix.stack().astype('float64').idxmin() #First use stack to flatten, then use idxmin to find the minimum position.
##Print (the minimum p and Q values of u'bic are:% s,% s'% (p, q))

p=2  #Here is the result. The 212 model is the best, so it's assigned directly to avoid half a day's adjustment of the program
q=2

#Establishing ARIMA(p,1,q) model
arima = ARIMA(dta.dropna(), (p,1,q)).fit()
print("Optimal model", arima.summary())

#Model test: after the model is established, check whether the residual sequence is white noise

dta3 = dta.drop(axis = 0, index = '2008/1/11')#Delete the first item and the corresponding difference is missing
dta_pred = arima.predict(typ = 'levels') #Forecast by model
print(dta_pred)#Manual operation..
#Draw a fit
fig3 = plt.figure(figsize=(12,8))
plt.plot(dta3)
plt.plot(dta_pred, color='red')
fig3.show()

#Modified residual sequence format
dta_pred2=pd.read_csv('forecast.CSV',header=0 ,index_col=0)#Here is the DTA above_ PRED is printed and copied to a csv because there is a bug when it is used directly. I don't know what happened. It seems that the name of the calculated column is lost
dta_pred2 = dta_pred2.dropna() #Remove incomplete item data

pred_error = (dta_pred2 - dta3).dropna()#Calculation residual
lb,p_l= acorr_ljungbox(pred_error, lags = 16)#LB Test
print(p_l)
h = (p_l < 0.05).sum() #p value is less than 0.05, which is considered as non white noise.
if h > 0:
    print (u'Model ARIMA(%s,1,%s)Non conformity white noise test'%(p,q))
else:
    print (u'Model ARIMA(%s,1,%s)Compliance with white noise test' %(p,q))


#Autocorrelation graph of residuals
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = plot_acf(pred_error,ax=ax1)
ax2 = fig.add_subplot(212)
fig = plot_pacf(pred_error, ax=ax2)
fig.show()

#D-W inspection
print(sm.stats.durbin_watson(pred_error))

#Draw qq chart
fig = plt.figure(figsize=(12,8))
fig = qqplot(pred_error, line='q', fit=True)
fig.show()

###Accuracy of different difference times
##print(ARIMA(dta, (p,0,q)).fit().bic)
##print(ARIMA(dta, (p,1,q)).fit().bic)
##
##print(ARIMA(dta, (p,0,q)).fit().aic)
##print(ARIMA(dta, (p,1,q)).fit().aic)
###print(ARIMA(dta, (p,2,q)).fit().aic) #MA coefficient is irreversible
##
###Differential comparison
##fig1 = plt.figure(figsize=(8,7))
##ax1= fig1.add_subplot(211)
##diff1 = dta.diff(1)
##diff1.plot(ax=ax1)
##ax2= fig1.add_subplot(212)
##diff2 = dta.diff(2)
##diff2.plot(ax=ax2)
##fig1.show()
##
###Autocorrelation graph after difference
##dta= dta.diff(1)#First order difference
##dta = dta.dropna()
##
##fig2 = plt.figure(figsize=(8,7))
##ax1=fig2.add_subplot(211)
##fig2 = plot_acf(dta,ax=ax1)
##ax2 = fig2.add_subplot(212)
##fig2 = plot_pacf(dta,ax=ax2)
##fig2.show()

forecast

import warnings  # Ignore warnings
warnings.filterwarnings('ignore')

import pandas as pd #Table and data operations
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.api import tsa
import statsmodels.api as sm

from itertools import product


dta0=pd.read_csv('time series_40.CSV',header=0)
dta0.index = pd.to_datetime(dta0['Date'])

data = pd.DataFrame()
data['date'] = ['2008/1/11','2008/2/6','2008/3/17','2008/4/13','2008/5/17','2008/6/15','2008/7/1','2008/7/12','2008/8/10','2008/9/14','2008/10/12','2008/11/16','2008/12/13','2009/1/19','2009/2/16','2009/3/13','2009/4/18','2009/5/16','2009/6/20','2009/7/11','2009/8/15','2009/9/19','2009/10/16','2009/11/14','2009/12/11','2010/1/15','2010/2/20','2010/3/13','2010/4/17','2010/5/15','2010/6/12','2010/7/16','2010/8/14','2010/9/18','2010/10/16','2010/11/19','2010/12/24','2011/1/21','2011/2/18','2011/3/19','2011/4/17','2011/5/15','2011/6/18','2011/7/16','2011/8/20','2011/9/24','2011/10/22','2011/11/19','2011/12/24','2012/1/14']
data['dy'] = [0.62,1.01,1.78,1.29,0.11,-0.35,-0.44,-0.3,-1.11,-1.78,-1.39,-0.94,-0.36,1.47,1.75,2.04,1.03,0.02,-0.59,-1.35,-2.14,-1.96,-1.46,-0.56,0.04,0.96,1.58,1.43,0.95,0.14,-0.3,-1.35,-1.6,-1.98,-1.58,-0.98,0.56,1.14,1.19,1.18,0.61,0.76,-0.66,-1.14,-1.35,-1.85,-0.95,-0.65,0.44,1.09]
data.index = pd.to_datetime(data['date'])


p=2
q=2

arima = ARIMA(dta0['dy'].dropna(), (p,1,q)).fit()

print(arima.summary())

dta_pred = arima.predict(typ = 'levels') #forecast

###fitting
##fig1 = plt.figure(figsize=(12,8))
##plt.plot(dta0['dy'], color='green')
##plt.plot(dta_pred, color='yellow')
##fig1.show()

#model prediction 
forecast_ts = arima.forecast(10)
fore = pd.DataFrame()
fore['date'] = ['2011-4-17','2011-5-15','2011-6-18','2011-7-16','2011-8-20','2011-9-24','2011-10-22','2011-11-19','2011-12-24','2012-1-14']
fore['result'] = pd.DataFrame(forecast_ts[0])
fore.index = pd.to_datetime(fore['date'])

print(fore['result'])

#Draw results table
dta0['dy'].plot(color='blue', label='Original',figsize=(12,8))
dta_pred.plot(color='red', label='Predict',figsize=(12,8))
data.dy.plot(color='green', label='Original',figsize=(12,8))
fore.result.plot(color='yellow', label='Forecast',figsize=(12,8))

plt.show()


result

Above

Original data sequence diagram

Time sequence diagram of primary difference

Sequence diagram of quadratic difference

Stationarity test and white noise test

Order of difference ADF test (less than 0.05 can be regarded as stable) LB Test (less than 0.05 can be regarded as non white noise sequence)
0 1.1067394073215614e-08 4.4372756913599096e-09
1 2.3032333702927282e-11 0.0002755346487733437
2 1.2900060307415488e-08 1.1951415276363802e-07

Autocorrelation graph

Partial correlation graph

This table and the figure above are run out with SPSS, because the biased diagram I drew with the program is very outrageous

I use the autocorrelation biased graph drawn by the program:


Program out p=q=2
And then calculate the order of difference

Order of difference AIC BIC
0 53.22929865693882 64.7014366895077
1 61.743467139654456 73.09438892831821
2 MA coefficient is irreversible MA coefficient is irreversible

Although the 0-order difference is the smallest, the residual calculated by the 202 model does not conform to the white noise test, which is also very outrageous, so the last one used is 212

And then there's the model test

Start with the number of pages

Under the exponential smoothing model, it is necessary to observe whether the residual of ARIMA model is a normal distribution with an average value of 0 and a constant variance (obeying the normal distribution with zero mean and constant variance), and whether the continuous residual is (autocorrelation) at the same time
Let's first look at the autocorrelation of the residuals. Mm-hmm, it's pretty good. I'm very glad

LB Test
After the model is established, it is necessary to test whether the residual sequence is white noise. If it is not white noise, it indicates that there is still useful information in the residual, and it is necessary to modify the model or further extract. Using the model and the original data to make the difference, the residual term is obtained. The result of LB Test with a lag value of 16 is as follows: [0.30226802 0.31656683 0.26285311 0.39561196 0.14861177 0.20026693 0.28606258 0.32825364 0.35109379 0.28906182 0.33058463 0.36657827 0.4070982 0.47689337 0.53658179 0.5805578 ]It can be seen that there is less useful information in the residual, and the fitting result of the model is good, which can be used as a prediction model.

D-W inspection
When DW value is significantly close to O or 4, there is autocorrelation, and when DW value is close to 2, there is no (first order) autocorrelation. The test result is 2.26339224, indicating that the autocorrelation of the residual is very small, which can be regarded as no autocorrelation

Q-Q graph can visually verify whether a group of data comes from a certain distribution, or whether two groups of data come from the same (family) distribution. If all the points are basically in a straight line, we can say that the two distributions are the same. Therefore, according to the qq chart, we can conclude that most of the target data are distributed around a straight line, and we can think that the residual basically follows the normal distribution

At the prediction stage, another model is built with the first 40 sets of data. Why is the first 40? Because the first 30 can't be built (spitting blood)

The 0.9 residuals in it are very outrageous. It's OK to remove them

date actual value predicted value residual Residual analysis
2011/4/17 0.61 0.599297 0.010703 count 10
2011/5/15 0.76 -0.20582 0.96582 mean 0.241534
2011/6/18 -0.66 -1.01002 0.35002 std 0.358969
2011/7/16 -1.14 -1.597523 0.457523
2011/8/20 -1.35 -1.820037 0.470037 Residual after extremum removal
2011/9/24 -1.85 -1.63406 -0.21594 count 9
2011/10/22 -0.95 -1.10831 0.15831 mean 0.161057
2011/11/19 -0.65 -0.400514 -0.24949 std 0.268526
2011/12/24 0.44 0.2889 0.1511
2012/1/14 1.09 0.77275 0.31725

Actual observations (green), fitted (red), and predicted (yellow)

Tags: less Excel

Posted on Tue, 16 Jun 2020 01:16:46 -0400 by seran128