#How to start a paper on data processing from scratch

I really don't know how to do it. It's pure entry-level. Just draw a picture and hand in a paper. Here's how I do it (for reference only)

##CSV data

Pay attention to the date format. When you open it in excel, it will become the default date format. Remember to change the slash to - sometimes it's weird

Vertical displacement monitoring of dam crest is a main content of deformation monitoring, and its results can accurately reflect the vertical deformation of hydraulic structures. The following table shows the vertical displacement of the dam crest at 5 × 104 points of a gravity dam in China from January 2008 to January 2012, with 50 sets of original data.

Date,dy 2008-01-11,0.62 2008-02-06,1.01 2008-03-17,1.78 2008-04-13,1.29 2008-05-17,0.11 2008-06-15,-0.35 2008-07-01,-0.44 2008-07-12,-0.3 2008-08-10,-1.11 2008-09-14,-1.78 2008-10-12,-1.39 2008-11-16,-0.94 2008-12-13,-0.36 2009-01-19,1.47 2009-02-16,1.75 2009-03-13,2.04 2009-04-18,1.03 2009-05-16,0.02 2009-06-20,-0.59 2009-07-11,-1.35 2009-08-15,-2.14 2009-09-19,-1.96 2009-10-16,-1.46 2009-11-14,-0.56 2009-12-11,0.04 2010-01-15,0.96 2010-02-20,1.58 2010-03-13,1.43 2010-04-17,0.95 2010-05-15,0.14 2010-06-12,-0.3 2010-07-16,-1.35 2010-08-14,-1.6 2010-09-18,-1.98 2010-10-16,-1.58 2010-11-19,-0.98 2010-12-24,0.56 2011-01-21,1.14 2011-02-18,1.19 2011-03-19,1.18 2011-04-17,0.61 2011-05-15,0.76 2011-06-18,-0.66 2011-07-16,-1.14 2011-08-20,-1.35 2011-09-24,-1.85 2011-10-22,-0.95 2011-11-19,-0.65 2011-12-24,0.44 2012-01-14,1.09

The experimental process is to use 50 groups of data to fit the model, select the best one; then use the first 40 groups to build the model, predict the next ten groups of data, and draw a picture to see. In fact, I don't think I'm serious. Don't ask, don't bother to change

## program

Because of my inconsistencies, I wrote two programs, one for prediction; the one I annotated could also run, so I annotated the program directly if it was too troublesome. In fact, this program is used to draw pictures

fitting

import warnings # Ignore warnings warnings.filterwarnings('ignore') import pandas as pd #Table and data operations import numpy as np import matplotlib.pyplot as plt from random import randrange from statsmodels.tsa.stattools import adfuller as ADF from statsmodels.stats.diagnostic import acorr_ljungbox #White noise test from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima_model import ARMA from statsmodels.tsa.arima_model import ARIMA from statsmodels.api import tsa import statsmodels.api as sm from statsmodels.graphics.api import qqplot from itertools import product dta=pd.read_csv('Time series 1 - copy - copy.CSV',header=0 ,index_col=0) dta = dta.dropna() #Remove incomplete item data print(dta.describe()) #Stability test by unit root test (ADF) dta1= dta diff=0 adf=ADF(dta1) if adf[1]>0.05: print (u'The original sequence is not stable, p The value is:%s'%(adf[1])) else: print (u'The original sequence is stable, p The value is:%s'%(adf[1])) while adf[1]>=0.05:#adf[1] is p value, and P value less than 0.05 is considered stable diff=diff+1 adf=ADF(dta1.diff(diff).dropna()) print (u'Original sequence%s The order difference is stable, p The value is%s'%(diff,adf[1])) for i in range(1,3): dta1 = dta1.diff(i).dropna() adf = ADF(dta1) print(u'Original sequence%s Order difference,p The value is%s'%(i,adf[1])) #White noise test with LB statistics dta2 = dta [[lb],[p]]=acorr_ljungbox(dta2,lags=1) if p<0.05: print (u'The original sequence is a non white noise sequence, p The value is:%s'%p) else: print (u'The original sequence is white noise sequence, p The value is:%s'%p) [[lb],[p]]=acorr_ljungbox(dta2.diff(1).dropna(),lags=1) if p<0.05: print (u'The first-order difference sequence is a non white noise sequence, p The value is:%s'%p) else: print (u'The first order difference sequence is white noise sequence, p The value is:%s'%p) dta.plot(figsize=(8,7)) plt.show() #Drawing autocorrelation and skew correlation fig = plt.figure(figsize=(8,7)) ax1= fig.add_subplot(211) ax2= fig.add_subplot(212) fig = plot_acf(dta,ax=ax1) fig = plot_pacf(dta,ax=ax2) fig.show() #Model recognition #Determine the best p,d,q values ###Rank determination ##pmax = int(len(dta)/10) #General order no more than length/10 ##qmax = int(len(dta)/10) #General order no more than length/10 ##bic_matrix = [] #bic matrix ##for p in range(pmax+1): ## tmp = [] ## for q in range(qmax+1): ## try: #Some errors are reported, so try is used to skip them. ## tmp.append(ARIMA(dta, (p,1,q)).fit().bic) ## except: ## tmp.append(None) ## bic_matrix.append(tmp) ## ##bic_matrix = pd.DataFrame(bic_matrix) #From which we can find the minimum value ## ##p,q = bic_matrix.stack().astype('float64').idxmin() #First use stack to flatten, then use idxmin to find the minimum position. ##Print (the minimum p and Q values of u'bic are:% s,% s'% (p, q)) p=2 #Here is the result. The 212 model is the best, so it's assigned directly to avoid half a day's adjustment of the program q=2 #Establishing ARIMA(p,1,q) model arima = ARIMA(dta.dropna(), (p,1,q)).fit() print("Optimal model", arima.summary()) #Model test: after the model is established, check whether the residual sequence is white noise dta3 = dta.drop(axis = 0, index = '2008/1/11')#Delete the first item and the corresponding difference is missing dta_pred = arima.predict(typ = 'levels') #Forecast by model print(dta_pred)#Manual operation.. #Draw a fit fig3 = plt.figure(figsize=(12,8)) plt.plot(dta3) plt.plot(dta_pred, color='red') fig3.show() #Modified residual sequence format dta_pred2=pd.read_csv('forecast.CSV',header=0 ,index_col=0)#Here is the DTA above_ PRED is printed and copied to a csv because there is a bug when it is used directly. I don't know what happened. It seems that the name of the calculated column is lost dta_pred2 = dta_pred2.dropna() #Remove incomplete item data pred_error = (dta_pred2 - dta3).dropna()#Calculation residual lb,p_l= acorr_ljungbox(pred_error, lags = 16)#LB Test print(p_l) h = (p_l < 0.05).sum() #p value is less than 0.05, which is considered as non white noise. if h > 0: print (u'Model ARIMA(%s,1,%s)Non conformity white noise test'%(p,q)) else: print (u'Model ARIMA(%s,1,%s)Compliance with white noise test' %(p,q)) #Autocorrelation graph of residuals fig = plt.figure(figsize=(12,8)) ax1 = fig.add_subplot(211) fig = plot_acf(pred_error,ax=ax1) ax2 = fig.add_subplot(212) fig = plot_pacf(pred_error, ax=ax2) fig.show() #D-W inspection print(sm.stats.durbin_watson(pred_error)) #Draw qq chart fig = plt.figure(figsize=(12,8)) fig = qqplot(pred_error, line='q', fit=True) fig.show() ###Accuracy of different difference times ##print(ARIMA(dta, (p,0,q)).fit().bic) ##print(ARIMA(dta, (p,1,q)).fit().bic) ## ##print(ARIMA(dta, (p,0,q)).fit().aic) ##print(ARIMA(dta, (p,1,q)).fit().aic) ###print(ARIMA(dta, (p,2,q)).fit().aic) #MA coefficient is irreversible ## ###Differential comparison ##fig1 = plt.figure(figsize=(8,7)) ##ax1= fig1.add_subplot(211) ##diff1 = dta.diff(1) ##diff1.plot(ax=ax1) ##ax2= fig1.add_subplot(212) ##diff2 = dta.diff(2) ##diff2.plot(ax=ax2) ##fig1.show() ## ###Autocorrelation graph after difference ##dta= dta.diff(1)#First order difference ##dta = dta.dropna() ## ##fig2 = plt.figure(figsize=(8,7)) ##ax1=fig2.add_subplot(211) ##fig2 = plot_acf(dta,ax=ax1) ##ax2 = fig2.add_subplot(212) ##fig2 = plot_pacf(dta,ax=ax2) ##fig2.show()

forecast

import warnings # Ignore warnings warnings.filterwarnings('ignore') import pandas as pd #Table and data operations import numpy as np import matplotlib.pyplot as plt from statsmodels.tsa.arima_model import ARMA from statsmodels.tsa.arima_model import ARIMA from statsmodels.api import tsa import statsmodels.api as sm from itertools import product dta0=pd.read_csv('time series_40.CSV',header=0) dta0.index = pd.to_datetime(dta0['Date']) data = pd.DataFrame() data['date'] = ['2008/1/11','2008/2/6','2008/3/17','2008/4/13','2008/5/17','2008/6/15','2008/7/1','2008/7/12','2008/8/10','2008/9/14','2008/10/12','2008/11/16','2008/12/13','2009/1/19','2009/2/16','2009/3/13','2009/4/18','2009/5/16','2009/6/20','2009/7/11','2009/8/15','2009/9/19','2009/10/16','2009/11/14','2009/12/11','2010/1/15','2010/2/20','2010/3/13','2010/4/17','2010/5/15','2010/6/12','2010/7/16','2010/8/14','2010/9/18','2010/10/16','2010/11/19','2010/12/24','2011/1/21','2011/2/18','2011/3/19','2011/4/17','2011/5/15','2011/6/18','2011/7/16','2011/8/20','2011/9/24','2011/10/22','2011/11/19','2011/12/24','2012/1/14'] data['dy'] = [0.62,1.01,1.78,1.29,0.11,-0.35,-0.44,-0.3,-1.11,-1.78,-1.39,-0.94,-0.36,1.47,1.75,2.04,1.03,0.02,-0.59,-1.35,-2.14,-1.96,-1.46,-0.56,0.04,0.96,1.58,1.43,0.95,0.14,-0.3,-1.35,-1.6,-1.98,-1.58,-0.98,0.56,1.14,1.19,1.18,0.61,0.76,-0.66,-1.14,-1.35,-1.85,-0.95,-0.65,0.44,1.09] data.index = pd.to_datetime(data['date']) p=2 q=2 arima = ARIMA(dta0['dy'].dropna(), (p,1,q)).fit() print(arima.summary()) dta_pred = arima.predict(typ = 'levels') #forecast ###fitting ##fig1 = plt.figure(figsize=(12,8)) ##plt.plot(dta0['dy'], color='green') ##plt.plot(dta_pred, color='yellow') ##fig1.show() #model prediction forecast_ts = arima.forecast(10) fore = pd.DataFrame() fore['date'] = ['2011-4-17','2011-5-15','2011-6-18','2011-7-16','2011-8-20','2011-9-24','2011-10-22','2011-11-19','2011-12-24','2012-1-14'] fore['result'] = pd.DataFrame(forecast_ts[0]) fore.index = pd.to_datetime(fore['date']) print(fore['result']) #Draw results table dta0['dy'].plot(color='blue', label='Original',figsize=(12,8)) dta_pred.plot(color='red', label='Predict',figsize=(12,8)) data.dy.plot(color='green', label='Original',figsize=(12,8)) fore.result.plot(color='yellow', label='Forecast',figsize=(12,8)) plt.show()

## result

Above

Original data sequence diagram

Time sequence diagram of primary difference

Sequence diagram of quadratic difference

Stationarity test and white noise test

Order of difference | ADF test (less than 0.05 can be regarded as stable) | LB Test (less than 0.05 can be regarded as non white noise sequence) |
---|---|---|

0 | 1.1067394073215614e-08 | 4.4372756913599096e-09 |

1 | 2.3032333702927282e-11 | 0.0002755346487733437 |

2 | 1.2900060307415488e-08 | 1.1951415276363802e-07 |

Autocorrelation graph

Partial correlation graph

This table and the figure above are run out with SPSS, because the biased diagram I drew with the program is very outrageous

I use the autocorrelation biased graph drawn by the program:

Program out p=q=2

And then calculate the order of difference

Order of difference | AIC | BIC |
---|---|---|

0 | 53.22929865693882 | 64.7014366895077 |

1 | 61.743467139654456 | 73.09438892831821 |

2 | MA coefficient is irreversible | MA coefficient is irreversible |

Although the 0-order difference is the smallest, the residual calculated by the 202 model does not conform to the white noise test, which is also very outrageous, so the last one used is 212

And then there's the model test

Start with the number of pages

Under the exponential smoothing model, it is necessary to observe whether the residual of ARIMA model is a normal distribution with an average value of 0 and a constant variance (obeying the normal distribution with zero mean and constant variance), and whether the continuous residual is (autocorrelation) at the same time

Let's first look at the autocorrelation of the residuals. Mm-hmm, it's pretty good. I'm very glad

LB Test

After the model is established, it is necessary to test whether the residual sequence is white noise. If it is not white noise, it indicates that there is still useful information in the residual, and it is necessary to modify the model or further extract. Using the model and the original data to make the difference, the residual term is obtained. The result of LB Test with a lag value of 16 is as follows: [0.30226802 0.31656683 0.26285311 0.39561196 0.14861177 0.20026693 0.28606258 0.32825364 0.35109379 0.28906182 0.33058463 0.36657827 0.4070982 0.47689337 0.53658179 0.5805578 ]It can be seen that there is less useful information in the residual, and the fitting result of the model is good, which can be used as a prediction model.

D-W inspection

When DW value is significantly close to O or 4, there is autocorrelation, and when DW value is close to 2, there is no (first order) autocorrelation. The test result is 2.26339224, indicating that the autocorrelation of the residual is very small, which can be regarded as no autocorrelation

Q-Q graph can visually verify whether a group of data comes from a certain distribution, or whether two groups of data come from the same (family) distribution. If all the points are basically in a straight line, we can say that the two distributions are the same. Therefore, according to the qq chart, we can conclude that most of the target data are distributed around a straight line, and we can think that the residual basically follows the normal distribution

At the prediction stage, another model is built with the first 40 sets of data. Why is the first 40? Because the first 30 can't be built (spitting blood)

The 0.9 residuals in it are very outrageous. It's OK to remove them

date | actual value | predicted value | residual | Residual analysis |
---|---|---|---|---|

2011/4/17 | 0.61 | 0.599297 | 0.010703 | count 10 |

2011/5/15 | 0.76 | -0.20582 | 0.96582 | mean 0.241534 |

2011/6/18 | -0.66 | -1.01002 | 0.35002 | std 0.358969 |

2011/7/16 | -1.14 | -1.597523 | 0.457523 | |

2011/8/20 | -1.35 | -1.820037 | 0.470037 | Residual after extremum removal |

2011/9/24 | -1.85 | -1.63406 | -0.21594 | count 9 |

2011/10/22 | -0.95 | -1.10831 | 0.15831 | mean 0.161057 |

2011/11/19 | -0.65 | -0.400514 | -0.24949 | std 0.268526 |

2011/12/24 | 0.44 | 0.2889 | 0.1511 | |

2012/1/14 | 1.09 | 0.77275 | 0.31725 |

Actual observations (green), fitted (red), and predicted (yellow)