#How to start a paper on data processing from scratch
I really don't know how to do it. It's pure entry-level. Just draw a picture and hand in a paper. Here's how I do it (for reference only)
##CSV data
Pay attention to the date format. When you open it in excel, it will become the default date format. Remember to change the slash to - sometimes it's weird
Vertical displacement monitoring of dam crest is a main content of deformation monitoring, and its results can accurately reflect the vertical deformation of hydraulic structures. The following table shows the vertical displacement of the dam crest at 5 × 104 points of a gravity dam in China from January 2008 to January 2012, with 50 sets of original data.
Date,dy 2008-01-11,0.62 2008-02-06,1.01 2008-03-17,1.78 2008-04-13,1.29 2008-05-17,0.11 2008-06-15,-0.35 2008-07-01,-0.44 2008-07-12,-0.3 2008-08-10,-1.11 2008-09-14,-1.78 2008-10-12,-1.39 2008-11-16,-0.94 2008-12-13,-0.36 2009-01-19,1.47 2009-02-16,1.75 2009-03-13,2.04 2009-04-18,1.03 2009-05-16,0.02 2009-06-20,-0.59 2009-07-11,-1.35 2009-08-15,-2.14 2009-09-19,-1.96 2009-10-16,-1.46 2009-11-14,-0.56 2009-12-11,0.04 2010-01-15,0.96 2010-02-20,1.58 2010-03-13,1.43 2010-04-17,0.95 2010-05-15,0.14 2010-06-12,-0.3 2010-07-16,-1.35 2010-08-14,-1.6 2010-09-18,-1.98 2010-10-16,-1.58 2010-11-19,-0.98 2010-12-24,0.56 2011-01-21,1.14 2011-02-18,1.19 2011-03-19,1.18 2011-04-17,0.61 2011-05-15,0.76 2011-06-18,-0.66 2011-07-16,-1.14 2011-08-20,-1.35 2011-09-24,-1.85 2011-10-22,-0.95 2011-11-19,-0.65 2011-12-24,0.44 2012-01-14,1.09
The experimental process is to use 50 groups of data to fit the model, select the best one; then use the first 40 groups to build the model, predict the next ten groups of data, and draw a picture to see. In fact, I don't think I'm serious. Don't ask, don't bother to change
program
Because of my inconsistencies, I wrote two programs, one for prediction; the one I annotated could also run, so I annotated the program directly if it was too troublesome. In fact, this program is used to draw pictures
fitting
import warnings # Ignore warnings warnings.filterwarnings('ignore') import pandas as pd #Table and data operations import numpy as np import matplotlib.pyplot as plt from random import randrange from statsmodels.tsa.stattools import adfuller as ADF from statsmodels.stats.diagnostic import acorr_ljungbox #White noise test from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima_model import ARMA from statsmodels.tsa.arima_model import ARIMA from statsmodels.api import tsa import statsmodels.api as sm from statsmodels.graphics.api import qqplot from itertools import product dta=pd.read_csv('Time series 1 - copy - copy.CSV',header=0 ,index_col=0) dta = dta.dropna() #Remove incomplete item data print(dta.describe()) #Stability test by unit root test (ADF) dta1= dta diff=0 adf=ADF(dta1) if adf[1]>0.05: print (u'The original sequence is not stable, p The value is:%s'%(adf[1])) else: print (u'The original sequence is stable, p The value is:%s'%(adf[1])) while adf[1]>=0.05:#adf[1] is p value, and P value less than 0.05 is considered stable diff=diff+1 adf=ADF(dta1.diff(diff).dropna()) print (u'Original sequence%s The order difference is stable, p The value is%s'%(diff,adf[1])) for i in range(1,3): dta1 = dta1.diff(i).dropna() adf = ADF(dta1) print(u'Original sequence%s Order difference,p The value is%s'%(i,adf[1])) #White noise test with LB statistics dta2 = dta [[lb],[p]]=acorr_ljungbox(dta2,lags=1) if p<0.05: print (u'The original sequence is a non white noise sequence, p The value is:%s'%p) else: print (u'The original sequence is white noise sequence, p The value is:%s'%p) [[lb],[p]]=acorr_ljungbox(dta2.diff(1).dropna(),lags=1) if p<0.05: print (u'The first-order difference sequence is a non white noise sequence, p The value is:%s'%p) else: print (u'The first order difference sequence is white noise sequence, p The value is:%s'%p) dta.plot(figsize=(8,7)) plt.show() #Drawing autocorrelation and skew correlation fig = plt.figure(figsize=(8,7)) ax1= fig.add_subplot(211) ax2= fig.add_subplot(212) fig = plot_acf(dta,ax=ax1) fig = plot_pacf(dta,ax=ax2) fig.show() #Model recognition #Determine the best p,d,q values ###Rank determination ##pmax = int(len(dta)/10) #General order no more than length/10 ##qmax = int(len(dta)/10) #General order no more than length/10 ##bic_matrix = [] #bic matrix ##for p in range(pmax+1): ## tmp = [] ## for q in range(qmax+1): ## try: #Some errors are reported, so try is used to skip them. ## tmp.append(ARIMA(dta, (p,1,q)).fit().bic) ## except: ## tmp.append(None) ## bic_matrix.append(tmp) ## ##bic_matrix = pd.DataFrame(bic_matrix) #From which we can find the minimum value ## ##p,q = bic_matrix.stack().astype('float64').idxmin() #First use stack to flatten, then use idxmin to find the minimum position. ##Print (the minimum p and Q values of u'bic are:% s,% s'% (p, q)) p=2 #Here is the result. The 212 model is the best, so it's assigned directly to avoid half a day's adjustment of the program q=2 #Establishing ARIMA(p,1,q) model arima = ARIMA(dta.dropna(), (p,1,q)).fit() print("Optimal model", arima.summary()) #Model test: after the model is established, check whether the residual sequence is white noise dta3 = dta.drop(axis = 0, index = '2008/1/11')#Delete the first item and the corresponding difference is missing dta_pred = arima.predict(typ = 'levels') #Forecast by model print(dta_pred)#Manual operation.. #Draw a fit fig3 = plt.figure(figsize=(12,8)) plt.plot(dta3) plt.plot(dta_pred, color='red') fig3.show() #Modified residual sequence format dta_pred2=pd.read_csv('forecast.CSV',header=0 ,index_col=0)#Here is the DTA above_ PRED is printed and copied to a csv because there is a bug when it is used directly. I don't know what happened. It seems that the name of the calculated column is lost dta_pred2 = dta_pred2.dropna() #Remove incomplete item data pred_error = (dta_pred2 - dta3).dropna()#Calculation residual lb,p_l= acorr_ljungbox(pred_error, lags = 16)#LB Test print(p_l) h = (p_l < 0.05).sum() #p value is less than 0.05, which is considered as non white noise. if h > 0: print (u'Model ARIMA(%s,1,%s)Non conformity white noise test'%(p,q)) else: print (u'Model ARIMA(%s,1,%s)Compliance with white noise test' %(p,q)) #Autocorrelation graph of residuals fig = plt.figure(figsize=(12,8)) ax1 = fig.add_subplot(211) fig = plot_acf(pred_error,ax=ax1) ax2 = fig.add_subplot(212) fig = plot_pacf(pred_error, ax=ax2) fig.show() #D-W inspection print(sm.stats.durbin_watson(pred_error)) #Draw qq chart fig = plt.figure(figsize=(12,8)) fig = qqplot(pred_error, line='q', fit=True) fig.show() ###Accuracy of different difference times ##print(ARIMA(dta, (p,0,q)).fit().bic) ##print(ARIMA(dta, (p,1,q)).fit().bic) ## ##print(ARIMA(dta, (p,0,q)).fit().aic) ##print(ARIMA(dta, (p,1,q)).fit().aic) ###print(ARIMA(dta, (p,2,q)).fit().aic) #MA coefficient is irreversible ## ###Differential comparison ##fig1 = plt.figure(figsize=(8,7)) ##ax1= fig1.add_subplot(211) ##diff1 = dta.diff(1) ##diff1.plot(ax=ax1) ##ax2= fig1.add_subplot(212) ##diff2 = dta.diff(2) ##diff2.plot(ax=ax2) ##fig1.show() ## ###Autocorrelation graph after difference ##dta= dta.diff(1)#First order difference ##dta = dta.dropna() ## ##fig2 = plt.figure(figsize=(8,7)) ##ax1=fig2.add_subplot(211) ##fig2 = plot_acf(dta,ax=ax1) ##ax2 = fig2.add_subplot(212) ##fig2 = plot_pacf(dta,ax=ax2) ##fig2.show()
forecast
import warnings # Ignore warnings warnings.filterwarnings('ignore') import pandas as pd #Table and data operations import numpy as np import matplotlib.pyplot as plt from statsmodels.tsa.arima_model import ARMA from statsmodels.tsa.arima_model import ARIMA from statsmodels.api import tsa import statsmodels.api as sm from itertools import product dta0=pd.read_csv('time series_40.CSV',header=0) dta0.index = pd.to_datetime(dta0['Date']) data = pd.DataFrame() data['date'] = ['2008/1/11','2008/2/6','2008/3/17','2008/4/13','2008/5/17','2008/6/15','2008/7/1','2008/7/12','2008/8/10','2008/9/14','2008/10/12','2008/11/16','2008/12/13','2009/1/19','2009/2/16','2009/3/13','2009/4/18','2009/5/16','2009/6/20','2009/7/11','2009/8/15','2009/9/19','2009/10/16','2009/11/14','2009/12/11','2010/1/15','2010/2/20','2010/3/13','2010/4/17','2010/5/15','2010/6/12','2010/7/16','2010/8/14','2010/9/18','2010/10/16','2010/11/19','2010/12/24','2011/1/21','2011/2/18','2011/3/19','2011/4/17','2011/5/15','2011/6/18','2011/7/16','2011/8/20','2011/9/24','2011/10/22','2011/11/19','2011/12/24','2012/1/14'] data['dy'] = [0.62,1.01,1.78,1.29,0.11,-0.35,-0.44,-0.3,-1.11,-1.78,-1.39,-0.94,-0.36,1.47,1.75,2.04,1.03,0.02,-0.59,-1.35,-2.14,-1.96,-1.46,-0.56,0.04,0.96,1.58,1.43,0.95,0.14,-0.3,-1.35,-1.6,-1.98,-1.58,-0.98,0.56,1.14,1.19,1.18,0.61,0.76,-0.66,-1.14,-1.35,-1.85,-0.95,-0.65,0.44,1.09] data.index = pd.to_datetime(data['date']) p=2 q=2 arima = ARIMA(dta0['dy'].dropna(), (p,1,q)).fit() print(arima.summary()) dta_pred = arima.predict(typ = 'levels') #forecast ###fitting ##fig1 = plt.figure(figsize=(12,8)) ##plt.plot(dta0['dy'], color='green') ##plt.plot(dta_pred, color='yellow') ##fig1.show() #model prediction forecast_ts = arima.forecast(10) fore = pd.DataFrame() fore['date'] = ['2011-4-17','2011-5-15','2011-6-18','2011-7-16','2011-8-20','2011-9-24','2011-10-22','2011-11-19','2011-12-24','2012-1-14'] fore['result'] = pd.DataFrame(forecast_ts[0]) fore.index = pd.to_datetime(fore['date']) print(fore['result']) #Draw results table dta0['dy'].plot(color='blue', label='Original',figsize=(12,8)) dta_pred.plot(color='red', label='Predict',figsize=(12,8)) data.dy.plot(color='green', label='Original',figsize=(12,8)) fore.result.plot(color='yellow', label='Forecast',figsize=(12,8)) plt.show()
result
Above
Original data sequence diagram
Time sequence diagram of primary difference
Sequence diagram of quadratic difference
Stationarity test and white noise test
Order of difference ADF test (less than 0.05 can be regarded as stable) LB Test (less than 0.05 can be regarded as non white noise sequence) 0 1.1067394073215614e-08 4.4372756913599096e-09 1 2.3032333702927282e-11 0.0002755346487733437 2 1.2900060307415488e-08 1.1951415276363802e-07Autocorrelation graph
Partial correlation graph
This table and the figure above are run out with SPSS, because the biased diagram I drew with the program is very outrageous
I use the autocorrelation biased graph drawn by the program:
Program out p=q=2
And then calculate the order of difference
Although the 0-order difference is the smallest, the residual calculated by the 202 model does not conform to the white noise test, which is also very outrageous, so the last one used is 212
And then there's the model test
Start with the number of pages
Under the exponential smoothing model, it is necessary to observe whether the residual of ARIMA model is a normal distribution with an average value of 0 and a constant variance (obeying the normal distribution with zero mean and constant variance), and whether the continuous residual is (autocorrelation) at the same time
Let's first look at the autocorrelation of the residuals. Mm-hmm, it's pretty good. I'm very glad
LB Test
After the model is established, it is necessary to test whether the residual sequence is white noise. If it is not white noise, it indicates that there is still useful information in the residual, and it is necessary to modify the model or further extract. Using the model and the original data to make the difference, the residual term is obtained. The result of LB Test with a lag value of 16 is as follows: [0.30226802 0.31656683 0.26285311 0.39561196 0.14861177 0.20026693 0.28606258 0.32825364 0.35109379 0.28906182 0.33058463 0.36657827 0.4070982 0.47689337 0.53658179 0.5805578 ]It can be seen that there is less useful information in the residual, and the fitting result of the model is good, which can be used as a prediction model.
D-W inspection
When DW value is significantly close to O or 4, there is autocorrelation, and when DW value is close to 2, there is no (first order) autocorrelation. The test result is 2.26339224, indicating that the autocorrelation of the residual is very small, which can be regarded as no autocorrelation
Q-Q graph can visually verify whether a group of data comes from a certain distribution, or whether two groups of data come from the same (family) distribution. If all the points are basically in a straight line, we can say that the two distributions are the same. Therefore, according to the qq chart, we can conclude that most of the target data are distributed around a straight line, and we can think that the residual basically follows the normal distribution
At the prediction stage, another model is built with the first 40 sets of data. Why is the first 40? Because the first 30 can't be built (spitting blood)
The 0.9 residuals in it are very outrageous. It's OK to remove them
Actual observations (green), fitted (red), and predicted (yellow)