Chapter 9 time series data

Chapter 9 time series data

import pandas as pd
import numpy as np

1, Timing creation

1. Four kinds of time variables

Now I understand that there may be some confusion about ③ and ④, which will be explained later

name describe Element type How to create
① Date times Describe a specific date or point in time Timestamp to_datetime or date_range
② Time spans A period of time defined by a point in time Period Period or period_range
③ Date offsets (relative time difference) Relative size of a period of time (independent of summer / winter) DateOffset DateOffset
④ Time deltas (absolute time difference) Absolute size of a period of time (related to summer / winter) Timedelta to_timedelta or timedelta_range

2. Creation of time point

(a)to_datetime method

Pandas gives a lot of freedom in the input format specification of time point establishment. The following statements can correctly establish the same time point

pd.to_datetime('2020.1.1')
pd.to_datetime('2020 1.1')
pd.to_datetime('2020 1 1')
pd.to_datetime('2020 1-1')
pd.to_datetime('2020-1 1')
pd.to_datetime('2020-1-1')
pd.to_datetime('2020/1/1')
pd.to_datetime('1.1.2020')
pd.to_datetime('1.1 2020')
pd.to_datetime('1 1 2020')
pd.to_datetime('1 1-2020')
pd.to_datetime('1-1 2020')
pd.to_datetime('1-1-2020')
pd.to_datetime('1/1/2020')
pd.to_datetime('20200101')
pd.to_datetime('2020.0101')
Timestamp('2020-01-01 00:00:00')

The following statements will report errors

#pd.to_datetime('2020\\1\\1')
#pd.to_datetime('2020`1`1')
#pd.to_datetime('2020.1 1')
#pd.to_datetime('1 1.2020')

In this case, the format parameter can be used to force matching

pd.to_datetime('2020\\1\\1',format='%Y\\%m\\%d')
pd.to_datetime('2020`1`1',format='%Y`%m`%d')
pd.to_datetime('2020.1 1',format='%Y.%m %d')
pd.to_datetime('1 1.2020',format='%d %m.%Y')
Timestamp('2020-01-01 00:00:00')

At the same time, use the list to turn it into a point in time index

pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))
2020-01-01    0
2020-01-02    1
dtype: int64
type(pd.to_datetime(['2020/1/1','2020/1/2']))
pandas.core.indexes.datetimes.DatetimeIndex

For DataFrame, if the columns have been arranged in chronological order, use to_datetime can be automatically converted

df = pd.DataFrame({'year': [2020, 2020],'month': [1, 1], 'day': [1, 2]})
pd.to_datetime(df)
0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]

(b) Time precision and range limitation

In fact, the accuracy of Timestamp is far beyond day, and it can be as small as nanosecond ns

pd.to_datetime('2020/1/1 00:00:00.123456789')
Timestamp('2020-01-01 00:00:00.123456789')

At the same time, the cost of scope is that only about 584 years of time is available

pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')

(c)date_range method

Generally speaking, start/end/periods / freq (interval method) is the most important parameter of this method. Given three of them, the remaining one will be determined

pd.date_range(start='2020/1/1',end='2020/1/10',periods=3)
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')
pd.date_range(start='2020/1/1',periods=3,freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
pd.date_range(end='2020/1/3',periods=3,freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

There are many options for the freq parameter. The common parts are listed below, and more options are available here

Symbol D/B W M/Q/Y BM/BQ/BY MS/QS/YS BMS/BQS/BYS H T S
describe Days / working days week end of the month Month / season / year end Month / quarter / year end working day Month / quarter / year beginning date Month / quarter / year beginning working day hour minute
pd.date_range(start='2020/1/1',periods=3,freq='T')
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00'],
              dtype='datetime64[ns]', freq='T')
pd.date_range(start='2020/1/1',periods=3,freq='M')
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'], dtype='datetime64[ns]', freq='M')
pd.date_range(start='2020/1/1',periods=3,freq='BYS')
DatetimeIndex(['2020-01-01', '2021-01-01', '2022-01-03'], dtype='datetime64[ns]', freq='BAS-JAN')

bdate_range is similar to date_ The feature of range method is that the weekmask parameter and holidays parameter can be selected on the self-contained working day interval setting

There is a special 'C' /'CBM '/'CBMS' option in its freq, which means customization. It needs to be used in combination with weekmask parameter and holidays parameter

For example, we need to keep Monday, Tuesday and Friday in weekdays, and delete some holidays

weekmask = 'Mon Tue Fri'
holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
#Pay attention to holidays
pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays)
DatetimeIndex(['2020-01-03', '2020-01-06', '2020-01-13', '2020-01-14'], dtype='datetime64[ns]', freq='C')

3. DateOffset object

(a) Difference between DataOffset and Timedelta

The characteristic of the absolute time difference of Timedelta is that whether it is winter time or summer time, the increase or decrease of 1 day is only 24 hours

DataOffset relative time difference means that no matter whether a day is 23-24-25 hours, the increase or decrease of 1 day is consistent with the same time of the day

For example, in the 03 hours of 2020, 29 hours in the UK, the 01:00:00 clock changed 1 hours to 2020, 03 months 29, 02:00:00, and began daylight saving time.

ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 02:00:00+0300', tz='Europe/Helsinki')
ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00+0300', tz='Europe/Helsinki')

It seems a bit overwhelming, but just remove the tz (time zone) and you can leave it alone. The two are consistent, unless you need to use a time zone change

ts = pd.Timestamp('2020-3-29 01:00:00')
ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 01:00:00')
ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00')

(b) Increase or decrease for a period of time

Optional parameters for DateOffset include years/months/weeks/days/hours/minutes/seconds

pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2)
Timestamp('2019-12-18 00:20:00')

(c) Various common offset objects

freq D/B W (B)M/(B)Q/(B)Y (B)MS/(B)QS/(B)YS H T S C
offset DateOffset/BDay Week (B)MonthEnd/(B)QuarterEnd/(B)YearEnd (B)MonthBegin/(B)QuarterBegin/(B)YearBegin Hour Minute Second Cday (custom working day)
pd.Timestamp('2020-01-01') + pd.offsets.Week(2)
Timestamp('2020-01-15 00:00:00')
pd.Timestamp('2020-01-01') + pd.offsets.BQuarterBegin(1)
Timestamp('2020-03-02 00:00:00')

(d) offset operation of sequence

Using the apply function

pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y'))
0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]

Use object addition and subtraction directly

pd.date_range('20200101',periods=3,freq='Y') + pd.offsets.BYearBegin(3)
DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq='A-DEC')

To customize the offset, you can specify the weekmask and holidays parameters (think about why all three are one value)

pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri',holidays='2020010').apply(i)
                                  for i in pd.date_range('20200105',periods=3,freq='D'))
0   2020-01-15
1   2020-01-15
2   2020-01-15
dtype: datetime64[ns]

2, Index and attribute of time sequence

1. Index slice

This part is almost identical with the rules in chapter two

rng = pd.date_range('2020','2021', freq='W')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.head()
2020-01-05   -0.064380
2020-01-12   -0.027114
2020-01-19    1.408292
2020-01-26   -1.042610
2020-02-02   -1.091203
Freq: W-SUN, dtype: float64
ts['2020-01-26']
-1.0426103532400182

Automatic conversion of legal characters to point in time

ts['2020-01-26':'20200726'].head()
2020-01-26   -1.042610
2020-02-02   -1.091203
2020-02-09    0.578561
2020-02-16    0.743681
2020-02-23   -1.366325
Freq: W-SUN, dtype: float64

2. Subset index

ts['2020-7'].head()
2020-07-05   -1.304767
2020-07-12    0.715708
2020-07-19    1.115148
2020-07-26    0.768978
Freq: W-SUN, dtype: float64

Support mixed form index

ts['2011-1':'20200726'].head()
2020-01-05   -0.064380
2020-01-12   -0.027114
2020-01-19    1.408292
2020-01-26   -1.042610
2020-02-02   -1.091203
Freq: W-SUN, dtype: float64

3. Attributes of time points

Information about time can be easily obtained with dt object

pd.Series(ts.index).dt.year.head()
0    2020
1    2020
2    2020
3    2020
4    2020
dtype: int64
pd.Series(ts.index).dt.day.head()
0     5
1    12
2    19
3    26
4     2
dtype: int64

Time format can be modified by strftime

pd.Series(ts.index).dt.strftime('%Y-Interval 1-%m-Interval 2-%d').head()
0    2020-01-05 00:00:00
1    2020-01-12 00:00:00
2    2020-01-19 00:00:00
3    2020-01-26 00:00:00
4    2020-02-02 00:00:00
dtype: object

For datetime objects, you can get information directly from the properties

pd.date_range('2020','2021', freq='W').month
Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  4,  4,  4,  4,
             5,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,
             8,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12,
            12],
           dtype='int64')
pd.date_range('2020','2021', freq='W').weekday
Int64Index([6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64')

3, Resampling

Resampling refers to the resample function, which can be seen as the group by function of the sequential version

1. The basic operation of the sample object

The sampling frequency is generally set to the offset character mentioned above

df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
                  columns=['A', 'B', 'C'])
df_r
A B C
2020-01-01 00:00:00 0.342367 0.045786 -0.063409
2020-01-01 00:00:01 -0.927340 0.341773 1.530537
2020-01-01 00:00:02 2.071863 0.676199 -0.182570
2020-01-01 00:00:03 -1.454321 0.994234 -1.063875
2020-01-01 00:00:04 -0.637741 0.043385 0.219948
... ... ... ...
2020-01-01 00:16:35 0.709417 -1.864620 -1.142192
2020-01-01 00:16:36 -0.968802 0.528431 0.000216
2020-01-01 00:16:37 1.677542 -0.456622 -0.718650
2020-01-01 00:16:38 -0.135881 0.157089 1.948446
2020-01-01 00:16:39 -1.098189 0.540106 0.791522

1000 rows × 3 columns

r = df_r.resample('3min')
r
<pandas.core.resample.DatetimeIndexResampler object at 0x000001766EC63198>
r.sum()
A B C
2020-01-01 00:00:00 -8.772685 -27.074716 2.134617
2020-01-01 00:03:00 3.822484 8.912459 -15.448955
2020-01-01 00:06:00 2.744722 -8.055139 -11.364361
2020-01-01 00:09:00 4.655620 -11.524496 -10.536002
2020-01-01 00:12:00 -10.546811 5.063887 11.776490
2020-01-01 00:15:00 8.795150 -12.828809 -8.393950
df_r2 = pd.DataFrame(np.random.randn(200, 3),index=pd.date_range('1/1/2020', freq='D', periods=200),
                  columns=['A', 'B', 'C'])
r = df_r2.resample('CBMS')
r.sum()
A B C
2020-01-01 5.278470 1.688588 5.904806
2020-02-03 -3.581797 7.515267 0.205308
2020-03-02 -5.021605 -4.441066 5.433917
2020-04-01 0.671702 3.840042 4.922487
2020-05-01 4.613352 9.702408 -4.928112
2020-06-01 -0.598191 7.387416 8.716921
2020-07-01 -0.327200 -1.577507 -3.956079

2. Sampling polymerization

r = df_r.resample('3T')
r['A'].mean()
2020-01-01 00:00:00   -0.048737
2020-01-01 00:03:00    0.021236
2020-01-01 00:06:00    0.015248
2020-01-01 00:09:00    0.025865
2020-01-01 00:12:00   -0.058593
2020-01-01 00:15:00    0.087952
Freq: 3T, Name: A, dtype: float64
r['A'].agg([np.sum, np.mean, np.std])
sum mean std
2020-01-01 00:00:00 -8.772685 -0.048737 0.939954
2020-01-01 00:03:00 3.822484 0.021236 1.004048
2020-01-01 00:06:00 2.744722 0.015248 1.018865
2020-01-01 00:09:00 4.655620 0.025865 1.020881
2020-01-01 00:12:00 -10.546811 -0.058593 0.954328
2020-01-01 00:15:00 8.795150 0.087952 1.199379

Similarly, you can use the function / lambda expression

r.agg({'A': np.sum,'B': lambda x: max(x)-min(x)})
A B
2020-01-01 00:00:00 -8.772685 4.950006
2020-01-01 00:03:00 3.822484 5.711679
2020-01-01 00:06:00 2.744722 6.923072
2020-01-01 00:09:00 4.655620 6.370589
2020-01-01 00:12:00 -10.546811 4.544878
2020-01-01 00:15:00 8.795150 5.244546

3. Iteration of sampling group

The iteration of sampling group is similar to that of group by, and each group can be operated separately

small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
resampled
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")
Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64

4, Window function

Here are two main types of window functions in pandas: rolling/expanding

s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
s.head()
2020-01-01    1.498756
2020-01-02   -0.352679
2020-01-03   -0.375851
2020-01-04    0.654390
2020-01-05   -1.187620
Freq: D, dtype: float64

1. Rolling

(a) Common aggregation

The rolling method is to specify a window, which, like the groupby object, does not operate on its own and needs to cooperate with the aggregate function to calculate the result

s.rolling(window=50)
Rolling [window=50,center=False,axis=0]
s.rolling(window=50).mean()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22   -0.040783
2022-09-23   -0.065672
2022-09-24   -0.049524
2022-09-25   -0.026894
2022-09-26    0.008426
Freq: D, Length: 1000, dtype: float64

min_ The periods parameter is the threshold of the number of non missing data points required

s.rolling(window=50,min_periods=3).mean().head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.256742
2020-01-04    0.356154
2020-01-05    0.047399
Freq: D, dtype: float64

Count / sum / mean / medium / min / max / STD / var / skew / Kurt / quantile / cov / corr are commonly used aggregate functions

(b) Application aggregation of rolling

When using apply aggregation, you only need to remember that the input is a Series of window size, and the output must be scalar. For example, calculate the coefficient of variation as follows

s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()
2020-01-01          NaN
2020-01-02          NaN
2020-01-03   -10.018809
2020-01-04    -2.040720
2020-01-05    -1.463460
Freq: D, dtype: float64

(c) Time based rolling

s.rolling('15D').mean()
2020-01-01    1.498756
2020-01-02    0.573038
2020-01-03    0.256742
2020-01-04    0.356154
2020-01-05    0.047399
                ...   
2022-09-22    0.202153
2022-09-23    0.092951
2022-09-24    0.175540
2022-09-25    0.165124
2022-09-26    0.108596
Freq: D, Length: 1000, dtype: float64

You can select the parameter closed ='right '(default)'left','both'neither'to determine the inclusion of endpoints

s.rolling('15D', closed='right').sum().head()
2020-01-01    1.498756
2020-01-02    1.146077
2020-01-03    0.770226
2020-01-04    1.424616
2020-01-05    0.236996
Freq: D, dtype: float64

2. Expanding

(a) expanding function

The common expanding function is equivalent to rolling (window = len (s), min_ Period = 1), is the cumulative calculation of the sequence

s.rolling(window=len(s),min_periods=1).sum()
2020-01-01     1.498756
2020-01-02     1.146077
2020-01-03     0.770226
2020-01-04     1.424616
2020-01-05     0.236996
                ...    
2022-09-22    14.578431
2022-09-23    13.896989
2022-09-24    16.404925
2022-09-25    17.196337
2022-09-26    18.187211
Freq: D, Length: 1000, dtype: float64
s.expanding().sum().head()
2020-01-01    0.305974
2020-01-02    0.491195
2020-01-03   -0.155277
2020-01-04   -1.585570
2020-01-05   -2.541664
Freq: D, dtype: float64

The apply method is also available

s.expanding().apply(lambda x:sum(x)).head()
2020-01-01    0.305974
2020-01-02    0.491195
2020-01-03   -0.155277
2020-01-04   -1.585570
2020-01-05   -2.541664
Freq: D, dtype: float64

(b) Several special Expanding type functions

cumsum/cumprod/cummax/cummin are all special expanding cumulative calculation methods

s.cumsum().head()
2020-01-01    0.305974
2020-01-02    0.491195
2020-01-03   -0.155277
2020-01-04   -1.585570
2020-01-05   -2.541664
Freq: D, dtype: float64
s.cumsum().head()
2020-01-01    0.305974
2020-01-02    0.491195
2020-01-03   -0.155277
2020-01-04   -1.585570
2020-01-05   -2.541664
Freq: D, dtype: float64

shift/diff/pct_change refers to element relationship

① shift means that the sequence index does not change, but the value moves backward

② diff refers to the difference between the front and back elements. The period parameter indicates the interval, which is 1 by default and can be negative

③pct_change is the percentage change of elements before and after the value, and the period parameter is similar to diff

s.shift(2).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.305974
2020-01-04    0.185221
2020-01-05   -0.646472
Freq: D, dtype: float64
s.diff(3).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04   -1.736267
2020-01-05   -1.141316
Freq: D, dtype: float64
s.pct_change(3).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04   -5.674559
2020-01-05   -6.161897
Freq: D, dtype: float64

5, Questions and exercises

[question 1] how to deal with date_range to add frames in batches or increase the timestamp density for a certain period of time?

[question 2] how to increase the accuracy of timestamps in batches?

[question 3] is there really no way to deal with the time beyond the processing time?

[question 4] given a set of discontinuous dates, how to quickly find out the dates that are between the maximum date and the minimum date and do not appear in the set of dates?

[exercise 1] there is a time series data of milk sales in a supermarket. Please complete the following questions:

(a) What day of the week is the sales peak? (hint: use dayofweek function)

(b) Calculate the total monthly sales excluding Spring Festival, national day and May Day holiday

(c) Total sales on weekends (Saturdays and Sundays) on a quarterly basis

(d) Starting from the last day, skip Saturdays and Mondays and move forward on a 5-day time unit

(e) Assuming that the data is wrong now, the sales records of Monday and Friday in the same week are reversed, please calculate the sales of the first Monday of each month in 2018 (if there is no record of Monday or Friday in that week, it will remain unchanged)

df=pd.read_csv('./time_series_one.csv', parse_dates=['date'])
df['date'].dt.dayofweek[df['sales volume'].idxmax()]
6
holiday = pd.date_range(start='20170501', end='20170503').append(
          pd.date_range(start='20171001', end='20171007')).append(
          pd.date_range(start='20180215', end='20180221')).append(
          pd.date_range(start='20180501', end='20180503')).append(
          pd.date_range(start='20181001', end='20181007')).append(
          pd.date_range(start='20190204', end='20190224')).append(
          pd.date_range(start='20190501', end='20190503')).append(
          pd.date_range(start='20191001', end='20191007'))
df[~df['date'].isin(holiday)].set_index('date').resample('MS').sum()
sales volume
date
2017-02-01 31740
2017-03-01 80000
2017-04-01 74734
2017-05-01 76237
2017-06-01 80750
2017-07-01 83107
2017-08-01 85715
2017-09-01 79604
2017-10-01 61197
2017-11-01 78877
2017-12-01 84900
2018-01-01 85869
2018-02-01 61838
2018-03-01 88339
2018-04-01 82011
2018-05-01 76932
2018-06-01 85307
2018-07-01 82316
2018-08-01 85917
2018-09-01 82740
2018-10-01 67018
2018-11-01 81333
2018-12-01 91381
2019-01-01 89407
2019-02-01 21219
2019-03-01 87565
2019-04-01 90202
2019-05-01 85665
2019-06-01 90301
2019-07-01 90902
2019-08-01 93664
2019-09-01 89077
2019-10-01 72099
2019-11-01 38423
 df[df['date'].dt.dayofweek.isin([5,6])].set_index('date').resample('QS').sum()
sales volume
date
2017-01-01 32894
2017-04-01 66692
2017-07-01 69099
2017-10-01 70384
2018-01-01 74671
2018-04-01 69950
2018-07-01 74245
2018-10-01 74699
2019-01-01 77835
2019-04-01 77042
2019-07-01 76276
2019-10-01 35994
df[~df['date'].dt.dayofweek.isin([0,5])].set_index('date').sort_index(ascending=False).resample('5D').sum()
sales volume
date
2017-02-17 8026
2017-02-22 10917
2017-02-27 11052
2017-03-04 6945
2017-03-09 8916
... ...
2019-10-20 12817
2019-10-25 8902
2019-10-30 11955
2019-11-04 11281
2019-11-09 8701

200 rows × 1 columns

df_temp = df[~df['date'].dt.dayofweek.isin([5,6])].set_index('date').iloc[::-1]
L_temp,date_temp = [],[0]*df_temp.shape[0]
for i in range(df_temp.shape[0]//5):
    L_temp.extend([i]*5)
L_temp.extend([df_temp.shape[0]//5]*(df_temp.shape[0]-df_temp.shape[0]//5*5))
date_temp = pd.Series([i%5==0 for i in range(df_temp.shape[0])])
df_temp['num'] = L_temp
result = pd.DataFrame({'5 Total days':df_temp.groupby('num')['sales volume'].sum().values},
                       index=df_temp.reset_index()[date_temp]['date']).iloc[::-1]
result.head()
5 days total
date
2017-02-22 9855
2017-03-01 12296
2017-03-08 13323
2017-03-15 13845
2017-03-22 11356
from datetime import datetime 
df_temp = df.copy()
df_fri = df.shift(4)[df.shift(4)['date'].dt.dayofweek==1]['sales volume']
df_mon = df.shift(-4)[df.shift(-4)['date'].dt.dayofweek==5]['sales volume']
df_temp.loc[df_fri.index,['sales volume']] = df_fri
df_temp.loc[df_mon.index,['sales volume']] = df_mon
df_temp.loc[df_temp[df_temp['date'].dt.year==2018]['date'][
        df_temp[df_temp['date'].dt.year==2018]['date'].apply(
        lambda x:True if datetime.strptime(str(x).split()[0],'%Y-%m-%d').weekday() == 0 
        and 1 <= datetime.strptime(str(x).split()[0],'%Y-%m-%d').day <= 7 else False)].index,:]
date sales volume
318 2018-01-01 2863.0
353 2018-02-05 2321.0
381 2018-03-05 2705.0
409 2018-04-02 2487.0
444 2018-05-07 3204.0
472 2018-06-04 2927.0
500 2018-07-02 2574.0
535 2018-08-06 2504.0
563 2018-09-03 2483.0
591 2018-10-01 2431.0
626 2018-11-05 2395.0
654 2018-12-03 2373.0

[exercise 2] to continue to use the data of the previous question, please complete the following questions:

(a) Calculate the mean value and maximum value of sliding window with 50 days as window (min_periods is set to 1)

(b) Now there are the following rules: if the sales volume of the day exceeds the average value of the previous five days, it will be recorded as 1; otherwise, it will be recorded as 0. Please give the corresponding calculation results in 2018

(c) Change "forward 5 days" to "5 days before non weekend" in ^, please calculate the result again

df = pd.read_csv('./time_series_one.csv',index_col='date',parse_dates=['date'])
df['sales volume'].rolling(window=50,min_periods=1).mean().head()
date
2017-02-17    2154.000000
2017-02-18    2124.500000
2017-02-19    2569.333333
2017-02-20    2476.500000
2017-02-21    2463.800000
 Name: sales, dtype: float64
df['sales volume'].rolling(window=50,min_periods=1).max().head()
date
2017-02-17    2154.0
2017-02-18    2154.0
2017-02-19    3459.0
2017-02-20    3459.0
2017-02-21    3459.0
 Name: sales, dtype: float64

Tags: Lambda Attribute Spring

Posted on Mon, 29 Jun 2020 01:01:49 -0400 by heropage