'Python' 태그의 글 목록 (9 Page)

'Python'에 해당되는 글 243건

2020.01.02 [Python 시계열 자료 분석] 시계열 분해 (Time series Decomposition)
2020.01.01 [Python 시계열 자료 분석] 시계열 구성 요인 (Time series component factors): 추세(trend), 순환(cycle), 계절(seasonal), 불규칙(irregular) 요인 2
2019.12.31 [Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation)
2019.12.30 [Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기
2019.12.30 [Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)
2019.12.28 [Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)
2019.12.26 [Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 21
2019.12.26 [Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets)
2019.12.25 [Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m)
2019.12.24 [Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 1

[Python 시계열 자료 분석] 시계열 분해 (Time series Decomposition)

Python 분석과 프로그래밍/Python 통계분석 2020. 1. 2. 23:46

지난번 포스팅에서는 시계열 자료의 구성 요인 (time series component factors)으로서 추세 요인, 순환 요인, 계절 요인, 불규칙 요인에 대해서 소개하였습니다.

지난번 포스팅에서는 가법 모형으로 가상의 시계열 자료를 만들었다면(time series composition), ==> 이번 포스팅에서는 반대로 시계열 분해(time series decomposition)를 통해 시계열 자료를 추세(순환)(Trend), 계절성(Seasonality), 잔차(Residual)로 분해를 해보겠습니다.

시계열 분해는 직관적으로 이해하기 쉽고 구현도 쉬워서 시계열 자료 분석의 고전적인 방법론이지만 지금까지도 꾸준히 사용되는 방법론입니다. 시계열 분해의 순서와 방법은 대략 아래와 같습니다.

(1) 시도표 (time series plot)를 보고 시계열의 주기적 반복/계절성이 있는지, 가법 모형(additive model, y = t + s + r)과 승법 모형(multiplicative model, y = t * s * r) 중 무엇이 더 적합할지 판단을 합니다.

(가법 모형을 가정할 시)

(2) 시계열 자료에서 추세(trend)를 뽑아내기 위해서 중심 이동 평균(centered moving average)을 이용합니다.

(3) 원 자료에서 추세 분해값을 빼줍니다(detrend). 그러면 계절 요인과 불규칙 요인만 남게 됩니다.

(4) 다음에 계절 주기 (seasonal period) 로 detrend 이후 남은 값의 합을 나누어주면 계절 평균(average seasonality)을 구할 수 있습니다. (예: 01월 계절 평균 = (2020-01 + 2021-01 + 2022-01 + 2023-01)/4, 02월 계절 평균 = (2020-02 + 2021-02 + 2022-02 + 2023-02)/4).

(5) 원래의 값에서 추세와 계절성 분해값을 빼주면 불규칙 요인(random, irregular factor)이 남게 됩니다.

시계열 분해 후에 추세와 계절성을 제외한 잔차(residual, random/irregular factor) 가 특정 패턴 없이 무작위 분포를 띠고 작은 값이면 추세와 계절성으로 모형화가 잘 되는 것이구요, 시계열 자료의 특성을 이해하고 예측하는데 활용할 수 있습니다. 만약 시계열 분해 후의 잔차에 특정 패턴 (가령, 주기적인 파동을 그린다거나, 분산이 점점 커진다거나 등..) 이 존재한다면 잔차에 대해서만 다른 모형을 추가로 적합할 수도 있겠습니다.

예제로 사용할 시계열 자료로서 '1차 선형 추세 + 4년 주기 순환 + 1년 단위 계절성 + 불규칙 noise' 의 가법 모형 (additive model)으로 시계열 데이터를 만들어보겠습니다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dates = pd.date_range('2020-01-01', periods=48, freq='M')
dates
[Out]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31', '2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30', '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31', '2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30', '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31', '2022-09-30', '2022-10-31', '2022-11-30', '2022-12-31', '2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30', '2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31', '2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31'], dtype='datetime64[ns]', freq='M')

# additive model: trend + cycle + seasonality + irregular factor

timestamp = np.arange(len(dates))
trend_factor = timestamp*1.1
cycle_factor = 10*np.sin(np.linspace(0, 3.14*2, 48))
seasonal_factor = 7*np.sin(np.linspace(0, 3.14*8, 48))
np.random.seed(2004)
irregular_factor = 2*np.random.randn(len(dates))

df = pd.DataFrame({'timeseries': trend_factor + cycle_factor + seasonal_factor + irregular_factor,
'trend': trend_factor,
'cycle': cycle_factor,
'trend_cycle': trend_factor + cycle_factor,
'seasonal': seasonal_factor,
'irregular': irregular_factor},
index=dates)

df
[Out]:

	timeseries	trend	cycle	trend_cycle	seasonal	irregular
2020-01-31	2.596119	0.0	0.000000	0.000000	0.000000	2.596119
2020-02-29	6.746160	1.1	1.332198	2.432198	3.565684	0.748278
2020-03-31	8.112100	2.2	2.640647	4.840647	6.136825	-2.865371
2020-04-30	8.255941	3.3	3.902021	7.202021	6.996279	-5.942358
2020-05-31	16.889655	4.4	5.093834	9.493834	5.904327	1.491495
2020-06-30	16.182357	5.5	6.194839	11.694839	3.165536	1.321981
2020-07-31	14.128087	6.6	7.185409	13.785409	-0.456187	0.798865
2020-08-31	11.943313	7.7	8.047886	15.747886	-3.950671	0.146099
2020-09-30	9.728095	8.8	8.766892	17.566892	-6.343231	-1.495567
2020-10-31	12.483489	9.9	9.329612	19.229612	-6.966533	0.220411
2020-11-30	12.141808	11.0	9.726013	20.726013	-5.646726	-2.937480
2020-12-31	15.143334	12.1	9.949029	22.049029	-2.751930	-4.153764
2021-01-31	21.774516	13.2	9.994684	23.194684	0.910435	-2.330604
2021-02-28	28.432892	14.3	9.862164	24.162164	4.318862	-0.048134
2021-03-31	32.350583	15.4	9.553832	24.953832	6.522669	0.874082
2021-04-30	30.596556	16.5	9.075184	25.575184	6.907169	-1.885797
2021-05-31	32.510523	17.6	8.434753	26.034753	5.365118	1.110653
2021-06-30	30.425519	18.7	7.643955	26.343955	2.326624	1.754939
2021-07-31	24.300958	19.8	6.716890	26.516890	-1.360813	-0.855119
2021-08-31	20.450917	20.9	5.670082	26.570082	-4.668691	-1.450475
2021-09-30	18.870881	22.0	4.522195	26.522195	-6.674375	-0.976939
2021-10-31	21.326310	23.1	3.293690	26.393690	-6.818438	1.751059
2021-11-30	22.902448	24.2	2.006469	26.206469	-5.060699	1.756678
2021-12-31	26.620578	25.3	0.683478	25.983478	-1.891426	2.528526
2022-01-31	27.626499	26.4	-0.651696	25.748304	1.805404	0.072791
2022-02-28	31.858923	27.5	-1.975253	25.524747	4.998670	1.335506
2022-03-31	35.930469	28.6	-3.263598	25.336402	6.797704	3.796363
2022-04-30	30.177870	29.7	-4.493762	25.206238	6.700718	-1.729087
2022-05-31	30.016165	30.8	-5.643816	25.156184	4.734764	0.125217
2022-06-30	26.591729	31.9	-6.693258	25.206742	1.448187	-0.063200
2022-07-31	21.118481	33.0	-7.623379	25.376621	-2.242320	-2.015820
2022-08-31	16.636031	34.1	-8.417599	25.682401	-5.307397	-3.738973
2022-09-30	17.682613	35.2	-9.061759	26.138241	-6.892132	-1.563496
2022-10-31	21.163298	36.3	-9.544375	26.755625	-6.554509	0.962182
2022-11-30	22.455672	37.4	-9.856844	27.543156	-4.388699	-0.698786
2022-12-31	26.919529	38.5	-9.993595	28.506405	-0.998790	-0.588086
2023-01-31	33.964623	39.6	-9.952191	29.647809	2.669702	1.647112
2023-02-28	37.459776	40.7	-9.733369	30.966631	5.593559	0.899586
2023-03-31	40.793766	41.8	-9.341031	32.458969	6.957257	1.377540
2023-04-30	43.838415	42.9	-8.782171	34.117829	6.380433	3.340153
2023-05-31	41.301780	44.0	-8.066751	35.933249	4.023975	1.344556
2023-06-30	39.217866	45.1	-7.207526	37.892474	0.545147	0.780245
2023-07-31	35.125502	46.2	-6.219813	39.980187	-3.085734	-1.768952
2023-08-31	33.841926	47.3	-5.121219	42.178781	-5.855940	-2.480916
2023-09-30	38.770511	48.4	-3.931329	44.468671	-6.992803	1.294643
2023-10-31	37.371216	49.5	-2.671356	46.828644	-6.179230	-3.278198
2023-11-30	46.587633	50.6	-1.363760	49.236240	-3.642142	0.993536
2023-12-31	46.403326	51.7	-0.031853	51.668147	-0.089186	-5.175634

(1) Python을 이용한 시계열 분해 (Time series decomposition using Python)

Python의 statsmodels 라이브러리를 사용해서 가법 모형(additive model) 가정 하에 시계열 분해를 해보겠습니다.

from statsmodels.tsa.seasonal import seasonal_decompose

ts = df.timeseries
result = seasonal_decompose(ts, model='additive')

plt.rcParams['figure.figsize'] = [12, 8]
result.plot()
plt.show()

원래의 시계열 구성요소(추세+순환, 계절성, 불규칙 요인)와 시계열 분해(time series decomposition)를 통해 분리한 추세(&순환), 계절성, 잔차(불규칙 요인)를 겹쳐서 그려보았습니다. (즉, 원래 데이터의 추세요인과 시계열 분해를 통해 분리한 추세를 겹쳐서 그려보고, 원래 데이터의 계절요인과 시계열 분해를 통해 분리한 계절을 겹쳐서 그려보고, 원래 데이터의 불규칙 요인과 시계열 분해를 통해 분리한 잔차를 겹쳐서 그려봄)

원래의 데이터와 얼추 비슷하게, 그럴싸하게 시계열 분해를 한 것처럼 보이지요?

# ground truth & timeseries decompostion all together
# -- observed data
plt.figure(figsize=(12, 12))
plt.subplot(4,1, 1)
result.observed.plot()
plt.grid(True)
plt.ylabel('Observed', fontsize=14)

# -- trend & cycle factor
plt.subplot(4, 1, 2)
result.trend.plot() # from timeseries decomposition
df.trend_cycle.plot() # ground truth
plt.grid(True)
plt.ylabel('Trend', fontsize=14)

# -- seasonal factor
plt.subplot(4, 1, 3)
result.seasonal.plot() # from timeseries decomposition
df.seasonal.plot() # ground truth
plt.grid(True)
plt.ylabel('Seasonality', fontsize=14)

# -- irregular factor (noise)
plt.subplot(4, 1, 4)
result.resid.plot() # from timeseries decomposition
df.irregular.plot() # ground truth
plt.grid(True)
plt.ylabel('Residual', fontsize=14)

plt.show()

원래의 관측치(observed), 추세(trend), 계절성(seasonal), 잔차(residual) 데이터 아래처럼 시계열 분해한 객체에서 obsered, trend, seasonal, resid 라는 attributes 를 통해서 조회할 수 있습니다.

Observed		Trend ( & Cycle)
print(result.observed) [Out]: 2020-01-31 2.596119 2020-02-29 6.746160 2020-03-31 8.112100 2020-04-30 8.255941 2020-05-31 16.889655 2020-06-30 16.182357 2020-07-31 14.128087 2020-08-31 11.943313 2020-09-30 9.728095 2020-10-31 12.483489 2020-11-30 12.141808 2020-12-31 15.143334 2021-01-31 21.774516 2021-02-28 28.432892 2021-03-31 32.350583 2021-04-30 30.596556 2021-05-31 32.510523 2021-06-30 30.425519 2021-07-31 24.300958 2021-08-31 20.450917 2021-09-30 18.870881 2021-10-31 21.326310 2021-11-30 22.902448 2021-12-31 26.620578 2022-01-31 27.626499 2022-02-28 31.858923 2022-03-31 35.930469 2022-04-30 30.177870 2022-05-31 30.016165 2022-06-30 26.591729 2022-07-31 21.118481 2022-08-31 16.636031 2022-09-30 17.682613 2022-10-31 21.163298 2022-11-30 22.455672 2022-12-31 26.919529 2023-01-31 33.964623 2023-02-28 37.459776 2023-03-31 40.793766 2023-04-30 43.838415 2023-05-31 41.301780 2023-06-30 39.217866 2023-07-31 35.125502 2023-08-31 33.841926 2023-09-30 38.770511 2023-10-31 37.371216 2023-11-30 46.587633 2023-12-31 46.403326 Freq: M, Name: timeseries, dtype: float64		print(result.trend) [Out] 2020-01-31 NaN 2020-02-29 NaN 2020-03-31 NaN 2020-04-30 NaN 2020-05-31 NaN 2020-06-30 NaN 2020-07-31 11.994971 2020-08-31 13.697685 2020-09-30 15.611236 2020-10-31 17.552031 2020-11-30 19.133760 2020-12-31 20.378094 2021-01-31 21.395429 2021-02-28 22.173782 2021-03-31 22.909215 2021-04-30 23.658616 2021-05-31 24.475426 2021-06-30 25.402005 2021-07-31 26.124056 2021-08-31 26.510640 2021-09-30 26.802553 2021-10-31 26.934270 2021-11-30 26.812893 2021-12-31 26.549220 2022-01-31 26.256876 2022-02-28 25.965319 2022-03-31 25.756854 2022-04-30 25.700551 2022-05-31 25.675143 2022-06-30 25.668984 2022-07-31 25.945528 2022-08-31 26.442986 2022-09-30 26.878992 2022-10-31 27.650819 2022-11-30 28.690242 2022-12-31 29.686565 2023-01-31 30.796280 2023-02-28 32.096818 2023-03-31 33.692393 2023-04-30 35.246385 2023-05-31 36.927214 2023-06-30 38.744537 2023-07-31 NaN 2023-08-31 NaN 2023-09-30 NaN 2023-10-31 NaN 2023-11-30 NaN 2023-12-31 NaN Freq: M, Name: timeseries, dtype: float64

Seasonality

Residual (Noise)

print(result.seasonal)
[Out]:
2020-01-31 1.501630
2020-02-29 5.701170
2020-03-31 8.768065
2020-04-30 6.531709
2020-05-31 5.446174
2020-06-30 2.002476
2020-07-31 -1.643064
2020-08-31 -6.011071
2020-09-30 -7.807785
2020-10-31 -5.858728
2020-11-30 -5.849710
2020-12-31 -2.780867
2021-01-31 1.501630
2021-02-28 5.701170
2021-03-31 8.768065
2021-04-30 6.531709
2021-05-31 5.446174
2021-06-30 2.002476
2021-07-31 -1.643064
2021-08-31 -6.011071
2021-09-30 -7.807785
2021-10-31 -5.858728
2021-11-30 -5.849710
2021-12-31 -2.780867
2022-01-31 1.501630
2022-02-28 5.701170
2022-03-31 8.768065
2022-04-30 6.531709
2022-05-31 5.446174
2022-06-30 2.002476
2022-07-31 -1.643064
2022-08-31 -6.011071
2022-09-30 -7.807785
2022-10-31 -5.858728
2022-11-30 -5.849710
2022-12-31 -2.780867
2023-01-31 1.501630
2023-02-28 5.701170
2023-03-31 8.768065
2023-04-30 6.531709
2023-05-31 5.446174
2023-06-30 2.002476
2023-07-31 -1.643064
2023-08-31 -6.011071
2023-09-30 -7.807785
2023-10-31 -5.858728
2023-11-30 -5.849710
2023-12-31 -2.780867
Freq: M, Name: timeseries,
dtype: float64

print(result.resid)
[Out]:
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 3.776179
2020-08-31 4.256699
2020-09-30 1.924644
2020-10-31 0.790186
2020-11-30 -1.142242
2020-12-31 -2.453893
2021-01-31 -1.122544
2021-02-28 0.557940
2021-03-31 0.673303
2021-04-30 0.406231
2021-05-31 2.588922
2021-06-30 3.021039
2021-07-31 -0.180034
2021-08-31 -0.048653
2021-09-30 -0.123887
2021-10-31 0.250769
2021-11-30 1.939265
2021-12-31 2.852225
2022-01-31 -0.132007
2022-02-28 0.192434
2022-03-31 1.405550
2022-04-30 -2.054390
2022-05-31 -1.105152
2022-06-30 -1.079730
2022-07-31 -3.183983
2022-08-31 -3.795884
2022-09-30 -1.388594
2022-10-31 -0.628793
2022-11-30 -0.384861
2022-12-31 0.013830
2023-01-31 1.666713
2023-02-28 -0.338212
2023-03-31 -1.666692
2023-04-30 2.060321
2023-05-31 -1.071608
2023-06-30 -1.529146
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
Freq: M, Name: timeseries,
dtype: float64

# export to csv file
df.to_csv('ts_components.txt', sep=',', index=False)

(2) R을 이용한 시계열 분해 (Time series Decomposition using R)

위에서 가법 모형을 적용해서 Python으로 만든 시계열 자료를 text 파일로 내보낸 후, 이를 R에서 읽어서 시계열 분해 (time series decomposition)를 해보겠습니다.

ts_components.txt

다운로드

# read time series text file
df <- read.table('ts_components.txt', sep=',', header=T)
head(df)

A data.frame: 6 × 6
timeseries	trend	cycle	trend_cycle	seasonal	irregular
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
2.596119	0.0	0.000000	0.000000	0.000000	2.5961193
6.746160	1.1	1.332198	2.432198	3.565684	0.7482781
8.112100	2.2	2.640647	4.840647	6.136825	-2.8653712
8.255941	3.3	3.902021	7.202021	6.996279	-5.9423582
16.889655	4.4	5.093834	9.493834	5.904327	1.4914946
16.182357	5.5	6.194839	11.694839	3.165536	1.3219812

이렇게 불러와서 만든 df DataFrame의 칼럼 중에서 시계열 분해를 할 'timeseries' 칼럼만을 가져와서 ts() 함수를 사용하여 1년 12개월 이므로 frequency = 12로 설정해 R의 시계열 자료 형태로 변환합니다.

그 다음에 decompose() 함수를 사용하여 시계열 분해를 하는데요, 이때 가법 모형 (additive model)을 적용할 것이므로 decompose(ts, "additive") 라고 설정해줍니다.

시계열 분해를 한 결과를 모아놓은 리스트 ts_decompose 객체를 프린트해보면 원래의 값 $x, 계절 요인 $seasonal, 추세(&순환) 요인 $trend, 불규칙 요인 $random 분해값이 순서대로 저장되어 있음을 알 수 있습니다.

# transforming data to time series with 12 months frequency
ts <- ts(df$timeseries, frequency = 12) # 12 months

# time series decomposition
ts_decompose <- decompose(ts, "additive") # additive model

# decomposition results
ts_decompose
[Out]:
$x Jan Feb Mar Apr May Jun Jul
1 2.596119 6.746160 8.112100 8.255941 16.889655 16.182357 14.128087
2 21.774516 28.432892 32.350583 30.596556 32.510523 30.425519 24.300958
3 27.626499 31.858923 35.930469 30.177870 30.016165 26.591729 21.118481
4 33.964623 37.459776 40.793766 43.838415 41.301780 39.217866 35.125502

Aug Sep Oct Nov Dec
1 11.943313 9.728095 12.483489 12.141808 15.143334
2 20.450917 18.870881 21.326310 22.902448 26.620578
3 16.636031 17.682613 21.163298 22.455672 26.919529
4 33.841926 38.770511 37.371216 46.587633 46.403326

$seasonal Jan Feb Mar Apr May Jun Jul
1 1.501630 5.701170 8.768065 6.531709 5.446174 2.002476 -1.643064
2 1.501630 5.701170 8.768065 6.531709 5.446174 2.002476 -1.643064
3 1.501630 5.701170 8.768065 6.531709 5.446174 2.002476 -1.643064
4 1.501630 5.701170 8.768065 6.531709 5.446174 2.002476 -1.643064

Aug Sep Oct Nov Dec
1 -6.011071 -7.807785 -5.858728 -5.849710 -2.780867
2 -6.011071 -7.807785 -5.858728 -5.849710 -2.780867
3 -6.011071 -7.807785 -5.858728 -5.849710 -2.780867
4 -6.011071 -7.807785 -5.858728 -5.849710 -2.780867

$trend Jan Feb Mar Apr May Jun Jul Aug
1 NA NA NA NA NA NA 11.99497 13.69769
2 21.39543 22.17378 22.90922 23.65862 24.47543 25.40200 26.12406 26.51064
3 26.25688 25.96532 25.75685 25.70055 25.67514 25.66898 25.94553 26.44299
4 30.79628 32.09682 33.69239 35.24639 36.92721 38.74454 NA NA

Sep Oct Nov Dec
1 15.61124 17.55203 19.13376 20.37809
2 26.80255 26.93427 26.81289 26.54922
3 26.87899 27.65082 28.69024 29.68657
4 NA NA NA NA

$random Jan Feb Mar Apr May Jun
1 NA NA NA NA NA NA
2 -1.12254398 0.55793990 0.67330316 0.40623129 2.58892205 3.02103851
3 -0.13200705 0.19243403 1.40555039 -2.05439035 -1.10515213 -1.07973023
4 1.66671294 -0.33821202 -1.66669164 2.06032097 -1.07160802 -1.52914637

Jul Aug Sep Oct Nov Dec
1 3.77617915 4.25669908 1.92464365 0.79018590 -1.14224232 -2.45389325
2 -0.18003389 -0.04865275 -0.12388741 0.25076875 1.93926482 2.85222467
3 -3.18398336 -3.79588443 -1.38859434 -0.62879275 -0.38486059 0.01383049
4 NA NA NA NA NA NA

$figure
[1] 1.501630 5.701170 8.768065 6.531709 5.446174 2.002476 -1.643064
[8] -6.011071 -7.807785 -5.858728 -5.849710 -2.780867

$type
[1] "additive" attr(,"class")
[1] "decomposed.ts"

위의 분해 결과가 숫자만 잔뜩 들어있으니 뭐가 뭔지 잘 눈에 안들어오지요? 그러면 이제 원래의 값 (observed)과 시계열 분해된 결과인 trend, seasonal, random 을 plot() 함수를 사용하여 다같이 시각화해보겠습니다.

# change plot in jupyter
library(repr)

# Change plot size to 12 x 10
options(repr.plot.width=12, repr.plot.height=10)

plot(ts_decompose)

위의 분해 결과를 Trend (ts_decompose$trend), Seasonal (ts_decompose$seasonal), Random (ts_decompose$random) 의 각 요소별로 나누어서 시각화해볼 수도 있습니다.

# change the plot size
options(repr.plot.width=12, repr.plot.height=5)

# Trend
plot(as.ts(ts_decompose$trend))

# Seasonality
plot(as.ts(ts_decompose$seasonal))

# Random (Irregular factor)
plot(as.ts(ts_decompose$random))

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~

'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame) (0)	2021.05.08
[Python] 샘플 크기가 다른 2개 이상 그룹간 일원분산분석 (one-way ANOVA with different sized samples) (0)	2021.05.07
[Python] 선형회귀모형, 로지스틱 회귀모형에 대한 각 관측치 별 변수별 기여도(민감도) 분석 (Sensitivity analysis of linear regression & Logistic regression per each variables and each observations) (1)	2020.01.12
[Python 시계열 자료 분석] 시계열 패턴별 지수 평활법 (exponential smoothing by time series patterns) (5)	2020.01.05
[Python 시계열 자료 분석] 시계열 구성 요인 (Time series component factors): 추세(trend), 순환(cycle), 계절(seasonal), 불규칙(irregular) 요인 (2)	2020.01.01

Posted by Rfriend

[Python 시계열 자료 분석] 시계열 구성 요인 (Time series component factors): 추세(trend), 순환(cycle), 계절(seasonal), 불규칙(irregular) 요인

Python 분석과 프로그래밍/Python 통계분석 2020. 1. 1. 22:53

그동안 여러 포스팅에 나누어서 Python pandas 라이브러리에서 사용할 수 있는 시계열 데이터 처리 함수, 메소드, attributes 들에 대해서 소개했습니다.

이번 포스팅에서는 시계열(Time series)의 4가지 구성 요인(추세 요인, 순환 요인, 계절 요인, 불규칙 요인)에 대해서 소개하겠습니다.

시계열 구성요인 간의 결합 방식에 따라서 (1) 구성요인 간 독립적이라고 가정하여 각 구성요인을 더하는 가법 모형 (additive model)과, (2) 구성요인 간 독립적이지 않고 상호작용을 한다고 가정하여 구성요인 간 곱해주는 승법 모형 (multiplicative model)으로 구분할 수 있습니다.

시계열 가법 모형 (time series additive model)
= 추세 요인(trend factor) + 순환 요인(cycle factor) + 계절 요인(seasonal factor) + 불규칙 요인(irregular/random factor)
:

시계열 승법 모형 (time series multiplicative model)
= 추세 요인 * 순환 요인 x 계절 요인 x 불규칙 요인
:

이번 포스팅에서는 이해하기 쉬운 가법 모형(additive model)을 가상으로 만든 예제 데이터를 가지고 설명해보겠습니다.

[ 시계열 구성 요인 (Time Series Component Factors) ]

시계열의 4가지 구성 요인인 추세 요인, 순환 요인, 계절 요인, 불규칙 요인을 차례대로 설명해보겠습니다.

(* Reference: '한국의 경기순환 분석', 김혜원, 2004, http://kostat.go.kr/attach/journal/9-1-4.PDF)

(1) 추세 요인 (Trend factor) 은 인구의 변화, 자원의 변화, 자본재의 변화, 기술의 변화 등과 같은 요인들에 의해 영향을 받는 장기 변동 요인으로서, 급격한 충격이 없는 한 지속되는 특성이 있습니다. "10년 주기의 세계경제 변동 추세" 같은 것이 추세 요인의 예라고 할 수 있습니다.

(2) 순환 요인 (Cycle factor) 은 경제활동의 팽창과 위축과 같이 불규칙적이며 반복적인 중기 변동요인을 말합니다. 주식투자가들이 "건설업/반도체업/조선업 순환주기"를 고려해서 투자한다고 말하는게 좋은 예입니다.
만약 관측한 데이터셋이 10년 미만일 경우 추세 요인과 순환 요인을 구분하는 것이 매우 어렵습니다. 그래서 관측기간이 길지 않을 경우 추세와 순환 요인을 구분하지 않고 그냥 묶어서 추세 요인이라고 분석하기도 합니다.

(3) 계절 요인 (Seasonal factor) 은 12개월(1년)의 주기를 가지고 반복되는 변화를 말하며, 계절의 변화, 공휴일의 반복, 추석 명절의 반복 등 과 같은 요인들에 의하여 발생합니다.

(4) 불규칙 요인 (Irregular / Random factor, Noise) 은 일정한 규칙성을 인지할 수 없는 변동의 유형을 의미합니다. 천재지변, 전쟁, 질병 등과 같이 예 상할 수 없는 우연적 요인에 의해 발생되는 변동을 총칭합니다. 불규칙변동 은 경제활동에 미미한 영향을 미치기도 하지만 때로는 경제생활에 지대한 영향을 주기도 합니다.

위의 설명에 대한 이해를 돕기 위하여 Python으로 위의 추세 요인, 순환 요인, 계절 요인, 불규칙 요인을 모두 더한 가법 모형의 시계열 자료(Yt = Tt + Ct + St + It)를 가상으로 만들어보겠습니다.

import numpy as np

import pandas as pd

# DatetiemIndex

dates = pd.date_range('2020-01-01', periods=48, freq='M')

dates

[Out]:

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
               '2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
               '2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
               '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',
               '2022-09-30', '2022-10-31', '2022-11-30', '2022-12-31',
               '2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30',
               '2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31',
               '2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31'],
              dtype='datetime64[ns]', freq='M')

# additive model: trend + cycle + seasonality + irregular factor

timestamp = np.arange(len(dates))

trend_factor = timestamp*1.1

cycle_factor = 10*np.sin(np.linspace(0, 3.14*2, 48))

seasonal_factor = 7*np.sin(np.linspace(0, 3.14*8, 48))

np.random.seed(2004)

irregular_factor = 2*np.random.randn(len(dates))

df = pd.DataFrame({'timeseries': trend_factor + cycle_factor + seasonal_factor + irregular_factor,

'trend': trend_factor,

'cycle': cycle_factor,

'seasonal': seasonal_factor,

'irregular': irregular_factor},

index=dates)

[Out]:

	timeseries	trend	cycle	seasonal	irregular
2020-01-31	2.596119	0.0	0.000000	0.000000	2.596119
2020-02-29	6.746160	1.1	1.332198	3.565684	0.748278
2020-03-31	8.112100	2.2	2.640647	6.136825	-2.865371
2020-04-30	8.255941	3.3	3.902021	6.996279	-5.942358
2020-05-31	16.889655	4.4	5.093834	5.904327	1.491495
2020-06-30	16.182357	5.5	6.194839	3.165536	1.321981
2020-07-31	14.128087	6.6	7.185409	-0.456187	0.798865
2020-08-31	11.943313	7.7	8.047886	-3.950671	0.146099
2020-09-30	9.728095	8.8	8.766892	-6.343231	-1.495567
2020-10-31	12.483489	9.9	9.329612	-6.966533	0.220411
2020-11-30	12.141808	11.0	9.726013	-5.646726	-2.937480
2020-12-31	15.143334	12.1	9.949029	-2.751930	-4.153764
2021-01-31	21.774516	13.2	9.994684	0.910435	-2.330604
2021-02-28	28.432892	14.3	9.862164	4.318862	-0.048134
2021-03-31	32.350583	15.4	9.553832	6.522669	0.874082
2021-04-30	30.596556	16.5	9.075184	6.907169	-1.885797
2021-05-31	32.510523	17.6	8.434753	5.365118	1.110653
2021-06-30	30.425519	18.7	7.643955	2.326624	1.754939
2021-07-31	24.300958	19.8	6.716890	-1.360813	-0.855119
2021-08-31	20.450917	20.9	5.670082	-4.668691	-1.450475
2021-09-30	18.870881	22.0	4.522195	-6.674375	-0.976939
2021-10-31	21.326310	23.1	3.293690	-6.818438	1.751059
2021-11-30	22.902448	24.2	2.006469	-5.060699	1.756678
2021-12-31	26.620578	25.3	0.683478	-1.891426	2.528526
2022-01-31	27.626499	26.4	-0.651696	1.805404	0.072791
2022-02-28	31.858923	27.5	-1.975253	4.998670	1.335506
2022-03-31	35.930469	28.6	-3.263598	6.797704	3.796363
2022-04-30	30.177870	29.7	-4.493762	6.700718	-1.729087
2022-05-31	30.016165	30.8	-5.643816	4.734764	0.125217
2022-06-30	26.591729	31.9	-6.693258	1.448187	-0.063200
2022-07-31	21.118481	33.0	-7.623379	-2.242320	-2.015820
2022-08-31	16.636031	34.1	-8.417599	-5.307397	-3.738973
2022-09-30	17.682613	35.2	-9.061759	-6.892132	-1.563496
2022-10-31	21.163298	36.3	-9.544375	-6.554509	0.962182
2022-11-30	22.455672	37.4	-9.856844	-4.388699	-0.698786
2022-12-31	26.919529	38.5	-9.993595	-0.998790	-0.588086
2023-01-31	33.964623	39.6	-9.952191	2.669702	1.647112
2023-02-28	37.459776	40.7	-9.733369	5.593559	0.899586
2023-03-31	40.793766	41.8	-9.341031	6.957257	1.377540
2023-04-30	43.838415	42.9	-8.782171	6.380433	3.340153
2023-05-31	41.301780	44.0	-8.066751	4.023975	1.344556
2023-06-30	39.217866	45.1	-7.207526	0.545147	0.780245
2023-07-31	35.125502	46.2	-6.219813	-3.085734	-1.768952
2023-08-31	33.841926	47.3	-5.121219	-5.855940	-2.480916
2023-09-30	38.770511	48.4	-3.931329	-6.992803	1.294643
2023-10-31	37.371216	49.5	-2.671356	-6.179230	-3.278198
2023-11-30	46.587633	50.6	-1.363760	-3.642142	0.993536
2023-12-31	46.403326	51.7	-0.031853	-0.089186	-5.175634

위에서 추세 요인(trend factor) + 순환 요인(cycle factor) + 계절 요인(seasonal factor) + 불규칙 요인(irregular factor, noise) 을 더해서 만든 시계열 가법 모형 (time series additive model) 자료를 아래에 시각화보았습니다.

아래의 시도표 (time series plot)를 보면 '양(+)의 1차 선형 추세 (linear trend)', '1년 단위의 계절성(seasonality)', 그리고 불규칙한 잡음(noise)을 눈으로 확인할 수 있습니다. (기간이 4년으로서 길지 않다보니 추세와 순환 요인을 구분하기는 쉽지가 않지요?)

# Time series plot

import matplotlib.pyplot as plt

plt.figure(figsize=[10, 6])

df.timeseries.plot()

plt.title('Time Series (Additive Model)', fontsize=16)

plt.ylim(-12, 55)

plt.show()

(1) 위에서 가법 모형으로 가상의 시계열 자료를 만들 때 사용했던 '1차 선형 추세 요인 (trend factor)' 데이터를 시각화하면 아래와 같습니다.

# -- Trend factor

#timestamp = np.arange(len(dates))

#trend_factor = timestamp*1.1

plt.figure(figsize=[10, 6])

df.trend.plot()

plt.title('Trend Factor', fontsize=16)

plt.ylim(-12, 55)

plt.show()

(2) 위의 가법 모형으로 가상의 시계열 자료를 만들 때 사용했던 '4년 주기의 순환 요인 (cycle factor) '자료를 시각화하면 아래와 같습니다.

# -- Cycle factor

#cycle_factor = 10*np.sin(np.linspace(0, 3.14*2, 48))

plt.figure(figsize=[10, 6])

df.cycle.plot()

plt.title('Cycle Factor', fontsize=16)

plt.ylim(-12, 55)

plt.show()

(3) 위에서 가법 모형으로 가상의 시계열 자료를 만들 때 사용했던 '1년 주기의 계절 요인 (seasonal factor)' 자료를 시각화하면 아래와 같습니다.

# -- Seasonal factor

#seasonal_factor = 7*np.sin(np.linspace(0, 3.14*8, 48))

plt.figure(figsize=[10, 6])

df.seasonal.plot()

plt.title('Seasonal Factor', fontsize=16)

plt.ylim(-12, 55)

plt.show()

(4) 위에서 가법 모형으로 가상의 시계열 자료를 만들 때 사용했던 '불규칙 요인 (irregular factor)' 자료를 시각화하면 아래와 같습니다.

# -- Irregular/ Random factor

#np.random.seed(2004)

#irregular_factor = 2*np.random.randn(len(dates))

plt.figure(figsize=[10, 6])

df.irregular.plot()

plt.title('Irregular Factor', fontsize=16)

plt.ylim(-12, 55)

plt.show()

추세 요인(trend factor), 순환 요인 (cycle factor), 계절 요인(seasonal factor), 불규칙 요인(irregular factor)와 이를 모두 합한 시계열 자료를 모두 모아서 하나의 그래프에 시각화하면 아래와 같습니다.

# All in one: Time series = Trend factor + Cycle factor + Seasonal factor + Irregular factor

from pylab import rcParams

rcParams['figure.figsize'] = 12, 8

df.plot()

plt.ylim(-12, 55)

plt.show()

다음 포스팅에서는 위에서 만든 가상의 시계열 데이터를 Python과 R을 사용해서 시계열 구성요인 별로 분해(time series decomposition)를 해보겠습니다. (https://rfriend.tistory.com/509)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame) (0)	2021.05.08
[Python] 샘플 크기가 다른 2개 이상 그룹간 일원분산분석 (one-way ANOVA with different sized samples) (0)	2021.05.07
[Python] 선형회귀모형, 로지스틱 회귀모형에 대한 각 관측치 별 변수별 기여도(민감도) 분석 (Sensitivity analysis of linear regression & Logistic regression per each variables and each observations) (1)	2020.01.12
[Python 시계열 자료 분석] 시계열 패턴별 지수 평활법 (exponential smoothing by time series patterns) (5)	2020.01.05
[Python 시계열 자료 분석] 시계열 분해 (Time series Decomposition) (0)	2020.01.02

Posted by Rfriend

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 31. 13:49

지난번 포스팅에서는 Python pandas에서 resampling 중 Downsampling 으로 집계할 때에 왼쪽과 오른쪽 중에서 어느쪽을 포함(inclusive, closed)할 지와 어느쪽으로 라벨 이름(label)을 쓸지(https://rfriend.tistory.com/507)에 대해서 알아보았습니다.

이번 포스팅에서는 pandas의 resampling 중 Upsampling으로 시계열 데이터 주기(frequency)를 변환(conversion) 할 때 생기는 결측값을 처리하는 두 가지 방법을 소개하겠습니다.

(1) Upsampling 으로 주기 변환 시 생기는 결측값을 채우는 방법 (filling forward/backward)

(2) Upsampling 으로 주기 변환 시 생기는 결측값을 선형 보간하는 방법 (linear interpolation)

예제로 사용할 간단할 2개의 칼럼을 가지고 주기(frequency)가 5초(5 seconds)인 시계열 데이터 DataFrame을 만들어보겠습니다.

import pandas as pd

import numpy as np

rng = pd.date_range('2019-12-31', periods=3, freq='5S')

rng

[Out]:
DatetimeIndex(['2019-12-31 00:00:00', '2019-12-31 00:00:05',
               '2019-12-31 00:00:10'],
              dtype='datetime64[ns]', freq='5S')

ts = pd.DataFrame(np.array([0, 1, 3, 2, 10, 3]).reshape(3, 2),

index=rng,

columns=['col_1', 'col_2'])

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0	1
2019-12-31 00:00:05	3	2
2019-12-31 00:00:10	10	3

이제 pandas resample() 메소드를 사용해서 주기가 5초(freq='5S')인 원래 데이터를 주기가 1초(freq='1S')인 데이터로 Upsampling 변환을 해보겠습니다. 그러면 아래처럼 새로 생긴 날짜-시간 행에 결측값(missing value)이 생깁니다ㅣ

ts_upsample = ts.resample('S').mean()

ts_upsample

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	NaN	NaN
2019-12-31 00:00:02	NaN	NaN
2019-12-31 00:00:03	NaN	NaN
2019-12-31 00:00:04	NaN	NaN
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	NaN	NaN
2019-12-31 00:00:07	NaN	NaN
2019-12-31 00:00:08	NaN	NaN
2019-12-31 00:00:09	NaN	NaN
2019-12-31 00:00:10	10.0	3.0

위에 Upsampling을 해서 생긴 결측값들을 (1) 채우기(filling), (2) 선형 보간(linear interpolation) 해보겠습니다.

(1) Upsampling 으로 주기 변환 시 생기는 결측값을 채우기 (filling missing values)

(1-1) 앞의 값으로 뒤의 결측값 채우기 (Filling forward)

# (1) filling forward

ts_upsample.ffill()

ts_upsample.fillna(method='ffill')

ts_upsample.fillna(method='pad')

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.0	1.0
2019-12-31 00:00:02	0.0	1.0
2019-12-31 00:00:03	0.0	1.0
2019-12-31 00:00:04	0.0	1.0
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	3.0	2.0
2019-12-31 00:00:07	3.0	2.0
2019-12-31 00:00:08	3.0	2.0
2019-12-31 00:00:09	3.0	2.0
2019-12-31 00:00:10	10.0	3.0

(1-2) 뒤의 값으로 앞의 결측값 채우기 (Filling backward)

# (2)filling backward

ts_upsample.bfill()

ts_upsample.fillna(method='bfill')

ts_upsample.fillna(method='backfill')

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	3.0	2.0
2019-12-31 00:00:02	3.0	2.0
2019-12-31 00:00:03	3.0	2.0
2019-12-31 00:00:04	3.0	2.0
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	10.0	3.0
2019-12-31 00:00:07	10.0	3.0
2019-12-31 00:00:08	10.0	3.0
2019-12-31 00:00:09	10.0	3.0
2019-12-31 00:00:10	10.0	3.0

(1-3) 특정 값으로 결측값 채우기

# (3)fill Missing value with '0'

ts_upsample.fillna(0)

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.0	0.0
2019-12-31 00:00:02	0.0	0.0
2019-12-31 00:00:03	0.0	0.0
2019-12-31 00:00:04	0.0	0.0
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	0.0	0.0
2019-12-31 00:00:07	0.0	0.0
2019-12-31 00:00:08	0.0	0.0
2019-12-31 00:00:09	0.0	0.0
2019-12-31 00:00:10	10.0	3.0

(1-4) 평균 값으로 결측값 채우기

# (4) filling with mean value

# mean per column

ts_upsample.mean()

[Out]:

col_1    4.333333
col_2    2.000000
dtype: float64

ts_upsample.fillna(ts_upsample.mean())

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.000000	1.0
2019-12-31 00:00:01	4.333333	2.0
2019-12-31 00:00:02	4.333333	2.0
2019-12-31 00:00:03	4.333333	2.0
2019-12-31 00:00:04	4.333333	2.0
2019-12-31 00:00:05	3.000000	2.0
2019-12-31 00:00:06	4.333333	2.0
2019-12-31 00:00:07	4.333333	2.0
2019-12-31 00:00:08	4.333333	2.0
2019-12-31 00:00:09	4.333333	2.0
2019-12-31 00:00:10	10.000000	3.0

(1-5) 결측값 채우는 행의 개수 제한하기

# (5) limit the number of filling observation

ts_upsample.ffill(limit=1)

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.0	1.0
2019-12-31 00:00:02	NaN	NaN
2019-12-31 00:00:03	NaN	NaN
2019-12-31 00:00:04	NaN	NaN
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	3.0	2.0
2019-12-31 00:00:07	NaN	NaN
2019-12-31 00:00:08	NaN	NaN
2019-12-31 00:00:09	NaN	NaN
2019-12-31 00:00:10	10.0	3.0

(2) Upsampling 으로 주기 변환 시 생기는 결측값을 선형 보간하기 (linear interpolation)

# (6) Linear interpolation by values

ts_upsample.interpolate(method='values') # by default

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.6	1.2
2019-12-31 00:00:02	1.2	1.4
2019-12-31 00:00:03	1.8	1.6
2019-12-31 00:00:04	2.4	1.8
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	4.4	2.2
2019-12-31 00:00:07	5.8	2.4
2019-12-31 00:00:08	7.2	2.6
2019-12-31 00:00:09	8.6	2.8
2019-12-31 00:00:10	10.0	3.0

ts_upsample.interpolate(method='values').plot()

# (7) Linear interpolation by time

ts_upsample.interpolate(method='time')

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.6	1.2
2019-12-31 00:00:02	1.2	1.4
2019-12-31 00:00:03	1.8	1.6
2019-12-31 00:00:04	2.4	1.8
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	4.4	2.2
2019-12-31 00:00:07	5.8	2.4
2019-12-31 00:00:08	7.2	2.6
2019-12-31 00:00:09	8.6	2.8
2019-12-31 00:00:10	10.0	3.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28

Posted by Rfriend

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 30. 17:47

지난번 포스팅에서는 분기 단위의 기간 날짜 범위 만들기, 그리고 period와 timestamp 간 변환하기 (https://rfriend.tistory.com/506)에 대해서 소개하였습니다.

Python pandas의 resample() 메소드를 사용하면

(a) 더 세부적인 주기(higher frequency)의 시계열 데이터를 더 낮은 주기로 집계/요약을 하는 Downsampling (예: 초(seconds) --> 10초(10 seconds), 일(day) --> 주(week), 일(day) --> 월(month) 등)과,

(b) 더 낮은 주기의 시계열 데이터를 더 세부적인 주기의 데이터로 변환하는 Upsampling (예: 10초 --> 1초, 주 --> 일, 월 --> 주, 년 --> 일 등)을 할 수 있습니다.

이번 포스팅에서는 pandas의 resample() 메소드로 Downsampling 을 할 때 (예: 1초 단위 주기 --> 10초 단위/ 1분 단위/ 1시간 단위 주기로 resampling)

(1) 왼쪽과 오른쪽 중에서 포함 위치 설정 (closed)

(2) 왼쪽과 오른쪽 중에서 라벨 이름 위치 설정 (label)

하는 방법을 소개하겠습니다.

포함 위치와 라벨 이름 설정 시 왼쪽과 오른쪽 중에서 어디를 사용하느냐에 대한 규칙은 없구요, (a) 명확하게 인지하고 있고 (특히, 여러 사람이 동시에 협업하여 작업할 경우), (b) product의 코드 전반에 걸쳐서 일관되게(consistant) 사용하는 것이 필요합니다. (SQL로 DB에서 두 그룹으로 나누어서 시계열 데이터 전처리 작업을 하다가 나중에서야 포함 여부와 라벨 규칙이 서로 다르다는 것을 확인하고, 이를 동일 규칙으로 수정하느라 시간을 소비했던 경험이 있습니다. -_-;;;)

예제로 사용하기 위해 1분 단위 주기의 6개 데이터 포인트를 가지는 간단한 시계열 데이터 pandas Series 를 만들어보겠습니다.

import pandas as pd

# generate dates range

dates = pd.date_range('2020-12-31', periods=6, freq='min') # or freq='T'

dates

[Out]:
DatetimeIndex(['2020-12-31 00:00:00', '2020-12-31 00:01:00',
               '2020-12-31 00:02:00', '2020-12-31 00:03:00',
               '2020-12-31 00:04:00', '2020-12-31 00:05:00'],
              dtype='datetime64[ns]', freq='T')

# create Series

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

이제 '1 분 단위 주기'(freq='min')인 시계열 데이터를 '2초 단위 주기'(freq='2min' or freq='2T')로 resample() 메소드를 이용해서 Downsampling을 해보도록 하겠습니다.

이때 포함 위치 (a) closed='left' (by default) 또는 (b) closed='right' 과 라벨 이름 위치 (c) label='left' (by default) 또는 label='right' 의 총 4개 조합별로 나누어서 Downsampling 결과를 비교해보겠습니다. 집계 함수는 sum()을 공통으로 사용하겠습니다.

(1) By default: Downsampling 시 closed='left', label='left'

Downsampling 할 때 왼쪽과 오른쪽 중에서 한쪽은 포함(inclusive, default: 'left')되고 나머지 한쪽은 포함되지 않습니다. 그리고 Downsampling으로 resampling 된 후의 라벨 이름의 경우 default는 가장 왼쪽(label='left')의 라벨을 사용합니다.

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# by default, left side of bin interval is closed

# by default, left side of bin inverval is labeled

ts_series.resample('2min').sum()

[Out]:

2020-12-31 00:00:00 1 2020-12-31 00:02:00 5 2020-12-31 00:04:00 9 Freq: 2T, dtype: int64

# same result with above

ts_series.resample('2min', closed='left', label='left').sum()

[Out]:

2020-12-31 00:00:00    1
2020-12-31 00:02:00    5
2020-12-31 00:04:00    9
Freq: 2T, dtype: int64

(2) Downsampling 시 closed='right', label='left'

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin interval is closed using closed='right'

ts_series.resample('2min', closed='right', label='left').sum()

[Out]:

2020-12-30 23:58:00 0 2020-12-31 00:00:00 3 2020-12-31 00:02:00 7 2020-12-31 00:04:00 5 Freq: 2T, dtype: int64

(3) Downsampling 시 closed='left', label='right'

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin inverval is labeled using label='right'

ts_series.resample('2min', closed='left', label='right').sum()

[Out]:

2020-12-31 00:02:00    1
2020-12-31 00:04:00    5
2020-12-31 00:06:00    9
Freq: 2T, dtype: int64

(4) Downsampling 시 closed='right', label='right'

아래의 예는 디폴트와 정반대로 시계열 구간의 오른쪽을 포함시키고(closed='right') 라벨 이름도 오른쪽 구간 값(label='right')을 가져다가 Downsampling 한 경우입니다.

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin interval is closed using closed='right'

# right side of bin inverval is labeled using label='right'

ts_series.resample('2min', closed='right', label='right').sum()

2020-12-31 00:00:00    0
2020-12-31 00:02:00    3
2020-12-31 00:04:00    7
2020-12-31 00:06:00    5

Freq: 2T, dtype: int64

(5) 시계열 pandas DataFrame에 대해 Downsaumpling 시 포함(closed), 라벨(label) 위치 설정하기

지금까지 위의 (1), (2), (3), (4)는 pandas Series를 대상으로 한 예제였습니다. DatatimeIndex를 index로 가지는 시계열 데이터 pandas DataFrame 도 Series와 동일한 방법으로 Downsampling 하면서 포함, 라벨 위치를 설정합니다.

import pandas as pd

# generate dates range

dates = pd.date_range('2020-12-31', periods=6, freq='min')

dates

[Out]:

DatetimeIndex(['2020-12-31 00:00:00', '2020-12-31 00:01:00',
               '2020-12-31 00:02:00', '2020-12-31 00:03:00',
               '2020-12-31 00:04:00', '2020-12-31 00:05:00'],
              dtype='datetime64[ns]', freq='T')

# create timeseries DataFrame

ts_df = pd.DataFrame({'val': range(len(dates))}, index=dates)

ts_df

[Out]:

	val
2020-12-31 00:00:00	0
2020-12-31 00:01:00	1
2020-12-31 00:02:00	2
2020-12-31 00:03:00	3
2020-12-31 00:04:00	4
2020-12-31 00:05:00	5

# (a) Downsampling using default setting

ts_df.resample('2min').sum()

[Out]:

	val
2020-12-31 00:00:00	1
2020-12-31 00:02:00	5
2020-12-31 00:04:00	9

# (b) Downsampling using closed='right'

ts_df.resample('2min', closed='right').sum()

[Out]:

	val
2020-12-30 23:58:00	0
2020-12-31 00:00:00	3
2020-12-31 00:02:00	7
2020-12-31 00:04:00	5

# (c) Downsampling using label='right'

ts_df.resample('2min', label='right').sum()

[Out]:

	val
2020-12-31 00:02:00	1
2020-12-31 00:04:00	5
2020-12-31 00:06:00	9

# (d) Downsampling using closed='right', label='right'

ts_df.resample('2min', closed='right', label='right').sum()

[Out]:

	val
2020-12-31 00:00:00	0
2020-12-31 00:02:00	3
2020-12-31 00:04:00	7
2020-12-31 00:06:00	5

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26

Posted by Rfriend

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 30. 12:30

지난번 포스팅에서는 Python pandas에서 시간대를 확인, 설정, 변경하는 방법(https://rfriend.tistory.com/505)을 소개하였습니다.

이번 포스팅에서는 Python pandas에서

(1) 분기 단위의 기간 주기 만들기 (quarterly period frequencies)

(2) 분기 단위의 기간 날짜-범위 만들기 (quarterly period date-range)

(3) 분기 단위의 기간과 timestamp 간 변환하기 (conversion between quarterly period and timestamp)

(4) 분기 단위 기간으로 집계하기 (quarterly period group by aggregation)

에 대해서 소개하겠습니다.

이번 포스팅은 특히, 금융, 회계 분야에서 분기 단위(fiscal year quarters) 실적 집계, 분석할 때 pandas로 하기에 유용한 기능들입니다.

[ 그림1. pandas 분기 단위의 기간 범위 만들기 (Quarterly Period Range) ]

(1) 분기 단위의 기간 주기 만들기 (quarterly period frequencies)

pandas Period() 함수를 사용해서 the Fiscal Year 2020 4 Quarter 를 만들어보겠습니다. 회기년도 '2020-Q4'는 위의 [그림 1] 에서 보는 바와 같이, 2019.3월~5월(2020- Q1), 2019.6월~8월(2020-Q2), 2019.9월~11월(2020-Q3), 2019.12월~2020.2월(2020-Q4) 의 기간으로 구성되어 있습니다. (회계년도 2020 에 2019년의 3월~12월이 포함되어서 좀 이상하게 보일 수도 있는데요, 그냥 이렇습니다. ^^')

import pandas as pd

import numpy as np

p = pd.Period('2020Q4', freq='Q-FEB')

[Out]: Period('2020Q4', 'Q-FEB')

pandas의 asfreq() 메소드를 사용하면 pandas Period 객체를 원하는 주기(Period frequency)로 변환할 수 있습니다. 위의 2020-Q4 의 분기 단위의 기간(Quarterly Period)를 asfreq() 메소드를 사용해 (a) 분기별 시작 날짜(starting date)와 끝 날짜(ending date), (b) 분기별 공휴일이 아닌 시작 날짜(staring business date)와 공휴일이 아닌 끝 날짜 (ending business date)로 변환해 보겠습니다.

(a) converting from Period to Date: 'D'

(b) converting from Period to Business Date: 'B'

# starting date

p.asfreq('D', how='start')

[Out]: Period('2019-12-01', 'D')

# ending date

p.asfreq('D', how='end')

[Out]: Period('2020-02-29', 'D')

# starting business date

p.asfreq('B', how='start')

[Out]: Period('2019-12-02', 'B')

# ending business date

p.asfreq('B', how='end')

[Out]: Period('2020-02-28', 'B')

asfreq() 메소드를 chain으로 연속으로 이어서

(a) 분기별 ending business date를 선택하고 --> (b) starting(how-='start) minutes (freq='T' or freq='min')의 주기(frequency)로 변환한다거나,

(e) 분기별 ending business date를 선택하고 --> 이를 (f) ending seconds 로 변환

하는 것이 모두 가능합니다.

# (a) from ending Business date --> (b) to starting Minutes

p.asfreq('B', how='end').asfreq('T', how='start')

[Out]: Period('2020-02-28 00:00', 'T')

# (c) from ending Business date --> (d) to ending Minutes

p.asfreq('B', how='end').asfreq('T', how='end')

[Out]: Period('2020-02-28 23:59', 'T')

# (e) from Business date --> (f) to Seconds

p.asfreq('B', how='end').asfreq('S', how='end')

[Out]: Period('2020-02-28 23:59:59', 'S')

(2) 분기 단위의 기간 범위 만들기 (quarterly period range)

pandas의 date_range() 함수로 날짜-시간 범위의 DatetimeIndex 객체를 만들 듯이, pandas의 period_range('start', 'end', freq='Q-[ending-month]') 함수를 사용해서 분기 단위의 기간 범위(quarterly period range)를 만들 수 있습니다. (참고로 freq='A-DEC' 는 12월을 마지막으로 가지는 년 단위 기간(yearly period)라는 뜻이며, freq='Q-FEB'는 2월달을 마지막으로 가지는 분기 단위 기간(quarterly period)라는 뜻입니다)

아래 예는 2020-Q1 ~ 2020-Q4 기간(pd.period_range('2020Q1', '2020Q4')의 2월달을 마지막으로 하는 분기 단위의 기간(freq='Q-FEB')을 만든 것입니다.

p_rng = pd.period_range('2020Q1', '2020Q4', freq='Q-FEB')

p_rng

[Out]:PeriodIndex(['2020Q1', '2020Q2', '2020Q3', '2020Q4'], dtype='period[Q-FEB]', 
freq='Q-FEB')

asfreq() 메소드를 사용해서 위에서 생성한 '2020-Q1' ~ '2020-Q4' 기간(period with a Quarter ending at February)의 공휴일이 아닌 시작 날짜(staring business date)와 끝 날짜(ending business date)로 변환해보겠습니다.

# convert period into deisred frequency using asfreq() methods

# starting business day per quarter 'Q-FEB'

p_rng.asfreq('B', how='start')

[Out]:

PeriodIndex(['2019-03-01', '2019-06-03', '2019-09-02', '2019-12-02'],

dtype='period[B]', freq='B')

# ending business day per quarter 'Q-FEB'

p_rng.asfreq('B', how='end')

[Out]:

PeriodIndex(['2019-05-31', '2019-08-30', '2019-11-29', '2020-02-28'],

dtype='period[B]', freq='B')

기간(Period) 객체를 frequency로 변환한 후에 산술 연산(arithmetic operation)이 가능합니다. 아래 예는 2월달에 끝나는 4 분기의 ending business date에 1 day 를 더한것입니다.

# arithmatic operation: plus one day

p_rng.asfreq('B', how='end') + 1

[Out]:

PeriodIndex(['2019-06-03', '2019-09-02', '2019-12-02', '2020-03-02'],

dtype='period[B]', freq='B')

아래의 예는 period object를 ending business date로 먼저 변환하고, 이를 다시 starting hour frequency로 변환한 후에 여기에 12 hours 를 더한 것입니다.

# period ending Business day, starting Hour

p_rng.asfreq('B', how='end').asfreq('H', how='start')

[Out]:

PeriodIndex(['2019-05-31 00:00', '2019-08-30 00:00', '2019-11-29 00:00',
             '2020-02-28 00:00'],
            dtype='period[H]', freq='H')

# plus 12 hours

p_12h_rng = p_rng.asfreq('B', how='end').asfreq('H', how='start') + 12

p_12h_rng

[Out]:

PeriodIndex(['2019-05-31 12:00', '2019-08-30 12:00', '2019-11-29 12:00',
             '2020-02-28 12:00'],
            dtype='period[H]', freq='H')

(3) 분기 단위의 기간과 timestamp 간 변환하기

(conversion between quarterly period and timestamp)

pandas date_range() 로 만든 날짜-시간 DatetimeIndex를 pandas.to_period() 메소드를 사용해서 PeriodIndex로 변환할 수 있습니다.

import pandas as pd

# generate dates range with 12 Months

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

[Out]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'],
              dtype='datetime64[ns]', freq='M')

# convert from DatetimeIndex to PeriodIndex

p = ts.to_period()

[Out]:
PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12'],
            dtype='period[M]', freq='M')

반대로, pandas.to_timestamp() 메소드를 사용해서 PeriodIndex를 DatetimeIndex로 변환할 수 있습니다.

# convert from PeriodIndex to DatetimeIndex with starting month('M')

p.asfreq('B', how='end').asfreq('M', how='start').to_timestamp()

[Out]:

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01'],
              dtype='datetime64[ns]', freq='MS')

# convert from PeriodIndex to DatatimeIndex with ending minutes('T')

p.asfreq('B', how='end').asfreq('T', how='end').to_timestamp()

DatetimeIndex(['2020-01-31 23:59:00', '2020-02-28 23:59:00',
               '2020-03-31 23:59:00', '2020-04-30 23:59:00',
               '2020-05-29 23:59:00', '2020-06-30 23:59:00',
               '2020-07-31 23:59:00', '2020-08-31 23:59:00',
               '2020-09-30 23:59:00', '2020-10-30 23:59:00',
               '2020-11-30 23:59:00', '2020-12-31 23:59:00'],
              dtype='datetime64[ns]', freq='BM')

(4) 분기 기간 단위 집계 (quarterly period group by aggregation)

간단한 월 단위 pandas Series 를 분기 단위 Period Index를 가진 Series로 변환한 후에, 분기 단위로 평균을 집계해보겠습니다.

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

ts_series = pd.Series(range(len(ts)), index=ts)

ts_series

[Out]:

2020-01-31 0 2020-02-29 1 2020-03-31 2 2020-04-30 3 2020-05-31 4 2020-06-30 5 2020-07-31 6 2020-08-31 7 2020-09-30 8 2020-10-31 9 2020-11-30 10 2020-12-31 11 Freq: M, dtype: int64

# convert from DatatimeIndex to Quarterly PeriodIndex

ts_series.index = ts.to_period(freq='Q-FEB')

ts_series

[Out]:

2020Q4     0
2020Q4     1
2021Q1     2
2021Q1     3
2021Q1     4
2021Q2     5
2021Q2     6
2021Q2     7
2021Q3     8
2021Q3     9
2021Q3    10
2021Q4    11

Freq: Q-FEB, dtype: int64

# quarterly groupby mean aggregation

ts_series.groupby(ts_series.index).mean()

[Out]:

2020Q4     0.5
2021Q1     3.0
2021Q2     6.0
2021Q3     9.0
2021Q4    11.0
Freq: Q-FEB, dtype: float64

참고로, 아래는 resample() 메소드로 downsampling 해서 분기 단위로 평균을 집계해본 것인데요, 위의 to_period(freq='Q-FEB')로 frequency를 변환해서 groupby()로 집계한 것과 년도(2020 vs. 2021)가 서로 다릅니다.

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

ts_series = pd.Series(range(len(ts)), index=ts)

ts_series.resample('Q-FEB').mean()

[Out]:

2020-02-29     0.5
2020-05-31     3.0
2020-08-31     6.0
2020-11-30     9.0
2021-02-28    11.0
Freq: Q-FEB, dtype: float64

resample 시 kind='period' 옵션을 설정해주면 ts.to_period(freq='Q-FEB') 를 groupby 한 결과와 동일한 값을 얻을 수 있습니다.

ts_series.resample('Q-FEB', kind='period').mean()

[Out]:

2020Q4     0.5
2021Q1     3.0
2021Q2     6.0
2021Q3     9.0
2021Q4    11.0
Freq: Q-FEB, dtype: float64

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26

Posted by Rfriend

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 28. 18:13

이번 포스팅에서는

(1) Python에서 시간대 (time zone) 확인하기

(2) pandas 에서 date_range()로 날짜-시간 생성 시 시간대(time zone)를 설정하기

(time zone setting)

(3) 시간대 정보가 없는 naive 상태에서 지역 시간대로 변경하기

(convert from naive timezone to localized timezone)

(4) 날짜-시간 DatetimeIndex의 특정 시간대를 다른 시간대로 변경하기

(converst from a timezone to another timezone)

하는 방법을 소개하겠습니다.

(1) Python에서 시간대 (time zone) 확인하기

국가 간을 넘나들면서 여러 시간대에 걸쳐서 업무를 봐야 한다거나, 일광 절약 시간(미국식 Daylight Savings Time, DST, 영국식 Summer Time) 을 적용하고 있는 나라 (예: 미국, 캐나다, 대부분의 유럽 국가, 호주 일부 지역) 에서는 시간대를 고려해서 프로그래밍을 해야 한다는게 머리가 아픈 일입니다.

그래서 국가/지역별 시간대의 국제 표준으로 UTC (Coordinated Univeral Time, 이전의 Greenwich Mean Time, GMT) 시간대를 많이 사용합니다. 아래 지도는 국가별 시간대를 나타낸 것인데요, 영국의 Greenwich 천문대를 지나는 지도의 가운데 부분이 바로 UTC 시간대입니다.

참고로, 한국, 일본, 호주 가운데 지역은 UTC + 9hour 시간대에 속합니다.

[ Standard Time Zones of the World ]

* 출처: https://en.wikipedia.org/wiki/Coordinated_Universal_Time#/media/File:World_Time_Zones_Map.png

Python에서는 pytz 라이브러리를 사용해서 시간대 정보를 확인할 수 있으며, pandas는 pytz 라이브러리를 wrap 해서 시간대 정보를 다루고 있습니다.

아시아 지역의 시간대 이름 (time zone names in Asia)을 살펴보겠습니다.

# time zone information

import pytz

# regular expression in Python

import re

# regular expression for pattern containing 'Asia' texts

pattern = re.compile(r'^Asia')

# list comprehension for selecting 'Asia****' time zones

tz_asia = [x for x in pytz.common_timezones if pattern.match(x)]

tz_asia

[Out]:

['Asia/Aden',
 'Asia/Almaty',
 'Asia/Amman',
 'Asia/Anadyr',
 'Asia/Aqtau',
 'Asia/Aqtobe',
 'Asia/Ashgabat',
 'Asia/Atyrau',
 'Asia/Baghdad',
 'Asia/Bahrain',
 'Asia/Baku',
 'Asia/Bangkok',
 'Asia/Barnaul',
 'Asia/Beirut',
 'Asia/Bishkek',
 'Asia/Brunei',
 'Asia/Chita',
 'Asia/Choibalsan',
 'Asia/Colombo',
 'Asia/Damascus',
 'Asia/Dhaka',
 'Asia/Dili',
 'Asia/Dubai',
 'Asia/Dushanbe',
 'Asia/Famagusta',
 'Asia/Gaza',
 'Asia/Hebron',
 'Asia/Ho_Chi_Minh',
 'Asia/Hong_Kong',
 'Asia/Hovd',
 'Asia/Irkutsk',
 'Asia/Jakarta',
 'Asia/Jayapura',
 'Asia/Jerusalem',
 'Asia/Kabul',
 'Asia/Kamchatka',
 'Asia/Karachi',
 'Asia/Kathmandu',
 'Asia/Khandyga',
 'Asia/Kolkata',
 'Asia/Krasnoyarsk',
 'Asia/Kuala_Lumpur',
 'Asia/Kuching',
 'Asia/Kuwait',
 'Asia/Macau',
 'Asia/Magadan',
 'Asia/Makassar',
 'Asia/Manila',
 'Asia/Muscat',
 'Asia/Nicosia',
 'Asia/Novokuznetsk',
 'Asia/Novosibirsk',
 'Asia/Omsk',
 'Asia/Oral',
 'Asia/Phnom_Penh',
 'Asia/Pontianak',
 'Asia/Pyongyang',
 'Asia/Qatar',
 'Asia/Qostanay',
 'Asia/Qyzylorda',
 'Asia/Riyadh',
 'Asia/Sakhalin',
 'Asia/Samarkand',
 'Asia/Seoul',
 'Asia/Shanghai',
 'Asia/Singapore',
 'Asia/Srednekolymsk',
 'Asia/Taipei',
 'Asia/Tashkent',
 'Asia/Tbilisi',
 'Asia/Tehran',
 'Asia/Thimphu',
 'Asia/Tokyo',
 'Asia/Tomsk',
 'Asia/Ulaanbaatar',
 'Asia/Urumqi',
 'Asia/Ust-Nera',
 'Asia/Vientiane',
 'Asia/Vladivostok',
 'Asia/Yakutsk',
 'Asia/Yangon',
 'Asia/Yekaterinburg',
 'Asia/Yerevan']

아래는 한국의 서울, 싱가폴, 중국의 상해, 일본의 도쿄의 시간대 정보를 조회해 본 결과입니다.

# UTC: coordinated universal time

pytz.timezone('UTC')

[Out]: <UTC>

pytz.timezone('Asia/Seoul')

[Out]: <DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

pytz.timezone('Asia/Singapore')

[Out]: <DstTzInfo 'Asia/Singapore' LMT+6:55:00 STD>

pytz.timezone('Asia/Shanghai')

[Out]: <DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>

pytz.timezone('Asia/Tokyo')

[Out]: <DstTzInfo 'Asia/Tokyo' LMT+9:19:00 STD>

(2) 시간대를 포함해서 날짜-시간 범위 만들기 (generate date ranges with time zone)

pandas 의 date_range() 함수로 날짜-시간 DatetimeIndex를 생성할 때 tz = 'time_zone_name' 옵션을 사용하면 시간대(time zone)를 설정해줄 수 있습니다. 아래 예는 'Asia/Seoul' 시간대를 설정해서 2019-12-28 부터 4일 치 날짜를 생성한 것입니다.

import pandas as pd

ts_seoul = pd.date_range('2019-12-28', periods=4, freq='D', tz='Asia/Seoul')

ts_seoul

[Out]:

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

ts_seoul_series = pd.Series(range(len(ts_seoul_idx)), index = ts_seoul)

ts_seoul_series.index.tz

[Out]:

<DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

(3) 시간대가 없는 naive 상태에서 지역 시간대 설정하기

(convert from naive to localized time zone)

pandas 의 date_range() 함수로 날짜-시간 DatetimeIndex를 생성하면 디폴트로는 시간대가 없는 naive 상태로 만들어집니다. 이런 naive time-zone에서 특정 국가/지역의 시간대를 설정하고 싶을 때 tz_localize('timezone_name') 메소드를 사용합니다.

# timezone-naive timestamps

ts_naive = pd.date_range('2019-12-28', periods=6, freq='D')

ts_naive

[Out]:

DatetimeIndex(['2019-12-28', '2019-12-29', '2019-12-30', '2019-12-31',
               '2020-01-01', '2020-01-02'],
              dtype='datetime64[ns]', freq='D')

# localize timezone to 'Asia/Seoul' using tz_localize() methods

ts_local_seoul = ts_naive.tz_localize('Asia/Seoul')

ts_local_seoul

[Out]:

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00',
               '2020-01-01 00:00:00+09:00', '2020-01-02 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

만약 naive time-zone 상태에서 시간대를 설정해주기 위해 tz_convert('timezone_name') 메소드를 사용하면 'TypeError: Connot convert tz-naive timestmaps, use tz_localize to localize' 라는 타입 에러가 발생합니다.

# TypeError: Cannot convert tz-naive timestamps, use tz_localize to localize

ts_naive.tz_convert('Asia/Seoul')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-270d3596ed05> in <module>
----> 1 ts_naive.tz_convert('Asia/Seoul')

--- 중간 생략 ---

~/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py in tz_convert(self, tz)
    958             # tz naive, use tz_localize
    959             raise TypeError(
--> 960                 "Cannot convert tz-naive timestamps, use " "tz_localize to localize"
    961             )
    962 

TypeError: Cannot convert tz-naive timestamps, use tz_localize to localize

(4) 특정 시간대를 다른 시간대로 바꾸기 (convert from a time-zone to another one)

아래의 예는 tz_convert('Asia/Singapore') 메소드를 이용해서 'Asia/Seoul' 시간대를 'Asia/Singapore' 시간대로 변경해보았습니다.

# timezone 'Asia/Seoul'

ts_seoul = pd.date_range('2019-12-28', periods=4, freq='D', tz='Asia/Seoul')

ts_seoul

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

# convert from 'Asia/Seoul' to 'Asia/Singapore' using tz_convert()

ts_singapore = ts_seoul.tz_convert('Asia/Singapore')

ts_singapore

[Out]:

DatetimeIndex(['2019-12-27 23:00:00+08:00', '2019-12-28 23:00:00+08:00',
               '2019-12-29 23:00:00+08:00', '2019-12-30 23:00:00+08:00'],
              dtype='datetime64[ns, Asia/Singapore]', freq='D')

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

[ 파이썬 시간대 이름 (Python Timezone Names) ]

import pytz

pytz.common_timezones

Africa	America
['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek']	['America/Adak', 'America/Anchorage', 'America/Anguilla', 'America/Antigua', 'America/Araguaina', 'America/Argentina/Buenos_Aires', 'America/Argentina/Catamarca', 'America/Argentina/Cordoba', 'America/Argentina/Jujuy', 'America/Argentina/La_Rioja', 'America/Argentina/Mendoza', 'America/Argentina/Rio_Gallegos', 'America/Argentina/Salta', 'America/Argentina/San_Juan', 'America/Argentina/San_Luis', 'America/Argentina/Tucuman', 'America/Argentina/Ushuaia', 'America/Aruba', 'America/Asuncion', 'America/Atikokan', 'America/Bahia', 'America/Bahia_Banderas', 'America/Barbados', 'America/Belem', 'America/Belize', 'America/Blanc-Sablon', 'America/Boa_Vista', 'America/Bogota', 'America/Boise', 'America/Cambridge_Bay', 'America/Campo_Grande', 'America/Cancun', 'America/Caracas', 'America/Cayenne', 'America/Cayman', 'America/Chicago', 'America/Chihuahua', 'America/Costa_Rica', 'America/Creston', 'America/Cuiaba', 'America/Curacao', 'America/Danmarkshavn', 'America/Dawson', 'America/Dawson_Creek', 'America/Denver', 'America/Detroit', 'America/Dominica', 'America/Edmonton', 'America/Eirunepe', 'America/El_Salvador', 'America/Fort_Nelson', 'America/Fortaleza', 'America/Glace_Bay', 'America/Godthab', 'America/Goose_Bay', 'America/Grand_Turk', 'America/Grenada', 'America/Guadeloupe', 'America/Guatemala', 'America/Guayaquil', 'America/Guyana', 'America/Halifax', 'America/Havana', 'America/Hermosillo', 'America/Indiana/Indianapolis', 'America/Indiana/Knox', 'America/Indiana/Marengo', 'America/Indiana/Petersburg', 'America/Indiana/Tell_City', 'America/Indiana/Vevay', 'America/Indiana/Vincennes', 'America/Indiana/Winamac', 'America/Inuvik', 'America/Iqaluit', 'America/Jamaica', 'America/Juneau', 'America/Kentucky/Louisville', 'America/Kentucky/Monticello', 'America/Kralendijk', 'America/La_Paz', 'America/Lima', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Maceio', 'America/Managua', 'America/Manaus', 'America/Marigot', 'America/Martinique', 'America/Matamoros', 'America/Mazatlan', 'America/Menominee', 'America/Merida', 'America/Metlakatla', 'America/Mexico_City', 'America/Miquelon', 'America/Moncton', 'America/Monterrey', 'America/Montevideo', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Nipigon', 'America/Nome', 'America/Noronha', 'America/North_Dakota/Beulah', 'America/North_Dakota/Center', 'America/North_Dakota/New_Salem', 'America/Ojinaga', 'America/Panama', 'America/Pangnirtung', 'America/Paramaribo', 'America/Phoenix', 'America/Port-au-Prince', 'America/Port_of_Spain', 'America/Porto_Velho', 'America/Puerto_Rico', 'America/Punta_Arenas', 'America/Rainy_River', 'America/Rankin_Inlet', 'America/Recife', 'America/Regina', 'America/Resolute', 'America/Rio_Branco', 'America/Santarem', 'America/Santiago', 'America/Santo_Domingo', 'America/Sao_Paulo', 'America/Scoresbysund', 'America/Sitka', 'America/St_Barthelemy', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Swift_Current', 'America/Tegucigalpa', 'America/Thule', 'America/Thunder_Bay', 'America/Tijuana', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Whitehorse', 'America/Winnipeg', 'America/Yakutat', 'America/Yellowknife' ]
Antarctica
[ 'Antarctica/Casey', 'Antarctica/Davis', 'Antarctica/DumontDUrville', 'Antarctica/Macquarie', 'Antarctica/Mawson', 'Antarctica/McMurdo', 'Antarctica/Palmer', 'Antarctica/Rothera', 'Antarctica/Syowa', 'Antarctica/Troll', 'Antarctica/Vostok', 'Arctic/Longyearbyen']
Australia
[ 'Australia/Adelaide', 'Australia/Brisbane', 'Australia/Broken_Hill', 'Australia/Currie', 'Australia/Darwin', 'Australia/Eucla', 'Australia/Hobart', 'Australia/Lindeman', 'Australia/Lord_Howe', 'Australia/Melbourne', 'Australia/Perth', 'Australia/Sydney']
Canada
[ 'Canada/Atlantic', 'Canada/Central', 'Canada/Eastern', 'Canada/Mountain', 'Canada/Newfoundland', 'Canada/Pacific']
Europe
[ 'Europe/Amsterdam', 'Europe/Andorra', 'Europe/Astrakhan', 'Europe/Athens', 'Europe/Belgrade', 'Europe/Berlin', 'Europe/Bratislava', 'Europe/Brussels', 'Europe/Bucharest', 'Europe/Budapest', 'Europe/Busingen', 'Europe/Chisinau', 'Europe/Copenhagen', 'Europe/Dublin', 'Europe/Gibraltar', 'Europe/Guernsey', 'Europe/Helsinki', 'Europe/Isle_of_Man', 'Europe/Istanbul', 'Europe/Jersey', 'Europe/Kaliningrad', 'Europe/Kiev', 'Europe/Kirov', 'Europe/Lisbon', 'Europe/Ljubljana', 'Europe/London', 'Europe/Luxembourg', 'Europe/Madrid', 'Europe/Malta', 'Europe/Mariehamn', 'Europe/Minsk', 'Europe/Monaco', 'Europe/Moscow', 'Europe/Oslo', 'Europe/Paris', 'Europe/Podgorica', 'Europe/Prague', 'Europe/Riga', 'Europe/Rome', 'Europe/Samara', 'Europe/San_Marino', 'Europe/Sarajevo', 'Europe/Saratov', 'Europe/Simferopol', 'Europe/Skopje', 'Europe/Sofia', 'Europe/Stockholm', 'Europe/Tallinn', 'Europe/Tirane', 'Europe/Ulyanovsk', 'Europe/Uzhgorod', 'Europe/Vaduz', 'Europe/Vatican', 'Europe/Vienna', 'Europe/Vilnius', 'Europe/Volgograd', 'Europe/Warsaw', 'Europe/Zagreb', 'Europe/Zaporozhye', 'Europe/Zurich']
Pacific	US
['Pacific/Apia', 'Pacific/Auckland', 'Pacific/Bougainville', 'Pacific/Chatham', 'Pacific/Chuuk', 'Pacific/Easter', 'Pacific/Efate', 'Pacific/Enderbury', 'Pacific/Fakaofo', 'Pacific/Fiji', 'Pacific/Funafuti', 'Pacific/Galapagos', 'Pacific/Gambier', 'Pacific/Guadalcanal', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Kiritimati', 'Pacific/Kosrae', 'Pacific/Kwajalein', 'Pacific/Majuro', 'Pacific/Marquesas', 'Pacific/Midway', 'Pacific/Nauru', 'Pacific/Niue', 'Pacific/Norfolk', 'Pacific/Noumea', 'Pacific/Pago_Pago', 'Pacific/Palau', 'Pacific/Pitcairn', 'Pacific/Pohnpei', 'Pacific/Port_Moresby', 'Pacific/Rarotonga', 'Pacific/Saipan', 'Pacific/Tahiti', 'Pacific/Tarawa', 'Pacific/Tongatapu', 'Pacific/Wake', 'Pacific/Wallis']	[ 'US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific']
	Indian
	['Indian/Antananarivo', 'Indian/Chagos', 'Indian/Christmas', 'Indian/Cocos', 'Indian/Comoro', 'Indian/Kerguelen', 'Indian/Mahe', 'Indian/Maldives', 'Indian/Mauritius', 'Indian/Mayotte', 'Indian/Reunion']

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25

Posted by Rfriend

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 26. 19:56

이번 포스팅에서는

(1) 텍스트 파일을 열어 각 Line 별로 읽어 들인 후에 문자열 메소드를 이용해 파싱(Parsing)

--> pandas DataFrame으로 만들고,

(2) ID를 기준으로 그룹별로 값을 한칸식 내려서(Lag) 새로운 칼럼을 만들기

를 해보겠습니다.

아래와 같이 생긴 텍스트 파일이 있다고 하겠습니다.

'color_range.txt' 파일

color_range.txt

첫번째 행 AAA는 0에서 100까지는 a 영역, 100부터 200까지는 b 영역이라는 의미입니다. 여기서 a(빨간색), b(파란색)은 색상을 나타내며, AAA 는 0(포함)부터 100(미포함)까지는 빨간색, 100(포함)부터 200(미포함)까지는 파란색, 200(포함)부터 300(미포함)까지는 빨간색, ... 을 의미합니다.

이렇게 데이터가 행으로 옆으로 길게 늘여져서 쓰여진 파일을 'AAA', 'BBB' 의 ID별로 색깔(a: 빨간색, b: 파란색)별 시작 숫자와 끝 숫자를 알기 쉽게 각 칼럼으로 구분하여 pandas DataFrame으로 만들어보고자 합니다.

(1) 텍스트 파일을 열어 각 Line별로 읽어들인 후 문자열 메소드를 이용해 파싱(Parsing)

--> pandas DataFrame 만들기

import pandas as pd

import os

# set file path

cwd = os.getcwd()

file_path = os.path.join(cwd, 'color_range.txt')

# read 'color_range.txt' file and parsing it by id and value

df = pd.DataFrame() # blank DataFrame to store results

# open file

f = open(file_path)

# parsing text line by line using for loop statement

for line in f.readlines():

id_list = []

color_list = []

bin_list = []

# remove white space

line = line.strip()

# delete '"'

line = line.replace('"', '')

# get ID and VALUE from a line

id = line[:3]

val = line[4:]

# make a separator with comma(',')

val = val.replace(' a', ',a')

val = val.replace(' b', ',b')

# split a line using separator ','

val_split = val.split(sep=',')

# get a 'ID', 'COLOR', 'BIN_END' values and append it to list

for j in range(len(val_split)):

id_list.append(id)

color_list.append(val_split[j][:1])

bin_list.append(val_split[j][2:])

# make a temp DataFrame, having ID, COLOR, BIN_END values per each line

# note: if a line has only one value(ie. Scalar), then it will erase 'index error' :-(

df_tmp = pd.DataFrame({'id': id_list,

'color_cd': color_list,

'bin_end': bin_list}

)

# combine df and df_tmp one by one

df = pd.concat([df, df_tmp], axis=0, ignore_index=True)

# let's check df DataFrame

[Out]:

	id	color_cd	bin_end
0	AAA	a	100
1	AAA	b	200
2	AAA	a	300
3	AAA	b	400
4	BBB	a	250
5	BBB	b	350
6	BBB	a	450
7	BBB	b	550
8	BBB	a	650
9	BBB	b	750
10	BBB	a	800
11	BBB	b	910

(2) ID를 기준으로 그룹별로 값을 한칸식 내려서(Lag) 새로운 칼럼을 만들기

'ID'를 기준으로 'bin_end' 칼럼을 한칸씩 내리고 (shift(1)), 첫번째 행의 결측값은 '0'으로 채워(fillna(0))보겠습니다.

# lag 1 group by 'id' and fill missing value with '0'

df['bin_start'] = df.groupby('id')['bin_end'].shift(1).fillna(0)

[Out]:

	id	color_cd	bin_end	bin_start
0	AAA	a	100	0
1	AAA	b	200	100
2	AAA	a	300	200
3	AAA	b	400	300
4	BBB	a	250	0
5	BBB	b	350	250
6	BBB	a	450	350
7	BBB	b	550	450
8	BBB	a	650	550
9	BBB	b	750	650
10	BBB	a	800	750
11	BBB	b	910	800

color code ('color_cd')에서 'a' 는 빨간색(red), 'b'는 파란색(blue) 이라는 색깔 이름을 매핑해보겠습니다.

# mapping color using color_cd

color_map = {'a': 'red',

'b': 'blue'}

df['color'] = df['color_cd'].map(lambda x: color_map.get(x, x))

[Out]:

	id	color_cd	bin_end	bin_start	color
0	AAA	a	100	0	red
1	AAA	b	200	100	blue
2	AAA	a	300	200	red
3	AAA	b	400	300	blue
4	BBB	a	250	0	red
5	BBB	b	350	250	blue
6	BBB	a	450	350	red
7	BBB	b	550	450	blue
8	BBB	a	650	550	red
9	BBB	b	750	650	blue
10	BBB	a	800	750	red
11	BBB	b	910	800	blue

보기에 편리하도록 칼럼 순서를 'id', 'color_cd', 'color', 'bin_start', 'bin_end' 의 순서대로 재배열 해보겠습니다.

# change the sequence of columns

df = df[['id', 'color_cd', 'color', 'bin_start', 'bin_end']]

[Out]:

	id	color_cd	color	bin_start	bin_end
0	AAA	a	red	0	100
1	AAA	b	blue	100	200
2	AAA	a	red	200	300
3	AAA	b	blue	300	400
4	BBB	a	red	0	250
5	BBB	b	blue	250	350
6	BBB	a	red	350	450
7	BBB	b	blue	450	550
8	BBB	a	red	550	650
9	BBB	b	blue	650	750
10	BBB	a	red	750	800
11	BBB	b	blue	800	910

bin_start 는 포함하고 (include), bin_end 는 포함하지 않는(not include) 것을 알기 쉽도록

==> 포함('[') 기호 + 'bin_start', 'bin_end' + 미포함(')') 기호를 덧붙여서

'bin_range'라는 새로운 칼럼을 만들어보겠습니다.

# make a 'Bin Range' column with include '[' and exclude ')' sign

df['bin_range'] = df['bin_start'].apply(lambda x: '[' + str(x) + ',') + \

df['bin_end'].apply(lambda x: str(x + ')'))

[Out]:

	id	color_cd	color	bin_start	bin_end	bin_range
0	AAA	a	red	0	100	[0,100)
1	AAA	b	blue	100	200	[100,200)
2	AAA	a	red	200	300	[200,300)
3	AAA	b	blue	300	400	[300,400)
4	BBB	a	red	0	250	[0,250)
5	BBB	b	blue	250	350	[250,350)
6	BBB	a	red	350	450	[350,450)
7	BBB	b	blue	450	550	[450,550)
8	BBB	a	red	550	650	[550,650)
9	BBB	b	blue	650	750	[650,750)
10	BBB	a	red	750	800	[750,800)
11	BBB	b	blue	800	910	[800,910)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24

Posted by Rfriend

[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 26. 13:37

지난번 포스팅에서는 numpy와 pandas를 이용해서 차수 m인 단순 이동평균 구하는 방법 (https://rfriend.tistory.com/502) 를 소개하였습니다.

이번 포스팅에서는 Python pandas에서 시계열 데이터를 생성할 때 유용하게 사용할 수 있는 빈도와 날짜 Offsets (pandas Frequencies and Date Offsets)에 대해서 알아보겠습니다. 2020년 달력을 가지고 Offset Type 별로 Alias 사용해가면서 결과값 확인해보도록 하겠습니다.

[ Python pandas Base Time Series Frequencies and Date Offsets ]

(* 'Frequency'를 '빈도'라고 번역하는게 좋을지, 아니면 '주기'라고 번역하는게 좋을지 고민스럽습니다. 저는 의미상으로는 '주기'라고 번역하는게 더 적합할 것 같다고 생각하는데요, 이미 '빈도'라고 번역이 되어서 사용되고 있네요. Offset은 뭐라고 번역하는게 좋을지 잘 모르겠네요.)

(1) 3일 주기의 날짜 데이터 생성하기 (generate dates with 3 days frequency)

pandas 의 Frequencies는 'base frequency'와 'multiplier'로 구성되어 있으며, base frequency는 Alias 문자열(alias string)를 사용하여 호출해서 이용합니다. 아래의 예는 'Day'의 Alias인 'd'(or 'D')에 '3'을 곱하여(multiplier) '3 Days' Frequency (빈도, 주기)의 날짜 범위를 8개 (periods = 8) 생성한 것입니다.

import pandas as pd

pd.date_range('2019-12-01', periods = 8, freq = '3d') # or freq = '3D'

[Out]:
DatetimeIndex(['2019-12-01', '2019-12-04', '2019-12-07', '2019-12-10',
               '2019-12-13', '2019-12-16', '2019-12-19', '2019-12-22'],
              dtype='datetime64[ns]', freq='3D')

freq = 3 * '1D' 과 같이 명시적으로 곱하기 3을 밖으로 빼어서 표기해도 freq = '3D'와 결과 값은 동일합니다.

pd.date_range('2019-12-01', periods = 8, freq = 3 * '1d') # or freq = 3 * '1D'

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-04', '2019-12-07', '2019-12-10',
               '2019-12-13', '2019-12-16', '2019-12-19', '2019-12-22'],
              dtype='datetime64[ns]', freq='3D')

그리고 base frequency는 'date offset' 이라는 클래스 객체(class object)를 가지고 있습니다. 아래에 pandas.tseries.offsets 으로부터 일(Day), 시간(Hour), 분(Minute), 초(Minute) date offsets을 불어와서, freq = Day(3)과 같이 Day(3)의 date offset으로 위의 freq = '3d'와 동일한 결과를 얻었습니다.

from pandas.tseries.offsets import Day, Hour, Minute, Second

pd.date_range('2019-12-01', periods = 8, freq = Day(3))

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-04', '2019-12-07', '2019-12-10',
               '2019-12-13', '2019-12-16', '2019-12-19', '2019-12-22'],
              dtype='datetime64[ns]', freq='3D')

2일(2 Days) + 23시간 (23 Hours) + 59분 (59 Minutes) + 60초 (60 Seconds) = 3 일 (Days) 이므로 아래의 myfreq = Day(2) + Hour(23) + Minute(59) + Second(60) 으로 freq = myfreq 를 사용하여 날짜를 생성하면 위와 동일한 결과를 반환합니다. (3일 주기의 8개 날짜 생성)

from pandas.tseries.offsets import Day, Hour, Minute, Second

myfreq = Day(2) + Hour(23) + Minute(59) + Second(60) # 3 days

myfreq

[Out]: <3 * Days>

pd.date_range('2019-12-01', periods = 8, freq = myfreq)

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-04', '2019-12-07', '2019-12-10',
               '2019-12-13', '2019-12-16', '2019-12-19', '2019-12-22'],
              dtype='datetime64[ns]', freq='3D')

물론, freq = '2D23H59min60S' (혹은 freq = '2d23h59T60s') 로 Frequency 의 Alias를 사용해도 결과는 동일합니다.

pd.date_range('2019-12-01', periods = 8, freq = '2D23H59min60S') # or freq = '2d23h59T60s'

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-04', '2019-12-07', '2019-12-10',
               '2019-12-13', '2019-12-16', '2019-12-19', '2019-12-22'],
              dtype='datetime64[ns]', freq='3D')

각 주/월/분기별 (a) 시작 날짜와 마지막 날짜, (b) 공휴일이 아닌(business day) 시작 날짜와 마지막 날짜를 가져올 수 있는 Data Offset, Alias를 살펴보기 위해 아래의 2020년 달력을 봐가면서 예를 들어보겠습니다.

(2) Month End vs. Business Month End, Month Begin vs. Business Month Begin

(2-1) (a) 월의 마지막 날짜(Month End) vs. (b) 월의 공휴일이 아닌 마지막 날짜 (Business Month End)

아래는 (a) 2020년 1월 ~ 8월의 각 월별 마지막 날짜 (offset type: MonthEnd, alias: 'M')와, (b) 각 월별 공휴일이 아닌 마지막 날짜(offset type: Business Month End, alias: 'BM')로 DatetimeIndex 를 생성해보았습니다. 2020년 2월달과 5월달의 'Month End'와 'Business Month End'가 서로 다르게 정확하게 생성되었다는 것을 위의 2020년 달력과 아래의 날짜 생성결과로 확인할 수 있습니다.

# (a) Month End: 'M'

pd.date_range('2020-01-01', periods = 8, freq = 'M')

[Out]:

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31'],
              dtype='datetime64[ns]', freq='M')

# (b) Business Month End: 'BM'

pd.date_range('2020-01-01', periods = 8, freq = 'BM')

[Out]:

DatetimeIndex(['2020-01-31', '2020-02-28', '2020-03-31', '2020-04-30',
               '2020-05-29', '2020-06-30', '2020-07-31', '2020-08-31'],
              dtype='datetime64[ns]', freq='BM')

(2-2) (a) 월별 시작 날짜(Month Begin) vs. (b) 월별 공휴일이 아닌 시작 날짜(Business Month Begin)

아래는 (a) 2020년 1월 ~ 8월까지 각 월별 시작 날짜(offset type: MonthStart, alias: 'MS') 와, (b) 공휴일이 아닌 시작 날짜(offset type: BusinessMonthBegin, alias:'BMS') 로 DatetimeIndex 를 생성하였습니다.

# (a) Month Start: 'MS'

pd.date_range('2020-01-01', periods = 8, freq = 'MS')

[Out]:

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01'],
              dtype='datetime64[ns]', freq='MS')

# (b) Business Month Start: 'BMS'

pd.date_range('2020-01-01', periods = 8, freq = 'BMS')

[Out]:

DatetimeIndex(['2020-01-01', '2020-02-03', '2020-03-02', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-03'],
              dtype='datetime64[ns]', freq='BMS')

(3) 주별 특정 요일 날짜 (Week)

아래의 예는 2020년 1월 1일 이후의 날 중에서 매주 월요일(freq = 'W-MON'에 해당하는 날짜 8개로 DatetimeIndex를 생성한 것입니다. 각 요일별 alias는 영문 요일의 앞에서부터 3번째 자리까지의 알파벳입니다.

# -- Week 'Alias': Offset Type

# 'W-MON': Monday

# 'W-TUE': Tuesday

# 'W-WED': Wednesday

# 'W-THU': Thursday

# 'W-FRI': Friday

# 'W-SAT': Saturday

# 'W-SUN': Sunday

pd.date_range('2020-01-01', periods = 8, freq = 'W-MON')

[Out]:
DatetimeIndex(['2020-01-06', '2020-01-13', '2020-01-20', '2020-01-27',
               '2020-02-03', '2020-02-10', '2020-02-17', '2020-02-24'],
              dtype='datetime64[ns]', freq='W-MON')

(4) 월별 특정 순번째의 요일 날짜 (Week of Month)

아래의 예는 2020년 1월 1일 이후의 날 중에서 매월 첫번째 금요일(freq = 'WOM-1FRI')에 해당하는 날째 8개로 DatetimeIndex를 생성한 것입니다. 만약 이러한 요건에 해당하는 offset type이 없어서 매뉴얼하게 코딩을 해야 한다고 생각하면 골치가 좀 아플것 같은데요, 매우 편리한 기능입니다.

(가령, 매월 2번째 화요일로 날짜를 생성하고 싶다면 freq = "WOM-2TUE" 로 해주면 됩니다)

pd.date_range('2020-01-01', periods = 8, freq = 'WOM-1FRI')

[Out]:
DatetimeIndex(['2020-01-03', '2020-02-07', '2020-03-06', '2020-04-03',
               '2020-05-01', '2020-06-05', '2020-07-03', '2020-08-07'],
              dtype='datetime64[ns]', freq='WOM-1FRI')

(5-1) 분기별 마지막 날짜(Quarter End), 분기별 공휴일 아닌 마지막 날짜(Business Quarter End)

아래의 예는 2020년 1월 1일 이후 날짜 중에서 2월 달을 분기 마지막인 달 기준으로 해서 매 분기별 마지막 날짜(freq = 'Q-FEB') 8개(periods=8)를 가져와서 DatetimeIndex를 생성한 예입니다.

만약 3월 달(March) 혹은 12월 달(December)을 분기 말의 기준으로 해서 매 분기별 마지막 날짜(freq = 'Q-MAR')를 8개 성성하고 싶다면 아래의 두번째 예제를 참고하면 됩니다.

# quarterly dates anchored on 'February' last calendar day of each month

pd.date_range('2020-01-01', periods = 8, freq = 'Q-FEB')

[Out]:

DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31', '2020-11-30',
               '2021-02-28', '2021-05-31', '2021-08-31', '2021-11-30'],
              dtype='datetime64[ns]', freq='Q-FEB')

# quarterly dates anchored on 'March' last calendar day of each month

pd.date_range('2020-01-01', periods = 8, freq = 'Q-MAR')

[Out]:

DatetimeIndex(['2020-03-31', '2020-06-30', '2020-09-30', '2020-12-31',
               '2021-03-31', '2021-06-30', '2021-09-30', '2021-12-31'],
              dtype='datetime64[ns]', freq='Q-MAR')

공휴일이 아닌 Business day 기준의 분기별 마지막 날짜(Business Quarter End)를 2월 달을 분기 마지막인 달 기준으로 8개 생성하려면 아래의 예처럼 freq = 'BQ-FEB' 의 base time series frequency를 사용하면 됩니다.

# quarterly dates anchored on last busines day of each month

pd.date_range('2020-01-01', periods = 8, freq = 'BQ-FEB')

[Out]:

DatetimeIndex(['2020-02-28', '2020-05-29', '2020-08-31', '2020-11-30',
               '2021-02-26', '2021-05-31', '2021-08-31', '2021-11-30'],
              dtype='datetime64[ns]', freq='BQ-FEB')

(5-2) 분기별 시작 날짜(Quarter Begin), 분기별 공휴일 아닌 시작 날짜(Business Quarter Begin)

분기별 시작 날짜는 위의 (5-1) 번의 offset type alias 'Q'에 'S'를 붙여주어서 'QS', 'BQS' 를 사용합니다.

아래의 예는 2020년 1월 1일 이후의 날짜 중에서 2월(February) 달을 분기 마지막 달 기준으로 해서 각 분기별 시작 날짜(freq = 'QS-FEB')를 8개 (periods=8) 생성한 것입니다.

# quarterly dates anchored on first calendar day of each month

pd.date_range('2020-01-01', periods = 8, freq = 'QS-FEB')

[Out]:
DatetimeIndex(['2020-02-01', '2020-05-01', '2020-08-01', '2020-11-01',
               '2021-02-01', '2021-05-01', '2021-08-01', '2021-11-01'],
              dtype='datetime64[ns]', freq='QS-FEB')

아래의 예는 분기별 공휴일이 아닌, 즉 business day 기준으로 2월(February) 달을 분기 마지막 달 기준으로 해서 각 분기별 시작 날짜(freq = 'BQS-FEB')를 8개 DatetimeIndex로 생성한 것입니다.

# quarterly dates anchored on first busines day of each month

pd.date_range('2020-01-01', periods = 8, freq = 'BQS-FEB')

[Out]:

DatetimeIndex(['2020-02-03', '2020-05-01', '2020-08-03', '2020-11-02',
               '2021-02-01', '2021-05-03', '2021-08-02', '2021-11-01'],
              dtype='datetime64[ns]', freq='BQS-FEB')

(6) 년별 특정월 마지막 날짜(Year End), 년별 공휴일이 아닌 특정월 마지막 날짜(Business Year End)

아래의 예는 년도별로 2월(February) 마지막 날짜 (freq = 'A-FEB')를 8개 가져와서 DatetimeIndex를 만든 것입니다. (만약 매년 1월 마지막 날짜를 생성하고 싶으면 freq = 'A-JAN' 처럼 JANUARY의 앞 3개 알파벳을 입력해주면 됩니다)

# Year End

# Annual dates anchored on last calendar day of given month

pd.date_range('2020-01-01', periods = 8, freq = 'A-FEB')

[Out]:
DatetimeIndex(['2020-02-29', '2021-02-28', '2022-02-28', '2023-02-28',
               '2024-02-29', '2025-02-28', '2026-02-28', '2027-02-28'],
              dtype='datetime64[ns]', freq='A-FEB')

아래의 예는 매 년도별로 공휴일이 아닌(Business day) 2월(February) 마지막 날짜(freq = 'BA-FEB') 를 8개 (periods=8) 가져와서 DatetimeIndex를 만는 것입니다.

# Business Year End

# Annual dates anchored on last business day of given month

pd.date_range('2020-01-01', periods = 8, freq = 'BA-FEB')

[Out]:

DatetimeIndex(['2020-02-28', '2021-02-26', '2022-02-28', '2023-02-28',
               '2024-02-29', '2025-02-28', '2026-02-27', '2027-02-26'],
              dtype='datetime64[ns]', freq='BA-FEB')

(7) 년별 시작 날짜(Year Begin), 년별 공휴일이 아닌 시작 날짜(Business Year Begin)

아래의 예는 2020-01-01일 이후 날짜 중에서 매년 2월(February)의 시작 날짜 (freq = 'AS-FEb') 를 8개(periods=8) 가져와서 DatetimeIndex를 만든 것입니다. 위의 (6)번에 'A'에 'S'를 추가하였습니다.

# Year Begin

pd.date_range('2020-01-01', periods = 8, freq = 'AS-FEB')

[Out]:

DatetimeIndex(['2020-02-01', '2021-02-01', '2022-02-01', '2023-02-01',
               '2024-02-01', '2025-02-01', '2026-02-01', '2027-02-01'],
              dtype='datetime64[ns]', freq='AS-FEB')

아래의 예는 2020-01-01일 이후이고 공휴일이 아닌(business day) 날짜 중에서 매년 2월(February)의 시작 날짜 (freq = 'BAS-FEB') 를 8개 (periods=8) 가져와서 DatetimeIndex를 만든 것입니다. 바로 위의 freq = 'AS-FEB'에서 'B'를 추가하여 freq = 'BAS-FEB'를 사용해서 만들었습니다.

# Business Year Begin

pd.date_range('2020-01-01', periods = 8, freq = 'BAS-FEB')

[Out]:

DatetimeIndex(['2020-02-03', '2021-02-01', '2022-02-01', '2023-02-01',
               '2024-02-01', '2025-02-03', '2026-02-02', '2027-02-01'],
              dtype='datetime64[ns]', freq='BAS-FEB')

(8) Offset 만큼 날짜 이동하기 (shifting dates with offsets)

위에서 Base time series frequencies와 offset types 에 대해서 알아보았습니다. 이 offset 객체를 가지고 다른 datetime 객체에 더하거나 뺄 수 있습니다.

from datetime import datetime

from pandas.tseries.offsets import MonthEnd, MonthBegin

now = datetime.now()

now

[Out]: datetime.datetime(2019, 12, 21, 15, 30, 39, 654904)

now + MonthEnd()

[Out]: Timestamp('2019-12-31 15:30:39.654904')

now - MonthEnd()

[Out]: Timestamp('2019-11-30 15:30:39.654904')

now + MonthBegin()

[Out]: Timestamp('2020-01-01 15:30:39.654904')

혹은 offset 객체에 rollforward() 메소드를 사용해서 앞으로(미래로) 날짜를 굴리거나(이동시키거나), 아니면 rollback() 메소드를 사용해서 뒤로(과거로) 날짜를 굴릴(이동시킬) 수 있습니다. 재미있는 기능입니다. ^^

offset_me = MonthEnd()

offset_me.rollforward(now)

[Out]: Timestamp('2019-12-31 15:30:39.654904')

offset_me.rollback(now)

[Out]: Timestamp('2019-11-30 15:30:39.654904')

(9) pandas.period_range() 로 날짜 기간(period of time) 만들기

(a) pd.data_range('2020-01-01', periods=10, freq='d')로 만든 DatetimeIndex를 index로 해서 pandas Series를 만들 수도 있으며, (b) pd.period_range('2020-01-01', '2020-01-10', freq='d')러 PeriodIndex 를 만들어서 이를 index로 해서 pandas Series를 만들어도 동일한 결과를 얻을 수 있습니다.

import pandas as pd

import numpy as np

# (a) pandas.date_range('start_date', periods, freq)

dr = pd.date_range('2020-01-01', periods=10, freq='d')

[Out]:

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

pd.Series(range(10), index=dr)

[Out]:

2020-01-01    0
2020-01-02    1
2020-01-03    2
2020-01-04    3
2020-01-05    4
2020-01-06    5
2020-01-07    6
2020-01-08    7
2020-01-09    8
2020-01-10    9
Freq: D, dtype: int64

# (b) pandas.period_range('start_date', 'end_date', freq)

pr = pd.period_range('2020-01-01', '2020-01-10', freq='d')

[Out]:

PeriodIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
             '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
             '2020-01-09', '2020-01-10'],
            dtype='period[D]', freq='D')

pd.Series(range(10), index=pr)

[Out]:

2020-01-01    0
2020-01-02    1
2020-01-03    2
2020-01-04    3
2020-01-05    4
2020-01-06    5
2020-01-07    6
2020-01-08    7
2020-01-09    8
2020-01-10    9
Freq: D, dtype: int64

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23

Posted by Rfriend

[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 25. 18:01

이번 포스팅에서는

(1) Yahoo Finace에서 'Apple' 회사의 2019년도 주가 데이터를 가져오기

(2) 주식 종가로 5일, 10일, 20일 단순이동평균(Simple Moving Average) 구하기

(3) 종가, 5일/10일/20일 이동평균을 seaborn을 이용해서 시각화하기

를 차례대로 해보겠습니다.

(1) Yahoo Finace에서 'Apple' 회사의 2019년도 주가 데이터를 가져오기

Yahoo Finance 사이트에서 쉽게 주가 데이터를 다운로드 받는 방법 중의 하나는 yfinance library를 설치해서 download() 함수를 이용하는 것입니다. Jupyter Notebook의 Cell에서 바로 !pip install yfinance 명령어로 라이브러리를 설치하고 import 해서 download() 함수로 Apple('AAPL')의 2019-01-01 ~ 2019-12-24' 일까지의 주가 데이터를 다운로드하였습니다.

# Install yfinance package.

!pip install yfinance

# Import yfinance

import yfinance as yf

# Get the data for the stock Apple by specifying the stock ticker, start date, and end date

aapl = yf.download('AAPL','2019-01-01','2019-12-25')

Requirement already satisfied: yfinance in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (0.1.52)
Requirement already satisfied: multitasking>=0.0.7 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (0.0.9)
Requirement already satisfied: numpy>=1.15 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (1.17.3)
Requirement already satisfied: pandas>=0.24 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (0.25.3)
Requirement already satisfied: requests>=2.20 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (2.22.0)
Requirement already satisfied: pytz>=2017.2 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2019.11.28)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (1.25.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (3.0.4)
Requirement already satisfied: six>=1.5 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas>=0.24->yfinance) (1.13.0)
[*********************100%***********************]  1 of 1 completed

aapl.head()

[Out]:

	Open	High	Low	Close	Adj Close	Volume
Date
2018-12-31	158.529999	159.360001	156.479996	157.740005	155.405045	35003500
2019-01-02	154.889999	158.850006	154.229996	157.919998	155.582367	37039700
2019-01-03	143.979996	145.720001	142.000000	142.190002	140.085220	91312200
2019-01-04	144.529999	148.550003	143.800003	148.259995	146.065353	58607100
2019-01-07	148.699997	148.830002	145.899994	147.929993	145.740265	54777800

(2) 주식 종가(Close)로 5일, 10일, 20일 이동평균 구하기

Apple 회사의 주식 데이터 중에서 '종가(Close)'를 대상으로 이동평균을 구해보겠습니다.

aapl.Close[:10]

[Out]:

Date
2018-12-31    157.740005
2019-01-02    157.919998
2019-01-03    142.190002
2019-01-04    148.259995
2019-01-07    147.929993
2019-01-08    150.750000
2019-01-09    153.309998
2019-01-10    153.800003
2019-01-11    152.289993
2019-01-14    150.000000
Name: Close, dtype: float64

이동평균은 시계열 데이터 내의 잡음(noise)을 제거하는 데이터 전처리, 혹은 계절성이 존재하는 시계열 데이터에서 계절성 부분을 빼고 장기 추세 요인(trend factor)나 중기 순환/주기 요인(cycle factor)를 보려고 할 때 많이 사용합니다. 시계열 데이터 예측(forecasting)에도 사용하구요.

이동평균은 가중치를 고려 안하는 (즉, 모든 값의 가중치가 같다고 가정하는) 단순이동평균(Simple Moving Average, SMA)과, 가중치를 부여하는 가중이동평균(Weighted Moving Average, WMA)가 있는데요, 이번 포스팅에서는 단순이동평균(SMA)에 대해서 다룹니다.

차수(order) m 인 단순이동평균(Simple Moving Average with Order m) 은 다시 중심이동평균(Centered Moving Average)와 추적이동평균(Trailing Moving Average)로 구분할 수 있습니다 (아래의 개념 비교 이미지를 참고하세요). 이번 포스팅에서는 python pandas에서 사용하고 있는 추적이동평균 개념으로 window 5일, 10일, 15일의 단순이동평균을 계산해 보았습니다.

이동평균을 구하는 두 가지 방법으로, for loop 반복문과 numpy.mean() 을 이용하는 수작업 방법과, pandas 라이브러리의 rolling(window=m).mean() 함수를 이용하는 좀더 편리한 방법을 소개하겠습니다.

(2-1) for loop 반복문과 numpy.mean() 을 이용한 5일 이동평균 구하기

import numpy as np

for i in range(0, 6):

stock_close_5days = aapl.Close[i:(i+5)]

sma_5d = np.mean(stock_close_5days)

print('SMA(5 Days Window) of', aapl.Close.index[i+4].date(), ':', sma_5d)


[Out]:
SMA(5 Days Window) of 2019-01-07 : 150.80799865722656
SMA(5 Days Window) of 2019-01-08 : 149.40999755859374
SMA(5 Days Window) of 2019-01-09 : 148.48799743652344
SMA(5 Days Window) of 2019-01-10 : 150.80999755859375
SMA(5 Days Window) of 2019-01-11 : 151.61599731445312
SMA(5 Days Window) of 2019-01-14 : 152.02999877929688

(2-2) pandas 의 rolling(window=5).mean() 함수를 이용한 5일 이동평균 구하기

차수 m인 이동평균(trailing moving average)을 구하면 처음 시작부분에 m-1 개의 결측값이 발생합니다.

import pandas as pd

sma_5d = aapl.Close.rolling(window=5).mean()

sma_5d[:10]

[Out]:

Date 2018-12-31 NaN 2019-01-02 NaN 2019-01-03 NaN 2019-01-04 NaN 2019-01-07 150.807999 2019-01-08 149.409998 2019-01-09 148.487997 2019-01-10 150.809998 2019-01-11 151.615997 2019-01-14 152.029999 Name: Close, dtype: float64

이제 pandas에서 이동평균 구하는 rolling() 함수를 알았으니, 차수(order, window)가 5일, 10일, 20일인 단순 추적 이동평균(simple trailing moving average)을 구해보겠습니다.

# simple trailing moving average with window 5 days/ 10 days/ 20 days

df_sma = pd.DataFrame({

'close': aapl.Close

, 'sma_5d': aapl.Close.rolling(window=5).mean()

, 'sma_10d': aapl.Close.rolling(window=10).mean()

, 'sma_20d': aapl.Close.rolling(window=20).mean()

})

# top first 25 rows

df_sma[:25]

[Out]:

	close	sma_5d	sma_10d	sma_20d
Date
2018-12-31	157.740005	NaN	NaN	NaN
2019-01-02	157.919998	NaN	NaN	NaN
2019-01-03	142.190002	NaN	NaN	NaN
2019-01-04	148.259995	NaN	NaN	NaN
2019-01-07	147.929993	150.807999	NaN	NaN
2019-01-08	150.750000	149.409998	NaN	NaN
2019-01-09	153.309998	148.487997	NaN	NaN
2019-01-10	153.800003	150.809998	NaN	NaN
2019-01-11	152.289993	151.615997	NaN	NaN
2019-01-14	150.000000	152.029999	151.418999	NaN
2019-01-15	153.070007	152.494000	150.951999	NaN
2019-01-16	154.940002	152.820001	150.653999	NaN
2019-01-17	155.860001	153.232001	152.020999	NaN
2019-01-18	156.820007	154.138004	152.877000	NaN
2019-01-22	153.300003	154.798004	153.414001	NaN
2019-01-23	153.919998	154.968002	153.731001	NaN
2019-01-24	152.699997	154.520001	153.670001	NaN
2019-01-25	157.759995	154.900000	154.066000	NaN
2019-01-28	156.300003	154.795999	154.467001	NaN
2019-01-29	154.679993	155.071997	154.935001	153.177000
2019-01-30	165.250000	157.337997	156.153000	153.552499
2019-01-31	166.440002	160.085999	157.303000	153.978500
2019-02-01	166.520004	161.838000	158.369000	155.195000
2019-02-04	171.250000	164.828000	159.812000	156.344500
2019-02-05	174.179993	168.728000	161.899998	157.657000

(3) 종가, 5일/10일/15일 이동평균을 seaborn을 이용해서 시각화하기

trailing moving average 이동평균을 구하면 차수 m-1 만큼의 결측값(NaN) 이 생깁니다. 시각화를 위해서 결측값이 있는 행은 삭제하도록 하겠습니다.

df_sma.dropna(axis=0, inplace=True)

df_sma.head(10)

[Out]:

	close	sma_5d	sma_10d	sma_20d
Date
2019-01-29	154.679993	155.071997	154.935001	153.177000
2019-01-30	165.250000	157.337997	156.153000	153.552499
2019-01-31	166.440002	160.085999	157.303000	153.978500
2019-02-01	166.520004	161.838000	158.369000	155.195000
2019-02-04	171.250000	164.828000	159.812000	156.344500
2019-02-05	174.179993	168.728000	161.899998	157.657000
2019-02-06	174.240005	170.526001	163.931999	158.831500
2019-02-07	170.940002	171.426001	165.756000	159.713000
2019-02-08	170.410004	172.204001	167.021001	160.543501
2019-02-11	169.429993	171.839999	168.334000	161.400500

Matplotlib 을 이용해서 '종가(Close)', '5일 이동평균', '10일 이동평균', '20일 이동평균' 선 그래프를 그려보겠습니다.

# line plot with moving average of 5 window, 10 window, 20 window

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 10))

plt.plot(df_sma.index, df_sma.close, 'y-', label='close_price')

plt.plot(df_sma.index, df_sma.sma_5d, 'b-', label='sma_5d')

plt.plot(df_sma.index, df_sma.sma_10d, 'r-', label='sma_10d')

plt.plot(df_sma.index, df_sma.sma_20d, 'g-', label='sma_20d')

plt.legend()

plt.show()

위의 1년치 시계열 그래프가 서로 겹쳐보여서 잘 구분이 안되네요. 그래서 2월달의 20개 관측치만 선택해서 다시 시계열 선 그래프를 그려보겠습니다.

아래의 그래프에서 확인할 수 있는 바와 같이, 이동평균 값은 원래의 주식 종가(Close) 값보다 후행적으로 쫓아가고(trailing) 있습니다. 그리고 차수(order, rolling window)가 클 수록 후행적으로 쫒아가는 정도가 더 느림을 알 수 있습니다.

plt.figure(figsize=(15, 10))

plt.plot(df_sma.index[:20], df_sma.close[:20], 'yo-', label='close_price')

plt.plot(df_sma.index[:20], df_sma.sma_5d[:20], 'bo-', label='sma_5d')

plt.plot(df_sma.index[:20], df_sma.sma_10d[:20], 'ro-', label='sma_10d')

plt.plot(df_sma.index[:20], df_sma.sma_20d[:20], 'go-', label='sma_20d')

plt.legend()

plt.show()

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23

Posted by Rfriend

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 24. 18:29

지난번 포스팅에서는 Python pandas의 Series, DataFrame에서 시계열 데이터 index 의 중복 확인 및 처리하는 방법(https://rfriend.tistory.com/500) 에 대해서 소개하였습니다.

이번 포스팅에서는 Python pandas에서 일정한 주기의 시계열 데이터(Fixed frequency time series)를 가진 Series, DataFrame 만드는 방법을 소개하겠습니다.

[ 시계열 데이터의 특징 ]

동일한/ 고정된 간격의 날짜-시간 index (equally spaced time interval, fixed frequency)
중복 없고, 빠진 것도 없는 날짜-시간 index (no redundant values or gaps)
시간 순서대로 정렬 (sequential order)

(* 시계열 데이터가 반드시 동일한/고정된 간격의 날짜-시간을 가져야만 하는 것은 아님. 가령, 주가(stock price) 데이는 장이 열리는 business day에만 존재하며 공휴일은 데이터 없음)

(1) 동일 간격의 시계열 데이터 Series 만들기 (fixed frequency time series pandas Series)

(1-1) 중간에 날짜가 비어있는 시계열 데이터 Series 만들기 (non-equally spaced time series)

먼저, 예제로 사용할 간단한 시계열 데이터 pandas Series 를 만들어보겠습니다. 의도적으로 '2019-12-04', '2019-12-08' 일의 날짜-시간 index 를 제거(drop)하여 이빨 빠진 날짜-시간 index 를 만들었습니다.

import pandas as pd

# generate dates from 2019-12-01 to 2019-12-10

date_idx = pd.date_range('2019-12-01', periods=10)

date_idx

[Out]:
DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04',
               '2019-12-05', '2019-12-06', '2019-12-07', '2019-12-08',
               '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq='D')

# drop 2 dates from DatetimeIndex

date_idx = date_idx.drop(pd.DatetimeIndex(['2019-12-04', '2019-12-08']))

date_idx

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-05',
               '2019-12-06', '2019-12-07', '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq=None)

# Time Series with missing dates index

series_ts_missing = pd.Series(range(len(date_idx))

, index=date_idx)

series_ts_missing

[Out]:

2019-12-01    0
2019-12-02    1
2019-12-03    2
2019-12-05    3
2019-12-06    4
2019-12-07    5
2019-12-09    6
2019-12-10    7
dtype: int64

(1-2) 이빨 빠진 Time Series를 동일한 간격의 시계열 데이터 pandas Series로 변환하기

(fixed frequency, equally spaced time interval time series)

위의 (1-1)에서 만든 Series는 '2019-12-04', '2019-12-08'일의 날짜-시간 index가 빠져있는데요, 이럴 경우 resample('D')를 이용하여 날짜-시간 index는 등간격의 날짜-시간을 채워넣고, 대신 값은 결측값 처리(missing value, NaN, Not a Number)를 해보겠습니다.

# Create a 1 day Fixed Frequency Time Series using resample('D')

series_ts_fixed_freq = series_ts_missing.resample('D')

series_ts_fixed_freq.first()

[Out]:

2019-12-01    0.0
2019-12-02    1.0
2019-12-03    2.0
2019-12-04    NaN <---
2019-12-05    3.0
2019-12-06    4.0
2019-12-07    5.0
2019-12-08    NaN <---
2019-12-09    6.0
2019-12-10    7.0
Freq: D, dtype: float64

비어있던 '날짜-시간' index 를 등간격 '날짜-시간' index로 채우면서 값(value)에 'NaN'이 생긴 부분을 fillna(0)을 이용하여 '0'으로 채워보겠습니다.

# fill missing value with '0'

series_ts_fixed_freq.first().fillna(0)

[Out]:

2019-12-01    0.0
2019-12-02    1.0
2019-12-03    2.0
2019-12-04    0.0 <---
2019-12-05    3.0
2019-12-06    4.0
2019-12-07    5.0
2019-12-08    0.0 <---
2019-12-09    6.0
2019-12-10    7.0
Freq: D, dtype: float64

이번에는 resample('10T')를 이용하여 '10분 단위의 동일 간격 날짜-시간' index의 시계열 데이터를 만들어보겠습니다. 이때도 원래의 데이터셋에 없던 '날짜-시간' index의 경우 값(value)은 결측값으로 처리되어 'NaN'으로 채워집니다.

# resampling with 10 minutes frequency (interval)

series_ts_missing.resample('10T').first()

[Out]:

2019-12-01 00:00:00 0.0 2019-12-01 00:10:00 NaN 2019-12-01 00:20:00 NaN 2019-12-01 00:30:00 NaN 2019-12-01 00:40:00 NaN ... 2019-12-09 23:20:00 NaN 2019-12-09 23:30:00 NaN 2019-12-09 23:40:00 NaN 2019-12-09 23:50:00 NaN 2019-12-10 00:00:00 7.0 Freq: 10T, Length: 1297, dtype: float64

(2) 동일 간격의 시계열 데이터 DataFrame 만들기

(fixed frequency time series pandas DataFrame)

(2-1) 중간에 날짜가 비어있는 시계열 데이터 DataFrame 만들기 (non-equally spaced time series DataFrame)

pd.date_range() 함수로 등간격의 10일치 날짜-시간 index를 만든 후에, drop(pd.DatetimeIndex()) 로 '2019-12-04', '2019-12-08'일을 제거하여 '이빨 빠진 날짜-시간' index를 만들었습니다.

import pandas as pd

# generate dates from 2019-12-01 to 2019-12-10

date_idx = pd.date_range('2019-12-01', periods=10)

# drop 2 dates from DatetimeIndex

date_idx = date_idx.drop(pd.DatetimeIndex(['2019-12-04', '2019-12-08']))

date_idx

[Out]:
DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-05',

               '2019-12-06', '2019-12-07', '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq=None)

df_ts_missing = pd.DataFrame(range(len(date_idx))

, columns=['col']

, index=date_idx)

df_ts_missing

[Out]:

	col
2019-12-01	0
2019-12-02	1
2019-12-03	2
2019-12-05	3
2019-12-06	4
2019-12-07	5
2019-12-09	6
2019-12-10	7

(2-2) 이빨 빠진 Time Series를 동일한 간격의 시계열 데이터 pandas DataFrame으로 변환하기

(fixed frequency, equally spaced time interval time series pandas DataFrame)

resample('D') 를 메소드를 사용하여 '일(Day)' 동일 간격의 '날짜-시간' index를 가지는 시계열 데이터 DataFrame을 만들었습니다. 이때 원래의 데이터에 없던 '날짜-시간' index의 경우 결측값 처리되어 값(value)은 'NaN'으로 처리됩니다.

df_ts_fixed_freq = df_ts_missing.resample('D').first()

df_ts_fixed_freq

[Out]:

	col
2019-12-01	0.0
2019-12-02	1.0
2019-12-03	2.0
2019-12-04	NaN <---
2019-12-05	3.0
2019-12-06	4.0
2019-12-07	5.0
2019-12-08	NaN <---
2019-12-09	6.0
2019-12-10	7.0

동일 간견 시계열 데이터로 변환하는 과정에서 생긴 'NaN' 결측값 부분을 fillina(0) 메소드를 이용하여 '0'으로 대체하여 채워보겠습니다.

# fill missing value with '0'

df_ts_fixed_freq = df_ts_fixed_freq.fillna(0)

df_ts_fixed_freq

	col
2019-12-01	0.0
2019-12-02	1.0
2019-12-03	2.0
2019-12-04	0.0 <---
2019-12-05	3.0
2019-12-06	4.0
2019-12-07	5.0
2019-12-08	0.0 <---
2019-12-09	6.0
2019-12-10	7.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22

Posted by Rfriend

이전 1 ··· 6 7 8 9 10 11 12 ··· 25 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Python'에 해당되는 글 243건

[Python 시계열 자료 분석] 시계열 분해 (Time series Decomposition)

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python 시계열 자료 분석] 시계열 구성 요인 (Time series component factors): 추세(trend), 순환(cycle), 계절(seasonal), 불규칙(irregular) 요인

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바