지난번 포스팅에서는 Python pandas에서 resampling 중 Downsampling 으로 집계할 때에 왼쪽과 오른쪽 중에서 어느쪽을 포함(inclusive, closed)할 지와 어느쪽으로 라벨 이름(label)을 쓸지(https://rfriend.tistory.com/507)에 대해서 알아보았습니다. 


이번 포스팅에서는 pandas의 resampling 중 Upsampling으로 시계열 데이터 주기(frequency)를 변환(conversion) 할 때 생기는 결측값을 처리하는 두 가지 방법을 소개하겠습니다. 


(1) Upsampling 으로 주기 변환 시 생기는 결측값을 채우는 방법 (filling forward/backward)

(2) Upsampling 으로 주기 변환 시 생기는 결측값을 선형 보간하는 방법 (linear interpolation)






예제로 사용할 간단할 2개의 칼럼을 가지고 주기(frequency)가 5초(5 seconds)인 시계열 데이터 DataFrame을 만들어보겠습니다. 



import pandas as pd

import numpy as np


rng = pd.date_range('2019-12-31', periods=3, freq='5S')

rng

[Out]:

DatetimeIndex(['2019-12-31 00:00:00', '2019-12-31 00:00:05', '2019-12-31 00:00:10'], dtype='datetime64[ns]', freq='5S')


ts = pd.DataFrame(np.array([0, 1, 3, 2, 10, 3]).reshape(3, 2), 

                  index=rng

                  columns=['col_1', 'col_2'])

ts

[Out]:

col_1col_2
2019-12-31 00:00:0001
2019-12-31 00:00:0532
2019-12-31 00:00:10103




이제 pandas resample() 메소드를 사용해서 주기가 5초(freq='5S')인 원래 데이터를 주기가 1초(freq='1S')인 데이터로 Upsampling 변환을 해보겠습니다. 그러면 아래처럼 새로 생긴 날짜-시간 행에 결측값(missing value)이 생깁니다ㅣ  



ts_upsample = ts.resample('S').mean()

ts_upsample

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:01NaNNaN
2019-12-31 00:00:02NaNNaN
2019-12-31 00:00:03NaNNaN
2019-12-31 00:00:04NaNNaN
2019-12-31 00:00:053.02.0
2019-12-31 00:00:06NaNNaN
2019-12-31 00:00:07NaNNaN
2019-12-31 00:00:08NaNNaN
2019-12-31 00:00:09NaNNaN
2019-12-31 00:00:1010.03.0

 



위에 Upsampling을 해서 생긴 결측값들을 (1) 채우기(filling), (2) 선형 보간(linear interpolation) 해보겠습니다. 



  (1) Upsampling 으로 주기 변환 시 생기는 결측값을 채우기 (filling missing values)


(1-1) 앞의 값으로 뒤의 결측값 채우기 (Filling forward)



# (1) filling forward

ts_upsample.ffill()

ts_upsample.fillna(method='ffill')

ts_upsample.fillna(method='pad')

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:010.01.0
2019-12-31 00:00:020.01.0
2019-12-31 00:00:030.01.0
2019-12-31 00:00:040.01.0
2019-12-31 00:00:053.02.0
2019-12-31 00:00:063.02.0
2019-12-31 00:00:073.02.0
2019-12-31 00:00:083.02.0
2019-12-31 00:00:093.02.0
2019-12-31 00:00:1010.03.0





(1-2) 뒤의 값으로 앞의 결측값 채우기 (Filling backward)



# (2)filling backward

ts_upsample.bfill()

ts_upsample.fillna(method='bfill')

ts_upsample.fillna(method='backfill')

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:013.02.0
2019-12-31 00:00:023.02.0
2019-12-31 00:00:033.02.0
2019-12-31 00:00:043.02.0
2019-12-31 00:00:053.02.0
2019-12-31 00:00:0610.03.0
2019-12-31 00:00:0710.03.0
2019-12-31 00:00:0810.03.0
2019-12-31 00:00:0910.03.0
2019-12-31 00:00:1010.03.0





(1-3) 특정 값으로 결측값 채우기



# (3)fill Missing value with '0'

ts_upsample.fillna(0)

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:010.00.0
2019-12-31 00:00:020.00.0
2019-12-31 00:00:030.00.0
2019-12-31 00:00:040.00.0
2019-12-31 00:00:053.02.0
2019-12-31 00:00:060.00.0
2019-12-31 00:00:070.00.0
2019-12-31 00:00:080.00.0
2019-12-31 00:00:090.00.0
2019-12-31 00:00:1010.03.0

 




(1-4) 평균 값으로 결측값 채우기



# (4) filling with mean value

# mean per column

ts_upsample.mean()

[Out]:
col_1    4.333333
col_2    2.000000
dtype: float64


ts_upsample.fillna(ts_upsample.mean())

[Out]:

col_1col_2
2019-12-31 00:00:000.0000001.0
2019-12-31 00:00:014.3333332.0
2019-12-31 00:00:024.3333332.0
2019-12-31 00:00:034.3333332.0
2019-12-31 00:00:044.3333332.0
2019-12-31 00:00:053.0000002.0
2019-12-31 00:00:064.3333332.0
2019-12-31 00:00:074.3333332.0
2019-12-31 00:00:084.3333332.0
2019-12-31 00:00:094.3333332.0
2019-12-31 00:00:1010.0000003.0

 




(1-5) 결측값 채우는 행의 개수 제한하기



# (5) limit the number of filling observation

ts_upsample.ffill(limit=1)

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:010.01.0
2019-12-31 00:00:02NaNNaN
2019-12-31 00:00:03NaNNaN
2019-12-31 00:00:04NaNNaN
2019-12-31 00:00:053.02.0
2019-12-31 00:00:063.02.0
2019-12-31 00:00:07NaNNaN
2019-12-31 00:00:08NaNNaN
2019-12-31 00:00:09NaNNaN
2019-12-31 00:00:1010.03.0
 




  (2) Upsampling 으로 주기 변환 시 생기는 결측값을 선형 보간하기 (linear interpolation)



# (6) Linear interpolation by values

ts_upsample.interpolate(method='values') # by default

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:010.61.2
2019-12-31 00:00:021.21.4
2019-12-31 00:00:031.81.6
2019-12-31 00:00:042.41.8
2019-12-31 00:00:053.02.0
2019-12-31 00:00:064.42.2
2019-12-31 00:00:075.82.4
2019-12-31 00:00:087.22.6
2019-12-31 00:00:098.62.8
2019-12-31 00:00:1010.03.0



ts_upsample.interpolate(method='values').plot()





# (7) Linear interpolation by time

ts_upsample.interpolate(method='time')

[Out]:

col_1col_2
2019-12-31 00:00:000.01.0
2019-12-31 00:00:010.61.2
2019-12-31 00:00:021.21.4
2019-12-31 00:00:031.81.6
2019-12-31 00:00:042.41.8
2019-12-31 00:00:053.02.0
2019-12-31 00:00:064.42.2
2019-12-31 00:00:075.82.4
2019-12-31 00:00:087.22.6
2019-12-31 00:00:098.62.8
2019-12-31 00:00:1010.03.0

 



많은 도움이 되었기를 바랍니다. 

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)


반응형
Posted by Rfriend

댓글을 달아 주세요