'pandas' 태그의 글 목록 (2 Page)

'pandas'에 해당되는 글 71건

2019.12.30 [Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기
2019.12.30 [Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)
2019.12.28 [Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)
2019.12.25 [Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m)
2019.12.24 [Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 1
2019.12.23 [Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기
2019.12.15 [Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 8
2019.09.15 [Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법
2019.08.25 [Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기
2019.08.25 [Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype() 6

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 30. 17:47

지난번 포스팅에서는 분기 단위의 기간 날짜 범위 만들기, 그리고 period와 timestamp 간 변환하기 (https://rfriend.tistory.com/506)에 대해서 소개하였습니다.

Python pandas의 resample() 메소드를 사용하면

(a) 더 세부적인 주기(higher frequency)의 시계열 데이터를 더 낮은 주기로 집계/요약을 하는 Downsampling (예: 초(seconds) --> 10초(10 seconds), 일(day) --> 주(week), 일(day) --> 월(month) 등)과,

(b) 더 낮은 주기의 시계열 데이터를 더 세부적인 주기의 데이터로 변환하는 Upsampling (예: 10초 --> 1초, 주 --> 일, 월 --> 주, 년 --> 일 등)을 할 수 있습니다.

이번 포스팅에서는 pandas의 resample() 메소드로 Downsampling 을 할 때 (예: 1초 단위 주기 --> 10초 단위/ 1분 단위/ 1시간 단위 주기로 resampling)

(1) 왼쪽과 오른쪽 중에서 포함 위치 설정 (closed)

(2) 왼쪽과 오른쪽 중에서 라벨 이름 위치 설정 (label)

하는 방법을 소개하겠습니다.

포함 위치와 라벨 이름 설정 시 왼쪽과 오른쪽 중에서 어디를 사용하느냐에 대한 규칙은 없구요, (a) 명확하게 인지하고 있고 (특히, 여러 사람이 동시에 협업하여 작업할 경우), (b) product의 코드 전반에 걸쳐서 일관되게(consistant) 사용하는 것이 필요합니다. (SQL로 DB에서 두 그룹으로 나누어서 시계열 데이터 전처리 작업을 하다가 나중에서야 포함 여부와 라벨 규칙이 서로 다르다는 것을 확인하고, 이를 동일 규칙으로 수정하느라 시간을 소비했던 경험이 있습니다. -_-;;;)

예제로 사용하기 위해 1분 단위 주기의 6개 데이터 포인트를 가지는 간단한 시계열 데이터 pandas Series 를 만들어보겠습니다.

import pandas as pd

# generate dates range

dates = pd.date_range('2020-12-31', periods=6, freq='min') # or freq='T'

dates

[Out]:
DatetimeIndex(['2020-12-31 00:00:00', '2020-12-31 00:01:00',
               '2020-12-31 00:02:00', '2020-12-31 00:03:00',
               '2020-12-31 00:04:00', '2020-12-31 00:05:00'],
              dtype='datetime64[ns]', freq='T')

# create Series

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

이제 '1 분 단위 주기'(freq='min')인 시계열 데이터를 '2초 단위 주기'(freq='2min' or freq='2T')로 resample() 메소드를 이용해서 Downsampling을 해보도록 하겠습니다.

이때 포함 위치 (a) closed='left' (by default) 또는 (b) closed='right' 과 라벨 이름 위치 (c) label='left' (by default) 또는 label='right' 의 총 4개 조합별로 나누어서 Downsampling 결과를 비교해보겠습니다. 집계 함수는 sum()을 공통으로 사용하겠습니다.

(1) By default: Downsampling 시 closed='left', label='left'

Downsampling 할 때 왼쪽과 오른쪽 중에서 한쪽은 포함(inclusive, default: 'left')되고 나머지 한쪽은 포함되지 않습니다. 그리고 Downsampling으로 resampling 된 후의 라벨 이름의 경우 default는 가장 왼쪽(label='left')의 라벨을 사용합니다.

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# by default, left side of bin interval is closed

# by default, left side of bin inverval is labeled

ts_series.resample('2min').sum()

[Out]:

2020-12-31 00:00:00 1 2020-12-31 00:02:00 5 2020-12-31 00:04:00 9 Freq: 2T, dtype: int64

# same result with above

ts_series.resample('2min', closed='left', label='left').sum()

[Out]:

2020-12-31 00:00:00    1
2020-12-31 00:02:00    5
2020-12-31 00:04:00    9
Freq: 2T, dtype: int64

(2) Downsampling 시 closed='right', label='left'

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin interval is closed using closed='right'

ts_series.resample('2min', closed='right', label='left').sum()

[Out]:

2020-12-30 23:58:00 0 2020-12-31 00:00:00 3 2020-12-31 00:02:00 7 2020-12-31 00:04:00 5 Freq: 2T, dtype: int64

(3) Downsampling 시 closed='left', label='right'

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin inverval is labeled using label='right'

ts_series.resample('2min', closed='left', label='right').sum()

[Out]:

2020-12-31 00:02:00    1
2020-12-31 00:04:00    5
2020-12-31 00:06:00    9
Freq: 2T, dtype: int64

(4) Downsampling 시 closed='right', label='right'

아래의 예는 디폴트와 정반대로 시계열 구간의 오른쪽을 포함시키고(closed='right') 라벨 이름도 오른쪽 구간 값(label='right')을 가져다가 Downsampling 한 경우입니다.

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin interval is closed using closed='right'

# right side of bin inverval is labeled using label='right'

ts_series.resample('2min', closed='right', label='right').sum()

2020-12-31 00:00:00    0
2020-12-31 00:02:00    3
2020-12-31 00:04:00    7
2020-12-31 00:06:00    5

Freq: 2T, dtype: int64

(5) 시계열 pandas DataFrame에 대해 Downsaumpling 시 포함(closed), 라벨(label) 위치 설정하기

지금까지 위의 (1), (2), (3), (4)는 pandas Series를 대상으로 한 예제였습니다. DatatimeIndex를 index로 가지는 시계열 데이터 pandas DataFrame 도 Series와 동일한 방법으로 Downsampling 하면서 포함, 라벨 위치를 설정합니다.

import pandas as pd

# generate dates range

dates = pd.date_range('2020-12-31', periods=6, freq='min')

dates

[Out]:

DatetimeIndex(['2020-12-31 00:00:00', '2020-12-31 00:01:00',
               '2020-12-31 00:02:00', '2020-12-31 00:03:00',
               '2020-12-31 00:04:00', '2020-12-31 00:05:00'],
              dtype='datetime64[ns]', freq='T')

# create timeseries DataFrame

ts_df = pd.DataFrame({'val': range(len(dates))}, index=dates)

ts_df

[Out]:

	val
2020-12-31 00:00:00	0
2020-12-31 00:01:00	1
2020-12-31 00:02:00	2
2020-12-31 00:03:00	3
2020-12-31 00:04:00	4
2020-12-31 00:05:00	5

# (a) Downsampling using default setting

ts_df.resample('2min').sum()

[Out]:

	val
2020-12-31 00:00:00	1
2020-12-31 00:02:00	5
2020-12-31 00:04:00	9

# (b) Downsampling using closed='right'

ts_df.resample('2min', closed='right').sum()

[Out]:

	val
2020-12-30 23:58:00	0
2020-12-31 00:00:00	3
2020-12-31 00:02:00	7
2020-12-31 00:04:00	5

# (c) Downsampling using label='right'

ts_df.resample('2min', label='right').sum()

[Out]:

	val
2020-12-31 00:02:00	1
2020-12-31 00:04:00	5
2020-12-31 00:06:00	9

# (d) Downsampling using closed='right', label='right'

ts_df.resample('2min', closed='right', label='right').sum()

[Out]:

	val
2020-12-31 00:00:00	0
2020-12-31 00:02:00	3
2020-12-31 00:04:00	7
2020-12-31 00:06:00	5

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26

Posted by Rfriend

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 30. 12:30

지난번 포스팅에서는 Python pandas에서 시간대를 확인, 설정, 변경하는 방법(https://rfriend.tistory.com/505)을 소개하였습니다.

이번 포스팅에서는 Python pandas에서

(1) 분기 단위의 기간 주기 만들기 (quarterly period frequencies)

(2) 분기 단위의 기간 날짜-범위 만들기 (quarterly period date-range)

(3) 분기 단위의 기간과 timestamp 간 변환하기 (conversion between quarterly period and timestamp)

(4) 분기 단위 기간으로 집계하기 (quarterly period group by aggregation)

에 대해서 소개하겠습니다.

이번 포스팅은 특히, 금융, 회계 분야에서 분기 단위(fiscal year quarters) 실적 집계, 분석할 때 pandas로 하기에 유용한 기능들입니다.

[ 그림1. pandas 분기 단위의 기간 범위 만들기 (Quarterly Period Range) ]

(1) 분기 단위의 기간 주기 만들기 (quarterly period frequencies)

pandas Period() 함수를 사용해서 the Fiscal Year 2020 4 Quarter 를 만들어보겠습니다. 회기년도 '2020-Q4'는 위의 [그림 1] 에서 보는 바와 같이, 2019.3월~5월(2020- Q1), 2019.6월~8월(2020-Q2), 2019.9월~11월(2020-Q3), 2019.12월~2020.2월(2020-Q4) 의 기간으로 구성되어 있습니다. (회계년도 2020 에 2019년의 3월~12월이 포함되어서 좀 이상하게 보일 수도 있는데요, 그냥 이렇습니다. ^^')

import pandas as pd

import numpy as np

p = pd.Period('2020Q4', freq='Q-FEB')

[Out]: Period('2020Q4', 'Q-FEB')

pandas의 asfreq() 메소드를 사용하면 pandas Period 객체를 원하는 주기(Period frequency)로 변환할 수 있습니다. 위의 2020-Q4 의 분기 단위의 기간(Quarterly Period)를 asfreq() 메소드를 사용해 (a) 분기별 시작 날짜(starting date)와 끝 날짜(ending date), (b) 분기별 공휴일이 아닌 시작 날짜(staring business date)와 공휴일이 아닌 끝 날짜 (ending business date)로 변환해 보겠습니다.

(a) converting from Period to Date: 'D'

(b) converting from Period to Business Date: 'B'

# starting date

p.asfreq('D', how='start')

[Out]: Period('2019-12-01', 'D')

# ending date

p.asfreq('D', how='end')

[Out]: Period('2020-02-29', 'D')

# starting business date

p.asfreq('B', how='start')

[Out]: Period('2019-12-02', 'B')

# ending business date

p.asfreq('B', how='end')

[Out]: Period('2020-02-28', 'B')

asfreq() 메소드를 chain으로 연속으로 이어서

(a) 분기별 ending business date를 선택하고 --> (b) starting(how-='start) minutes (freq='T' or freq='min')의 주기(frequency)로 변환한다거나,

(e) 분기별 ending business date를 선택하고 --> 이를 (f) ending seconds 로 변환

하는 것이 모두 가능합니다.

# (a) from ending Business date --> (b) to starting Minutes

p.asfreq('B', how='end').asfreq('T', how='start')

[Out]: Period('2020-02-28 00:00', 'T')

# (c) from ending Business date --> (d) to ending Minutes

p.asfreq('B', how='end').asfreq('T', how='end')

[Out]: Period('2020-02-28 23:59', 'T')

# (e) from Business date --> (f) to Seconds

p.asfreq('B', how='end').asfreq('S', how='end')

[Out]: Period('2020-02-28 23:59:59', 'S')

(2) 분기 단위의 기간 범위 만들기 (quarterly period range)

pandas의 date_range() 함수로 날짜-시간 범위의 DatetimeIndex 객체를 만들 듯이, pandas의 period_range('start', 'end', freq='Q-[ending-month]') 함수를 사용해서 분기 단위의 기간 범위(quarterly period range)를 만들 수 있습니다. (참고로 freq='A-DEC' 는 12월을 마지막으로 가지는 년 단위 기간(yearly period)라는 뜻이며, freq='Q-FEB'는 2월달을 마지막으로 가지는 분기 단위 기간(quarterly period)라는 뜻입니다)

아래 예는 2020-Q1 ~ 2020-Q4 기간(pd.period_range('2020Q1', '2020Q4')의 2월달을 마지막으로 하는 분기 단위의 기간(freq='Q-FEB')을 만든 것입니다.

p_rng = pd.period_range('2020Q1', '2020Q4', freq='Q-FEB')

p_rng

[Out]:PeriodIndex(['2020Q1', '2020Q2', '2020Q3', '2020Q4'], dtype='period[Q-FEB]', 
freq='Q-FEB')

asfreq() 메소드를 사용해서 위에서 생성한 '2020-Q1' ~ '2020-Q4' 기간(period with a Quarter ending at February)의 공휴일이 아닌 시작 날짜(staring business date)와 끝 날짜(ending business date)로 변환해보겠습니다.

# convert period into deisred frequency using asfreq() methods

# starting business day per quarter 'Q-FEB'

p_rng.asfreq('B', how='start')

[Out]:

PeriodIndex(['2019-03-01', '2019-06-03', '2019-09-02', '2019-12-02'],

dtype='period[B]', freq='B')

# ending business day per quarter 'Q-FEB'

p_rng.asfreq('B', how='end')

[Out]:

PeriodIndex(['2019-05-31', '2019-08-30', '2019-11-29', '2020-02-28'],

dtype='period[B]', freq='B')

기간(Period) 객체를 frequency로 변환한 후에 산술 연산(arithmetic operation)이 가능합니다. 아래 예는 2월달에 끝나는 4 분기의 ending business date에 1 day 를 더한것입니다.

# arithmatic operation: plus one day

p_rng.asfreq('B', how='end') + 1

[Out]:

PeriodIndex(['2019-06-03', '2019-09-02', '2019-12-02', '2020-03-02'],

dtype='period[B]', freq='B')

아래의 예는 period object를 ending business date로 먼저 변환하고, 이를 다시 starting hour frequency로 변환한 후에 여기에 12 hours 를 더한 것입니다.

# period ending Business day, starting Hour

p_rng.asfreq('B', how='end').asfreq('H', how='start')

[Out]:

PeriodIndex(['2019-05-31 00:00', '2019-08-30 00:00', '2019-11-29 00:00',
             '2020-02-28 00:00'],
            dtype='period[H]', freq='H')

# plus 12 hours

p_12h_rng = p_rng.asfreq('B', how='end').asfreq('H', how='start') + 12

p_12h_rng

[Out]:

PeriodIndex(['2019-05-31 12:00', '2019-08-30 12:00', '2019-11-29 12:00',
             '2020-02-28 12:00'],
            dtype='period[H]', freq='H')

(3) 분기 단위의 기간과 timestamp 간 변환하기

(conversion between quarterly period and timestamp)

pandas date_range() 로 만든 날짜-시간 DatetimeIndex를 pandas.to_period() 메소드를 사용해서 PeriodIndex로 변환할 수 있습니다.

import pandas as pd

# generate dates range with 12 Months

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

[Out]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'],
              dtype='datetime64[ns]', freq='M')

# convert from DatetimeIndex to PeriodIndex

p = ts.to_period()

[Out]:
PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12'],
            dtype='period[M]', freq='M')

반대로, pandas.to_timestamp() 메소드를 사용해서 PeriodIndex를 DatetimeIndex로 변환할 수 있습니다.

# convert from PeriodIndex to DatetimeIndex with starting month('M')

p.asfreq('B', how='end').asfreq('M', how='start').to_timestamp()

[Out]:

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01'],
              dtype='datetime64[ns]', freq='MS')

# convert from PeriodIndex to DatatimeIndex with ending minutes('T')

p.asfreq('B', how='end').asfreq('T', how='end').to_timestamp()

DatetimeIndex(['2020-01-31 23:59:00', '2020-02-28 23:59:00',
               '2020-03-31 23:59:00', '2020-04-30 23:59:00',
               '2020-05-29 23:59:00', '2020-06-30 23:59:00',
               '2020-07-31 23:59:00', '2020-08-31 23:59:00',
               '2020-09-30 23:59:00', '2020-10-30 23:59:00',
               '2020-11-30 23:59:00', '2020-12-31 23:59:00'],
              dtype='datetime64[ns]', freq='BM')

(4) 분기 기간 단위 집계 (quarterly period group by aggregation)

간단한 월 단위 pandas Series 를 분기 단위 Period Index를 가진 Series로 변환한 후에, 분기 단위로 평균을 집계해보겠습니다.

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

ts_series = pd.Series(range(len(ts)), index=ts)

ts_series

[Out]:

2020-01-31 0 2020-02-29 1 2020-03-31 2 2020-04-30 3 2020-05-31 4 2020-06-30 5 2020-07-31 6 2020-08-31 7 2020-09-30 8 2020-10-31 9 2020-11-30 10 2020-12-31 11 Freq: M, dtype: int64

# convert from DatatimeIndex to Quarterly PeriodIndex

ts_series.index = ts.to_period(freq='Q-FEB')

ts_series

[Out]:

2020Q4     0
2020Q4     1
2021Q1     2
2021Q1     3
2021Q1     4
2021Q2     5
2021Q2     6
2021Q2     7
2021Q3     8
2021Q3     9
2021Q3    10
2021Q4    11

Freq: Q-FEB, dtype: int64

# quarterly groupby mean aggregation

ts_series.groupby(ts_series.index).mean()

[Out]:

2020Q4     0.5
2021Q1     3.0
2021Q2     6.0
2021Q3     9.0
2021Q4    11.0
Freq: Q-FEB, dtype: float64

참고로, 아래는 resample() 메소드로 downsampling 해서 분기 단위로 평균을 집계해본 것인데요, 위의 to_period(freq='Q-FEB')로 frequency를 변환해서 groupby()로 집계한 것과 년도(2020 vs. 2021)가 서로 다릅니다.

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

ts_series = pd.Series(range(len(ts)), index=ts)

ts_series.resample('Q-FEB').mean()

[Out]:

2020-02-29     0.5
2020-05-31     3.0
2020-08-31     6.0
2020-11-30     9.0
2021-02-28    11.0
Freq: Q-FEB, dtype: float64

resample 시 kind='period' 옵션을 설정해주면 ts.to_period(freq='Q-FEB') 를 groupby 한 결과와 동일한 값을 얻을 수 있습니다.

ts_series.resample('Q-FEB', kind='period').mean()

[Out]:

2020Q4     0.5
2021Q1     3.0
2021Q2     6.0
2021Q3     9.0
2021Q4    11.0
Freq: Q-FEB, dtype: float64

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26

Posted by Rfriend

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 28. 18:13

이번 포스팅에서는

(1) Python에서 시간대 (time zone) 확인하기

(2) pandas 에서 date_range()로 날짜-시간 생성 시 시간대(time zone)를 설정하기

(time zone setting)

(3) 시간대 정보가 없는 naive 상태에서 지역 시간대로 변경하기

(convert from naive timezone to localized timezone)

(4) 날짜-시간 DatetimeIndex의 특정 시간대를 다른 시간대로 변경하기

(converst from a timezone to another timezone)

하는 방법을 소개하겠습니다.

(1) Python에서 시간대 (time zone) 확인하기

국가 간을 넘나들면서 여러 시간대에 걸쳐서 업무를 봐야 한다거나, 일광 절약 시간(미국식 Daylight Savings Time, DST, 영국식 Summer Time) 을 적용하고 있는 나라 (예: 미국, 캐나다, 대부분의 유럽 국가, 호주 일부 지역) 에서는 시간대를 고려해서 프로그래밍을 해야 한다는게 머리가 아픈 일입니다.

그래서 국가/지역별 시간대의 국제 표준으로 UTC (Coordinated Univeral Time, 이전의 Greenwich Mean Time, GMT) 시간대를 많이 사용합니다. 아래 지도는 국가별 시간대를 나타낸 것인데요, 영국의 Greenwich 천문대를 지나는 지도의 가운데 부분이 바로 UTC 시간대입니다.

참고로, 한국, 일본, 호주 가운데 지역은 UTC + 9hour 시간대에 속합니다.

[ Standard Time Zones of the World ]

* 출처: https://en.wikipedia.org/wiki/Coordinated_Universal_Time#/media/File:World_Time_Zones_Map.png

Python에서는 pytz 라이브러리를 사용해서 시간대 정보를 확인할 수 있으며, pandas는 pytz 라이브러리를 wrap 해서 시간대 정보를 다루고 있습니다.

아시아 지역의 시간대 이름 (time zone names in Asia)을 살펴보겠습니다.

# time zone information

import pytz

# regular expression in Python

import re

# regular expression for pattern containing 'Asia' texts

pattern = re.compile(r'^Asia')

# list comprehension for selecting 'Asia****' time zones

tz_asia = [x for x in pytz.common_timezones if pattern.match(x)]

tz_asia

[Out]:

['Asia/Aden',
 'Asia/Almaty',
 'Asia/Amman',
 'Asia/Anadyr',
 'Asia/Aqtau',
 'Asia/Aqtobe',
 'Asia/Ashgabat',
 'Asia/Atyrau',
 'Asia/Baghdad',
 'Asia/Bahrain',
 'Asia/Baku',
 'Asia/Bangkok',
 'Asia/Barnaul',
 'Asia/Beirut',
 'Asia/Bishkek',
 'Asia/Brunei',
 'Asia/Chita',
 'Asia/Choibalsan',
 'Asia/Colombo',
 'Asia/Damascus',
 'Asia/Dhaka',
 'Asia/Dili',
 'Asia/Dubai',
 'Asia/Dushanbe',
 'Asia/Famagusta',
 'Asia/Gaza',
 'Asia/Hebron',
 'Asia/Ho_Chi_Minh',
 'Asia/Hong_Kong',
 'Asia/Hovd',
 'Asia/Irkutsk',
 'Asia/Jakarta',
 'Asia/Jayapura',
 'Asia/Jerusalem',
 'Asia/Kabul',
 'Asia/Kamchatka',
 'Asia/Karachi',
 'Asia/Kathmandu',
 'Asia/Khandyga',
 'Asia/Kolkata',
 'Asia/Krasnoyarsk',
 'Asia/Kuala_Lumpur',
 'Asia/Kuching',
 'Asia/Kuwait',
 'Asia/Macau',
 'Asia/Magadan',
 'Asia/Makassar',
 'Asia/Manila',
 'Asia/Muscat',
 'Asia/Nicosia',
 'Asia/Novokuznetsk',
 'Asia/Novosibirsk',
 'Asia/Omsk',
 'Asia/Oral',
 'Asia/Phnom_Penh',
 'Asia/Pontianak',
 'Asia/Pyongyang',
 'Asia/Qatar',
 'Asia/Qostanay',
 'Asia/Qyzylorda',
 'Asia/Riyadh',
 'Asia/Sakhalin',
 'Asia/Samarkand',
 'Asia/Seoul',
 'Asia/Shanghai',
 'Asia/Singapore',
 'Asia/Srednekolymsk',
 'Asia/Taipei',
 'Asia/Tashkent',
 'Asia/Tbilisi',
 'Asia/Tehran',
 'Asia/Thimphu',
 'Asia/Tokyo',
 'Asia/Tomsk',
 'Asia/Ulaanbaatar',
 'Asia/Urumqi',
 'Asia/Ust-Nera',
 'Asia/Vientiane',
 'Asia/Vladivostok',
 'Asia/Yakutsk',
 'Asia/Yangon',
 'Asia/Yekaterinburg',
 'Asia/Yerevan']

아래는 한국의 서울, 싱가폴, 중국의 상해, 일본의 도쿄의 시간대 정보를 조회해 본 결과입니다.

# UTC: coordinated universal time

pytz.timezone('UTC')

[Out]: <UTC>

pytz.timezone('Asia/Seoul')

[Out]: <DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

pytz.timezone('Asia/Singapore')

[Out]: <DstTzInfo 'Asia/Singapore' LMT+6:55:00 STD>

pytz.timezone('Asia/Shanghai')

[Out]: <DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>

pytz.timezone('Asia/Tokyo')

[Out]: <DstTzInfo 'Asia/Tokyo' LMT+9:19:00 STD>

(2) 시간대를 포함해서 날짜-시간 범위 만들기 (generate date ranges with time zone)

pandas 의 date_range() 함수로 날짜-시간 DatetimeIndex를 생성할 때 tz = 'time_zone_name' 옵션을 사용하면 시간대(time zone)를 설정해줄 수 있습니다. 아래 예는 'Asia/Seoul' 시간대를 설정해서 2019-12-28 부터 4일 치 날짜를 생성한 것입니다.

import pandas as pd

ts_seoul = pd.date_range('2019-12-28', periods=4, freq='D', tz='Asia/Seoul')

ts_seoul

[Out]:

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

ts_seoul_series = pd.Series(range(len(ts_seoul_idx)), index = ts_seoul)

ts_seoul_series.index.tz

[Out]:

<DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

(3) 시간대가 없는 naive 상태에서 지역 시간대 설정하기

(convert from naive to localized time zone)

pandas 의 date_range() 함수로 날짜-시간 DatetimeIndex를 생성하면 디폴트로는 시간대가 없는 naive 상태로 만들어집니다. 이런 naive time-zone에서 특정 국가/지역의 시간대를 설정하고 싶을 때 tz_localize('timezone_name') 메소드를 사용합니다.

# timezone-naive timestamps

ts_naive = pd.date_range('2019-12-28', periods=6, freq='D')

ts_naive

[Out]:

DatetimeIndex(['2019-12-28', '2019-12-29', '2019-12-30', '2019-12-31',
               '2020-01-01', '2020-01-02'],
              dtype='datetime64[ns]', freq='D')

# localize timezone to 'Asia/Seoul' using tz_localize() methods

ts_local_seoul = ts_naive.tz_localize('Asia/Seoul')

ts_local_seoul

[Out]:

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00',
               '2020-01-01 00:00:00+09:00', '2020-01-02 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

만약 naive time-zone 상태에서 시간대를 설정해주기 위해 tz_convert('timezone_name') 메소드를 사용하면 'TypeError: Connot convert tz-naive timestmaps, use tz_localize to localize' 라는 타입 에러가 발생합니다.

# TypeError: Cannot convert tz-naive timestamps, use tz_localize to localize

ts_naive.tz_convert('Asia/Seoul')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-270d3596ed05> in <module>
----> 1 ts_naive.tz_convert('Asia/Seoul')

--- 중간 생략 ---

~/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py in tz_convert(self, tz)
    958             # tz naive, use tz_localize
    959             raise TypeError(
--> 960                 "Cannot convert tz-naive timestamps, use " "tz_localize to localize"
    961             )
    962 

TypeError: Cannot convert tz-naive timestamps, use tz_localize to localize

(4) 특정 시간대를 다른 시간대로 바꾸기 (convert from a time-zone to another one)

아래의 예는 tz_convert('Asia/Singapore') 메소드를 이용해서 'Asia/Seoul' 시간대를 'Asia/Singapore' 시간대로 변경해보았습니다.

# timezone 'Asia/Seoul'

ts_seoul = pd.date_range('2019-12-28', periods=4, freq='D', tz='Asia/Seoul')

ts_seoul

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

# convert from 'Asia/Seoul' to 'Asia/Singapore' using tz_convert()

ts_singapore = ts_seoul.tz_convert('Asia/Singapore')

ts_singapore

[Out]:

DatetimeIndex(['2019-12-27 23:00:00+08:00', '2019-12-28 23:00:00+08:00',
               '2019-12-29 23:00:00+08:00', '2019-12-30 23:00:00+08:00'],
              dtype='datetime64[ns, Asia/Singapore]', freq='D')

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

[ 파이썬 시간대 이름 (Python Timezone Names) ]

import pytz

pytz.common_timezones

Africa	America
['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek']	['America/Adak', 'America/Anchorage', 'America/Anguilla', 'America/Antigua', 'America/Araguaina', 'America/Argentina/Buenos_Aires', 'America/Argentina/Catamarca', 'America/Argentina/Cordoba', 'America/Argentina/Jujuy', 'America/Argentina/La_Rioja', 'America/Argentina/Mendoza', 'America/Argentina/Rio_Gallegos', 'America/Argentina/Salta', 'America/Argentina/San_Juan', 'America/Argentina/San_Luis', 'America/Argentina/Tucuman', 'America/Argentina/Ushuaia', 'America/Aruba', 'America/Asuncion', 'America/Atikokan', 'America/Bahia', 'America/Bahia_Banderas', 'America/Barbados', 'America/Belem', 'America/Belize', 'America/Blanc-Sablon', 'America/Boa_Vista', 'America/Bogota', 'America/Boise', 'America/Cambridge_Bay', 'America/Campo_Grande', 'America/Cancun', 'America/Caracas', 'America/Cayenne', 'America/Cayman', 'America/Chicago', 'America/Chihuahua', 'America/Costa_Rica', 'America/Creston', 'America/Cuiaba', 'America/Curacao', 'America/Danmarkshavn', 'America/Dawson', 'America/Dawson_Creek', 'America/Denver', 'America/Detroit', 'America/Dominica', 'America/Edmonton', 'America/Eirunepe', 'America/El_Salvador', 'America/Fort_Nelson', 'America/Fortaleza', 'America/Glace_Bay', 'America/Godthab', 'America/Goose_Bay', 'America/Grand_Turk', 'America/Grenada', 'America/Guadeloupe', 'America/Guatemala', 'America/Guayaquil', 'America/Guyana', 'America/Halifax', 'America/Havana', 'America/Hermosillo', 'America/Indiana/Indianapolis', 'America/Indiana/Knox', 'America/Indiana/Marengo', 'America/Indiana/Petersburg', 'America/Indiana/Tell_City', 'America/Indiana/Vevay', 'America/Indiana/Vincennes', 'America/Indiana/Winamac', 'America/Inuvik', 'America/Iqaluit', 'America/Jamaica', 'America/Juneau', 'America/Kentucky/Louisville', 'America/Kentucky/Monticello', 'America/Kralendijk', 'America/La_Paz', 'America/Lima', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Maceio', 'America/Managua', 'America/Manaus', 'America/Marigot', 'America/Martinique', 'America/Matamoros', 'America/Mazatlan', 'America/Menominee', 'America/Merida', 'America/Metlakatla', 'America/Mexico_City', 'America/Miquelon', 'America/Moncton', 'America/Monterrey', 'America/Montevideo', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Nipigon', 'America/Nome', 'America/Noronha', 'America/North_Dakota/Beulah', 'America/North_Dakota/Center', 'America/North_Dakota/New_Salem', 'America/Ojinaga', 'America/Panama', 'America/Pangnirtung', 'America/Paramaribo', 'America/Phoenix', 'America/Port-au-Prince', 'America/Port_of_Spain', 'America/Porto_Velho', 'America/Puerto_Rico', 'America/Punta_Arenas', 'America/Rainy_River', 'America/Rankin_Inlet', 'America/Recife', 'America/Regina', 'America/Resolute', 'America/Rio_Branco', 'America/Santarem', 'America/Santiago', 'America/Santo_Domingo', 'America/Sao_Paulo', 'America/Scoresbysund', 'America/Sitka', 'America/St_Barthelemy', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Swift_Current', 'America/Tegucigalpa', 'America/Thule', 'America/Thunder_Bay', 'America/Tijuana', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Whitehorse', 'America/Winnipeg', 'America/Yakutat', 'America/Yellowknife' ]
Antarctica
[ 'Antarctica/Casey', 'Antarctica/Davis', 'Antarctica/DumontDUrville', 'Antarctica/Macquarie', 'Antarctica/Mawson', 'Antarctica/McMurdo', 'Antarctica/Palmer', 'Antarctica/Rothera', 'Antarctica/Syowa', 'Antarctica/Troll', 'Antarctica/Vostok', 'Arctic/Longyearbyen']
Australia
[ 'Australia/Adelaide', 'Australia/Brisbane', 'Australia/Broken_Hill', 'Australia/Currie', 'Australia/Darwin', 'Australia/Eucla', 'Australia/Hobart', 'Australia/Lindeman', 'Australia/Lord_Howe', 'Australia/Melbourne', 'Australia/Perth', 'Australia/Sydney']
Canada
[ 'Canada/Atlantic', 'Canada/Central', 'Canada/Eastern', 'Canada/Mountain', 'Canada/Newfoundland', 'Canada/Pacific']
Europe
[ 'Europe/Amsterdam', 'Europe/Andorra', 'Europe/Astrakhan', 'Europe/Athens', 'Europe/Belgrade', 'Europe/Berlin', 'Europe/Bratislava', 'Europe/Brussels', 'Europe/Bucharest', 'Europe/Budapest', 'Europe/Busingen', 'Europe/Chisinau', 'Europe/Copenhagen', 'Europe/Dublin', 'Europe/Gibraltar', 'Europe/Guernsey', 'Europe/Helsinki', 'Europe/Isle_of_Man', 'Europe/Istanbul', 'Europe/Jersey', 'Europe/Kaliningrad', 'Europe/Kiev', 'Europe/Kirov', 'Europe/Lisbon', 'Europe/Ljubljana', 'Europe/London', 'Europe/Luxembourg', 'Europe/Madrid', 'Europe/Malta', 'Europe/Mariehamn', 'Europe/Minsk', 'Europe/Monaco', 'Europe/Moscow', 'Europe/Oslo', 'Europe/Paris', 'Europe/Podgorica', 'Europe/Prague', 'Europe/Riga', 'Europe/Rome', 'Europe/Samara', 'Europe/San_Marino', 'Europe/Sarajevo', 'Europe/Saratov', 'Europe/Simferopol', 'Europe/Skopje', 'Europe/Sofia', 'Europe/Stockholm', 'Europe/Tallinn', 'Europe/Tirane', 'Europe/Ulyanovsk', 'Europe/Uzhgorod', 'Europe/Vaduz', 'Europe/Vatican', 'Europe/Vienna', 'Europe/Vilnius', 'Europe/Volgograd', 'Europe/Warsaw', 'Europe/Zagreb', 'Europe/Zaporozhye', 'Europe/Zurich']
Pacific	US
['Pacific/Apia', 'Pacific/Auckland', 'Pacific/Bougainville', 'Pacific/Chatham', 'Pacific/Chuuk', 'Pacific/Easter', 'Pacific/Efate', 'Pacific/Enderbury', 'Pacific/Fakaofo', 'Pacific/Fiji', 'Pacific/Funafuti', 'Pacific/Galapagos', 'Pacific/Gambier', 'Pacific/Guadalcanal', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Kiritimati', 'Pacific/Kosrae', 'Pacific/Kwajalein', 'Pacific/Majuro', 'Pacific/Marquesas', 'Pacific/Midway', 'Pacific/Nauru', 'Pacific/Niue', 'Pacific/Norfolk', 'Pacific/Noumea', 'Pacific/Pago_Pago', 'Pacific/Palau', 'Pacific/Pitcairn', 'Pacific/Pohnpei', 'Pacific/Port_Moresby', 'Pacific/Rarotonga', 'Pacific/Saipan', 'Pacific/Tahiti', 'Pacific/Tarawa', 'Pacific/Tongatapu', 'Pacific/Wake', 'Pacific/Wallis']	[ 'US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific']
	Indian
	['Indian/Antananarivo', 'Indian/Chagos', 'Indian/Christmas', 'Indian/Cocos', 'Indian/Comoro', 'Indian/Kerguelen', 'Indian/Mahe', 'Indian/Maldives', 'Indian/Mauritius', 'Indian/Mayotte', 'Indian/Reunion']

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25

Posted by Rfriend

[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 25. 18:01

이번 포스팅에서는

(1) Yahoo Finace에서 'Apple' 회사의 2019년도 주가 데이터를 가져오기

(2) 주식 종가로 5일, 10일, 20일 단순이동평균(Simple Moving Average) 구하기

(3) 종가, 5일/10일/20일 이동평균을 seaborn을 이용해서 시각화하기

를 차례대로 해보겠습니다.

(1) Yahoo Finace에서 'Apple' 회사의 2019년도 주가 데이터를 가져오기

Yahoo Finance 사이트에서 쉽게 주가 데이터를 다운로드 받는 방법 중의 하나는 yfinance library를 설치해서 download() 함수를 이용하는 것입니다. Jupyter Notebook의 Cell에서 바로 !pip install yfinance 명령어로 라이브러리를 설치하고 import 해서 download() 함수로 Apple('AAPL')의 2019-01-01 ~ 2019-12-24' 일까지의 주가 데이터를 다운로드하였습니다.

# Install yfinance package.

!pip install yfinance

# Import yfinance

import yfinance as yf

# Get the data for the stock Apple by specifying the stock ticker, start date, and end date

aapl = yf.download('AAPL','2019-01-01','2019-12-25')

Requirement already satisfied: yfinance in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (0.1.52)
Requirement already satisfied: multitasking>=0.0.7 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (0.0.9)
Requirement already satisfied: numpy>=1.15 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (1.17.3)
Requirement already satisfied: pandas>=0.24 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (0.25.3)
Requirement already satisfied: requests>=2.20 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from yfinance) (2.22.0)
Requirement already satisfied: pytz>=2017.2 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2019.11.28)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (1.25.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from requests>=2.20->yfinance) (3.0.4)
Requirement already satisfied: six>=1.5 in /Users/ihongdon/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas>=0.24->yfinance) (1.13.0)
[*********************100%***********************]  1 of 1 completed

aapl.head()

[Out]:

	Open	High	Low	Close	Adj Close	Volume
Date
2018-12-31	158.529999	159.360001	156.479996	157.740005	155.405045	35003500
2019-01-02	154.889999	158.850006	154.229996	157.919998	155.582367	37039700
2019-01-03	143.979996	145.720001	142.000000	142.190002	140.085220	91312200
2019-01-04	144.529999	148.550003	143.800003	148.259995	146.065353	58607100
2019-01-07	148.699997	148.830002	145.899994	147.929993	145.740265	54777800

(2) 주식 종가(Close)로 5일, 10일, 20일 이동평균 구하기

Apple 회사의 주식 데이터 중에서 '종가(Close)'를 대상으로 이동평균을 구해보겠습니다.

aapl.Close[:10]

[Out]:

Date
2018-12-31    157.740005
2019-01-02    157.919998
2019-01-03    142.190002
2019-01-04    148.259995
2019-01-07    147.929993
2019-01-08    150.750000
2019-01-09    153.309998
2019-01-10    153.800003
2019-01-11    152.289993
2019-01-14    150.000000
Name: Close, dtype: float64

이동평균은 시계열 데이터 내의 잡음(noise)을 제거하는 데이터 전처리, 혹은 계절성이 존재하는 시계열 데이터에서 계절성 부분을 빼고 장기 추세 요인(trend factor)나 중기 순환/주기 요인(cycle factor)를 보려고 할 때 많이 사용합니다. 시계열 데이터 예측(forecasting)에도 사용하구요.

이동평균은 가중치를 고려 안하는 (즉, 모든 값의 가중치가 같다고 가정하는) 단순이동평균(Simple Moving Average, SMA)과, 가중치를 부여하는 가중이동평균(Weighted Moving Average, WMA)가 있는데요, 이번 포스팅에서는 단순이동평균(SMA)에 대해서 다룹니다.

차수(order) m 인 단순이동평균(Simple Moving Average with Order m) 은 다시 중심이동평균(Centered Moving Average)와 추적이동평균(Trailing Moving Average)로 구분할 수 있습니다 (아래의 개념 비교 이미지를 참고하세요). 이번 포스팅에서는 python pandas에서 사용하고 있는 추적이동평균 개념으로 window 5일, 10일, 15일의 단순이동평균을 계산해 보았습니다.

이동평균을 구하는 두 가지 방법으로, for loop 반복문과 numpy.mean() 을 이용하는 수작업 방법과, pandas 라이브러리의 rolling(window=m).mean() 함수를 이용하는 좀더 편리한 방법을 소개하겠습니다.

(2-1) for loop 반복문과 numpy.mean() 을 이용한 5일 이동평균 구하기

import numpy as np

for i in range(0, 6):

stock_close_5days = aapl.Close[i:(i+5)]

sma_5d = np.mean(stock_close_5days)

print('SMA(5 Days Window) of', aapl.Close.index[i+4].date(), ':', sma_5d)


[Out]:
SMA(5 Days Window) of 2019-01-07 : 150.80799865722656
SMA(5 Days Window) of 2019-01-08 : 149.40999755859374
SMA(5 Days Window) of 2019-01-09 : 148.48799743652344
SMA(5 Days Window) of 2019-01-10 : 150.80999755859375
SMA(5 Days Window) of 2019-01-11 : 151.61599731445312
SMA(5 Days Window) of 2019-01-14 : 152.02999877929688

(2-2) pandas 의 rolling(window=5).mean() 함수를 이용한 5일 이동평균 구하기

차수 m인 이동평균(trailing moving average)을 구하면 처음 시작부분에 m-1 개의 결측값이 발생합니다.

import pandas as pd

sma_5d = aapl.Close.rolling(window=5).mean()

sma_5d[:10]

[Out]:

Date 2018-12-31 NaN 2019-01-02 NaN 2019-01-03 NaN 2019-01-04 NaN 2019-01-07 150.807999 2019-01-08 149.409998 2019-01-09 148.487997 2019-01-10 150.809998 2019-01-11 151.615997 2019-01-14 152.029999 Name: Close, dtype: float64

이제 pandas에서 이동평균 구하는 rolling() 함수를 알았으니, 차수(order, window)가 5일, 10일, 20일인 단순 추적 이동평균(simple trailing moving average)을 구해보겠습니다.

# simple trailing moving average with window 5 days/ 10 days/ 20 days

df_sma = pd.DataFrame({

'close': aapl.Close

, 'sma_5d': aapl.Close.rolling(window=5).mean()

, 'sma_10d': aapl.Close.rolling(window=10).mean()

, 'sma_20d': aapl.Close.rolling(window=20).mean()

})

# top first 25 rows

df_sma[:25]

[Out]:

	close	sma_5d	sma_10d	sma_20d
Date
2018-12-31	157.740005	NaN	NaN	NaN
2019-01-02	157.919998	NaN	NaN	NaN
2019-01-03	142.190002	NaN	NaN	NaN
2019-01-04	148.259995	NaN	NaN	NaN
2019-01-07	147.929993	150.807999	NaN	NaN
2019-01-08	150.750000	149.409998	NaN	NaN
2019-01-09	153.309998	148.487997	NaN	NaN
2019-01-10	153.800003	150.809998	NaN	NaN
2019-01-11	152.289993	151.615997	NaN	NaN
2019-01-14	150.000000	152.029999	151.418999	NaN
2019-01-15	153.070007	152.494000	150.951999	NaN
2019-01-16	154.940002	152.820001	150.653999	NaN
2019-01-17	155.860001	153.232001	152.020999	NaN
2019-01-18	156.820007	154.138004	152.877000	NaN
2019-01-22	153.300003	154.798004	153.414001	NaN
2019-01-23	153.919998	154.968002	153.731001	NaN
2019-01-24	152.699997	154.520001	153.670001	NaN
2019-01-25	157.759995	154.900000	154.066000	NaN
2019-01-28	156.300003	154.795999	154.467001	NaN
2019-01-29	154.679993	155.071997	154.935001	153.177000
2019-01-30	165.250000	157.337997	156.153000	153.552499
2019-01-31	166.440002	160.085999	157.303000	153.978500
2019-02-01	166.520004	161.838000	158.369000	155.195000
2019-02-04	171.250000	164.828000	159.812000	156.344500
2019-02-05	174.179993	168.728000	161.899998	157.657000

(3) 종가, 5일/10일/15일 이동평균을 seaborn을 이용해서 시각화하기

trailing moving average 이동평균을 구하면 차수 m-1 만큼의 결측값(NaN) 이 생깁니다. 시각화를 위해서 결측값이 있는 행은 삭제하도록 하겠습니다.

df_sma.dropna(axis=0, inplace=True)

df_sma.head(10)

[Out]:

	close	sma_5d	sma_10d	sma_20d
Date
2019-01-29	154.679993	155.071997	154.935001	153.177000
2019-01-30	165.250000	157.337997	156.153000	153.552499
2019-01-31	166.440002	160.085999	157.303000	153.978500
2019-02-01	166.520004	161.838000	158.369000	155.195000
2019-02-04	171.250000	164.828000	159.812000	156.344500
2019-02-05	174.179993	168.728000	161.899998	157.657000
2019-02-06	174.240005	170.526001	163.931999	158.831500
2019-02-07	170.940002	171.426001	165.756000	159.713000
2019-02-08	170.410004	172.204001	167.021001	160.543501
2019-02-11	169.429993	171.839999	168.334000	161.400500

Matplotlib 을 이용해서 '종가(Close)', '5일 이동평균', '10일 이동평균', '20일 이동평균' 선 그래프를 그려보겠습니다.

# line plot with moving average of 5 window, 10 window, 20 window

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 10))

plt.plot(df_sma.index, df_sma.close, 'y-', label='close_price')

plt.plot(df_sma.index, df_sma.sma_5d, 'b-', label='sma_5d')

plt.plot(df_sma.index, df_sma.sma_10d, 'r-', label='sma_10d')

plt.plot(df_sma.index, df_sma.sma_20d, 'g-', label='sma_20d')

plt.legend()

plt.show()

위의 1년치 시계열 그래프가 서로 겹쳐보여서 잘 구분이 안되네요. 그래서 2월달의 20개 관측치만 선택해서 다시 시계열 선 그래프를 그려보겠습니다.

아래의 그래프에서 확인할 수 있는 바와 같이, 이동평균 값은 원래의 주식 종가(Close) 값보다 후행적으로 쫓아가고(trailing) 있습니다. 그리고 차수(order, rolling window)가 클 수록 후행적으로 쫒아가는 정도가 더 느림을 알 수 있습니다.

plt.figure(figsize=(15, 10))

plt.plot(df_sma.index[:20], df_sma.close[:20], 'yo-', label='close_price')

plt.plot(df_sma.index[:20], df_sma.sma_5d[:20], 'bo-', label='sma_5d')

plt.plot(df_sma.index[:20], df_sma.sma_10d[:20], 'ro-', label='sma_10d')

plt.plot(df_sma.index[:20], df_sma.sma_20d[:20], 'go-', label='sma_20d')

plt.legend()

plt.show()

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23

Posted by Rfriend

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 24. 18:29

지난번 포스팅에서는 Python pandas의 Series, DataFrame에서 시계열 데이터 index 의 중복 확인 및 처리하는 방법(https://rfriend.tistory.com/500) 에 대해서 소개하였습니다.

이번 포스팅에서는 Python pandas에서 일정한 주기의 시계열 데이터(Fixed frequency time series)를 가진 Series, DataFrame 만드는 방법을 소개하겠습니다.

[ 시계열 데이터의 특징 ]

동일한/ 고정된 간격의 날짜-시간 index (equally spaced time interval, fixed frequency)
중복 없고, 빠진 것도 없는 날짜-시간 index (no redundant values or gaps)
시간 순서대로 정렬 (sequential order)

(* 시계열 데이터가 반드시 동일한/고정된 간격의 날짜-시간을 가져야만 하는 것은 아님. 가령, 주가(stock price) 데이는 장이 열리는 business day에만 존재하며 공휴일은 데이터 없음)

(1) 동일 간격의 시계열 데이터 Series 만들기 (fixed frequency time series pandas Series)

(1-1) 중간에 날짜가 비어있는 시계열 데이터 Series 만들기 (non-equally spaced time series)

먼저, 예제로 사용할 간단한 시계열 데이터 pandas Series 를 만들어보겠습니다. 의도적으로 '2019-12-04', '2019-12-08' 일의 날짜-시간 index 를 제거(drop)하여 이빨 빠진 날짜-시간 index 를 만들었습니다.

import pandas as pd

# generate dates from 2019-12-01 to 2019-12-10

date_idx = pd.date_range('2019-12-01', periods=10)

date_idx

[Out]:
DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04',
               '2019-12-05', '2019-12-06', '2019-12-07', '2019-12-08',
               '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq='D')

# drop 2 dates from DatetimeIndex

date_idx = date_idx.drop(pd.DatetimeIndex(['2019-12-04', '2019-12-08']))

date_idx

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-05',
               '2019-12-06', '2019-12-07', '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq=None)

# Time Series with missing dates index

series_ts_missing = pd.Series(range(len(date_idx))

, index=date_idx)

series_ts_missing

[Out]:

2019-12-01    0
2019-12-02    1
2019-12-03    2
2019-12-05    3
2019-12-06    4
2019-12-07    5
2019-12-09    6
2019-12-10    7
dtype: int64

(1-2) 이빨 빠진 Time Series를 동일한 간격의 시계열 데이터 pandas Series로 변환하기

(fixed frequency, equally spaced time interval time series)

위의 (1-1)에서 만든 Series는 '2019-12-04', '2019-12-08'일의 날짜-시간 index가 빠져있는데요, 이럴 경우 resample('D')를 이용하여 날짜-시간 index는 등간격의 날짜-시간을 채워넣고, 대신 값은 결측값 처리(missing value, NaN, Not a Number)를 해보겠습니다.

# Create a 1 day Fixed Frequency Time Series using resample('D')

series_ts_fixed_freq = series_ts_missing.resample('D')

series_ts_fixed_freq.first()

[Out]:

2019-12-01    0.0
2019-12-02    1.0
2019-12-03    2.0
2019-12-04    NaN <---
2019-12-05    3.0
2019-12-06    4.0
2019-12-07    5.0
2019-12-08    NaN <---
2019-12-09    6.0
2019-12-10    7.0
Freq: D, dtype: float64

비어있던 '날짜-시간' index 를 등간격 '날짜-시간' index로 채우면서 값(value)에 'NaN'이 생긴 부분을 fillna(0)을 이용하여 '0'으로 채워보겠습니다.

# fill missing value with '0'

series_ts_fixed_freq.first().fillna(0)

[Out]:

2019-12-01    0.0
2019-12-02    1.0
2019-12-03    2.0
2019-12-04    0.0 <---
2019-12-05    3.0
2019-12-06    4.0
2019-12-07    5.0
2019-12-08    0.0 <---
2019-12-09    6.0
2019-12-10    7.0
Freq: D, dtype: float64

이번에는 resample('10T')를 이용하여 '10분 단위의 동일 간격 날짜-시간' index의 시계열 데이터를 만들어보겠습니다. 이때도 원래의 데이터셋에 없던 '날짜-시간' index의 경우 값(value)은 결측값으로 처리되어 'NaN'으로 채워집니다.

# resampling with 10 minutes frequency (interval)

series_ts_missing.resample('10T').first()

[Out]:

2019-12-01 00:00:00 0.0 2019-12-01 00:10:00 NaN 2019-12-01 00:20:00 NaN 2019-12-01 00:30:00 NaN 2019-12-01 00:40:00 NaN ... 2019-12-09 23:20:00 NaN 2019-12-09 23:30:00 NaN 2019-12-09 23:40:00 NaN 2019-12-09 23:50:00 NaN 2019-12-10 00:00:00 7.0 Freq: 10T, Length: 1297, dtype: float64

(2) 동일 간격의 시계열 데이터 DataFrame 만들기

(fixed frequency time series pandas DataFrame)

(2-1) 중간에 날짜가 비어있는 시계열 데이터 DataFrame 만들기 (non-equally spaced time series DataFrame)

pd.date_range() 함수로 등간격의 10일치 날짜-시간 index를 만든 후에, drop(pd.DatetimeIndex()) 로 '2019-12-04', '2019-12-08'일을 제거하여 '이빨 빠진 날짜-시간' index를 만들었습니다.

import pandas as pd

# generate dates from 2019-12-01 to 2019-12-10

date_idx = pd.date_range('2019-12-01', periods=10)

# drop 2 dates from DatetimeIndex

date_idx = date_idx.drop(pd.DatetimeIndex(['2019-12-04', '2019-12-08']))

date_idx

[Out]:
DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-05',

               '2019-12-06', '2019-12-07', '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq=None)

df_ts_missing = pd.DataFrame(range(len(date_idx))

, columns=['col']

, index=date_idx)

df_ts_missing

[Out]:

	col
2019-12-01	0
2019-12-02	1
2019-12-03	2
2019-12-05	3
2019-12-06	4
2019-12-07	5
2019-12-09	6
2019-12-10	7

(2-2) 이빨 빠진 Time Series를 동일한 간격의 시계열 데이터 pandas DataFrame으로 변환하기

(fixed frequency, equally spaced time interval time series pandas DataFrame)

resample('D') 를 메소드를 사용하여 '일(Day)' 동일 간격의 '날짜-시간' index를 가지는 시계열 데이터 DataFrame을 만들었습니다. 이때 원래의 데이터에 없던 '날짜-시간' index의 경우 결측값 처리되어 값(value)은 'NaN'으로 처리됩니다.

df_ts_fixed_freq = df_ts_missing.resample('D').first()

df_ts_fixed_freq

[Out]:

	col
2019-12-01	0.0
2019-12-02	1.0
2019-12-03	2.0
2019-12-04	NaN <---
2019-12-05	3.0
2019-12-06	4.0
2019-12-07	5.0
2019-12-08	NaN <---
2019-12-09	6.0
2019-12-10	7.0

동일 간견 시계열 데이터로 변환하는 과정에서 생긴 'NaN' 결측값 부분을 fillina(0) 메소드를 이용하여 '0'으로 대체하여 채워보겠습니다.

# fill missing value with '0'

df_ts_fixed_freq = df_ts_fixed_freq.fillna(0)

df_ts_fixed_freq

	col
2019-12-01	0.0
2019-12-02	1.0
2019-12-03	2.0
2019-12-04	0.0 <---
2019-12-05	3.0
2019-12-06	4.0
2019-12-07	5.0
2019-12-08	0.0 <---
2019-12-09	6.0
2019-12-10	7.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22

Posted by Rfriend

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 23. 18:41

지난번 포스팅에서는 날짜-시간 시계열 객체(date-time, Timeseries objects)를 문자열(Strings)로 변환하기, 거꾸로 문자열을 날짜-시간 시계열 객체로 변환하는 방법(https://rfriend.tistory.com/498)을 소개하였습니다.

이번 포스팅에서는 날짜-시간 시계열 데이터(date-time time series)를 index로 가지는 Python pandas의 Series, DataFrame 에서 특정 날짜-시간을 indexing, slicing, selection, truncation 하는 방법을 소개하겠습니다.

(1) pandas Series에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

(2) pandas DataFrame에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

(1) pandas Series에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

먼저, 간단한 예제로 사용하도록 2019년 11월 25일 부터 ~ 2019년 12월 4일까지 10일 기간의 년-월-일 날짜를 index로 가지는 pands Series를 만들어보겠습니다.

pandas.date_range(시작날짜, periods=생성할 날짜-시간 개수) 함수를 사용하여 날짜-시간 데이터를 생성하였으며, 이를 index로 하여 pandas Series를 만들었습니다.

import pandas as pd

from datetime import datetime

# DatetimeIndex

ts_days_idx = pd.date_range('2019-11-25', periods=10)

ts_days_idx

[Out]:
DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

# Series with time series index

series_ts = pd.Series(range(len(ts_days_idx))

, index=ts_days_idx)

series_ts

[Out]:

2019-11-25    0
2019-11-26    1
2019-11-27    2
2019-11-28    3
2019-11-29    4
2019-11-30    5
2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

series_ts.index

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

series_ts.index[6]

[Out]: Timestamp('2019-12-01 00:00:00', freq='D')

참고로, 아례의 예처럼 pd.date_range(start='시작 날짜-시간', end='끝 날짜-시간') 처럼 명시적으로 시작과 끝의 날짜-시간을 지정해주어도 위의 perieds를 사용한 예와 동일한 결과를 얻을 수 있습니다.

import pandas as pd

pd.date_range(start='2019-11-25', end='2019-12-04')

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

참고로 하나더 소개하자면요, pandas.date_range('시작날짜-시간', period=생성할 날짜-시간 개수, freq='주기 단위') 에서 freq 옵션을 통해서 'S' 1초 단위, '10S' 10초 단위, 'H' 1시간 단위, 'D' 1일 단위, 'M' 1달 단위(월 말일 기준), 'Y' 1년 단위 (년 말일 기준) 등으로 날짜-시간 시계열 데이터 생성 주기를 설정할 수 있습니다. 매우 편하지요?!

< 1초 단위로 날짜-시간 데이터 10개를 생성한 예 >

# 10 timeseries data points by Second(freq='S')

pd.date_range('2019-11-25 00:00:00', periods=10, freq='S')

[Out]:

DatetimeIndex(['2019-11-25 00:00:00', '2019-11-25 00:00:01',
               '2019-11-25 00:00:02', '2019-11-25 00:00:03',
               '2019-11-25 00:00:04', '2019-11-25 00:00:05',
               '2019-11-25 00:00:06', '2019-11-25 00:00:07',
               '2019-11-25 00:00:08', '2019-11-25 00:00:09'],
              dtype='datetime64[ns]', freq='S')

< 10초 단위로 날짜-시간 데이터 10개를 생성한 예 >

# 10 timeseries data points by 10 Seconds (freq='10S')

pd.date_range('2019-11-25 00:00:00', periods=10, freq='10S')

[Out]:
DatetimeIndex(['2019-11-25 00:00:00', '2019-11-25 00:00:10',
               '2019-11-25 00:00:20', '2019-11-25 00:00:30',
               '2019-11-25 00:00:40', '2019-11-25 00:00:50',
               '2019-11-25 00:01:00', '2019-11-25 00:01:10',
               '2019-11-25 00:01:20', '2019-11-25 00:01:30'],
              dtype='datetime64[ns]', freq='10S')

(1-1) 시계열데이터를 index로 가지는 pandas Series에서 특정 날짜-시간 데이터 indexing 하기

먼저 위에서 생성한 series_ts 라는 이름의 시간 순서대로 정렬되어 있는 Series 에서 7번째에 위치한 '2019-12-01' 의 값 '6'을 indexing 해보겠습니다.

(a), (b)와 같이 위치(position)를 가지고 인덱싱할 수 있습니다.

또한, (c), (d)와 같이 날짜-시간 문자열(String)을 가지고도 인덱싱(indexing)을 할 수 있습니다.

(e) 처럼 datetime.datetime(year, month, day) 객체를 사용해서도 인덱싱할 수 있습니다.

import pandas as pd

from datetime import datetime

# (a) indexing with index number

series_ts[6]

[Out]: 6

# (b) indexing with index number using iloc

series_ts.iloc[6]

[Out]: 6

# (c) indexing with string ['year-month-day']

series_ts['2019-12-01']

[Out]: 6

# (d) indexing with string ['month/day/year']

series_ts['12/01/2019']

[Out]: 6

# (f) indexing with datetime.datetime(year, month, day)

series_ts[datetime(2019, 12, 1)]

[Out]: 6

(1-2) 시계열데이터를 index로 가지는 pandas Series에서 날짜-시간 데이터 Slicing 하기

아래는 '2019-12-01' 일 이후의 값을 모두 slicing 해오는 5가지 방법입니다.

(a), (b)는 위치(position):위치(position)을 이용하여 날짜를 index로 가지는 Series를 slicing을 하였습니다.

(c), (d)는 '년-월-일':'년-월-일' 혹은 '월/일/년':'월/일/년' 문자열(string)을 이용하여 slicing을 하였습니다.

(e)는 datetime.datetime(년, 월, 일):datetime.datetime(년, 월, 일) 을 이용하여 slicing을 하였습니다.

import pandas as pd

from datetime import datetime

# (a) slicing with position

series_ts[6:]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (b) slicing with position using iloc

series_ts.iloc[6:]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (c) slicing with string

series_ts['2019-12-01':'2019-12-10']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (d) slicing with string

series_ts['12/01/2019':'12/10/2019']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (e) slicing with datetime

series_ts[datetime(2019, 12, 1):datetime(2019, 12, 10)]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

(1-3) 시계열데이터를 index로 가지는 pandas Series 에서 날짜-시간 데이터 Selection 하기

'날짜-시간' 문자열(String)을 이용하여 특정 '년', '월'의 모든 데이터를 선택할 수도 있습니다. 꽤 편리하고 재미있는 기능입니다.

< '2019'년 모든 데이터 선택하기 예 >

# selection with year string

series_ts['2019']

[Out]:

2019-11-25 0 2019-11-26 1 2019-11-27 2 2019-11-28 3 2019-11-29 4 2019-11-30 5 2019-12-01 6 2019-12-02 7 2019-12-03 8 2019-12-04 9 Freq: D, dtype: int64

< '2019년 12월' 모든 데이터 선택하기 예 >

# selection with year-month string

series_ts['2019-12']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

(1-4) 시계열 데이터를 index로 가지는 pandas Series에서 날짜-시간 데이터 잘라내기 (Truncate)

truncate() methods를 사용하면 잘라내기(truncation)를 할 수 있습니다. before, after 옵션으로 잘라내기하는 범위 기간을 설정할 수 있는데요, 해당 날짜 포함 여부를 유심히 살펴보기 바랍니다.

< '2019년 12월 1일' 이전(before) 모든 데이터 잘라내기 예 >

('2019년 11월 30일'까지의 모든 데이터 삭제하며, '2019-12-01'일 데이터는 남아 있음)

# truncate before

series_ts.truncate(before='2019-12-01')

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

< '2019년 11월 30일' 이후(after) 모든 데이터 잘라내기 예 >

(''2019년 12월 1일' 부터의 모든 데이터 삭제하며, '2019-11-30'일 데이터는 남아 있음)

# truncate after

series_ts.truncate(after='2019-11-30')

[Out]:

2019-11-25    0
2019-11-26    1
2019-11-27    2
2019-11-28    3
2019-11-29    4
2019-11-30    5
Freq: D, dtype: int64

(2) pandas DataFrame에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

위의 (1)번에서 소개했던 pandas Series의 시계열 데이터 indexing, slicing, selection, truncation 방법을 동일하게 pandas DataFrame에도 사용할 수 있습니다.

년-월-일 날짜를 index로 가지는 간단한 pandas DataFrame 예제를 만들어보겠습니다.

import pandas as pd

from datetime import datetime

# DatetimeIndex

ts_days_idx = pd.date_range('2019-11-25', periods=10)

ts_days_idx

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

# DataFrame with DatetimeIndex

df_ts = pd.DataFrame(range(len(ts_days_idx))

, columns=['col']

, index=ts_days_idx)

df_ts

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-1) 시계열데이터를 index로 가지는 pandas DataFrame에서 특정 날짜-시간 데이터 indexing 하기

위의 (1-1) Series indexing과 거의 유사한데요, DataFrame에서는 df_ts[6], df_ts[datetime(2019, 12, 1)] 의 두가지 방법은 KeyError 가 발생해서 사용할 수 없구요, 아래의 3가지 방법만 indexing에 사용 가능합니다.

(a) iloc[integer] 메소드를 사용하여 위치(position) 로 indexing 하기

(b), (c) loc['label'] 메소드를 사용하여 이름('label')로 indexing 하기

# (a) indexing with index position integer using iloc[]

df_ts.iloc[6]

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

# (b) indexing with index labels ['year-month-day']

df_ts.loc['2019-12-01']

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

# (c) indexing with index labels ['month/day/year']

df_ts.loc['12/01/2019']

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

(2-2) 시계열데이터를 index로 가지는 pandas DataFrame에서 날짜-시간 데이터 Slicing 하기

아래는 '2019-12-01' 일 이후의 값을 모두 slicing 해오는 4가지 방법입니다.

(a) 위치(position):위치(position)을 이용하여 날짜를 index로 가지는 Series를 slicing을 하였습니다.

(b), (c)는 loc['년-월-일']:loc['년-월-일'] 혹은 loc['월/일/년']:loc['월/일/년'] 문자열(string)을 이용하여 slicing을 하였습니다.

(d) 는 loc[datetime.datetime(year, month, day):datetime.datetime(year, month, day)] 로 slicing을 한 예입니다.

# (a) slicing DataFrame with position integer

df_ts[6:10]

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (b) silcing using date strings 'year-month-day'

df_ts.loc['2019-12-01':'2019-12-10']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (c) slicing using date strings 'month/day/year'

df_ts.loc['12/01/2019':'12/10/2019']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (d) slicing using datetime objects

from datetime import datetime

df_ts.loc[datetime(2019, 12, 1):datetime(2019, 12, 10)]

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-3) 시계열데이터를 index로 가지는 pandas DataFrame 에서 날짜-시간 데이터 Selection 하기

'년', '년-월' 날짜 문자열을 df.loc['year'], df.loc['year-month'] 에 입력하면 해당 년(year), 월(month)의 모든 데이터를 선택할 수 있습니다.

< '2019년'의 모든 데이터 선택 예 >

# selection of year '2019'

df_ts.loc['2019'] # df_ts['2019']

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

< '2019년 12월'의 모든 데이터 선택 예 >

# selection of year-month '2019-12'

df_ts.loc['2019-12']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-4) 시계열 데이터를 index로 가지는 pandas DataFrame에서 날짜-시간 데이터 잘라내기 (Truncate)

truncate() 메소드를 사용하면 before 이전 기간의 데이터를 잘라내거나 after 이후 기간의 데이터를 잘라낼 수 있습니다.

< '2019-12-01' 일 이전(before) 기간 데이터 잘라내기 예 >

('2019-12-01'일은 삭제되지 않고 남아 있음)

# truncate before

df_ts.truncate(before='2019-12-01') # '2019-12-01' is not removed

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

< '2019-11-30'일 이후(after) 기간 데이터 잘라내기 예 >

('2019-11-30'일은 삭제되지 않고 남아 있음)

# truncate after

df_ts.truncate(after='2019-11-30') # '2019-11-30' is not removed

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15

Posted by Rfriend

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 15. 00:42

이번 포스팅에서는 Python pandas library를 이용하여 시계열 데이터(time series data)를 10분, 20분, 1시간, 1일, 1달 등의 특정 시간 단위(time span) 구간별로 집계/요약 하는 방법을 소개하겠습니다. (Downsampling)

(* PostgreSQL, Greenplum database로 특정 시간 단위 구간별 시계열 데이터 집계, 요약하는 방법은 https://rfriend.tistory.com/495 참조)

이전에 소개했었던 groupby() operator를 사용해서 그룹별로 집계/요약하는 방법을 사용할 수도 있는데요, 시계열 데이터의 경우 pandas의 resample() method를 사용하면 좀더 편리하고 코드도 깔끔하게 시간 단위 구간별로 시계열 데이터를 집계/요약할 수 있습니다.

먼저 '년-월-일 시간:분:초'로 이루어진 time-stamp 를 index로 가지고, 가격(price)와 수량(amount) 의 두 개의 칼럼을 가지는 간단한 시계열 데이터를 만들어보겠습니다. pandas의 date_range(from, to, freq) 함수를 해서 '2분 간격(freq='2min')의 date range 데이터를 만들었습니다. 이 중에서 20개 행만 선택해서 예를 들어보겠습니다.

import pandas as pd

import numpy as np

# generate time series index

range = pd.date_range('2019-12-19', '2019-12-20', freq='2min')

df = pd.DataFrame(index = range)[:20]

# add 'price' columm using random number

np.random.seed(seed=1004) # for reproducibility

df['price'] = np.random.randint(low=10, high=100, size=20)

# add 'amount' column unsing random number

df['amount'] = np.random.randint(low=1, high=5, size=20)

print('Shape of df DataFrame:', df.shape)

[Out]:Shape of df DataFrame: (20, 2)

[Out]:

	price	amount
2019-12-19 00:00:00	12	4
2019-12-19 00:02:00	21	2
2019-12-19 00:04:00	41	1
2019-12-19 00:06:00	79	4
2019-12-19 00:08:00	61	2
2019-12-19 00:10:00	81	1
2019-12-19 00:12:00	24	3
2019-12-19 00:14:00	62	1
2019-12-19 00:16:00	76	3
2019-12-19 00:18:00	63	1
2019-12-19 00:20:00	95	2
2019-12-19 00:22:00	82	1
2019-12-19 00:24:00	82	3
2019-12-19 00:26:00	70	1
2019-12-19 00:28:00	30	4
2019-12-19 00:30:00	33	1
2019-12-19 00:32:00	22	2
2019-12-19 00:34:00	77	3
2019-12-19 00:36:00	58	3
2019-12-19 00:38:00	96	3

(1) 10분 단위 구간별로 각 칼럼의 첫번째 값(first value), 마지막 값(last value) 구하기

(select the first and last value by 10 minutes time span using pandas resample method)

resample('10T') 는 '년-월-일 시간:분:초' 의 시계열 index를 10분 단위의 동일 간격별로 데이터를 뽑으라는 뜻입니다. pandas의 groupby() 에서 split-apply-combine에서 동일 시간대 간격으로 split 의 역할을 한다고 생각할 수 있습니다.

[ resample() 메소드의 시간 단위 구간 설정 ]
- 5분 단위 구간 : resample('5T')
- 10분 단위 구간 : resample('10T')
- 20분 단위 구간 : resample('20T')
- 1시간 단위 구간 : resample('1H')
- 1일 단위 구간 : resample('1D')
- 1주일 단위 구각 : resample('1W')
- 1달 단위 구간 : resample('1M')
- 1년 단위 구간 : resample('1Y')

각 시간 단위 구간(time span) 별로 시간 순서대로 정렬된 상태에서 첫번째 행의 값(first row's value)은 first() 메소드를 사용하며, 마지막 행의 값(last row's value)은 last() 메소드를 사용해서 구할 수 있습니다. (groupby 의 split-apply-combine 중에서 apply 에 해당한다고 생각할 수 있습니다)

# Resampling by a given time span (group)

# : first, last

df_summary = pd.DataFrame()

df_summary['price_10m_first'] = df.price.resample('10T').first()

df_summary['price_10m_last'] = df.price.resample('10T').last()

df_summary['amount_10m_first'] = df.amount.resample('10T').first()

df_summary['amount_10m_last'] = df.amount.resample('10T').last()

df_summary

	price_10m_first	price_10m_last	amount_10m_first	amount_10m_last
2019-12-19 00:00:00	12	61	4	2
2019-12-19 00:10:00	81	63	1	1
2019-12-19 00:20:00	95	30	2	4
2019-12-19 00:30:00	33	96	1	3

(2) 10분 단위 구간별로 숫자형 데이터의 합계, 누적 합계 구하기

(sum, cumulative sum by 10 minutes time span using pandas resample method)

# Resampling by a given time span (group)

# sum, cumulative sum

df_summary = pd.DataFrame()

df_summary['price_10m_sum'] = df.price.resample('10T').sum()

df_summary['price_10m_cumsum'] = df.price.resample('10T').sum().cumsum()

df_summary['amount_10m_sum'] = df.amount.resample('10T').sum()

df_summary['amount_10m_cumsum'] = df.amount.resample('10T').sum().cumsum()

df_summary

	price_10m_sum	price_10m_cumsum	amount_10m_sum	amount_10m_cumsum
2019-12-19 00:00:00	214	214	13	13
2019-12-19 00:10:00	306	520	9	22
2019-12-19 00:20:00	359	879	11	33
2019-12-19 00:30:00	286	1165	12	45

(3) 10분 단위 구간별로 최소값, 최대값, 평균, 중앙값, 범위 구하기

(summary statistics by 10 minutes time span using pandas resample method)

최소값(min), 최대값(max), 평균(mean), 중앙값(median) 요약통계량은 min(), max(), mean(), median() 메소드를 이용하여 구할 수 있으며, 범위(range)는 해당 메소드가 없어서 범위(range) = 최대값(max) - 최소값(min) 의 계산을 해서 구하였습니다.

# Resampling by a given time span (group)

# min, max, mean, median, range

df_summary = pd.DataFrame()

df_summary['price_10m_min'] = df.price.resample('10T').min()

df_summary['price_10m_max'] = df.price.resample('10T').max()

df_summary['price_10m_mean'] = df.price.resample('10T').mean()

df_summary['price_10m_median'] = df.price.resample('10T').median()

df_summary['price_10m_range'] = \

df.price.resample('10T').max() - df.price.resample('10T').min()

df_summary

	price_10m_min	price_10m_max	price_10m_mean	price_10m_median	price_10m_range
2019-12-19 00:00:00	12	79	42.8	41	67
2019-12-19 00:10:00	24	81	61.2	63	57
2019-12-19 00:20:00	30	95	71.8	82	65
2019-12-19 00:30:00	22	96	57.2	58	74

(4) 10분 단위 구간별로 분산(variance), 표준편차(standard deviation) 구하기

(variance, standard deviation by 10 minutes time span using pandas resample(() method)

resample('10T') 로 10분 단위 구간별로 데이터를 그룹으로 뽑고, var() 메소드로 표본 분산(sample variance)을 구합니다. (* 참고: 모집단 분산(population variance)이 편차 제곱의 합을 원소의 개수 N으로 나누어주는 반면에, 표본 분산(sample variance)는 편차 제곱의 합을 원소의 개수에서 1개를 뺀 N-1로 나누어준다는 차이점이 있습니다)

표본 표분편차(sample standard deviation)을 직접 구할 수 있는 메소드가 없어서 표본 분산에 제곱근(square root)을 취하여 표본 표준편차를 구하였습니다.

# Resampling by a given time span (group)

# variance, standard deviation

df_summary = pd.DataFrame()

# sample variance 1/(N-1)*sigma(X-X_bar)^2

df_summary['price_10m_var'] = df.price.resample('10T').var()

# sample standard deviation using sqrt(var) formula

df_summary['price_10m_stddev'] = np.sqrt(df.price.resample('10T').var())

	price_10m_var	price_10m_stddev
2019-12-19 00:00:00	767.2	27.698375
2019-12-19 00:10:00	499.7	22.353971
2019-12-19 00:20:00	624.2	24.983995
2019-12-19 00:30:00	930.7	30.507376

(5) 특정 시간 단위 구간별로 요약 통계량 구하는 사용자 정의 함수

(User Defined Function for aggregating summary statistics by specific time span)

위의 (1) ~ (4)번에서 pandas의 resample() 메소드를 사용하여 시계열 데이터를 특정 시간 단위 구간별로 샘플링하고, 첫번째 값(first), 마지막 값(last), 합(sum), 누적합(cumsum), 최소값(min), 최대값(max), 평균(mean), 중앙값(median), 구간(range), 분산(variance), 표준편차(standard deviation) 을 구하는 방법을 소개하였습니다.

이를 좀더 사용하기 편리하도록 아래의 매개변수를 인자로 가지는 사용자 정의 함수를 정의해보겠습니다.

[ resample_summary() 사용자 정의 함수 매개변수 ]

(a) ts_data : '년-월-일 시간:분:초'의 시계열 범위 데이터를 index로 가지는 시계열 데이터 DataFrame
(b) col_nm : 집계/요약의 대상이 되는 칼럼 이름
(c) time_span : 특정 시간 단위 구간 (예: 10분 단위 '10T', 1시간 단위 '1H', 1일 단위 '1D' 등)
(d) func_list : 집계/요약할 함수 (예: 첫번째 값 'first', 마지막 값 'last', 합 'sum', 누적합 'cumsum', 최소값 'min', 최대값 'max', 평균 'mean', 중앙값 'median', 범위 'range', 표본 분산 'var', 표본 표준편차 'stddev' 등)

공통으로 사용되는 부분인 resampler = ts_data[col_nm].resample(time_span) 를 resampler 객체로 만들어서 반복해서 사용하였습니다.

그리고 사용자가 입력(선택)한 집계/요약 함수만 집계/요약하여 반환하도록 if [function name] in func_list 조건문을 추가해주었습니다.

집계/요약된 값의 칼럼 이름은 이해하기 쉽도록 접미사(suffix)를 붙어서 [ 기존 칼럼 이름 + '_' + 시간 단위 구간 + '_' + 집계/요약함수 ] 를 이어붙여서 새로 만들어주었습니다. (예: price_10T_first)

# UDF of Resampling by column name, time span and summary functions

def resample_summary(ts_data, col_nm, time_span, func_list):

import numpy as np

import pandas as pd

df_summary = pd.DataFrame() # blank DataFrame to store results

# resampler with column name by time span (group by)

resampler = ts_data[col_nm].resample(time_span)

# aggregation functions with suffix name

if 'first' in func_list:

df_summary[col_nm + '_' + time_span + '_first'] = resampler.first()

if 'last' in func_list:

df_summary[col_nm + '_' + time_span + '_last'] = resampler.last()

if 'sum' in func_list:

df_summary[col_nm + '_' + time_span + '_sum'] = resampler.sum()

if 'cumsum' in func_list:

df_summary[col_nm + '_' + time_span + '_cumsum'] = resampler.sum().cumsum()

if 'min' in func_list:

df_summary[col_nm + '_' + time_span + '_min'] = resampler.min()

if 'max' in func_list:

df_summary[col_nm + '_' + time_span + '_max'] = resampler.max()

if 'mean' in func_list:

df_summary[col_nm + '_' + time_span + '_mean'] = resampler.mean()

if 'median' in func_list:

df_summary[col_nm + '_' + time_span + '_median'] = resampler.median()

if 'range' in func_list:

df_summary[col_nm + '_' + time_span + '_range'] = resampler.max() - resampler.min()

if 'var' in func_list:

df_summary[col_nm + '_' + time_span + '_var'] = resampler.var() # sample variance

if 'stddev' in func_list:

df_summary[col_nm + '_' + time_span + '_stddev'] = np.sqrt(resampler.var())

return df_summary

위의 (5)번에서 정의한 resample_summary() 사용자 정의 함수를 이용하여, df 데이터셋의 'price' 칼럼에 대해 '10분 단위 구간별로(time_span = '10T') 첫번째 값('first'), 마지막 값('last'), 합('sum'), 누적합('cumsum'), 최소값('min'), 최대값('max') 을 구해보겠습니다.

func_list = ['first', 'last', 'sum', 'cumsum', 'min', 'max']

resample_summary(df, 'price', '10T', func_list)

	price_10T_first	price_10T_last	price_10T_sum	price_10T_cumsum	price_10T_min	price_10T_max
2019-12-19 00:00:00	12	61	214	214	12	79
2019-12-19 00:10:00	81	63	306	520	24	81
2019-12-19 00:20:00	95	30	359	879	30	95
2019-12-19 00:30:00	33	96	286	1165	22	96

이번에는 시간 단위 구간을 '20분 ('20T')'으로 늘려서 resample_summary() 사용자 정의 함수를 사용해 보겠습니다.

func_list = ['mean', 'median', 'range', 'var', 'stddev']

resample_summary(df, 'price', '20T', func_list)

	price_20T_mean	price_20T_median	price_20T_range	price_20T_var	price_20T_stddev
2019-12-19 00:00:00	52.0	61.5	69	657.111111	25.634179
2019-12-19 00:20:00	64.5	73.5	74	750.277778	27.391199

이번에는 집계/요약의 대상이 되는 칼럼을 '수량(amount)' 으로 바꾸어서 resample_summary() 사용자 정의 함수를 사용해 보겠습니다.

func_list = ['mean', 'median', 'range', 'var', 'stddev']

resample_summary(df, 'amount', '20T', func_list) # with 'amount' column

	amount_20T_mean	amount_20T_median	amount_20T_range	amount_20T_var	amount_20T_stddev
2019-12-19 00:00:00	2.2	2.0	3	1.511111	1.229273
2019-12-19 00:20:00	2.3	2.5	3	1.122222	1.059350

집계/요약할 함수를 평균('mean'), 중앙값('median'), 범위('range'), 분산('var'), 표준편차('stddev')로 바꾸어서 resample_summary() 사용자 정의 함수를 사용해 보겠습니다.

func_list = ['mean', 'median', 'range', 'var', 'stddev']

resample_summary(df, 'price', '10T', func_list)

	price_10T_mean	price_10T_median	price_10T_range	price_10T_var	price_10T_stddev
2019-12-19 00:00:00	42.8	41	67	767.2	27.698375
2019-12-19 00:10:00	61.2	63	57	499.7	22.353971
2019-12-19 00:20:00	71.8	82	65	624.2	24.983995
2019-12-19 00:30:00	57.2	58	74	930.7	30.507376

이번에는 데이터를 2019-12-19 일에서 2020-01-18일 까지 약 한달 간의 시계열 데이터를 난수로 생성해서 ==> 시간 단위 구간을 1시간('1H'), 1일('1D'), 1주('1W'), 1달('1M') 로 바꾸어가면서 집계/요약을 해보겠습니다.

# generate time series index

range = pd.date_range('2019-12-19', '2020-01-18', freq='2min') # one month period

df_1m = pd.DataFrame(index = range)

# add 'price' columm using random number

np.random.seed(seed=1004) # for reproducibility

df_1m['price'] = np.random.randint(low=10, high=100, size=len(df))

# add 'amount' column unsing random number

df_1m['amount'] = np.random.randint(low=1, high=5, size=len(df))

print('Shape of df_1m DataFrame:', df_1m.shape)

Shape of df_1m DataFrame: (21601, 2)

# by 1 Hour

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1H', func_list).head() # by 1 Hour

 price_1H_first price_1H_sum price_1H_mean price_1H_stddev
2019-12-19 00:00:00 12 1684 56.133333 25.143359
2019-12-19 01:00:00 44 1534 51.133333 24.764732
2019-12-19 02:00:00 70 1435 47.833333 25.223256
2019-12-19 03:00:00 22 1867 62.233333 24.842515
2019-12-19 04:00:00 80 1766 58.866667 23.292345

# by 1 Day

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1D', func_list).head() # by 1 Day

 price_1D_first price_1D_sum price_1D_mean price_1D_stddev
2019-12-19 12 39746 55.202778 25.946355
2019-12-20 26 40171 55.793056 25.547419
2019-12-21 87 39737 55.190278 26.238314
2019-12-22 65 39350 54.652778 25.675714
2019-12-23 69 39835 55.326389 26.230239

# by 1 Week

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1W', func_list) # by 1 Week

 price_1W_first price_1W_sum price_1W_mean price_1W_stddev
2019-12-22 12 159004 55.209722 25.842990
2019-12-29 69 272943 54.155357 26.084089
2020-01-05 72 274740 54.511905 25.840425
2020-01-12 41 276563 54.873611 26.295806
2020-01-19 55 197090 54.732019 25.984207

# by 2 Weeks

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '2W', func_list) # by 2 Week

 price_2W_first price_2W_sum price_2W_mean price_2W_stddev
2019-12-22 12 159004 55.209722 25.842990
2020-01-05 69 547683 54.333631 25.961867
2020-01-19 41 473653 54.814605 26.164988

# by 1 Month

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1M', func_list) # by 1 Month

 price_1M_first price_1M_sum price_1M_mean price_1M_stddev
2019-12-31 12 510036 54.491026 25.912189
2020-01-31 48 670304 54.758925 26.117109

(6) 10분 단위 구간별 수량 가중 평균 가격 구하기

(amount-weighted average of price by 10 minutes time span using pandas resample method)

가격('price')과 수량('amount')을 곱해서 만든 새로운 칼럼 'price_mult_amt' 를 만들어주고, resample('10T') 메소드를 사용해서 10분 단위 구간별로 수량 가중 평균 가격(10분 단위 구간별 구입가격*구입수량 합 / 전체 구입수량 합)을 구해주었습니다.

참고로, 아래 코드에서 역 슬래쉬('\')는 코드를 한줄에 다 쓰기에 너무 길 경우에 '다음 줄로 넘겨서 쓴 코드를 앞코드와 이어진 코드'로 인식하게 만들어 줄 때 사용합니다.

# function: weighted average

# 각 시간대의 수량가중평균가격(sum(price*amount)/sum(amount))

# (*가중평균은 특정 시간대에 발생한 모든 구입건의 구입가격*구입수량 합/전체 구입수량 합)

df_summary = pd.DataFrame()

df['price_mult_amt'] = df['price']*df['amount']

df_summary['price_10m_amount_weighted_avg'] = \

df.price_mult_amt.resample('10T').sum() / df.amount.resample('10T').sum()

df_summary

	price_10m_amount_weighted_avg
2019-12-19 00:00:00	43.769231
2019-12-19 00:10:00	56.222222
2019-12-19 00:20:00	64.363636
2019-12-19 00:30:00	64.166667

(7) 10분 단위 구간별 집계/요약 통계량 결과를 csv 파일로 내보내기

(exporting summary results by 10 minutes time span into 'csv file' using pandas to_csv() method)

위의 (5)번에서 정의한 resample_summary() 사용자 정의 함수(UDF)를 사용하여 10분 단위('10T') 구간별로 가격('price') 칼럼에 대해 'first', 'last', 'sum', 'cumsum', 'min', 'max', 'mean', 'median', 'range', 'var', 'stddev'를 모두 집계/요약한 데이터 프레임을 만들고,

이어서, 10단위 구간별로 수량 가중 평균 가격(amount-weighted average of price)을 구한 후에,

이를 취합한 결과 데이터프레임을 pandas의 to_csv() 메소드를 사용하여 'df_summary.csv' 라는 이름의 csv 파일로 내보내보겠습니다. '년-월-일 시간:분:초'의 시간 정보가 들어있는 index도 같이 내보내야 하므로 to_csv() 메소드 내 index=True 옵션으로 설정해주었으며, 결측값이 존재할 경우 na_rep='NaN' 으로 표기하도록 설정해주었고, 요약통계량 값이 부동소수형(float) 일 경우 소수점 2번째 자리까지만 표기하도록 float_format='%.2f' 옵션을 설정해주었습니다.

# summary statistics using resample_summary() User Defiened Function, refer to (5)

func_list = ['first', 'last', 'sum', 'cumsum', 'min', 'max',

'mean', 'median', 'range', 'var', 'stddev']

df_summary = resample_summary(df, 'price', '10T', func_list)

# amount-weighted average of price, refer to (6)

df['price_mult_amt'] = df['price']*df['amount']

df_summary['price_10m_amount_weighted_avg'] = \

df.price_mult_amt.resample('10T').sum()/ df.amount.resample('10T').sum()

# export df_summary DataFrame into csv file

import os

work_dir = os.getcwd() # current working directory

file_path = os.path.join(work_dir, 'df_summary.csv')

df_summary.to_csv(file_path

, index=True # include index

, na_rep='NaN' # representation of missing value

, float_format = '%.2f') # 2 decimal places

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법 (0)	2019.09.15
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04

Posted by Rfriend

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 9. 15. 12:18

이번 포스팅에서는 Python pandas DataFrame을 만들려고 할 때 "ValueError: If using all scalar values, you must pass an index" 에러 해결 방안 4가지를 소개하겠습니다.

아래의 예처럼 dictionary로 키, 값 쌍으로 된 데이터를 pandas DataFrame으로 만들려고 했을 때, 모든 값이 스칼라 값(if using all scalar values) 일 경우에 "ValueError: If using all scalar values, you must pass an index" 에러가 발생합니다.

import pandas as pd

df = pd.DataFrame({'col_1': 1,

'col_2': 2})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-73d6f192ba2a> in <module>()
      1 df = pd.DataFrame({'col_1': 1, 
----> 2                   'col_2': 2})

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    273                                  dtype=dtype, copy=copy)
    274         elif isinstance(data, dict):
--> 275             mgr = self._init_dict(data, index, columns, dtype=dtype)
    276         elif isinstance(data, ma.MaskedArray):
    277             import numpy.ma.mrecords as mrecords

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    409             arrays = [data[k] for k in keys]
    410 
--> 411         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    412 
    413     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5494     # figure out the index, if necessary
   5495     if index is None:
-> 5496         index = extract_index(arrays)
   5497     else:
   5498         index = _ensure_index(index)

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in extract_index(data)
   5533 
   5534         if not indexes and not raw_lengths:
-> 5535             raise ValueError('If using all scalar values, you must pass'
   5536                              ' an index')
   5537 

ValueError: If using all scalar values, you must pass an index

이 에러를 해결하기 위한 4가지 방법을 차례대로 소개하겠습니다.

(1) 해결방안 1 : 인덱스 값을 설정해줌 (pass an index)

에러 메시지에 "you must pass an index" 라는 가이드라인대로 인덱스 값을 추가로 입력해주면 됩니다.

# (1) pass an index

df = pd.DataFrame({'col_1': 1,

'col_2': 2},

index = [0])

	col_1	col_2
0	1	2

물론 index 에 원하는 값을 입력해서 설정해줄 수 있습니다. index 에 'row_1' 이라고 해볼까요?

df = pd.DataFrame({'col_1': 1,

'col_2': 2},

index = ['row_1'])

	col_1	col_2
row_1	1	2

(2) 스칼라 값 대신 리스트 값을 입력 (use a list instead of scalar values)

입력하는 값(values)에 대괄호 [ ] 를 해주어서 리스트로 만들어준 값을 사전형의 값으로 사용하면 에러가 발생하지 않습니다.

# (2) use a list instead of scalar values

df2 = pd.DataFrame({'col_1': [1],

'col_2': [2]})

df2

	col_1	col_2
0	1	2

(3) pd.DataFrame.from_records([{'key': value}]) 를 사용해서 DataFrame 만들기

이때도 [ ] 로 해서 리스트 값을 입력해주어야 합니다. ( [ ] 빼먹으면 동일 에러 발생함)

# (3) use pd.DataFrame.from_records() with a list

df3 = pd.DataFrame.from_records([{'col_1': 1,

'col_2': 2}])

df3

	col_1	col_2
0	1	2

(4) pd.DataFrame.from_dict([{'key': value}]) 를 사용하여 DataFrame 만들기

(3)과 거의 유사한데요, from_records([]) 대신에 from_dict([]) 를 사용하였으며, 역시 [ ] 로 해서 리스트 값을 입력해주면 됩니다.

# (4) use pd.DataFrame.from_dict([]) with a list

df4 = pd.DataFrame.from_dict([{'col_1': 1,

'col_2': 2}])

df4

	col_1	col_2
0	1	2

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04
[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31
[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31

Posted by Rfriend

[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 25. 19:26

이번 포스팅에서는 구분자(delimiter, separator)를 포함한 문자열 칼럼을 구분자를 기준으로 여러개의 칼럼으로 나누어서 DataFrame을 만드는 방법을 소개하겠습니다.

그리고 PoestgreSQL, Greenplum DB에서도 구분자를 포함한 칼럼을 구분자를 기준으로 여러개의 칼럼으로 나누는 방법도 이어서 소개하겠습니다.

(1) pandas DataFrame 내 문자열 칼럼을 구분자로 분리하여 여러개의 칼럼 만들기

먼저 간단한 예를 들기 위해 ':' 구분자(delimiter, separator)를 가진 'col' 이라는 이름의 칼럼을 가진 pandas DataFrame을 만들어보겠습니다.

import pandas as pd

df = pd.DataFrame({'col': ['a:1:20.3:S', 'b:2:10.5:C', 'c:3:51.9:A']})

	col
0	a:1:20.3:S
1	b:2:10.5:C
2	c:3:51.9:A

이제 원래의 'df' 라는 이름의 DataFrame 에 'col'변수를 그대로 둔 채로, ':' 구분자를 기준으로 'col' 문자열 칼럼을 분리(split) 하여 'group', 'id', 'value', 'grade' 라는 새로운 4개의 칼럼을 생성하여 추가해보겠습니다. split() 문자열 메소드는 split(separator, maxsplit) 의 형식으로 사용합니다.

df[['group', 'id', 'value', 'grade']] = pd.DataFrame(df.col.str.split(':', 3).tolist())

	col	group	id	value	grade
0	a:1:20.3:S	a	1	20.3	S
1	b:2:10.5:C	b	2	10.5	C
2	c:3:51.9:A	c	3	51.9	A

원래의 'col' 이름의 칼럼이 필요 없을 경우 원래의 DataFrame을 덮어쓰거나, 아니면 'col'을 포함하지 않는 새로운 DataFrame을 만들어주면 됩니다.

df2 = pd.DataFrame(df.col.str.split(':', 3).tolist(),

columns = ['group', 'id', 'value', 'grade'])

df2

	group	id	value	grade
0	a	1	20.3	S
1	b	2	10.5	C
2	c	3	51.9	A

문자열(string)을 분리(split)해서 만든 새로운 칼럼들은 전부 문자열(string) 데이터 형식입니다. 이중에서 'id'와 'value' 칼럼을 숫자형(numeric)으로 변경하는 방법은 https://rfriend.tistory.com/470 포스팅을 참고하세요.

df2.dtypes

group    object
id       object
value    object
grade    object
dtype: object

(2) PostgreSQL, GPDB에서 문자열 칼럼을 구분자로 분리하여 여러개 칼럼 만들기

PostgreSQL, Greenplum DB에서는 split_part(string_column, separator, field_number) 의 형식으로 문자열 칼럼을 나눌 수 있습니다.

위의 Python pandas DataFrame에서 사용했던 것과 동일한 예제 Table을 만들어서, 'col' 문자열 칼럼을 'group', 'id', value', 'grade'의 4개의 문자열(string)을 가진 새로운 Table을 만들어보겠습니다.

-- make a table

DROP TABLE IF EXISTS grp_val_grade;

CREATE TABLE grp_val_grade (

col varchar(100) NOT NULL

);

INSERT INTO grp_val_grade VALUES ('a:1:20.3:S');

INSERT INTO grp_val_grade VALUES('b:2:10.5:C');

INSERT INTO grp_val_grade VALUES('c:3:51.9:A');

SELECT * FROM grp_val_grade;

다음으로 split_part(string_column, separator, field_number) 함수를 사용해서 문자열 칼럼을 ':' 구분자를 기준으로 나누어서 새로운 칼럼을 만들어보겠습니다.

-- split a column by delimeter and make 4 columns

DROP TABLE IF EXISTS grp_val_grade2;

CREATE TABLE grp_val_grade2 AS (

SELECT

col

, split_part(col, ':', 1) AS group

, split_part(col, ':', 2) AS id

, split_part(col, ':', 3) AS value

, split_part(col, ':', 4) AS grade

FROM grp_val_grade

);

SELECT * FROM grp_val_grade2;

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31
[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value) (0)	2019.08.28
[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype() (6)	2019.08.25
[Python pandas] DataFrame의 칼럼 이름 바꾸기 : df.columns = [], df.rename(columns) (2)	2019.08.14
[Python pandas] DataFrame 을 Excel로 내보내기 (write DataFrame to Excel): pd.DataFrame.to_excel() (4)	2019.08.06

Posted by Rfriend

[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype()

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 8. 25. 16:54

이번 포스팅에서는 Python pandas DataFrame 이나 Series 내 문자열 칼럼을 숫자형으로 변환(how to convert string columns to numeric data types in pandas DataFrame, Series) 하는 2가지 방법을 소개하겠습니다.

(1) pd.to_numeric() 함수를 이용한 문자열 칼럼의 숫자형 변환

(2) astype() 메소드를 이용한 문자열 칼럼의 숫자형 변환

(1) pd.to_numeric() 함수를 이용한 문자열 칼럼의 숫자형 변환

(1-1) 한개의 문자열 칼럼을 숫자형으로 바꾸기

import numpy as np

import pandas as pd

# make a DataFrame

df = pd.DataFrame({'col_str': ['1', '2', '3', '4', '5']})

	col_str
0	1
1	2
2	3
3	4
4	5

# check data types

df.dtypes

col_str    object
dtype: object

df['col_int'] = pd.to_numeric(df['col_str'])

	col_str	col_int
0	1	1
1	2	2
2	3	3
3	4	4
4	5	5

df.dtypes

col_str    object
col_int     int64
dtype: object

(1-2) apply() 함수와 to_numeric() 함수를 사용해 DataFrame 내 다수의 문자열 칼럼을 숫자형으로 바꾸기

# make a DataFrame with 3 string columns

df2 = pd.DataFrame({'col_str_1': ['1', '2', '3'],

'col_str_2': ['4', '5', '6'],

'col_str_3': ['7.0', '8.1', '9.2']})

df2

	col_str_1	col_str_2	col_str_3
0	1	4	7.0
1	2	5	8.1
2	3	6	9.2

df2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
dtype: object

# convert 'col_str_1' and 'col_str_2' to numeric

df2[['col_int_1', 'col_int_2']] = df2[['col_str_1', 'col_str_2']].apply(pd.to_numeric)

df2

	col_str_1	col_str_2	col_str_3	col_int_1	col_int_2
0	1	4	7.0	1	4
1	2	5	8.1	2	5
2	3	6	9.2	3	6

df2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
col_int_1     int64
col_int_2     int64
dtype: object

# convert all columns of a DataFrame to numeric using apply() and to_numeric together

df3 = df2.apply(pd.to_numeric)

df3.dtypes

col_str_1      int64
col_str_2      int64
col_str_3    float64
col_int_1      int64
col_int_2      int64
dtype: object

(2) astype() 메소드를 이용한 문자열 칼럼의 숫자형 변환

(2-1) DataFrame 내 모든 문자열 칼럼을 float로 한꺼번에 변환하기

df4 = pd.DataFrame({'col_str_1': ['1', '2', '3'],

'col_str_2': ['4.1', '5.5', '6.0']})

df4.dtypes

col_str_1    object
col_str_2    object
dtype: object

df5 = df4.astype(float)

df5

	col_str_1	col_str_2
0	1.0	4.1
1	2.0	5.5
2	3.0	6.0

df5.dtypes

col_str_1    float64
col_str_2    float64
dtype: object

(2-2) DataFrame 내 문자열 칼럼별로 int, float 데이터 형식 개별 지정해서 숫자형으로 변환하기

df6 = df4.astype({'col_str_1': int,

'col_str_2': np.float})

df6

	col_str_1	col_str_2
0	1	4.1
1	2	5.5
2	3	6.0

df6.dtypes

col_str_1      int64
col_str_2    float64
dtype: object

DataFrame에 문자가 포함된 칼럼이 같이 있을 경우 ValueError

물론 DataFrame 내의 문자열 중에서 숫자가 아니라 문자(character)로 이루어진 문자열(string)이 포함되어 있을 경우 apply(pd.to_numeric) 함수나 DataFrame.astype(int) 메소드를 써서 한꺼번에 숫자형 데이터 형태로 변환하려고 하면 ValueError 가 발생합니다. (너무 당연한 거라서 여기에 써야 하나 싶기도 한데요... ^^;)

이럴 때는 숫자만 들어있는 문자열 칼럼만을 선택해서 개별적으로 변환을 해주면 됩니다.

아래는 문자로만 구성된 문자열 'col_2' 를 포함한 df7 데이터프레임을 만들어서 전체 칼럼을 숫자형으로 바꾸려고 했을 때 ValueError 가 발생한 예입니다.

df7 = pd.DataFrame({'col_1': ['1', '2', '3'],

'col_2': ['aaa', 'bbb', 'ccc']})

df7

	col_1	col_2
0	1	aaa
1	2	bbb
2	3	ccc

df7.dtypes

col_1    object
col_2    object
dtype: object

* ValueError

# ValueError

df7 = df7.apply(pd.to_numeric)

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "aaa" During handling of the above exception, another exception occurred: -- 중간 생략 -- ~/anaconda3/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast) 124 coerce_numeric = False if errors in ('ignore', 'raise') else True 125 values = lib.maybe_convert_numeric(values, set(), --> 126 coerce_numeric=coerce_numeric) 127 128 except Exception: pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: ('Unable to parse string "aaa" at position 0', 'occurred at index col_2')

* ValueError

# ValueError

df7 = df7.astype(int)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-124-f50dad302c83> in <module>()
----> 1 df7 = df7.astype(int)

-- 중간 생략 --

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    623     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    624         # work around NumPy brokenness, #1987
--> 625         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    626 
    627     if dtype.name in ("datetime64", "timedelta64"):

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src/util.pxd in util.set_value_at_unsafe()

ValueError: invalid literal for int() with base 10: 'aaa'

문자열을 숫자형으로 변환 시 ValueError 를 무시하기: df.apply(pd.to_numeric, errors = 'coerce')

위의 예와는 조금 다르게 문자형을 숫자형으로 변환하려는 칼럼이 맞는데요, 값 중에 몇 개가 실수로 숫자로 된 문자열이 아니라 문자로 된 문자열이 몇 개 포함되어 있다고 해봅시다. 이럴 경우 문자열을 숫자로 파싱할 수 없다면서 ValueError가 발생하는데요, 문자가 포함되어 있는 경우는 강제로 'NaN'으 값으로 변환하고, 나머지 숫자로된 문자열은 숫자형으로 변환해주려면 errors = 'coerce' 옵션을 추가해주면 됩니다.

df8 = pd.DataFrame({'col_1': ['1', '2', '3'],

'col_2': ['4', 'bbb', '6']})

df8

	col_1	col_2
0	1	4
1	2	bbb
2	3	6

df8 = df8.apply(pd.to_numeric)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "bbb"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-130-9e8d711c10d5> in <module>()
----> 1 df8 = df8.apply(pd.to_numeric)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4260                         f, axis,
   4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
   4263             else:
   4264                 return self._apply_broadcast(f, axis)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4356             try:
   4357                 for i, v in enumerate(series_gen):
-> 4358                     results[i] = func(v)
   4359                     keys.append(v.name)
   4360             except Exception as e:

~/anaconda3/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    124             coerce_numeric = False if errors in ('ignore', 'raise') else True
    125             values = lib.maybe_convert_numeric(values, set(),
--> 126                                                coerce_numeric=coerce_numeric)
    127 
    128     except Exception:

pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: ('Unable to parse string "bbb" at position 1', 'occurred at index col_2')

df8 = df8.apply(pd.to_numeric, errors = 'coerce')

df8

	col_1	col_2
0	1	4.0
1	2	NaN
2	3	6.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 사전 자료형의 키, 값 기준으로 정렬하기 (sort a Dictionary by key, value) (0)	2019.08.28
[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기 (0)	2019.08.25
[Python pandas] DataFrame의 칼럼 이름 바꾸기 : df.columns = [], df.rename(columns) (2)	2019.08.14
[Python pandas] DataFrame 을 Excel로 내보내기 (write DataFrame to Excel): pd.DataFrame.to_excel() (4)	2019.08.06
[Python pandas] Python으로 엑셀 데이터 불러와서 DataFrame으로 만들기 (How to read Excel data using Python pandas) (4)	2019.07.31

Posted by Rfriend

이전 1 2 3 4 5 ··· 8 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'pandas'에 해당되는 글 71건

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 한개의 문자열 칼럼을 구분자로 나누어서 여러개의 칼럼을 가진 DataFrame 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] DataFrame의 문자열 칼럼을 숫자형으로 바꾸기 : pd.to_numeric(), DataFrame.astype()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바