'Python' 태그의 글 목록 (10 Page)

'Python'에 해당되는 글 243건

2019.12.23 [Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series)
2019.12.23 [Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기
2019.12.22 [Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기
2019.12.22 [Python pandas] Timestamp로 날짜, 시간 데이터 입력, 변환, 정보추출하기
2019.12.21 [Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module)
2019.12.15 [Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 8
2019.12.09 [Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) 6
2019.10.07 [Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법
2019.10.06 [R] Jupyter Notebook에서 R 사용하기 6
2019.09.15 [Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 23. 23:49

지난 포스팅에서는 '날짜-시간 index'를 가지는 pandas Series, DataFrame에서 '날짜-시간 index'를 기준으로 시계열 데이터를 indexing, slicing, selection 하는 방법(https://rfriend.tistory.com/499)을 소개하였습니다.

이번 포스팅에서는 Python pandas 의 Series, DataFrame에서

(1) 시계열 데이터 index 중복 여부를 확인 (check duplicated time series indices)

(2) 시계열 데이터 중복 index 시 첫번째 행만 가져오기 (keep the first row from duplicated time series indices)

(3) 시계열 데이터 index 별 group by 집계 (group by aggregation using time series indices)

하는 방법을 소개하겠습니다.

예제로 사용할 간단할 시계열 데이터 pandas Series를 만들어보겠습니다.

이번 포스팅의 주제가 '중복된 시계열 데이터 인덱스 (Duplicated DatatimeIndex)' 이므로 append() 메소드를 사용하여 '2019-12-01', '2019-12-02' 일의 인덱스를 추가함으로써 중복 인덱스를 만들어보았습니다.

그리고, 시계열 데이터는 시간 순서대로 정렬된 상태로 저장된 데이터이므로 sort_values() 메소드로 내림차순 정렬(sort in ascending order)을 해주었습니다.

import pandas as pd

# generate dates from 2019-12-01 to 2019-12-10

date_idx = pd.date_range('2019-12-01', periods=10)

date_idx

[Out]:
DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04',
               '2019-12-05', '2019-12-06', '2019-12-07', '2019-12-08',
               '2019-12-09', '2019-12-10'],
              dtype='datetime64[ns]', freq='D')

# append duplicated dates index

date_idx = date_idx.append(pd.DatetimeIndex(['2019-12-01', '2019-12-02']))

date_idx

[Out]:

DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04',
               '2019-12-05', '2019-12-06', '2019-12-07', '2019-12-08',
               '2019-12-09', '2019-12-10', '2019-12-01', '2019-12-02'],
              dtype='datetime64[ns]', freq=None)

# Time Series with duplicated dates index

series_ts = pd.Series(range(len(date_idx))

, index=date_idx.sort_values())

series_ts

[Out]:

2019-12-01     0
2019-12-01     1
2019-12-02     2
2019-12-02     3
2019-12-03     4
2019-12-04     5
2019-12-05     6
2019-12-06     7
2019-12-07     8
2019-12-08     9
2019-12-09    10
2019-12-10    11
dtype: int64

(1) 시계열 데이터 index 중복 여부를 확인 (check duplicated DatetimeIndex indices)

날짜-시간 index 의 중복 여부를 확인하는 방법에는 여러가지가 있는데요, 그중에서 가장 간단한 방법은 유일한 값 여부(is unique?)를 확인할 수 있는 is_unique 메소드를 사용하는 것입니다. '유일한 값'으로만 되어있으면 True, '중복 값'이 포함되어 있으면 False를 반환합니다.

# duplication check

series_ts.index.is_unique

[Out]: False

index 중복 여부를 확인하는 다른 방법으로는 index에 대해 nunique() 메소드를 사용해서 유일한 값의 index 개수를 세어보고, len(series_ts)로 Series의 전체 행의 개수를 세어보아서 이 둘의 값이 같은지를 확인해보는 것입니다. 만약 중복 index가 있다면 이 둘의 개수가 다르겠지요.

# duplication check

series_ts.index.nunique() == len(series_ts)

[Out]: False

행 단위까지 내려가서 확인을 해보고 싶으면 groupby(level=0) 으로 첫번째인 날짜-시간 index 기준으로 Group By 개수 집계를 해봐서 날짜-시간 index별 행 개수가 '1' 초과인 행을 살펴보면 됩니다.

< 날짜-시간 index별 행의 개수 집계 >

# count group by time series index

series_ts.groupby(level=0).count() # or size()

[Out]:

2019-12-01    2
2019-12-02    2
2019-12-03    1
2019-12-04    1
2019-12-05    1
2019-12-06    1
2019-12-07    1
2019-12-08    1
2019-12-09    1
2019-12-10    1
dtype: int64

< 날짜-시간 index별 행의 개수가 1 보다 큰 경우, 즉 중복 index 인 모든 행 선택 >

# selecting duplicated index rows

series_ts[series_ts.groupby(level=0).count() > 1]

[Out]:

2019-12-01 0 2019-12-01 1 2019-12-02 2 2019-12-02 3 dtype: int64

(2) 시계열 데이터 중복 index 시 첫번째 행만 가져오기

(keep the first row from duplicated time series indices)

날짜-시간 index 중복인 경우 첫번째 행을 가져오려면 groupby(level=0).first(), 마지막 행을 가져오려면 groupby(level=0).last() 메소드를 사용합니다.

# selecting FIRST row in case duplicated index

series_ts.groupby(level=0).first()

[Out]:

2019-12-01     0
2019-12-02     2
2019-12-03     4
2019-12-04     5
2019-12-05     6
2019-12-06     7
2019-12-07     8
2019-12-08     9
2019-12-09    10
2019-12-10    11
dtype: int64

# selecting LAST row in case duplicated index

series_ts.groupby(level=0).last()

[Out]:

2019-12-01     1
2019-12-02     3
2019-12-03     4
2019-12-04     5
2019-12-05     6
2019-12-06     7
2019-12-07     8
2019-12-08     9
2019-12-09    10
2019-12-10    11
dtype: int64

(3) 시계열 데이터 index 별 group by 집계

(group by aggregation using time series indices)

groupby(level=0) 으로 날짜-시간 index 기준으로 GroupBy operation을 수행할 수 있으므로, groupby().agg() 로 집계하고자 하는 함수를 agg() 괄호 안에 넣어서 집계할 수 있습니다. 아래 예제에서는 날짜-시간 index별로 행의 개수(size), 합계(sum), 평균(mean), 최소값(min), 최대값(max) 를 구해보았습니다. (중복 index 시 groupby.agg() 함수 적용하여 집계/요약함)

참고로, groupby().agg() 로 여러개의 집계함수를 적용한 경우 DataFrame을 반환합니다. (vs. 한개의 집계함수만 groupby(level=0).size() 처럼 사용한 경우 Series 반환)

# aggregating size, sum, mean, min, max by group by time series index

series_ts.groupby(level=0).agg(['size', 'sum', 'mean', 'min', 'max'])

[Out]:

	size	sum	mean	min	max
2019-12-01	2	1	0.5	0	1
2019-12-02	2	5	2.5	2	3
2019-12-03	1	4	4.0	4	4
2019-12-04	1	5	5.0	5	5
2019-12-05	1	6	6.0	6	6
2019-12-06	1	7	7.0	7	7
2019-12-07	1	8	8.0	8	8
2019-12-08	1	9	9.0	9	9
2019-12-09	1	10	10.0	10	10
2019-12-10	1	11	11.0	11	11

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24
[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21

Posted by Rfriend

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 23. 18:41

지난번 포스팅에서는 날짜-시간 시계열 객체(date-time, Timeseries objects)를 문자열(Strings)로 변환하기, 거꾸로 문자열을 날짜-시간 시계열 객체로 변환하는 방법(https://rfriend.tistory.com/498)을 소개하였습니다.

이번 포스팅에서는 날짜-시간 시계열 데이터(date-time time series)를 index로 가지는 Python pandas의 Series, DataFrame 에서 특정 날짜-시간을 indexing, slicing, selection, truncation 하는 방법을 소개하겠습니다.

(1) pandas Series에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

(2) pandas DataFrame에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

(1) pandas Series에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

먼저, 간단한 예제로 사용하도록 2019년 11월 25일 부터 ~ 2019년 12월 4일까지 10일 기간의 년-월-일 날짜를 index로 가지는 pands Series를 만들어보겠습니다.

pandas.date_range(시작날짜, periods=생성할 날짜-시간 개수) 함수를 사용하여 날짜-시간 데이터를 생성하였으며, 이를 index로 하여 pandas Series를 만들었습니다.

import pandas as pd

from datetime import datetime

# DatetimeIndex

ts_days_idx = pd.date_range('2019-11-25', periods=10)

ts_days_idx

[Out]:
DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

# Series with time series index

series_ts = pd.Series(range(len(ts_days_idx))

, index=ts_days_idx)

series_ts

[Out]:

2019-11-25    0
2019-11-26    1
2019-11-27    2
2019-11-28    3
2019-11-29    4
2019-11-30    5
2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

series_ts.index

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

series_ts.index[6]

[Out]: Timestamp('2019-12-01 00:00:00', freq='D')

참고로, 아례의 예처럼 pd.date_range(start='시작 날짜-시간', end='끝 날짜-시간') 처럼 명시적으로 시작과 끝의 날짜-시간을 지정해주어도 위의 perieds를 사용한 예와 동일한 결과를 얻을 수 있습니다.

import pandas as pd

pd.date_range(start='2019-11-25', end='2019-12-04')

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

참고로 하나더 소개하자면요, pandas.date_range('시작날짜-시간', period=생성할 날짜-시간 개수, freq='주기 단위') 에서 freq 옵션을 통해서 'S' 1초 단위, '10S' 10초 단위, 'H' 1시간 단위, 'D' 1일 단위, 'M' 1달 단위(월 말일 기준), 'Y' 1년 단위 (년 말일 기준) 등으로 날짜-시간 시계열 데이터 생성 주기를 설정할 수 있습니다. 매우 편하지요?!

< 1초 단위로 날짜-시간 데이터 10개를 생성한 예 >

# 10 timeseries data points by Second(freq='S')

pd.date_range('2019-11-25 00:00:00', periods=10, freq='S')

[Out]:

DatetimeIndex(['2019-11-25 00:00:00', '2019-11-25 00:00:01',
               '2019-11-25 00:00:02', '2019-11-25 00:00:03',
               '2019-11-25 00:00:04', '2019-11-25 00:00:05',
               '2019-11-25 00:00:06', '2019-11-25 00:00:07',
               '2019-11-25 00:00:08', '2019-11-25 00:00:09'],
              dtype='datetime64[ns]', freq='S')

< 10초 단위로 날짜-시간 데이터 10개를 생성한 예 >

# 10 timeseries data points by 10 Seconds (freq='10S')

pd.date_range('2019-11-25 00:00:00', periods=10, freq='10S')

[Out]:
DatetimeIndex(['2019-11-25 00:00:00', '2019-11-25 00:00:10',
               '2019-11-25 00:00:20', '2019-11-25 00:00:30',
               '2019-11-25 00:00:40', '2019-11-25 00:00:50',
               '2019-11-25 00:01:00', '2019-11-25 00:01:10',
               '2019-11-25 00:01:20', '2019-11-25 00:01:30'],
              dtype='datetime64[ns]', freq='10S')

(1-1) 시계열데이터를 index로 가지는 pandas Series에서 특정 날짜-시간 데이터 indexing 하기

먼저 위에서 생성한 series_ts 라는 이름의 시간 순서대로 정렬되어 있는 Series 에서 7번째에 위치한 '2019-12-01' 의 값 '6'을 indexing 해보겠습니다.

(a), (b)와 같이 위치(position)를 가지고 인덱싱할 수 있습니다.

또한, (c), (d)와 같이 날짜-시간 문자열(String)을 가지고도 인덱싱(indexing)을 할 수 있습니다.

(e) 처럼 datetime.datetime(year, month, day) 객체를 사용해서도 인덱싱할 수 있습니다.

import pandas as pd

from datetime import datetime

# (a) indexing with index number

series_ts[6]

[Out]: 6

# (b) indexing with index number using iloc

series_ts.iloc[6]

[Out]: 6

# (c) indexing with string ['year-month-day']

series_ts['2019-12-01']

[Out]: 6

# (d) indexing with string ['month/day/year']

series_ts['12/01/2019']

[Out]: 6

# (f) indexing with datetime.datetime(year, month, day)

series_ts[datetime(2019, 12, 1)]

[Out]: 6

(1-2) 시계열데이터를 index로 가지는 pandas Series에서 날짜-시간 데이터 Slicing 하기

아래는 '2019-12-01' 일 이후의 값을 모두 slicing 해오는 5가지 방법입니다.

(a), (b)는 위치(position):위치(position)을 이용하여 날짜를 index로 가지는 Series를 slicing을 하였습니다.

(c), (d)는 '년-월-일':'년-월-일' 혹은 '월/일/년':'월/일/년' 문자열(string)을 이용하여 slicing을 하였습니다.

(e)는 datetime.datetime(년, 월, 일):datetime.datetime(년, 월, 일) 을 이용하여 slicing을 하였습니다.

import pandas as pd

from datetime import datetime

# (a) slicing with position

series_ts[6:]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (b) slicing with position using iloc

series_ts.iloc[6:]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (c) slicing with string

series_ts['2019-12-01':'2019-12-10']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (d) slicing with string

series_ts['12/01/2019':'12/10/2019']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

# (e) slicing with datetime

series_ts[datetime(2019, 12, 1):datetime(2019, 12, 10)]

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

(1-3) 시계열데이터를 index로 가지는 pandas Series 에서 날짜-시간 데이터 Selection 하기

'날짜-시간' 문자열(String)을 이용하여 특정 '년', '월'의 모든 데이터를 선택할 수도 있습니다. 꽤 편리하고 재미있는 기능입니다.

< '2019'년 모든 데이터 선택하기 예 >

# selection with year string

series_ts['2019']

[Out]:

2019-11-25 0 2019-11-26 1 2019-11-27 2 2019-11-28 3 2019-11-29 4 2019-11-30 5 2019-12-01 6 2019-12-02 7 2019-12-03 8 2019-12-04 9 Freq: D, dtype: int64

< '2019년 12월' 모든 데이터 선택하기 예 >

# selection with year-month string

series_ts['2019-12']

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

(1-4) 시계열 데이터를 index로 가지는 pandas Series에서 날짜-시간 데이터 잘라내기 (Truncate)

truncate() methods를 사용하면 잘라내기(truncation)를 할 수 있습니다. before, after 옵션으로 잘라내기하는 범위 기간을 설정할 수 있는데요, 해당 날짜 포함 여부를 유심히 살펴보기 바랍니다.

< '2019년 12월 1일' 이전(before) 모든 데이터 잘라내기 예 >

('2019년 11월 30일'까지의 모든 데이터 삭제하며, '2019-12-01'일 데이터는 남아 있음)

# truncate before

series_ts.truncate(before='2019-12-01')

[Out]:

2019-12-01    6
2019-12-02    7
2019-12-03    8
2019-12-04    9
Freq: D, dtype: int64

< '2019년 11월 30일' 이후(after) 모든 데이터 잘라내기 예 >

(''2019년 12월 1일' 부터의 모든 데이터 삭제하며, '2019-11-30'일 데이터는 남아 있음)

# truncate after

series_ts.truncate(after='2019-11-30')

[Out]:

2019-11-25    0
2019-11-26    1
2019-11-27    2
2019-11-28    3
2019-11-29    4
2019-11-30    5
Freq: D, dtype: int64

(2) pandas DataFrame에서 시계열 데이터 indexing, slicing, selection, truncation 하는 방법

위의 (1)번에서 소개했던 pandas Series의 시계열 데이터 indexing, slicing, selection, truncation 방법을 동일하게 pandas DataFrame에도 사용할 수 있습니다.

년-월-일 날짜를 index로 가지는 간단한 pandas DataFrame 예제를 만들어보겠습니다.

import pandas as pd

from datetime import datetime

# DatetimeIndex

ts_days_idx = pd.date_range('2019-11-25', periods=10)

ts_days_idx

[Out]:

DatetimeIndex(['2019-11-25', '2019-11-26', '2019-11-27', '2019-11-28',
               '2019-11-29', '2019-11-30', '2019-12-01', '2019-12-02',
               '2019-12-03', '2019-12-04'],
              dtype='datetime64[ns]', freq='D')

# DataFrame with DatetimeIndex

df_ts = pd.DataFrame(range(len(ts_days_idx))

, columns=['col']

, index=ts_days_idx)

df_ts

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-1) 시계열데이터를 index로 가지는 pandas DataFrame에서 특정 날짜-시간 데이터 indexing 하기

위의 (1-1) Series indexing과 거의 유사한데요, DataFrame에서는 df_ts[6], df_ts[datetime(2019, 12, 1)] 의 두가지 방법은 KeyError 가 발생해서 사용할 수 없구요, 아래의 3가지 방법만 indexing에 사용 가능합니다.

(a) iloc[integer] 메소드를 사용하여 위치(position) 로 indexing 하기

(b), (c) loc['label'] 메소드를 사용하여 이름('label')로 indexing 하기

# (a) indexing with index position integer using iloc[]

df_ts.iloc[6]

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

# (b) indexing with index labels ['year-month-day']

df_ts.loc['2019-12-01']

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

# (c) indexing with index labels ['month/day/year']

df_ts.loc['12/01/2019']

[Out]:

col    6
Name: 2019-12-01 00:00:00, dtype: int64

(2-2) 시계열데이터를 index로 가지는 pandas DataFrame에서 날짜-시간 데이터 Slicing 하기

아래는 '2019-12-01' 일 이후의 값을 모두 slicing 해오는 4가지 방법입니다.

(a) 위치(position):위치(position)을 이용하여 날짜를 index로 가지는 Series를 slicing을 하였습니다.

(b), (c)는 loc['년-월-일']:loc['년-월-일'] 혹은 loc['월/일/년']:loc['월/일/년'] 문자열(string)을 이용하여 slicing을 하였습니다.

(d) 는 loc[datetime.datetime(year, month, day):datetime.datetime(year, month, day)] 로 slicing을 한 예입니다.

# (a) slicing DataFrame with position integer

df_ts[6:10]

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (b) silcing using date strings 'year-month-day'

df_ts.loc['2019-12-01':'2019-12-10']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (c) slicing using date strings 'month/day/year'

df_ts.loc['12/01/2019':'12/10/2019']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

# (d) slicing using datetime objects

from datetime import datetime

df_ts.loc[datetime(2019, 12, 1):datetime(2019, 12, 10)]

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-3) 시계열데이터를 index로 가지는 pandas DataFrame 에서 날짜-시간 데이터 Selection 하기

'년', '년-월' 날짜 문자열을 df.loc['year'], df.loc['year-month'] 에 입력하면 해당 년(year), 월(month)의 모든 데이터를 선택할 수 있습니다.

< '2019년'의 모든 데이터 선택 예 >

# selection of year '2019'

df_ts.loc['2019'] # df_ts['2019']

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

< '2019년 12월'의 모든 데이터 선택 예 >

# selection of year-month '2019-12'

df_ts.loc['2019-12']

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

(2-4) 시계열 데이터를 index로 가지는 pandas DataFrame에서 날짜-시간 데이터 잘라내기 (Truncate)

truncate() 메소드를 사용하면 before 이전 기간의 데이터를 잘라내거나 after 이후 기간의 데이터를 잘라낼 수 있습니다.

< '2019-12-01' 일 이전(before) 기간 데이터 잘라내기 예 >

('2019-12-01'일은 삭제되지 않고 남아 있음)

# truncate before

df_ts.truncate(before='2019-12-01') # '2019-12-01' is not removed

	col
2019-12-01	6
2019-12-02	7
2019-12-03	8
2019-12-04	9

< '2019-11-30'일 이후(after) 기간 데이터 잘라내기 예 >

('2019-11-30'일은 삭제되지 않고 남아 있음)

# truncate after

df_ts.truncate(after='2019-11-30') # '2019-11-30' is not removed

	col
2019-11-25	0
2019-11-26	1
2019-11-27	2
2019-11-28	3
2019-11-29	4
2019-11-30	5

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24
[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15

Posted by Rfriend

[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 22. 22:08

지난번 포스팅에서는 Python standard datetime과 pandas Timestamp 객체로 날짜-시간 데이터를 입력, 변환, 조회하는 방법을 소개하였습니다.

이번 포스팅에서는

(1) Python datetime, pandas Timestamp 객체를 문자열(string)로 변환

(Converting native Python datetime, pandas Timestamp objects to Strings)

(2) 문자열(string)을 Python datetime, pandas Timestamp 객체로 변환

(Converting Strings to Python datetime, pandas Timestamp)

에 대해서 차례대로 알아보겠습니다.

(1) Python datetime, pandas Timestamp 객체를 문자열(string)로 변환

(Converting native Python datetime, pandas Timestamp objects to Strings)

(1-1) str() 함수를 이용하여 Python datetime 객체를 문자열(string)로 변환하기

# Create Python datetime objects

import datetime as dt

ts = dt.datetime(2019, 12, 22, 13, 30, 59) # (year, month, day, hour, minume, second)

type(ts)

[Out]: datetime.datetime

[Out]: datetime.datetime(2019, 12, 22, 13, 30, 59)

# converting Python datetime objects to Strings

ts_str = str(ts)

type(ts_str)

[Out]: str

ts_str

[Out]: '2019-12-22 13:30:59'

# indexing from a string

print('year-month-day:', ts_str[:10])

[Out]: year-month-day: 2019-12-22

print('hour:minute:second:', ts_str[10:])
[Out]: hour:minute:second:  13:30:59

(1-2) strftime() 메소드를 이용하여 Python datetime 객체를 문자열(string)로 변환하기

Python standard datetime 객체의 포맷에 맞추어서 strftime() 메소드의 괄호 안에 형태 지정(format specification)을 해줍니다. 가령 4자리 년(4-digit year), 월(month), 일(day), 0-23시간(0-23 hour), 분, 초의 형태로 지정을 하고 싶으면 strftime('%Y-%m-%d %H:%M:%S') 로 설정을 해주면 됩니다.

import datetime as dt

ts = dt.datetime(2019, 12, 22, 13, 30, 59) # (year, month, day, hour, minume, second)

[Out]: datetime.datetime(2019, 12, 22, 13, 30, 59)

ts_strftime = ts.strftime('%Y-%m-%d %H:%M:%S')

type(ts_strftime)

[Out]: str

ts_strftime

[Out]: '2019-12-22 13:30:59'

년도를 '2자리 연도(2-digit year)'로 문자열 변환하고자 하면 '%y' 를 사용하며, 0-11시간(0-11 hour) 형태로 시간을 문자열 변환하고자 하면 '%I'를 사용합니다.

# %Y: 4-digit year vs. %y: 2-digit year

# %H: 24-hour clock[00, 23] vs. %I: 12-hour clock [01, 11]

ts.strftime('%y-%m-%d %I:%M:%S')

[Out]: '19-12-22 01:30:59'

datetime.datetime(2019, 12, 22).strftime('%w') 는 주 단위의 요일을 일요일은 '0', 월요일은 '1', ..., 토요일은 '6'으로 순서대로 정수의 문자로 반환합니다.

# '%w': weekday

print('----- %w: weekday as integer -----')

print('Sunday :', dt.datetime(2019, 12, 22).strftime('%w'))

print('Monday :', dt.datetime(2019, 12, 23).strftime('%w'))

print('Tuesday :', dt.datetime(2019, 12, 24).strftime('%w'))

print('Wednesday :', dt.datetime(2019, 12, 25).strftime('%w'))

print('Thursday :', dt.datetime(2019, 12, 26).strftime('%w'))

print('Friday :', dt.datetime(2019, 12, 27).strftime('%w'))

print('Saturday :', dt.datetime(2019, 12, 28).strftime('%w'))

[Out]:

----- %w: weekday as integer -----
Sunday    : 0
Monday    : 1
Tuesday   : 2
Wednesday : 3
Thursday  : 4
Friday    : 5
Saturday  : 6

datetime.datetime(2019, 12, 22).strftime('%U')는 년 중 해당 주의 숫자(week nuber of the year)를 반환합니다. 이때 '%U'는 일요일이 주의 첫번째 일로 간주하는 반면에 '%W'는 월요일을 주의 첫번째 일로 간주하는 차이가 있습니다.

# '%U': week number of the year [00, 53].

# Sunday is considered the first day of the week

dt.datetime(2019, 12, 22).strftime('%U') # '2019-12-22' is Sunday

[Out]: '51'

# '%W': week number of the year [00, 53].

# Monday is considered the first day of the week

dt.datetime(2019, 12, 22).strftime('%W') # '2019-12-22' is Sunday

[Out]: '50'

dt.datetime(2019, 12, 23).strftime('%W') # '2019-12-23' is Monday

[Out]: '51'

(1-3) pandas Timestamp를 strftime() 메소드를 사용하여 문자열(string)로 변환하기

import pandas as pd

pd_ts = pd.Timestamp(2019, 12, 22, 13, 30, 59)

pd_ts

[Out]: Timestamp('2019-12-22 13:30:59')

# convert pandas Timestamp to string

pd_ts.strftime('%y-%m-%d %I:%M:%S')

[Out]: '19-12-22 01:30:59'

(2) 문자열(string)을 Python datetime, pandas Timestamp 객체로 변환

(Converting Strings to Python datetime, pandas Timestamp)

(2-1) datetime.strptime() 함수를 이용하여 문자열을 Python datetime.datetime 객체로 변환하기

strptime(문자열, 날짜-시간 포맷) 의 괄호 안에는 문자열의 형태에 맞추어서 반환하고자 하는 날짜-시간 포맷을 설정해줍니다.

# create a string with 'year-month-day hour:minute:second'

ts_str = '2019-12-22 13:30:59'

ts_str

[Out]: '2019-12-22 13:30:59'

# convert a string to datetime object

import datetime as dt

dt.datetime.strptime(ts_str, '%Y-%m-%d %H:%M:%S')

[Out]: datetime.datetime(2019, 12, 22, 13, 30, 59)

여러개의 날짜-시간 문자열로 구성된 리스트를 List Comprehension 을 이용해서 datetime 객체 리스트로 만들 수 있습니다.

# convert strings list using list comprehension

ts_str_list = ['2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25']

[dt.datetime.strptime(date, '%Y-%m-%d') for date in ts_str_list]

[Out]:

[datetime.datetime(2019, 12, 22, 0, 0),
 datetime.datetime(2019, 12, 23, 0, 0),
 datetime.datetime(2019, 12, 24, 0, 0),
 datetime.datetime(2019, 12, 25, 0, 0)]

아래의 예는 pandas DataFrame에서 "Date" 문자열 칼럼과 "Time" 문자열 칼럼이 분리되어 있는 경우, (1) 먼저 이 두 칼럼을 "DateTime" 이라는 하나의 칼럼으로 합치고, (2) 그 다음에 문자열(Strings)을 DateTime 객체로 변환하는 방법에 대한 소개입니다.

apply() 함수 안에 strptime() 함수를 lambda 무기명 함수 형태로 사용하였으며, 날짜와 시간 포맷도 위와는 조금 다르게 설정해보았습니다. ('%d/%m/%Y %H.%M.%S'))

import pandas as pd

# make a sample DataFrame with Date and Time (string format)

Date = ['01/12/2019', '01/12/2019', '01/12/2019']

Time = ['09.01.00', '09.01.01', '09.01.02']

val = [1, 2, 3]

df = pd.DataFrame({'Date': Date,

'Time': Time,

'val': val})

[Out]:

	Date	Time	val
0	01/12/2019	09.01.00	1
1	01/12/2019	09.01.01	2
2	01/12/2019	09.01.02	3

# combine 'Date' and 'Time' column as 'DateTime'

df['DateTime'] = df.Date + ' ' + df.Time

[Out]:

	Date	Time	val	DateTime
0	01/12/2019	09.01.00	1	01/12/2019 09.01.00
1	01/12/2019	09.01.01	2	01/12/2019 09.01.01
2	01/12/2019	09.01.02	3	01/12/2019 09.01.02

# check date type : strings

type(df.DateTime[0])

[Out]: str

# convert 'DateTime' column from strings to datetime object

# using datetime.strptime() and lambda, apply function

from datetime import datetime

df.DateTime = df.DateTime.apply(lambda x: datetime.strptime(x, '%d/%m/%Y %H.%M.%S'))

df.set_index('DateTime', inplace=True)

[Out]:

	Date	Time	val
DateTime
2019-12-01 09:01:00	01/12/2019	09.01.00	1
2019-12-01 09:01:01	01/12/2019	09.01.01	2
2019-12-01 09:01:02	01/12/2019	09.01.02	3

# check data type : datetime object Timestamp

type(df.index[0])

[Out]: pandas._libs.tslibs.timestamps.Timestamp

datetime.strptime(문자열, 날짜-시간 포맷) 의 경우 날짜-시간 문자열의 형태가 제한적이어서 아래와 같은 순서/형태의 문자열일 경우 ValueError 가 발생합니다. 이럴 경우 (2-2) dateutil.parser 의 parse() 함수를 사용하여 유연하게 날짜-시간 문자열을 파싱할 수 있습니다.

dt.datetime.strptime('Dec 22, 2019 13:30:59', '%m %d, %Y %H:%M:%S')

[Out]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-7dfd356853ba> in <module>
----> 1 dt.datetime.strptime('Dec 22, 2019 13:30:59', '%m %d, %Y %H:%M:%S')

~/anaconda3/envs/py3.6_tf2.0/lib/python3.6/_strptime.py in _strptime_datetime(cls, data_string, format)
    563     """Return a class cls instance based on the input string and the
    564     format string."""
--> 565     tt, fraction = _strptime(data_string, format)
    566     tzname, gmtoff = tt[-2:]
    567     args = tt[:6] + (fraction,)

~/anaconda3/envs/py3.6_tf2.0/lib/python3.6/_strptime.py in _strptime(data_string, format)
    360     if not found:
    361         raise ValueError("time data %r does not match format %r" %
--> 362                          (data_string, format))
    363     if len(data_string) != found.end():
    364         raise ValueError("unconverted data remains: %s" %

ValueError: time data 'Dec 22, 2019 13:30:59' does not match format '%m %d, %Y %H:%M:%S'

(2-2) dateutil 라이브러리의 parser.parse() 함수를 이용하여 문자열을 datetime 객체로 변환하기

아래의 3개의 예에서 보는 바와 같이, dateutil 라이브러리의 parser.parse() 함수는 괄호 안에 다양한 형태의 날짜-시간 문자열을 유연하게 인식하여 datetime.datetime 객체로 반환해줍니다. (2-1)의 datetime.strptime() 함수와 비교했을 때 dateutil.parser의 parse() 함수는 상대적으로 편리하고 강력합니다.

# Converting Strings to datetime.datetime objects using parser.parse()

from dateutil.parser import parse

parse('2019-12-22 13:30:59')

[Out]: datetime.datetime(2019, 12, 22, 13, 30, 59)

parse('Dec 22, 2019 13:30:59')

[Out]: datetime.datetime(2019, 12, 22, 13, 30, 59)

parse('Dec 22, 2019 01:30:59 PM')

[Out]: datetime.datetime(2019, 12, 22, 13, 30, 59)

parse() 함수의 괄호 안 문자열의 첫번째 숫자가 '월(month)' 인지 아니면 '일(day)' 인지를 dayfirst=False (default), dayfirst=True 옵션을 사용하여 명시적으로 지정해줄 수 있습니다.

# month first

parse('01/12/2019 13:30:59')

[Out]: datetime.datetime(2019, 1, 12, 13, 30, 59)

# day first

parse('01/12/2019 13:30:59', dayfirst=True)

[Out]: datetime.datetime(2019, 12, 1, 13, 30, 59)

여러개의 날짜-시간 문자열로 구성된 리스트를 가지고 List Comprehension을 사용하여 datetime.datetime 객체 리스트를 생성할 수 있습니다.

# convert strings list using list comprehension

ts_str_list = ['2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25']

[parse(date) for date in ts_str_list]

[Out]:

[datetime.datetime(2019, 12, 22, 0, 0),
 datetime.datetime(2019, 12, 23, 0, 0),
 datetime.datetime(2019, 12, 24, 0, 0),
 datetime.datetime(2019, 12, 25, 0, 0)]

(3-3) pandas 의 pd.to_datetime()으로 날짜-시간 문자열을 pandas Timestamp 로 변환하기

* pandas Timestamp 에 대한 자세한 설명은 https://rfriend.tistory.com/49 7 참조하세요

# pandas Timestamp

import pandas as pd

pd.to_datetime('2019-12-22 13:30:59')

[Out]: Timestamp('2019-12-22 13:30:59')

여러개의 날짜-시간 문자열로 구성된 문자열 리스트를 가지고 pandas.to_datetime() 함수를 사용하여 pandas DatetimeIndex 를 생성할 수 있습니다. 이렇게 생성된 DatatimeIndex는 시계열 데이터로 이루어진 pandas Series나 pandas DataFrame 를 생성할 때 index로 사용할 수 있습니다.

# pandas DatetimeIndex

ts_str_list = ['2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25']

pd.to_datetime(ts_str_list)

DatetimeIndex(['2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25'], dtype='datetime64[ns]', freq=None)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series) (0)	2019.12.23
[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09

Posted by Rfriend

[Python pandas] Timestamp로 날짜, 시간 데이터 입력, 변환, 정보추출하기

Python 분석과 프로그래밍 2019. 12. 22. 16:26

지난 포스팅에서는 native Python datetime 모듈의 4가지 데이터 유형으로 date, time, datetime, timedelta 에 대해서 소개(https://rfriend.tistory.com/49 6)하였습니다.

이번 포스팅에서는 Python datetime과 거의 유사한 기능과 활용법을 가지는 pandas의 Timestamp 클래스로 날짜-시간 데이터 입력, 변환, 정보추출하는 방법에 대해서 소개하겠습니다. 매우 기본적인 내용들이고 쉽습니다.

(1) pandas Timestamp 생성하기 : pd.Timestamp()

pandas.Timestamp() 클래스를 사용해서 (a) 위치(position), (b) 키워드(keyword) 의 두 가지 방식으로 pandas Timestamp 객체를 생성할 수 있습니다.

(a) 위치(position)의 경우 순서대로 (year, month, day, hour, minute, second) 를 의미합니다.

(b) 키워드(keyword) 방식의 경우 year, month, day, hour, minute, second 각 키워드별로 순서에 무관하게 값을 입력해주면 됩니다.

단, position과 keyword 방식을 하나의 Timestamp 객체에 병행해서 사용할 수는 없습니다.

import pandas as pd

# (a) position: (year, month, day, hour, minute, second)

pd_ts = pd.Timestamp(2019, 12, 22, 13, 30, 59)

pd_ts

[OUT] Timestamp('2019-12-22 13:30:59')

# (b) keyword

pd.Timestamp(year=2019, month=12, day=22, hour=13, minute=30, second=59)

[OUT] Timestamp('2019-12-22 13:30:59')

# keyword : with different position

pd.Timestamp(day=22, month=12, year=2019, hour=13, minute=30, second=59)

[OUT] Timestamp('2019-12-22 13:30:59')

(2) pandas Timestamp Attributes 로 날짜, 시간 정보 확인하기

# pandas Timestamp's Attributes

stamp = pd.Timestamp(2019, 12, 22, 13, 30, 59)

print('---------------------------')

print('pandas Timestamp Attributes')

print('---------------------------')

print('pandas Timestamp:', stamp)

print('year:', stamp.year)

print('month:', stamp.month)

print('day:', stamp.day)

print('hour:', stamp.hour)

print('minute:', stamp.minute)

print('second:', stamp.second)

print('microsecond:', stamp.microsecond)

print('day of week:', stamp.dayofweek) # [Monday 0 ~ Sunday 6]

print('day of year:', stamp.dayofyear)

print('days in month:', stamp.days_in_month) # or daysinmonth

print('quarter:', stamp.quarter)

print('week number of the year:', stamp.week) # or weekofyear

---------------------------
pandas Timestamp Attributes
---------------------------
pandas Timestamp: 2019-12-22 13:30:59
year: 2019
month: 12
day: 22
hour: 13
minute: 30
second: 59
microsecond: 0
day of week: 6
day of year: 356
days in month: 31
quarter: 4
week number of the year: 51

(3) pandas Timestamp Methods 로 날짜, 시간 정보 확인하기

- date() 메소드: 년-월-일(year-month-day) 객체 반환

- time() 메소드: 시간-분-초(hour-minute-second) 객체 반환

# pandas Timestamp

pd_ts = pd.Timestamp(2019, 12, 22, 13, 30, 59)

pd_ts

[Out] Timestamp('2019-12-22 13:30:59')

# pandas Timestamp date(), time() method

print('date:', pd_ts.date())

print('time', pd_ts.time())

[Out] date: 2019-12-22
[Out] time 13:30:59

- combine() 메소드: 날짜(date)와 시간(time) 객체를 합쳐서 날짜-시간(date-time) 객체 만들기

# combine() method

pd.Timestamp.combine(pd_ts.date(), pd_ts.time())

[Out] Timestamp('2019-12-22 13:30:59')

- month_name() 메소드: 월의 영문 이름 반환

pd_ts.month_name()

[Out] 'December'

- timestamp() 메소드: float형 POSIX timestamp 반환

pd_ts.timestamp()

[Out] 1577021459.0

- now(), today() 메소드: 현재 날짜와 시간 반환 (current date and time)

(cf. datetime.now() 와 동일)

# current date and time

pd.Timestamp.now()

[Out] Timestamp('2019-12-22 15:55:52.916704')

pd.Timestamp.today()

[Out] Timestamp('2019-12-22 15:55:52.924013')

# equvalent to datetime.now()

import datetime as dt

dt.datetime.now()

[Out] datetime.datetime(2019, 12, 22, 15, 55, 52, 933711)

(4) pandas Timestamp를 문자열(string)로 변환, 문자열을 Timestamp로 변환

(Converting between pandas Timestamp and Strings)

(4-1) strftime(format 설정): pandas Timestamp를 문자열(string)로 변환

# convert pandas Timestamp to string

pd_ts = pd.Timestamp(2019, 12, 22, 13, 30, 59)

pd_ts.strftime('%Y-%m-%d %H:%M:%S') # 4-digit year, 24-hour format

[Out] '2019-12-22 13:30:59'

pd_ts.strftime('%y-%m-%d %I:%M:%S') # 2-digit year, 12-hour format

[Out] '19-12-22 01:30:59'

(4-2) pd.to_datetime(문자열): 문자열을 pandas Timestamp로 변환

# convert string to pandas Timestamp

pd.to_datetime('2019-12-22 01:30:59')

[Out] Timestamp('2019-12-22 01:30:59')

여러 개의 날짜(-시간) 포맷의 문자열로 이루어지 리스트를 가지고 pd.to_datetime() 을 이용하여 pandas DatetimeIndex 를 만들 수 있습니다. 이렇게 만든 DatetimeIndex를 pandas Series, DataFrame의 index 로 사용할 수 있습니다.

# convert string list to pandas datetimeIndex

ts_list = ['2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25']

ts_idx = pd.to_datetime(ts_list)

ts_idx

[Out] DatetimeIndex(['2019-12-22', '2019-12-23', '2019-12-24', '2019-12-25'], dtype='datetime64[ns]', freq=None)

val = [1, 2, 3, 4]

ts_series = pd.Series(val, index=ts_idx)

ts_series

[Out]

2019-12-22    1
2019-12-23    2
2019-12-24    3
2019-12-25    4
dtype: int64

ts_df = pd.DataFrame(val, columns=['val'], index=ts_idx)

ts_df

[Out]

	val
2019-12-22	1
2019-12-23	2
2019-12-24	3
2019-12-25	4

(5) pandas Timestamp를 Python standard datetime 으로 변환하기

pandas의 to_pydatetime() 메소드를 사용하여 pandas Timestamp 객체를 native Python datetime 모듈의 날짜-시간 객체로 변환할 수 있습니다.

pd_ts = pd.Timestamp(2019, 12, 22, 13, 30, 59)

pd_ts

[Out] Timestamp('2019-12-22 13:30:59')

# convert pandas Timestamp to native Python datatime

pd_ts.to_pydatetime()

[Out] datetime.datetime(2019, 12, 22, 13, 30, 59)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

Posted by Rfriend

[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 21. 22:28

이번 포스팅에서는 Python 표준 라이브러리(Python standard library)로 시계열 데이터의 날짜, 시간을 처리, 조작, 분석할 때 사용하는 datetime 모듈의 4가지 데이터 유형 (4 data types in datetime module in python pandas library) 에 대해서 알아보겠습니다.

(1) datetime.date: 년(year), 월(month), 일(day)

(2) datetime.time: 시간(hour), 분(minute), 초(second), 마이크로초(microsecond)

(3) datetime.datetime: date(년, 월, 일) & time(시간, 분, 초, 마이크로초)

(4) datetime.timedelta: 두 개의 datetime 값의 차이 (difference between 2 DateTime values)

--> 일(dayes), 초(seconds), 마이크로초(microseconds) 형태로 반환

이외 datetime.tzinfo, datetime.timezone 클래스가 있습니다.

[ Python standard library: 4 data types in datetime module ]

(1) datetime.date : 년(year), 월(month), 일(day)

date.date(year, month, day) 의 형태로 년/월/일 정보를 가지는 달력의 날짜(calendar date) 데이터 객체를 생성할 수 있으며, 날짜 객체로 부터 year, month, day attribute로 년(year), 월(month), 일(day) 데이터를 추출할 수 있습니다.

import pandas as pd

import datetime as dt

# date: (year, month, day)

mydate = dt.date(2019, 12, 21)

mydate

datetime.date(2019, 12, 21)

print('year:', mydate.year)

print('month:', mydate.month)

print('day:', mydate.day)

year: 2019
month: 12
day: 21

(2) datetime.time : 시간(hour), 분(minute), 초(second), 마이크로초(microsecond)

datetime.time 클래스를 사용하여 시간(hour), 분(minute), 초(second), 마이크로초(microsecond)의 시계의 시간 데이터 객체를 생성, 조회할 수 있습니다.

# time: (hour, minute, second, microsecond)

mytime = dt.time(20, 46, 22, 445671)

mytime

datetime.time(20, 46, 22, 445671)

print('hour:', mytime.hour)

print('minute:', mytime.minute)

print('second:', mytime.second)

print('microsecond:', mytime.microsecond)

hour: 20
minute: 46
second: 22
microsecond: 445671

(3) datetime.datetime : date(year, month, day)
& time(hour, minute, second, microsecond)

datetime.datetime 은 위의 (1)번의 datetime.date 와 (2)번의 datetime.time 을 합쳐놓아서 날짜(date)와 시간(time) 정보를 모두 가지는 날짜-시간 객체입니다.

datetime.datetime.now() 는 현재 날짜-시간을 객체로 가져옵니다.

year, month, day, hour, minute, second, microsecond attribute를 사용하여 datetime 객체로 부터 년, 월, 일, 시간, 분, 초, 마이크로초 정보를 가져올 수 있습니다.

# datetime: (year, month, day, hour, minute, second, microsecond)

now = dt.datetime.now() # current date and time

now

datetime.datetime(2019, 12, 21, 20, 46, 22, 445671)

print('year:', now.year)

print('month:', now.month)

print('day:', now.day)

print('hour:', now.hour)

print('minute:', now.minute)

print('second:', now.second)

print('microsecond:', now.microsecond)

year: 2019
month: 12
day: 21
hour: 20
minute: 46
second: 22
microsecond: 445671

두 개의 datetime.datetime 의 날짜-시간 객체끼리 - 연산을 통해 날짜-시간 차이를 계산할 수 있습니다.

이들 차이(delta)에 대해 days, seconds, microseconds attribute로 날짜 차이, 초 차이, 마이크로초 차이 정보를 추출할 수 있습니다.

now = dt.datetime.now() # current date and time

delta = now - dt.datetime(2019, 12, 1, 23, 59, 59)

delta

datetime.timedelta(19, 74783, 445671)

print('delta days:', delta.days)

print('delta seconds:', delta.seconds)

print('delta microseconds:', delta.microseconds)

delta days: 19
delta seconds: 74783
delta microseconds: 445671

(4) datetime.timedelta : 두 개의 datetime 값 간의 차이

(the difference between 2 datetime values)

두 개의 datetime 객체 값 간의 차이를 구할 때 timedelta 클래스를 사용하면 편리하게 특정 일/시간 차이가 나는 날짜-시간을 구할 수 있습니다.

datetime.timedelta(days, seconds, microseconds) 의 형태로 날짜-시간 차이 데이터를 저장합니다.

weeks = 1 은 7 days 로 변환되며, minutes = 1 은 60 seconds 로 변환되고, milliseconds = 1000 은 1 seconds 로 변환됩니다.

# timedelat() class

import datetime as dt

delta = dt.timedelta(days=1,

seconds=20,

microseconds=1000,

milliseconds=5000,

minutes=5,

hours=12,

weeks=2)

delta

datetime.timedelta(15, 43525, 1000)

# check

days = 1

weeks = 2

seconds = 20

microseconds = 1000

milliseconds = 5000

minutes = 5

hours = 12

print('days:', days + 7*weeks)

print('seconds:', seconds + 60*minutes + 60*60*hours + milliseconds/1000)

print('microsecond:', microseconds)

days: 15
seconds: 43525.0
microsecond: 1000

timedelta 클래스를 사용하여 각각 1 day, 1 day 10 seconds, 1 day 10 seocnds 100 microseconds 를 더해보겠습니다.

# timedelta: difference between two datetime values

# (days)

dt.datetime(2019, 12, 21) + dt.timedelta(1) # + 1 day

datetime.datetime(2019, 12, 22, 0, 0)

# (days, seconds)

dt.datetime(2019, 12, 21, 23, 59, 59) + dt.timedelta(1, 10) # + 1 day 10 seconds

datetime.datetime(2019, 12, 23, 0, 0, 9)

# (days, seconds, microseconds)

dt.datetime(2019, 12, 21, 23, 59, 59, 1000) + dt.timedelta(1, 10, 100) # + 1day 10seconds 100microseconds

datetime.datetime(2019, 12, 23, 0, 0, 9, 1100)

이번에는 위와 반대로 datetime.timedelta 클래스로 1 day, 10 seconds, 100 microseconds를 빼보겠습니다.

# minus

dt.datetime(2019, 12, 21, 23, 59, 59, 1000) - dt.timedelta(1, 10, 100)

datetime.datetime(2019, 12, 20, 23, 59, 49, 900)

timedelta 클래스에 곱하기와 나누기를 적용해서 빼는 것도 가능합니다. 첫번째의 - 5 * datetime.timedelta(1) = - 5 days 를 빼라는 의미이며, 두번째의 - datetime.timedelta(10)/ 2 = - 5 days 역시 10 days를 2로 나눈 5 days를 빼라는 의미로 동일한 결과를 반환합니다.

# multiplication

dt.datetime(2019, 12, 21, 23, 59, 59, 1000) - 5 * dt.timedelta(1)

datetime.datetime(2019, 12, 16, 23, 59, 59, 1000)

# divide

dt.datetime(2019, 12, 21, 23, 59, 59, 1000) - dt.timedelta(10) / 2

datetime.datetime(2019, 12, 16, 23, 59, 59, 1000)

pandas Timestamp 클래스를 이용한 날짜-시간 입력, 변환, 정보조회 방법은 https://rfriend.tistory.com/497 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기 (0)	2019.12.23
[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법 (0)	2019.09.15

Posted by Rfriend

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 15. 00:42

이번 포스팅에서는 Python pandas library를 이용하여 시계열 데이터(time series data)를 10분, 20분, 1시간, 1일, 1달 등의 특정 시간 단위(time span) 구간별로 집계/요약 하는 방법을 소개하겠습니다. (Downsampling)

(* PostgreSQL, Greenplum database로 특정 시간 단위 구간별 시계열 데이터 집계, 요약하는 방법은 https://rfriend.tistory.com/495 참조)

이전에 소개했었던 groupby() operator를 사용해서 그룹별로 집계/요약하는 방법을 사용할 수도 있는데요, 시계열 데이터의 경우 pandas의 resample() method를 사용하면 좀더 편리하고 코드도 깔끔하게 시간 단위 구간별로 시계열 데이터를 집계/요약할 수 있습니다.

먼저 '년-월-일 시간:분:초'로 이루어진 time-stamp 를 index로 가지고, 가격(price)와 수량(amount) 의 두 개의 칼럼을 가지는 간단한 시계열 데이터를 만들어보겠습니다. pandas의 date_range(from, to, freq) 함수를 해서 '2분 간격(freq='2min')의 date range 데이터를 만들었습니다. 이 중에서 20개 행만 선택해서 예를 들어보겠습니다.

import pandas as pd

import numpy as np

# generate time series index

range = pd.date_range('2019-12-19', '2019-12-20', freq='2min')

df = pd.DataFrame(index = range)[:20]

# add 'price' columm using random number

np.random.seed(seed=1004) # for reproducibility

df['price'] = np.random.randint(low=10, high=100, size=20)

# add 'amount' column unsing random number

df['amount'] = np.random.randint(low=1, high=5, size=20)

print('Shape of df DataFrame:', df.shape)

[Out]:Shape of df DataFrame: (20, 2)

[Out]:

	price	amount
2019-12-19 00:00:00	12	4
2019-12-19 00:02:00	21	2
2019-12-19 00:04:00	41	1
2019-12-19 00:06:00	79	4
2019-12-19 00:08:00	61	2
2019-12-19 00:10:00	81	1
2019-12-19 00:12:00	24	3
2019-12-19 00:14:00	62	1
2019-12-19 00:16:00	76	3
2019-12-19 00:18:00	63	1
2019-12-19 00:20:00	95	2
2019-12-19 00:22:00	82	1
2019-12-19 00:24:00	82	3
2019-12-19 00:26:00	70	1
2019-12-19 00:28:00	30	4
2019-12-19 00:30:00	33	1
2019-12-19 00:32:00	22	2
2019-12-19 00:34:00	77	3
2019-12-19 00:36:00	58	3
2019-12-19 00:38:00	96	3

(1) 10분 단위 구간별로 각 칼럼의 첫번째 값(first value), 마지막 값(last value) 구하기

(select the first and last value by 10 minutes time span using pandas resample method)

resample('10T') 는 '년-월-일 시간:분:초' 의 시계열 index를 10분 단위의 동일 간격별로 데이터를 뽑으라는 뜻입니다. pandas의 groupby() 에서 split-apply-combine에서 동일 시간대 간격으로 split 의 역할을 한다고 생각할 수 있습니다.

[ resample() 메소드의 시간 단위 구간 설정 ]
- 5분 단위 구간 : resample('5T')
- 10분 단위 구간 : resample('10T')
- 20분 단위 구간 : resample('20T')
- 1시간 단위 구간 : resample('1H')
- 1일 단위 구간 : resample('1D')
- 1주일 단위 구각 : resample('1W')
- 1달 단위 구간 : resample('1M')
- 1년 단위 구간 : resample('1Y')

각 시간 단위 구간(time span) 별로 시간 순서대로 정렬된 상태에서 첫번째 행의 값(first row's value)은 first() 메소드를 사용하며, 마지막 행의 값(last row's value)은 last() 메소드를 사용해서 구할 수 있습니다. (groupby 의 split-apply-combine 중에서 apply 에 해당한다고 생각할 수 있습니다)

# Resampling by a given time span (group)

# : first, last

df_summary = pd.DataFrame()

df_summary['price_10m_first'] = df.price.resample('10T').first()

df_summary['price_10m_last'] = df.price.resample('10T').last()

df_summary['amount_10m_first'] = df.amount.resample('10T').first()

df_summary['amount_10m_last'] = df.amount.resample('10T').last()

df_summary

	price_10m_first	price_10m_last	amount_10m_first	amount_10m_last
2019-12-19 00:00:00	12	61	4	2
2019-12-19 00:10:00	81	63	1	1
2019-12-19 00:20:00	95	30	2	4
2019-12-19 00:30:00	33	96	1	3

(2) 10분 단위 구간별로 숫자형 데이터의 합계, 누적 합계 구하기

(sum, cumulative sum by 10 minutes time span using pandas resample method)

# Resampling by a given time span (group)

# sum, cumulative sum

df_summary = pd.DataFrame()

df_summary['price_10m_sum'] = df.price.resample('10T').sum()

df_summary['price_10m_cumsum'] = df.price.resample('10T').sum().cumsum()

df_summary['amount_10m_sum'] = df.amount.resample('10T').sum()

df_summary['amount_10m_cumsum'] = df.amount.resample('10T').sum().cumsum()

df_summary

	price_10m_sum	price_10m_cumsum	amount_10m_sum	amount_10m_cumsum
2019-12-19 00:00:00	214	214	13	13
2019-12-19 00:10:00	306	520	9	22
2019-12-19 00:20:00	359	879	11	33
2019-12-19 00:30:00	286	1165	12	45

(3) 10분 단위 구간별로 최소값, 최대값, 평균, 중앙값, 범위 구하기

(summary statistics by 10 minutes time span using pandas resample method)

최소값(min), 최대값(max), 평균(mean), 중앙값(median) 요약통계량은 min(), max(), mean(), median() 메소드를 이용하여 구할 수 있으며, 범위(range)는 해당 메소드가 없어서 범위(range) = 최대값(max) - 최소값(min) 의 계산을 해서 구하였습니다.

# Resampling by a given time span (group)

# min, max, mean, median, range

df_summary = pd.DataFrame()

df_summary['price_10m_min'] = df.price.resample('10T').min()

df_summary['price_10m_max'] = df.price.resample('10T').max()

df_summary['price_10m_mean'] = df.price.resample('10T').mean()

df_summary['price_10m_median'] = df.price.resample('10T').median()

df_summary['price_10m_range'] = \

df.price.resample('10T').max() - df.price.resample('10T').min()

df_summary

	price_10m_min	price_10m_max	price_10m_mean	price_10m_median	price_10m_range
2019-12-19 00:00:00	12	79	42.8	41	67
2019-12-19 00:10:00	24	81	61.2	63	57
2019-12-19 00:20:00	30	95	71.8	82	65
2019-12-19 00:30:00	22	96	57.2	58	74

(4) 10분 단위 구간별로 분산(variance), 표준편차(standard deviation) 구하기

(variance, standard deviation by 10 minutes time span using pandas resample(() method)

resample('10T') 로 10분 단위 구간별로 데이터를 그룹으로 뽑고, var() 메소드로 표본 분산(sample variance)을 구합니다. (* 참고: 모집단 분산(population variance)이 편차 제곱의 합을 원소의 개수 N으로 나누어주는 반면에, 표본 분산(sample variance)는 편차 제곱의 합을 원소의 개수에서 1개를 뺀 N-1로 나누어준다는 차이점이 있습니다)

표본 표분편차(sample standard deviation)을 직접 구할 수 있는 메소드가 없어서 표본 분산에 제곱근(square root)을 취하여 표본 표준편차를 구하였습니다.

# Resampling by a given time span (group)

# variance, standard deviation

df_summary = pd.DataFrame()

# sample variance 1/(N-1)*sigma(X-X_bar)^2

df_summary['price_10m_var'] = df.price.resample('10T').var()

# sample standard deviation using sqrt(var) formula

df_summary['price_10m_stddev'] = np.sqrt(df.price.resample('10T').var())

	price_10m_var	price_10m_stddev
2019-12-19 00:00:00	767.2	27.698375
2019-12-19 00:10:00	499.7	22.353971
2019-12-19 00:20:00	624.2	24.983995
2019-12-19 00:30:00	930.7	30.507376

(5) 특정 시간 단위 구간별로 요약 통계량 구하는 사용자 정의 함수

(User Defined Function for aggregating summary statistics by specific time span)

위의 (1) ~ (4)번에서 pandas의 resample() 메소드를 사용하여 시계열 데이터를 특정 시간 단위 구간별로 샘플링하고, 첫번째 값(first), 마지막 값(last), 합(sum), 누적합(cumsum), 최소값(min), 최대값(max), 평균(mean), 중앙값(median), 구간(range), 분산(variance), 표준편차(standard deviation) 을 구하는 방법을 소개하였습니다.

이를 좀더 사용하기 편리하도록 아래의 매개변수를 인자로 가지는 사용자 정의 함수를 정의해보겠습니다.

[ resample_summary() 사용자 정의 함수 매개변수 ]

(a) ts_data : '년-월-일 시간:분:초'의 시계열 범위 데이터를 index로 가지는 시계열 데이터 DataFrame
(b) col_nm : 집계/요약의 대상이 되는 칼럼 이름
(c) time_span : 특정 시간 단위 구간 (예: 10분 단위 '10T', 1시간 단위 '1H', 1일 단위 '1D' 등)
(d) func_list : 집계/요약할 함수 (예: 첫번째 값 'first', 마지막 값 'last', 합 'sum', 누적합 'cumsum', 최소값 'min', 최대값 'max', 평균 'mean', 중앙값 'median', 범위 'range', 표본 분산 'var', 표본 표준편차 'stddev' 등)

공통으로 사용되는 부분인 resampler = ts_data[col_nm].resample(time_span) 를 resampler 객체로 만들어서 반복해서 사용하였습니다.

그리고 사용자가 입력(선택)한 집계/요약 함수만 집계/요약하여 반환하도록 if [function name] in func_list 조건문을 추가해주었습니다.

집계/요약된 값의 칼럼 이름은 이해하기 쉽도록 접미사(suffix)를 붙어서 [ 기존 칼럼 이름 + '_' + 시간 단위 구간 + '_' + 집계/요약함수 ] 를 이어붙여서 새로 만들어주었습니다. (예: price_10T_first)

# UDF of Resampling by column name, time span and summary functions

def resample_summary(ts_data, col_nm, time_span, func_list):

import numpy as np

import pandas as pd

df_summary = pd.DataFrame() # blank DataFrame to store results

# resampler with column name by time span (group by)

resampler = ts_data[col_nm].resample(time_span)

# aggregation functions with suffix name

if 'first' in func_list:

df_summary[col_nm + '_' + time_span + '_first'] = resampler.first()

if 'last' in func_list:

df_summary[col_nm + '_' + time_span + '_last'] = resampler.last()

if 'sum' in func_list:

df_summary[col_nm + '_' + time_span + '_sum'] = resampler.sum()

if 'cumsum' in func_list:

df_summary[col_nm + '_' + time_span + '_cumsum'] = resampler.sum().cumsum()

if 'min' in func_list:

df_summary[col_nm + '_' + time_span + '_min'] = resampler.min()

if 'max' in func_list:

df_summary[col_nm + '_' + time_span + '_max'] = resampler.max()

if 'mean' in func_list:

df_summary[col_nm + '_' + time_span + '_mean'] = resampler.mean()

if 'median' in func_list:

df_summary[col_nm + '_' + time_span + '_median'] = resampler.median()

if 'range' in func_list:

df_summary[col_nm + '_' + time_span + '_range'] = resampler.max() - resampler.min()

if 'var' in func_list:

df_summary[col_nm + '_' + time_span + '_var'] = resampler.var() # sample variance

if 'stddev' in func_list:

df_summary[col_nm + '_' + time_span + '_stddev'] = np.sqrt(resampler.var())

return df_summary

위의 (5)번에서 정의한 resample_summary() 사용자 정의 함수를 이용하여, df 데이터셋의 'price' 칼럼에 대해 '10분 단위 구간별로(time_span = '10T') 첫번째 값('first'), 마지막 값('last'), 합('sum'), 누적합('cumsum'), 최소값('min'), 최대값('max') 을 구해보겠습니다.

func_list = ['first', 'last', 'sum', 'cumsum', 'min', 'max']

resample_summary(df, 'price', '10T', func_list)

	price_10T_first	price_10T_last	price_10T_sum	price_10T_cumsum	price_10T_min	price_10T_max
2019-12-19 00:00:00	12	61	214	214	12	79
2019-12-19 00:10:00	81	63	306	520	24	81
2019-12-19 00:20:00	95	30	359	879	30	95
2019-12-19 00:30:00	33	96	286	1165	22	96

이번에는 시간 단위 구간을 '20분 ('20T')'으로 늘려서 resample_summary() 사용자 정의 함수를 사용해 보겠습니다.

func_list = ['mean', 'median', 'range', 'var', 'stddev']

resample_summary(df, 'price', '20T', func_list)

	price_20T_mean	price_20T_median	price_20T_range	price_20T_var	price_20T_stddev
2019-12-19 00:00:00	52.0	61.5	69	657.111111	25.634179
2019-12-19 00:20:00	64.5	73.5	74	750.277778	27.391199

이번에는 집계/요약의 대상이 되는 칼럼을 '수량(amount)' 으로 바꾸어서 resample_summary() 사용자 정의 함수를 사용해 보겠습니다.

func_list = ['mean', 'median', 'range', 'var', 'stddev']

resample_summary(df, 'amount', '20T', func_list) # with 'amount' column

	amount_20T_mean	amount_20T_median	amount_20T_range	amount_20T_var	amount_20T_stddev
2019-12-19 00:00:00	2.2	2.0	3	1.511111	1.229273
2019-12-19 00:20:00	2.3	2.5	3	1.122222	1.059350

집계/요약할 함수를 평균('mean'), 중앙값('median'), 범위('range'), 분산('var'), 표준편차('stddev')로 바꾸어서 resample_summary() 사용자 정의 함수를 사용해 보겠습니다.

func_list = ['mean', 'median', 'range', 'var', 'stddev']

resample_summary(df, 'price', '10T', func_list)

	price_10T_mean	price_10T_median	price_10T_range	price_10T_var	price_10T_stddev
2019-12-19 00:00:00	42.8	41	67	767.2	27.698375
2019-12-19 00:10:00	61.2	63	57	499.7	22.353971
2019-12-19 00:20:00	71.8	82	65	624.2	24.983995
2019-12-19 00:30:00	57.2	58	74	930.7	30.507376

이번에는 데이터를 2019-12-19 일에서 2020-01-18일 까지 약 한달 간의 시계열 데이터를 난수로 생성해서 ==> 시간 단위 구간을 1시간('1H'), 1일('1D'), 1주('1W'), 1달('1M') 로 바꾸어가면서 집계/요약을 해보겠습니다.

# generate time series index

range = pd.date_range('2019-12-19', '2020-01-18', freq='2min') # one month period

df_1m = pd.DataFrame(index = range)

# add 'price' columm using random number

np.random.seed(seed=1004) # for reproducibility

df_1m['price'] = np.random.randint(low=10, high=100, size=len(df))

# add 'amount' column unsing random number

df_1m['amount'] = np.random.randint(low=1, high=5, size=len(df))

print('Shape of df_1m DataFrame:', df_1m.shape)

Shape of df_1m DataFrame: (21601, 2)

# by 1 Hour

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1H', func_list).head() # by 1 Hour

 price_1H_first price_1H_sum price_1H_mean price_1H_stddev
2019-12-19 00:00:00 12 1684 56.133333 25.143359
2019-12-19 01:00:00 44 1534 51.133333 24.764732
2019-12-19 02:00:00 70 1435 47.833333 25.223256
2019-12-19 03:00:00 22 1867 62.233333 24.842515
2019-12-19 04:00:00 80 1766 58.866667 23.292345

# by 1 Day

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1D', func_list).head() # by 1 Day

 price_1D_first price_1D_sum price_1D_mean price_1D_stddev
2019-12-19 12 39746 55.202778 25.946355
2019-12-20 26 40171 55.793056 25.547419
2019-12-21 87 39737 55.190278 26.238314
2019-12-22 65 39350 54.652778 25.675714
2019-12-23 69 39835 55.326389 26.230239

# by 1 Week

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1W', func_list) # by 1 Week

 price_1W_first price_1W_sum price_1W_mean price_1W_stddev
2019-12-22 12 159004 55.209722 25.842990
2019-12-29 69 272943 54.155357 26.084089
2020-01-05 72 274740 54.511905 25.840425
2020-01-12 41 276563 54.873611 26.295806
2020-01-19 55 197090 54.732019 25.984207

# by 2 Weeks

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '2W', func_list) # by 2 Week

 price_2W_first price_2W_sum price_2W_mean price_2W_stddev
2019-12-22 12 159004 55.209722 25.842990
2020-01-05 69 547683 54.333631 25.961867
2020-01-19 41 473653 54.814605 26.164988

# by 1 Month

func_list = ['first', 'sum', 'mean', 'stddev']
resample_summary(df_1m, 'price', '1M', func_list) # by 1 Month

 price_1M_first price_1M_sum price_1M_mean price_1M_stddev
2019-12-31 12 510036 54.491026 25.912189
2020-01-31 48 670304 54.758925 26.117109

(6) 10분 단위 구간별 수량 가중 평균 가격 구하기

(amount-weighted average of price by 10 minutes time span using pandas resample method)

가격('price')과 수량('amount')을 곱해서 만든 새로운 칼럼 'price_mult_amt' 를 만들어주고, resample('10T') 메소드를 사용해서 10분 단위 구간별로 수량 가중 평균 가격(10분 단위 구간별 구입가격*구입수량 합 / 전체 구입수량 합)을 구해주었습니다.

참고로, 아래 코드에서 역 슬래쉬('\')는 코드를 한줄에 다 쓰기에 너무 길 경우에 '다음 줄로 넘겨서 쓴 코드를 앞코드와 이어진 코드'로 인식하게 만들어 줄 때 사용합니다.

# function: weighted average

# 각 시간대의 수량가중평균가격(sum(price*amount)/sum(amount))

# (*가중평균은 특정 시간대에 발생한 모든 구입건의 구입가격*구입수량 합/전체 구입수량 합)

df_summary = pd.DataFrame()

df['price_mult_amt'] = df['price']*df['amount']

df_summary['price_10m_amount_weighted_avg'] = \

df.price_mult_amt.resample('10T').sum() / df.amount.resample('10T').sum()

df_summary

	price_10m_amount_weighted_avg
2019-12-19 00:00:00	43.769231
2019-12-19 00:10:00	56.222222
2019-12-19 00:20:00	64.363636
2019-12-19 00:30:00	64.166667

(7) 10분 단위 구간별 집계/요약 통계량 결과를 csv 파일로 내보내기

(exporting summary results by 10 minutes time span into 'csv file' using pandas to_csv() method)

위의 (5)번에서 정의한 resample_summary() 사용자 정의 함수(UDF)를 사용하여 10분 단위('10T') 구간별로 가격('price') 칼럼에 대해 'first', 'last', 'sum', 'cumsum', 'min', 'max', 'mean', 'median', 'range', 'var', 'stddev'를 모두 집계/요약한 데이터 프레임을 만들고,

이어서, 10단위 구간별로 수량 가중 평균 가격(amount-weighted average of price)을 구한 후에,

이를 취합한 결과 데이터프레임을 pandas의 to_csv() 메소드를 사용하여 'df_summary.csv' 라는 이름의 csv 파일로 내보내보겠습니다. '년-월-일 시간:분:초'의 시간 정보가 들어있는 index도 같이 내보내야 하므로 to_csv() 메소드 내 index=True 옵션으로 설정해주었으며, 결측값이 존재할 경우 na_rep='NaN' 으로 표기하도록 설정해주었고, 요약통계량 값이 부동소수형(float) 일 경우 소수점 2번째 자리까지만 표기하도록 float_format='%.2f' 옵션을 설정해주었습니다.

# summary statistics using resample_summary() User Defiened Function, refer to (5)

func_list = ['first', 'last', 'sum', 'cumsum', 'min', 'max',

'mean', 'median', 'range', 'var', 'stddev']

df_summary = resample_summary(df, 'price', '10T', func_list)

# amount-weighted average of price, refer to (6)

df['price_mult_amt'] = df['price']*df['amount']

df_summary['price_10m_amount_weighted_avg'] = \

df.price_mult_amt.resample('10T').sum()/ df.amount.resample('10T').sum()

# export df_summary DataFrame into csv file

import os

work_dir = os.getcwd() # current working directory

file_path = os.path.join(work_dir, 'df_summary.csv')

df_summary.to_csv(file_path

, index=True # include index

, na_rep='NaN' # representation of missing value

, float_format = '%.2f') # 2 decimal places

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기 (0)	2019.12.22
[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법 (0)	2019.09.15
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04

Posted by Rfriend

[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 9. 00:55

Lag, Lead window function은 시계열 데이터를 처리할 때 많이 사용하는 매우 유용한 함수입니다.

이번 포스팅에서는 PostgreSQL, Python (pandas), R (dplyr) 을 이용해서 그룹별로 행을 하나씩 내리기, 올리기 (lag or lead a row by group using PostgreSQL, Python, R) 하는 방법을 소개하겠습니다.

1. PostgreSQL로 그룹별로 특정 칼럼의 행을 하나씩 내리기, 올리기

(lag, lead a row by group using PostgreSQL lag(), lead() window function)

연월일(dt), 그룹ID(id), 측정값(val) 의 세 개 칼럼을 가진 시계열 데이터의 테이블을 PostgreSQL DB에 만들어보겠습니다.

DROP TABLE IF EXISTS ts;

CREATE TABLE ts (

dt date not null

, id text not null

, val numeric not null

);

INSERT INTO ts VALUES

('2019-12-01', 'a', 5)

, ('2019-12-02', 'a', 6)

, ('2019-12-03', 'a', 7)

, ('2019-12-04', 'a', 8)

, ('2019-12-01', 'b', 13)

, ('2019-12-02', 'b', 14)

, ('2019-12-03', 'b', 15)

, ('2019-12-04', 'b', 16);

SELECT * FROM ts ORDER BY id, dt;

PostgreSQL 의 LAG(value, offset, default), LEAD(value, offset, default) Window function을 이용해서 그룹ID('id') 별로 측정값('val')의 행을 하나씩 내리기(lag), 올리기(lead) 해보겠습니다. 행을 내리거나 올린 후에 빈 셀의 값은 'NULL'로 지정해주었습니다.

LAG(), LEAD() 함수를 사용할 때 그룹ID('id')별로 년월일('dt') 을 기준으로 내림차순 정렬(OVER(PARTITIO BY id ORDER BY dt)) 을 해줍니다.

-- lead() windows function

SELECT

, LAG(val, 1, NULL) OVER (PARTITION BY id ORDER BY dt) AS val_lag_1

, LEAD(val, 1, NULL) OVER (PARTITION BY id ORDER BY dt) AS val_lead_2

FROM ts;

lag(), lead() 함수를 사용해서 lag_1, lead_2 라는 새로운 칼럼을 추가한 'ts_lag_lead' 라는 이름의 테이블을 만들어보겠습니다.

DROP TABLE IF EXISTS ts_lag_lead;

CREATE TABLE ts_lag_lead AS (

SELECT

, LAG(val, 1, NULL) OVER (PARTITION BY id ORDER BY dt) AS val_lag_1

, LEAD(val, 1, NULL) OVER (PARTITION BY id ORDER BY dt) AS val_lead_2

FROM ts

);

SELECT * FROM ts_lag_lead ORDER BY id, dt;

2. Python pandas 로 DataFrame 내 그룹별 특정 칼럼의 행을 하나씩 내리기, 올리기

(shift a row by group using Python pandas library)

위에서 PostgreSQL의 lag(), lead() window function과 똑같은 작업을 Python pandas 를 가지고 수행해보겠습니다.

먼저 dt, id, val의 칼럼을 가진 pandas DataFrame 시계열 데이터를 만들어보겠습니다.

import pandas as pd

ts = pd.DataFrame({'dt': ['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04',

'2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04'],

'id': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],

'val': [5, 6, 7, 8, 13, 14, 15, 16]})

	dt	id	val
0	2019-12-01	a	5
1	2019-12-02	a	6
2	2019-12-03	a	7
3	2019-12-04	a	8
4	2019-12-01	b	13
5	2019-12-02	b	14
6	2019-12-03	b	15
7	2019-12-04	b	16

shift() 함수를 쓰기 전에 sort_values() 함수로 정렬을 해주는데요, lag 는 내림차순 정렬, lead는 오름차순 정렬임에 주의해야 합니다. (PostgreSQL, R 대비 Python이 좀 불편하긴 하네요 -,-;)

(a) lag: sort_values() 함수를 이용해서 년월일('dt')를 기준으로 내림차순 정렬 (ascending=True) 한 후, 'id' 그룹별로 'val' 값을 하나씩 내려기 groupby('id')['val'].shift(1)

(b) lead: sort_values() 함수를 이용해서 년월일('dt')를 기준으로 오름차순 정렬 (ascending=False) 한 후, 'id' 그룹별로 'val' 값을 하나씩 올리기 groupby('id')['val].shift(1)

# lag a row by group 'id'

ts['val_lag_1'] = ts.sort_values(by='dt', ascending=True).groupby('id')['val'].shift(1)

# lead a row by group 'id'

ts['val_lead_1'] = ts.sort_values(by='dt', ascending=False).groupby('id')['val'].shift(1)

ts.sort_values(by=['id', 'dt'])

	dt	id	val	val_lag_1	val_lead_1
0	2019-12-01	a	5	NaN	6.0
1	2019-12-02	a	6	5.0	7.0
2	2019-12-03	a	7	6.0	8.0
3	2019-12-04	a	8	7.0	NaN
4	2019-12-01	b	13	NaN	14.0
5	2019-12-02	b	14	13.0	15.0
6	2019-12-03	b	15	14.0	16.0
7	2019-12-04	b	16	15.0	NaN

3. R dplyr 로 dataframe 내 그룹별 특정 칼럼의 행을 하나씩 내리기, 올리기

(lag, lead a row by group using R dplyr library)

위에서 PostgreSQL의 lag(), lead() window function과 똑같은 작업을 R dplyr library를 가지고 수행해보겠습니다.

먼저 dt, id, val의 칼럼을 가진 R DataFrame 시계열 데이터를 만들어보겠습니다.

#install.packages("dplyr")

library(dplyr)

dt <- c(rep(c('2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04'), 2))

id <- c(rep('a', 4), rep('b', 4))

val <- c(5, 6, 7, 8, 13, 14, 15, 16)

ts <- data.frame(dt, id, val)

A data.frame: 8 × 3
dt	id	val
<fct>	<fct>	<dbl>
2019-12-01	a	5
2019-12-02	a	6
2019-12-03	a	7
2019-12-04	a	8
2019-12-01	b	13
2019-12-02	b	14
2019-12-03	b	15
2019-12-04	b	16

R은 Postgresql 처럼 lag(), lead() window function을 가지고 있고 dplyr library의 chain operator를 써서 arrange() 함수로 'dt' 기준 내림차순 정렬하고, group_by(id)를 써서 그룹ID('id')별로 lag(), lead()를 무척 편리하게 적용해서 새로운 변수를 생성(mutate)할 수 있습니다.

ts <- ts %>%

arrange(dt) %>%

group_by(id) %>%

mutate(val_lag_1 = lag(val, 1),

val_lead_1 = lead(val, 1))

arrange(ts, id, dt)

A grouped_df: 8 × 5
dt	id	val	val_lag_1	val_lead_1
<fct>	<fct>	<dbl>	<dbl>	<dbl>
2019-12-01	a	5	NA	6
2019-12-02	a	6	5	7
2019-12-03	a	7	6	8
2019-12-04	a	8	7	NA
2019-12-01	b	13	NA	14
2019-12-02	b	14	13	15
2019-12-03	b	15	14	16
2019-12-04	b	16	15	NA

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module) (0)	2019.12.21
[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15
[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법 (0)	2019.09.15
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04
[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31

Posted by Rfriend

[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2019. 10. 7. 23:24

이번 포스팅에서는 Jupyter Notebook을 사용하는데 있어 자주 쓰는 것은 아니지만 한번 쓰려고 하면 방법을 찾으려고 하면 또 시간을 빼앗기곤 하는, 그래서 어디에 메모해두었다가 필요할 때 꺼내쓰기에 아기자기한 팁들을 모아보았습니다.

(1) Jupyter Notebook cell 너비 설정하기

(2) Jupyter Notebook에서 DataFrame 칼럼 최대 너비 설정하기

(3) Jupyter Notebook에서 DataFrame 내 텍스트를 왼쪽으로 정렬하기

(4) Jupyter Notebook에서 DataFrame 소수점 자리수 설정하기

(5) Jupyter Notebook에서 matplotlib plot 기본 옵션 설정하기 (figure size, line width, color, grid)

(6) Jupyter Notebook에서 출력하는 행의 최대개수 설정하기 (number of max rows)

(1) Jupyter Notebook cell 너비 설정하기 (setting cell width in Jupyter Notebook)

아래 코드의 부분의 숫자를 바꾸어주면 됩니다.

from IPython.core.display import display, HTML

display(HTML("<style>.container { width: 50% !important; }</style>"))

(2) Jupyter Notebook에서 DataFrame 칼럼 최대 너비 설정하기

(setting the max-width of DataFrame's column in Jupyter Notebook)

import pandas as pd

pd.set_option('display.max.colwidth', 10)

df = pd.DataFrame({'a': [100000000.0123, 20000000.54321],

'b': ['abcdefghijklmnop', 'qrstuvwxyz']})

Out[2]:

	a	b
0	1.0000...	abcdef...
1	2.0000...	qrstuv...

pd.set_option('display.max.colwidth', 50)

Out[3]:

	a	b
0	1.000000e+08	abcdefghijklmnop
1	2.000000e+07	qrstuvwxyz

(3) Jupyter Notebook에서 DataFrame 내 텍스트를 왼쪽으로 정렬하기

(align text of pandas DataFrame to left in Jupyter Notebook)

dfStyler = df.style.set_properties(**{'text-align': 'left'})

dfStyler.set_table_styles([dict(selector='th',

props=[('text-align', 'left')])])

Out[4]:

	a	b
0	1e+08	abcdefghijklmnop
1	2e+07	qrstuvwxyz

(4) Jupyter Notebook에서 DataFrame 소수점 자리수 설정하기

(setting the decimal point format of pandas DataFrame in Jupyter Notebook)

아래에 예시로 지수형 표기를 숫자형 표기로 바꾸고, 소수점 2째자리까지만 나타내도록 숫자 표기 포맷을 설정해보겠습니다.

import pandas as pd

pd.options.display.float_format = '{:.2f}'.format

Out[6]:

	a	b
0	100000000.01	abcdefghijklmnop
1	20000000.54	qrstuvwxyz

(5) Jupyter Notebook에서 matplotlib plot 기본 옵션 설정하기

(setting figure size, line width, color, grid of matplotlib plot in Jupyter Notebook)

matplotlib.pyplot 의 plt.rcParams[] 를 사용하여 그래프 크기, 선 너비, 선 색깔, 그리드 포함 여부 등을 설정할 수 있습니다.

# matplotlib setting

import matplotlib.pylab as plt

%matplotlib inline

plt.rcParams["figure.figsize"] = (6, 5)

plt.rcParams["lines.linewidth"] = 2

plt.rcParams["lines.color"] = 'r'

plt.rcParams["axes.grid"] = True

# simple plot

x = [1, 2, 3, 4, 5]

y = [2, 3.5, 5, 8, 9]

plt.plot(x, y)

plt.show()

(6) Jupyter Notebook에서 출력하는 최대 행의 개수 설정하기 (number of max rows)

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 행의 개수가 많을 경우 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 처음 5개행과 마지막 5개 행만 선별해서 보여줍니다.

# if there are many rows, JN does not print all rows

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))
df

그런데, 필요에 따라서는 전체 행을 프린트해서 눈으로 확인해봐야할 때도 있습니다. 이럴 경우 Jupyter Notebook에서 출력하는 최대 행의 개수 (the number of max rows in Jupyter Notebook) 를 설정할 수 있습니다. 위의 예에서 5번째 행부터 점선으로 처리되어 건너뛰고 안보여지던 행이, pd.set_option('display.max_rows', 100) 으로 최대 인쇄되는 행의 개수를 100개로 늘려서 설정하니 이번에는 100개 행이 전부 다 인쇄되어 볼 수 있습니다.

# setting the number of maximum rows in Jupyter Notebook

import pandas as pd

pd.set_option('display.max_rows', 100)

df

(7) Jupyter Notebook에서 출력하는 최대 열의 개수 설정하기 (number of max columns)

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 열의 개수가 20개를 넘어가면 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 나머지 20개만 출력을 해줍니다. 아래의 화면 캡쳐 예시처럼 중간 열 부분의 결과가 "..." 으로 가려져있어서 확인할 수가 없습니다.

이럴 경우에 pandas.set_option('display.max.columns', col_num) 옵션을 사용하여 화면에 보여주는 최대 열의 개수 (maximum number of columns)"를 옵션으로 설정해줄 수 있습니다. 위에서는 ... 으로 처리되었던 열10 ~ 열12 가 아래에서는 제대로 jupyter notebook에 출력되었네요!

# setting the maximum number of columns in jupyter notebook display

import pandas as pd

pd.set_option('display.max.columns', 50) # set the maximum number whatever you want

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~

'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 파일을 열기, 데이터 읽기와 쓰기, 파일 닫기 (0)	2020.03.01
[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30
[R] Jupyter Notebook에서 R 사용하기 (6)	2019.10.06
[Python] Windows10에서 Anaconda Prompt를 이용해 가상환경 만들기 (Create a new virtual environment for python with anaconda prompt) (0)	2019.07.19
[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법 (2)	2019.07.02

Posted by Rfriend

[R] Jupyter Notebook에서 R 사용하기

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2019. 10. 6. 14:31

Python은 Jupyter Notebook, R은 RStudio 라고만 알고 있는 분도 계실텐데요, IRKernel을 설치해주면 Jupyter Notebook에서도 R을 사용할 수 있습니다.

이번 포스팅에서는 Jupyter Notebook에서 R을 사용할 수 있도록 하는 방법을 소개하겠습니다.

(MacOS, Anaconda, Python3.6 version 환경)

명령 프롬프트 창(prompt window, terminal) 에서 아래의 절차에 따라서 IRKernel을 설치해주시면 됩니다.

(RStudion에서 설치를 하려고 하면 에러가 납니다. 명령 프롬프트창/ Terminal에서 진행하기 바랍니다)

1. Jupyter Notebook 사용할 수 있도록 IRKernel 설치하기

(1) 명령 프롬프트 창(terminal)에서 'R' 을 써주고 엔터 => R을 실행합니다.

MacBook-Pro:~ ihongdon$

MacBook-Pro:~ ihongdon$ R

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"

Platform: x86_64-apple-darwin15.6.0 (64-bit)

(2) devtools 패키지를 설치합니다: install.packages('devtools')

'현재 세션에서 사용할 CRAN 미러를 선택해 주세요' 라는 메시지와 함께 콤보 박스 창이 뜨면 서버 아무거나 선택하면 됩니다. 저는 'SEOUL' 선택했습니다.

> install.packages('devtools')

--- 현재 세션에서 사용할 CRAN 미러를 선택해 주세요 ---

SEOUL

경고: 저장소 https://cran.seoul.go.kr/bin/macosx/el-capitan/contrib/3.6에 대한 인덱스에 접근할 수 없습니다:

URL 'https://cran.seoul.go.kr/bin/macosx/el-capitan/contrib/3.6/PACKAGES'를 열 수 없습니다

소스형태의 패키지 ‘devtools’(들)를 설치합니다.

URL 'https://cran.seoul.go.kr/src/contrib/devtools_2.2.1.tar.gz'을 시도합니다

Content type 'application/x-gzip' length 372273 bytes (363 KB)

==================================================

downloaded 363 KB

* installing *source* package ‘devtools’ ...

** 패키지 ‘devtools’는 성공적으로 압축해제되었고, MD5 sums 이 확인되었습니다

** using staged installation

** R

** inst

** byte-compile and prepare package for lazy loading

** help

*** installing help indices

*** copying figures

** building package indices

** installing vignettes

** testing if installed package can be loaded from temporary location

** testing if installed package can be loaded from final location

** testing if installed package keeps a record of temporary installation path

* DONE (devtools)

다운로드한 소스 패키지들은 다음의 위치에 있습니다

‘/private/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/Rtmpg1bePN/downloaded_packages’

(3) IRkernel 설치: devtools::install_github('IRkernel/IRkernel') 실행합니다.

> devtools::install_github('IRkernel/IRkernel')

Downloading GitHub repo IRkernel/IRkernel@master

'/usr/local/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/jupyter/jupyter_kernel_test.git /var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/jkt

'/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/jkt'에 복제합니다...

remote: Enumerating objects: 12, done.

remote: Counting objects: 100% (12/12), done.

remote: Compressing objects: 100% (11/11), done.

remote: Total 12 (delta 1), reused 3 (delta 0), pack-reused 0

오브젝트 묶음 푸는 중: 100% (12/12), 완료.

'/usr/local/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/flying-sheep/ndjson-testrunner.git /var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/njr

'/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T//Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/tests/testthat/njr'에 복제합니다...

remote: Enumerating objects: 10, done.

remote: Counting objects: 100% (10/10), done.

remote: Compressing objects: 100% (8/8), done.

remote: Total 10 (delta 0), reused 6 (delta 0), pack-reused 0

오브젝트 묶음 푸는 중: 100% (10/10), 완료.

Skipping 1 packages ahead of CRAN: htmltools

checking for file ‘/private/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/Rtmpg1bePN/remotes36921150b621/IRkernel✔ checking for file ‘/private/var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/Rtmpg1bePN/remotes36921150b621/IRkernel-IRkernel-67592db/DESCRIPTION’

─ preparing ‘IRkernel’:

✔ checking DESCRIPTION meta-information ...

─ checking for LF line-endings in source and make files and shell scripts

─ checking for empty or unneeded directories

Removed empty directory ‘IRkernel/example-notebooks’

─ building ‘IRkernel_1.0.2.9000.tar.gz’

* installing *source* package ‘IRkernel’ ...

** using staged installation

** R

** inst

** byte-compile and prepare package for lazy loading

** help

*** installing help indices

** building package indices

** testing if installed package can be loaded from temporary location

** testing if installed package can be loaded from final location

** testing if installed package keeps a record of temporary installation path

* DONE (IRkernel)

(4) IRkernel::installspec() 확인

> IRkernel::installspec()

[InstallKernelSpec] Installed kernelspec ir in /Users/ihongdon/Library/Jupyter/kernels/ir

만약 아래와 같은 메시지가 떴다면 이는 아마도 위의 (1), (2), (3) 절차를 "명령 프롬프트 창(Terminal)"에서 실행한 것이 아니라 "RStudio"에서 실행했기 때문일 것입니다.

jupyter-client has to be installed but “jupyter kernelspec –version” exited with code 127.

위의 메시지가 떴다면 명령 프롬프트 창(prompt window, Terminal)을 하나 열고 위의 (1), (2), (3) 절차를 실행한 후에 (4)로 확인을 해보세요.

Windows10 OS를 사용하는 분이라면 PATH에 아래 경로를 추가로 등록해보시기 바랍니다.

Anaconda\Lib\site-packages\jupyter_client

C:\Users\Anaconda3\Scripts

참고로 Windows 10, Windows 8에서 PATH 등록하는 법은 아래와 같습니다.

[ Windows10 또는 Windows8 에서 PATH 등록 ]

‘시스템(제어판)’ 선택

‘고급 시스템 설정 링크’ 선택(클릭)

‘환경변수’ 선택 —> 시스템 변수 섹션에서 ‘PATH 환경변수’ 선택 —> ‘편집’ 선택 —> PATH 환경변수가 존재하지 않을 경우 ‘새로 만들기’ 선택

‘시스템 변수 편집 (또는 새 시스템 변수)’ 창에서 PATH 환경 변수의 값을 지정 —> ‘확인’ 선택 —> 나머지 창 모두 닫기

2. Jupyter Notebook에서 R 사용해보기

R kernel 도 잘 설치가 되었으니 Jupyter Notebook에서 R을 사용해보겠습니다. 위의 R이 실행 중인 terminal 에서 'ctrl + z' 를 눌러서 빠져나옵니다.

$ conda info -e (혹은 conda env list) 로 가상환경 목록을 확인하고,

$ source activate [가상환경 이름] 으로 특정 가상환경으로 들어갑니다.

(Windows OS 에서는 activate [가상환경 이름])

$ jupyter notebook 로 jupyter notebook 창을 열어줍니다.

[1]+ Stopped R

MacBook-Pro:~ ihongdon$

MacBook-Pro:~ ihongdon$ conda info -e

# conda environments:

base * /Users/ihongdon/anaconda3

py2.7_tf1.4 /Users/ihongdon/anaconda3/envs/py2.7_tf1.4

py3.5_tf1.4 /Users/ihongdon/anaconda3/envs/py3.5_tf1.4

py3.6_tf2.0 /Users/ihongdon/anaconda3/envs/py3.6_tf2.0

MacBook-Pro:~ ihongdon$

MacBook-Pro:~ ihongdon$ source activate py3.6_tf2.0

(py3.6_tf2.0) MacBook-Pro:~ ihongdon$

(py3.6_tf2.0) MacBook-Pro:~ ihongdon$ jupyter notebook

아래처럼 Jupyter Notebook 창이 뜨면 오른쪽 상단의 'New' 메뉴를 클릭하고, 여러 하위 메뉴 중 'R' 선택합니다.

간단한 예제로 x, y 두개 변수로 구성된 R 데이터프레임을 하나 만들고, ggplot2 로 산점도를 그려보았습니다.

다음으로 종속변수 y에 대해 설명변수 x를 사용한 선형회귀모형을 적합시켜 보았습니다.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] 객체 지향 프로그래밍과 클래스 (Object-Oriented Programming and Class in Python) (0)	2020.01.30
[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법 (0)	2019.10.07
[Python] Windows10에서 Anaconda Prompt를 이용해 가상환경 만들기 (Create a new virtual environment for python with anaconda prompt) (0)	2019.07.19
[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법 (2)	2019.07.02
맥북(Mac OS)에서 graphviz 실행 시 "ValueError: Program dot not found in path" 에러 대처방안 (0)	2018.08.31

Posted by Rfriend

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 9. 15. 12:18

이번 포스팅에서는 Python pandas DataFrame을 만들려고 할 때 "ValueError: If using all scalar values, you must pass an index" 에러 해결 방안 4가지를 소개하겠습니다.

아래의 예처럼 dictionary로 키, 값 쌍으로 된 데이터를 pandas DataFrame으로 만들려고 했을 때, 모든 값이 스칼라 값(if using all scalar values) 일 경우에 "ValueError: If using all scalar values, you must pass an index" 에러가 발생합니다.

import pandas as pd

df = pd.DataFrame({'col_1': 1,

'col_2': 2})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-73d6f192ba2a> in <module>()
      1 df = pd.DataFrame({'col_1': 1, 
----> 2                   'col_2': 2})

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    273                                  dtype=dtype, copy=copy)
    274         elif isinstance(data, dict):
--> 275             mgr = self._init_dict(data, index, columns, dtype=dtype)
    276         elif isinstance(data, ma.MaskedArray):
    277             import numpy.ma.mrecords as mrecords

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    409             arrays = [data[k] for k in keys]
    410 
--> 411         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    412 
    413     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5494     # figure out the index, if necessary
   5495     if index is None:
-> 5496         index = extract_index(arrays)
   5497     else:
   5498         index = _ensure_index(index)

~/anaconda3/envs/py3.5_tf1.4/lib/python3.5/site-packages/pandas/core/frame.py in extract_index(data)
   5533 
   5534         if not indexes and not raw_lengths:
-> 5535             raise ValueError('If using all scalar values, you must pass'
   5536                              ' an index')
   5537 

ValueError: If using all scalar values, you must pass an index

이 에러를 해결하기 위한 4가지 방법을 차례대로 소개하겠습니다.

(1) 해결방안 1 : 인덱스 값을 설정해줌 (pass an index)

에러 메시지에 "you must pass an index" 라는 가이드라인대로 인덱스 값을 추가로 입력해주면 됩니다.

# (1) pass an index

df = pd.DataFrame({'col_1': 1,

'col_2': 2},

index = [0])

	col_1	col_2
0	1	2

물론 index 에 원하는 값을 입력해서 설정해줄 수 있습니다. index 에 'row_1' 이라고 해볼까요?

df = pd.DataFrame({'col_1': 1,

'col_2': 2},

index = ['row_1'])

	col_1	col_2
row_1	1	2

(2) 스칼라 값 대신 리스트 값을 입력 (use a list instead of scalar values)

입력하는 값(values)에 대괄호 [ ] 를 해주어서 리스트로 만들어준 값을 사전형의 값으로 사용하면 에러가 발생하지 않습니다.

# (2) use a list instead of scalar values

df2 = pd.DataFrame({'col_1': [1],

'col_2': [2]})

df2

	col_1	col_2
0	1	2

(3) pd.DataFrame.from_records([{'key': value}]) 를 사용해서 DataFrame 만들기

이때도 [ ] 로 해서 리스트 값을 입력해주어야 합니다. ( [ ] 빼먹으면 동일 에러 발생함)

# (3) use pd.DataFrame.from_records() with a list

df3 = pd.DataFrame.from_records([{'col_1': 1,

'col_2': 2}])

df3

	col_1	col_2
0	1	2

(4) pd.DataFrame.from_dict([{'key': value}]) 를 사용하여 DataFrame 만들기

(3)과 거의 유사한데요, from_records([]) 대신에 from_dict([]) 를 사용하였으며, 역시 [ ] 로 해서 리스트 값을 입력해주면 됩니다.

# (4) use pd.DataFrame.from_dict([]) with a list

df4 = pd.DataFrame.from_dict([{'col_1': 1,

'col_2': 2}])

df4

	col_1	col_2
0	1	2

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기 (8)	2019.12.15
[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group) (6)	2019.12.09
[Python] 웹에서 XML 포맷 데이터를 Python으로 읽어와서 DataFrame으로 만들기 (18)	2019.09.04
[Python] 웹으로 부터 JSON 포맷 데이터 읽어와서 pandas DataFrame으로 만들기 (0)	2019.08.31
[Python] Python으로 JSON 데이터 읽고 쓰기 (Read and Write JSON data by Python) (4)	2019.08.31

Posted by Rfriend

이전 1 ··· 7 8 9 10 11 12 13 ··· 25 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Python'에 해당되는 글 243건

[Python pandas] 시계열 데이터 index 중복 확인 및 처리 (duplicated indices in time series)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Series, DataFrame에서 시계열 데이터 indexing, slicing, 조회하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Pyhon datetime, pandas Timestamp을 문자열(string)로 변환, 문자열을 datetime, Timestamp로 변환하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Timestamp로 날짜, 시간 데이터 입력, 변환, 정보추출하기

[Python datetime] datetime 모듈의 데이터 유형 (data types in pandas datetime module)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] resample() 메소드로 시계열 데이터를 10분 단위 구간별로 집계/요약 하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python, R, PostgreSQL] 그룹별로 행을 내리기, 올리기 (Lag, Lead a row by Group)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] Jupyter Notebook에서 cell 너비, DataFrame 칼럼 너비, 텍스트 정렬, 소수점 자리수, 그래프 크기 설정, 최대 행의 개수 설정 방법

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 행의 개수가 많을 경우 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 처음 5개행과 마지막 5개 행만 선별해서 보여줍니다.

# if there are many rows, JN does not print all rows

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))
df

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[R] Jupyter Notebook에서 R 사용하기

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] pandas DataFrame: ValueError: If using all scalar values, you must pass an index 에러 해결 방법

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

'Python'에 해당되는 글 243건

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

Jupyter Notebook 셀에서 DataFrame 인쇄 시에 기본 설정은 행의 개수가 많을 경우 중간 부분이 점선으로 처리 ("...")되어 건너뛰고, 처음 5개행과 마지막 5개 행만 선별해서 보여줍니다.

# if there are many rows, JN does not print all rows

import pandas as pdimport numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))df

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(200).reshape(100, 2))
df