'Python 분석과 프로그래밍/Python 데이터 전처리' 카테고리의 글 목록 (5 Page)

'Python 분석과 프로그래밍/Python 데이터 전처리'에 해당되는 글 157건

2020.05.17 [Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv()) 4
2020.02.18 [Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut()) 2
2020.02.15 [Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) 3
2020.02.11 [Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) 2
2020.02.05 [Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array)
2019.12.31 [Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation)
2019.12.30 [Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기
2019.12.30 [Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)
2019.12.28 [Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)
2019.12.26 [Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 21

[Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv())

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 5. 17. 12:57

예전 포스팅(https://rfriend.tistory.com/250)에서 pandas read_csv() 함수로 text, csv 파일을 읽어오는 방법을 소개하였습니다.

이번 포스팅에서는 Python pandas의 read_csv() 함수를 사용하여 csv file, text file 을 읽어와 DataFrame을 만들 때 날짜/시간 (Date/Time) 이 포함되어 있을 경우 이를 날짜/시간 형태(DateTime format)에 맞도록 파싱하여 읽어오는 방법을 소개하겠습니다.

[예제 샘플 데이터]

date_sample

(1) 날짜/시간 포맷 지정 없이 pd.read_csv() 로 날짜/시간 데이터 읽어올 경우

예제로 사용할 데이터는 위의 이미지 우측 상단에 있는 바와 같이 '1/5/2020 10:00:00' (1일/ 5월/ 2020년 10시:00분:00초) 형태의 날짜/시간 칼럼을 포함하고 있는 텍스트 파일입니다.

이 예제 데이터를 날짜/시간 포맷에 대한 명시적인 설정없이 그냥 pandas의 read_csv() 함수로 읽어와서 DataFrame을 만들 경우 아래와 같이 object 데이터 형태로 불어오게 됩니다. 이럴 경우 별도로 문자열을 DateTime foramt으로 변환을 해주어야 하는 불편함이 있습니다. (참고: https://rfriend.tistory.com/498)

import pandas as pd

df = pd.read_csv('date_sample', sep=",", names=['date', 'id', 'val']) # no datetime parsing

	date	id	val
0	1/5/2020 10:00:00	1	10
1	1/5/2020 10:10:00	1	12
2	1/5/2020 10:20:00	1	17
3	1/5/2020 10:00:00	2	11
4	1/5/2020 10:10:00	2	14
5	1/5/2020 10:20:00	2	16

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
date    6 non-null object <-- not datetime format
id      6 non-null int64
val     6 non-null int64
dtypes: int64(2), object(1)
memory usage: 272.0+ bytes

-- 날짜/시간 파싱 자동 추정 --

(2) parse_dates=['date'] 칼럼에 대해

dayfirst=True 일이 월보다 먼저 위치하는 것으로

infer_datetime_format=True 날짜시간 포맷 추정해서 파싱하기

이번 예제 데이터의 경우 '1/5/2020 10:00:00' (1일/ 5월/ 2020년 10시:00분:00초) 처럼 '일(day)'이 '월(month)' 보다 먼저 나오으므로 dayfirst=True 로 설정해주었습니다. infer_datetime_format=True 로 설정해주면 Python pandas가 똑똑하게도 알아서 날짜/시간 포맷을 추정해서 잘 파싱해줍니다.

df_date.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
date    6 non-null datetime64[ns] <-- datetime foramt!!
id      6 non-null int64
val     6 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 272.0 bytes

import pandas as pd

df_date = pd.read_csv("date_sample",

sep=",",

names=['date', 'id', 'val'],

parse_dates=['date'],

dayfirst=True, # May 1st

infer_datetime_format=True)

df_date # May 1st, 2020

	date	id	val
0	2020-05-01 10:00:00	1	10
1	2020-05-01 10:10:00	1	12
2	2020-05-01 10:20:00	1	17
3	2020-05-01 10:00:00	2	11
4	2020-05-01 10:10:00	2	14
5	2020-05-01 10:20:00	2	16

(3) dayfirst=False 일(day)이 월(month)보다 뒤에 있는 파일 날짜/시간 파싱하기

만약에 '1/5/2020 10:00:00' (1월/ 5일/ 2020년 10시:00분:00초) 의 형태로 월(month)이 일(day)보다 먼저 나오는 파일을 읽어와서 파싱하는 경우에는 dayfirst=False 로 설정해주면 됩니다. (default 값은 False 이므로 생략 가능합니다)

df_date_dayfirstF = pd.read_csv("date_sample",

sep=",",

names=['date', 'id', 'val'],

parse_dates=['date'],

dayfirst=False, # January 5th, default setting

infer_datetime_format=True)

df_date_dayfirstF # January 5th, 2020

	date	id	val
0	2020-01-05 10:00:00	1	10
1	2020-01-05 10:10:00	1	12
2	2020-01-05 10:20:00	1	17
3	2020-01-05 10:00:00	2	11
4	2020-01-05 10:10:00	2	14
5	2020-01-05 10:20:00	2	16

-- 날짜/시간 파싱 함수 수동 지정 --

(4) date_parser=Parser 함수를 사용해서 데이터 읽어올 때 날짜/시간 파싱하기

위의 (2)번, (3)번에서는 infer_datetime_format=True 로 설정해줘서 pandas가 알아서 날짜/시간 포맷을 추정(infer)해서 파싱을 해주었다면요,

이번에는 날짜/시간 파싱하는 포맷을 lambda 함수로 직접 명시적으로 사용자가 지정을 해주어서 read_csv() 함수로 파일을 읽어올 때 이 함수를 사용하여 날짜/시간 포맷을 파싱하는 방법입니다.

이번 예제의 날짜/시간 포맷에 맞추어서 datetime.strptime(x, "%d/%m/%Y %H:%M:%S") 로 string을 datetime으로 변환해주도록 하였습니다.

# parsing datetime string using lambda Function at date_parser

# Reference: converting string to DataTime: https://rfriend.tistory.com/498

import pandas as pd

from datetime import datetime

dt_parser = lambda x: datetime.strptime(x, "%d/%m/%Y %H:%M:%S")

df_date2 = pd.read_csv("date_sample",

sep=",",

names=['date', 'id', 'val'],

parse_dates=['date'], # column name

date_parser=dt_parser)

df_date2

	date	id	val
0	2020-05-01 10:00:00	1	10
1	2020-05-01 10:10:00	1	12
2	2020-05-01 10:20:00	1	17
3	2020-05-01 10:00:00	2	11
4	2020-05-01 10:10:00	2	14
5	2020-05-01 10:20:00	2	16

참고로, pd.read_csv() 함수에서 날짜/시간 데이터는 위에서 처럼 parse_dates=['date'] 처럼 명시적으로 칼럼 이름을 설정해줘도 되구요, 아래처럼 parse_dates=[0] 처럼 위치 index 를 써주어도 됩니다.

# parse_dates=[column_position]

import pandas as pd

from datetime import datetime

dt_parser = lambda x: datetime.strptime(x, "%d/%m/%Y %H:%M:%S")

df_date3 = pd.read_csv("date_sample",

sep=",",

names=['date', 'id', 'val'],

parse_dates=[0], # position index

date_parser=dt_parser)

df_date3

	date	id	val
0	2020-05-01 10:00:00	1	10
1	2020-05-01 10:10:00	1	12
2	2020-05-01 10:20:00	1	17
3	2020-05-01 10:00:00	2	11
4	2020-05-01 10:10:00	2	14
5	2020-05-01 10:20:00	2	16

(5) index_col 로 날짜-시간 데이터를 DataFrame index로 불러오기

마지막으로, 파일을 읽어올 때 index_col=['column_name'] 또는 index_col=column_position 을 설정해주면 날짜/시간을 pandas DataFrame의 index로 바로 읽어올 수 있습니다.

(그냥 칼럼으로 읽어온 후에 date 칼럼을 reindex() 를 사용해서 index로 설정해도 됩니다.

* 참고: https://rfriend.tistory.com/255)

# use datetime as an index

import pandas as pd

df_date_idx = pd.read_csv("date_sample",

sep=",",

names=['date', 'id', 'val'],

index_col=['date'], # or index_col=0

parse_dates=True,

dayfirst=True,

infer_datetime_format=True)

df_date_idx

	id	val
date
2020-05-01 10:00:00	1	10
2020-05-01 10:10:00	1	12
2020-05-01 10:20:00	1	17
2020-05-01 10:00:00	2	11
2020-05-01 10:10:00	2	14
2020-05-01 10:20:00	2	16

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] PyYAML로 YAML 파일 읽고 쓰기 (Parse and Serialize YAML in Python) (6)	2020.06.28
[Python pandas] TimeStamp와 ID의 모든 조합 MultiIndex로 시계열 데이터 만들기 (0)	2020.06.21
[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut()) (2)	2020.02.18
[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) (3)	2020.02.15
[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11

Posted by Rfriend

[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut())

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 18. 17:19

이번 포스팅에서는 연속형 변수를 여러개의 구간별로 구분하여 범주형 변수로 변환(categorization of a continuous variable by multiple bins) 하는 두가지 방법을 비교하여 설명하겠습니다.

(1) np.digitize(X, bins) 를 이용한 연속형 변수의 여러개 구간별 범주화

(2) pd.cut(X, bins, labels) 를 이용한 연속형 변수의 여러개 구간별 범주화

np.digitize(X, bins)와 pd.cut(X, bins, labels) 함수가 서로 비슷하면서도 사용법에 있어서는 모든 면에서 조금씩 다르므로 각 함수의 syntax에 맞게 정확하게 확인하고서 사용하기 바랍니다.

[ np.digitize()와 pd.cut() 비교 ]

구분	np.digitize(X, bins)	pd.cut(X, bins, labels)
bins=[start, end]	[포함, 미포함)	(미포함, 포함)
bin 구간 대비 작거나 큰 수	bin 첫 구간 보다 작으면 [-inf, start) --> 자동으로 '1'로 digitize bin 마지막 구간 보다 크면 [end, inf) --> 자동으로 bin 순서에 따라 digitize	bin 첫번째 구간보다 작으면 --> NaN bin 마지막 구간보다 크면 --> Nan
label	0, 1, 2, ... 순서의 양의 정수 자동 설정	사용자 지정 가능 (labels option)
반환 (return)	numpy array	a list of categories with labels

(1) np.digitize(X, bins) 를 이용한 연속형 변수의 여러개 구간별 범주화

먼저 예제로 사용할 간단한 pandas DataFrame을 만들어보겠습니다.

import pandas as pd

import numpy as np

df = pd.DataFrame({'col': np.arange(10)})

	col
0	0
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9

이제 np.digitize(X, bins=[0, 5, 8]) 함수를 사용해서 {[0, 5), [5, 8), [8, inf)} 구간 bin 별로 {1, 2, 3} 의 순서로 양의 정수를 자동으로 이름을 부여하여 'grp_digitize'라는 이름의 새로운 칼럼을 df DataFrame에 만들어보겠습니다.

참고로 '(' 또는 ')'는 미포함 (not included), '[' 또는 ']' 보호는 포함(included)을 나타냅니다.

bins=[0, 5, 8]

# returns numpy array

np.digitize(df['col'], bins)

[Out]: array([1, 1, 1, 1, 1, 2, 2, 2, 3, 3])

df['grp_digitize'] = np.digitize(df['col'], bins)

[Out]:

	col	grp_digitize
0	0	1
1	1	1
2	2	1
3	3	1
4	4	1
5	5	2
6	6	2
7	7	2
8	8	3
9	9	3

(2) pd.cut(X, bins, labels) 를 이용한 연속형 변수의 여러개 구간별 범주화

이번에는 pd.cut(X, bins=[0, 5, 8]) 을 이용하여 {(0, 5], (5, 8]} 의 2개 구간별로 범주화해보겠습니다. array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 의 각 원소가 어느 bin에 속하는지를 나타내는 category 리스트를 반환합니다.

import pandas as pd

import numpy as np

df = pd.DataFrame({'col': np.arange(10)})

# pd.cut(미포함, 포함]

bins=[0, 5, 8]

# returns a list of catogiries with labels

pd.cut(df["col"], bins=bins)

[Out]:
0           NaN
1    (0.0, 5.0]
2    (0.0, 5.0]
3    (0.0, 5.0]
4    (0.0, 5.0]
5    (0.0, 5.0]
6    (5.0, 8.0]
7    (5.0, 8.0]
8    (5.0, 8.0]
9           NaN
Name: col, dtype: category
Categories (2, interval[int64]): [(0, 5] < (5, 8]]

위 (1)번의 np.digitize() 가 [포함, 미포함) 인 반면에 pd.cut()은 (미포함, 포함]으로 정반대입니다.

위 (1)번의 np.digitize() 가 bin 안의 처음 숫자보다 작거나 같은 값에 자동으로 '1'의 정수를 부여하고, bin 안의 마지막 숫자보다 큰 값에 대해서는 bin 순서에 따라 자동으로 digitze 정수를 부여하는 반면에, pd.cut()은 bin 구간에 없는 값에 대해서는 'NaN'을 반환하고 bin 구간 내 값에 대해서는 사용자가 labels=['a', 'b'] 처럼 입력해준 label 값을 부여해줍니다.

df['grp_cut'] = pd.cut(df["col"], bins=bins, labels=['a', 'b'])

[Out]:

	col	grp_digitize	grp_cut
0	0	1	NaN
1	1	1	a
2	2	1	a
3	3	1	a
4	4	1	a
5	5	2	a
6	6	2	b
7	7	2	b
8	8	3	b
9	9	3	NaN

이렇게 연속형 변수를 범주형 변수로 변환을 한 후에 'col' 변수에 대해 groupby('grp_cut') 로 그룹별 합계(sum by group)를 집계해 보겠습니다.

df.groupby('grp_cut')['col'].sum()

[Out]:

grp_cut
a    15
b    21
Name: col, dtype: int64

'grp_cut' 기준 그룹('a', 'b')별로 합(sum), 개수(count), 평균(mean), 분산(variance) 등의 여러개 통계량을 한번에 구하려면 사용자 정의 함수를 정의한 후에 --> df.groupby('grp_cut').apply(my_summary) 처럼 apply() 를 사용하면 됩니다. 그룹별로 통계량을 한눈에 보기에 좋도록 unstack()을 사용해서 세로로 길게 늘어선 결과를 가로로 펼쳐서 제시해보았습니다.

# UDF of summary statistics

def my_summary(x):

result = {

'sum': x.sum(),

'count': x.count(),

'mean': x.mean(),

'variance': x.var()

}

return result

df.groupby('grp_cut')['col'].apply(my_summary).unstack()

[Out]:

	sum	count	mean	variance
grp_cut
a	15.0	5.0	3.0	2.5
b	21.0	3.0	7.0	1.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] TimeStamp와 ID의 모든 조합 MultiIndex로 시계열 데이터 만들기 (0)	2020.06.21
[Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv()) (4)	2020.05.17
[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) (3)	2020.02.15
[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05

Posted by Rfriend

[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python)

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 15. 23:29

지난번 포스팅에서는 무작위로 데이터셋을 추출하여 train set, test set을 분할(Train set, Test set split by Random Sampling)하는 방법을 소개하였습니다.

이번 포스팅에서는 데이터셋 내 층(stratum) 의 비율을 고려하여 층별로 구분하여 무작위로 train set, test set을 분할하는 방법(Train, Test set Split by Stratified Random Sampling)을 소개하겠습니다.

(1) sklearn.model_selection.train_test_split 함수를 이용한 Train, Test set 분할

(층을 고려한 X_train, X_test, y_train, y_test 반환)

(2)sklearn.model_selection.StratifiedShuffleSplit 함수를 이용한 Train, Test set 분할

(층을 고려한 train/test indices 반환 --> Train, Test set indexing)

참고로 단순 임의 추출(Simple Random Sampling), 체계적 추출(Systematic Sampling), 층화 임의 추출(Stratified Random Sampling), 군집 추출(Cluster Sampling), 다단계 추출(Multi-stage Sampling) 방법에 대한 소개는 https://rfriend.tistory.com/58 를 참고하세요.

(1) sklearn.model_selection.train_test_split 함수를 이용한 Train, Test set 분할

(층을 고려한 X_train, X_test, y_train, y_test 반환)

먼저 간단한 예제로 사용하기 위해 15행 2열의 X 배열, 15개 원소를 가진 y 배열 데이터셋을 numpy array 를 이용해서 만들어보겠습니다. 그리고 앞에서 부터 5개의 관측치는 '0' 그룹(층), 6번째부터 15번째 관측치는 '1' 그룹(층)에 속한다고 보고, 이 정보를 가지고 있는 'grp' 리스트도 만들겠습니다.

import numpy as np

X = np.arange(30).reshape(15, 2)

[Out]:
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19],
       [20, 21],
       [22, 23],
       [24, 25],
       [26, 27],
       [28, 29]])

y = np.arange(15)

[Out]: 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

# stratum (group)

grp = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

grp

[Out]:

[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

이제 scikit-learn model_selection 클래스에서 train_test_split 함수를 가져와서 X_train, X_test, y_train_y_test 데이터셋을 분할해 보겠습니다.

- X와 y 데이터셋이 따로 분리되어 있는 상태에서 처음과 두번째 위치에 X, y를 각각 입력해줍니다.

- test_size에는 test set의 비율을 입력하고 stratify에는 층 구분 변수이름을 입력해주는데요, 이때 각 층(stratum, group) 별로 나누어서 test_size 비율을 적용해서 추출을 해줍니다.

- shuffle=True 를 지정해주면 무작위 추출(random sampling)을 해줍니다. 만약 체계적 추출(systematic sampling)을 하고 싶다면 shuffle=False를 지정해주면 됩니다.

- random_state 는 재현가능성을 위해서 난수 초기값으로 아무 숫자나 지정해주면 됩니다.

# returns X_train, X_test, y_train, y_test dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,

test_size=0.2,

shuffle=True,

stratify=grp,

random_state=1004)

print('X_train shape:', X_train.shape)

print('X_test shape:', X_test.shape)

print('y_train shape:', y_train.shape)

print('y_test shape:', y_test.shape)

[Out]:

X_train shape: (12, 2)
X_test shape: (3, 2)
y_train shape: (12,)
y_test shape: (3,)

아래는 X_train, y_train, X_test, y_test 로 각각 분할된 결과입니다.

X_train

[Out]:

array([[12, 13],
       [ 8,  9],
       [28, 29],
       [ 0,  1],
       [10, 11],
       [ 6,  7],
       [ 2,  3],
       [18, 19],
       [20, 21],
       [22, 23],
       [26, 27],
       [14, 15]])

y_train

[Out]:

array([ 6,  4, 14,  0,  5,  3,  1,  9, 10, 11, 13,  7])

X_test

[Out]:

array([[16, 17],
       [ 4,  5],
       [24, 25]])

y_test

[Out]: array([ 8,  2, 12])

(2) sklearn.model_selection.StratifiedShuffleSplit 함수를 이용한 Train, Test set 분할

(층을 고려한 train/test indices 반환 --> Train, Test set indexing)

(2-1) numpy array 예제

위의 train_test_split() 함수가 X, y를 input으로 받아서 각 층의 비율을 고려해 무작위로 X_train, X_test, y_train, y_test 로 분할된 데이터셋을 반환했다고 하며, 이번에 소개할 StratfiedShuffleSplit() 함수는 각 층의 비율을 고려해 무작위로 train/test set을 분할할 수 있는 indices 를 반환하며, 이 indices를 이용해서 train set, test set을 indexing 하는 작업을 추가로 해줘야 합니다. 위의 (1)번 대비 좀 불편하지요? (대신 이게 k-folds cross-validation 할 때n_splits 를 가지고 층화 무작위 추출할 때는 위의 (1)번 보다 편리합니다)

1개의 train/ test set 만을 분할하므로 n_splits=1 로 지정해주며, test_size에 test set의 비율을 지정해주고, random_state에는 재현가능성을 위해 난수 초기값으로 아무값이 지정해줍니다.

train_idx, test_idx 를 반환하므로 for loop문을 사용해서 X_train, X_test, y_train, y_test를 X와 y로 부터 indexing해서 만들었습니다.

i# Stratified ShuffleSplit cross-validator

# provides train/test indices to split data in train/test sets.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1004)

for train_idx, test_idx in split.split(X, grp):

X_train = X[train_idx]

X_test = X[test_idx]

y_train = y[train_idx]

y_test = y[test_idx]

X_train, y_train, X_test, y_test 값을 확인해보면 아래와 같은데요, 이는 random_state=1004로 (1)번과 같게 설정해주었기때문에 (1)번의 train_test_split() 함수를 사용한 결과와 동일한 train, test set 데이터셋이 층화 무작위 추출법으로 추출되었습니다.

X_train

[Out]:
array([[12, 13],
       [ 8,  9],
       [28, 29],
       [ 0,  1],
       [10, 11],
       [ 6,  7],
       [ 2,  3],
       [18, 19],
       [20, 21],
       [22, 23],
       [26, 27],
       [14, 15]])

y_train

[Out]: array([ 6,  4, 14,  0,  5,  3,  1,  9, 10, 11, 13,  7])

X_test

[Out]:

array([[16, 17],
       [ 4,  5],
       [24, 25]])

y_test

[Out]: array([ 8,  2, 12])

(2-2) pandas DataFrame 예제

위의 (2-1)에서는 numpy array를 사용해서 해보았는데요, 이번에는 pandas DataFrame에 대해서 StratifiedShuffleSplit() 함수를 사용해서 층화 무작위 추출법을 이용한 Train, Test set 분할을 해보겠습니다.

먼저, 위에서 사용한 데이터셋과 똑같이 값으로 구성된, x1, x2, y, grp 칼럼을 가진 DataFrame을 만들어보겠습니다.

import pandas as pd

import numpy as np

X = np.arange(30).reshape(15, 2)

y = np.arange(15)

df = pd.DataFrame(np.column_stack((X, y)), columns=['X1','X2', 'y'])

	X1	X2	y
0	0	1	0
1	2	3	1
2	4	5	2
3	6	7	3
4	8	9	4
5	10	11	5
6	12	13	6
7	14	15	7
8	16	17	8
9	18	19	9
10	20	21	10
11	22	23	11
12	24	25	12
13	26	27	13
14	28	29	14

df['grp'] = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

	X1	X2	y	grp
0	0	1	0	0
1	2	3	1	0
2	4	5	2	0
3	6	7	3	0
4	8	9	4	0
5	10	11	5	1
6	12	13	6	1
7	14	15	7	1
8	16	17	8	1
9	18	19	9	1
10	20	21	10	1
11	22	23	11	1
12	24	25	12	1
13	26	27	13	1
14	28	29	14	1

이제 StratifiedShuffleSplit() 함수를 사용해서 층의 비율을 고려해서(유지한채) 무작위로 train set, test set DataFrame을 만들어보겠습니다.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1004)

for train_idx, test_idx in split.split(df, df["grp"]):

df_strat_train = df.loc[train_idx]

df_strat_test = df.loc[test_idx]

층 내 class의 비율을 고려해서 층화 무작위 추출된 DataFrame 결과는 아래와 같습니다.

df_strat_train

	X1	X2	y	grp
6	12	13	6	1
4	8	9	4	0
14	28	29	14	1
0	0	1	0	0
5	10	11	5	1
3	6	7	3	0
1	2	3	1	0
9	18	19	9	1
10	20	21	10	1
11	22	23	11	1
13	26	27	13	1
7	14	15	7	1

df_strat_test

	X1	X2	y	grp
8	16	17	8	1
2	4	5	2	0
12	24	25	12	1

정말로 각 층 내 계급의 비율(percentage of samples for each class)이 train set, test set에서도 유지가 되고 있는지 확인을 해보겠습니다.

df["grp"].value_counts() / len(df)

[Out]:
1    0.666667
0    0.333333
Name: grp, dtype: float64

df_strat_train["grp"].value_counts() / len(df_strat_train)

[Out]:

1    0.666667
0    0.333333
Name: grp, dtype: float64

df_strat_test["grp"].value_counts() / len(df_strat_test)

[Out]:

1 0.666667 0 0.333333 Name: grp, dtype: float64

pandas DataFrame에 대한 무작위 표본 추출 방법은 https://rfriend.tistory.com/602 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv()) (4)	2020.05.17
[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut()) (2)	2020.02.18
[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31

Posted by Rfriend

[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set)

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 11. 21:48

기계학습에서 모델을 학습하는데 사용하는 train set, 적합된 모델의 성능을 평가하는데 사용하는 test set 으로 나누어놓고 시작합니다.

이번 포스팅에서는 2차원 행렬 형태의 데이터셋을 무작위로 샘플링하여 Train set, Test set 으로 분할하는 방법을 소개하겠습니다.

(1) scikit-learn 라이브러리 model_selection 클래스의 train_test_split 함수를 이용하여 train, test set 분할하기

(2) numpy random 클래스의 permutation() 함수를 이용하여 train, test set 분할하기

(3) numpy random 클래스의 choice() 함수를 이용하여 train, test set 분할하기

(4) numpy random 클래스의 shuffle() 함수를 이용하여 train, test set 분할하기

(1) scikit-learn.model_selection의 train_test_split 함수로 train, test set 분할하기

(split train and test set using sklearn.model_selection train_test_split())

제일 편리하고 그래서 (아마도) 제일 많이 사용되는 방법이 scikit-learn 라이브러리 model_selection 클래스의 train_test_split() 함수를 사용하는 것일 것입니다. 무작위 샘플링을 할지 선택하는 shuffle 옵션, 층화 추출법을 할 수 있는 stratify 옵션도 제공하여 간단한 옵션 설정으로 깔끔하게 끝낼 수 있으니 사용하지 않을 이유가 없습니다.

예제로 사용할 간단한 2차원 numpy array의 X와 1차원 numpy array의 y를 만들어보겠습니다.

import numpy as np

X = np.arange(20).reshape(10, 2)

[Out]:

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19]])

y = np.arange(10)

[Out]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

(1-1) 순차적으로 train, test set 분할

이제 sklearn.model_selection 의 train_test_split() 함수를 사용해서 train set 60%, test set 40%의 비율로 무작위로 섞는 것 없이 순차적으로(shuffle=False) 분할을 해보겠습니다. 시계열 데이터와 같이 순서를 유지하는 것이 필요한 경우에 이 방법을 사용합니다. suffle 옵션의 디폴트 설정은 True 이므로 만약 무작위 추출이 아닌 순차적 추출을 원하면 반드시 shuffle=False 를 명시적으로 설정해주어야 합니다.

from sklearn.model_selection import train_test_split

# shuffle = False

X_train, X_test, y_train, y_test = train_test_split(X,

test_size=0.4,

shuffle=False,

random_state=1004)

print('X_train shape:', X_train.shape)

print('X_test shape:', X_test.shape)

print('y_train shape:', y_train.shape)

print('y_test shape:', y_test.shape)

[Out]:

X_train shape: (6, 2)
X_test shape: (4, 2)
y_train shape: (6,)
y_test shape: (4,)

X_train

[Out]:

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

y_train

[Out]: array([0, 1, 2, 3, 4, 5])

(1-2) 무작위 추출로 train, test set 분할

이번에는 train set 60%, test set 40%의 비율로 무작위 추출(random sampling, shuffle=True)하여 분할을 해보겠습니다. random_state 는 재현가능(for reproducibility)하도록 난수의 초기값을 설정해주는 것이며, 아무 숫자나 넣어주면 됩니다. shuffle=True 가 디폴트 설정이므로 생략 가능합니다.

# shuffle = True

X_train, X_test, y_train, y_test = train_test_split(X,

test_size=0.4,

shuffle=True,

random_state=1004)

X_train

array([[ 2,  3],
       [ 8,  9],
       [ 6,  7],
       [14, 15],
       [10, 11],
       [ 4,  5]])

y_train

array([1, 4, 3, 7, 5, 2])

(2) numpy random 클래스의 permutation 함수를 이용하여 train, test set 분할하기

이번에는 numpy 라이브러리를 이용해서 train, test set을 분할하는 사용자 정의 함수(user defined function)를 직접 만들어보겠습니다. 방법은 간단합니다. 먼저 np.random.permutation()으로 X의 관측치 개수(X.shape[0])의 정수를 무작위로 섞은 후에, --> train_num만큼의 train set을 슬라이싱하고, test_num 만큼의 test set을 슬라이싱 합니다.

np.random.seed(seed_number) 는 재현가능성을 위해서 난수 초기값을 설정해줍니다.

# UDF of split train, test set using np.random.permutation()

# X: 2D array, y:1D array

def permutation_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

import numpy as np

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state)

shuffled = np.random.permutation(X.shape[0])

X = X[shuffled,:]

y = y[shuffled]

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

# create 2D X and 1D y array

X = np.arange(20).reshape(10, 2)

y = np.arange(10)

# split train and test set using by random sampling

X_train, X_test, y_train, y_test = permutation_train_test_split(X,

test_size=0.4,

shuffle=True,

random_state=1004)

X_train

[Out]:

array([[ 0,  1],
       [12, 13],
       [16, 17],
       [18, 19],
       [ 2,  3],
       [ 8,  9]])

y_train

[Out]: array([0, 6, 8, 9, 1, 4])

(3) numpy random 클래스의 choice 함수를 이용하여 train, test set 분할하기

(3-1) 다음으로 numpy.random.choice(int_A, int_B, replace=False) 함수를 사용하면 비복원추출(replace=False)로 A개의 정수 중에서 B개의 정수를 무작위로 추출하여 이를 train set의 index로 사용하고, np.setdiff1d() 함수로 train set의 index를 제외한 나머지 index를 test set index로 사용하여 indexing하는 방식입니다.

def choice_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state)

train_idx = np.random.choice(X.shape[0], train_num, replace=False)

#-- way 1: using np.setdiff1d()

test_idx = np.setdiff1d(range(X.shape[0]), train_idx)

X_train = X[train_idx, :]

X_test = X[train_idx, :]

y_train = y[test_idx]

y_test = y[test_idx]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

(3-2) 아래는 위의 (3-1)과 np.random.choice()를 사용하여 train set 추출을 위한 index 번호 무작위 추출은 동일하며, test set 추출을 위한 index를 for loop 과 if not in 조건문을 사용하여 list comprehension 으로 생성한 부분이 (3-1)과 다릅니다.

def choice_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state)

train_idx = np.random.choice(X.shape[0], train_num, replace=False)

#-- way 2: using list comprehension with for loop

test_idx = [idx for idx in range(X.shape[0]) if idx not in train_idx]

X_train = X[train_idx, :]

X_test = X[train_idx, :]

y_train = y[test_idx]

y_test = y[test_idx]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

(4) numpy random shuffle() 함수를 이용하여 train, test set 분할하기

(split train, test set using np.random.shuffle() function)

np.random.shuffle() 함수는 배열을 통째로 무작위로 섞은 배열을 반환합니다. 따라서 무작위로 섞었을 때 X와 y가 동일한 순서로 무작위로 섞인 결과를 얻기 위해서 (4-1) X와 y를 먼저 np.column_stack((X, y)) 를 사용해서 옆으로 나란히 붙인 후에(concatenate), --> (4-2) np.random.shuffle(Xy)로 X와 y 배열을 나란히 붙힌 Xy 배열을 무작위로 섞고 (inplace 로 작동함), --> (4-3) train set 개수 (train_num row) 만큼 위에서 부터 행을 슬라이싱을 하고, X 배열의 열(column) 만큼 슬라이싱해서 X_train set을 만듭니다. (4-4) 무작위로 섞인 Xy 배열로부터 train set 개수(train_num row) 만큼 위에서 부터 행을 슬라이싱하고, y 배열이 제일 오른쪽에 붙여(concatenated) 있으므로 Xy[train_num:, -1] 로 제일 오른쪽의 행을 indexing 해오면 y_train set을 만들 수 있습니다.

# UDF of split train, test set using np.random.shuffle()

# X: 2D array, y:1D array

def shuffle_train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1004):

import numpy as np

test_num = int(X.shape[0] * test_size)

train_num = X.shape[0] - test_num

if shuffle:

np.random.seed(random_state) # for reproducibility

Xy = np.column_stack((X, y)) # concatenate first

np.random.shuffle(Xy) # random shuffling second

X_train = Xy[:train_num, :-1] # slicing from 1 to train_num row, X column

X_test = Xy[train_num:, :-1] # slicing from 1 to train_num row, y column

y_train = Xy[:train_num, -1]

y_test = Xy[train_num:, -1]

else:

X_train = X[:train_num]

X_test = X[train_num:]

y_train = y[:train_num]

y_test = y[train_num:]

return X_train, X_test, y_train, y_test

# shuffle = True

X = np.arange(20).reshape(10, 2)

y = np.arange(10)

X_train, X_test, y_train, y_test = shuffle_train_test_split(X,

test_size=0.4,

shuffle=True)

X_train

[Out]:

array([[ 0,  1],
       [12, 13],
       [16, 17],
       [18, 19],
       [ 2,  3],
       [ 8,  9]])

y_train

[Out]: array([0, 6, 8, 9, 1, 4])

무작위 층화 추출법을 이용한 train set, test set 분할 방법(train and test set split by stratified random sampling in python)은 https://rfriend.tistory.com/520 를 참고하시기 바랍니다.

np.random.choice() 메소드를 활용한 1-D 배열로 부터 임의확률표본추출하는 방법(Generate a random sample frm a given 1-D array)은 https://rfriend.tistory.com/548 를 참고하시기 바랍니다.

pandas DataFrame에 대한 무작위 표본 추출 방법은 https://rfriend.tistory.com/602 를 참고하세요.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut()) (2)	2020.02.18
[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) (3)	2020.02.15
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30

Posted by Rfriend

[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array)

Python 분석과 프로그래밍/Python 데이터 전처리 2020. 2. 5. 23:09

이번 포스팅에서는 Python numpy 의 배열 원소의 순서를 거꾸로 뒤집기 (how to reverse the python numpy array) 하는 두가지 방법을 소개하겠습니다. 1D array의 원소를 뒤집는 것은 간단한데요, 2D 이상의 다차원 배열(multi-dimensional array)의 경우 좀 복잡하므로 간단한 예를 들어 유형별로 설명을 해보겠습니다.

(1) x[::-1] 를 이용해서 배열 뒤집기 (mirror 반환하기)

(2) np.flip(x) 를 이용해서 배열 뒤집기

1 차원 numpy 배열을 뒤집기 (how to reverse 1D numpy array?)

먼저 예제로 사용할 간단한 1차원 numpy 배열을 만들어보겠습니다.

import numpy as np

# 1D array

arr_1d = np.arange(5)

arr_1d

[Out]: array([0, 1, 2, 3, 4])

다음으로, 1차원 numpy 배열을 (1) x[::-1] 방법과, (2) np.flip(x) 방법을 이용하여 뒤집어보겠습니다.

(1) x[::-1]

(2) np.flip(x)

# returns a view in reversed order

arr_1d[::-1]

[Out]: array([4, 3, 2, 1, 0])

# 1D array in reversed order using np.flip()

np.flip(arr_1d)

[Out]: array([4, 3, 2, 1, 0])

2 차원 numpy 배열을 뒤집기 (how to reverse 2D numpy array?)

2차원 이상의 numpy 배열 뒤집기는 말로 설명하기가 좀 어렵고 복잡합니다. 왜냐하면 배열의 차원(축, axis) 을 무엇으로 하느냐에 따라서 뒤집기의 기준과 기대하는 결과의 모습(reversed output) 달라지기 때문입니다. 따라서 아래에는 2차원 numpy 배열에 대해서 3가지 경우의 수에 대해서 각각 x[::-1] 과 np.flip(x) 을 사용한 방법을 소개하였으니 원하는 뒤집기 output 에 맞게 선택해서 사용하시기 바랍니다.

먼저 예제로 사용할 2차원 numpy 배열(2D numpy array)을 만들어보겠습니다.

import numpy as np

# 2D array

arr_2d = np.arange(10).reshape(2, 5)

arr_2d

[Out]:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

이제 2차원 numpy 배열을 한번 뒤집어 볼까요?

(2-1) axis = 0 기준으로 2차원 numpy 배열 뒤집기 (how to reverse numpy 2D array by axis=0)

(1) x[::-1]

(2) np.flip(x, axis=0)

# returns a view in reversed order by axis=0

arr_2d[::-1]

[Out]:
array([[5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4]])

# reverse 2D array by axis 0

np.flip(arr_2d, axis=0)

[Out]:

array([[5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4]])

(2-2) axis = 1 기준으로 2차원 numpy 배열 뒤집기 (how to reverse numpy 2D array by axis=1)

(1) x[:, ::-1]

(2) np.flip(x, axis=1)

# returns a view of 2D array by axis=1

arr_2d[:, ::-1]

[Out]:

array([[4, 3, 2, 1, 0],
       [9, 8, 7, 6, 5]])

# reverse 2D array by axis 1

np.flip(arr_2d, axis=1)

[Out]:

array([[4, 3, 2, 1, 0],
       [9, 8, 7, 6, 5]])

(2-3) axis =0 & axis = 1 기준으로 2차원 numpy 배열 뒤집기 (revserse numpy 2D array by axis=0 &1)

(1) x[:, ::-1][::-1]

(2) np.flip(x)

# returns a view

arr_2d[:, ::-1][::-1]

[Out]:
array([[9, 8, 7, 6, 5],
       [4, 3, 2, 1, 0]])

# 2D array

np.flip(arr_2d)

[Out]:

array([[9, 8, 7, 6, 5],
       [4, 3, 2, 1, 0]])

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python) (3)	2020.02.15
[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30

Posted by Rfriend

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 31. 13:49

지난번 포스팅에서는 Python pandas에서 resampling 중 Downsampling 으로 집계할 때에 왼쪽과 오른쪽 중에서 어느쪽을 포함(inclusive, closed)할 지와 어느쪽으로 라벨 이름(label)을 쓸지(https://rfriend.tistory.com/507)에 대해서 알아보았습니다.

이번 포스팅에서는 pandas의 resampling 중 Upsampling으로 시계열 데이터 주기(frequency)를 변환(conversion) 할 때 생기는 결측값을 처리하는 두 가지 방법을 소개하겠습니다.

(1) Upsampling 으로 주기 변환 시 생기는 결측값을 채우는 방법 (filling forward/backward)

(2) Upsampling 으로 주기 변환 시 생기는 결측값을 선형 보간하는 방법 (linear interpolation)

예제로 사용할 간단할 2개의 칼럼을 가지고 주기(frequency)가 5초(5 seconds)인 시계열 데이터 DataFrame을 만들어보겠습니다.

import pandas as pd

import numpy as np

rng = pd.date_range('2019-12-31', periods=3, freq='5S')

rng

[Out]:
DatetimeIndex(['2019-12-31 00:00:00', '2019-12-31 00:00:05',
               '2019-12-31 00:00:10'],
              dtype='datetime64[ns]', freq='5S')

ts = pd.DataFrame(np.array([0, 1, 3, 2, 10, 3]).reshape(3, 2),

index=rng,

columns=['col_1', 'col_2'])

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0	1
2019-12-31 00:00:05	3	2
2019-12-31 00:00:10	10	3

이제 pandas resample() 메소드를 사용해서 주기가 5초(freq='5S')인 원래 데이터를 주기가 1초(freq='1S')인 데이터로 Upsampling 변환을 해보겠습니다. 그러면 아래처럼 새로 생긴 날짜-시간 행에 결측값(missing value)이 생깁니다ㅣ

ts_upsample = ts.resample('S').mean()

ts_upsample

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	NaN	NaN
2019-12-31 00:00:02	NaN	NaN
2019-12-31 00:00:03	NaN	NaN
2019-12-31 00:00:04	NaN	NaN
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	NaN	NaN
2019-12-31 00:00:07	NaN	NaN
2019-12-31 00:00:08	NaN	NaN
2019-12-31 00:00:09	NaN	NaN
2019-12-31 00:00:10	10.0	3.0

위에 Upsampling을 해서 생긴 결측값들을 (1) 채우기(filling), (2) 선형 보간(linear interpolation) 해보겠습니다.

(1) Upsampling 으로 주기 변환 시 생기는 결측값을 채우기 (filling missing values)

(1-1) 앞의 값으로 뒤의 결측값 채우기 (Filling forward)

# (1) filling forward

ts_upsample.ffill()

ts_upsample.fillna(method='ffill')

ts_upsample.fillna(method='pad')

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.0	1.0
2019-12-31 00:00:02	0.0	1.0
2019-12-31 00:00:03	0.0	1.0
2019-12-31 00:00:04	0.0	1.0
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	3.0	2.0
2019-12-31 00:00:07	3.0	2.0
2019-12-31 00:00:08	3.0	2.0
2019-12-31 00:00:09	3.0	2.0
2019-12-31 00:00:10	10.0	3.0

(1-2) 뒤의 값으로 앞의 결측값 채우기 (Filling backward)

# (2)filling backward

ts_upsample.bfill()

ts_upsample.fillna(method='bfill')

ts_upsample.fillna(method='backfill')

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	3.0	2.0
2019-12-31 00:00:02	3.0	2.0
2019-12-31 00:00:03	3.0	2.0
2019-12-31 00:00:04	3.0	2.0
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	10.0	3.0
2019-12-31 00:00:07	10.0	3.0
2019-12-31 00:00:08	10.0	3.0
2019-12-31 00:00:09	10.0	3.0
2019-12-31 00:00:10	10.0	3.0

(1-3) 특정 값으로 결측값 채우기

# (3)fill Missing value with '0'

ts_upsample.fillna(0)

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.0	0.0
2019-12-31 00:00:02	0.0	0.0
2019-12-31 00:00:03	0.0	0.0
2019-12-31 00:00:04	0.0	0.0
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	0.0	0.0
2019-12-31 00:00:07	0.0	0.0
2019-12-31 00:00:08	0.0	0.0
2019-12-31 00:00:09	0.0	0.0
2019-12-31 00:00:10	10.0	3.0

(1-4) 평균 값으로 결측값 채우기

# (4) filling with mean value

# mean per column

ts_upsample.mean()

[Out]:

col_1    4.333333
col_2    2.000000
dtype: float64

ts_upsample.fillna(ts_upsample.mean())

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.000000	1.0
2019-12-31 00:00:01	4.333333	2.0
2019-12-31 00:00:02	4.333333	2.0
2019-12-31 00:00:03	4.333333	2.0
2019-12-31 00:00:04	4.333333	2.0
2019-12-31 00:00:05	3.000000	2.0
2019-12-31 00:00:06	4.333333	2.0
2019-12-31 00:00:07	4.333333	2.0
2019-12-31 00:00:08	4.333333	2.0
2019-12-31 00:00:09	4.333333	2.0
2019-12-31 00:00:10	10.000000	3.0

(1-5) 결측값 채우는 행의 개수 제한하기

# (5) limit the number of filling observation

ts_upsample.ffill(limit=1)

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.0	1.0
2019-12-31 00:00:02	NaN	NaN
2019-12-31 00:00:03	NaN	NaN
2019-12-31 00:00:04	NaN	NaN
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	3.0	2.0
2019-12-31 00:00:07	NaN	NaN
2019-12-31 00:00:08	NaN	NaN
2019-12-31 00:00:09	NaN	NaN
2019-12-31 00:00:10	10.0	3.0

(2) Upsampling 으로 주기 변환 시 생기는 결측값을 선형 보간하기 (linear interpolation)

# (6) Linear interpolation by values

ts_upsample.interpolate(method='values') # by default

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.6	1.2
2019-12-31 00:00:02	1.2	1.4
2019-12-31 00:00:03	1.8	1.6
2019-12-31 00:00:04	2.4	1.8
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	4.4	2.2
2019-12-31 00:00:07	5.8	2.4
2019-12-31 00:00:08	7.2	2.6
2019-12-31 00:00:09	8.6	2.8
2019-12-31 00:00:10	10.0	3.0

ts_upsample.interpolate(method='values').plot()

# (7) Linear interpolation by time

ts_upsample.interpolate(method='time')

[Out]:

	col_1	col_2
2019-12-31 00:00:00	0.0	1.0
2019-12-31 00:00:01	0.6	1.2
2019-12-31 00:00:02	1.2	1.4
2019-12-31 00:00:03	1.8	1.6
2019-12-31 00:00:04	2.4	1.8
2019-12-31 00:00:05	3.0	2.0
2019-12-31 00:00:06	4.4	2.2
2019-12-31 00:00:07	5.8	2.4
2019-12-31 00:00:08	7.2	2.6
2019-12-31 00:00:09	8.6	2.8
2019-12-31 00:00:10	10.0	3.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set) (2)	2020.02.11
[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28

Posted by Rfriend

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 30. 17:47

지난번 포스팅에서는 분기 단위의 기간 날짜 범위 만들기, 그리고 period와 timestamp 간 변환하기 (https://rfriend.tistory.com/506)에 대해서 소개하였습니다.

Python pandas의 resample() 메소드를 사용하면

(a) 더 세부적인 주기(higher frequency)의 시계열 데이터를 더 낮은 주기로 집계/요약을 하는 Downsampling (예: 초(seconds) --> 10초(10 seconds), 일(day) --> 주(week), 일(day) --> 월(month) 등)과,

(b) 더 낮은 주기의 시계열 데이터를 더 세부적인 주기의 데이터로 변환하는 Upsampling (예: 10초 --> 1초, 주 --> 일, 월 --> 주, 년 --> 일 등)을 할 수 있습니다.

이번 포스팅에서는 pandas의 resample() 메소드로 Downsampling 을 할 때 (예: 1초 단위 주기 --> 10초 단위/ 1분 단위/ 1시간 단위 주기로 resampling)

(1) 왼쪽과 오른쪽 중에서 포함 위치 설정 (closed)

(2) 왼쪽과 오른쪽 중에서 라벨 이름 위치 설정 (label)

하는 방법을 소개하겠습니다.

포함 위치와 라벨 이름 설정 시 왼쪽과 오른쪽 중에서 어디를 사용하느냐에 대한 규칙은 없구요, (a) 명확하게 인지하고 있고 (특히, 여러 사람이 동시에 협업하여 작업할 경우), (b) product의 코드 전반에 걸쳐서 일관되게(consistant) 사용하는 것이 필요합니다. (SQL로 DB에서 두 그룹으로 나누어서 시계열 데이터 전처리 작업을 하다가 나중에서야 포함 여부와 라벨 규칙이 서로 다르다는 것을 확인하고, 이를 동일 규칙으로 수정하느라 시간을 소비했던 경험이 있습니다. -_-;;;)

예제로 사용하기 위해 1분 단위 주기의 6개 데이터 포인트를 가지는 간단한 시계열 데이터 pandas Series 를 만들어보겠습니다.

import pandas as pd

# generate dates range

dates = pd.date_range('2020-12-31', periods=6, freq='min') # or freq='T'

dates

[Out]:
DatetimeIndex(['2020-12-31 00:00:00', '2020-12-31 00:01:00',
               '2020-12-31 00:02:00', '2020-12-31 00:03:00',
               '2020-12-31 00:04:00', '2020-12-31 00:05:00'],
              dtype='datetime64[ns]', freq='T')

# create Series

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

이제 '1 분 단위 주기'(freq='min')인 시계열 데이터를 '2초 단위 주기'(freq='2min' or freq='2T')로 resample() 메소드를 이용해서 Downsampling을 해보도록 하겠습니다.

이때 포함 위치 (a) closed='left' (by default) 또는 (b) closed='right' 과 라벨 이름 위치 (c) label='left' (by default) 또는 label='right' 의 총 4개 조합별로 나누어서 Downsampling 결과를 비교해보겠습니다. 집계 함수는 sum()을 공통으로 사용하겠습니다.

(1) By default: Downsampling 시 closed='left', label='left'

Downsampling 할 때 왼쪽과 오른쪽 중에서 한쪽은 포함(inclusive, default: 'left')되고 나머지 한쪽은 포함되지 않습니다. 그리고 Downsampling으로 resampling 된 후의 라벨 이름의 경우 default는 가장 왼쪽(label='left')의 라벨을 사용합니다.

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# by default, left side of bin interval is closed

# by default, left side of bin inverval is labeled

ts_series.resample('2min').sum()

[Out]:

2020-12-31 00:00:00 1 2020-12-31 00:02:00 5 2020-12-31 00:04:00 9 Freq: 2T, dtype: int64

# same result with above

ts_series.resample('2min', closed='left', label='left').sum()

[Out]:

2020-12-31 00:00:00    1
2020-12-31 00:02:00    5
2020-12-31 00:04:00    9
Freq: 2T, dtype: int64

(2) Downsampling 시 closed='right', label='left'

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin interval is closed using closed='right'

ts_series.resample('2min', closed='right', label='left').sum()

[Out]:

2020-12-30 23:58:00 0 2020-12-31 00:00:00 3 2020-12-31 00:02:00 7 2020-12-31 00:04:00 5 Freq: 2T, dtype: int64

(3) Downsampling 시 closed='left', label='right'

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin inverval is labeled using label='right'

ts_series.resample('2min', closed='left', label='right').sum()

[Out]:

2020-12-31 00:02:00    1
2020-12-31 00:04:00    5
2020-12-31 00:06:00    9
Freq: 2T, dtype: int64

(4) Downsampling 시 closed='right', label='right'

아래의 예는 디폴트와 정반대로 시계열 구간의 오른쪽을 포함시키고(closed='right') 라벨 이름도 오른쪽 구간 값(label='right')을 가져다가 Downsampling 한 경우입니다.

ts_series = pd.Series(range(len(dates)), index=dates)

ts_series

[Out]:

2020-12-31 00:00:00    0
2020-12-31 00:01:00    1
2020-12-31 00:02:00    2
2020-12-31 00:03:00    3
2020-12-31 00:04:00    4
2020-12-31 00:05:00    5
Freq: T, dtype: int64

# right side of bin interval is closed using closed='right'

# right side of bin inverval is labeled using label='right'

ts_series.resample('2min', closed='right', label='right').sum()

2020-12-31 00:00:00    0
2020-12-31 00:02:00    3
2020-12-31 00:04:00    7
2020-12-31 00:06:00    5

Freq: 2T, dtype: int64

(5) 시계열 pandas DataFrame에 대해 Downsaumpling 시 포함(closed), 라벨(label) 위치 설정하기

지금까지 위의 (1), (2), (3), (4)는 pandas Series를 대상으로 한 예제였습니다. DatatimeIndex를 index로 가지는 시계열 데이터 pandas DataFrame 도 Series와 동일한 방법으로 Downsampling 하면서 포함, 라벨 위치를 설정합니다.

import pandas as pd

# generate dates range

dates = pd.date_range('2020-12-31', periods=6, freq='min')

dates

[Out]:

DatetimeIndex(['2020-12-31 00:00:00', '2020-12-31 00:01:00',
               '2020-12-31 00:02:00', '2020-12-31 00:03:00',
               '2020-12-31 00:04:00', '2020-12-31 00:05:00'],
              dtype='datetime64[ns]', freq='T')

# create timeseries DataFrame

ts_df = pd.DataFrame({'val': range(len(dates))}, index=dates)

ts_df

[Out]:

	val
2020-12-31 00:00:00	0
2020-12-31 00:01:00	1
2020-12-31 00:02:00	2
2020-12-31 00:03:00	3
2020-12-31 00:04:00	4
2020-12-31 00:05:00	5

# (a) Downsampling using default setting

ts_df.resample('2min').sum()

[Out]:

	val
2020-12-31 00:00:00	1
2020-12-31 00:02:00	5
2020-12-31 00:04:00	9

# (b) Downsampling using closed='right'

ts_df.resample('2min', closed='right').sum()

[Out]:

	val
2020-12-30 23:58:00	0
2020-12-31 00:00:00	3
2020-12-31 00:02:00	7
2020-12-31 00:04:00	5

# (c) Downsampling using label='right'

ts_df.resample('2min', label='right').sum()

[Out]:

	val
2020-12-31 00:02:00	1
2020-12-31 00:04:00	5
2020-12-31 00:06:00	9

# (d) Downsampling using closed='right', label='right'

ts_df.resample('2min', closed='right', label='right').sum()

[Out]:

	val
2020-12-31 00:00:00	0
2020-12-31 00:02:00	3
2020-12-31 00:04:00	7
2020-12-31 00:06:00	5

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array) (0)	2020.02.05
[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26

Posted by Rfriend

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 30. 12:30

지난번 포스팅에서는 Python pandas에서 시간대를 확인, 설정, 변경하는 방법(https://rfriend.tistory.com/505)을 소개하였습니다.

이번 포스팅에서는 Python pandas에서

(1) 분기 단위의 기간 주기 만들기 (quarterly period frequencies)

(2) 분기 단위의 기간 날짜-범위 만들기 (quarterly period date-range)

(3) 분기 단위의 기간과 timestamp 간 변환하기 (conversion between quarterly period and timestamp)

(4) 분기 단위 기간으로 집계하기 (quarterly period group by aggregation)

에 대해서 소개하겠습니다.

이번 포스팅은 특히, 금융, 회계 분야에서 분기 단위(fiscal year quarters) 실적 집계, 분석할 때 pandas로 하기에 유용한 기능들입니다.

[ 그림1. pandas 분기 단위의 기간 범위 만들기 (Quarterly Period Range) ]

(1) 분기 단위의 기간 주기 만들기 (quarterly period frequencies)

pandas Period() 함수를 사용해서 the Fiscal Year 2020 4 Quarter 를 만들어보겠습니다. 회기년도 '2020-Q4'는 위의 [그림 1] 에서 보는 바와 같이, 2019.3월~5월(2020- Q1), 2019.6월~8월(2020-Q2), 2019.9월~11월(2020-Q3), 2019.12월~2020.2월(2020-Q4) 의 기간으로 구성되어 있습니다. (회계년도 2020 에 2019년의 3월~12월이 포함되어서 좀 이상하게 보일 수도 있는데요, 그냥 이렇습니다. ^^')

import pandas as pd

import numpy as np

p = pd.Period('2020Q4', freq='Q-FEB')

[Out]: Period('2020Q4', 'Q-FEB')

pandas의 asfreq() 메소드를 사용하면 pandas Period 객체를 원하는 주기(Period frequency)로 변환할 수 있습니다. 위의 2020-Q4 의 분기 단위의 기간(Quarterly Period)를 asfreq() 메소드를 사용해 (a) 분기별 시작 날짜(starting date)와 끝 날짜(ending date), (b) 분기별 공휴일이 아닌 시작 날짜(staring business date)와 공휴일이 아닌 끝 날짜 (ending business date)로 변환해 보겠습니다.

(a) converting from Period to Date: 'D'

(b) converting from Period to Business Date: 'B'

# starting date

p.asfreq('D', how='start')

[Out]: Period('2019-12-01', 'D')

# ending date

p.asfreq('D', how='end')

[Out]: Period('2020-02-29', 'D')

# starting business date

p.asfreq('B', how='start')

[Out]: Period('2019-12-02', 'B')

# ending business date

p.asfreq('B', how='end')

[Out]: Period('2020-02-28', 'B')

asfreq() 메소드를 chain으로 연속으로 이어서

(a) 분기별 ending business date를 선택하고 --> (b) starting(how-='start) minutes (freq='T' or freq='min')의 주기(frequency)로 변환한다거나,

(e) 분기별 ending business date를 선택하고 --> 이를 (f) ending seconds 로 변환

하는 것이 모두 가능합니다.

# (a) from ending Business date --> (b) to starting Minutes

p.asfreq('B', how='end').asfreq('T', how='start')

[Out]: Period('2020-02-28 00:00', 'T')

# (c) from ending Business date --> (d) to ending Minutes

p.asfreq('B', how='end').asfreq('T', how='end')

[Out]: Period('2020-02-28 23:59', 'T')

# (e) from Business date --> (f) to Seconds

p.asfreq('B', how='end').asfreq('S', how='end')

[Out]: Period('2020-02-28 23:59:59', 'S')

(2) 분기 단위의 기간 범위 만들기 (quarterly period range)

pandas의 date_range() 함수로 날짜-시간 범위의 DatetimeIndex 객체를 만들 듯이, pandas의 period_range('start', 'end', freq='Q-[ending-month]') 함수를 사용해서 분기 단위의 기간 범위(quarterly period range)를 만들 수 있습니다. (참고로 freq='A-DEC' 는 12월을 마지막으로 가지는 년 단위 기간(yearly period)라는 뜻이며, freq='Q-FEB'는 2월달을 마지막으로 가지는 분기 단위 기간(quarterly period)라는 뜻입니다)

아래 예는 2020-Q1 ~ 2020-Q4 기간(pd.period_range('2020Q1', '2020Q4')의 2월달을 마지막으로 하는 분기 단위의 기간(freq='Q-FEB')을 만든 것입니다.

p_rng = pd.period_range('2020Q1', '2020Q4', freq='Q-FEB')

p_rng

[Out]:PeriodIndex(['2020Q1', '2020Q2', '2020Q3', '2020Q4'], dtype='period[Q-FEB]', 
freq='Q-FEB')

asfreq() 메소드를 사용해서 위에서 생성한 '2020-Q1' ~ '2020-Q4' 기간(period with a Quarter ending at February)의 공휴일이 아닌 시작 날짜(staring business date)와 끝 날짜(ending business date)로 변환해보겠습니다.

# convert period into deisred frequency using asfreq() methods

# starting business day per quarter 'Q-FEB'

p_rng.asfreq('B', how='start')

[Out]:

PeriodIndex(['2019-03-01', '2019-06-03', '2019-09-02', '2019-12-02'],

dtype='period[B]', freq='B')

# ending business day per quarter 'Q-FEB'

p_rng.asfreq('B', how='end')

[Out]:

PeriodIndex(['2019-05-31', '2019-08-30', '2019-11-29', '2020-02-28'],

dtype='period[B]', freq='B')

기간(Period) 객체를 frequency로 변환한 후에 산술 연산(arithmetic operation)이 가능합니다. 아래 예는 2월달에 끝나는 4 분기의 ending business date에 1 day 를 더한것입니다.

# arithmatic operation: plus one day

p_rng.asfreq('B', how='end') + 1

[Out]:

PeriodIndex(['2019-06-03', '2019-09-02', '2019-12-02', '2020-03-02'],

dtype='period[B]', freq='B')

아래의 예는 period object를 ending business date로 먼저 변환하고, 이를 다시 starting hour frequency로 변환한 후에 여기에 12 hours 를 더한 것입니다.

# period ending Business day, starting Hour

p_rng.asfreq('B', how='end').asfreq('H', how='start')

[Out]:

PeriodIndex(['2019-05-31 00:00', '2019-08-30 00:00', '2019-11-29 00:00',
             '2020-02-28 00:00'],
            dtype='period[H]', freq='H')

# plus 12 hours

p_12h_rng = p_rng.asfreq('B', how='end').asfreq('H', how='start') + 12

p_12h_rng

[Out]:

PeriodIndex(['2019-05-31 12:00', '2019-08-30 12:00', '2019-11-29 12:00',
             '2020-02-28 12:00'],
            dtype='period[H]', freq='H')

(3) 분기 단위의 기간과 timestamp 간 변환하기

(conversion between quarterly period and timestamp)

pandas date_range() 로 만든 날짜-시간 DatetimeIndex를 pandas.to_period() 메소드를 사용해서 PeriodIndex로 변환할 수 있습니다.

import pandas as pd

# generate dates range with 12 Months

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

[Out]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'],
              dtype='datetime64[ns]', freq='M')

# convert from DatetimeIndex to PeriodIndex

p = ts.to_period()

[Out]:
PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12'],
            dtype='period[M]', freq='M')

반대로, pandas.to_timestamp() 메소드를 사용해서 PeriodIndex를 DatetimeIndex로 변환할 수 있습니다.

# convert from PeriodIndex to DatetimeIndex with starting month('M')

p.asfreq('B', how='end').asfreq('M', how='start').to_timestamp()

[Out]:

DatetimeIndex(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01'],
              dtype='datetime64[ns]', freq='MS')

# convert from PeriodIndex to DatatimeIndex with ending minutes('T')

p.asfreq('B', how='end').asfreq('T', how='end').to_timestamp()

DatetimeIndex(['2020-01-31 23:59:00', '2020-02-28 23:59:00',
               '2020-03-31 23:59:00', '2020-04-30 23:59:00',
               '2020-05-29 23:59:00', '2020-06-30 23:59:00',
               '2020-07-31 23:59:00', '2020-08-31 23:59:00',
               '2020-09-30 23:59:00', '2020-10-30 23:59:00',
               '2020-11-30 23:59:00', '2020-12-31 23:59:00'],
              dtype='datetime64[ns]', freq='BM')

(4) 분기 기간 단위 집계 (quarterly period group by aggregation)

간단한 월 단위 pandas Series 를 분기 단위 Period Index를 가진 Series로 변환한 후에, 분기 단위로 평균을 집계해보겠습니다.

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

ts_series = pd.Series(range(len(ts)), index=ts)

ts_series

[Out]:

2020-01-31 0 2020-02-29 1 2020-03-31 2 2020-04-30 3 2020-05-31 4 2020-06-30 5 2020-07-31 6 2020-08-31 7 2020-09-30 8 2020-10-31 9 2020-11-30 10 2020-12-31 11 Freq: M, dtype: int64

# convert from DatatimeIndex to Quarterly PeriodIndex

ts_series.index = ts.to_period(freq='Q-FEB')

ts_series

[Out]:

2020Q4     0
2020Q4     1
2021Q1     2
2021Q1     3
2021Q1     4
2021Q2     5
2021Q2     6
2021Q2     7
2021Q3     8
2021Q3     9
2021Q3    10
2021Q4    11

Freq: Q-FEB, dtype: int64

# quarterly groupby mean aggregation

ts_series.groupby(ts_series.index).mean()

[Out]:

2020Q4     0.5
2021Q1     3.0
2021Q2     6.0
2021Q3     9.0
2021Q4    11.0
Freq: Q-FEB, dtype: float64

참고로, 아래는 resample() 메소드로 downsampling 해서 분기 단위로 평균을 집계해본 것인데요, 위의 to_period(freq='Q-FEB')로 frequency를 변환해서 groupby()로 집계한 것과 년도(2020 vs. 2021)가 서로 다릅니다.

ts = pd.date_range('2020-01-01', periods = 12, freq='M')

ts_series = pd.Series(range(len(ts)), index=ts)

ts_series.resample('Q-FEB').mean()

[Out]:

2020-02-29     0.5
2020-05-31     3.0
2020-08-31     6.0
2020-11-30     9.0
2021-02-28    11.0
Freq: Q-FEB, dtype: float64

resample 시 kind='period' 옵션을 설정해주면 ts.to_period(freq='Q-FEB') 를 groupby 한 결과와 동일한 값을 얻을 수 있습니다.

ts_series.resample('Q-FEB', kind='period').mean()

[Out]:

2020Q4     0.5
2021Q1     3.0
2021Q2     6.0
2021Q3     9.0
2021Q4    11.0
Freq: Q-FEB, dtype: float64

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation) (0)	2019.12.31
[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26

Posted by Rfriend

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 28. 18:13

이번 포스팅에서는

(1) Python에서 시간대 (time zone) 확인하기

(2) pandas 에서 date_range()로 날짜-시간 생성 시 시간대(time zone)를 설정하기

(time zone setting)

(3) 시간대 정보가 없는 naive 상태에서 지역 시간대로 변경하기

(convert from naive timezone to localized timezone)

(4) 날짜-시간 DatetimeIndex의 특정 시간대를 다른 시간대로 변경하기

(converst from a timezone to another timezone)

하는 방법을 소개하겠습니다.

(1) Python에서 시간대 (time zone) 확인하기

국가 간을 넘나들면서 여러 시간대에 걸쳐서 업무를 봐야 한다거나, 일광 절약 시간(미국식 Daylight Savings Time, DST, 영국식 Summer Time) 을 적용하고 있는 나라 (예: 미국, 캐나다, 대부분의 유럽 국가, 호주 일부 지역) 에서는 시간대를 고려해서 프로그래밍을 해야 한다는게 머리가 아픈 일입니다.

그래서 국가/지역별 시간대의 국제 표준으로 UTC (Coordinated Univeral Time, 이전의 Greenwich Mean Time, GMT) 시간대를 많이 사용합니다. 아래 지도는 국가별 시간대를 나타낸 것인데요, 영국의 Greenwich 천문대를 지나는 지도의 가운데 부분이 바로 UTC 시간대입니다.

참고로, 한국, 일본, 호주 가운데 지역은 UTC + 9hour 시간대에 속합니다.

[ Standard Time Zones of the World ]

* 출처: https://en.wikipedia.org/wiki/Coordinated_Universal_Time#/media/File:World_Time_Zones_Map.png

Python에서는 pytz 라이브러리를 사용해서 시간대 정보를 확인할 수 있으며, pandas는 pytz 라이브러리를 wrap 해서 시간대 정보를 다루고 있습니다.

아시아 지역의 시간대 이름 (time zone names in Asia)을 살펴보겠습니다.

# time zone information

import pytz

# regular expression in Python

import re

# regular expression for pattern containing 'Asia' texts

pattern = re.compile(r'^Asia')

# list comprehension for selecting 'Asia****' time zones

tz_asia = [x for x in pytz.common_timezones if pattern.match(x)]

tz_asia

[Out]:

['Asia/Aden',
 'Asia/Almaty',
 'Asia/Amman',
 'Asia/Anadyr',
 'Asia/Aqtau',
 'Asia/Aqtobe',
 'Asia/Ashgabat',
 'Asia/Atyrau',
 'Asia/Baghdad',
 'Asia/Bahrain',
 'Asia/Baku',
 'Asia/Bangkok',
 'Asia/Barnaul',
 'Asia/Beirut',
 'Asia/Bishkek',
 'Asia/Brunei',
 'Asia/Chita',
 'Asia/Choibalsan',
 'Asia/Colombo',
 'Asia/Damascus',
 'Asia/Dhaka',
 'Asia/Dili',
 'Asia/Dubai',
 'Asia/Dushanbe',
 'Asia/Famagusta',
 'Asia/Gaza',
 'Asia/Hebron',
 'Asia/Ho_Chi_Minh',
 'Asia/Hong_Kong',
 'Asia/Hovd',
 'Asia/Irkutsk',
 'Asia/Jakarta',
 'Asia/Jayapura',
 'Asia/Jerusalem',
 'Asia/Kabul',
 'Asia/Kamchatka',
 'Asia/Karachi',
 'Asia/Kathmandu',
 'Asia/Khandyga',
 'Asia/Kolkata',
 'Asia/Krasnoyarsk',
 'Asia/Kuala_Lumpur',
 'Asia/Kuching',
 'Asia/Kuwait',
 'Asia/Macau',
 'Asia/Magadan',
 'Asia/Makassar',
 'Asia/Manila',
 'Asia/Muscat',
 'Asia/Nicosia',
 'Asia/Novokuznetsk',
 'Asia/Novosibirsk',
 'Asia/Omsk',
 'Asia/Oral',
 'Asia/Phnom_Penh',
 'Asia/Pontianak',
 'Asia/Pyongyang',
 'Asia/Qatar',
 'Asia/Qostanay',
 'Asia/Qyzylorda',
 'Asia/Riyadh',
 'Asia/Sakhalin',
 'Asia/Samarkand',
 'Asia/Seoul',
 'Asia/Shanghai',
 'Asia/Singapore',
 'Asia/Srednekolymsk',
 'Asia/Taipei',
 'Asia/Tashkent',
 'Asia/Tbilisi',
 'Asia/Tehran',
 'Asia/Thimphu',
 'Asia/Tokyo',
 'Asia/Tomsk',
 'Asia/Ulaanbaatar',
 'Asia/Urumqi',
 'Asia/Ust-Nera',
 'Asia/Vientiane',
 'Asia/Vladivostok',
 'Asia/Yakutsk',
 'Asia/Yangon',
 'Asia/Yekaterinburg',
 'Asia/Yerevan']

아래는 한국의 서울, 싱가폴, 중국의 상해, 일본의 도쿄의 시간대 정보를 조회해 본 결과입니다.

# UTC: coordinated universal time

pytz.timezone('UTC')

[Out]: <UTC>

pytz.timezone('Asia/Seoul')

[Out]: <DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

pytz.timezone('Asia/Singapore')

[Out]: <DstTzInfo 'Asia/Singapore' LMT+6:55:00 STD>

pytz.timezone('Asia/Shanghai')

[Out]: <DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>

pytz.timezone('Asia/Tokyo')

[Out]: <DstTzInfo 'Asia/Tokyo' LMT+9:19:00 STD>

(2) 시간대를 포함해서 날짜-시간 범위 만들기 (generate date ranges with time zone)

pandas 의 date_range() 함수로 날짜-시간 DatetimeIndex를 생성할 때 tz = 'time_zone_name' 옵션을 사용하면 시간대(time zone)를 설정해줄 수 있습니다. 아래 예는 'Asia/Seoul' 시간대를 설정해서 2019-12-28 부터 4일 치 날짜를 생성한 것입니다.

import pandas as pd

ts_seoul = pd.date_range('2019-12-28', periods=4, freq='D', tz='Asia/Seoul')

ts_seoul

[Out]:

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

ts_seoul_series = pd.Series(range(len(ts_seoul_idx)), index = ts_seoul)

ts_seoul_series.index.tz

[Out]:

<DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

(3) 시간대가 없는 naive 상태에서 지역 시간대 설정하기

(convert from naive to localized time zone)

pandas 의 date_range() 함수로 날짜-시간 DatetimeIndex를 생성하면 디폴트로는 시간대가 없는 naive 상태로 만들어집니다. 이런 naive time-zone에서 특정 국가/지역의 시간대를 설정하고 싶을 때 tz_localize('timezone_name') 메소드를 사용합니다.

# timezone-naive timestamps

ts_naive = pd.date_range('2019-12-28', periods=6, freq='D')

ts_naive

[Out]:

DatetimeIndex(['2019-12-28', '2019-12-29', '2019-12-30', '2019-12-31',
               '2020-01-01', '2020-01-02'],
              dtype='datetime64[ns]', freq='D')

# localize timezone to 'Asia/Seoul' using tz_localize() methods

ts_local_seoul = ts_naive.tz_localize('Asia/Seoul')

ts_local_seoul

[Out]:

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00',
               '2020-01-01 00:00:00+09:00', '2020-01-02 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

만약 naive time-zone 상태에서 시간대를 설정해주기 위해 tz_convert('timezone_name') 메소드를 사용하면 'TypeError: Connot convert tz-naive timestmaps, use tz_localize to localize' 라는 타입 에러가 발생합니다.

# TypeError: Cannot convert tz-naive timestamps, use tz_localize to localize

ts_naive.tz_convert('Asia/Seoul')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-270d3596ed05> in <module>
----> 1 ts_naive.tz_convert('Asia/Seoul')

--- 중간 생략 ---

~/anaconda3/envs/py3.6_tf2.0/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py in tz_convert(self, tz)
    958             # tz naive, use tz_localize
    959             raise TypeError(
--> 960                 "Cannot convert tz-naive timestamps, use " "tz_localize to localize"
    961             )
    962 

TypeError: Cannot convert tz-naive timestamps, use tz_localize to localize

(4) 특정 시간대를 다른 시간대로 바꾸기 (convert from a time-zone to another one)

아래의 예는 tz_convert('Asia/Singapore') 메소드를 이용해서 'Asia/Seoul' 시간대를 'Asia/Singapore' 시간대로 변경해보았습니다.

# timezone 'Asia/Seoul'

ts_seoul = pd.date_range('2019-12-28', periods=4, freq='D', tz='Asia/Seoul')

ts_seoul

DatetimeIndex(['2019-12-28 00:00:00+09:00', '2019-12-29 00:00:00+09:00',
               '2019-12-30 00:00:00+09:00', '2019-12-31 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Seoul]', freq='D')

# convert from 'Asia/Seoul' to 'Asia/Singapore' using tz_convert()

ts_singapore = ts_seoul.tz_convert('Asia/Singapore')

ts_singapore

[Out]:

DatetimeIndex(['2019-12-27 23:00:00+08:00', '2019-12-28 23:00:00+08:00',
               '2019-12-29 23:00:00+08:00', '2019-12-30 23:00:00+08:00'],
              dtype='datetime64[ns, Asia/Singapore]', freq='D')

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

[ 파이썬 시간대 이름 (Python Timezone Names) ]

import pytz

pytz.common_timezones

Africa	America
['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek']	['America/Adak', 'America/Anchorage', 'America/Anguilla', 'America/Antigua', 'America/Araguaina', 'America/Argentina/Buenos_Aires', 'America/Argentina/Catamarca', 'America/Argentina/Cordoba', 'America/Argentina/Jujuy', 'America/Argentina/La_Rioja', 'America/Argentina/Mendoza', 'America/Argentina/Rio_Gallegos', 'America/Argentina/Salta', 'America/Argentina/San_Juan', 'America/Argentina/San_Luis', 'America/Argentina/Tucuman', 'America/Argentina/Ushuaia', 'America/Aruba', 'America/Asuncion', 'America/Atikokan', 'America/Bahia', 'America/Bahia_Banderas', 'America/Barbados', 'America/Belem', 'America/Belize', 'America/Blanc-Sablon', 'America/Boa_Vista', 'America/Bogota', 'America/Boise', 'America/Cambridge_Bay', 'America/Campo_Grande', 'America/Cancun', 'America/Caracas', 'America/Cayenne', 'America/Cayman', 'America/Chicago', 'America/Chihuahua', 'America/Costa_Rica', 'America/Creston', 'America/Cuiaba', 'America/Curacao', 'America/Danmarkshavn', 'America/Dawson', 'America/Dawson_Creek', 'America/Denver', 'America/Detroit', 'America/Dominica', 'America/Edmonton', 'America/Eirunepe', 'America/El_Salvador', 'America/Fort_Nelson', 'America/Fortaleza', 'America/Glace_Bay', 'America/Godthab', 'America/Goose_Bay', 'America/Grand_Turk', 'America/Grenada', 'America/Guadeloupe', 'America/Guatemala', 'America/Guayaquil', 'America/Guyana', 'America/Halifax', 'America/Havana', 'America/Hermosillo', 'America/Indiana/Indianapolis', 'America/Indiana/Knox', 'America/Indiana/Marengo', 'America/Indiana/Petersburg', 'America/Indiana/Tell_City', 'America/Indiana/Vevay', 'America/Indiana/Vincennes', 'America/Indiana/Winamac', 'America/Inuvik', 'America/Iqaluit', 'America/Jamaica', 'America/Juneau', 'America/Kentucky/Louisville', 'America/Kentucky/Monticello', 'America/Kralendijk', 'America/La_Paz', 'America/Lima', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Maceio', 'America/Managua', 'America/Manaus', 'America/Marigot', 'America/Martinique', 'America/Matamoros', 'America/Mazatlan', 'America/Menominee', 'America/Merida', 'America/Metlakatla', 'America/Mexico_City', 'America/Miquelon', 'America/Moncton', 'America/Monterrey', 'America/Montevideo', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Nipigon', 'America/Nome', 'America/Noronha', 'America/North_Dakota/Beulah', 'America/North_Dakota/Center', 'America/North_Dakota/New_Salem', 'America/Ojinaga', 'America/Panama', 'America/Pangnirtung', 'America/Paramaribo', 'America/Phoenix', 'America/Port-au-Prince', 'America/Port_of_Spain', 'America/Porto_Velho', 'America/Puerto_Rico', 'America/Punta_Arenas', 'America/Rainy_River', 'America/Rankin_Inlet', 'America/Recife', 'America/Regina', 'America/Resolute', 'America/Rio_Branco', 'America/Santarem', 'America/Santiago', 'America/Santo_Domingo', 'America/Sao_Paulo', 'America/Scoresbysund', 'America/Sitka', 'America/St_Barthelemy', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Swift_Current', 'America/Tegucigalpa', 'America/Thule', 'America/Thunder_Bay', 'America/Tijuana', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Whitehorse', 'America/Winnipeg', 'America/Yakutat', 'America/Yellowknife' ]
Antarctica
[ 'Antarctica/Casey', 'Antarctica/Davis', 'Antarctica/DumontDUrville', 'Antarctica/Macquarie', 'Antarctica/Mawson', 'Antarctica/McMurdo', 'Antarctica/Palmer', 'Antarctica/Rothera', 'Antarctica/Syowa', 'Antarctica/Troll', 'Antarctica/Vostok', 'Arctic/Longyearbyen']
Australia
[ 'Australia/Adelaide', 'Australia/Brisbane', 'Australia/Broken_Hill', 'Australia/Currie', 'Australia/Darwin', 'Australia/Eucla', 'Australia/Hobart', 'Australia/Lindeman', 'Australia/Lord_Howe', 'Australia/Melbourne', 'Australia/Perth', 'Australia/Sydney']
Canada
[ 'Canada/Atlantic', 'Canada/Central', 'Canada/Eastern', 'Canada/Mountain', 'Canada/Newfoundland', 'Canada/Pacific']
Europe
[ 'Europe/Amsterdam', 'Europe/Andorra', 'Europe/Astrakhan', 'Europe/Athens', 'Europe/Belgrade', 'Europe/Berlin', 'Europe/Bratislava', 'Europe/Brussels', 'Europe/Bucharest', 'Europe/Budapest', 'Europe/Busingen', 'Europe/Chisinau', 'Europe/Copenhagen', 'Europe/Dublin', 'Europe/Gibraltar', 'Europe/Guernsey', 'Europe/Helsinki', 'Europe/Isle_of_Man', 'Europe/Istanbul', 'Europe/Jersey', 'Europe/Kaliningrad', 'Europe/Kiev', 'Europe/Kirov', 'Europe/Lisbon', 'Europe/Ljubljana', 'Europe/London', 'Europe/Luxembourg', 'Europe/Madrid', 'Europe/Malta', 'Europe/Mariehamn', 'Europe/Minsk', 'Europe/Monaco', 'Europe/Moscow', 'Europe/Oslo', 'Europe/Paris', 'Europe/Podgorica', 'Europe/Prague', 'Europe/Riga', 'Europe/Rome', 'Europe/Samara', 'Europe/San_Marino', 'Europe/Sarajevo', 'Europe/Saratov', 'Europe/Simferopol', 'Europe/Skopje', 'Europe/Sofia', 'Europe/Stockholm', 'Europe/Tallinn', 'Europe/Tirane', 'Europe/Ulyanovsk', 'Europe/Uzhgorod', 'Europe/Vaduz', 'Europe/Vatican', 'Europe/Vienna', 'Europe/Vilnius', 'Europe/Volgograd', 'Europe/Warsaw', 'Europe/Zagreb', 'Europe/Zaporozhye', 'Europe/Zurich']
Pacific	US
['Pacific/Apia', 'Pacific/Auckland', 'Pacific/Bougainville', 'Pacific/Chatham', 'Pacific/Chuuk', 'Pacific/Easter', 'Pacific/Efate', 'Pacific/Enderbury', 'Pacific/Fakaofo', 'Pacific/Fiji', 'Pacific/Funafuti', 'Pacific/Galapagos', 'Pacific/Gambier', 'Pacific/Guadalcanal', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Kiritimati', 'Pacific/Kosrae', 'Pacific/Kwajalein', 'Pacific/Majuro', 'Pacific/Marquesas', 'Pacific/Midway', 'Pacific/Nauru', 'Pacific/Niue', 'Pacific/Norfolk', 'Pacific/Noumea', 'Pacific/Pago_Pago', 'Pacific/Palau', 'Pacific/Pitcairn', 'Pacific/Pohnpei', 'Pacific/Port_Moresby', 'Pacific/Rarotonga', 'Pacific/Saipan', 'Pacific/Tahiti', 'Pacific/Tarawa', 'Pacific/Tongatapu', 'Pacific/Wake', 'Pacific/Wallis']	[ 'US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific']
	Indian
	['Indian/Antananarivo', 'Indian/Chagos', 'Indian/Christmas', 'Indian/Cocos', 'Indian/Comoro', 'Indian/Kerguelen', 'Indian/Mahe', 'Indian/Maldives', 'Indian/Mauritius', 'Indian/Mayotte', 'Indian/Reunion']

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기 (0)	2019.12.30
[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기 (21)	2019.12.26
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25

Posted by Rfriend

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 12. 26. 19:56

이번 포스팅에서는

(1) 텍스트 파일을 열어 각 Line 별로 읽어 들인 후에 문자열 메소드를 이용해 파싱(Parsing)

--> pandas DataFrame으로 만들고,

(2) ID를 기준으로 그룹별로 값을 한칸식 내려서(Lag) 새로운 칼럼을 만들기

를 해보겠습니다.

아래와 같이 생긴 텍스트 파일이 있다고 하겠습니다.

'color_range.txt' 파일

color_range.txt

첫번째 행 AAA는 0에서 100까지는 a 영역, 100부터 200까지는 b 영역이라는 의미입니다. 여기서 a(빨간색), b(파란색)은 색상을 나타내며, AAA 는 0(포함)부터 100(미포함)까지는 빨간색, 100(포함)부터 200(미포함)까지는 파란색, 200(포함)부터 300(미포함)까지는 빨간색, ... 을 의미합니다.

이렇게 데이터가 행으로 옆으로 길게 늘여져서 쓰여진 파일을 'AAA', 'BBB' 의 ID별로 색깔(a: 빨간색, b: 파란색)별 시작 숫자와 끝 숫자를 알기 쉽게 각 칼럼으로 구분하여 pandas DataFrame으로 만들어보고자 합니다.

(1) 텍스트 파일을 열어 각 Line별로 읽어들인 후 문자열 메소드를 이용해 파싱(Parsing)

--> pandas DataFrame 만들기

import pandas as pd

import os

# set file path

cwd = os.getcwd()

file_path = os.path.join(cwd, 'color_range.txt')

# read 'color_range.txt' file and parsing it by id and value

df = pd.DataFrame() # blank DataFrame to store results

# open file

f = open(file_path)

# parsing text line by line using for loop statement

for line in f.readlines():

id_list = []

color_list = []

bin_list = []

# remove white space

line = line.strip()

# delete '"'

line = line.replace('"', '')

# get ID and VALUE from a line

id = line[:3]

val = line[4:]

# make a separator with comma(',')

val = val.replace(' a', ',a')

val = val.replace(' b', ',b')

# split a line using separator ','

val_split = val.split(sep=',')

# get a 'ID', 'COLOR', 'BIN_END' values and append it to list

for j in range(len(val_split)):

id_list.append(id)

color_list.append(val_split[j][:1])

bin_list.append(val_split[j][2:])

# make a temp DataFrame, having ID, COLOR, BIN_END values per each line

# note: if a line has only one value(ie. Scalar), then it will erase 'index error' :-(

df_tmp = pd.DataFrame({'id': id_list,

'color_cd': color_list,

'bin_end': bin_list}

)

# combine df and df_tmp one by one

df = pd.concat([df, df_tmp], axis=0, ignore_index=True)

# let's check df DataFrame

[Out]:

	id	color_cd	bin_end
0	AAA	a	100
1	AAA	b	200
2	AAA	a	300
3	AAA	b	400
4	BBB	a	250
5	BBB	b	350
6	BBB	a	450
7	BBB	b	550
8	BBB	a	650
9	BBB	b	750
10	BBB	a	800
11	BBB	b	910

(2) ID를 기준으로 그룹별로 값을 한칸식 내려서(Lag) 새로운 칼럼을 만들기

'ID'를 기준으로 'bin_end' 칼럼을 한칸씩 내리고 (shift(1)), 첫번째 행의 결측값은 '0'으로 채워(fillna(0))보겠습니다.

# lag 1 group by 'id' and fill missing value with '0'

df['bin_start'] = df.groupby('id')['bin_end'].shift(1).fillna(0)

[Out]:

	id	color_cd	bin_end	bin_start
0	AAA	a	100	0
1	AAA	b	200	100
2	AAA	a	300	200
3	AAA	b	400	300
4	BBB	a	250	0
5	BBB	b	350	250
6	BBB	a	450	350
7	BBB	b	550	450
8	BBB	a	650	550
9	BBB	b	750	650
10	BBB	a	800	750
11	BBB	b	910	800

color code ('color_cd')에서 'a' 는 빨간색(red), 'b'는 파란색(blue) 이라는 색깔 이름을 매핑해보겠습니다.

# mapping color using color_cd

color_map = {'a': 'red',

'b': 'blue'}

df['color'] = df['color_cd'].map(lambda x: color_map.get(x, x))

[Out]:

	id	color_cd	bin_end	bin_start	color
0	AAA	a	100	0	red
1	AAA	b	200	100	blue
2	AAA	a	300	200	red
3	AAA	b	400	300	blue
4	BBB	a	250	0	red
5	BBB	b	350	250	blue
6	BBB	a	450	350	red
7	BBB	b	550	450	blue
8	BBB	a	650	550	red
9	BBB	b	750	650	blue
10	BBB	a	800	750	red
11	BBB	b	910	800	blue

보기에 편리하도록 칼럼 순서를 'id', 'color_cd', 'color', 'bin_start', 'bin_end' 의 순서대로 재배열 해보겠습니다.

# change the sequence of columns

df = df[['id', 'color_cd', 'color', 'bin_start', 'bin_end']]

[Out]:

	id	color_cd	color	bin_start	bin_end
0	AAA	a	red	0	100
1	AAA	b	blue	100	200
2	AAA	a	red	200	300
3	AAA	b	blue	300	400
4	BBB	a	red	0	250
5	BBB	b	blue	250	350
6	BBB	a	red	350	450
7	BBB	b	blue	450	550
8	BBB	a	red	550	650
9	BBB	b	blue	650	750
10	BBB	a	red	750	800
11	BBB	b	blue	800	910

bin_start 는 포함하고 (include), bin_end 는 포함하지 않는(not include) 것을 알기 쉽도록

==> 포함('[') 기호 + 'bin_start', 'bin_end' + 미포함(')') 기호를 덧붙여서

'bin_range'라는 새로운 칼럼을 만들어보겠습니다.

# make a 'Bin Range' column with include '[' and exclude ')' sign

df['bin_range'] = df['bin_start'].apply(lambda x: '[' + str(x) + ',') + \

df['bin_end'].apply(lambda x: str(x + ')'))

[Out]:

	id	color_cd	color	bin_start	bin_end	bin_range
0	AAA	a	red	0	100	[0,100)
1	AAA	b	blue	100	200	[100,200)
2	AAA	a	red	200	300	[200,300)
3	AAA	b	blue	300	400	[300,400)
4	BBB	a	red	0	250	[0,250)
5	BBB	b	blue	250	350	[250,350)
6	BBB	a	red	350	450	[350,450)
7	BBB	b	blue	450	550	[450,550)
8	BBB	a	red	550	650	[550,650)
9	BBB	b	blue	650	750	[650,750)
10	BBB	a	red	750	800	[750,800)
11	BBB	b	blue	800	910	[800,910)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp) (0)	2019.12.30
[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion) (0)	2019.12.28
[Python pandas] 시계열 데이터 빈도/주기와 날짜 Offsets (Frequencies and Date Offsets) (0)	2019.12.26
[Python pandas] 차수(order) m인 단순이동평균 구하기 (simple moving average with order m) (0)	2019.12.25
[Python pandas] 일정한 주기의 시계열 데이터 Series, DataFrame 만들기 (1)	2019.12.24

Posted by Rfriend

이전 1 2 3 4 5 6 7 8 ··· 16 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Python 분석과 프로그래밍/Python 데이터 전처리'에 해당되는 글 157건

[Python pandas] read_csv() 로 데이터 읽어올 때 날짜/시간 데이터 파싱하기 (parsing datetime from file using read_csv())

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 연속형을 범주형으로 변환하는 np.digitize(), pd.cut() 비교 (comparison of categorization using np.digitize(), pd.cut())

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 층화 무작위 추출을 통한 train set, test set 분할 (Train, Test set Split by Stratified Random Sampling in Python)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python numpy] Train, Test 데이터셋 분할하기 (split train and test set)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python Numpy] numpy array 거꾸로 뒤집기 (how to reverse numpy array)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Upsampling 변환 시 생기는 결측값 채우기(fill na), 선형 보간하기(linear interpolation)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Downsampling 으로 시계열 데이터 집계 시 좌/우의 포함여부(closed), 라벨 이름(label) 설정하기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 분기 단위의 기간 날짜 범위 만들기, timestamp와 변환하기 (Quarterly period frequencies and range, conversion b/w timestamp)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 시간대 확인, 설정, 변경하기 (time zone generation, localization, conversion)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 텍스트 문서 파싱하여 DataFrame으로 만들고, 그룹별로 Lag 해서 새로운 칼럼 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바