'분류 전체보기' 카테고리의 글 목록 (45 Page)

[Python pandas] 여러개의 함수를 적용하여 GroupBy 집계하기 : grouped.agg()

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 9. 2. 20:03

지난번 포스팅에서는 Python pandas의 GroupBy 집계 메소드와 함수에 대해서 알아보았습니다.

이번 포스팅에서는 Python pandas의 GroupBy 집계를 할 때 grouped.agg() 를 사용하여 다수의 함수를 적용하는 몇 가지 방법을 소개하고자 합니다.

(1) 함수 리스트(List)를 사용하여 다수의 GroupBy 집계 함수를 동일한 칼럼에 적용하기

(2) 칼럼과 함수를 매핑한 Dict를 사용하여 칼럼별로 특정 GroupBy 집계 함수를 적용하기

(3) (이름, 함수)의 튜플 (Tuples of (name, function))을 사용하여 GroupBy 집계 함수에 이름 부여하기

[ Python pandas: GroupBy with multiple functions using lists, Dicts, tuples ]

예제로 사용할 데이터는 UCI Machine Learning Repository에 있는 Abalone data set 입니다. 전복의 둘레, 두께, 높이, 전체 무게, 껍질 무게 등 4,177개의 전복을 측정해 놓은 데이터셋입니다.

[ UCI Machine Learning Repository ]

Abalone Data Set 설명: http://archive.ics.uci.edu/ml/datasets/Abalone
Abalone Data Set: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
Variables

Name Data Type Meas. Description ---- --------- ----- ----------- Sex nominal M, F, and I (infant) Length continuous mm Longest shell measurement Diameter continuous mm perpendicular to length Height continuous mm with meat in shell Whole weight continuous grams whole abalone Shucked weight continuous grams weight of meat Viscera weight continuous grams gut weight (after bleeding) Shell weight continuous grams after being dried Rings integer +1.5 gives the age in years

UCI machine learning repository 웹사이트로부터 Abalone 데이터셋을 csv파일을 다운로드 받아서 pandas DataFrame로 불러오도록 하겠습니다.

# Importing common libraries

import numpy as np

import pandas as pd

# Import Abalone data set from UCI machine learning repository directly

import csv

import urllib2

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

downloaded_data = urllib2.urlopen(url)

abalone = pd.read_csv(downloaded_data,

names = ['sex', 'length', 'diameter', 'height',

'whole_weight', 'shucked_weight', 'viscera_weight',

'shell_weight', 'rings'],

header = None)

abalone.head()

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

예제에서 GroupBy 집계 시 그룹을 나누는 기준으로 사용할 용도로 'length' 변수에 대해 중앙값을 기준으로 큰지, 작은지 여부에 따라 'length_cat' 라는 범주형 변수를 하나 더 만들어보겠습니다.

abalone['length_cat'] = np.where(abalone.length > np.median(abalone.length),

'length_long', # True

'length_short') # False

abalone[['length', 'length_cat']][:10]

	length	length_cat
0	0.455	length_short
1	0.350	length_short
2	0.530	length_short
3	0.440	length_short
4	0.330	length_short
5	0.425	length_short
6	0.530	length_short
7	0.545	length_short
8	0.475	length_short
9	0.550	length_long

(1) 함수 리스트(List)를 사용하여 다수의 GroupBy 집계 함수를 동일한 칼럼에 적용하기

'sex' ('F', 'I', 'M' 계급), 'length_cat' ('length_short', 'length_long' 계급) 의 두 개의 범주형 변수를 사용하여 GroupBy 집계 시 그룹을 나누는 기준으로 사용하겠으며, 'whole_weight' 연속형 변수에 대해 GroupBy 집계 함수를 적용하여 그룹 집계를 해보겠습니다.

grouped_ww = abalone.groupby(['sex', 'length_cat'])['whole_weight']

grouped_ww

<pandas.core.groupby.SeriesGroupBy object at 0x10a7e0290>

먼저, 복습을 하는 차원에서 지난번 포스팅에서 소개했던 '단일 함수'를 사용하여 GroupBy 집계하는 두가지 방법, 즉 (1) GroupBy method를 사용하거나 아니면 (2) grouped.agg(함수)를 사용하는 방법을 소개하면 아래와 같습니다. 하나의 집계함수를 적용하면 반환되는 결과는 Series 가 됩니다.

(방법1) GroupBy methods

(방법2) grouped.agg(function)

grouped_ww.mean() # Series

sex  length_cat  
F    length_long     1.261330
     length_short    0.589702
I    length_long     0.923215
     length_short    0.351234
M    length_long     1.255182
     length_short    0.538157
Name: whole_weight, dtype: float64

grouped_ww.agg('mean') # Series

sex  length_cat  
F    length_long     1.261330
     length_short    0.589702
I    length_long     0.923215
     length_short    0.351234
M    length_long     1.255182
     length_short    0.538157
Name: whole_weight, dtype: float64

이제부터 '여러개의 함수'를 적용하여 GroupBy 집계하는 방법을 소개하겠습니다. 먼저, GroupBy 집계하려는 함수들의 문자열 리스트(list)로 grouped.agg() 에 적용하는 방법입니다. 이처럼 여러개의 집계함수를 적용하면 반환되는 결과는 DataFrame이 됩니다.

grouped_ww.agg(['size', 'mean', 'std', 'min', 'max']) # DataFrame

		size	mean	std	min	max
sex	length_cat
F	length_long	889	1.261330	0.329656	0.6405	2.6570
F	length_short	418	0.589702	0.202400	0.0800	1.3580
I	length_long	188	0.923215	0.218334	0.5585	2.0495
I	length_short	1154	0.351234	0.204237	0.0020	1.0835
M	length_long	966	1.255182	0.354682	0.5990	2.8255
M	length_short	562	0.538157	0.246498	0.0155	1.2825

function_list = ['size', 'mean', 'std', 'min', 'max']

grouped_ww.agg(function_list)

		size	mean	std	min	max
sex	length_cat
F	length_long	889	1.261330	0.329656	0.6405	2.6570
F	length_short	418	0.589702	0.202400	0.0800	1.3580
I	length_long	188	0.923215	0.218334	0.5585	2.0495
I	length_short	1154	0.351234	0.204237	0.0020	1.0835
M	length_long	966	1.255182	0.354682	0.5990	2.8255
M	length_short	562	0.538157	0.246498	0.0155	1.2825

물론, "다수의 칼럼"에 대해서 여러개의 함수를 적용하는 것도 가능합니다. 아래의 예에서는 'whole_weight', 'shell_weight'의 두 개의 칼럼에 대해서 GroupBy 집계 함수 리스트(list)를 적용하여 집계하여 보았습니다.

grouped = abalone.groupby(['sex', 'length_cat'])

function_list = ['size', 'mean', 'std']

groupby_result = grouped['whole_weight', 'shell_weight'].agg(function_list)

groupby_result

		whole_weight			shell_weight
		size	mean	std	size	mean	std
sex	length_cat
F	length_long	889	1.261330	0.329656	889	0.360013	0.104014
F	length_short	418	0.589702	0.202400	418	0.178650	0.063085
I	length_long	188	0.923215	0.218334	188	0.273247	0.064607
I	length_short	1154	0.351234	0.204237	1154	0.104549	0.061003
M	length_long	966	1.255182	0.354682	966	0.351683	0.102636
M	length_short	562	0.538157	0.246498	562	0.162141	0.075629

GroupBy 집계 결과가 pandas DataFrame으로 반환된다고 하였으므로, DataFrame에서 사용하는 Indexing 기법을 그대로 사용할 수 있습니다. 예를 들어, 칼럼을 기준으로 집계 결과 데이터프레임인 groupby_result 로 부터 'shell_weight' 변수에 대한 결과만 Indexing 해보겠습니다.

groupby_result['shell_weight']

		size	mean	std
sex	length_cat
F	length_long	889	0.360013	0.104014
F	length_short	418	0.178650	0.063085
I	length_long	188	0.273247	0.064607
I	length_short	1154	0.104549	0.061003
M	length_long	966	0.351683	0.102636
M	length_short	562	0.162141	0.075629

groupby_result['shell_weight'][['size', 'mean']]

		size	mean
sex	length_cat
F	length_long	889	0.360013
F	length_short	418	0.178650
I	length_long	188	0.273247
I	length_short	1154	0.104549
M	length_long	966	0.351683
M	length_short	562	0.162141

GroupBy 집계 결과 데이터프레임으로부터 row를 기준으로 Indexing을 할 수도 있습니다. DataFrame에서 row 기준으로 indexing 할 때 DataFrame.loc[] 를 사용하는 것과 동일합니다.

groupby_result.loc['M']

	whole_weight			shell_weight
	size	mean	std	size	mean	std
length_cat
length_long	966	1.255182	0.354682	966	0.351683	0.102636
length_short	562	0.538157	0.246498	562	0.162141	0.075629

groupby_result.loc['M', 'shell_weight']

	size	mean	std
length_cat
length_long	966	0.351683	0.102636
length_short	562	0.162141	0.075629

(2) 칼럼과 함수를 매핑한 Dict를 사용하여 칼럼별로 특정 GroupBy 집계 함수를 적용하기

먼저, 범위(range)와 IQR(Inter-Quartile Range)를 구하는 사용자 정의 함수를 정의한 후에 grouped.agg() 에 적용해보겠습니다.

def range_func(x):

max_val = np.max(x)

min_val = np.min(x)

range_val = max_val - min_val

return range_val

def iqr_func(x):

q3, q1 = np.percentile(x, [75, 25])

iqr = q3 - q1

return iqr

이제 Dicts를 사용하여 'whole_weight' 칼럼에는 size(), mean(), std() 메소드를 매핑하여 GroupBy 집계에 적용하고, 'shell_weight' 칼럼에는 range_func, iqr_func 사용자 정의 함수를 매핑하여 GroupBy 집계에 적용해보겠습니다.

size(), mean(), std() 등의 메소드는 문자열(string)로 grouped.agg() 안에 넣어주어야 해서 작은따옴표('method_name')로 감싸주었으며, 사용자 정의 함수(UDF)는 작은따옴표 없이 그냥 써주면 됩니다.

grouped.agg({'whole_weight': ['size', 'mean', 'std'], # put method's name as a string

'shell_weight': [range_func, iqr_func]}) # UDF name

		whole_weight			shell_weight
		size	mean	std	range_func	iqr_func
sex	length_cat
F	length_long	889	1.261330	0.329656	0.850	0.127000
F	length_short	418	0.589702	0.202400	0.378	0.080500
I	length_long	188	0.923215	0.218334	0.485	0.067875
I	length_short	1154	0.351234	0.204237	0.349	0.092750
M	length_long	966	1.255182	0.354682	0.776	0.124000
M	length_short	562	0.538157	0.246498	0.375	0.102750

(3) (이름, 함수)의 튜플 (Tuples of (name, function))을 사용하여 GroupBy 집계 함수에 이름 부여하기

위의 (2)번에서 Dicts를 사용하여 shell_weight 변수에 대해 range_func, iqr_func 사용자 정의 함수를 적용하여 GroupBy 집계를 하였는데요, 집계 결과로 반환된 데이터프레임의 변수 이름이 그대로 'range_func', 'iqr_func' 여서 왠지 좀 마음에 들지 않군요. 이럴 때 (이름, 함수) 의 튜플 (Tuples of (name, function))을 사용하여 함수에 특정 이름을 부여할 수 있습니다.

아래 예제에서는 알아보기에 쉽도록 'range_func'는 'Range'라는 이름으로, 'iqr_func'는 'Inter-Quartile_Range'라는 이름을 부여하여 변경을 해보겠습니다.

# (name, function) tuples

grouped.agg({'whole_weight': ['size', 'mean', 'std'],

'shell_weight': [('Range', range_func), # (name, function) tuple

('Inter-Quartile_Range', iqr_func)]}) # (name, function) tuple

		whole_weight			shell_weight
		size	mean	std	Range	Inter-Quartile_Range
sex	length_cat
F	length_long	889	1.261330	0.329656	0.850	0.127000
F	length_short	418	0.589702	0.202400	0.378	0.080500
I	length_long	188	0.923215	0.218334	0.485	0.067875
I	length_short	1154	0.351234	0.204237	0.349	0.092750
M	length_long	966	1.255182	0.354682	0.776	0.124000
M	length_short	562	0.538157	0.246498	0.375	0.102750

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] GroupBy 를 활용한 그룹 별 가중평균 구하기 (0)	2018.12.01
[Python pandas] 여러개의 칼럼에 대해 다른 함수를 적용한 Group By 집계: grouped.apply(functions) (0)	2018.09.06
[Python pandas] GroupBy 집계 메소드와 함수 (Group by aggregation methods and functions) (0)	2018.09.02
[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels (0)	2018.09.01
[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기 (2)	2018.09.01

Posted by Rfriend

,

[Python pandas] GroupBy 집계 메소드와 함수 (Group by aggregation methods and functions)

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 9. 2. 14:56

지난번 포스팅에서는 row나 column 기준으로 GroupBy의 Group을 지정할 수 있는 4가지 방법으로 Dicts, Series, Functions, Index Levels 를 소개하였습니다.

이번 포스팅에서는 Python pandas에서 연속형 변수의 기술통계량 집계를 할 수 있는 GroupBy 집계 메소드와 함수 (GroupBy aggregation methods and functions)에 대해서 소개하겠습니다.

(1) GroupBy 메소드를 이용한 집계 (GroupBy aggregation using methods): (ex) grouped.sum()

(2) 함수를 이용한 GroupBy 집계 (GroupBy aggregation using functions): grouped.agg(function)

[ Python pandas Group By 집계 메소드와 함수 ]

pandas에서 GroupBy 집계를 할 때 (1) pandas에 내장되어 있는 기술 통계량 메소드를 사용하는 방법과, (2) (사용자 정의) 함수를 grouped.agg(function) 형태로 사용하는 방법이 있습니다. GroupBy 메소드는 성능이 최적화되어 있어 성능면에서 함수를 사용하는 것보다 빠르므로, 메소드가 지원하는 집단별 기술통계량 분석 시에는 메소드를 이용하는게 좋겠습니다.

NA 값은 모두 무시되고 non-NA 값들에 대해서만 GroupBy method가 적용됩니다.

기술 통계량들이 어려운게 하나도 없으므로 이번 포스팅은 좀 쉬어가는 코너로 가볍게 소개합니다. 설명에 사용한 간단한 예제 데이터프레임과 'group'변수를 대상으로 GroupBy object를 만들어보겠습니다.

# Importing common libraries

import numpy as np

import pandas as pd

# sample DataFrame

df = pd.DataFrame({'group': ['a', 'a', 'a', 'b', 'b', 'b'],

'value_1': np.arange(6),

'value_2': np.random.randn(6)})

df

	group	value_1	value_2
0	a	0	-1.739302
1	a	1	0.851955
2	a	2	0.874874
3	b	3	-0.461543
4	b	4	0.880763
5	b	5	-0.346675

# Making GroupBy object

grouped = df.groupby('group')

grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x11136f550>

(1) GroupBy 메소드를 이용한 집계 (GroupBy aggregation using methods)

(1-1) count(), sum()

count(): 그룹 내 non-NA 개수

sum(): 그룹 내 non-NA 합

grouped.count()

	value_1	value_2
group
a	3	3
b	3	3

grouped.sum() # DataFrame

	value_1	value_2
group
a	3	-0.012473
b	12	0.072545

*cf. grouped.size() 도 grouped.count()와 동일한 결과를 반환함

위의 예에서 보면 'value_1', 'value_2' 변수가 숫자형이므로 pandas가 알아서 잘 찾아서 count()와 sum()을 해주었으며, 반환된 결과는 데이터프레임입니다.

만약 특정 변수에 대해서만 그룹별 요약/집계를 하고 싶다면 해당 변수를 indexing해주면 되며, 한개 변수에 대해서만 GroupBy 집계를 하면 반환되는 결과는 Series가 됩니다. 한개 변수에 대해 GroupBy 집계해서 나온 Series를 데이터프레임으로 만들고 싶으면 pd.DataFrame() 를 사용해서 집계 결과를 데이터프레임으로 변환해주면 됩니다.

grouped.sum()['value_2'] # Series

group

a   -0.012473
b    0.072545

Name: value_2, dtype: float64

pd.DataFrame(grouped.sum()['value_2']) # DataFrame

	value_2
group
a	-0.012473
b	0.072545

(1-2) 최소값, 최대값: min(), max()

min(): 그룹 내 non-NA 값 중 최소값

max(): 그룹 내 non-NA 값 중 최대값

grouped.min()

	value_1	value_2
group
a	0	-1.739302
b	3	-0.461543

grouped.max()

	value_1	value_2
group
a	2	0.874874
b	5	0.880763

(1-3) 중심 경향: mean(), median()

mean(): 그룹 내 non-NA 값들의 평균값

median(): 그룹 내 non-NA 값들의 중앙값

grouped.mean()

	value_1	value_2
group
a	1	-0.004158
b	4	0.024182

grouped.median()

	value_1	value_2
group
a	1	0.851955
b	4	-0.346675

※ 그룹별로 집계된 결과값의 변수에 접두사(Prefix)를 붙이려면 add_prefix() 를 사용합니다.

예) df.groupby('key_col').mean().add_prefix('mean_')

(1-4) 퍼짐 정도: std(), var(), quantile()

표준편차, 분산 계산에 n-1 자유도를 사용했으므로 샘플표준편차, 샘플분산으로 봐야겠네요.

quantile() 의 괄호 안에 0~1 사이의 값을 넣어주면 분위수를 계산해주며, 최소값과 최대값을 등분하여 그 사이를 interpolation 하여 분위수를 계산하는 방식입니다.

std(): 그룹 내 표준편차

var(): 그룹 내 분산

quantile(): 그룹 내 분위수

grouped.std()

	value_1	value_2
group
a	1.0	1.502723
b	1.0	0.744042

grouped.var()

	value_1	value_2
group
a	1	2.258176
b	1	0.553598

# interpolation

grouped.quantile(0.1)

0.1	value_1	value_2
group
a	0.2	-1.221051
b	3.2	-0.438569

(1-5) first(), last()

first(): 그룹 내 non-NA 값 중 첫번째 값

last(): 그룹 내 non-NA 값 중 마지막 값

grouped.first()

	value_1	value_2
group
a	0	-1.739302
b	3	-0.461543

grouped.last()

	value_1	value_2
group
a	2	0.874874
b	5	-0.346675

(1-6) describe()

describe(): 그룹 별 기술통계량

- 옆으로 길게

describe().T: 그룹 별 기술통계량

- 세로로 길게

grouped.describe()['value_1']

	count	mean	std	min	25%	50%	75%	max
group
a	3.0	1.0	1.0	0.0	0.5	1.0	1.5	2.0
b	3.0	4.0	1.0	3.0	3.5	4.0	4.5	5.0

grouped.describe()['value_1'].T

group	a	b
count	3.0	3.0
mean	1.0	4.0
std	1.0	1.0
min	0.0	3.0
25%	0.5	3.5
50%	1.0	4.0
75%	1.5	4.5
max	2.0	5.0

(2) 함수를 이용한 GroupBy 집계: grouped.agg(function)

필요로 하는 집계함수가 pandas GroupBy methods에 없는 경우 사용자 정의 함수를 정의해서 집계에 사용할 수 있습니다. IQR(Inter-Quartile Range, Q3 - Q1) 를 사용자 정의 함수로 정의하고, 이를 grouped.aggregate() 혹은 grouped.agg() 의 괄호 안에 넣어서 그룹 별로 IQR를 계산해보겠습니다.

def iqr_func(x):

q3, q1 = np.percentile(x, [75, 25])

iqr = q3 - q1

return iqr

grouped.aggregate(function)

grouped.agg(function)

grouped.aggregate(iqr_func)

	value_1	value_2
group
a	1	1.307088
b	1	0.671153

grouped.agg(iqr_func)

	value_1	value_2
group
a	1	1.307088
b	1	0.671153

위에서 사용자 정의함수로 정의해서 그룹별로 집계한 결과가 맞게 나온건지 quantile() 메소드로 그룹별 Q3 와 Q1을 계산해서 확인해보니, 위의 grouped.agg(iqr_func)가 잘 계산한거 맞네요.

grouped.quantile([0.75, 0.25])

		value_1	value_2
group
a	0.75	1.5	0.863414
a	0.25	0.5	-0.443674
b	0.75	4.5	0.267044
b	0.25	3.5	-0.404109

다음번 포스팅에서는 grouped.agg() 의 좀더 다양한 사용법을 소개하겠습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 여러개의 칼럼에 대해 다른 함수를 적용한 Group By 집계: grouped.apply(functions) (0)	2018.09.06
[Python pandas] 여러개의 함수를 적용하여 GroupBy 집계하기 : grouped.agg() (2)	2018.09.02
[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels (0)	2018.09.01
[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기 (2)	2018.09.01
[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31

Posted by Rfriend

,

[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 9. 1. 19:32

이번 포스팅에서 Python pandas의 GroupBy 집계 방법 4가지를 소개하겠습니다.

(1) Dicts를 사용한 GroupBy 집계

(2) Series를 사용한 GroupBy 집계

(3) Functions를 사용한 GroupBy 집계

(4) Index Levels를 사용한 GroupBy 집계

기본은 axis = 0으로서 row-wise 집계를 하게 되며, axis = 1 을 설정해주면 column-wise 집계를 하게 됩니다.

[ pandas의 4가지 GroupBy 집계 방법 ]

(1) Dicts를 이용한 GroupBy 집계

예제로 사용할 간단한 데이터프레임을 만들어보겠습니다.

# importing libraries

import numpy as np

import pandas as pd

from pandas import DataFrame

from pandas import Series

# making sample dataset

df = DataFrame(data=np.arange(20).reshape(4, 5),

columns = ['c1', 'c2', 'c3', 'c4', 'c5'],

index = ['r1', 'r2', 'r3', 'r4'])

df

	c1	c2	c3	c4	c5
r1	0	1	2	3	4
r2	5	6	7	8	9
r3	10	11	12	13	14
r4	15	16	17	18	19

다음으로, 행 기준(row-wise), 열 기준(column-wise)으로 나누어서 Dicts를 사용해 GroupBy 집계하는 예를 들어보겠습니다.

(1-1) 행 기준 Dicts를 이용한 GroupBy 집계 (row-wise GroupBy aggregation using Dicts, axis = 0)

mapping_dict_row = {'r1': 'row_g1',

'r2': 'row_g1',

'r3': 'row_g2',

'r4': 'row_g2'}

grouped_by_row = df.groupby(mapping_dict_row)

grouped_by_row.sum()

	c1	c2	c3	c4	c5
row_g1	5	7	9	11	13
row_g2	25	27	29	31	33

(1-2) 열 기준 Dicts를 이용한 GroupBy 집계 (Column-wise GroupBy aggregation using Dicts, axis = 1)

mapping_dict_col = {'c1': 'col_g1',

'c2': 'col_g1',

'c3': 'col_g2',

'c4': 'col_g2',

'c5': 'col_g2'}

grouped_by_col = df.groupby(mapping_dict_col, axis=1)

grouped_by_col.sum()

	col_g1	col_g2
r1	1	9
r2	11	24
r3	21	39
r4	31	54

Series, Lists 로도 Dicts와 유사하게 GroupBy 집계를 할 수 있습니다.

(2) Series를 이용한 GroupBy 집계

(2-1) 행 기준 Series를 이용한 GroupBy 집계 (row-wise GroupBy aggregation using Series, axis = 0)

mapping_series_row = Series(mapping_dict_row)

mapping_series_row

r1    row_g1
r2    row_g1
r3    row_g2
r4    row_g2
dtype: object

df.groupby(mapping_series_row).sum()

	c1	c2	c3	c4	c5
row_g1	5	7	9	11	13
row_g2	25	27	29	31	33

(2-2) 열 기준 Series를 이용한 GroupBy 집계 (column-wise GroupBy aggregation using Series, axis = 1)

mapping_series_col = Series(mapping_dict_col)

mapping_series_col

c1 col_g1

c2    col_g1
c3    col_g2
c4    col_g2
c5    col_g2

dtype: object

df.groupby(mapping_series_col, axis=1).sum()

	col_g1	col_g2
r1	1	9
r2	11	24
r3	21	39
r4	31	54

df.groupby(mapping_series_col, axis=1).mean()

	col_g1	col_g2
r1	0.5	3.0
r2	5.5	8.0
r3	10.5	13.0
r4	15.5	18.0

(2-3) 열 기준 Lists를 이용한 GroupBy 집계 (column-wise GroupBy aggregation using Lists, axis = 1)

Lists를 이용해서도 (2-2)와 동일한 집계 결과를 얻을 수 있습니다.

mapping_list_col = ['col_g1', 'col_g1', 'col_g2', 'col_g2', 'col_g2'] # lists

df.groupby(mapping_list_col, axis=1).mean()

	col_g1	col_g2
r1	0.5	3.0
r2	5.5	8.0
r3	10.5	13.0
r4	15.5	18.0

Dicts와는 달리 Series나 List의 경우 Group으로 묶어 주려는 행이나 열의 인덱스 개수가 데이터프레임의 인덱스 개수와 일치해야 합니다. 만약 매핑하려는 Series나 Lists의 안의 원소 개수와 데이터프레임의 인덱스 개수가 다를 경우 'KeyError' 가 발생합니다.

아래 예는 칼럼을 매핑하려는 mapping_list_col_2 리스트에 원소 개수가 6개인 반면에, df 데이터프레임에는 칼럼이 5개 밖에 없으므로 KeyError 가 발생하였습니다.

mapping_list_col_2 = ['col_g1', 'col_g1', 'col_g2', 'col_g2', 'col_g2', 'col_g2']

df.groupby(mapping_list_col_2, axis=1).mean()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-76bb72a2996c> in <module>()
----> 1 df.groupby(mapping_list_col_2, axis=1).mean()

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   4414         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   4415                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 4416                        **kwargs)
   4417 
   4418     def asfreq(self, freq, method=None, how=None, normalize=False,

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/groupby.pyc in groupby(obj, by, **kwds)
   1697         raise TypeError('invalid type: %s' % type(obj))
   1698 
-> 1699     return klass(obj, by, **kwds)
   1700 
   1701 

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    390                                                     level=level,
    391                                                     sort=sort,
--> 392                                                     mutated=self.mutated)
    393 
    394         self.obj = obj

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper(obj, key, axis, level, sort, mutated)
   2688                 in_axis, name, level, gpr = False, None, gpr, None
   2689             else:
-> 2690                 raise KeyError(gpr)
   2691         elif isinstance(gpr, Grouper) and gpr.key is not None:
   2692             # Add key to exclusions

KeyError: 'col_g1'

(3) Functions를 이용한 GroupBy 집계

재미있게도 GroupBy operator 에 함수를 사용할 수도 있습니다. 아래 예에서는 row_grp_func() 라는 사용자 정의 함수를 만들어서 GroupBy 합계 집계를 해보았습니다.

df

	c1	c2	c3	c4	c5
r1	0	1	2	3	4
r2	5	6	7	8	9
r3	10	11	12	13	14
r4	15	16	17	18	19

def row_grp_func(x):

if x == 'r1' or x == 'r2':

row_group = 'row_g1'

else:

row_group = 'row_g2'

return row_group

df.groupby(row_grp_func).sum()

	c1	c2	c3	c4	c5
row_g1	5	7	9	11	13
row_g2	25	27	29	31	33

(4) Index Levels를 이용한 GroupBy 집계

마지막으로, 계층적 인덱스(Hierarchical index)를 가진 데이터프레임에 대해서 Index Levels를 사용하여 집계하는 방법을 소개하겠습니다. Level에 대해서 names 로 이름을 부여하여 사용하면 편리합니다. 계층적 인덱스는 R에는 없는 기능인데요, 자꾸 쓰다 보니 나름 유용합니다.

hier_columns = pd.MultiIndex.from_arrays([['col_g1', 'col_g1', 'col_g2', 'col_g2', 'col_g2'],

['c1', 'c2', 'c3', 'c4', 'c5']],

names = ['col_level_1', 'col_level_2'])

hier_columns

MultiIndex(levels=[[u'col_g1', u'col_g2'], [u'c1', u'c2', u'c3', u'c4', u'c5']],
           labels=[[0, 0, 1, 1, 1], [0, 1, 2, 3, 4]],
           names=[u'col_level_1', u'col_level_2'])

hier_df = DataFrame(data = np.arange(20).reshape(4,5),

columns = hier_columns,

index = ['r1', 'r2', 'r3', 'r4'])

hier_df

col_level_1	col_g1		col_g2
col_level_2	c1	c2	c3	c4	c5
r1	0	1	2	3	4
r2	5	6	7	8	9
r3	10	11	12	13	14
r4	15	16	17	18	19

hier_df.groupby(level = 'col_level_1', axis=1).mean()

col_level_1	col_g1	col_g2
r1	0.5	3.0
r2	5.5	8.0
r3	10.5	13.0
r4	15.5	18.0

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 여러개의 함수를 적용하여 GroupBy 집계하기 : grouped.agg() (2)	2018.09.02
[Python pandas] GroupBy 집계 메소드와 함수 (Group by aggregation methods and functions) (0)	2018.09.02
[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기 (2)	2018.09.01
[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31
[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups) (0)	2018.08.26

Posted by Rfriend

,

[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 9. 1. 12:52

이번 포스팅에서는 Python pandas의 DataFrame에서 문자열 변수들을 가지고 일부 포맷 변형을 한 후에 새로운 변수를 만드는 방법을 소개하겠습니다. 이게 얼핏 생각하면 쉬울 것 같은데요, 또 한번도 본적이 없으면 어렵습니다. ^^; lambda, apply() 함수와 문자열 처리 메소드 등에 대해서 알고 있으면 이해가 쉽습니다.

(1) 'id' 변수가 전체 5개 자리가 되도록 왼쪽에 비어있는 부분에 '0'을 채워서 새로운 변수 'id_2' 만들기

(Left padding with zeros so that make 5 positions)

(2) 새로 만든 'id_2' 변수와 'name' 변수를 각 원소별로 합쳐서 데이터프레임 안에 새로운 변수 'id_name' 만들기

(element-wise string concatenation with multiple inputs array in pandas DataFrame)

먼저, 예제로 사용할 간단한 DataFrame을 만들어보겠습니다.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'id': [1, 2, 10, 20, 100, 200],

...: 'name': ['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff']})

In [3]: df

Out[3]:

id name

0 1 aaa

1 2 bbb

2 10 ccc

3 20 ddd

4 100 eee

5 200 fff

(1) 'id' 변수가 전체 5개 자리가 되도록 왼쪽에 비어있는 부분에 '0'을 채워서 새로운 변수 'id_2' 만들기

(Left padding with zeros so that make 5 positions)

lambda 로 format() 함수를 만들어서 apply() 로 적용을 하여 5자리 중에서 빈 자리를 '0'으로 채웠습니다.

In [4]: df['id_2'] = df['id'].apply(lambda x: "{:0>5d}".format(x))

In [5]: df

Out[5]:

id name id_2

0 1 aaa 00001

1 2 bbb 00002

2 10 ccc 00010

3 20 ddd 00020

4 100 eee 00100

5 200 fff 00200

다양한 숫자 포맷(number format) 함수는 https://mkaz.blog/code/python-string-format-cookbook/ 를 참고하세요.

(2) 새로 만든 'id_2' 변수와 'name' 변수를 각 원소별로 합쳐서 데이터프레임 안에

새로운 변수 'id_name' 만들기

(element-wise string concatenation with multiple inputs array in pandas DataFrame)

그리고 역시 lambda 로 '_'를 중간 구분자로 해서 두 변수의 문자열을 결합('_'.join)하는 함수를 정의한 후에 apply() 로 적용하였습니다, 'axis = 1'을 설정해준 점 주의하시기 바랍니다.

In [6]: df['id_name'] = df[['id_2', 'name']].apply(lambda x: '_'.join(x), axis=1)

In [7]: df

Out[7]:

id name id_2 id_name

0 1 aaa 00001 00001_aaa

1 2 bbb 00002 00002_bbb

2 10 ccc 00010 00010_ccc

3 20 ddd 00020 00020_ddd

4 100 eee 00100 00100_eee

5 200 fff 00200 00200_fff

여기서 끝내면 좀 허전하고 아쉬우니 몇 가지 데이터 포맷 변경을 더 해보겠습니다.

(3) 'id' 변수의 값을 소숫점 두번째 자리까지 나타내는 새로운 변수 'id_3' 만들기

(4) 'name' 변수의 문자열을 전부 대문자로 바꾼 새로운 변수 'name_3' 만들기

(5) 데이터프레임 안의 'id_3'와 'name_3' 변수를 합쳐서 새로운 변수 'id_name_3' 만들기

(3) 'id' 변수의 값을 소숫점 두번째 자리까지 나타내는 새로운 변수 'id_3' 만들기

"{:.2f}".format() 함수를 사용하여 소숫점 두번째 자리까지 표현하였습니다.

In [8]: df['id_3'] = df['id'].apply(lambda x: "{:.2f}".format(x))

In [9]: df

(4) 'name' 변수의 문자열을 전부 대문자로 바꾼 새로운 변수 'name_3' 만들기

upper() 문자열 내장 메소드를 사용하여 소문자를 대문자로 변경하였습니다.

In [10]: df['name_3'] = df['name'].apply(lambda x: x.upper())

In [11]: df

(5) 데이터프레임 안의 'id_3'와 'name_3' 변수를 합쳐서 새로운 변수 'id_name_3' 만들기

In [14]: df['id_name_3'] = df[['id_3', 'name_3']].apply(lambda x: ':'.join(x), axis=1)

In [15]: df

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] GroupBy 집계 메소드와 함수 (Group by aggregation methods and functions) (0)	2018.09.02
[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels (0)	2018.09.01
[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31
[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups) (0)	2018.08.26
[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups) (8)	2018.08.26

Posted by Rfriend

,

맥북(Mac OS)에서 graphviz 실행 시 "ValueError: Program dot not found in path" 에러 대처방안

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2018. 8. 31. 09:43

이번 포스팅에서는 맥 OS 컴퓨터에 시각화 소프트웨어인 Graphviz, pygraphviz 를 설치하고 파이썬에서 실행을 했을 때

ValueError: Program dot not found in path

라는 에러 메시지가 발생했을 때 대처방안을 소개하겠습니다. 맥북 노트북에서는 아무런 문제 없이 설치해서 graphviz 를 썼는데요, 맥북 컴퓨터에서는 설치까지는 잘 되었는데 사용하려고 하니 이 문제가 발생해서 두시간 정도 구글링 하면서 삽질을 했습니다. ㅜ_ㅜ 이거 pygraphviz의 버그인거 같습니다. 저처럼 삽질하면서 아까운 시간 버리지 마시길 바래요.

(참고로 python 2.7 사용 중입니다.)

(1) 'dot' path 확인하기

(2) pygraphviz패키지의 agraph.py 파일에서 runprog 경로에 (1)에서 찾은 경로로 수정해주기

(1) 'dot' path 확인하기 : $ which dot

터미널 창을 하나 띄우시고 아래처럼 '$ which dot' 을 입력하면 dot program이 설치되어 있는 경로(path)를 찾을 수 있습니다. 제거는 /usr/local/bin/dot 에 설치가 되어 있네요.

abc:~ ddd$ which dot

/usr/local/bin/dot

abc:~ ddd$ dot -V

dot - graphviz version 2.40.1 (20161225.0304)

(2) pygraphviz 패키지의 agraph.py 파일에서 runprog 경로에 (1)에서 찾은 경로로 수정해주기

'Spotlight 검색'(command + spacebar) 창에서 'agraph.py'라는 키워드로 검색하면 아래와 같이 pygraphviz 패키지의 agraph.py 파일을 찾을 수 있습니다. agraph.py 파이썬 프로그램 파일을 열어보세요.

다음에 'command + F'를 눌러서 검색할 수 있는 창이 나오면 'runprog' 키워드로 검색을 한 후에 -> runprog = self._which(prog) 가 있는 라인을 찾아보세요. 제거에 설치된거는 1,289번째 라인에 있네요.

(1) 번에서 터미널 창을 뜨워놓고 '$ which dot' 명령어를 실행해서 dot 프로그램이 설치된 경로(제거는 /usr/local/bin/dot )를 찾았는데요, 그 경로를 복사해다가 아래처럼 수정(제 맥 컴퓨터의 경우 runprog = "/usr/local/bin/dot" 로 수정함)을 해주시기 바랍니다.

수정 전 (Before)

수정 후 (After)

def _get_prog(self, prog):

# private: get path of graphviz program

progs = ['neato', 'dot', 'twopi', 'circo',
'fdp', 'nop', 'wc', 'acyclic', 'gvpr',
'gvcolor', 'ccomps', 'sccmap',
'tred', 'sfdp']

if prog not in progs:

raise ValueError("Program %s is not
one of: %s." % (prog, ', '.join(progs)))

try:

runprog = self._which(prog)

except:

raise ValueError("Program %s not
found in path." % prog)

return runprog

def _get_prog(self, prog):

# private: get path of graphviz program

progs = ['neato', 'dot', 'twopi', 'circo',
'fdp', 'nop', 'wc', 'acyclic', 'gvpr',
'gvcolor', 'ccomps', 'sccmap',
'tred', 'sfdp']

if prog not in progs:

raise ValueError("Program %s is not
one of: %s." %(prog, ', '.join(progs)))

try:

runprog = "/usr/local/bin/dot"

except:

raise ValueError("Program %s not
found in path." % prog)

return runprog

수정 후에 저장하고 agraph.py 파이썬 프로그램 파일을 닫은 후에 pygrahpviz 써서 Graphviz 로 네트워크 다이어그램을 시각화하니 잘 되네요.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] Windows10에서 Anaconda Prompt를 이용해 가상환경 만들기 (Create a new virtual environment for python with anaconda prompt) (0)	2019.07.19
[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법 (2)	2019.07.02
맥북에 Graphviz, pygraphviz 설치하고 Decision Tree 시각화해보기 (0)	2018.08.25
[Jupyter Notebook, ipython] 경고 메시지 숨기기 (ignore warning message) (0)	2018.01.30
[Python] 사전 자료형 내장함수 및 메소드 (Dictionary built-in functions and methods) (0)	2017.08.28

Posted by Rfriend

,

[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get()

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 8. 31. 00:44

이번 포스팅에서는 Python pandas의 DataFrame에서 범주형 변수의 항목(class)을 기준 정보(mapping table, reference table)를 이용하여 일괄 변환하는 방법을 소개하겠습니다.

(1) 범주형 변수의 항목 매핑/변환에 사용한 기준 정보를 dict 자료형으로 만들어 놓고,

(2) dict.get() 함수를 이용하여 매핑/변환에 사용할 사용자 정의 함수를 만든 후에

(3) map() 함수로 (2)번에서 만든 사용자 정의 함수를 DataFrame의 범주형 변수에 적용하여 매핑하기

차근차근 예를 들어서 설명해보겠습니다.

먼저, 간단한 예제 데이터프레임을 만들어보겠습니다.

import pandas as pd

from pandas import DataFrame

df = DataFrame({'name': ['kim', 'KIM', 'Kim', 'lee', 'LEE', 'Lee', 'wang', 'hong'],

'value': [1, 2, 3, 4, 5, 6, 7, 8],

'value_2': [100, 300, 200, 100, 100, 300, 50, 80]

})

df

	name	value	value_2
0	kim	1	100
1	KIM	2	300
2	Kim	3	200
3	lee	4	100
4	LEE	5	100
5	Lee	6	300
6	wang	7	50
7	hong	8	80

위의 df 라는 이름의 DataFrame에서, name 변수의 (kim, KIM, Kim) 를 (kim)으로, (lee, LEE, Lee)를 (lee)로, 그리고 (wang, hong)을 (others) 라는 항목으로 매핑하여 새로운 변수 name_2 를 만들어보려고 합니다.

(1) 범주형 변수의 항목 매핑/변환에 사용할 기준 정보를 dict 자료형으로 만들기

name_mapping = {

'KIM': 'kim',

'Kim': 'kim',

'LEE': 'lee',

'Lee': 'lee',

'wang': 'others',

'hong': 'others'

}

name_mapping

{'KIM': 'kim',

 'Kim': 'kim',
 'LEE': 'lee',
 'Lee': 'lee',
 'hong': 'others',
 'wang': 'others'}

(2) dict.get() 함수를 이용하여 매핑/변환에 사용할 사용자 정의 함수 만들기

dict 자료형에 대해 dict.get() 함수를 사용하여 정의한 아래의 사용자 정의 함수 func는 '만약 매핑에 필요한 정보가 기준 정보 name_mapping dict에 있으면 그 정보를 사용하여 매핑을 하고, 만약에 기준정보 name_mapping dict에 매핑에 필요한 정보가 없으면 입력값을 그대로 반환하라는 뜻입니다. 'lee', 'kim'의 경우 위의 name_mapping dict 기준정보에 매핑에 필요한 정보항목이 없으므로 그냥 자기 자신을 그대로 반환하게 됩니다.

func = lambda x: name_mapping.get(x, x)

(3) map() 함수로 매핑용 사용자 정의 함수를 DataFrame의 범주형 변수에 적용하여 매핑/변환하기

위의 기준정보 name_mapping dict를 사용하여 'name_2' 라는 이름의 새로운 범주형 변수를 만들어보았습니다.

df['name_2'] = df.name.map(func)

df

	name	value	value_2	name_2
0	kim	1	100	kim
1	KIM	2	300	kim
2	Kim	3	200	kim
3	lee	4	100	lee
4	LEE	5	100	lee
5	Lee	6	300	lee
6	wang	7	50	others
7	hong	8	80	others

(4) groupby() 로 범주형 변수의 그룹별로 집계하기

범주형 변수에 대해서 항목을 매핑/변환하여 새로운 group 정보를 만들었으니, groupby() operator를 사용해서 새로 만든 name_2 변수별로 연속형 변수들('value', 'value_2')의 합계를 구해보겠습니다.

# aggregation by name

df.groupby('name_2').sum()

	value	value_2
name_2
kim	6	600
lee	15	500
others	15	130

'name_2'와 'name' 범주형 변수 2개를 groupby()에 함께 사용하여 두개 범주형 변수의 계층적인 인덱스(hierarchical index) 형태로 'value_2' 연속형 변수에 대해서만 합계를 구해보겠습니다. (아래의 결과에 대해 unstack()을 하면 name 변수를 칼럼으로 올려서 cross-tab 형태로 볼 수도 있겠습니다.)

df.groupby(['name_2', 'name'])['value_2'].sum()

name_2  name
kim     KIM     300
        Kim     200
        kim     100
lee     LEE     100
        Lee     300
        lee     100
others  hong     80
        wang     50
Name: value_2, dtype: int64

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels (0)	2018.09.01
[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기 (2)	2018.09.01
[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups) (0)	2018.08.26
[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups) (8)	2018.08.26
[Python pandas] 다수개의 범주형자료로 가변수 만들기 (dummy variable) (2)	2018.08.21

Posted by Rfriend

,

[Greenplum, Postgresql] 중복된 관측치 제거하기 (how to delete duplicated rows)

Greenplum and PostgreSQL Database 2018. 8. 29. 22:09

이번 포스팅에서는 Greenplum DB, Postgresql DB에서 중복된 관측치(duplicated observations, duplicated rows)가 있을 경우에 제일 처음나 제일 마지막의 관측치 하나만 남겨놓고 나머지 중복 관측치들은 삭제하여 유일한 관측치만 남기는 2가지 방법을 소개하겠습니다.

(방법 1) 원래의 테이블에서 중복된 관측치들 중에서 하나만 남기고 나머지 중복된 관측치들은 삭제하기

: DELETE FROM original_table

(방법 2) 중복된 관측치들중에서 하나씩만 가져와서 새로운 테이블 만들고, 원래의 테이블은 제거(drop)하기

: CREATE TABLE new_table & DROP TABLE original_table

(방법 1) 원래의 테이블에서 중복된 관측치들 중에서 하나만 남기고 나머지 중복된 관측치들은 삭제하기

: DELETE FROM original_table

Greenplum Database에 중복된 관측치가 들어있는 간단한 예제 테이블을 만들어보겠습니다. 'name'과 'price'의 두 개 변수를 기준으로 해서 중복 여부를 판단하여 중복된 관측치를 제거하는 예제입니다.

drop table if exists prod_master;

create table prod_master (

id int not null

, name text not null

, price real not null

) distributed randomly;

insert into prod_master values

(1, 'a', 1000)

, (2, 'a', 1000)

, (3, 'a', 1000)

, (4, 'b', 2000)

, (5, 'b', 2000)

, (6, 'c', 3000)

, (7, 'c', 3000);

select * from prod_master;

이제 DELETE query를 사용하여 중복된 관측치 중에서 첫번째 것만 남기고 나머지 중복된 관측치들은 제거해보겠습니다. DELETE 문은 'DELETE FROM table_name WHERE [conditio];' 의 형태로 사용합니다.

이때 주의할 점은 sub query로 row_number() over (partition by ) 라는 window function을 사용해야만 중복된 관측치들 중에서 각각의 "첫번째 관측치"를 남겨놓을 수 있다는 것입니다 (아래 query의 빨간색 부분). 자칫 잘못하면 중복이 된 값은 하나도 남김없이 모조리 삭제하는 실수를 범할 수 있으니 조심하시기 바랍니다.

delete from prod_master where id in (

select id

from

(select id,

row_number() over (partition by name, price order by id) as row_num

from prod_master) a

where a.row_num > 1

);

[Messages]

DELETE 4 Query returned successfully in 177 msec.

select * from prod_master;

혹시 중복된 관측치들 중에서 "가장 앞에 있는"(위의 예시) 관측치 대신에 "가장 뒤에 있는" 관측치를 남기고 나머지 중복된 관측치는 제거하고 싶다면 row_number() over() 의 window function 에서 order by id desc 를 사용해주면 됩니다.

--- Create a sample table

drop table if exists prod_master;

create table prod_master (

id int not null

, name text not null

, price real not null

) distributed randomly;

insert into prod_master values

(1, 'a', 1000)

, (2, 'a', 1000)

, (3, 'a', 1000)

, (4, 'b', 2000)

, (5, 'b', 2000)

, (6, 'c', 3000)

, (7, 'c', 3000);

---- keep the last observation in case of duplication

delete from prod_master where id in (

select id

from

(select id,

row_number() over (partition by name, price order by id desc) as row_num

from prod_master) a

where a.row_num > 1

);

select * from prod_master;

위의 방법 1은 원래의 테이블을 그대로 유지한 상태에서 중복된 관측치를 삭제하므로, 새로운 테이블을 만들거나 기존 테이블을 삭제할 필요가 없습니다만, 대용량 데이터를 대상으로 다수의 중복된 관측치를 제거해야 하는 경우 (아래의 방법2 대비 상대적으로) 속도가 느리다는 단점이 있습니다. 대용량 데이터의 경우 빠른 속도로 중복처리하려면 아래의 '방법2'를 고려해보길 권합니다. ('Messages'에 나오는 실행 속도를 비교해보면 아래의 '방법2'가 빠른 것을 알 수 있습니다. 지금 예제야 관측치 7개짜리의 간단한 예제인지라 177 msec vs. 118 msec로 밀리세컨 단위 차이라고 무시할 수도 있겠지만, 데이터가 대용량이 되면 차이가 무시할 수 없게 커질 수 있습니다.)

(방법 2) 중복된 관측치들중에서 하나씩만 가져와서 새로운 테이블 만들고, 원래의 테이블은 제거하기

: CREATE TABLE new_table & DROP TABLE original_table

--- Create a sample table

drop table if exists prod_master;

create table prod_master (

id int not null

, name text not null

, price real not null

) distributed randomly;

insert into prod_master values

(1, 'a', 1000)

, (2, 'a', 1000)

, (3, 'a', 1000)

, (4, 'b', 2000)

, (5, 'b', 2000)

, (6, 'c', 3000)

, (7, 'c', 3000);

---- keep the first observation in case of duplication by creating a new table

drop table if exists prod_master_unique;

create table prod_master_unique as (

select * from prod_master

where id NOT IN (

select id

from

(select id,

row_number() over (partition by name, price order by id) as row_num

from prod_master) a

where a.row_num > 1)

) distributed randomly;

[Messages]

SELECT 3 Query returned successfully in 118 msec.

select * from prod_master_unique order by id;

-- Drop the original table to save disk storage

drop table prod_master;

R에서 중복된 관측치를 제거하고, 중복된 관측치에서 한개씩만 남기기는 여기를 참고하세요.
=> http://rfriend.tistory.com/165 , http://rfriend.tistory.com/235
Python에서 중복된 관측치를 제거하고, 중복된 관측치에서 한개씩만 남기기는 여기를 참고하세요.
=> http://rfriend.tistory.com/266

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

Greenplum DB, Postgresql DB에 사용할 수 있는 오픈소스 database tool DBeaver 설치 및 사용 방법 (0)	2019.03.04
[Docker] Error response from daemon: driver failed programming external connectivity on endpoint, port is already allocated 대처 방법 (port 강제 종료) (11)	2018.09.24
[Greenplum DB] Greenplum DB, MADlib, PL/R, PL/Python을 Docker Image를 이용하여 환경구성 하기 (0)	2018.08.13
[Greenplum DB] Greenplum DB에 MADlib, PL/R, PL/Python 설치한 Docker Image 만들어서 Docker Hub에 올리기 (0)	2018.08.13
[Greenplum DB] 세계 최초 오픈소스 대용량 데이터 병렬 처리 데이터 분석 플랫폼, Greenplum Database (GPDB) (3)	2018.08.08

Posted by Rfriend

,

[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups)

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 8. 26. 19:35

이번 포스팅에서는 GroupBy 를 사용하여 그룹별로 반복 작업(iteration over groups)하는 방법을 소개하겠습니다.

pandas의 GroupBy 객체는 for loop 반복 시에 그룹 이름과 그룹별 데이터셋을 2개의 튜플로 반환합니다. 이러한 특성을 잘 활용하면 그룹별로 for loop 반복작업을 하는데 유용하게 사용할 수 있습니다.

[ GroupBy로 그룹별로 반복 작업하기 ]

예제로 사용할 데이터는 UCI machine learning repository에 등록되어 있는 abalone 공개 데이터셋입니다.

abalone text file download: abalone.txt

import numpy as np

import pandas as pd

abalone = pd.read_csv("/Users/ihongdon/Documents/Python/abalone.txt",

sep=",",

names = ['sex', 'length', 'diameter', 'height',

'whole_weight', 'shucked_weight', 'viscera_weight',

'shell_weight', 'rings'],

header = None)

abalone['length_cat'] = np.where(abalone.length > np.median(abalone.length),

'length_long',

'length_short')

abalone.head()

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings	length_cat
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15	length_short
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7	length_short
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9	length_short
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10	length_short
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7	length_short

위의 abalone 데이터셋을 '성별(sex)'로 GroupBy를 한 후에, for loop을 돌려서 그룹 이름(sex: 'F', 'I', 'M')별로 데이터셋을 프린트해보겠습니다.

for sex, group_data in abalone[['sex', 'length_cat', 'whole_weight', 'rings']].groupby('sex'):

print sex,

print group_data[:5]

F    sex    length_cat  whole_weight  rings

2 F length_short 0.6770 9 6 F length_short 0.7775 20 7 F length_short 0.7680 16 9 F length_long 0.8945 19 10 F length_short 0.6065 14

I    sex    length_cat  whole_weight  rings
4    I  length_short        0.2050      7
5    I  length_short        0.3515      8
16   I  length_short        0.2905      7
21   I  length_short        0.2255     10
42   I  length_short        0.0700      5

M    sex    length_cat  whole_weight  rings
0    M  length_short        0.5140     15
1    M  length_short        0.2255      7
3    M  length_short        0.5160     10
8    M  length_short        0.5095      9
11   M  length_short        0.4060     10

이번에는 두 개의 범주형 변수(sex, length_cat)를 사용하여 for loop 반복문으로 그룹 이름 (sex와 leggth_cat 의 조합: F & length_long, F & length_short, I & length_long, I & length_short, M & length_long, M & length_short)과 각 그룹별 데이터셋을 프린트해보겠습니다.

참고로, 아래 코드에서 '\' 역슬래쉬는 코드를 한줄에 전부 다 쓰기에 너무 길 때 다음줄로 코드를 넘길 때 사용합니다.

for (sex, length_cat), group_data in abalone[['sex', 'length_cat', 'whole_weight', 'rings']]\

.groupby(['sex', 'length_cat']):

print sex, length_cat

print group_data[:5]

F length_long
   sex   length_cat  whole_weight  rings
9    F  length_long        0.8945     19
22   F  length_long        0.9395     12
23   F  length_long        0.7635      9
24   F  length_long        1.1615     10
25   F  length_long        0.9285     11

F length_short
   sex    length_cat  whole_weight  rings
2    F  length_short        0.6770      9
6    F  length_short        0.7775     20
7    F  length_short        0.7680     16
10   F  length_short        0.6065     14
13   F  length_short        0.6845     10

I length_long
    sex   length_cat  whole_weight  rings
509   I  length_long        0.8735     16
510   I  length_long        1.1095     10
549   I  length_long        0.8750     11
550   I  length_long        1.1625     17
551   I  length_long        0.9885     13

I length_short
   sex    length_cat  whole_weight  rings
4    I  length_short        0.2050      7
5    I  length_short        0.3515      8
16   I  length_short        0.2905      7
21   I  length_short        0.2255     10
42   I  length_short        0.0700      5

M length_long
   sex   length_cat  whole_weight  rings
27   M  length_long        0.9310     12
28   M  length_long        0.9365     15
29   M  length_long        0.8635     11
30   M  length_long        0.9975     10
32   M  length_long        1.3380     18

M length_short
   sex    length_cat  whole_weight  rings
0    M  length_short        0.5140     15
1    M  length_short        0.2255      7
3    M  length_short        0.5160     10
8    M  length_short        0.5095      9
11   M  length_short        0.4060     10

다음으로, 성별(sex)로 GroupBy를 해서 성별 그룹('F', 'I', 'M')을 key로 하고, 데이터셋을 value로 하는 dict를 만들어보겠습니다.

abalone_sex_group = dict(list(abalone[:10][['sex', 'length_cat', 'whole_weight', 'rings']]

.groupby('sex')))

abalone_sex_group

{'F':   sex    length_cat  whole_weight  rings
 2   F  length_short        0.6770      9
 6   F  length_short        0.7775     20
 7   F  length_short        0.7680     16
 9   F   length_long        0.8945     19,

 'I':   sex    length_cat  whole_weight  rings
 4   I  length_short        0.2050      7
 5   I  length_short        0.3515      8,

 'M':   sex    length_cat  whole_weight  rings
 0   M  length_short        0.5140     15
 1   M  length_short        0.2255      7
 3   M  length_short        0.5160     10
 8   M  length_short        0.5095      9}

이렇게 그룹 이름을 key로 하는 dict 를 만들어놓으면 그룹 이름을 가지고 데이터셋을 indexing하기에 편리합니다. 예로 성별 중에 'M'인 데이터셋을 indexing해보겠습니다.

abalone_sex_group['M']

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings	length_cat
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15	length_short
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7	length_short
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10	length_short
8	M	0.475	0.370	0.125	0.5095	0.2165	0.1125	0.165	9	length_short

물론 abalone[:10][abalone['sex'] == 'M'] 처럼 원래의 처음 abalone 데이터프레임에 boolean 형태로 indexing을 해도 됩니다. 대신에 dict 로 만들어놓으면 데이터셋 indexing 하는 속도가 더 빠를겁니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기 (2)	2018.09.01
[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31
[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups) (8)	2018.08.26
[Python pandas] 다수개의 범주형자료로 가변수 만들기 (dummy variable) (2)	2018.08.21
[Python NumPy] 선형대수 함수 (Linear Algebra) (0)	2018.08.15

Posted by Rfriend

,

[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups)

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 8. 26. 18:20

이번 포스팅에서는 Python pandas의 groupby() 연산자를 사용하여 집단, 그룹별로 데이터를 집계, 요약하는 방법을 소개하겠습니다.

전체 데이터를 그룹 별로 나누고 (split), 각 그룹별로 집계함수를 적용(apply) 한후, 그룹별 집계 결과를 하나로 합치는(combine) 단계를 거치게 됩니다. (Split => Apply function => Combine)

[ GroupBy aggregation mechanics ]

groupby() 는 다양한 변수를 가진 데이터셋을 분석하는데 있어 그룹별로 데이터를 집계하는 분석은 일상적으로 이루어지는 만큼 사용빈도가 매우 높고 알아두면 유용합니다.

실습에 사용할 예제는 바다 해산물인 전복(abalone)에 대한 공개 데이터셋을 사용하겠습니다.

[ UCI Machine Learning Repository ]

Abalone Data Set 설명: http://archive.ics.uci.edu/ml/datasets/Abalone
Abalone Data Set: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data

Abalone CSV dataset download: abalone.txt

Variables

Name Data Type Meas. Description ---- --------- ----- ----------- Sex nominal M, F, and I (infant) Length continuous mm Longest shell measurement Diameter continuous mm perpendicular to length Height continuous mm with meat in shell Whole weight continuous grams whole abalone Shucked weight continuous grams weight of meat Viscera weight continuous grams gut weight (after bleeding) Shell weight continuous grams after being dried Rings integer +1.5 gives the age in years

먼저, 바로 위에 링크해놓은 abalone.txt를 다운받은 후에, abalone.txt 데이터셋을 pandas의 read_csv() 로 불러와서 DataFrame을 만들어보겠습니다.

# Importing common libraries

import pandas as pd

from pandas import DataFrame

from pandas import Series

import numpy as np

# Reading abalone data set

abalone = pd.read_csv("/Users/ihongdon/Documents/Python/abalone.txt",

sep=",",

names = ['sex', 'length', 'diameter', 'height',

'whole_weight', 'shucked_weight', 'viscera_weight',

'shell_weight', 'rings'],

header = None)

abalone 라는 이름의 pandas DataFrame을 만들었으니, 데이터가 어떻게 생겼는지 탐색해보겠습니다. 다행히 결측치는 없으며, 4,177개의 관측치를 가지고 있네요. 전복의 성별(sex) 변수가 범주형 변수입니다.

# View of top 5 observations

abalone.head()

sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings

0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15

1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7

2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9

3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10

4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

# Check the missing value

np.sum(pd.isnull(abalone))

sex               0
length            0
diameter          0
height            0
whole_weight      0
shucked_weight    0
viscera_weight    0
shell_weight      0
rings             0
dtype: int64

# Descriptive statics

abalone.describe()

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
count	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000
mean	0.523992	0.407881	0.139516	0.828742	0.359367	0.180594	0.238831	9.933684
std	0.120093	0.099240	0.041827	0.490389	0.221963	0.109614	0.139203	3.224169
min	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	0.450000	0.350000	0.115000	0.441500	0.186000	0.093500	0.130000	8.000000
50%	0.545000	0.425000	0.140000	0.799500	0.336000	0.171000	0.234000	9.000000
75%	0.615000	0.480000	0.165000	1.153000	0.502000	0.253000	0.329000	11.000000
max	0.815000	0.650000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

자, 데이터 준비가 되었으니 이제부터 '전복 성별(sex)' 그룹('F', 'M', 'I')별로 전복의 전체 무게('whole_weight') 변수에 대해서 GroupBy 집계를 해보겠습니다.

집단별 크기는 grouped.size(), 집단별 합계는 grouped.sum(), 집단별 평균은 grouped.mean() 을 사용합니다.

grouped = abalone['whole_weight'].groupby(abalone['sex'])

grouped

<pandas.core.groupby.SeriesGroupBy object at 0x112668c10>

grouped.size()

sex
F    1307
I    1342
M    1528
Name: whole_weight, dtype: int64

grouped.sum()

sex
F    1367.8175
I     578.8885
M    1514.9500
Name: whole_weight, dtype: float64

grouped.mean()

sex
F    1.046532
I    0.431363
M    0.991459
Name: whole_weight, dtype: float64

위의 예에서는 'whole_weight' 라는 하나의 연속형 변수에 대해서만 '성별(sex)' 집계를 하였습니다만, 집계를 하는 key를 제외한 데이터프레임 안의 전체 연속형 변수에 대해서 한꺼번에 집계를 할 수도 있습니다.

abalone.groupby(abalone['sex']).mean()

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
sex
F	0.579093	0.454732	0.158011	1.046532	0.446188	0.230689	0.302010	11.129304
I	0.427746	0.326494	0.107996	0.431363	0.191035	0.092010	0.128182	7.890462
M	0.561391	0.439287	0.151381	0.991459	0.432946	0.215545	0.281969	10.705497

DataFrame.groupby('key_var').func() 형식으로도 사용가능하며, 위의 abalone.groupby(abalone['sex']).mean()은 아래처럼 abalone.groupby('sex').mean() 처럼 써도 똑같은 결과를 얻을 수 있습니다.

abalone.groupby('sex').mean()

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
sex
F	0.579093	0.454732	0.158011	1.046532	0.446188	0.230689	0.302010	11.129304
I	0.427746	0.326494	0.107996	0.431363	0.191035	0.092010	0.128182	7.890462
M	0.561391	0.439287	0.151381	0.991459	0.432946	0.215545	0.281969	10.705497

이제부터는 '성별(sex)' 이외에 '길이(length)'를 가지고 범주형 변수를 하나 더 만들어서, 2개의 범주형 변수 key 값을 가지고 그룹별 집계를 해보겠습니다.

np.where() 함수를 사용하여 length 의 중앙값보다 크면 'length_long'으로, 중앙값보다 작으면 'length_short'의 이름으로하는 계급으로하는 새로운 범주형 변수를 만들어보겠습니다.

abalone['length_cat'] = np.where(abalone.length > np.median(abalone.length),

'length_long', # True

'length_short') # False

abalone[['length', 'length_cat']][:10]

length length_cat

0 0.455 length_short

1 0.350 length_short

2 0.530 length_short

3 0.440 length_short

4 0.330 length_short

5 0.425 length_short

6 0.530 length_short

7 0.545 length_short

8 0.475 length_short

9 0.550 length_long

그럼, 이제 성별 그룹(sex)과 길이 범주(length_cat) 그룹별로 GroupBy 를 사용하여 평균을 구해보겠습니다.

mean_by_sex_length = abalone['whole_weight'].groupby([abalone['sex'], abalone['length_cat']]).mean()

mean_by_sex_length

sex  length_cat  
F    length_long     1.261330
     length_short    0.589702
I    length_long     0.923215
     length_short    0.351234
M    length_long     1.255182
     length_short    0.538157
Name: whole_weight, dtype: float64

위의 집계 결과가 SQL로 집계했을 때의 형태로 결과가 제시가 되었는데요, unstack() 함수를 사용하면 집계 결과를 가로, 세로 축으로 좀더 보기에 좋게 표현을 할 수 있습니다.

mean_by_sex_length.unstack()

length_cat	length_long	length_short
sex
F	1.261330	0.589702
I	0.923215	0.351234
M	1.255182	0.538157

abalone['whole_weight'].groupby([abalone['sex'], abalone['length_cat']]).mean() 를 좀더 간결하게 아래처럼 쓸 수도 있습니다. 대상 데이터프레임을 제일 앞에 써주고, groupby()에 집계의 기준이 되는 key 변수들을 써주고, 제일 뒤에 집계하려는 연속형 변수이름을 써주었습니다.

abalone.groupby(['sex', 'length_cat'])['whole_weight'].mean()

sex  length_cat  
F    length_long     1.261330
     length_short    0.589702
I    length_long     0.923215
     length_short    0.351234
M    length_long     1.255182
     length_short    0.538157
Name: whole_weight, dtype: float64

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31
[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups) (0)	2018.08.26
[Python pandas] 다수개의 범주형자료로 가변수 만들기 (dummy variable) (2)	2018.08.21
[Python NumPy] 선형대수 함수 (Linear Algebra) (0)	2018.08.15
[Python] numpy 배열을 여러개의 하위 배열로 분할하기 (split an array into sub-arrays) (0)	2018.05.22

Posted by Rfriend

,

맥북에 Graphviz, pygraphviz 설치하고 Decision Tree 시각화해보기

Python 분석과 프로그래밍/Python 설치 및 기본 사용법 2018. 8. 25. 12:55

Graphviz는 AT&T와 Bell Labs에서 만든 오픈소스 시각화 소프트웨어입니다. Graphviz는 구조화된 정보를 추상화된 그래프나 네트워크 형태의 다이어그램으로 제시를 해줍니다. 가령, 기계학습의 Decision Tree 학습 결과를 Tree 형태로 시각화를 한다든지, Process Mining을 통해 찾은 workflow를 방향성 있는 네트워크 형태로 시각화할 때 Graphviz 를 사용할 수 있습니다.

PyGraphviz 는 Python으로 Graphviz 소프트웨어를 사용할 수 있게 인터페이스를 해주는 Python 패키지입니다. PyGraphviz를 사용하여 Graphviz 그래프의 데이터 구조와 배열 알고리즘에 접근하여 그래프를 생성, 편집, 읽기, 쓰기, 그리기 등을 할 수 있습니다.

Python으로 Graphviz를 사용하려면 (a) 먼저 Graphviz S/W를 설치하고, (b) 다음으로 PyGraphviz를 설치해야 합니다. 만약 순서가 바뀌어서 Graphviz 소프트웨어를 설치하지 않은 상태에서 PyGraphviz를 설치하려고 하면 Graphviz를 먼저 설치하라는 에러 메시지가 뜰 겁니다. 순서가 중요합니다!

Your Graphviz installation could not be found.

1) You don't have Graphviz installed:

Install Graphviz (http://graphviz.org)

이번 포스팅에서는

(1) Mac OS High Sierra 에 Graphviz 소프트웨어 설치하기

(2) Python 2.7에 PyGraphviz library 설치하기

(3) Graphviz와 PyGraphviz를 사용하여 Decision Tree 시각화 해보기

에 대해서 소개하겠습니다.

(1) Mac OS High Sierra 에 Graphviz 소프트웨어 설치하기

(참고로, 저는 Mac OS High Sierra version 10.13.6을 사용하고 있습니다.)

(1-1) Homebrew 를 설치합니다.

Homebrew는 애플 Mac OS 에서 소프트웨어 패키지를 설치를 간소화해주는 소프트웨어 패키지 관리 오픈소스 툴입니다. 터미널을 하나 열고 아래의 코드를 복사해서 실행하면 됩니다.

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null

ihongdon-ui-MacBook-Pro:~ ihongdon$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
==> This script will install:
/usr/local/bin/brew
/usr/local/share/doc/homebrew
/usr/local/share/man/man1/brew.1
/usr/local/share/zsh/site-functions/_brew
/usr/local/etc/bash_completion.d/brew
/usr/local/Homebrew
==> The following existing directories will be made group writable:
/usr/local/bin
/usr/local/include
/usr/local/lib
/usr/local/share
/usr/local/lib/pkgconfig
/usr/local/share/info
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/man/man3
/usr/local/share/man/man5
/usr/local/share/man/man7
==> The following existing directories will have their owner set to ihongdon:
/usr/local/bin
/usr/local/include
/usr/local/lib
/usr/local/share
/usr/local/lib/pkgconfig
/usr/local/share/info
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/man/man3
/usr/local/share/man/man5
/usr/local/share/man/man7
==> The following existing directories will have their group set to admin:
/usr/local/bin
/usr/local/include
/usr/local/lib
/usr/local/share
/usr/local/lib/pkgconfig
/usr/local/share/info
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/man/man3
/usr/local/share/man/man5
/usr/local/share/man/man7
==> The following new directories will be created:
/usr/local/Cellar
/usr/local/Homebrew
/usr/local/Frameworks
/usr/local/etc
/usr/local/opt
/usr/local/sbin
/usr/local/share/zsh
/usr/local/share/zsh/site-functions
/usr/local/var
==> /usr/bin/sudo /bin/chmod u+rwx /usr/local/bin /usr/local/include /usr/local/lib /usr/local/share /usr/local/lib/pkgconfig /usr/local/share/info /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/man/man3 /usr/local/share/man/man5 /usr/local/share/man/man7
Password:
==> /usr/bin/sudo /bin/chmod g+rwx /usr/local/bin /usr/local/include /usr/local/lib /usr/local/share /usr/local/lib/pkgconfig /usr/local/share/info /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/man/man3 /usr/local/share/man/man5 /usr/local/share/man/man7
==> /usr/bin/sudo /usr/sbin/chown ihongdon /usr/local/bin /usr/local/include /usr/local/lib /usr/local/share /usr/local/lib/pkgconfig /usr/local/share/info /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/man/man3 /usr/local/share/man/man5 /usr/local/share/man/man7
==> /usr/bin/sudo /usr/bin/chgrp admin /usr/local/bin /usr/local/include /usr/local/lib /usr/local/share /usr/local/lib/pkgconfig /usr/local/share/info /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/man/man3 /usr/local/share/man/man5 /usr/local/share/man/man7
==> /usr/bin/sudo /bin/mkdir -p /usr/local/Cellar /usr/local/Homebrew /usr/local/Frameworks /usr/local/etc /usr/local/opt /usr/local/sbin /usr/local/share/zsh /usr/local/share/zsh/site-functions /usr/local/var
==> /usr/bin/sudo /bin/chmod g+rwx /usr/local/Cellar /usr/local/Homebrew /usr/local/Frameworks /usr/local/etc /usr/local/opt /usr/local/sbin /usr/local/share/zsh /usr/local/share/zsh/site-functions /usr/local/var
==> /usr/bin/sudo /bin/chmod 755 /usr/local/share/zsh /usr/local/share/zsh/site-functions
==> /usr/bin/sudo /usr/sbin/chown ihongdon /usr/local/Cellar /usr/local/Homebrew /usr/local/Frameworks /usr/local/etc /usr/local/opt /usr/local/sbin /usr/local/share/zsh /usr/local/share/zsh/site-functions /usr/local/var
==> /usr/bin/sudo /usr/bin/chgrp admin /usr/local/Cellar /usr/local/Homebrew /usr/local/Frameworks /usr/local/etc /usr/local/opt /usr/local/sbin /usr/local/share/zsh /usr/local/share/zsh/site-functions /usr/local/var
==> /usr/bin/sudo /bin/mkdir -p /Users/ihongdon/Library/Caches/Homebrew
==> /usr/bin/sudo /bin/chmod g+rwx /Users/ihongdon/Library/Caches/Homebrew
==> /usr/bin/sudo /usr/sbin/chown ihongdon /Users/ihongdon/Library/Caches/Homebrew
==> /usr/bin/sudo /bin/mkdir -p /Library/Caches/Homebrew
==> /usr/bin/sudo /bin/chmod g+rwx /Library/Caches/Homebrew
==> /usr/bin/sudo /usr/sbin/chown ihongdon /Library/Caches/Homebrew
==> Downloading and installing Homebrew...
HEAD is now at 1c7c876f3 Merge pull request #4736 from scpeters/bottle_json_local_filename
==> Homebrew is run entirely by unpaid volunteers. Please consider donating:
https://github.com/Homebrew/brew#donations
==> Tapping homebrew/core
ihongdon-ui-MacBook-Pro:~ ihongdon$
ihongdon-ui-MacBook-Pro:~ ihongdon$

(1-2) Graphviz를 설치합니다.

Homebrew를 설치하였으면,이제 아래의 Homebrew 코드를 터미널에서 실행하여 Graphviz를 설치해줍니다.

$ brew install graphviz

ihongdon-ui-MacBook-Pro:~ ihongdon$
ihongdon-ui-MacBook-Pro:~ ihongdon$ brew install graphviz
==> Installing dependencies for graphviz: libtool, libpng, freetype, fontconfig, jpeg, libtiff, webp, gd
==> Installing graphviz dependency: libtool
==> Downloading https://homebrew.bintray.com/bottles/libtool-2.4.6_1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtool--2.4.6_1.high_sierra.bottle.tar.gz
==> Caveats
In order to prevent conflicts with Apple's own libtool we have prepended a "g"
so, you have instead: glibtool and glibtoolize.
==> Summary
🍺 /usr/local/Cellar/libtool/2.4.6_1: 71 files, 3.7MB
==> Installing graphviz dependency: libpng
==> Downloading https://homebrew.bintray.com/bottles/libpng-1.6.35.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libpng--1.6.35.high_sierra.bottle.tar.gz
🍺 /usr/local/Cellar/libpng/1.6.35: 26 files, 1.2MB
==> Installing graphviz dependency: freetype
==> Downloading https://homebrew.bintray.com/bottles/freetype-2.9.1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring freetype--2.9.1.high_sierra.bottle.tar.gz
🍺 /usr/local/Cellar/freetype/2.9.1: 60 files, 2.6MB
==> Installing graphviz dependency: fontconfig
==> Downloading https://homebrew.bintray.com/bottles/fontconfig-2.13.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring fontconfig--2.13.0.high_sierra.bottle.tar.gz
==> Regenerating font cache, this may take a while
==> /usr/local/Cellar/fontconfig/2.13.0/bin/fc-cache -frv
🍺 /usr/local/Cellar/fontconfig/2.13.0: 511 files, 3.2MB
==> Installing graphviz dependency: jpeg
==> Downloading https://homebrew.bintray.com/bottles/jpeg-9c.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring jpeg--9c.high_sierra.bottle.tar.gz
🍺 /usr/local/Cellar/jpeg/9c: 21 files, 724.5KB
==> Installing graphviz dependency: libtiff
==> Downloading https://homebrew.bintray.com/bottles/libtiff-4.0.9_4.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtiff--4.0.9_4.high_sierra.bottle.tar.gz
🍺 /usr/local/Cellar/libtiff/4.0.9_4: 246 files, 3.5MB
==> Installing graphviz dependency: webp
==> Downloading https://homebrew.bintray.com/bottles/webp-1.0.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring webp--1.0.0.high_sierra.bottle.tar.gz
🍺 /usr/local/Cellar/webp/1.0.0: 38 files, 2MB
==> Installing graphviz dependency: gd
==> Downloading https://homebrew.bintray.com/bottles/gd-2.2.5.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring gd--2.2.5.high_sierra.bottle.tar.gz
🍺 /usr/local/Cellar/gd/2.2.5: 35 files, 1.1MB
==> Installing graphviz
==> Downloading https://homebrew.bintray.com/bottles/graphviz-2.40.1.high_sierra.bottle.1.tar.gz
######################################################################## 100.0%
==> Pouring graphviz--2.40.1.high_sierra.bottle.1.tar.gz
🍺 /usr/local/Cellar/graphviz/2.40.1: 500 files, 11.2MB
==> Caveats
==> libtool
In order to prevent conflicts with Apple's own libtool we have prepended a "g"
so, you have instead: glibtool and glibtoolize.
ihongdon-ui-MacBook-Pro:~ ihongdon$
ihongdon-ui-MacBook-Pro:~ ihongdon$

(2) Python 2.7에 PyGraphviz library 설치하기

PyGraphviz는 Python 3.x 버전, 그리고 Python 2.7 버전에서 사용할 수 있습니다. 이번 포스팅에서는 Python 2.7 버전에 PyGraphviz 패키지를 설치해보겠습니다.

conda env list로 가상환경 리스트를 확인하고, source activate 로 Python2.7 버전의 가상환경을 활성화시킨 후에, pip install --upgrade pip 로 pip 버전 업그레이드 한 후에, easy_install pygraphviz 로 PyGraphviz를 설치하였습니다.

왜 그런지 이유는 모르겠으나 pip install pygraphviz 로 설치하려고 하니 설치가 되다가 막판에 에러가 났습니다.

pip install git://github.com/pygraphviz/pygraphviz.git 도 시도를 해봤는데 역시 에러가 났습니다.

다행히 easy_install pygrapviz 로 설치가 되네요.

ihongdon-ui-MacBook-Pro:~ ihongdon$ conda env list

# conda environments:

#

base * /Users/ihongdon/anaconda3

py2.7_tf1.4 /Users/ihongdon/anaconda3/envs/py2.7_tf1.4

py3.5_tf1.4 /Users/ihongdon/anaconda3/envs/py3.5_tf1.4

ihongdon-ui-MacBook-Pro:~ ihongdon$

ihongdon-ui-MacBook-Pro:~ ihongdon$ source activate py2.7_tf1.4

(py2.7_tf1.4) ihongdon-ui-MacBook-Pro:~ ihongdon$ pip install --upgrade pip

Requirement already up-to-date: pip in ./anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages (18.0)

(py2.7_tf1.4) ihongdon-ui-MacBook-Pro:~ ihongdon$

(py2.7_tf1.4) ihongdon-ui-MacBook-Pro:~ ihongdon$ easy_install pygraphviz

Searching for pygraphviz

Reading https://pypi.python.org/simple/pygraphviz/

Downloading https://files.pythonhosted.org/packages/87/5e/40efbb2d02ee9d0282f6c8b9e477f6444a025a7ecf8cc0b15fe87a288708

/pygraphviz-1.4rc1.zip#sha256=e0b3a7f1d9203f9748b94e8365656755201966b562e53fd6424bed89e98fdc4e

Best match: pygraphviz 1.4rc1

Processing pygraphviz-1.4rc1.zip

Writing /var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/easy_install-U3TQDK/pygraphviz-1.4rc1/setup.cfg

Running pygraphviz-1.4rc1/setup.py -q bdist_egg --dist-dir /var/folders/6q/mtq6ftrj6_z4txn_zsxcfyxc0000gn/T/easy_install-U3TQDK/pygraphviz-1.4rc1/egg-dist-tmp-wc0yf4

warning: no previously-included files matching '*~' found anywhere in distribution

warning: no previously-included files matching '*.pyc' found anywhere in distribution

warning: no previously-included files matching '.svn' found anywhere in distribution

no previously-included directories found matching 'doc/build'

pygraphviz/graphviz_wrap.c:3354:12: warning: incompatible pointer to integer conversion returning 'Agsym_t *' (aka 'struct Agsym_s *') from a function with result type 'int' [-Wint-conversion]

return agattr(g, kind, name, val);

^~~~~~~~~~~~~~~~~~~~~~~~~~

pygraphviz/graphviz_wrap.c:3438:7: warning: unused variable 'fd1' [-Wunused-variable]

int fd1 ;

^

pygraphviz/graphviz_wrap.c:3439:13: warning: unused variable 'mode_obj1' [-Wunused-variable]

PyObject *mode_obj1 ;

^

pygraphviz/graphviz_wrap.c:3440:13: warning: unused variable 'mode_byte_obj1' [-Wunused-variable]

PyObject *mode_byte_obj1 ;

^

pygraphviz/graphviz_wrap.c:3441:9: warning: unused variable 'mode1' [-Wunused-variable]

char *mode1 ;

^

pygraphviz/graphviz_wrap.c:3509:7: warning: unused variable 'fd2' [-Wunused-variable]

int fd2 ;

^

pygraphviz/graphviz_wrap.c:3510:13: warning: unused variable 'mode_obj2' [-Wunused-variable]

PyObject *mode_obj2 ;

^

pygraphviz/graphviz_wrap.c:3511:13: warning: unused variable 'mode_byte_obj2' [-Wunused-variable]

PyObject *mode_byte_obj2 ;

^

pygraphviz/graphviz_wrap.c:3512:9: warning: unused variable 'mode2' [-Wunused-variable]

char *mode2 ;

^

9 warnings generated.

zip_safe flag not set; analyzing archive contents...

pygraphviz.graphviz: module references __file__

pygraphviz.release: module references __file__

pygraphviz.tests.test: module references __file__

creating /Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pygraphviz-1.4rc1-py2.7-macosx-10.6-x86_64.egg

Extracting pygraphviz-1.4rc1-py2.7-macosx-10.6-x86_64.egg to /Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages

Adding pygraphviz 1.4rc1 to easy-install.pth file

Installed /Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pygraphviz-1.4rc1-py2.7-macosx-10.6-x86_64.egg

Processing dependencies for pygraphviz

Finished processing dependencies for pygraphviz

(py2.7_tf1.4) ihongdon-ui-MacBook-Pro:~ ihongdon$

(3) Graphviz와 PyGraphviz를 사용하여 Decision Tree 시각화 해보기

이제 Graphviz와 PyGraphviz를 사용해서 Jupyter Notebook 에서 Decision Tree를 시각화해보겠습니다. Iris 데이터셋을 사용해서 Decision Tree로 Iris 종류 분류하는 예제입니다.

# Common imports

import numpy as np

import os

# To plot pretty figures

%matplotlib inline

import matplotlib

import matplotlib.pyplot as plt

# Where to save the figures

PROJECT_ROOT_DIR = "."

SUB_DIR = "decision_trees"

def image_path(fig_id):

return os.path.join(PROJECT_ROOT_DIR, "images", SUB_DIR, fig_id)

# Iris dataset import

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

iris = load_iris()

list(iris.keys())

['target_names', 'data', 'target', 'DESCR', 'feature_names']

print(iris.DESCR)

Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

iris.data[:5,]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

Scikit Learn 의 DecisionTreeClassifier 클래스를 사용하여 iris 분류 모델을 적합시켜 보겠습니다.

X = iris.data[:, 2:] # petal length and width

y = iris.target

# Train the model

tree_clf = DecisionTreeClassifier(max_depth=2)

tree_clf.fit(X, y)

위의 Decision Tree 모형을 export_grapviz() 함수를 사용하여 graphviz의 dot format 파일로 내보내기(export)해보겠습니다.

# Visualization

from sklearn.tree import export_graphviz

export_graphviz(

tree_clf,

out_file=image_path("iris_tree.dot"),

feature_names=iris.feature_names[2:],

class_names=iris.target_names,

rounded=True,

filled=True

)

위의 export_graphviz() 코드를 실행시키면 ""/Users/ihongdon/images/decision_trees/" 폴더에 iris_tree.dot 파일이 생성됩니다. 이 dot format 파일을 워드나 노트패드를 사용해서 열어보면 아래와 같이 되어있습니다.

digraph Tree {
node [shape=box, style="filled, rounded", color="black", fontname=helvetica] ;
edge [fontname=helvetica] ;
0 [label="petal width (cm) <= 0.8\ngini = 0.667\nsamples = 150\nvalue = [50, 50, 50]\nclass = setosa", fillcolor="#e5813900"] ;
1 [label="gini = 0.0\nsamples = 50\nvalue = [50, 0, 0]\nclass = setosa", fillcolor="#e58139ff"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.75\ngini = 0.5\nsamples = 100\nvalue = [0, 50, 50]\nclass = versicolor", fillcolor="#39e58100"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="gini = 0.168\nsamples = 54\nvalue = [0, 49, 5]\nclass = versicolor", fillcolor="#39e581e5"] ;
2 -> 3 ;
4 [label="gini = 0.043\nsamples = 46\nvalue = [0, 1, 45]\nclass = virginica", fillcolor="#8139e5f9"] ;
2 -> 4 ;
}

마지막으로, 위의 iris_tree.dot 의 dot format파일을 가지고 pygraphviz 패키지를 사용하여 Decision Tree를 시각화해보겠습니다.

import pygraphviz as pgv

from IPython.display import Image

graph = pgv.AGraph("/Users/ihongdon/images/decision_trees/iris_tree.dot")

graph.draw('iris_tree_out.png', prog='dot')

Image('iris_tree_out.png')

[Reference]

Graphviz: https://graphviz.gitlab.io/
PyGraphviz: https://pygraphviz.github.io/

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python] Python으로 Postgresql, GPDB, DB2, Presto DB connect 하는 방법 (2)	2019.07.02
맥북(Mac OS)에서 graphviz 실행 시 "ValueError: Program dot not found in path" 에러 대처방안 (0)	2018.08.31
[Jupyter Notebook, ipython] 경고 메시지 숨기기 (ignore warning message) (0)	2018.01.30
[Python] 사전 자료형 내장함수 및 메소드 (Dictionary built-in functions and methods) (0)	2017.08.28
[Python] 사전 자료형 생성 및 기본 사용법 (Python Dictionary : basic operations, access, delete) (0)	2017.08.27

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'분류 전체보기'에 해당되는 글 803건

[Python pandas] 여러개의 함수를 적용하여 GroupBy 집계하기 : grouped.agg()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] GroupBy 집계 메소드와 함수 (Group by aggregation methods and functions)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

맥북(Mac OS)에서 graphviz 실행 시 "ValueError: Program dot not found in path" 에러 대처방안

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Greenplum, Postgresql] 중복된 관측치 제거하기 (how to delete duplicated rows)

'Greenplum and PostgreSQL Database' 카테고리의 다른 글

[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] groupby() 로 그룹별 집계하기 (data aggregation by groups)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

맥북에 Graphviz, pygraphviz 설치하고 Decision Tree 시각화해보기

'Python 분석과 프로그래밍 > Python 설치 및 기본 사용법' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바