[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 9. 1. 19:32

이번 포스팅에서 Python pandas의 GroupBy 집계 방법 4가지를 소개하겠습니다.

(1) Dicts를 사용한 GroupBy 집계

(2) Series를 사용한 GroupBy 집계

(3) Functions를 사용한 GroupBy 집계

(4) Index Levels를 사용한 GroupBy 집계

기본은 axis = 0으로서 row-wise 집계를 하게 되며, axis = 1 을 설정해주면 column-wise 집계를 하게 됩니다.

[ pandas의 4가지 GroupBy 집계 방법 ]

(1) Dicts를 이용한 GroupBy 집계

예제로 사용할 간단한 데이터프레임을 만들어보겠습니다.

# importing libraries

import numpy as np

import pandas as pd

from pandas import DataFrame

from pandas import Series

# making sample dataset

df = DataFrame(data=np.arange(20).reshape(4, 5),

columns = ['c1', 'c2', 'c3', 'c4', 'c5'],

index = ['r1', 'r2', 'r3', 'r4'])

	c1	c2	c3	c4	c5
r1	0	1	2	3	4
r2	5	6	7	8	9
r3	10	11	12	13	14
r4	15	16	17	18	19

다음으로, 행 기준(row-wise), 열 기준(column-wise)으로 나누어서 Dicts를 사용해 GroupBy 집계하는 예를 들어보겠습니다.

(1-1) 행 기준 Dicts를 이용한 GroupBy 집계 (row-wise GroupBy aggregation using Dicts, axis = 0)

mapping_dict_row = {'r1': 'row_g1',

'r2': 'row_g1',

'r3': 'row_g2',

'r4': 'row_g2'}

grouped_by_row = df.groupby(mapping_dict_row)

grouped_by_row.sum()

	c1	c2	c3	c4	c5
row_g1	5	7	9	11	13
row_g2	25	27	29	31	33

(1-2) 열 기준 Dicts를 이용한 GroupBy 집계 (Column-wise GroupBy aggregation using Dicts, axis = 1)

mapping_dict_col = {'c1': 'col_g1',

'c2': 'col_g1',

'c3': 'col_g2',

'c4': 'col_g2',

'c5': 'col_g2'}

grouped_by_col = df.groupby(mapping_dict_col, axis=1)

grouped_by_col.sum()

	col_g1	col_g2
r1	1	9
r2	11	24
r3	21	39
r4	31	54

Series, Lists 로도 Dicts와 유사하게 GroupBy 집계를 할 수 있습니다.

(2) Series를 이용한 GroupBy 집계

(2-1) 행 기준 Series를 이용한 GroupBy 집계 (row-wise GroupBy aggregation using Series, axis = 0)

mapping_series_row = Series(mapping_dict_row)

mapping_series_row

r1    row_g1
r2    row_g1
r3    row_g2
r4    row_g2
dtype: object

df.groupby(mapping_series_row).sum()

	c1	c2	c3	c4	c5
row_g1	5	7	9	11	13
row_g2	25	27	29	31	33

(2-2) 열 기준 Series를 이용한 GroupBy 집계 (column-wise GroupBy aggregation using Series, axis = 1)

mapping_series_col = Series(mapping_dict_col)

mapping_series_col

c1 col_g1

c2    col_g1
c3    col_g2
c4    col_g2
c5    col_g2

dtype: object

df.groupby(mapping_series_col, axis=1).sum()

	col_g1	col_g2
r1	1	9
r2	11	24
r3	21	39
r4	31	54

df.groupby(mapping_series_col, axis=1).mean()

	col_g1	col_g2
r1	0.5	3.0
r2	5.5	8.0
r3	10.5	13.0
r4	15.5	18.0

(2-3) 열 기준 Lists를 이용한 GroupBy 집계 (column-wise GroupBy aggregation using Lists, axis = 1)

Lists를 이용해서도 (2-2)와 동일한 집계 결과를 얻을 수 있습니다.

mapping_list_col = ['col_g1', 'col_g1', 'col_g2', 'col_g2', 'col_g2'] # lists

df.groupby(mapping_list_col, axis=1).mean()

	col_g1	col_g2
r1	0.5	3.0
r2	5.5	8.0
r3	10.5	13.0
r4	15.5	18.0

Dicts와는 달리 Series나 List의 경우 Group으로 묶어 주려는 행이나 열의 인덱스 개수가 데이터프레임의 인덱스 개수와 일치해야 합니다. 만약 매핑하려는 Series나 Lists의 안의 원소 개수와 데이터프레임의 인덱스 개수가 다를 경우 'KeyError' 가 발생합니다.

아래 예는 칼럼을 매핑하려는 mapping_list_col_2 리스트에 원소 개수가 6개인 반면에, df 데이터프레임에는 칼럼이 5개 밖에 없으므로 KeyError 가 발생하였습니다.

mapping_list_col_2 = ['col_g1', 'col_g1', 'col_g2', 'col_g2', 'col_g2', 'col_g2']

df.groupby(mapping_list_col_2, axis=1).mean()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-76bb72a2996c> in <module>()
----> 1 df.groupby(mapping_list_col_2, axis=1).mean()

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   4414         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   4415                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 4416                        **kwargs)
   4417 
   4418     def asfreq(self, freq, method=None, how=None, normalize=False,

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/groupby.pyc in groupby(obj, by, **kwds)
   1697         raise TypeError('invalid type: %s' % type(obj))
   1698 
-> 1699     return klass(obj, by, **kwds)
   1700 
   1701 

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    390                                                     level=level,
    391                                                     sort=sort,
--> 392                                                     mutated=self.mutated)
    393 
    394         self.obj = obj

/Users/ihongdon/anaconda3/envs/py2.7_tf1.4/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper(obj, key, axis, level, sort, mutated)
   2688                 in_axis, name, level, gpr = False, None, gpr, None
   2689             else:
-> 2690                 raise KeyError(gpr)
   2691         elif isinstance(gpr, Grouper) and gpr.key is not None:
   2692             # Add key to exclusions

KeyError: 'col_g1'

(3) Functions를 이용한 GroupBy 집계

재미있게도 GroupBy operator 에 함수를 사용할 수도 있습니다. 아래 예에서는 row_grp_func() 라는 사용자 정의 함수를 만들어서 GroupBy 합계 집계를 해보았습니다.

	c1	c2	c3	c4	c5
r1	0	1	2	3	4
r2	5	6	7	8	9
r3	10	11	12	13	14
r4	15	16	17	18	19

def row_grp_func(x):

if x == 'r1' or x == 'r2':

row_group = 'row_g1'

else:

row_group = 'row_g2'

return row_group

df.groupby(row_grp_func).sum()

	c1	c2	c3	c4	c5
row_g1	5	7	9	11	13
row_g2	25	27	29	31	33

(4) Index Levels를 이용한 GroupBy 집계

마지막으로, 계층적 인덱스(Hierarchical index)를 가진 데이터프레임에 대해서 Index Levels를 사용하여 집계하는 방법을 소개하겠습니다. Level에 대해서 names 로 이름을 부여하여 사용하면 편리합니다. 계층적 인덱스는 R에는 없는 기능인데요, 자꾸 쓰다 보니 나름 유용합니다.

hier_columns = pd.MultiIndex.from_arrays([['col_g1', 'col_g1', 'col_g2', 'col_g2', 'col_g2'],

['c1', 'c2', 'c3', 'c4', 'c5']],

names = ['col_level_1', 'col_level_2'])

hier_columns

MultiIndex(levels=[[u'col_g1', u'col_g2'], [u'c1', u'c2', u'c3', u'c4', u'c5']],
           labels=[[0, 0, 1, 1, 1], [0, 1, 2, 3, 4]],
           names=[u'col_level_1', u'col_level_2'])

hier_df = DataFrame(data = np.arange(20).reshape(4,5),

columns = hier_columns,

index = ['r1', 'r2', 'r3', 'r4'])

hier_df

col_level_1	col_g1		col_g2
col_level_2	c1	c2	c3	c4	c5
r1	0	1	2	3	4
r2	5	6	7	8	9
r3	10	11	12	13	14
r4	15	16	17	18	19

hier_df.groupby(level = 'col_level_1', axis=1).mean()

col_level_1	col_g1	col_g2
r1	0.5	3.0
r2	5.5	8.0
r3	10.5	13.0
r4	15.5	18.0

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 여러개의 함수를 적용하여 GroupBy 집계하기 : grouped.agg() (2)	2018.09.02
[Python pandas] GroupBy 집계 메소드와 함수 (Group by aggregation methods and functions) (0)	2018.09.02
[Python pandas] 데이터프레임에서 두 개의 문자열 변수의 각 원소를 합쳐서 새로운 변수 만들기 (2)	2018.09.01
[Python pandas] 범주형 변수의 항목을 기준 정보를 사용하여 매핑해 변환하기: dict.get() (0)	2018.08.31
[Python pandas] GroupBy로 그룹별로 반복 작업하기 (Iteration over groups) (0)	2018.08.26

Posted by Rfriend

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python pandas] 다양한 GroupBy 집계 방법 : Dicts, Series, Lists, Functions, Index Levels

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바