[Python pandas] DataFrame, Series에서 순위(rank)를 구하는 rank() 함수

Python 분석과 프로그래밍/Python 데이터 전처리 2019. 7. 27. 20:01

이번 포스팅에서는 Python pandas의 DataFrame, Series 에서 특정 변수를 기준으로 순서를 구할 때 사용하는 rank() 함수를 소개하겠습니다.

순위(Rank)는 정렬(Sort)와 밀접한 관련이 있는데요, 참고로 Python에서 자료형별 정렬(Sort) 방법은 아래 링크를 참고하세요.

Pandas DataFrame, Tuple, List 정렬 : https://rfriend.tistory.com/281
Numpy Array 정렬 : https://rfriend.tistory.com/357

(1) 다양한 동점 처리방법(tie-breaking methods)에 따른 순위(rank) 구하기 비교

순위(rank)를 구할 때 기준 변수의 점수(score)가 동일(tie)한 관측치를 어떻게 처리하는지에 따라서 5가지 방법이 있습니다.

[ 순위 구할 때 동점 처리하는 5가지 방법 ]
평균(method='average') : 동점 관측치 간의 그룹 내 평균 순위 부여 (default 설정)
최소값(method='min') : 동점 관측치 그룹 내 최소 순위 부여
최대값(method='max') : 동점 관측치 그룹 내 최대 순위 부여
첫번째 값 (method='first') : 동점 관측치 중에서 데이터 상에서 먼저 나타나는 관측치부터 순위 부여
조밀하게 (method='dense') : 최소값('min')과 같은 방법으로 순위부여하나, 'min'과는 다르게 그룹 간 순위가 '1'씩 증가함 (like ‘min’, but rank always increases by 1 between groups)

동점을 포함하고 있는 간단한 예제 DataFrame을 만들어보겠습니다.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({

...: 'name': ['kim', 'lee', 'park', 'choi', 'jung', 'gang', 'nam'],

...: 'score': [70, 95, 100, 95, 70, 90, 70]

...: }, columns=['name', 'score'])

In [4]: df

Out[4]:

name score

0 kim 70

1 lee 95

2 park 100

3 choi 95

4 jung 70

5 gang 90

6 nam 70

이제 순위 구할 때 동점을 처리하는 5가지 방법별로 순위 결과가 어떻게 다른지 확인해보겠습니다. (예제를 시험점수로 가정하고, 점수가 높을 수록 상위 순위가 나오도록 함. ascending = False)

In [5]: df['rank_by_average'] = df['score'].rank(ascending=False) # rank default method='average

In [6]: df['rank_by_min'] = df['score'].rank(method='min', ascending=False)

In [7]: df['rank_by_max'] = df['score'].rank(method='max', ascending=False)

In [8]: df['rank_by_first'] = df['score'].rank(method='first', ascending=False)

In [9]: df['rank_by_dense'] = df['score'].rank(method='dense', ascending=False)

In [10]: df

Out[10]:

name score rank_by_average rank_by_min rank_by_max rank_by_first \

0 kim 70 6.0 5.0 7.0 5.0

1 lee 95 2.5 2.0 3.0 2.0

2 park 100 1.0 1.0 1.0 1.0

3 choi 95 2.5 2.0 3.0 3.0

4 jung 70 6.0 5.0 7.0 6.0

5 gang 90 4.0 4.0 4.0 4.0

6 nam 70 6.0 5.0 7.0 7.0

rank_by_dense

0 4.0

1 2.0

2 1.0

3 2.0

4 4.0

5 3.0

6 4.0

(2) 순위를 오름차순으로 구하기 (Rank in Ascending order)

rank(ascending = True) 으로 설정해주면 오름차순 (제일 작은 점수가 순위 '1'이고, 점수가 높아질수록 하나씩 순위 증가)으로 순위를 구합니다. Default 설정이 ascending=True 이므로 별도로 설정을 안해줘도 자동으로 오름차순 순위가 구해집니다.

In [11]: df_score = df[['name', 'score']].copy()

In [12]: df_score['rank_ascending'] = df_score['score'].rank(method='min', ascending=True)

In [13]: df_score

Out[13]:

name score rank_ascending

0 kim 70 1.0

1 lee 95 5.0

2 park 100 7.0

3 choi 95 5.0

4 jung 70 1.0

5 gang 90 4.0

6 nam 70 1.0

(3) 그룹 별로 순위 구하기 (Rank by Groups): df.groupby().rank()

Groupby operator를 사용하면 그룹별로 따로 따로 순위를 구할 수 있습니다.

In [14]: from itertools import chain, repeat

...:

In [15]: df2 = pd.DataFrame({

...: 'name': ['kim', 'lee', 'park', 'choi']*3,

...: 'course': list(chain.from_iterable((repeat(course, 4)

...: for course in ['korean', 'english', 'math']))),

...: 'score': [70, 95, 100, 95, 65, 80, 95, 90, 100, 85, 90, 90]

...: }, columns=['name', 'course', 'score'])

In [16]: df2

Out[16]:

name course score

0 kim korean 70

1 lee korean 95

2 park korean 100

3 choi korean 95

4 kim english 65

5 lee english 80

6 park english 95

7 choi english 90

8 kim math 100

9 lee math 85

10 park math 90

11 choi math 90

In [17]: df2['rank_by_min_per_course'] = df2.groupby('course')['score'].rank(method='min', ascending=False)

In [18]: df2

Out[18]:

name course score rank_by_min_per_course

0 kim korean 70 4.0

1 lee korean 95 2.0

2 park korean 100 1.0

3 choi korean 95 2.0

4 kim english 65 4.0

5 lee english 80 3.0

6 park english 95 1.0

7 choi english 90 2.0

8 kim math 100 1.0

9 lee math 85 4.0

10 park math 90 2.0

11 choi math 90 2.0

(4) 칼럼을 기준으로 순위 구하기 (Rank over the columns): df.rank(axis=1)

위의 (1), (2), (3) 번의 예시는 전부 행을 기준(위/아래 방향)으로 한 순위(rank over the rows) 였습니다. 필요에 따라서는 열을 기준(왼쪽/오른쪽 방향)으로 한 순위(rank over the columns)을 해야할 때도 있을텐데요, rank(axis=1) 을 설정해주면 열 기준 순위를 구할 수 있습니다.

In [19]: df3 = pd.DataFrame({

...: 'col_1': [1, 2, 3, 4],

...: 'col_2': [3, 5, 1, 2],

...: 'col_3': [3, 1, 2, 4]})

In [20]: df3

Out[20]:

col_1 col_2 col_3

0 1 3 3

1 2 5 1

2 3 1 2

3 4 2 4

In [21]: df3.rank(method='min', ascending=False, axis=1)

Out[21]:

col_1 col_2 col_3

0 3.0 1.0 1.0

1 2.0 1.0 3.0

2 1.0 3.0 2.0

3 1.0 3.0 1.0

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] Python으로 엑셀 데이터 불러와서 DataFrame으로 만들기 (How to read Excel data using Python pandas) (4)	2019.07.31
[Python pandas] DataFrame에서 천 단위 숫자의 자리 구분 기호 콤마(',')를 없애는 방법 (8)	2019.07.30
[Python pandas] DataFrame, Series에서 조건에 맞는 값이 들어있는 행 indexing 하기 : df.isin() (9)	2019.07.24
[Python] itertools를 활용한 리스트 원소를 n번 반복하여 새로운 리스트 만들기 (0)	2019.07.21
[Python pandas] DataFrame index를 reset칼럼으로 가져오고 이름 부여하기 (0)	2019.07.13

Posted by Rfriend

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python pandas] DataFrame, Series에서 순위(rank)를 구하는 rank() 함수

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바