'유일한 값 찾기 pd.Series.unique()' 태그의 글 목록

'유일한 값 찾기 pd.Series.unique()'에 해당되는 글 1건

2016.12.12 [Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts() 4

[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 12. 23:17

지난번 포스팅에서는 중복값 확인, 중복값 처리에 대해서 알아보았습니다.

이번 포스팅에서는 유일한 값(unique value)을 찾고 개수도 세어보기 위해서 Python pandas의

- pd.Series.unique() 를 이용한 유일한 값 찾기

(Return np.ndarray of unique values in the object)

- pd.Series.value_counts() 를 이용한 유일한 값별 개수 세기

(Returns object containing counts of unique values)

를 소개하겠습니다.

데이터 전처리 및 탐색적 데이터 분석 단계에서 종종 사용하니 알아두면 좋겠습니다.

먼저, 필요한 모듈 불러오고, 예제 DataFrame을 만들어보겠습니다.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A2', 'A3', 'A3'],

...: 'B': ['B1', 'B1', 'B1', 'B1', 'B2', np.nan],

...: 'C': [1, 1, 3, 4, 4, 4]})

In [4]: df

Out[4]:

    A    B C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

(1) 유일한 값 찾기 : pd.Series.unique()

pd.Series.unique()는 np.ndarray를 반환합니다. DataFrame의 각 칼럼별로 indexing해서 unique()를 적용한 결과는 아래와 같습니다.

'B' 칼럼에 'NaN'도 unique()에 포함되었습니다.

In [5]: df['A'].unique()

Out[5]: array(['A1', 'A2', 'A3'], dtype=object)

In [6]: df['B'].unique()

Out[6]: array(['B1', 'B2', nan], dtype=object)

In [7]: df['C'].unique()

Out[7]: array([1, 3, 4], dtype=int64)

(2) 유일한 값별로 개수 세기 : pd.Series.value_counts()

# eturns object containing counts of unique values

pd.Series.value_counts(normalize=False, # False면 개수, True면 상대 비율 구함

sort=True, # True면 개수 기준 정렬, False면 유일한 값 기준 정렬

ascending=False, # False면 내림차순 정렬, True면 오름차순 정렬

bins=None, # None이면 유일값 기준 개수, None아니면 Bins Group별 개수

dropna=True # True면 NaN 무시, False면 유일값에 NaN 포함)

In [4]: df

Out[4]:

    A    B C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

# returns of pd.Series.value_counts() by default setting

In [8]: df['A'].value_counts()

Out[8]:

A3    2
A2    2
A1    2

Name: A, dtype: int64

In [9]: df['B'].value_counts()

Out[9]:

B1 4
B2 1

Name: B, dtype: int64

In [10]: df['C'].value_counts()

Out[10]:

4    3
1    2
3    1

Name: C, dtype: int64

(2-1) 유일 값 별 상대적 비율 : pd.Series.value_counts(normalize=True)

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

In [11]: df['C'].value_counts(normalize=True)

Out[11]:

4    0.500000
1    0.333333
3    0.166667

Name: C, dtype: float64

(2-2) 유일한 값 기준 정렬 : pd.Series.value_counts(sort=True, ascending=True)

sort=True, False 와 ascending=True, False 의 조합은 아래 예시의 3가지 경우의 수가 있습니다.

[12] : 유일한 값의 개수 기준 내림차순 정렬 예시 (sort descending order by value_counts)

[13] : 유일한 값의 개수 기준 오름차순 정렬 예시 (sort ascending order by value_counts)

[14] : 유일한 값 기준 오름차순 정렬 예시 (유일한 값의 개수 기준 정렬은 없음)

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

# sort descending order by value_counts

In [12]: df['C'].value_counts(sort=True, ascending=False) # by default

Out[12]:

4    3
1    2
3    1

Name: C, dtype: int64

# sort ascending order by value_counts

In [13]: df['C'].value_counts(sort=True, ascending=True)

Out[13]:

3    1
1    2
4    3

Name: C, dtype: int64

# Don't sort by value_counts, but sort by unique value

In [14]: df['C'].value_counts(sort=False)

Out[14]:

1    2
3    1
4    3

Name: C, dtype: int64

(2-3) 결측값을 유일한 값에 포함할지 여부 : pd.Series.value_counts(dropna=True)

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

# dropna=True : Don’t include counts of NaN

In [15]: df['B'].value_counts(dropna=True) # by default

Out[15]:

B1 4
B2 1

Name: B, dtype: int64

# dropna=False : Include counts of NaN

In [16]: df['B'].value_counts(dropna=False)

Out[16]:

B1     4
B2     1
NaN    1

Name: B, dtype: int64

(2-4) Bins Group별 값 개수 세기 : pd.Series.value_counts(bins=[ , , ,])

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

In [17]: df['C'].value_counts(bins=[0, 1, 2, 3, 4, 5], sort=False)

Out[17]:

0 2

1 0

2 1

3 3

4 0

Name: C, dtype: int64

아래의 pd.cut(Series, bins=[ , , , ]) 와 위의 결과가 동일하며, 위의 Series.value_counts(bins=[ , , , ])가 조금 더 사용하기 편리합니다.

In [18]: out = pd.cut(df['C'], bins=[0, 1, 2, 3, 4, 5])

In [19]: pd.value_counts(out)

Out[19]:

(3, 4] 3

(0, 1] 2

(2, 3] 1

(4, 5] 0

(1, 2] 0

Name: C, dtype: int64

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data) (0)	2016.12.13
[Python pandas] 중복값 확인 및 처리 : DataFrame.duplicated(), DataFrame.drop_duplicates(), keep='first', 'last', False (3)	2016.12.11
[Python pandas] 결측값, 원래 값을 다른 값으로 교체하기(replacing generic values) : replace() (4)	2016.12.11
[Python pandas] 결측값 보간하기 (interpolation of missing values) : interpolate(), interpolate(method='time'), interpolate(method='values') (6)	2016.12.10

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'유일한 값 찾기 pd.Series.unique()'에 해당되는 글 1건

[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바