'Python 분석과 프로그래밍/Python 데이터 전처리' 카테고리의 글 목록 (14 Page)

[Python pandas] 데이터 재구조화(reshaping data) : pd.DataFrame.stack(), pd.DataFrame.unstack()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 24. 23:29

데이터 재구조화(reshaping data)를 위해 사용할 수 있는 Python pandas의 함수들에 대해서 아래의 순서대로 나누어서 소개해보겠습니다.

- (1) pivot(), pd.pivot_table()

- (2) stack(), unstack()

- (3) melt()

- (4) wide_to_long()

- (4) pd.crosstab()

이번 포스팅에서는 두번째로 pd.DataFrame.stack(), pd.DataFrame.unstack()에 대해서 알아보겠습니다.

stack을 영어사전에서 찾아보면 뜻이

stack[stӕk]
~ (sth) (up) (깔끔하게 정돈하여) 쌓다[포개다]; 쌓이다, 포개지다
~ sth (with sth) (어떤 곳에 물건을 쌓아서) 채우다

라는 뜻입니다.

stack이 (위에서 아래로 길게, 높게) 쌓는 것이면, unstack은 쌓은 것을 옆으로 늘어놓는것(왼쪽에서 오른쪽으로 넓게) 라고 연상이 될 것입니다.

Python pandas의 stack(), unstack() 실습에 필요한 모듈을 불러오고, 예제로 사용할 hierarchical index를 가진 DataFrame을 만들어보겠습니다.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from pandas import DataFrame

In [4]: mul_index = pd.MultiIndex.from_tuples([('cust_1', '2015'), ('cust_1', '2016'),

...: ('cust_2', '2015'), ('cust_2', '2016')])

...:

In [5]: data = DataFrame(data=np.arange(16).reshape(4, 4),

...: index=mul_index,

...: columns=['prd_1', 'prd_2', 'prd_3', 'prd_4'],

...: dtype='int')

...:

In [6]: data

Out[6]:

                 prd_1 prd_2 prd_3 prd_4
cust_1 2015      0       1      2      3
         2016      4       5      6      7
cust_2 2015      8     9     10     11
         2016     12     13     14     15

stack() method 를 사용해서 위의 예제 데이터셋을 위에서 아래로 길게(높게) 쌓아(stack) 보겠습니다. 칼럼의 level은 1개 밖에 없으므로 stack(level=-1) 을 별도로 명기하지 않아도 됩니다.

(1) pd.DataFrame.stack(level=-1, dropna=True)

DataFrame을 stack() 후에 index를 확인해보고, indexing 해보겠습니다.

DataFrame을 stack() 하면 Series 를 반환합니다.

# stack()

In [7]: data_stacked = data.stack()

# DataFrame.stack() => returns Series

In [8]: data_stacked

Out[8]:

cust_1 2015 prd_1     0
                  prd_2     1
                  prd_3     2
                  prd_4     3
        2016 prd_1     4
                  prd_2     5
                  prd_3     6
                  prd_4     7
cust_2 2015 prd_1     8
                  prd_2     9
                  prd_3    10
                  prd_4    11
         2016 prd_1    12
                  prd_2    13
                  prd_3    14
                  prd_4    15

dtype: int32

# MultiIndex(levels) after stack()

In [9]: data_stacked.index

Out[9]:

MultiIndex(levels=[['cust_1', 'cust_2'], ['2015', '2016'], ['prd_1', 'prd_2', 'prd_3', 'prd_4']],

labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

# indexing

In [10]: data_stacked['cust_2']['2015'][['prd_1', 'prd_2']]

Out[10]:

prd_1 8

prd_2 9

dtype: int32

결측값이 있는 데이터셋을 stack() 할 때 결측값을 제거할지(dropna=True), 아니면 결측값을 NaN으로 유지할지(dropna=False) 설정할 수 있는 stack(dropna=True, False)를 예를 들어 설명해보겠습니다.

# # putting NaN to DataFrame

In [11]: data.ix['cust_2', 'prd_4'] = np.nan

In [12]: data

Out[12]:

             prd_1 prd_2 prd_3 prd_4
cust_1 2015      0      1      2    3.0
         2016      4      5      6    7.0
cust_2 2015      8      9     10    NaN
         2016     12     13     14    NaN

# stack with 'dropna=False' argument

In [13]: data.stack(dropna=False)

Out[13]:

cust_1 2015 prd_1     0.0
                  prd_2     1.0
                  prd_3     2.0
                  prd_4     3.0
        2016 prd_1     4.0
                  prd_2     5.0
                  prd_3     6.0
                  prd_4     7.0
cust_2 2015 prd_1     8.0
                  prd_2     9.0
                  prd_3    10.0
                  prd_4     NaN
          2016 prd_1    12.0
                  prd_2    13.0
                  prd_3    14.0
                  prd_4     NaN

dtype: float64

# stack with 'dropna=True' argument

In [14]: data.stack(dropna=True) # by default

Out[14]:

cust_1 2015 prd_1     0.0
                  prd_2     1.0
                  prd_3     2.0
                  prd_4     3.0
        2016 prd_1     4.0
                  prd_2     5.0
                  prd_3     6.0
                  prd_4     7.0
cust_2 2015 prd_1     8.0
                  prd_2     9.0
                  prd_3    10.0
          2016 prd_1    12.0
                  prd_2    13.0
                  prd_3    14.0

dtype: float64

stack()으로 위에서 아래로 길게(높게) 쌓아 올린 데이터셋을 이번에는 거꾸로 왼쪽으로 오른쪽으로 넓게 unstack()으로 풀어보겠습니다.

stack() 후의 data_stacked 데이터셋이 아래에 보는 것처럼 level이 3개 있는 MultiIndex 입니다. 이럴 경우 unstack(level=-1), unstack(level=0), unstack(level=1) 별로 어떤 level이 칼럼으로 이동해서 unstack() 되는지 유심히 살펴보시기 바랍니다.

(2) pd.DataFrame.unstack(level=-1, fill_value=None)

In [15]: data_stacked

Out[15]:

cust_1 2015 prd_1     0
                  prd_2     1
                  prd_3     2
                  prd_4     3
          2016 prd_1     4
                  prd_2     5
                  prd_3     6
                  prd_4     7
cust_2 2015 prd_1     8
                  prd_2     9
                  prd_3    10
                  prd_4    11
          2016 prd_1    12
                  prd_2    13
                  prd_3    14
                  prd_4    15

dtype: int32

In [16]: data_stacked.unstack(level=-1)

Out[16]:

                 prd_1 prd_2 prd_3 prd_4
cust_1 2015      0      1      2      3
       2016      4      5      6      7
cust_2 2015      8      9     10     11
         2016     12     13     14     15

In [17]: data_stacked.unstack(level=0)

Out[17]:

                cust_1 cust_2
2015 prd_1       0       8
     prd_2       1       9
       prd_3       2      10
      prd_4       3      11
2016 prd_1       4      12
     prd_2       5      13
       prd_3       6      14
      prd_4       7      15

In [18]: data_stacked.unstack(level=1)

Out[18]:

                  2015 2016
cust_1 prd_1     0     4
       prd_2     1     5
         prd_3     2     6
         prd_4     3     7
cust_2 prd_1     8    12
       prd_2     9    13
         prd_3    10    14
         prd_4    11    15

unstack() 한 후의 데이터셋도 역시 Series 인데요, 이것을 DataFrame으로 변환해보겠습니다.

# converting Series to DataFrame

In [19]: data_stacked_unstacked = data_stacked.unstack(level=-1)

In [20]: data_stacked_unstacked

Out[20]:

                prd_1 prd_2 prd_3 prd_4
cust_1 2015      0      1      2      3
         2016      4      5      6      7
cust_2 2015      8      9     10     11
         2016     12     13     14     15

# converting index to columns

In [21]: data_stacked_unstacked_df = data_stacked_unstacked.reset_index()

# changing columns' name

In [22]: data_stacked_unstacked_df.rename(columns={'level_0' : 'custID',

...: 'level_1' : 'year'}, inplace=True)

...:

In [23]: data_stacked_unstacked_df

Out[23]:

   custID year prd_1 prd_2 prd_3 prd_4
0 cust_1 2015      0      1      2      3
1 cust_1 2016      4      5      6      7
2 cust_2 2015      8      9     10     11
3 cust_2 2016     12     13     14     15

이상으로 stack(), unstack()을 이용한 데이터 재구조화에 대해서 알아보았습니다.

다음번 포스팅에서는 melt(), wide_to_long() 을 이용한 데이터 재구조화를 소개하겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 데이터 재구조화(reshape) : pd.wide_to_long() (0)	2016.12.30
[Python] 데이터 재구조화(reshape) : pd.melt() (0)	2016.12.28
[Python pandas] 데이터 재구조화 (reshaping) : data.pivot(), pd.pivot_table(data) (5)	2016.12.23
[Python] 다항차수 변환, 교호작용 변수 생성 : sklearn.preprocessing.PolynomialFeatures() (0)	2016.12.21
[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...) (0)	2016.12.20

Posted by Rfriend

,

[Python pandas] 데이터 재구조화 (reshaping) : data.pivot(), pd.pivot_table(data)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 23. 23:44

분석을 하다 보면 원본 데이터의 구조가 분석 기법에 맞지 않아서 행과 열의 위치를 바꾼다거나, 특정 요인에 따라 집계를 해서 구조를 바꿔주어야 하는 경우가 있습니다.

이번 포스팅부터는 이처럼 데이터 재구조화(reshaping data)를 위해 사용할 수 있는 Python pandas의 함수들에 대해서 아래의 순서대로 나누어서 소개해보겠습니다.

- (1) pivot(), pd.pivot_table()

- (2) stack(), unstack()

- (3) melt()

- (4) wide_to_long()

- (5) pd.crosstab()

이번 포스팅에서는 첫번째로 data.pivot(), pd.pivot_table(data)에 대해서 알아보겠습니다.

먼저, 필요한 모듈을 불러오고, 간단한 예제 데이터셋을 만들어보겠습니다. 고객ID(cust_id), 상품 코드(prod_cd), 등급(grade), 구매금액(pch_amt) 의 4개 변수로 이루어진 데이터 프레임입니다.

# importing libraries

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from pandas import DataFrame

# making an example DataFrame

In [4]: data = DataFrame({'cust_id': ['c1', 'c1', 'c1', 'c2', 'c2', 'c2', 'c3', 'c3', 'c3'],

...: 'prod_cd': ['p1', 'p2', 'p3', 'p1', 'p2', 'p3', 'p1', 'p2', 'p3'],

...: 'grade' : ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'],

...: 'pch_amt': [30, 10, 0, 40, 15, 30, 0, 0, 10]})

...:

In [5]: data

Out[5]:

cust_id grade pch_amt prod_cd
0      c1     A       30      p1
1      c1     A       10      p2
2      c1     A        0      p3
3      c2     A       40      p1
4      c2     A       15      p2
5      c2     A       30      p3
6      c3     B        0      p1
7      c3     B        0      p2
8      c3     B       10      p3

위의 data 예제처럼 위에서 아래로 길게 늘어서 있는 데이터셋을 행(row)에는 고객ID(cust_id), 열(column)에는 상품코드(prd_cd), 행과 열이 교차하는 칸에는 구매금액(pch_amt)이 위치하도록 데이터를 구조를 바꿔보겠습니다. 말로 설명해도 이해가 잘 안될 수 있는데요, 아래 data.pivot(index, columns, values) 예시를 보시지요.

(1) 데이터 재구조화 : data.pivot(index, columns, values)

# reshaping DataFrame by pivoting

In [6]: data_pivot = data.pivot(index='cust_id', columns='prod_cd', values='pch_amt')

In [7]: data_pivot

Out[7]:

prod_cd p1 p2 p3
cust_id
c1       30 10   0
c2       40 15 30
c3        0   0 10

(2) 데이터 재구조화 : pd.pivot_table(data, index, columns, values, aggfunc)

위의 data.pivot() 과 동일한 결과가 나오도록 데이터를 재구조화하는데 pd.pivot_table()을 사용할 수도 있습니다.

# pd.pivot_table(data, index, columns, values, aggfunc)

In [8]: pd.pivot_table(data, index='cust_id', columns='prod_cd', values='pch_amt')

Out[8]:

prod_cd p1 p2 p3
cust_id
c1       30 10   0
c2       40 15 30
c3        0   0 10

data.pivot() 로 하면 에러가 나서 안되고, pivot_table(data) 을 사용해야만 하는 경우가 몇 가지 있습니다. 그러므로 여러가지 외우는거 싫고, 헷갈리는거 싫어하는 분이라면 pivot_table() 사용법만 잘 숙지하는 것도 좋은 방법입니다.

아래에 pivot()으로는 안되고 pivot_table()은 되는 경우를 나란히 이어서 제시해보겠습니다.

(a) index 가 2개 이상인 경우입니다.

# pivot() with 2 indices :ValueError

In [9]: data.pivot(index=['cust_id', 'grade'], columns='prod_cd', values='pch_amt')

ValueError: Wrong number of items passed 9, placement implies 2

# pd.pivot_table() with 2 indices : works well!

In [10]: pd.pivot_table(data, index=['cust_id', 'grade'], columns='prod_cd', values='pch_amt')

Out[10]:

prod_cd        p1 p2 p3
cust_id grade
c1      A      30 10   0
c2      A      40 15 30
c3      B       0   0 10

(b) columns 가 2개 이상인 경우 입니다.

# pivot() with 2 columns : KeyError

In [11]: data.pivot(index='cust_id', columns=['grade', 'prod_cd'], values='pch_amt')

KeyError: 'Level grade not found'

# pd.pivot_table() with 2 columns : works well!

In [12]: pd.pivot_table(data, index='cust_id', columns=['grade', 'prod_cd'], values='pch_amt')

Out[12]:

grade A B

grade       A                B
prod_cd    p1    p2    p3   p1   p2    p3
cust_id
c1       30.0 10.0   0.0 NaN NaN   NaN
c2       40.0 15.0 30.0 NaN NaN   NaN
c3        NaN   NaN   NaN 0.0 0.0 10.0

pivot() 함수는 중복값이 있을 경우 ValueError를 반환합니다. 반면에, pd.pivot_table()은 aggfunc=np.sum 혹은 aggfunc=np.mean 과 같이 집계(aggregation)할 수 있는 함수를 제공함에 따라 index 중복값이 있는 경우에도 문제가 없습니다.

# pivot() with index which contains duplicate entries: ValueError

In [13]: data.pivot(index='grade', columns='prod_cd', values='pch_amt')

ValueError: Index contains duplicate entries, cannot reshape

# pd.pivot_table() with aggfunc : works well!

In [14]: pd.pivot_table(data, index='grade', columns='prod_cd',

...: values='pch_amt', aggfunc=np.sum)

Out[14]:

prod_cd p1 p2 p3
grade
A        70 25 30
B         0   0 10

In [15]: pd.pivot_table(data, index='grade', columns='prod_cd',

...: values='pch_amt', aggfunc=np.mean)

Out[15]:

prod_cd    p1    p2    p3
grade
A        35.0 12.5 15.0
B         0.0   0.0 10.0

# pivot_table(aggfunc=np.mean), by default

In [16]: pd.pivot_table(data, index='grade', columns='prod_cd', values='pch_amt')

Out[16]:

prod_cd    p1    p2    p3
grade
A        35.0 12.5 15.0
B         0.0   0.0 10.0

pd.pivot_table()은 margins=True 옵션을 설정해주면 행과 열을 기준으로 합계(All, row sum, column sum)를 같이 제시해주기 때문에 꽤 편리합니다

# pd.pivot_table : margins=True
# special All columns and rows will be added with partial group aggregates
# across the categories on the rows and columns

In [17]: pd.pivot_table(data, index='grade', columns='prod_cd',

...: values='pch_amt', aggfunc=np.sum, margins=True)

Out[17]:

prod_cd    p1    p2    p3    All
grade
A        70.0 25.0 30.0 125.0
B         0.0   0.0 10.0   10.0
All      70.0 25.0 40.0 135.0

In [18]: pd.pivot_table(data, index='grade', columns='prod_cd',

...: values='pch_amt', aggfunc=np.mean, margins=True)

Out[18]:

prod_cd         p1         p2         p3        All
grade
A        35.000000 12.500000 15.000000 20.833333
B         0.000000   0.000000 10.000000   3.333333
All      23.333333   8.333333 13.333333 15.000000

이상으로 data.pivot(), pd.povit_table(data)를 활용한 데이터 재구조화 소개를 마치겠습니다.

다음번 포스팅에서는 stack(), unstack()을 이용한 데이터 재구조화에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 데이터 재구조화(reshape) : pd.melt() (0)	2016.12.28
[Python pandas] 데이터 재구조화(reshaping data) : pd.DataFrame.stack(), pd.DataFrame.unstack() (2)	2016.12.24
[Python] 다항차수 변환, 교호작용 변수 생성 : sklearn.preprocessing.PolynomialFeatures() (0)	2016.12.21
[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...) (0)	2016.12.20
[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder() (0)	2016.12.18

Posted by Rfriend

,

[Python] 다항차수 변환, 교호작용 변수 생성 : sklearn.preprocessing.PolynomialFeatures()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 21. 23:59

지난번 포스팅에서는

- Python sklearn.preprocessing.Binarizer()를 이용한 연속형 변수의 이항변수화(binarization)

- Python sklearn.preprocessing.OneHotEncoder()를 이용한 범주형 변수의 이항변수화

- Python np.digitize(), np.where() 를 이용한 연속형 변수의 이산형화(discretization)

에 대해서 알아보겠습니다.

이번 포스팅에서는 sklearn.preprocessing.PolynomialFeatures() 를 이용한 다항차수 변환, 교호작용 변수 생성에 대해서 소개하겠습니다.

회귀분석할 때 다항 차수를 이용해서 비선형 패턴, 관계(non-linear relation)를 나타내거나, 변수 간 곱을 사용해서 교호작용 효과(interaction effects)을 나타낼 수 있는 변수를 만듭니다.

먼저 필요한 모듈을 불러오고, 예제 Arrary Dataset을 만들어보겠습니다.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from sklearn.preprocessing import PolynomialFeatures

In [4]: X = np.arange(6).reshape(3, 2)

In [5]: X

Out[5]:

array([[0, 1],
[2, 3],
[4, 5]])

(1) sklearn.preprocessing.PolynomialFeatures()를 사용해 2차항 변수 만들기

# making 2-order polynomial features

In [6]: poly = PolynomialFeatures(degree=2)

# transform from (x1, x2) to (1, x1, x2, x1^2, x1*x2, x2^2)

In [7]: poly.fit_transform(X)

Out[7]:

array([[ 1.,   0.,   1.,   0.,   0.,   1.],
       [ 1.,   2.,   3.,   4.,   6.,   9.],
       [ 1.,   4.,   5., 16., 20., 25.]])

변수가 3개인 경우에는 다차항 변수와 교호작용 변수의 조합이 더 많아집니다(아래 예는 degree=2 로서 2차항 변수 & 교호작용 변수 생성 예시). 변수가 추가 될 때마다 조합의 경우의 수가 기하급수적으로 늘어나게 되므로 유의할 필요가 있습니다.

In [8]: X2 = np.arange(9).reshape(3, 3)

In [9]: X2

Out[9]:

array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])

# transform from (x1, x2, x3) to (1, x1, x2, x3, x1^2, x1*x2, x1*x3, x2^2, x2*x3, x3^2)

In [10]: poly.fit_transform(X2)

Out[10]:

array([[ 1.,   0.,   1.,   2.,   0.,   0.,   0.,   1.,   2.,   4.],
       [ 1.,   3.,   4.,   5.,   9., 12., 15., 16., 20., 25.],
       [ 1.,   6.,   7.,   8., 36., 42., 48., 49., 56., 64.]])

(2) 교호작용 변수만을 만들기 : interaction_only=True

다항차수는 적용하지 않고, 오직 교호작용 효과만을 분석하려면 interaction_only=True 옵션을 설정해주면 됩니다. degree를 가지고 교호작용을 몇 개 수준까지 볼 지 설정해줄 수 있습니다.

In [11]: X2

Out[11]:

array([[0, 1, 2],

[3, 4, 5],

[6, 7, 8]])

# transform from (x1, x2, x3) to (1, x1, x2, x3, x1*x2, x1*x3, x2*x3)

In [12]: poly_d2 = PolynomialFeatures(degree=2, interaction_only=True)

In [13]: poly_d2.fit_transform(X2)

Out[13]:

array([[ 1.,   0.,   1.,   2.,   0.,   0.,   2.],
       [ 1.,   3.,   4.,   5., 12., 15., 20.],
       [ 1.,   6.,   7.,   8., 42., 48., 56.]])

# transform from (x1, x2, x3) to (1, x1, x2, x3, x1*x2, x1*x3, x2*x3, x1*x2*x3)

In [14]: poly_d3 = PolynomialFeatures(degree=3, interaction_only=True)

In [15]: poly_d3.fit_transform(X2)

Out[15]:

array([[   1.,    0.,    1.,    2.,    0.,    0.,    2.,    0.],
       [   1.,    3.,    4.,    5.,   12.,   15.,   20.,   60.],
       [   1.,    6.,    7.,    8.,   42.,   48.,   56., 336.]])

이상으로 sklearn.preprocessing.PolynomialFeatures()를 이용한 다항차수 변환, 교호작용 변수 생성 방법 소개를 마치겠습니다.

다음번 포스팅에서는 데이터셋 재구조화(pivot, reshape)에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 데이터 재구조화(reshaping data) : pd.DataFrame.stack(), pd.DataFrame.unstack() (2)	2016.12.24
[Python pandas] 데이터 재구조화 (reshaping) : data.pivot(), pd.pivot_table(data) (5)	2016.12.23
[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...) (0)	2016.12.20
[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder() (0)	2016.12.18
[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize() (0)	2016.12.17

Posted by Rfriend

,

[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 20. 23:33

지난번 포스팅에서는

- Python sklearn.preprocessing.Binarizer()를 이용한 연속형 변수의 이항변수화(binarization)

- Python sklearn.preprocessing.OneHotEncoder()를 이용한 범주형 변수의 이항변수화

에 대해서 알아보았습니다.

이번 포스팅에서는 Python np.digitize(), np.where() 를 이용한 연속형 변수의 이산형화(discretization)에 대해서 알아보겠습니다.

이항변수화(binarization)는 '0'과 '1'의 값만을 가지는 가변수(dummy variable)를 만드는 것을 의미하며, 이에 비해 이산형화(discretization)은 연속형 변수를 2개 이상의 범주(category)를 가지는 변수로 변환해주는 것을 말합니다.

먼저, 필요한 모듈을 불러오고, 예제로 사용할 DataFrame을 만들어보겠습니다.

#%% discretization of continuous data, binning data

# importing modules

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from pandas import DataFrame

# setting random seed number

In [4]: np.random.seed(10)

# making DataFrame with continuous values

In [5]: df = DataFrame({'C1': np.random.randn(20),

...: 'C2': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',

...: 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b']})

...:

In [6]: df

Out[6]:

         C1   C2
0   1.331587 a
1   0.715279 a
2 -1.545400 a
3 -0.008384 a
4   0.621336 a
5 -0.720086 a
6   0.265512 a
7   0.108549 a
8   0.004291 a
9 -0.174600 a
10 0.433026 b
11 1.203037 b
12 -0.965066 b
13 1.028274 b
14 0.228630 b
15 0.445138 b
16 -1.136602 b
17 0.135137 b
18 1.484537 b
19 -1.079805 b

(1) np.digitize(data, bins)를 이용한 연속형 변수의 이산형화 (discretization)

연속형 변수 'C1'을 최소~최대값 구간을 10개 bin으로 균등하게 나누어서 'C1_bin'이라는 이름의 칼럼으로 이산형화 변환해보겠습니다. ifelse 등의 조건문을 길게 안써도 np.linspace()와 np.dititize() 를 사용해서 아주 간단하게 이산형화 할 수 있습니다.

# making 10 bins, from min to max of 'C1' column

In [7]: bins = np.linspace(df.C1.min(), df.C1.max(), 10)

In [8]: bins

Out[8]:

array([-1.54540029, -1.20874059, -0.87208089, -0.53542119, -0.19876149,

0.1378982 , 0.4745579 , 0.8112176 , 1.1478773 , 1.484537 ])

# making digitized column using np.digitize(data, bins)

In [9]: df['C1_bin'] = np.digitize(df['C1'], bins)

In [10]: df

Out[10]:

          C1    C2 C1_bin
0   1.331587 a       9
1   0.715279 a       7
2 -1.545400 a       1
3 -0.008384 a       5
4   0.621336 a       7
5 -0.720086 a       3
6   0.265512 a       6
7   0.108549 a       5
8   0.004291 a       5
9 -0.174600 a       5
10 0.433026 b       6
11 1.203037 b       9
12 -0.965066 b       2
13 1.028274 b       8
14 0.228630 b       6
15 0.445138 b       6
16 -1.136602 b       2
17 0.135137 b       5
18 1.484537 b      10
19 -1.079805 b       2

연속형 변수를 이산형화 해서 어디에 써먹나 싶을텐데요, 간단한 예를 들자면 이산형화한 범주(혹은 요인)별로 요약통계량을 집계한다든지, 범주 간 평균 차이나 독립성을 검정한다든지, 분류모형의 목표변수로 사용한다든지, indexing 하는데 사용한다든지 ... 등이 있을 수 있겠네요.

# aggregation with groupby()

In [11]: df.groupby('C1_bin')['C1'].size()

Out[11]:

C1_bin
1     1
2     3
3     1
5     5
6     4
7     2
8     1
9     2
10    1
dtype: int64

# mean by 'C1_bin' groups

In [12]: df.groupby('C1_bin')['C1'].mean()

Out[12]:

C1_bin
1    -1.545400
2    -1.060491
3    -0.720086
5     0.012999
6     0.343076
7     0.668307
8     1.028274
9     1.267312
10    1.484537
Name: C1, dtype: float64

# standard deviation by 'C1_bin' groups

In [13]: df.groupby('C1_bin')['C1'].std()

Out[13]:

C1_bin
1          NaN
2     0.087384
3          NaN
5     0.122243
6     0.111985
7     0.066428
8          NaN
9     0.090898
10         NaN
Name: C1, dtype: float64

# value counts by 'C1_bin' groups

In [14]: df.groupby('C1_bin')['C2'].value_counts()

Out[14]:

C1_bin C2
1          a     1
2          b     3
3          a     1
5          a     4
          b     1
6          b     3
           a     1
7          a     2
8          b     1
9          a     1
         b     1
10        b     1
Name: C2, dtype: int64

# indexing

In [15]: df_bin2 = df[df['C1_bin'] == 2]

In [16]: df_bin2

Out[16]:

          C1 C2 C1_bin
12 -0.965066 b       2
16 -1.136602 b       2
19 -1.079805 b       2

(2) pd.get_dummies() 를 이용해 가변수(dummy var) 만들기

위에서 새로 만든 범주형 변수 'C1_bin'과 pd.get_dummies() 함수를 사용해서 가변수(dummy variable)을 만들어보겠습니다.

prefix 옵션을 사용하면 가변수에 공통으로 접두사를 추가할 수 있습니다.

drop_first=True 옵션을 설정하면 가변수의 첫번째 변수를 자동으로 삭제를 해주며, 가변수 함정(dummy trap)을 피할 수 있게 해줍니다.

# get dummy variables with prefix from a categorical variable

In [17]: pd.get_dummies(df['C1_bin'], prefix='C1')

Out[17]:

    C1_1 C1_2 C1_3 C1_5 C1_6 C1_7 C1_8 C1_9 C1_10
0    0.0   0.0   0.0   0.0   0.0   0.0   0.0   1.0    0.0
1    0.0   0.0   0.0   0.0   0.0   1.0   0.0   0.0    0.0
2    1.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0
3    0.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
4    0.0   0.0   0.0   0.0   0.0   1.0   0.0   0.0    0.0
5    0.0   0.0   1.0   0.0   0.0   0.0   0.0   0.0    0.0
6    0.0   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
7    0.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
8    0.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
9    0.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
10   0.0   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
11   0.0   0.0   0.0   0.0   0.0   0.0   0.0   1.0    0.0
12   0.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0
13   0.0   0.0   0.0   0.0   0.0   0.0   1.0   0.0    0.0
14   0.0   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
15   0.0   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
16   0.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0
17   0.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
18   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0    1.0
19   0.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0

# drop_first : Whether to get k-1 dummies out of k categorical levels
# by removing the first level to avoid dummy trap

In [18]: pd.get_dummies(df['C1_bin'], prefix='C1', drop_first=True)

Out[18]:

    C1_2 C1_3 C1_5 C1_6 C1_7 C1_8 C1_9 C1_10
0    0.0   0.0   0.0   0.0   0.0   0.0   1.0    0.0
1    0.0   0.0   0.0   0.0   1.0   0.0   0.0    0.0
2    0.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0
3    0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
4    0.0   0.0   0.0   0.0   1.0   0.0   0.0    0.0
5    0.0   1.0   0.0   0.0   0.0   0.0   0.0    0.0
6    0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
7    0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
8    0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
9    0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
10   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
11   0.0   0.0   0.0   0.0   0.0   0.0   1.0    0.0
12   1.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0
13   0.0   0.0   0.0   0.0   0.0   1.0   0.0    0.0
14   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
15   0.0   0.0   0.0   1.0   0.0   0.0   0.0    0.0
16   1.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0
17   0.0   0.0   1.0   0.0   0.0   0.0   0.0    0.0
18   0.0   0.0   0.0   0.0   0.0   0.0   0.0    1.0
19   1.0   0.0   0.0   0.0   0.0   0.0   0.0    0.0

(3) np.where(condition, factor1, factor2, ...)를 이용한 연속형 변수의 이산형화

np.where() 를 사용하면 조건절에 좀더 유연하게 조건을 부여해서 이산형화, 범주화를 할 수 있습니다. 연속형 변수 'C1'의 '평균'을 기준으로 평균 이상으로 'high', 평균 미만이면 'low'로 이산형화 변수 'high_low' 신규 변수를 만들어본 후에, 'high_low' 이산형화 변수를 기준으로 요약통계량을 계산해보겠습니다.

# discretization using np.where(condition, factor1, factor2, ...)

In [17]: df['high_low'] = np.where(df['C1'] >= df.C1.mean(), 'high', 'low')

In [18]: df

Out[18]:

          C1 C2 C1_bin high_low
0   1.331587 a       9     high
1   0.715279 a       7     high
2 -1.545400 a       1      low
3 -0.008384 a       5      low
4   0.621336 a       7     high
5 -0.720086 a       3      low
6   0.265512 a       6     high
7   0.108549 a       5      low
8   0.004291 a       5      low
9 -0.174600 a       5      low
10 0.433026 b       6     high
11 1.203037 b       9     high
12 -0.965066 b       2      low
13 1.028274 b       8     high
14 0.228630 b       6     high
15 0.445138 b       6     high
16 -1.136602 b       2      low
17 0.135137 b       5     high
18 1.484537 b      10     high
19 -1.079805 b       2      low

In [19]: df.groupby('high_low')['C1'].size()

Out[19]:

high_low

high 11

low 9

dtype: int64

In [20]: df.groupby('high_low')['C1'].mean()

Out[20]:

high_low

high 0.717408

low -0.613011

Name: C1, dtype: float64

In [21]: df.groupby('high_low')['C1'].std()

Out[21]:

high_low

high 0.473769

low 0.607895

Name: C1, dtype: float64

np.where(condition, ...) 에서 조건절을 조금 더 복잡하게 해서 'Q1(quantile 1, 25%)', 'Q3(quantile 3, 75%) 를 기준으로 '01_high', '02_medium', '03_low' 로 구분해서 이산형화 변환을 해보겠습니다.

괄호 안에 조건절 주는 부분이 조금 복잡하므로 주의하시기 바랍니다.

# calculating Q1, Q3

In [22]: Q1 = np.percentile(df['C1'], 25)

In [23]: Q1

Out[23]: -0.31097154812443017

In [24]: Q3 = np.percentile(df['C1'], 75)

In [25]: Q3

Out[25]: 0.64482172401746174

# discretizing 3 categories by using np.where() 2 times

In [26]: df['h_m_l'] = np.where(df['C1'] >= Q3, '01_high',

...: np.where(df['C1'] >= Q1, '02_medium', '03_low'))

...:

In [27]: df

Out[27]:

          C1 C2 C1_bin high_low      h_m_l
0   1.331587 a       9     high      01_high
1   0.715279 a       7     high      01_high
2 -1.545400 a       1      low      03_low
3 -0.008384 a       5      low      02_medium
4   0.621336 a       7     high      02_medium
5 -0.720086 a       3      low      03_low
6   0.265512 a       6     high      02_medium
7   0.108549 a       5      low      02_medium
8   0.004291 a       5      low      02_medium
9 -0.174600 a       5      low      02_medium
10 0.433026 b       6     high     02_medium
11 1.203037 b       9     high     01_high
12 -0.965066 b       2      low     03_low
13 1.028274 b       8     high     01_high
14 0.228630 b       6     high     02_medium
15 0.445138 b       6     high     02_medium
16 -1.136602 b       2      low     03_low
17 0.135137 b       5     high     02_medium
18 1.484537 b      10     high    01_high
19 -1.079805 b       2      low     03_low

3개의 범주로 구분해서 새로 만든 'h_m_l' 이산형화 변환 변수를 기준으로 요약통계량을 계산해보면 아래와 같습니다.

In [28]: df.groupby('h_m_l')['C1'].size()

Out[28]:

h_m_l
01_high       5
02_medium    10
03_low        5

dtype: int64

In [29]: df.groupby('h_m_l')['C1'].mean()

Out[29]:

h_m_l
01_high      1.152543
02_medium    0.205863
03_low      -1.089392

Name: C1, dtype: float64

In [30]: df.groupby('h_m_l')['C1'].std()

Out[30]:

h_m_l
01_high      0.296424
02_medium    0.242969
03_low       0.300877

Name: C1, dtype: float64

이상으로 이산형화(discretization) 변환에 대해서 마치도록 하겠습니다.

다음번 포스팅에서는 다항 차수 변환(polynomial variables transformation)에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 데이터 재구조화 (reshaping) : data.pivot(), pd.pivot_table(data) (5)	2016.12.23
[Python] 다항차수 변환, 교호작용 변수 생성 : sklearn.preprocessing.PolynomialFeatures() (0)	2016.12.21
[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder() (0)	2016.12.18
[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize() (0)	2016.12.17
[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16

Posted by Rfriend

,

[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 18. 21:38

지난번 포스팅에서는 Python sklearn.preprocessing.Binarizer() 를 사용해서 연속형 변수를 이항변수로 변환하는 방법을 소개하였습니다.

이번 포스팅에서는 Python sklearn.preprocessing.OneHotEncoder()를 사용해서 범주형 변수를 이항변수화(binarization of categorical feature) 하는 방법을 알아보겠습니다.

가령, 성별(gender)가 '남성(Male)'이면 '0', '여성(Female)'이면 '1'로 encoding 하고,

연령대(age group)가 '20대'이면 '0', '30대'이면 '1', '40대'이면 '2'로 encoding 하고,

등급(grade)가 'S'이면 '0', 'A'이면 '1', 'B'이면 '2', 'C'이면 '3', 'D'이면 '4'로 encoding 한다고 했을 때,

이를 value로 '0'과 '1'만을 가진 가변수(dummy variable)로 바꾸는 이항변수화했을 때의 예시가 아래의 이미지입니다. 이것을 Python은 자동으로(auto) 변수별 범주(catogory)의 종류, 개수를 파악해서 이항변수화 해줍니다. 아주 편해요.

위의 이미지에 나타난 예제 데이터를 가지고 sklearn.preprocessing.OneHotEncoder() 예를 들어보겠습니다.

먼저, 필요한 모듈을 불러오고, 예제 데이터 arrary를 만들어보겠습니다.

# importing modules

In [1]: from sklearn.preprocessing import OneHotEncoder

In [2]: import numpy as np

# making an example data arrary

In [3]: data_train = np.array([[0, 0, 0],

...: [0, 1, 1],

...: [0, 2, 2],

...: [1, 0, 3],

...: [1, 1, 4]])

...:

In [4]: data_train

Out[4]:

array([[0, 0, 0],
       [0, 1, 1],
       [0, 2, 2],
       [1, 0, 3],
       [1, 1, 4]])

(1) OneHotEncoder() 로 범주형 변수의 이항변수화 적합시키기 : enc.fit()

# making the utility class OneHotEncoder

In [5]: enc= OneHotEncoder()

# fitting OneHotEncoder

In [6]: enc.fit(data_train)

Out[6]:

OneHotEncoder(categorical_features='all', dtype=<class 'float'>,

handle_unknown='error', n_values='auto', sparse=True)

(2) 적합된(fitted) OneHotEncoder()의 Attributes 확인해보기

: enc.active_features_ , enc.n_values_ , enc.feature_indices_

위의 이미지로 예를 들어보였던 범주형 변수의 '변환 이전' => 이항변수화 '변환 이후' 모습을 보면서 아래의 Attributes 결과를 비교해보면 이해하기가 수월할 거예요.

# Attributes : active_features_
# - Indices for active features, actually occur in the training set

In [7]: enc.active_features_

Out[7]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

# 부연설명: [남, 여, 20대, 30대, 40대, S, A, B, C, D]

# Attributes : n_values_
# - Number of values per feature

In [8]: enc.n_values_

Out[8]: array([2, 3, 5])

# 부연설명: [성별 2개 범주, 연령대 3개 범주, 등급 5개 범주]

# Attributes : feature_indices_
# - Indices to feature ranges

In [9]: enc.feature_indices_

Out[9]: array([ 0, 2, 5, 10], dtype=int32)

# 부연설명: [성별 0이상~2미만, 연령대 2이상~5미만, 등급 5이상~10미만]

(3) 적합된 OneHotEncoder()로 새로운 범주형 데이터셋을 이항변수화 변환하기

성별 '여성(1)', 연령대 '40대(2)', 등급 'D(4)' 의 범주형 속성을 가진 새로운 고객에 대해서 위의 (1)번에서 적합시킨 OneHotEncoder()의 enc.transform(new data).toarray()를 사용해서 이항변수화 시켜보겠습니다.

# new data : femail, age_group 40s, D grade

In [10]: data_new = np.array([[1, 2, 4]])

# applying OneHotEncoder to new data, returning array

In [11]: enc.transform(data_new).toarray()

Out[11]: array([[ 0., 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

다음번 포스팅에서는 Python으로 연속형 변수를 다수개의 범주로 구분하는 이산형화(discretization) 방법에 대해서 알아보겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 다항차수 변환, 교호작용 변수 생성 : sklearn.preprocessing.PolynomialFeatures() (0)	2016.12.21
[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...) (0)	2016.12.20
[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize() (0)	2016.12.17
[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16
[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15

Posted by Rfriend

,

[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 17. 16:54

지난번 포스팅에서는 Python sklearn.preprocessing.MinMaxScaler()를 사용해서 연속형 변수를 '최소~최대' 값이 '0~1' 사이 범위의 연속형 값을 가지도록 변환하는 [0~1] 변환에 대해서 알아보았습니다.

이번 포스팅에서는 Python sklearn.preprocessing.Binarizer()를 사용해서 연속형 변수를 특정 기준값 이하(equal or less the threshold)이면 '0', 특정 기준값 초과(above the threshold)이면 '1'의 두 개의 값만을 가지는 변수로 변환하는 방법을 소개하겠습니다.

확률변수 X가 이항분포(binomial distribution)를 따른다고 했을 때 '0' 또는 '1'의 값만을 가지는 이항변수화가 필요합니다. 참고로, 범주형 자료에 대한 회귀분석이나 연관성 분석, 텍스트 마이닝을 할 때도 '0'과 '1'의 값을 가지는 가변수(dummy variable)를 만들어서 분석하기도 합니다.

어떤 실험을 반복해서 시행한다고 했을 때 각 시행마다 "성공(success, 1)" 또는 "실패(failure, 0)"의 두 가지 경우의 수만 나온다고 할 때, 우리는 이런 시행을 "베르누이 시행(Bernoulli trial)"이라고 합니다.

그리고 성공확률이 p인 베르누이 시행을 n번 반복했을 때 성공하는 횟수를 X라 하면, 확률변수 X는 모수 n과 p인 이항분포(Binomial distributio)을 따른다고 합니다.

먼저 필요한 모듈을 불러오고, 아주 간단한 예제 array data를 만들어보겠습니다.

# importing modules

In [1]: import numpy as np

In [2]: from sklearn.preprocessing import Binarizer

# making a trainig data array

In [3]: X = np.array([[ 10., -10., 1.],

...: [ 5., 0., 2.],

...: [ 0., 10., 3.]])

...:

In [4]: X

Out[4]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
      [ 0., 10.,   3.]])

(1) sklearn.preprocessing.Binarizer() method를 사용한 이항변수화

# making the unitily class binarizer

In [5]: binarizer = Binarizer().fit(X)

# threshold=0.0 by default
In [6]: binarizer

Out[6]: Binarizer(copy=True, threshold=0.0)

# Feature values below or equal to the threshold are replaced by 0, above it by 1

In [7]: binarizer.transform(X)

Out[7]:

array([[ 1., 0., 1.],
[ 1., 0., 1.],
[ 0., 1., 1.]])

이항변수화를 하는 기준선(threshold)를 디폴트 '0.0'에서 '2.0'으로 조정해 보겠습니다.

In [8]: binarizer = Binarizer(threshold=2.0)

In [9]: X

Out[9]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
      [ 0., 10.,   3.]])

In [10]: binarizer.transform(X)

Out[10]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

(2) sklearn.preprocessing.binarize() 함수를 사용한 이항변수화

sklearn.preprocessing 모듈은 Transformer API 없이 이항변수화에 사용할 수 있는 binarize() 함수를 제공합니다.

# sklearn.preprocessing.binarize function which is used without transformer API
# sklearn.preprocessing.binarize(X, threshold=0.0, copy=True)

In [11]: from sklearn.preprocessing import binarize

In [12]: X

Out[12]:

array([[ 10., -10.,   1.],
       [ 5.,   0.,   2.],
       [ 0., 10.,   3.]])

In [13]: binarize(X)

Out[13]:

array([[ 1., 0., 1.],
[ 1., 0., 1.],
[ 0., 1., 1.]])

이항변수화 기준선(threshold)를 기본값 '0.0'에서 '2.0'으로 조정해보겠습니다.

# adjusting the threshold of the binarizer

In [14]: binarize(X, threshold=2.0)

Out[14]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

# original data is not replaced by binarizer (compare it with 'copy=False' below exmaple
In [15]: X

Out[15]:

array([[ 10., -10.,   1.],
      [ 5.,   0.,   2.],
       [ 0., 10.,   3.]])

binarize() 함수로 이항변수화할 때 복사(copy) 옵션의 기본값이 'True'이며, 원본 데이터는 그대로 두고 이항변수화 후의 값을 반환합니다.

binarize() 함수를 'copy = False' 로 설정하면 아래의 예시처럼 원본 데이터의 값 자체가 이항변수화 변환 후의 값으로 교체가 되어버립니다.

# # set to False to perform inplace binarization and avoid a copy

In [16]: binarize(X, threshold=2.0, copy=False)

Out[16]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

# oops, original data has been changed by binarizer
In [17]: X

Out[17]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

다음번 포스팅에서는 범주형 변수에 대한 이산형화, 이항변수화에 대해서 소개하겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...) (0)	2016.12.20
[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder() (0)	2016.12.18
[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16
[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data) (0)	2016.12.13

Posted by Rfriend

,

[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 16. 23:30

지난번 포스팅에서는 변수들의 척도(Scale)가 서로 다를 경우에 상호 비교를 위해서 표준화하는 방법으로서

- 정규분포를 따르는 데이터의 표준정규분포로의 표준화 (z standardization)

(평균과 표준편차 이용)

- 이상치/특이값이 포함되어 있는 데이터의 표준화(scaling data with outliers)

(중앙값과 IQR(InterQuartile Range) 이용)

에 대해서 소개하였습니다.

이번 포스팅에서는 최소값(Min)과 최대값(Max)을 사용해서 '0~1' 사이의 범위(range)로 데이터를 표준화해주는 '0~1 변환'에 대해서 알아보겠습니다. 어디서 사용하나 싶을 텐데요, 요즘 각광받고 있는 인공신경망, 딥러닝 할 때 변수들을 '0~1' 범위로 변환해서 사용합니다.

Python은 '0~1' 범위 변환에 사용하는

- sklearn.preprocessing.MinMaxScaler() method와

- sklearn.preprocessing.minmax_scale() 함수를

제공합니다.

먼저, 필요한 모듈을 불러오고, 실습에 사용할 array 데이터를 만들어보겠습니다.

# importing modules

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from sklearn.preprocessing import MinMaxScaler

# making a trainig data array

In [4]: X = np.array([[ 10., -10., 1.],

...: [ 5., 0., 2.],

...: [ 0., 10., 3.]])

...:

In [5]: X

Out[5]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
        [ 0., 10.,   3.]])

'0~1' 변환을 하는 3가지 방법을 차례대로 예를 들어서 소개하겠습니다.

(1) 최소, 최대값을 구해서 '0~1' 범위로 변환

In [5]: X

Out[5]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
        [ 0., 10.,   3.]])

# Scaling features to [0-1] range using the formula

In [6]: X_MinMax = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

In [7]: X_MinMax

Out[7]:

array([[ 1. , 0. , 0. ],
[ 0.5, 0.5, 0.5],
[ 0. , 1. , 1. ]])

(2) sklearn.preprocessing.MinMaxScaler() method를 사용한 최소.최대 '0~1' 범위 변환

training set에 대해 transformer API를 통해서 '0~1' 범위 변환을 훈련시키고 => 최소.최대 '0~1' 변환 모델을 test set (new data)에 대해 적용해서 '0~1' 범위 변환을 해보겠습니다.

In [5]: X

Out[5]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
        [ 0., 10.,   3.]])

# min_max_scaler training using the transformer API

In [8]: min_max_scaler = MinMaxScaler()

In [9]: X_MinMax_train = min_max_scaler.fit_transform(X)

In [10]: X_MinMax_train

Out[10]:

array([[ 1. , 0. , 0. ],
[ 0.5, 0.5, 0.5],
[ 0. , 1. , 1. ]])

# making new test set

In [11]: X_new = np.array([[9., -10., 1.],

...: [5., -5., 3.],

...: [1., 0., 5.]])

In [12]: X_new

Out[12]:

array([[ 9., -10.,   1.],
       [ 5., -5.,   3.],
       [ 1.,   0.,   5.]])

# applying min_max_scaler to test set (new data)

In [13]: X_MinMax_new = min_max_scaler.transform(X_new)

In [14]: X_MinMax_new

Out[14]:

array([[ 0.9 , 0. , 0. ],
[ 0.5 , 0.25, 1. ],
[ 0.1 , 0.5 , 2. ]])

(3) sklearn.preprocessing.minmax_scale() 함수를 사용한 최소.최대 '0~1' 범위 변환

sklearn.preprocessing.MinMaxScaler() method를 사용할 때 보다 조금 더 편한 면은 있으며, 결과는 동일합니다.

In [15]: X

Out[15]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
        [ 0., 10.,   3.]])

# importing minmax_scale function from sklearn.preprocessing

In [16]: from sklearn.preprocessing import minmax_scale

# scaling X to min~max '0~1' range

In [17]: X_MinMax_scaled = minmax_scale(X, axis=0, copy=True)

In [18]: X_MinMax_scaled

Out[18]:

array([[ 1. , 0. , 0. ],
[ 0.5, 0.5, 0.5],
[ 0. , 1. , 1. ]])

다음번 포스팅에서는 이항변수화(binarization)에 대해서 알아보겠습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder() (0)	2016.12.18
[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize() (0)	2016.12.17
[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data) (0)	2016.12.13
[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts() (4)	2016.12.12

Posted by Rfriend

,

[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 15. 23:40

지난번 포스팅에서는 zscore(), StandardScaler() 등을 사용해서 척도(scale)가 다른 변수들을 X ~ N(0, 1) 의 표준정규분포로 변환시키는 표준화에 대해서 알아보았습니다.

그런데 표준정규분포로의 표준화 변환 시에는 "이상치, 특이값 (outlier)이 없어야 한다"는 가정사항이 있습니다. 표준정규분포로 변환하는 공식이 z = (x - mean) / std 이며, 평균(mean)은 이상치, 특이값에 엄청 민감하기 때문입니다.

그럼, 데이터에 "이상치"가 포함되어 있다면 어떻게 해야할까요?

첫번째 방법은 "이상치, 특이값을 찾아서 제거"한 후 표준정규분포로 표준화 변환을 해서 분석, 모델링을 진행하는 방법입니다. "이상치, 특이값을 찾아서 제거"하는 노~력이 필요합니다. 물론, 회귀분석과 같은 parametric modeling 에서는 이상치 제거 후 모델링이 적합한 방법입니다.

두번째 방법은 "이상치, 특이값에 덜 민감한" 중앙값(median)과 IQR(Inter-Quartile Range)을 이용해서 척도를 표준화하는 방법입니다. K-NN 같은 non-parametric modeling 은 두번째 방법도 써볼만 합니다.

이번 포스팅의 주제가 바로 두번째 방법에 대한 것입니다.

이번 포스팅의 주인공은 RobustScaler() 이지만, 왜 필요하고 무엇이 표준정규분포 표준화와 다른지를 이해하기 쉽도록 하기 위해서, 이상치(outlier)를 포함하고 있는 동일한 예제 데이터에 대해서 Python 의 sklearn.preprocessing의

(1) StandardScaler() method를 이용한 표준정규분포로의 표준화 ((x-mean)/std )와

(2) RobustScaler() method를 이용한 표준화 ( (x-median)/IQR )를

비교하면서 설명을 해보겠습니다.

먼저, 필요한 모듈을 불러오고, 이상치, 특이값을 포함하고 있는 예제 데이터를 만들어보겠습니다.

# importing modules

In [1]: import numpy as np

In [2]: from sklearn.preprocessing import StandardScaler, RobustScaler

In [3]: import matplotlib.pyplot as plt

In [4]: import pandas as pd

# setting the number of digits of precision for floating point output

In [5]: np.set_printoptions(precision=2)

# setting random seed number

In [6]: np.random.seed(10)

# making 100 random x ~ N(10, 2)

In [7]: mu, sigma = 10, 2

In [8]: x = mu + sigma*np.random.randn(100)

In [9]: x

Out[9]:

array([ 12.66, 11.43,   6.91,   9.98, 11.24,   8.56, 10.53, 10.22,
        10.01,   9.65, 10.87, 12.41,   8.07, 12.06, 10.46, 10.89,
         7.73, 10.27, 12.97,   7.84,   6.04,   6.51, 10.53, 14.77,
        12.25, 13.35, 10.2 , 12.8 ,   9.46, 11.23,   9.47,   8.9 ,
        10.27,   9.05, 12.62, 10.39, 10.8 ,   9.32, 12.51,   8.54,
        11.32,   9.3 ,   8.12,   9.02,   8.39,   9.57,   9.32, 10.62,
        11.13,   9.71,   9.95, 10.58,   8.92, 11.42, 11.68, 10.41,
        14.79, 11.83,   9.78,   9.28,   9.54,   9. , 12.26,   8.6 ,
         9.84,   8.94, 12.09,   7.16,   9.28,   9.76, 10.64, 10.92,
         9.57, 11.98, 10.63, 14.94,   6.98, 11.24,   7.91,   8.4 ,
        13.97, 13.49,   6.29,   9.55,   9.87,   5.74,   9.9 , 10.79,
        10.43,   6.01, 12.22, 10.49,   9.88,   8.49, 11.42, 11.84,
         9.04, 10.18, 11.65,   6.09])

# plotting histogram

In [10]: plt.hist(x)

Out[10]:

(array([ 6., 3., 8., 15., 22., 20., 11., 9., 3., 3.]),

array([ 5.74, 6.66, 7.58, 8.5 , 9.42, 10.34, 11.26, 12.18,

13.1 , 14.02, 14.94]),

<a list of 10 Patch objects>)

# checking mean, std

In [11]: np.mean(x) # 10.15 (about 10)

Out[11]: 10.158833325873747

In [12]: np.std(x) # 1.93 (about 2)

Out[12]: 1.9340789542274115

# inserting outliers

In [13]: x[98:100] = 100

In [14]: x

Out[14]:

array([ 12.66,   11.43,    6.91,    9.98,   11.24,    8.56,   10.53,
         10.22,   10.01,    9.65,   10.87,   12.41,    8.07,   12.06,
         10.46,   10.89,    7.73,   10.27,   12.97,    7.84,    6.04,
          6.51,   10.53,   14.77,   12.25,   13.35,   10.2 ,   12.8 ,
          9.46,   11.23,    9.47,    8.9 ,   10.27,    9.05,   12.62,
         10.39,   10.8 ,    9.32,   12.51,    8.54,   11.32,    9.3 ,
          8.12,    9.02,    8.39,    9.57,    9.32,   10.62,   11.13,
          9.71,    9.95,   10.58,    8.92,   11.42,   11.68,   10.41,
         14.79,   11.83,    9.78,    9.28,    9.54,    9. ,   12.26,
          8.6 ,    9.84,    8.94,   12.09,    7.16,    9.28,    9.76,
         10.64,   10.92,    9.57,   11.98,   10.63,   14.94,    6.98,
         11.24,    7.91,    8.4 ,   13.97,   13.49,    6.29,    9.55,
          9.87,    5.74,    9.9 ,   10.79,   10.43,    6.01,   12.22,
         10.49,    9.88,    8.49,   11.42,   11.84,    9.04,   10.18,
        100. , 100. ])

# checking change of mean and std

In [15]: np.mean(x) # 11.98

Out[15]: 11.981383595820532

In [16]: np.std(x) # 12.71

Out[16]: 12.714552555982538

# plotting histogram to see the change of distribution, especially by outliers

In [17]: plt.hist(x, bins=np.arange(0, 102, 2))

Out[17]:

(array([ 0.,   0.,   1., 10., 36., 34., 14.,   3.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   2.]),
array([ 0,   2,   4,   6,   8, 10, 12, 14, 16, 18, 20, 22, 24,
         26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50,
         52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76,
         78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100]),

<a list of 50 Patch objects>)

(1) 이상치가 포함된 데이터의 표준정규분포로의 표준화 :
sklearn.preprocessing.StandardScaler()

# reshape from 1d arrays to ndarray

In [18]: x = x.reshape(-1, 1)

# numpy.ndarray

In [19]: x[0:10]

Out[19]:

array([[ 12.66],
       [ 11.43],
       [ 6.91],
       [ 9.98],
       [ 11.24],
       [ 8.56],
       [ 10.53],
       [ 10.22],
       [ 10.01],
       [ 9.65]])

# By StandardScaler()

In [20]: x_StandardScaler = StandardScaler().fit_transform(x)

In [21]: x_StandardScaler

Out[21]:
array([[ 5.36e-02],
       [ -4.33e-02],
       [ -3.99e-01],
       [ -1.57e-01],
       [ -5.81e-02],
       [ -2.69e-01],
       [ -1.14e-01],
       [ -1.39e-01],
       [ -1.55e-01],
       .....

..... (중간 생략)

.....

       [ -4.38e-02],
       [ -1.14e-02],
       [ -2.32e-01],
       [ -1.42e-01],
       [ 6.92e+00],
       [ 6.92e+00]])

# checking mean and std, z ~ N(0, 1)

In [22]: np.mean(x_StandardScaler) # mean = 0

Out[22]: 5.3290705182007512e-17

In [23]: np.std(x_StandardScaler) # std = 1

Out[23]: 1.0

# look at the outliers at the right corner

In [24]: plt.hist(x_StandardScaler)

Out[24]:

(array([ 98., 0., 0., 0., 0., 0., 0., 0., 0., 2.]),

array([-0.49, 0.25, 0.99, 1.73, 2.47, 3.22, 3.96, 4.7 , 5.44,

6.18, 6.92]),

<a list of 10 Patch objects>)

# zoom in

In [25]: x_StandardScaler_zoonin = x_StandardScaler[x_StandardScaler < 5]

In [26]: plt.hist(x_StandardScaler_zoonin, bins=np.arange(-3.0, 3.0, 0.2))

Out[26]:

(array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 5., 26., 50., 14., 3., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0., 0., 0.]),

array([ -3.00e+00, -2.80e+00, -2.60e+00, -2.40e+00, -2.20e+00,

-2.00e+00, -1.80e+00, -1.60e+00, -1.40e+00, -1.20e+00,

-1.00e+00, -8.00e-01, -6.00e-01, -4.00e-01, -2.00e-01,

2.66e-15, 2.00e-01, 4.00e-01, 6.00e-01, 8.00e-01,

1.00e+00, 1.20e+00, 1.40e+00, 1.60e+00, 1.80e+00,

2.00e+00, 2.20e+00, 2.40e+00, 2.60e+00, 2.80e+00]),

<a list of 29 Patch objects>)

위의 '표준정규분포로의 표준화' 예시의 제일 마지막에 이상치(outlier)를 무시하고 표준화 이후 값 범위 (-3 ~ 3) 사이로 그린 히스토그램을 아래의 RobustScaler()로 표준화를 한 값과 비교해보시기 바랍니다.

StandardScaler() 에 의한 표준화가 이상치에 영향을 더 심하게 받아서 이상치가 아닌 값들이 조밀하게, 촘촘하게 서로 붙어있음을 알 수 있습니다.

(2) 이상치가 포함된 데이터의 중앙값과 IQR 를 이용한 표준화

: sklearn.preprocessing.RobustScaler()

In [27]: np.median(x) # 10.2

Out[27]: 10.207697741550213

In [28]: Q1 = np.percentile(x, 25, axis=0)

In [29]: Q1 # 9.04

Out[29]: array([ 9.04])

In [30]: Q3 = np.percentile(x, 75, axis=0) # 5.700

In [31]: Q3 # 11.4

Out[31]: array([ 11.42])

In [32]: IQR = Q3 - Q1

In [33]: IQR # 2.37

Out[33]: array([ 2.37])

In [34]: x_RobustScaler = RobustScaler().fit_transform(x)

# list up 10 elememts from backward

In [35]: x_RobustScaler[-10:]

Out[35]:

array([[ 8.46e-01],
       [ 1.19e-01],
       [ -1.40e-01],
       [ -7.23e-01],
       [ 5.12e-01],
       [ 6.86e-01],
       [ -4.94e-01],
       [ -1.20e-02],
       [ 3.78e+01],
       [ 3.78e+01]])

# RobustScaler() removes the median and scales the data according to IQR(Inerquartile Range)

In [36]: np.median(x_RobustScaler)

Out[36]: 0.0

In [37]: np.mean(x_RobustScaler)

Out[37]: 0.74729363822182793

In [38]: np.std(x_RobustScaler)

Out[38]: 5.3569262082386482

# checking the distribution by histogram after RobustScaler

In [39]: plt.hist(x_RobustScaler)

Out[39]:

(array([ 98.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   2.]),
array([ -1.88,   2.09,   6.06, 10.03, 14. , 17.97, 21.95, 25.92,
         29.89, 33.86, 37.83]),
<a list of 10 Patch objects>)

# zoom in

In [40]: x_RobustScaler_zoonin = x_RobustScaler[x_RobustScaler < 5]

In [41]: plt.hist(x_RobustScaler_zoonin, bins=np.arange(-3, 3, 0.2))

Out[41]:

(array([ 0.,   0.,   0.,   0.,   0.,   1.,   3.,   1.,   3.,   1.,   4.,
          6.,   7., 13., 11., 14.,   6.,   7.,   6.,   5.,   4.,   2.,
          1.,   0.,   3.,   0.,   0.,   0.,   0.]),
array([ -3.00e+00, -2.80e+00, -2.60e+00, -2.40e+00, -2.20e+00,
         -2.00e+00, -1.80e+00, -1.60e+00, -1.40e+00, -1.20e+00,
         -1.00e+00, -8.00e-01, -6.00e-01, -4.00e-01, -2.00e-01,
          2.66e-15,   2.00e-01,   4.00e-01,   6.00e-01,   8.00e-01,
          1.00e+00,   1.20e+00,   1.40e+00,   1.60e+00,   1.80e+00,
          2.00e+00,   2.20e+00,   2.40e+00,   2.60e+00,   2.80e+00]),
<a list of 29 Patch objects>)

아래의 두 개의 히스토그램은 이상치, 특이값(outlier)이 포함되어 있는 데이터를 표준화하는 경우에 (1) 평균과 표준편차를 이용한 표준정규분포 표준화 결과 (outlier 미포함한 범위의 zoom in)와, (2) 중앙값과 IQR(Interquartile Range)를 이용한 이상치에 견고한 표준화 (outlier 미포함한 범위의 zoom in) 결과의 분포를 나타내고 있습니다.

왼쪽의 StandardScaler()에 의한 표준화보다 오른쪽의 RobustScaler()에 의한 표준화가 동일한 값을 더 넓게 분포시키고 있음을 알 수 있습니다. 즉, 목표변수 y값을 분류나 예측하는데 있어 산포가 더 크기 때문에 설명변수 x변수로서 더 유용할 수 있다고 추정할 수 있습니다.

(1) StandardScaler() zoon in

(2) RobustScaler() zoon in

다음번 포스팅에서는 최소.최대 [0~1] 범위 변환에 대해서 소개하도록 하겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize() (0)	2016.12.17
[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16
[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data) (0)	2016.12.13
[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts() (4)	2016.12.12
[Python pandas] 중복값 확인 및 처리 : DataFrame.duplicated(), DataFrame.drop_duplicates(), keep='first', 'last', False (3)	2016.12.11

Posted by Rfriend

,

[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data)

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 13. 23:41

데이터 분석을 하다 보면 변수들 간의 척도 (scale) 가 서로 다른 경우 직접적으로 상호 비교를 할 수가 없습니다. 모델링에서는 척도(scale)가 다름으로 인해서 모수의 왜곡이 생길 수도 있습니다.

따라서 모델링 작업에 들어가기 전에 변수들 간의 척도가 다른 경우에는 보통 표준화(scale standization)를 진행합니다.

표준화 중에서도 모집단이 '정규분포 (normal distribution, Gaussian distribution)을 따르는 경우 평균이 0, 표준편차는 1 인 표준정규분포(standard normal distribution)로 표준화 하는 방법을 많이 사용합니다.

이번 포스팅에서는

- Numpy : z = (x - mean())/std()

- scipy.stats : zscore()

- sklearn.preprocessing : StandardScaler().fit_transform()

의 모듈, method를 이용한 표준정규분포 표준화 (mean removal and variance scaling, mean = 0, std = 1)에 대해서 소개하겠습니다.

실습에 필요한 모듈을 importing하고 예제 Dataset을 만들어보겠습니다.

In [1]: import numpy as np

In [2]: data = np.random.randint(30, size=(6, 5))

In [3]: data

Out[3]:

array([[ 3, 5, 14, 24, 24],
       [ 3, 9, 1, 20, 3],
       [10, 5, 11, 17, 28],
       [26, 9, 20, 10, 8],
       [15, 7, 1, 24, 2],
       [15, 19, 10, 13, 2]])

표준정규분포로 표준화하는 3가지 방법을 차례대로 소개하겠습니다.

(1) Numpy 를 이용한 표준화 : z = (x - mean())/std()

칼럼마다 각각의 평균, 표준편차를 적용해서 표준화를 하려면 mean(data, axis=0), std(data, axis=0) 처럼 'axis=0' 을 설정해주면 됩니다.

# (1) Using numpy, z = (x-mean)/std

In [4]: from numpy import *

In [5]: data_standadized_np = (data - mean(data, axis=0)) / std(data, axis=0)

In [6]: data_standadized_np

Out[6]:

array([[-1.13090555, -0.84016805, 0.66169316, 1.14070365, 1.19426502],
       [-1.13090555, 0.        , -1.24986486, 0.38023455, -0.75998683],
       [-0.25131234, -0.84016805, 0.22056439, -0.19011728, 1.56650347],
       [ 1.75918641, 0.        , 1.54395071, -1.5209382 , -0.29468877],
       [ 0.37696852, -0.42008403, -1.24986486, 1.14070365, -0.85304644],
       [ 0.37696852, 2.10042013, 0.07352146, -0.95058638, -0.85304644]])

# check of 'mean=0', 'standard deviation=1'

In [7]: mean(data_standadized_np, axis=0)

Out[7]:

array([ -5.55111512e-17, 0.00000000e+00, 9.25185854e-18,
0.00000000e+00, 3.70074342e-17])

In [8]: std(data_standadized_np, axis=0)

Out[8]: array([ 1., 1., 1., 1., 1.])

평균(mean): np.mean(arr)
표준편차(standard deviation): np.std(arr)
분산(variance): np.var(arr)

import numpy as np

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print('mean:', np.mean(arr))

print('standard deviation:', np.std(arr))

print('variance:', np.var(arr))

mean: 5.0
standard deviation: 3.1622776601683795
variance: 10.0

(2) scipy.stats 을 이용한 표준화 : ss.zscore()

# (2) Standardization using zscore() of scipy.stats

In [9]: import scipy.stats as ss

In [10]: data_standadized_ss = ss.zscore(data)

In [11]: data_standadized_ss

Out[11]:

array([[-1.13090555, -0.84016805, 0.66169316, 1.14070365, 1.19426502],
       [-1.13090555, 0.        , -1.24986486, 0.38023455, -0.75998683],
       [-0.25131234, -0.84016805, 0.22056439, -0.19011728, 1.56650347],
       [ 1.75918641, 0.        , 1.54395071, -1.5209382 , -0.29468877],
       [ 0.37696852, -0.42008403, -1.24986486, 1.14070365, -0.85304644],
       [ 0.37696852, 2.10042013, 0.07352146, -0.95058638, -0.85304644]])

(3) sklearn.preprocessing 을 이용한 표준화 : StandardScaler().fit_transform()

In [12]: from sklearn.preprocessing import StandardScaler

In [13]: data_standadized_skl = StandardScaler().fit_transform(data)

In [14]: data_standadized_skl

Out[14]:

array([[-1.13090555, -0.84016805, 0.66169316, 1.14070365, 1.19426502],
       [-1.13090555, 0.        , -1.24986486, 0.38023455, -0.75998683],
       [-0.25131234, -0.84016805, 0.22056439, -0.19011728, 1.56650347],
       [ 1.75918641, 0.        , 1.54395071, -1.5209382 , -0.29468877],
       [ 0.37696852, -0.42008403, -1.24986486, 1.14070365, -0.85304644],
       [ 0.37696852, 2.10042013, 0.07352146, -0.95058638, -0.85304644]])

다음번 포스팅에서는 데이터셋에 Outlier 가 들어있을 때 Robust하게 표준화할 수 있는 방법으로서 sklearn.preprocessing.robust_scale, sklearn.preprocessing.RobustScaler 을 소개하겠습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16
[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts() (4)	2016.12.12
[Python pandas] 중복값 확인 및 처리 : DataFrame.duplicated(), DataFrame.drop_duplicates(), keep='first', 'last', False (3)	2016.12.11
[Python pandas] 결측값, 원래 값을 다른 값으로 교체하기(replacing generic values) : replace() (4)	2016.12.11

Posted by Rfriend

,

[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 12. 23:17

지난번 포스팅에서는 중복값 확인, 중복값 처리에 대해서 알아보았습니다.

이번 포스팅에서는 유일한 값(unique value)을 찾고 개수도 세어보기 위해서 Python pandas의

- pd.Series.unique() 를 이용한 유일한 값 찾기

(Return np.ndarray of unique values in the object)

- pd.Series.value_counts() 를 이용한 유일한 값별 개수 세기

(Returns object containing counts of unique values)

를 소개하겠습니다.

데이터 전처리 및 탐색적 데이터 분석 단계에서 종종 사용하니 알아두면 좋겠습니다.

먼저, 필요한 모듈 불러오고, 예제 DataFrame을 만들어보겠습니다.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A2', 'A3', 'A3'],

...: 'B': ['B1', 'B1', 'B1', 'B1', 'B2', np.nan],

...: 'C': [1, 1, 3, 4, 4, 4]})

In [4]: df

Out[4]:

    A    B C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

(1) 유일한 값 찾기 : pd.Series.unique()

pd.Series.unique()는 np.ndarray를 반환합니다. DataFrame의 각 칼럼별로 indexing해서 unique()를 적용한 결과는 아래와 같습니다.

'B' 칼럼에 'NaN'도 unique()에 포함되었습니다.

In [5]: df['A'].unique()

Out[5]: array(['A1', 'A2', 'A3'], dtype=object)

In [6]: df['B'].unique()

Out[6]: array(['B1', 'B2', nan], dtype=object)

In [7]: df['C'].unique()

Out[7]: array([1, 3, 4], dtype=int64)

(2) 유일한 값별로 개수 세기 : pd.Series.value_counts()

# eturns object containing counts of unique values

pd.Series.value_counts(normalize=False, # False면 개수, True면 상대 비율 구함

sort=True, # True면 개수 기준 정렬, False면 유일한 값 기준 정렬

ascending=False, # False면 내림차순 정렬, True면 오름차순 정렬

bins=None, # None이면 유일값 기준 개수, None아니면 Bins Group별 개수

dropna=True # True면 NaN 무시, False면 유일값에 NaN 포함)

In [4]: df

Out[4]:

    A    B C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

# returns of pd.Series.value_counts() by default setting

In [8]: df['A'].value_counts()

Out[8]:

A3    2
A2    2
A1    2

Name: A, dtype: int64

In [9]: df['B'].value_counts()

Out[9]:

B1 4
B2 1

Name: B, dtype: int64

In [10]: df['C'].value_counts()

Out[10]:

4    3
1    2
3    1

Name: C, dtype: int64

(2-1) 유일 값 별 상대적 비율 : pd.Series.value_counts(normalize=True)

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

In [11]: df['C'].value_counts(normalize=True)

Out[11]:

4    0.500000
1    0.333333
3    0.166667

Name: C, dtype: float64

(2-2) 유일한 값 기준 정렬 : pd.Series.value_counts(sort=True, ascending=True)

sort=True, False 와 ascending=True, False 의 조합은 아래 예시의 3가지 경우의 수가 있습니다.

[12] : 유일한 값의 개수 기준 내림차순 정렬 예시 (sort descending order by value_counts)

[13] : 유일한 값의 개수 기준 오름차순 정렬 예시 (sort ascending order by value_counts)

[14] : 유일한 값 기준 오름차순 정렬 예시 (유일한 값의 개수 기준 정렬은 없음)

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

# sort descending order by value_counts

In [12]: df['C'].value_counts(sort=True, ascending=False) # by default

Out[12]:

4    3
1    2
3    1

Name: C, dtype: int64

# sort ascending order by value_counts

In [13]: df['C'].value_counts(sort=True, ascending=True)

Out[13]:

3    1
1    2
4    3

Name: C, dtype: int64

# Don't sort by value_counts, but sort by unique value

In [14]: df['C'].value_counts(sort=False)

Out[14]:

1    2
3    1
4    3

Name: C, dtype: int64

(2-3) 결측값을 유일한 값에 포함할지 여부 : pd.Series.value_counts(dropna=True)

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

# dropna=True : Don’t include counts of NaN

In [15]: df['B'].value_counts(dropna=True) # by default

Out[15]:

B1 4
B2 1

Name: B, dtype: int64

# dropna=False : Include counts of NaN

In [16]: df['B'].value_counts(dropna=False)

Out[16]:

B1     4
B2     1
NaN    1

Name: B, dtype: int64

(2-4) Bins Group별 값 개수 세기 : pd.Series.value_counts(bins=[ , , ,])

In [4]: df

Out[4]:

    A    B   C
0 A1   B1 1
1 A1   B1 1
2 A2   B1 3
3 A2   B1 4
4 A3   B2 4
5 A3 NaN 4

In [17]: df['C'].value_counts(bins=[0, 1, 2, 3, 4, 5], sort=False)

Out[17]:

0 2

1 0

2 1

3 3

4 0

Name: C, dtype: int64

아래의 pd.cut(Series, bins=[ , , , ]) 와 위의 결과가 동일하며, 위의 Series.value_counts(bins=[ , , , ])가 조금 더 사용하기 편리합니다.

In [18]: out = pd.cut(df['C'], bins=[0, 1, 2, 3, 4, 5])

In [19]: pd.value_counts(out)

Out[19]:

(3, 4] 3

(0, 1] 2

(2, 3] 1

(4, 5] 0

(1, 2] 0

Name: C, dtype: int64

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data) (0)	2016.12.13
[Python pandas] 중복값 확인 및 처리 : DataFrame.duplicated(), DataFrame.drop_duplicates(), keep='first', 'last', False (3)	2016.12.11
[Python pandas] 결측값, 원래 값을 다른 값으로 교체하기(replacing generic values) : replace() (4)	2016.12.11
[Python pandas] 결측값 보간하기 (interpolation of missing values) : interpolate(), interpolate(method='time'), interpolate(method='values') (6)	2016.12.10

Posted by Rfriend

,

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'Python 분석과 프로그래밍/Python 데이터 전처리'에 해당되는 글 157건

[Python pandas] 데이터 재구조화(reshaping data) : pd.DataFrame.stack(), pd.DataFrame.unstack()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 데이터 재구조화 (reshaping) : data.pivot(), pd.pivot_table(data)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 다항차수 변환, 교호작용 변수 생성 : sklearn.preprocessing.PolynomialFeatures()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바