'sklearn..prepresessing.binarize()' 태그의 글 목록

'sklearn..prepresessing.binarize()'에 해당되는 글 1건

2016.12.17 [Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize()

[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize()

Python 분석과 프로그래밍/Python 데이터 전처리 2016. 12. 17. 16:54

지난번 포스팅에서는 Python sklearn.preprocessing.MinMaxScaler()를 사용해서 연속형 변수를 '최소~최대' 값이 '0~1' 사이 범위의 연속형 값을 가지도록 변환하는 [0~1] 변환에 대해서 알아보았습니다.

이번 포스팅에서는 Python sklearn.preprocessing.Binarizer()를 사용해서 연속형 변수를 특정 기준값 이하(equal or less the threshold)이면 '0', 특정 기준값 초과(above the threshold)이면 '1'의 두 개의 값만을 가지는 변수로 변환하는 방법을 소개하겠습니다.

확률변수 X가 이항분포(binomial distribution)를 따른다고 했을 때 '0' 또는 '1'의 값만을 가지는 이항변수화가 필요합니다. 참고로, 범주형 자료에 대한 회귀분석이나 연관성 분석, 텍스트 마이닝을 할 때도 '0'과 '1'의 값을 가지는 가변수(dummy variable)를 만들어서 분석하기도 합니다.

어떤 실험을 반복해서 시행한다고 했을 때 각 시행마다 "성공(success, 1)" 또는 "실패(failure, 0)"의 두 가지 경우의 수만 나온다고 할 때, 우리는 이런 시행을 "베르누이 시행(Bernoulli trial)"이라고 합니다.

그리고 성공확률이 p인 베르누이 시행을 n번 반복했을 때 성공하는 횟수를 X라 하면, 확률변수 X는 모수 n과 p인 이항분포(Binomial distributio)을 따른다고 합니다.

먼저 필요한 모듈을 불러오고, 아주 간단한 예제 array data를 만들어보겠습니다.

# importing modules

In [1]: import numpy as np

In [2]: from sklearn.preprocessing import Binarizer

# making a trainig data array

In [3]: X = np.array([[ 10., -10., 1.],

...: [ 5., 0., 2.],

...: [ 0., 10., 3.]])

...:

In [4]: X

Out[4]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
      [ 0., 10.,   3.]])

(1) sklearn.preprocessing.Binarizer() method를 사용한 이항변수화

# making the unitily class binarizer

In [5]: binarizer = Binarizer().fit(X)

# threshold=0.0 by default
In [6]: binarizer

Out[6]: Binarizer(copy=True, threshold=0.0)

# Feature values below or equal to the threshold are replaced by 0, above it by 1

In [7]: binarizer.transform(X)

Out[7]:

array([[ 1., 0., 1.],
[ 1., 0., 1.],
[ 0., 1., 1.]])

이항변수화를 하는 기준선(threshold)를 디폴트 '0.0'에서 '2.0'으로 조정해 보겠습니다.

In [8]: binarizer = Binarizer(threshold=2.0)

In [9]: X

Out[9]:

array([[ 10., -10.,   1.],
        [ 5.,   0.,   2.],
      [ 0., 10.,   3.]])

In [10]: binarizer.transform(X)

Out[10]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

(2) sklearn.preprocessing.binarize() 함수를 사용한 이항변수화

sklearn.preprocessing 모듈은 Transformer API 없이 이항변수화에 사용할 수 있는 binarize() 함수를 제공합니다.

# sklearn.preprocessing.binarize function which is used without transformer API
# sklearn.preprocessing.binarize(X, threshold=0.0, copy=True)

In [11]: from sklearn.preprocessing import binarize

In [12]: X

Out[12]:

array([[ 10., -10.,   1.],
       [ 5.,   0.,   2.],
       [ 0., 10.,   3.]])

In [13]: binarize(X)

Out[13]:

array([[ 1., 0., 1.],
[ 1., 0., 1.],
[ 0., 1., 1.]])

이항변수화 기준선(threshold)를 기본값 '0.0'에서 '2.0'으로 조정해보겠습니다.

# adjusting the threshold of the binarizer

In [14]: binarize(X, threshold=2.0)

Out[14]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

# original data is not replaced by binarizer (compare it with 'copy=False' below exmaple
In [15]: X

Out[15]:

array([[ 10., -10.,   1.],
      [ 5.,   0.,   2.],
       [ 0., 10.,   3.]])

binarize() 함수로 이항변수화할 때 복사(copy) 옵션의 기본값이 'True'이며, 원본 데이터는 그대로 두고 이항변수화 후의 값을 반환합니다.

binarize() 함수를 'copy = False' 로 설정하면 아래의 예시처럼 원본 데이터의 값 자체가 이항변수화 변환 후의 값으로 교체가 되어버립니다.

# # set to False to perform inplace binarization and avoid a copy

In [16]: binarize(X, threshold=2.0, copy=False)

Out[16]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

# oops, original data has been changed by binarizer
In [17]: X

Out[17]:

array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 1.]])

다음번 포스팅에서는 범주형 변수에 대한 이산형화, 이항변수화에 대해서 소개하겠습니다.

많은 도움 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 연속형 변수의 이산형화(discretization) : np.digitize(data, bins), pd.get_dummies(), np.where(condition, 'factor1', 'factor2', ...) (0)	2016.12.20
[Python] 범주형 변수의 이항변수화 : sklearn.preprocessing.OneHotEncoder() (0)	2016.12.18
[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16
[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data) (0)	2016.12.13

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'sklearn..prepresessing.binarize()'에 해당되는 글 1건

[Python] 이항변수화 변환 (Binarization) : sklearn.preprocessing.Binarizer(), sklearn.preporcessing.binarize()

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바