[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data)

데이터 분석을 하다 보면 변수들 간의 척도 (scale) 가 서로 다른 경우 직접적으로 상호 비교를 할 수가 없습니다. 모델링에서는 척도(scale)가 다름으로 인해서 모수의 왜곡이 생길 수도 있습니다.

따라서 모델링 작업에 들어가기 전에 변수들 간의 척도가 다른 경우에는 보통 표준화(scale standization)를 진행합니다.

표준화 중에서도 모집단이 '정규분포 (normal distribution, Gaussian distribution)을 따르는 경우 평균이 0, 표준편차는 1 인 표준정규분포(standard normal distribution)로 표준화 하는 방법을 많이 사용합니다.

이번 포스팅에서는

- Numpy : z = (x - mean())/std()

- scipy.stats : zscore()

- sklearn.preprocessing : StandardScaler().fit_transform()

의 모듈, method를 이용한 표준정규분포 표준화 (mean removal and variance scaling, mean = 0, std = 1)에 대해서 소개하겠습니다.

실습에 필요한 모듈을 importing하고 예제 Dataset을 만들어보겠습니다.

In [1]: import numpy as np

In [2]: data = np.random.randint(30, size=(6, 5))

In [3]: data

Out[3]:

array([[ 3, 5, 14, 24, 24],
       [ 3, 9, 1, 20, 3],
       [10, 5, 11, 17, 28],
       [26, 9, 20, 10, 8],
       [15, 7, 1, 24, 2],
       [15, 19, 10, 13, 2]])

표준정규분포로 표준화하는 3가지 방법을 차례대로 소개하겠습니다.

(1) Numpy 를 이용한 표준화 : z = (x - mean())/std()

칼럼마다 각각의 평균, 표준편차를 적용해서 표준화를 하려면 mean(data, axis=0), std(data, axis=0) 처럼 'axis=0' 을 설정해주면 됩니다.

# (1) Using numpy, z = (x-mean)/std

In [4]: from numpy import *

In [5]: data_standadized_np = (data - mean(data, axis=0)) / std(data, axis=0)

In [6]: data_standadized_np

Out[6]:

array([[-1.13090555, -0.84016805, 0.66169316, 1.14070365, 1.19426502],
       [-1.13090555, 0.        , -1.24986486, 0.38023455, -0.75998683],
       [-0.25131234, -0.84016805, 0.22056439, -0.19011728, 1.56650347],
       [ 1.75918641, 0.        , 1.54395071, -1.5209382 , -0.29468877],
       [ 0.37696852, -0.42008403, -1.24986486, 1.14070365, -0.85304644],
       [ 0.37696852, 2.10042013, 0.07352146, -0.95058638, -0.85304644]])

# check of 'mean=0', 'standard deviation=1'

In [7]: mean(data_standadized_np, axis=0)

Out[7]:

array([ -5.55111512e-17, 0.00000000e+00, 9.25185854e-18,
0.00000000e+00, 3.70074342e-17])

In [8]: std(data_standadized_np, axis=0)

Out[8]: array([ 1., 1., 1., 1., 1.])

평균(mean): np.mean(arr)
표준편차(standard deviation): np.std(arr)
분산(variance): np.var(arr)

import numpy as np

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print('mean:', np.mean(arr))

print('standard deviation:', np.std(arr))

print('variance:', np.var(arr))

mean: 5.0
standard deviation: 3.1622776601683795
variance: 10.0

(2) scipy.stats 을 이용한 표준화 : ss.zscore()

# (2) Standardization using zscore() of scipy.stats

In [9]: import scipy.stats as ss

In [10]: data_standadized_ss = ss.zscore(data)

In [11]: data_standadized_ss

Out[11]:

(3) sklearn.preprocessing 을 이용한 표준화 : StandardScaler().fit_transform()

In [12]: from sklearn.preprocessing import StandardScaler

In [13]: data_standadized_skl = StandardScaler().fit_transform(data)

In [14]: data_standadized_skl

Out[14]:

다음번 포스팅에서는 데이터셋에 Outlier 가 들어있을 때 Robust하게 표준화할 수 있는 방법으로서 sklearn.preprocessing.robust_scale, sklearn.preprocessing.RobustScaler 을 소개하겠습니다.

많은 도움이 되었기를 바랍니다.

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python] 최소 최대 '0~1' 범위 변환 (scaling to 0~1 range) : sklearn.preprocessing.MinMaxScaler() (1)	2016.12.16
[Python] 이상치, 특이값이 들어있는 데이터의 표준화 (Scaling data with outliers) (2)	2016.12.15
[Python pandas] 유일한 값 찾기 : pd.Series.unique(), 유일한 값별로 개수 세기 : pd.Series.value_counts() (4)	2016.12.12
[Python pandas] 중복값 확인 및 처리 : DataFrame.duplicated(), DataFrame.drop_duplicates(), keep='first', 'last', False (3)	2016.12.11
[Python pandas] 결측값, 원래 값을 다른 값으로 교체하기(replacing generic values) : replace() (4)	2016.12.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

R, Python 분석과 프로그래밍의 친구 (by R Friend)

[Python] 표준정규분포 데이터 표준화 (standardization) : (x-mean())/std(), ss.zscore(), StandardScaler(data).fit_transform(data)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역