'그룹별 선형회귀모형 적합하기' 태그의 글 목록

'그룹별 선형회귀모형 적합하기'에 해당되는 글 1건

2018.12.25 [Python pandas] 그룹 별 선형회귀모형 적합하기 (Group-wise Linear Regression)

[Python pandas] 그룹 별 선형회귀모형 적합하기 (Group-wise Linear Regression)

Python 분석과 프로그래밍/Python 데이터 전처리 2018. 12. 25. 18:48

여러번에 걸친 지난 포스팅에서는 groupby()와 apply()를 이용하여 그룹 별로 통계량을 구한다든지, 결측값을 대체한다든지, 변수 간 상관계수를 구하는 방법을 소개하였습니다.

이번 포스팅에서는 groupby()와 apply()를 사용하여 그룹 별로 선회회귀모형을 적합(Group-wise Linear Regression)하는 방법을 소개하겠습니다.

그룹 개수가 많고, 그룹별로 회귀계수를 비교하고자 할 때 이번 포스팅을 참고하면 그룹별로 일일이 하나씩 모형을 적합하지 않고도 짧은 코드로 간편하게 그룹별 회귀모형을 적합할 수 있습니다.

[ 그룹 별 선형회귀모형 적합 (Group-wise Linear Regression) ]

예제에 사용할 diabetes (당뇨병) 데이터와 선형회귀모형을 적합할 때 사용할 sklearn의 linear_model 모듈을 importing 하겠습니다.

당뇨병 환자 442명의 'age', 'sex', bmi', bp', 's1', 's2', 's3', 's4', 's5', 's6' 등의 10개 설명변수와, 우리가 예측하고자 하는 'target' 반응변수가 있는 데이터셋입니다. 설명변수는 표준화가 되어있군요.

import numpy as np

import pandas as pd

from sklearn import datasets, linear_model

diabetes = datasets.load_diabetes()

diabetes.DESCR

'Diabetes dataset

================

Notes

-----

Ten baseline variables, age, sex, body mass index, average blood

npressure, and six blood serum measurements were obtained for each of n =442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\nData Set Characteristics:\n\n :Number of Instances: 442\n\n :Number of Attributes: First 10 columns are numeric predictive values\n\n :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n :Attributes:\n :Age:\n :Sex:\n :Body mass index:\n :Average blood pressure:\n :S1:\n :S2:\n :S3:\n :S4:\n :S5:\n :S6:\n\nNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource URL:\nhttp://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n\nFor more information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.\n(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n'

diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

diabetes.target[:5]

array([151.,  75., 141., 206., 135.])

이번 포스팅이 '그룹'별 선형회귀모형 분석이므로 'sex(성별)'을 'M (Male)', 'F (Female)' 의 2개 그룹으로 나누어서 'age', 'bmi' 의 2개의 설명변수만을 사용하여 'target' 을 예측하는 그룹별 회귀모형을 적합하여 보겠습니다. 필요한 변수만 선별해서 DataFrame을 만들어보겠습니다.

# Make a DataFrame of Y ('target')

diabetes_Y = pd.DataFrame(diabetes.target, columns = ['target'])

diabetes_Y[:5]

	target
0	151.0
1	75.0
2	141.0
3	206.0
4	135.0

# Make a DataFrame of X with age, sex, bmi

diabetes_X = pd.DataFrame(diabetes.data[:, 0:3],

columns = ['age', 'sex', 'bmi']) # age, sex, bmi

diabetes_X[:5]

	age	sex	bmi
0	0.038076	0.050680	0.061696
1	-0.001882	-0.044642	-0.051474
2	0.085299	0.050680	0.044451
3	-0.089063	-0.044642	-0.011595
4	0.005383	-0.044642	-0.036385

diabetes_df = pd.concat([diabetes_Y, diabetes_X], axis=1)

diabetes_df[:5]

	target	age	sex	bmi
0	151.0	0.038076	0.050680	0.061696
1	75.0	-0.001882	-0.044642	-0.051474
2	141.0	0.085299	0.050680	0.044451
3	206.0	-0.089063	-0.044642	-0.011595
4	135.0	0.005383	-0.044642	-0.036385

'sex' 변수를 가지고 'M', 'F' 를 class로 가지는 'grp'라는 범주형 변수를 만든 후에, 'sex' 변수는 삭제하겠습니다.

diabetes_df['grp'] = np.where(diabetes_df['sex'] > 0, 'M', 'F')

diabetes_df.drop(columns=['sex'], inplace=True)

diabetes_df[:3]

	target	age	bmi	grp
0	151.0	0.038076	0.061696	M
1	75.0	-0.001882	-0.051474	F
2	141.0	0.085299	0.044451	M

이제 GroupBy()의 apply()에 사용할 선형회귀모형 사용자 정의 함수(UDF)를 정의해보겠습니다. 각 그룹별 age와 bmi변수의 회귀계수를 비교하고 싶다고 했으므로 사용자 정의 함수에서 그룹별 회귀모형의 회귀계수와 절편을 결과로 반환하도록 하였습니다.

# UDF of linear regression model

def lin_regress(data, yvar, xvars):

# output, input variables

Y = data[yvar]

X = data[xvars]

# Create linear regression object

linreg = linear_model.LinearRegression()

# Fit the linear regression model

model = linreg.fit(X, Y)

# Get the intercept and coefficients

intercept = model.intercept_

coef = model.coef_

result = [intercept, coef]

return result

다음으로 GroupBy()와 apply()를 사용해서 성별 그룹('M', 'F')별 선형회귀모형을 적합해보겠습니다.

# GroupBy

grouped = diabetes_df.groupby('grp')

# Apply the UDF of linear regression model by Group

lin_reg_coef = grouped.apply(lin_regress, 'target', ['age', 'bmi'])

남성('M')과 여성('F') 그룹별 Y절편과 age, bmi 변수의 회귀계수 적합 결과를 조회해보겠습니다.

lin_reg_coef

grp
F    [152.40684676047456, [23.199210147823813, 814....
M    [148.21507864445124, [291.7563226838977, 1092....

dtype: object

lin_reg_coef['M']

[148.21507864445124, array([ 291.75632268, 1092.80118705])]

lin_reg_coef['F']

[152.40684676047456, array([ 23.19921015, 814.50932703])]

위의 그룹별 선형회귀무형 적합 결과로부터 우리는 아래의 모형을 얻었습니다.

남성('M') 그룹의 당뇨병 진단 target = 148.2 + 291.8*age + 1,092.8*bmi
여성('F') 그룹의 당뇨병 진단 target = 152.4 + 23.2*age + 814.5*bmi

(이때 age, bmi 는 표준화한 후의 input 값임)

따라서 다른 변수가 고정되었다고 했을 때 (표준화한) bmi 값이 한 단위 증가할 때 '남성('M')' 그룹의 당뇨병 진단 target 은 1,092.8 만큼 증가하는 반면에 '여성('F') 그룹의 당뇨병 진단 target은 814.5 만큼 증가하는 것으로 나왔습니다.

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. ^^

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

[Python pandas] 다수 그룹 별 다수의 변수 간 상관관계 분석 (correlation coefficients with multiple columns by groups) (0)	2019.02.17
[Python pandas] 그룹 별 무작위 표본 추출 (random sampling by group) (0)	2018.12.26
[Python pandas] 그룹 별 변수 간 상관관계 분석 (correlation with columns by groups) (0)	2018.12.24
[Python Pandas] 동일 길이로 나누어서 범주 만들기 pd.cut(), 동일 개수로 나누어서 범주 만들기 pd.qcut() (3)	2018.12.23
[Python pandas] 데이터프레임에 그룹 단위로 통계량을 집계해서 칼럼 추가하기 : df.groupby(['group']).col.transform('count') (0)	2018.12.22

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'그룹별 선형회귀모형 적합하기'에 해당되는 글 1건

[Python pandas] 그룹 별 선형회귀모형 적합하기 (Group-wise Linear Regression)

'Python 분석과 프로그래밍 > Python 데이터 전처리' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바