'sensitivity analysis of logistic regression model per each observation and variable' 태그의 글 목록

'sensitivity analysis of logistic regression model per each observation and variable'에 해당되는 글 1건

2020.01.12 [Python] 선형회귀모형, 로지스틱 회귀모형에 대한 각 관측치 별 변수별 기여도(민감도) 분석 (Sensitivity analysis of linear regression & Logistic regression per each variables and each observations) 1

[Python] 선형회귀모형, 로지스틱 회귀모형에 대한 각 관측치 별 변수별 기여도(민감도) 분석 (Sensitivity analysis of linear regression & Logistic regression per each variables and each observations)

Python 분석과 프로그래밍/Python 통계분석 2020. 1. 12. 20:05

이번 포스팅에서는 선형회귀모형에 대한 각 관측치별 변수별 기여도 분석 (each variable contribution per each observations)에 대해서 소개하겠습니다.

변수 중요도 (variable importance, feature importance)가 전체 관측치를 사용해 적합한 모델 단위의 변수별 (상대적) 중요도를 나타내는 것이라면, 이번 포스팅에서 소개하려는 관측치별 변수별 기여도(민감도)는 개별 관측치 단위에서 한개의 칼럼이 모델 예측치에 얼마나 기여를 하는지를 분석해보는 것입니다.

선형회귀 모형을 예로 들면, 변수 중요도는 회귀모형의 회귀 계수라고 할 수 있겠구요, 관측치별 변수별 기여도는 특정 관측치의 특정 칼럼값만 그대로 사용하고 나머지 변수값에는 '0'을 대입해서 나온 예측치라고 할 수 있겠습니다. 변수 중요도가 Global 하게 적용되는 model weights 라고 한다면, 관측치별 변수별 기여도는 specific variable's weight * value 라고 할 수 있겠습니다.

변수 중요도를 분석했으면 됐지, 왜 관측치별 변수별 기여도 분석을 할까 궁금할 수도 있을 텐데요, 관측치별 변수별 기여도 분석은 특정 관측치별로 개별적으로 어떤 변수가 예측치에 크게 영향을 미쳤는지 파악하기 위해서 사용합니다. (동어반복인가요? ^^;)

모델의 변수별 가중치(high, low)와 각 관측치별 변수별 값(high, low)의 조합별로 아래와 같이 관측치별 변수별 예측치에 대한 기여도가 달라지게 됩니다. 가령, 관측치별 변수별 예측치에 대한 기여도가 높으려면 모델의 가중치(중요도)도 높아야 하고 동시에 개별 관측치의 해당 변수의 관측값도 역시 높아야 합니다.

관측치별 변수별 기여도 분석은 복잡하고 어려운 이론에 기반을 둔 분석이라기 보다는 단순한 산수를 반복 연산 프로그래밍으로 계산하는 것이므로 아래의 예를 따라가다 보면 금방 이해할 수 있을 것이라고 생각합니다.

예를 들어보기 위해 공개된 전복(abalone) 데이터를 사용하여 전체 무게(whole weight)를 예측하는 선형회귀모형을 적합하고, 관측치별로 각 변수별 예측치에 기여한 정도를 분석해보겠습니다.

1. 선형회귀모형에 대한 관측치별 변수별 기여도(민감도) 분석

(Sensitivity analysis for linear regression model)

(1) abalone dataset 을 pandas DataFrame으로 불러오기

import numpy as np

import pandas as pd

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

abalone = pd.read_csv(url

, sep=','

, header=None

, names=['sex', 'length', 'diameter', 'height',

'whole_weight', 'shucked_weight', 'viscera_weight',

'shell_weight', 'rings']

, index_col=None)

abalone.head()

[Out]:

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

(2) X, y 변수 데이터셋 생성 (creating X and y dataset)

예측하려는 목표변수 y 로는 '전체 무게(whole weight)' 변수를 사용하겠으며, 설명변수 X 에는 'sex', 'length', 'diameter', 'height', 'shell_weight', 'rings' 의 6개 변수를 사용하겠습니다. 이중에서 'sex' 변수는 범주형 변수이므로 가변수(dummy variable)로 변환하였습니다.

# transformation of categorical variable to dummy variable

abalone['sex'].unique()

[Out]: array(['M', 'F', 'I'], dtype=object)

abalone['sex_m'] = np.where(abalone['sex'] == 'M', 1, 0)

abalone['sex_f'] = np.where(abalone['sex'] == 'F', 1, 0)

# get X variables

X = abalone[["sex_m", "sex_f", "length", "diameter", "height", "shell_weight", "rings"]]

import statsmodels.api as sm

X = sm.add_constant(X) # add a constant to model

print(X)

[Out]:

Index(['const', 'sex_m', 'sex_f', 'length', 'diameter', 'height',
       'shell_weight', 'rings'],
      dtype='object')


      const  sex_m  sex_f  length  diameter  height  shell_weight  rings
0       1.0      1      0   0.455     0.365   0.095        0.1500     15
1       1.0      1      0   0.350     0.265   0.090        0.0700      7
2       1.0      0      1   0.530     0.420   0.135        0.2100      9
3       1.0      1      0   0.440     0.365   0.125        0.1550     10
4       1.0      0      0   0.330     0.255   0.080        0.0550      7
...     ...    ...    ...     ...       ...     ...           ...    ...
4172    1.0      0      1   0.565     0.450   0.165        0.2490     11
4173    1.0      1      0   0.590     0.440   0.135        0.2605     10
4174    1.0      1      0   0.600     0.475   0.205        0.3080      9
4175    1.0      0      1   0.625     0.485   0.150        0.2960     10
4176    1.0      1      0   0.710     0.555   0.195        0.4950     12

[4177 rows x 8 columns]

# get y value

y = abalone["whole_weight"]

Train:Test = 8:2 의 비율로 데이터셋을 분할 (Split train and test set)하겠습니다.

# train, test set split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,

test_size=0.2,

random_state=2004)

print('Shape of X_train:', X_train.shape)

print('Shape of X_test:', X_test.shape)

print('Shape of y_train:', y_train.shape)

print('Shape of y_test:', y_test.shape)

[Out]:
Shape of X_train: (3341, 8)
Shape of X_test: (836, 8)
Shape of y_train: (3341,)
Shape of y_test: (836,)

(3) 선형회귀모형 적합 (fitting linear regression model)

이번 포스팅이 선형회귀모형에 대해서 설명하는 것이 목적이 아니기 때문에 회귀모형을 적합하는데 필요한 가정사항 진단 및 조치는 생략하고, 그냥 statsmodels.api 라이브러리를 이용해서 회귀모델을 적합해서 바로 민감도 분석(Sensitivity analysis)으로 넘어가겠습니다.

이번 포스팅의 주제인 '관측치별 변수별 기여도 분석'에서 사용하는 모델은 어떤 알고리즘도 전부 가능하므로 개별 관측치별 변수별 영향력을 해석을 하는데 유용하게 사용할 수 있습니다. (가령, 블랙박스 모형인 Random Forest, Deep Learning 등도 가능)

# multivariate linear regression model

import statsmodels.api as sm

lin_reg = sm.OLS(y_train, X_train).fit()

lin_reg.summary()

OLS Regression Results
Dep. Variable:	whole_weight	R-squared:	0.942
Model:	OLS	Adj. R-squared:	0.942
Method:	Least Squares	F-statistic:	7738.
Date:	Fri, 17 Jan 2020	Prob (F-statistic):	0.00
Time:	20:06:12	Log-Likelihood:	2375.3
No. Observations:	3341	AIC:	-4735.
Df Residuals:	3333	BIC:	-4686.
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.3609	0.015	-23.795	0.000	-0.391	-0.331
sex_m	0.0349	0.006	6.039	0.000	0.024	0.046
sex_f	0.0217	0.006	3.493	0.000	0.010	0.034
length	1.2539	0.107	11.724	0.000	1.044	1.464
diameter	0.0749	0.135	0.554	0.580	-0.190	0.340
height	0.3880	0.088	4.410	0.000	0.216	0.561
shell_weight	2.4384	0.038	64.802	0.000	2.365	2.512
rings	-0.0154	0.001	-18.423	0.000	-0.017	-0.014

Omnibus:	698.165	Durbin-Watson:	1.978
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4041.763
Skew:	0.866	Prob(JB):	0.00
Kurtosis:	8.102	Cond. No.	866.

# prediction

predicted = lin_reg.predict(X_test)

actual = y_test

act_pred_df = pd.DataFrame({'actual': actual

, 'predicted': predicted

, 'error': actual - predicted})

act_pred_df.head()

[Out]:

	actual	predicted	error
3473	0.0455	-0.094703	0.140203
3523	0.0970	0.020604	0.076396
1862	0.5185	0.690349	-0.171849
2966	1.4820	1.358950	0.123050
659	0.9585	1.137433	-0.178933

# Scatter Plot: Actual vs. Predicted

import matplotlib.pyplot as plt

plt.plot(act_pred_df['actual'], act_pred_df['predicted'], 'o')

plt.xlabel('actual', fontsize=14)

plt.ylabel('predicted', fontsize=14)

plt.show()

# RMSE (Root Mean Squared Error)

from sklearn.metrics import mean_squared_error

from math import sqrt

rmse = sqrt(mean_squared_error(actual, predicted))

rmse

[Out]: 0.11099248621173345

(4) 관측치별 변수별 예측모델 결과에 대한 기여도 분석 (contribution per each variables from specific observation)

아래 예에서는 전체 836개의 test set 관측치 중에서 '1번 관측치 (1st observation)' 의 8개 변수별 기여도를 분석해보겠습니다.

X_test.shape

[Out]: (836, 8)

# get 1st observation's value as an example

X_i = X_test.iloc[0, :]

X_i

[Out]:
const           1.000
sex_m           0.000
sex_f           0.000
length          0.210
diameter        0.150
height          0.055
shell_weight    0.013
rings           4.000
Name: 3473, dtype: float64

관측치별 변수별 기여도(민감도, variable's contribution & sensitivity) 분석의 핵심 개념은 전체 데이터를 사용해서 적합된 모델에 특정 관측치의 변수별 값을 변수별로 순서대로 돌아가면서 해당 변수 측정값은 그대로 사용하고 나머지 변수들의 값은 '0'으로 대체해서 예측을 해보는 것입니다. 아래 코드의 for i, j in enumerate(X_i): X_mat[i, i] = j 부분을 유심히 살펴보시기 바랍니다.

# all zeros matrix with shape of [col_num, col_num]

X_mat = np.zeros(shape=[X_i.shape[0], X_i.shape[0]])

X_mat

[Out]:

array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.]])

# fill only 1 variable's value and leave '0' for the others

for i, j in enumerate(X_i):

X_mat[i, i] = j

X_mat

[Out]:

array([[1.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.21 , 0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.15 , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.055, 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.013, 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 4.   ]])

바로 위에서 만든 X_mat 행렬 (1번 관측치의 각 변수별로 하나씩 실제 값을 사용하고, 나머지는 '0'으로 대체)을 사용해서 (3)에서 적합한 선형회귀모형으로 y값을 예측(lin_reg.predict(X_mat))을 하면 우리가 원하는 '1번 관측치의 개별 변수별 y값 예측에 대한 기여도'를 구할 수 있습니다.

개별 변수별 y 예측값에 대한 기여도를 기준으로 내림차순 정렬해서 DataFrame을 만들고, 가로로 누운 막대그래프도 그려보았습니다.

# sensitivity analysis

sensitivity_df = pd.DataFrame({'x': X_test.iloc[0, :]

, 'contribution_x': lin_reg.predict(X_mat)}).\

sort_values(by='contribution_x', ascending=False)

sensitivity_df

[Out]:

	x	contribution_x
length	0.210	0.263315
shell_weight	0.013	0.031699
height	0.055	0.021341
diameter	0.150	0.011237
sex_m	0.000	0.000000
sex_f	0.000	0.000000
rings	4.000	-0.061437
const	1.000	-0.360857

# horizontal bar plot by column's contribution

sensitivity_df['contribution_x'].plot(kind='barh', figsize=(10, 5))

plt.title('Sensitivity Analysis', fontsize=16)

plt.xlabel('Contribution', fontsize=16)

plt.ylabel('Variable', fontsize=16)

plt.yticks(fontsize=14)

plt.show()

물론, 당연하게 관측치별 변수별 기여도 (민감도) 분석에서 나온 결과를 전부 다 합치면 애초의 모형에 해당 관측치의 전체 변수별 값을 넣어서 예측한 값과 동일한 결과가 나옵니다.

# result from sum of contribution analysis

sum(sensitivity_df['contribution_x'])

[Out]: -0.09470251191012563

# result from linear regression model's prediction

lin_reg.predict(X_test.iloc[0, :].to_numpy())

[Out]: array([-0.09470251])

(5) 관측치별 변수별 예측 기여도 (민감도) 분석을 위한 사용자 정의 함수

관측치별 변수별 기여도 분석 결과를 --> pandas DataFrame으로 저장하고, --> 기여도별로 정렬된 값을 기준으로 옆으로 누운 막대그래프를 그려주는 사용자 정의함수를 만들어보겠습니다.

# UDF for contribution(sensitivity) analysis per each variables

def sensitivity_analysis(model, X, idx, bar_plot_yn):

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import statsmodels.api as sm

pd.options.mode.chained_assignment = None

# get one object's X values

X_i = X.iloc[idx, :]

# make a matrix with zeros with shape of [num_cols, num_cols]

X_mat = np.zeros(shape=[X_i.shape[0], X_i.shape[0]])

# fil X_mat with values from one by one columns, leaving the ohters zeros

for i, j in enumerate(X_i):

X_mat[i, i] = j

# data frame with contribution of each X columns in descending order

sensitivity_df = pd.DataFrame({'idx': idx

, 'x': X_i

, 'contribution_x': model.predict(X_mat)}).\

sort_values(by='contribution_x', ascending=True)

# if bar_plot_yn == True then display it

col_n = X_i.shape[0]

if bar_plot_yn == True:

sensitivity_df['contribution_x'].plot(kind='barh', figsize=(10, 0.7*col_n))

plt.title('Sensitivity Analysis', fontsize=18)

plt.xlabel('Contribution', fontsize=16)

plt.ylabel('Variable', fontsize=16)

plt.yticks(fontsize=14)

plt.show()

return sensitivity_df

# check UDF

sensitivity_df = sensitivity_analysis(model=lin_reg, X=X_test, idx=0, bar_plot_yn=True)

sensitivity_df

아래는 위에서 정의한 sensitivity_analysis() 사용자 정의 함수에서 bar_plot_yn=False 로 설정을 해서 옆으로 누운 막대그래프는 그리지 말고 기여도 분석 결과 DataFrame만 반환하라고 한 경우입니다.

# without bar plot (bar_plot_yn=False)

sensitivity_df = sensitivity_analysis(model=lin_reg, X=X_test, idx=0, bar_plot_yn=False)

sensitivity_df

[Out]:

	x	contribution_x
length	0.210	0.263315
shell_weight	0.013	0.031699
height	0.055	0.021341
diameter	0.150	0.011237
sex_m	0.000	0.000000
sex_f	0.000	0.000000
rings	4.000	-0.061437
const	1.000	-0.360857

(6) 다수의 관측치에 대해 '개별 관측치별 변수별 예측 기여도 분석'

위에서 부터 10개의 관측치를 선별해서, 개별 관측치별 각 변수별로 위의 (5)번에서 정의한 sensitivity_analysis() 사용자정의함수와 for loop 반복문을 사용해서 변수 기여도를 분석해보겠습니다. (단, 변수 기여도 막대 그래프는 생략)

# calculate sensitivity of each columns of the first 10 objects using for loop

# blank DataFrame to save the sensitivity results together

sensitivity_df_all = pd.DataFrame()

to_idx = 10

for idx in range(0, to_idx):

sensitivity_df_idx = sensitivity_analysis(model=lin_reg

, X=X_test

, idx=idx

, bar_plot_yn=False)

sensitivity_df_all = pd.concat([sensitivity_df_all, sensitivity_df_idx], axis=0)

print("[STATUS]", idx+1, "/", to_idx, "(", 100*(idx+1)/to_idx, "%) is completed...")

[STATUS] 1 / 10 ( 10.0 %) is completed...
[STATUS] 2 / 10 ( 20.0 %) is completed...
[STATUS] 3 / 10 ( 30.0 %) is completed...
[STATUS] 4 / 10 ( 40.0 %) is completed...
[STATUS] 5 / 10 ( 50.0 %) is completed...
[STATUS] 6 / 10 ( 60.0 %) is completed...
[STATUS] 7 / 10 ( 70.0 %) is completed...
[STATUS] 8 / 10 ( 80.0 %) is completed...
[STATUS] 9 / 10 ( 90.0 %) is completed...
[STATUS] 10 / 10 ( 100.0 %) is completed...

결과를 한번 확인해볼까요?

sensitivity_df_all[:20]

[Out]:

	idx	x	contribution_x
length	0	0.2100	0.263315
shell_weight	0	0.0130	0.031699
height	0	0.0550	0.021341
diameter	0	0.1500	0.011237
sex_m	0	0.0000	0.000000
sex_f	0	0.0000	0.000000
rings	0	4.0000	-0.061437
const	0	1.0000	-0.360857
length	1	0.2600	0.326009
shell_weight	1	0.0305	0.074372
height	1	0.0700	0.027161
diameter	1	0.2050	0.015357
sex_m	1	0.0000	0.000000
sex_f	1	0.0000	0.000000
rings	1	4.0000	-0.061437
const	1	1.0000	-0.360857
length	2	0.5200	0.652018
shell_weight	2	0.1840	0.448668
height	2	0.1100	0.042682
diameter	2	0.4100	0.030713

2. 로지스틱 회귀모형에 대한 관측치별 변수별 민감도 분석

(Sensitivity analysis for Logistic Regression model)

제가 서두에서 민감도분석에 어떤 모델도 가능하도고 말씀드렸습니다. 그러면 이번에는 같은 abalone 데이터셋에 대해 목표변수로 전체무게(whole weight)가 평균보다 크면 '1', 평균보다 작으면 '0' 으로 범주를 구분한 후에, 이를 이진 분류(binary classification)할 수 있는 로지스틱 회귀모형(Logistic Regression)을 적합한 후에 --> 민감도 분석을 적용해보겠습니다.

먼저, y_cat 이라는 범주형 변수를 만들고, train/ test set으로 분할을 한 후에, 로지스틱 회귀모형(Logistic Regression)을 적합후, 이진분류를 할 수 있는 확률을 계산(확률이 0.5보다 크면 '1'로 분류)해보겠습니다.

# make a y_category variable: if y is greater or equal to mean, then 1

y_cat = np.where(abalone["whole_weight"] >= np.mean(abalone["whole_weight"]), 1, 0)

y_cat[:20]

[Out]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

cat_class, counts = np.unique(y_cat, return_counts=True)

dict(zip(cat_class, counts))

[Out]: {0: 2178, 1: 1999}

# train, test set split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,

y_cat,

test_size=0.2,

random_state=2004)

# fitting logistic regression

import statsmodels.api as sm

pd.options.mode.chained_assignment = None

logitreg = sm.Logit(y_train, X_train)

logitreg_fit = logitreg.fit()

print(logitreg_fit.summary())

[Out]:

Optimization terminated successfully.
         Current function value: 0.108606
         Iterations 11
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 3341
Model:                          Logit   Df Residuals:                     3333
Method:                           MLE   Df Model:                            7
Date:                Fri, 17 Jan 2020   Pseudo R-squ.:                  0.8431
Time:                        21:00:24   Log-Likelihood:                -362.85
converged:                       True   LL-Null:                       -2313.0
Covariance Type:            nonrobust   LLR p-value:                     0.000
=============================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const          -35.3546      2.219    -15.935      0.000     -39.703     -31.006
sex_m            1.3064      0.250      5.217      0.000       0.816       1.797
sex_f            1.1418      0.258      4.420      0.000       0.635       1.648
length          36.6625      5.348      6.855      0.000      26.180      47.145
diameter        11.7960      6.160      1.915      0.055      -0.277      23.869
height           7.6179      2.011      3.789      0.000       3.677      11.559
shell_weight    38.0930      3.286     11.593      0.000      31.653      44.533
rings           -0.1062      0.038     -2.777      0.005      -0.181      -0.031
=============================================================================

# prediction

test_prob_logitreg = logitreg_fit.predict(X_test)

test_prob_logitreg.head()

[Out]:

3473    9.341757e-12
3523    2.440305e-10
1862    1.147617e-02
2966    9.999843e-01
659     9.964952e-01
dtype: float64

다음으로 위의 선형회귀모형에 대한 민감도 분석 사용자정의 함수를 재활용하여 로지스틱 회귀모형에도 사용가능하도록 일부 수정해보았습니다.

아래의 사용자 정의 함수는 로지스틱 회귀모형 적합 시 statsmodels 라이브러리를 사용한 것으로 가정하고 작성하였습니다. 만약 로지스틱 회귀모형의 model 적합에 from sklearn.linear_model import LogisticRegression 의 라이브러리를 사용하였다면 '#'으로 비활성화해놓은 부분을 해제하여 사용하면 됩니다. (predict_proba(X_mat)[:, 1] 의 부분이 달라서 그렇습니다)

# UDF for contribution(sensitivity) analysis per each variables

# task: "LinearReg" or "LogitReg"

def sensitivity_analysis_LinearReg_LogitReg(task, model, X, idx, bar_plot_yn):

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import statsmodels.api as sm

pd.options.mode.chained_assignment = None

# get one object's X values

X_i = X.iloc[idx, :]

# make a matrix with zeros with shape of [num_cols, num_cols]

X_mat = np.zeros(shape=[X_i.shape[0], X_i.shape[0]])

# fil X_mat with values from one by one columns, leaving the ohters zeros

for i, j in enumerate(X_i):

X_mat[i, i] = j

# data frame with contribution of each X columns in descending order

sensitivity_df = pd.DataFrame({

'idx': idx

, 'task': task

, 'x': X_i

, 'contribution_x': model.predict(X_mat)

})

# # ==== Remark =====

# # if you used LogisticRegressionsklearn from sklearn.linear_model

# # then use codes below

# if task == "LinearReg":

# sensitivity_df = pd.DataFrame({

# 'idx': idx

# , 'task': task

# , 'x': X_i

# , 'contribution_x': model.predict(X_mat)

# })

# elif task == "LogitReg":

# sensitivity_df = pd.DataFrame({

# 'idx': idx

# , 'task': task

# , 'x': X_i

# , 'contribution_x': model.predict_proba(X_mat)[:,1]

# })

# else:

# print('Please choose task one of "LinearReg" or "LogitReg"...')

sensitivity_df = sensitivity_df.sort_values(by='contribution_x', ascending=True)

# if bar_plot_yn == True then display it

col_n = X_i.shape[0]

if bar_plot_yn == True:

sensitivity_df['contribution_x'].plot(kind='barh', figsize=(10, 0.7*col_n))

plt.title('Sensitivity Analysis', fontsize=18)

plt.xlabel('Contribution', fontsize=16)

plt.ylabel('Variable', fontsize=16)

plt.yticks(fontsize=14)

plt.show()

return sensitivity_df.sort_values(by='contribution_x', ascending=False)

# apply sensitivity analysis function on 1st observation for Logistic Regression

sensitivity_analysis_LinearReg_LogitReg(task="LogitReg"

, model=logitreg_fit

, X=X_test

, idx=0

, bar_plot_yn=True)

많은 도움이 되었기를 바랍니다.

이번 포스팅이 도움이 되었다면 아래의 '공감~'를 꾹 눌러주세요. :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame) (0)	2021.05.08
[Python] 샘플 크기가 다른 2개 이상 그룹간 일원분산분석 (one-way ANOVA with different sized samples) (0)	2021.05.07
[Python 시계열 자료 분석] 시계열 패턴별 지수 평활법 (exponential smoothing by time series patterns) (5)	2020.01.05
[Python 시계열 자료 분석] 시계열 분해 (Time series Decomposition) (0)	2020.01.02
[Python 시계열 자료 분석] 시계열 구성 요인 (Time series component factors): 추세(trend), 순환(cycle), 계절(seasonal), 불규칙(irregular) 요인 (2)	2020.01.01

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'sensitivity analysis of logistic regression model per each observation and variable'에 해당되는 글 1건

[Python] 선형회귀모형, 로지스틱 회귀모형에 대한 각 관측치 별 변수별 기여도(민감도) 분석 (Sensitivity analysis of linear regression & Logistic regression per each variables and each observations)

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바