R, Python 분석과 프로그래밍의 친구 (by R Friend)

'statsmodels'에 해당되는 글 2건

2021.07.29 [Python] 지수평활법 모형 훈련 및 예측, 모델평가 (Exponential Smoothing in Python) 19
2021.05.08 [Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame)

[Python] 지수평활법 모형 훈련 및 예측, 모델평가 (Exponential Smoothing in Python)

Python 분석과 프로그래밍/Python 통계분석 2021. 7. 29. 19:49

지난번 포스팅에서는 시계열 패턴 별 지수평활법 모델을 설정하는 방법 (https://rfriend.tistory.com/511), 그리고 시계열 데이터 예측 모델의 성능, 모델 적합도 (goodness-of-fit of time series model)를 평가하는 지표 (https://rfriend.tistory.com/667) 에 대해서 소개하였습니다.

이번 포스팅에서는 간단한 시계열 예제 데이터에 대해 앞서 소개한 이론들을 Python 코드로 실습하여

(1) 지수평활법(Exponential Smoothing) 기법으로 시계열 예측 모델을 만들고,

(2) 모델 적합도를 평가하는 여러 지표들로 성능을 평가해서 최적의 모델을 선택

해보겠습니다.

예제로 사용할 시계열 데이터는 1991년 7월부터 2008년 6월까지 월 단위(monthly)로 집계된 약 판매량 시계열 데이터셋 (drug sales time series dataset) 입니다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## getting drug sales dataset
file_path = 'https://raw.githubusercontent.com/selva86/datasets/master/a10.csv'
df = pd.read_csv(file_path, 
                 parse_dates=['date'], 
                 index_col='date')
                 
df.head(12)

# [Out]
#             value
# date	
# 1991-07-01	3.526591
# 1991-08-01	3.180891
# 1991-09-01	3.252221
# 1991-10-01	3.611003
# 1991-11-01	3.565869
# 1991-12-01	4.306371
# 1992-01-01	5.088335
# 1992-02-01	2.814520
# 1992-03-01	2.985811
# 1992-04-01	3.204780
# 1992-05-01	3.127578
# 1992-06-01	3.270523


## time series plot
df.plot(figsize=[12, 8])
plt.title('Drug Sales', fontsize=14)
plt.show()

위의 약 판매량 시도표 (time series plot of drug sales) 를 보면 세 가지를 알 수 있습니다.

(a) 선형 추세가 존재합니다. (linear or quadratic trend)

(b) 1년 단위의 계절성이 존재합니다. (seasonality of a year, ie. 12 months)

(c) 분산이 시간의 흐름에 따라 점점 커지고 있습니다. (increasing variation)

우리가 관찰한 이들 세가지의 시계열 데이터의 패턴으로 부터 "Multiplicative Winters' method with Linear / Quadratic Trend" 모델이 적합할 것이라고 판단해볼 수 있습니다.

* 참조: 시계열 패턴 별 지수평활법 모델을 설정하는 방법 (https://rfriend.tistory.com/511)

그럼, 실제로 다양한 지수평활법으로 위의 약 판매량 시계열 데이터에 대한 예측 모델을 만들어 예측을 해보고, 다양한 평가지표로 모델 간 비교를 해봐서 가장 우수한 모델이 무엇인지, 과연 앞서 말한 "Multiplicative Winters' method with Linear Trend" 모델이 가장 우수할 것인지 확인해보겠습니다. 보는게 믿는거 아니겠습니까!

이제 월 단위로 시간 순서대로 정렬된 시계열 데이터를 제일 뒤 (최근)의 12개월치는 시계열 예측 모델의 성능을 평가할 때 사용할 test set으로 따로 떼어놓고, 나머지 데이터는 시계열 예측모형 훈련에 사용할 training set 으로 분할 (splitting between the training and the test set)을 해보겠습니다.

## split between the training and the test data sets. 
## The last 12 periods form the test data
df_train = df.iloc[:-12]
df_test = df.iloc[-12:]

(1) 지수평활법 기법으로 시계열 예측모형 적합하기
(training the exponential smoothing models in Python)

Python의 statsmodels 라이브러리는 시계열 데이터 분석을 위한 API를 제공합니다. 아래의 코드는 순서대로

(a) Simple Exponential Smoothing

-- 추세 요소의 유형 (type of trend components)에 따라서

(b) Holt's (trend=None)

(c) Exponential trend (exponential=True)

(d) Additive damped trend (damped_trend=True)

(e) Multiplicative damped trend (exponential=True, damped_trend=True)

로 구분할 수 있습니다.

## exponential smoothing in Python
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt

# Simple Exponential Smoothing
fit1 = SimpleExpSmoothing(df_train, 
                          initialization_method="estimated").fit()

# Trend
fit2 = Holt(df_train, 
            initialization_method="estimated").fit()

# Exponential trend
fit3 = Holt(df_train,
            exponential=True, 
            initialization_method="estimated").fit()

# Additive damped trend
fit4 = Holt(df_train,
            damped_trend=True, 
            initialization_method="estimated").fit()

# Multiplicative damped trend
fit5 = Holt(df_train,
            exponential=True, 
            damped_trend=True, 
            initialization_method="estimated").fit()

statsmodels.tsa.api 메소드를 사용해서 적합한 시계열 데이터 예측 모델에 대해서는 model_object.summary() 메소드를 사용해서 적합 결과를 조회할 수 있습니다. 그리고 model_object.params['parameter_name'] 의 방식으로 지수평활법 시계열 적합 모델의 속성값(attributes of the fitted exponential smoothing results)에 대해서 조회할 수 있습니다.

## accessing the results of SimpleExpSmoothing Model
print(fit1.summary())

#                        SimpleExpSmoothing Model Results                       
# ==============================================================================
# Dep. Variable:                  value   No. Observations:                  192
# Model:             SimpleExpSmoothing   SSE                            692.196
# Optimized:                       True   AIC                            250.216
# Trend:                           None   BIC                            256.731
# Seasonal:                        None   AICC                           250.430
# Seasonal Periods:                None   Date:                 Thu, 29 Jul 2021
# Box-Cox:                        False   Time:                         19:26:46
# Box-Cox Coeff.:                  None                                         
# ==============================================================================
#                        coeff                 code              optimized      
# ------------------------------------------------------------------------------
# smoothing_level            0.3062819                alpha                 True
# initial_level              3.4872869                  l.0                 True
# ------------------------------------------------------------------------------


## getting all results by models
params = ['smoothing_level', 'smoothing_trend', 'damping_trend', 'initial_level', 'initial_trend']
results=pd.DataFrame(index=[r"$\alpha$",r"$\beta$",r"$\phi$",r"$l_0$","$b_0$","SSE"],
                     columns=['SES', "Holt's","Exponential", "Additive", "Multiplicative"])

results["SES"] =            [fit1.params[p] for p in params] + [fit1.sse]
results["Holt's"] =         [fit2.params[p] for p in params] + [fit2.sse]
results["Exponential"] =    [fit3.params[p] for p in params] + [fit3.sse]
results["Additive"] =       [fit4.params[p] for p in params] + [fit4.sse]
results["Multiplicative"] = [fit5.params[p] for p in params] + [fit5.sse]

results
# [Out]
#      SES	Holt's	Exponential	Additive	Multiplicative
# 𝛼 	0.306282	0.053290	1.655513e-08	0.064947	0.040053
# 𝛽 	NaN	0.053290	1.187778e-08	0.064947	0.040053
# 𝜙 	NaN	NaN	NaN	0.995000	0.995000
# 𝑙0 	3.487287	3.279317	3.681273e+00	3.234974	3.382898
# 𝑏0 	NaN	0.045952	1.009056e+00	0.052515	1.016242
# SSE  692.195826	631.551871	5.823167e+02	641.058197	621.344891

계절성(seasonality)을 포함한 지수평활법에 대해서는 statsmodels.tsa.holtwinters 내의 메소드를 사용해서 시계열 예측 모델을 적합해보겠습니다.

seasonal_periods 에는 계절 주기를 입력해주는데요, 이번 예제에서는 월 단위 집계 데이터에서 1년 단위로 계절성이 나타나고 있으므로 seasonal_periods = 12 를 입력해주었습니다.

trend = 'add' 로 일단 고정을 했으며, 대신에 계절성의 경우 (f) seasonal = 'add' (or 'additive'), (g) seasonal = 'mul' (or 'multiplicative') 로 구분해서 모델을 각각 적합해보았습니다.

위의 시계열 시도표를 보면 시간이 흐름에 따라 분산이 점점 커지고 있으므로 seasonal = 'mul' (or 'multiplicative') 가 더 적합할 것이라고 우리는 예상할 수 있습니다. 실제 그런지 데이터로 확인해보겠습니다.

## Holt's Winters's method for time series data with Seasonality
from statsmodels.tsa.holtwinters import ExponentialSmoothing as HWES

# additive model for fixed seasonal variation
fit6 = HWES(df_train, 
             seasonal_periods=12, 
             trend='add', 
             seasonal='add').fit(optimized=True, use_brute=True)

# multiplicative model for increasing seasonal variation
fit7 = HWES(df_train, 
             seasonal_periods=12, 
             trend='add', 
             seasonal='mul').fit(optimized=True, use_brute=True)

(2) 지수평활법 적합 모델로 예측하여 모델 간 성능 평가/비교하고 최적 모델 선택하기

(forecasting using the exponential smoothing models, model evaluation/ model comparison and best model selection)

(2-1) 향후 12개월 약 판매량 예측하기 (forecasting the drug sales for the next 12 months)

forecast() 함수를 사용하며, 괄호 안에는 예측하고자 하는 기간을 정수로 입력해줍니다.

## forecasting for 12 months
forecast_1 = fit1.forecast(12)
forecast_2 = fit2.forecast(12)
forecast_3 = fit3.forecast(12)
forecast_4 = fit4.forecast(12)
forecast_5 = fit5.forecast(12)
forecast_6 = fit6.forecast(12)
forecast_7 = fit7.forecast(12


y_test = df_test['value']

t_p = pd.DataFrame({'test': y_test, 
                    'f1': forecast_1, 
                    'f2': forecast_2, 
                    'f3': forecast_3, 
                    'f4': forecast_4, 
                    'f5': forecast_5, 
                    'f6': forecast_6, 
                    'f7': forecast_7})
                    
                    
print(t_p)

#                  test        f1         f2         f3         f4         f5  \
# 2007-07-01  21.834890  20.15245  20.350267  20.970823  20.454211  20.261185   
# 2007-08-01  23.930204  20.15245  20.511013  21.160728  20.616026  20.421627   
# 2007-09-01  22.930357  20.15245  20.671760  21.352352  20.777032  20.582528   
# 2007-10-01  23.263340  20.15245  20.832507  21.545711  20.937233  20.743882   
# 2007-11-01  25.250030  20.15245  20.993254  21.740821  21.096633  20.905686   
# 2007-12-01  25.806090  20.15245  21.154000  21.937699  21.255236  21.067932   
# 2008-01-01  29.665356  20.15245  21.314747  22.136359  21.413046  21.230618   
# 2008-02-01  21.654285  20.15245  21.475494  22.336818  21.570067  21.393736   
# 2008-03-01  18.264945  20.15245  21.636241  22.539092  21.726303  21.557283   
# 2008-04-01  23.107677  20.15245  21.796987  22.743198  21.881758  21.721254   
# 2008-05-01  22.912510  20.15245  21.957734  22.949152  22.036435  21.885642   
# 2008-06-01  19.431740  20.15245  22.118481  23.156972  22.190339  22.050443   

#                    f6         f7  
# 2007-07-01  20.882394  21.580197  
# 2007-08-01  22.592307  22.164134  
# 2007-09-01  21.388974  22.136953  
# 2007-10-01  25.178594  24.202879  
# 2007-11-01  26.776556  25.322929  
# 2007-12-01  26.424404  27.549045  
# 2008-01-01  30.468546  31.194773  
# 2008-02-01  19.065461  18.181477  
# 2008-03-01  21.834584  21.034783  
# 2008-04-01  19.061426  20.085947  
# 2008-05-01  23.452042  23.076423  
# 2008-06-01  22.353486  22.598155

(2-2) 모델 간 모델적합도/성능 평가하고 비교하기

Python의 scikit-learn 라이브러리의 metrics 메소드에 이미 정의되어 있는 MSE, MAE 지표는 가져다가 쓰구요, 없는 지표는 모델 적합도 (goodness-of-fit of time series model)를 평가하는 지표 (https://rfriend.tistory.com/667) 포스팅에서 소개했던 수식대로 Python 사용자 정의 함수(User Defined Funtion)을 만들어서 사용하겠습니다.

아래의 모델 성능 평가 지표 중에서 AIC, SBC, APC, Adjsted-R2 지표는 모델 적합에 사용된 모수의 개수 (the number of parameters) 를 알아야지 계산할 수 있으며, 모수의 개수는 지수평활법 모델 종류별로 다르므로, 적합된 모델 객체로부터 모수 개수를 세는 사용자 정의함수를 먼저 정의해보겠습니다.

## UDF for counting the number of parameters in model
def num_params(model):
    n_params = 0

    for p in list(model.params.values()):
        if isinstance(p, np.ndarray):
            n_params += len(p)
            #print(p)
        elif p in [np.nan, False, None]:
            pass
        elif np.isnan(float(p)):
            pass
        else:
            n_params += 1
            #print(p)
    
    return n_params
    

num_params(fit1)
[Out] 2

num_params(fit7)
[Out] 17

이제 각 지표별로 Python을 사용해서 지수평활법 모델별로 성능을 평가하는 사용자 정의함수를 정의해보겠습니다. MSE와 MAE는 scikit-learn.metrics 에 있는 메소드를 가져다가 사용하며, 나머지는 공식에 맞추어서 정의해주었습니다.

(혹시 sklearn 라이브러리에 모델 성능 평가 지표 추가로 더 있으면 그거 가져다 사용하시면 됩니다. 저는 sklearn 튜토리얼 찾아보고 없는거 같아서 아래처럼 정의해봤어요. 혹시 코드 잘못된거 있으면 댓글에 남겨주시면 감사하겠습니다.)

## number of observations in training set
T = df_train.shape[0]
print(T)
[Out] 192


## evaluation metrics
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import mean_absolute_error as MAE

# Mean Absolute Percentage Error
def SSE(y_test, y_pred):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    return np.sum((y_test - y_pred)**2)

def ME(y_test, y_pred):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    return np.mean(y_test - y_pred)

def RMSE(y_test, y_pred):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    return np.sqrt(np.mean((y_test - y_pred)**2))   
    #return np.sqrt(MSE(y_test - y_pred))

def MPE(y_test, y_pred): 
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    return np.mean((y_test - y_pred) / y_test) * 100

def MAPE(y_test, y_pred): 
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    return np.mean(np.abs((y_test - y_pred) / y_test)) * 100

def AIC(y_test, y_pred, T, model):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    sse = np.sum((y_test - y_pred)**2)
    #T = len(y_train) # number of observations
    k = num_params(model) # number of parameters
    return T * np.log(sse/T) + 2*k

def SBC(y_test, y_pred, T, model):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    sse = np.sum((y_test - y_pred)**2)
    #T = len(y_train) # number of observations
    k = num_params(model) # number of parameters
    return T * np.log(sse/T) + k * np.log(T)

def APC(y_test, y_pred, T, model):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    sse = np.sum((y_test - y_pred)**2)
    #T = len(y_train) # number of observations
    k = num_params(model) # number of parameters
    return ((T+k)/(T-k)) * sse / T

def ADJ_R2(y_test, y_pred, T, model):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    sst = np.sum((y_test - np.mean(y_test))**2)
    sse = np.sum((y_test - y_pred)**2)
    #T = len(y_train) # number of observations
    k = num_params(model) # number of parameters
    r2 = 1 - sse/sst
    return 1 - ((T - 1)/(T - k)) * (1 - r2)
    

## Combining all metrics together
def eval_all(y_test, y_pred, T, model):
    sse = SSE(y_test, y_pred)
    mse = MSE(y_test, y_pred)
    rmse = RMSE(y_test, y_pred)
    me = ME(y_test, y_pred)
    mae = MAE(y_test, y_pred)
    mpe = MPE(y_test, y_pred)
    mape = MAPE(y_test, y_pred)
    aic = AIC(y_test, y_pred, T, model)
    sbc = SBC(y_test, y_pred, T, model)
    apc = APC(y_test, y_pred, T, model)
    adj_r2 = ADJ_R2(y_test, y_pred, T, model)
    
    return [sse, mse, rmse, me, mae, mpe, mape, aic, sbc, apc, adj_r2]

위에서 정의한 시계열 예측 모델 성능평가 지표 사용자 정의 함수를 모델 성능 평가를 위해 따로 떼어놓았던 'test set' 에 적용하여 각 지수평활법 모델별/ 모델 성능평가 지표 별로 비교를 해보겠습니다.

모델 성능평가 지표 중에서 실제값과 예측값의 차이인 잔차(residual)를 기반으로 한 MSE, RMSE, ME, MAE, MPE, MAPE, AIC, SBC, APC 는 낮을 수록 상대적으로 더 좋은 모델이라고 평가할 수 있으며, 실제값과 평균값의 차이 중에서 시계열 예측 모델이 설명할 수 있는 부분의 비중 (모델의 설명력) 을 나타내는 Adjusted-R2 는 값이 높을 수록 상대적으로 더 좋은 모델이라고 평가합니다. (Ajd.-R2 가 음수가 나온 값이 몇 개 보이는데요, 이는 예측 모델이 불량이어서 예측값이 실제값과 차이가 큼에 따라 SSE가 SST보다 크게 되었기 때문입니다.)

eval_all_df = pd.DataFrame(
    {'SES': eval_all(y_test, forecast_1, T, fit1), 
    "Holt's": eval_all(y_test, forecast_2, T, fit2), 
    'Exponential': eval_all(y_test, forecast_3, T, fit3), 
    'Trend_Add': eval_all(y_test, forecast_4, T, fit4), 
    'Trend_Mult': eval_all(y_test, forecast_5, T, fit5), 
    'Trend_Season_Add': eval_all(y_test, forecast_6, T, fit6), 
    'Trend_Season_Mult': eval_all(y_test, forecast_7, T, fit7)}
    , index=['SSE', 'MSE', 'RMSE', 'ME', 'MAE', 'MPE', 'MAPE', 'AIC', 'SBC', 'APC', 'Adj_R2'])


print(eval_all_df)
#                SES      Holt's  Exponential   Trend_Add  Trend_Mult  \
# SSE     205.629692  155.645609   130.567477  150.867568  159.856508   
# MSE      17.135808   12.970467    10.880623   12.572297   13.321376   
# RMSE      4.139542    3.601454     3.298579    3.545744    3.649846   
# ME        3.018502    1.936578     1.123475    1.841425    2.019134   
# MAE       3.453205    2.946251     2.576567    2.878085    3.004307   
# MPE      11.739201    6.900165     3.313095    6.486233    7.259990   
# MAPE     14.079695   12.280881    10.960354   12.010763   12.510302   
# AIC      17.167661  -32.303428   -66.036202  -36.289847  -25.178006   
# SBC      23.682652  -19.273447   -53.006220  -20.002370   -8.890530   
# APC       1.093535    0.845150     0.708977    0.827788    0.877109   
# Adj_R2   -1.146688   -0.642161    -0.377571   -0.600262   -0.695608   

#         Trend_Season_Add  Trend_Season_Mult  
# SSE            56.743149          48.994715  
# MSE             4.728596           4.082893  
# RMSE            2.174533           2.020617  
# ME             -0.118946          -0.089689  
# MAE             1.863421           1.641140  
# MPE            -0.847971          -0.607618  
# MAPE            8.538100           7.461092  
# AIC          -200.040404        -228.230323  
# SBC          -144.662983        -172.852902  
# APC             0.352956           0.304759  
# Adj_R2          0.356850           0.444674

시계열 모델 성능 지표별로 보면 전반적으로 7번째 지수평활법 시계열 예측 모델인 "Multiplicative Winters' method with Linear Trend" 모델이 상대적으로 가장 우수함을 알 수 있습니다. 역시 시도표를 보고서 우리가 예상했던대로 결과가 나왔습니다.

MAPE (Mean Absolute Percentage Error) 지표에 대해서만 옆으로 누운 막대그래프로 모델 간 성능을 비교해서 시각화를 해보면 아래와 같습니다. (MAPE 값이 작을수록 상대적으로 더 우수한 모델로 해석함.)

# horizontal bar chart
eval_all_df.loc['MAPE', :].plot(kind='barh', figsize=[8, 6])
plt.title('Mean Absolute Percentage Error (MAPE)', fontsize=16)
plt.show()

아래의 시도표는 실제값과 예측값을 시간의 흐름에 따라 선그래프로 시각화해 본 것입니다.

예상했던 대로 4번째의 "1차 선형 추세만 반영하고 계절성은 없는 이중 지수 평활법 모델" 보다는, 7번째의 "선형추세+계절성까지 반영한 multiplicative holt-winters method exponential smoothing with linear trend 지수평활법 모델" 이 더 실제값을 잘 모델링하여 근접하게 예측하고 있음을 눈으로 확인할 수 있습니다.

# 1차 선형 추세는 있고 계절성은 없는 이중 지수 평활법
plt.rcParams['figure.figsize']=[12, 8]
past, = plt.plot(df_train.index, df_train, 'b.-', label='Sales History')
test, = plt.plot(df_test.index, df_test, 'r.-', label='y_test')
pred, = plt.plot(df_test.index, forecast_4, 'y.-', label='y_pred')
plt.title('Two Parameter Exponential Smoothing', fontsize=14)
plt.legend(handles=[past, test, pred])
plt.show()

# 1차 선형 추세와 확산계절변동이 있는 승법 윈터스 지수평활법
plt.rcParams['figure.figsize']=[12, 8]
past, = plt.plot(df_train.index, df_train, 'b.-', label='Sales History')
test, = plt.plot(df_test.index, df_test, 'r.-', label='y_test')
pred, = plt.plot(df_test.index, forecast_7, 'y.-', label='y_pred')
plt.title('Multiplicative Winters Method Exponential Smoothing with Linear Trend', fontsize=14)
plt.legend(handles=[past, test, pred])
plt.show()

적합된 statsmodels.tsa 시계열 예측 모델에 대해 summary() 메소드를 사용해서 적합 결과를 조회할 수 있으며, model.params['param_name'] 으로 적합된 모델의 속성값에 접근할 수 있습니다.

#print out the training summary
print(fit7.summary())

#                        ExponentialSmoothing Model Results                       
# ================================================================================
# Dep. Variable:                    value   No. Observations:                  192
# Model:             ExponentialSmoothing   SSE                             75.264
# Optimized:                         True   AIC                           -147.808
# Trend:                         Additive   BIC                            -95.688
# Seasonal:                Multiplicative   AICC                          -143.854
# Seasonal Periods:                    12   Date:                 Thu, 29 Jul 2021
# Box-Cox:                          False   Time:                         19:40:52
# Box-Cox Coeff.:                    None                                         
# =================================================================================
#                           coeff                 code              optimized      
# ---------------------------------------------------------------------------------
# smoothing_level               0.2588475                alpha                 True
# smoothing_trend               0.0511336                 beta                 True
# smoothing_seasonal           3.3446e-16                gamma                 True
# initial_level                 2.1420252                  l.0                 True
# initial_trend                 0.0466926                  b.0                 True
# initial_seasons.0             1.4468847                  s.0                 True
# initial_seasons.1             1.4713123                  s.1                 True
# initial_seasons.2             1.4550909                  s.2                 True
# initial_seasons.3             1.5754307                  s.3                 True
# initial_seasons.4             1.6324776                  s.4                 True
# initial_seasons.5             1.7590615                  s.5                 True
# initial_seasons.6             1.9730448                  s.6                 True
# initial_seasons.7             1.1392096                  s.7                 True
# initial_seasons.8             1.3057795                  s.8                 True
# initial_seasons.9             1.2354317                  s.9                 True
# initial_seasons.10            1.4064560                 s.10                 True
# initial_seasons.11            1.3648905                 s.11                 True
# ---------------------------------------------------------------------------------



print('Smoothing Level: %.3f' %(fit7.params['smoothing_level']))
print('Smoothing Trend: %.3f' %(fit7.params['smoothing_trend']))
print('Smoothing Seasonality: %.3f' %(fit7.params['smoothing_seasonal']))
print('Initial Level: %.3f' %(fit7.params['initial_level']))
print('Initial Trend: %.3f' %(fit7.params['initial_trend']))
print('Initial Seasons:', fit7.params['initial_seasons'])

# Smoothing Level: 0.259
# Smoothing Trend: 0.051
# Smoothing Seasonality: 0.000
# Initial Level: 2.142
# Initial Trend: 0.047
# Initial Seasons: [1.44688473 1.47131227 1.45509092 1.5754307  1.63247755 1.75906148
#  1.97304481 1.13920961 1.30577953 1.23543175 1.40645604 1.36489053]

[Reference]

* Exponential Smoothing
: https://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html
* Holt-Winters
: https://www.statsmodels.org/devel/generated/statsmodels.tsa.holtwinters.ExponentialSmoothing.html

Exponential smoothing — statsmodels

Exponential smoothing Let us consider chapter 7 of the excellent treatise on the subject of Exponential Smoothing By Hyndman and Athanasopoulos [1]. We will work through all the examples in the chapter as they unfold. [1] [Hyndman, Rob J., and George Athan

www.statsmodels.org

statsmodels.tsa.holtwinters.ExponentialSmoothing — statsmodels

An dictionary containing bounds for the parameters in the model, excluding the initial values if estimated. The keys of the dictionary are the variable names, e.g., smoothing_level or initial_slope. The initial seasonal variables are labeled initial_season

www.statsmodels.org

* 시계열 패턴 별 지수평활법 모델을 설정하는 방법
: https://rfriend.tistory.com/511

* 시계열 데이터 예측 모델의 성능, 모델 적합도 (goodness-of-fit of time series model)를 평가하는 지표
: https://rfriend.tistory.com/667

이번 포스팅이 많은 도움이 되었기를 바랍니다.

행복한 데이터 과학자 되세요! :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] 스펙트럼 분석 (spectral analysis, spectrum analysis) (2)	2021.10.03
[Python] scipy를 이용한 시계열 데이터 보간 (interpolation of time series data using scipy) (4)	2021.09.07
[Python 시계열 자료 분석] 시계열 모형의 예측 성능 평가 지표 (7)	2021.07.18
[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame) (0)	2021.05.08
[Python] 샘플 크기가 다른 2개 이상 그룹간 일원분산분석 (one-way ANOVA with different sized samples) (0)	2021.05.07

Posted by Rfriend

[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame)

Python 분석과 프로그래밍/Python 통계분석 2021. 5. 8. 19:22

지난번 포스팅에서는 샘플 크기가 다른 2개 이상의 집단에 대해 평균의 차이가 존재하는지를 검정하는 일원분산분석(one-way ANOVA)에 대해 scipy 모듈의 scipy.stats.f_oneway() 메소드를 사용해서 분석하는 방법(rfriend.tistory.com/638)을 소개하였습니다.

이번 포스팅에서는 2개 이상의 집단에 대해 pandas DataFrame에 들어있는 여러 개의 숫자형 변수(one-way ANOVA for multiple numeric variables in pandas DataFrame) 별로 일원분산분석 검정(one-way ANOVA test)을 하는 방법을 소개하겠습니다.

숫자형 변수와 집단 변수의 모든 가능한 조합을 MultiIndex 로 만들어서 statsmodels.api 모듈의 stats.anova_lm() 메소드의 모델에 for loop 순환문으로 변수를 바꾸어 가면서 ANOVA 검정을 하도록 작성하였습니다.

먼저, 3개의 집단('grp 1', 'grp 2', 'grp 3')을 별로 'x1', 'x2', 'x3, 'x4' 의 4개의 숫자형 변수를 각각 30개씩 가지는 가상의 pandas DataFrame을 만들어보겠습니다. 이때 숫자형 변수는 모두 정규분포로 부터 난수를 발생시켜 생성하였으며, 'x3'와 'x4'에 대해서는 집단3 ('grp 3') 의 평균이 다른 2개 집단의 평균과는 다른 정규분포로 부터 난수를 발생시켜 생성하였습니다.

아래의 가상 데이터셋은 결측값이 없이 만들었습니다만, 실제 기업에서 쓰는 데이터셋에는 혹시 결측값이 존재할 수도 있으므로 결측값을 없애거나 또는 결측값을 그룹 별 평균으로 대체한 후에 one-way ANOVA 를 실행하기 바랍니다.

## Creating sample dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# generate 90 IDs
id = np.arange(90) + 1

# Create 3 groups with 30 observations in each group.
from itertools import chain, repeat
grp = list(chain.from_iterable((repeat(number, 30) for number in [1, 2, 3])))

# generate random numbers per each groups from normal distribution
np.random.seed(1004)

# for 'x1' from group 1, 2 and 3
x1_g1 = np.random.normal(0, 1, 30)
x1_g2 = np.random.normal(0, 1, 30)
x1_g3 = np.random.normal(0, 1, 30)

# for 'x2' from group 1, 2 and 3
x2_g1 = np.random.normal(10, 1, 30)
x2_g2 = np.random.normal(10, 1, 30)
x2_g3 = np.random.normal(10, 1, 30)

# for 'x3' from group 1, 2 and 3
x3_g1 = np.random.normal(30, 1, 30)
x3_g2 = np.random.normal(30, 1, 30)
x3_g3 = np.random.normal(50, 1, 30) # different mean

x4_g1 = np.random.normal(50, 1, 30)
x4_g2 = np.random.normal(50, 1, 30)
x4_g3 = np.random.normal(20, 1, 30) # different mean

# make a DataFrame with all together
df = pd.DataFrame({'id': id, 
                   'grp': grp, 
                   'x1': np.concatenate([x1_g1, x1_g2, x1_g3]), 
                   'x2': np.concatenate([x2_g1, x2_g2, x2_g3]), 
                   'x3': np.concatenate([x3_g1, x3_g2, x3_g3]), 
                   'x4': np.concatenate([x4_g1, x4_g2, x4_g3])})
                   
df.head()
[Out] 

id	grp	x1	x2	x3	x4
0	1	1	0.594403	10.910982	29.431739	49.232193
1	2	1	0.402609	9.145831	28.548873	50.434544
2	3	1	-0.805162	9.714561	30.505179	49.459769
3	4	1	0.115126	8.885289	29.218484	50.040593
4	5	1	-0.753065	10.230208	30.072990	49.601211


df[df['grp'] == 3].head()
[Out] 

id	grp	x1	x2	x3	x4
60	61	3	-1.034244	11.751622	49.501195	20.363374
61	62	3	0.159294	10.043206	50.820755	19.800253
62	63	3	0.330536	9.967849	50.461775	20.993187
63	64	3	0.025636	9.430043	50.209187	17.892591
64	65	3	-0.092139	12.543271	51.795920	18.883919

가령, 'x3' 변수에 대해 집단별로 상자 그래프 (Box plot for 'x3' by groups) 를 그려보면, 아래와 같이 집단1과 집단2는 유사한 반면에 집단3은 평균이 차이가 많이 나게 가상의 샘플 데이터가 생성되었음을 알 수 있습니다.

## Boxplot for 'x3' by 'grp'
plt.rcParams['figure.figsize'] = [10, 6]
sns.boxplot(x='grp', y='x3', data=df)
plt.show()

여러개의 변수에 대해 일원분산분석을 하기 전에, 먼저 이해를 돕기 위해 Python의 statsmodels.api 모듈의 stats.anova_lm() 메소드를 사용해서 'x1' 변수에 대해 집단(집단 1/2/3)별로 평균이 같은지 일원분산분석으로 검정을 해보겠습니다.

- 귀무가설(H0) : 집단1의 x1 평균 = 집단2의 x1 평균 = 집단3의 x1 평균

- 대립가설(H1) : 적어도 1개 이상의 집단의 x1 평균이 다른 집단의 평균과 다르다. (Not H0)

# ANOVA for x1 and grp
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('x1 ~ grp', data=df).fit()
sm.stats.anova_lm(model, typ=1)
[Out]

df	sum_sq	mean_sq	F	PR(>F)
grp	1.0	0.235732	0.235732	0.221365	0.639166
Residual	88.0	93.711314	1.064901	NaN	NaN

일원분산분석 결과 F 통계량이 0.221365, p-value가 0.639 로서 유의수준 5% 하에서 귀무가설을 채택합니다. 즉, 3개 집단 간 x1의 평균의 차이는 없다고 판단할 수 있습니다. (정규분포 X ~ N(0, 1) 를 따르는 모집단으로 부터 무작위로 3개 집단의 샘플을 추출했으므로 차이가 없게 나오는게 맞겠습니다.)

한개의 변수에 대한 일원분산분석하는 방법을 알아보았으니, 다음으로는 3개 집단별로 여러개의 연속형 변수인 'x1', 'x2', 'x3', 'x4' 에 대해서 for loop 순환문으로 돌아가면서 일원분산분석을 하고, 그 결과를 하나의 DataFrame에 모아보도록 하겠습니다.

(1) 먼저, 일원분산분석을 하려는 모든 숫자형 변수와 집단 변수에 대한 가능한 조합의 MultiIndex 를 생성해줍니다.

# make a multiindex for possible combinations of Xs and Group
num_col = ['x1','x2', 'x3', 'x4']
cat_col =  ['grp']
mult_idx = pd.MultiIndex.from_product([num_col, cat_col],
                                   names=['x', 'grp'])

print(mult_idx)
[Out]
MultiIndex([('x1', 'grp'),
            ('x2', 'grp'),
            ('x3', 'grp'),
            ('x4', 'grp')],
           names=['x', 'grp'])

(2) for loop 순환문(for x, grp in mult_idx:)으로 model = ols('{} ~ {}'.format(x, grp) 의 선형모델의 y, x 부분의 변수 이름을 바꾸어가면서 sm.stats.anova_lm(model, typ=1) 로 일원분산분석을 수행합니다. 이렇게 해서 나온 일원분산분석 결과 테이블을 anova_tables.append(anova_table) 로 순차적으로 append 해나가면 됩니다.

# ANOVA test for multiple combinations of X and Group
import statsmodels.api as sm
from statsmodels.formula.api import ols

anova_tables = []
for x, grp in mult_idx:
    model = ols('{} ~ {}'.format(x, grp), data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=1)
    anova_tables.append(anova_table)

df_anova_tables = pd.concat(anova_tables, keys=mult_idx, axis=0)

df_anova_tables
[Out]

df	sum_sq	mean_sq	F	PR(>F)
x1	grp	grp	1.0	0.235732	0.235732	0.221365	6.391661e-01
Residual	88.0	93.711314	1.064901	NaN	NaN
x2	grp	grp	1.0	0.448662	0.448662	0.415853	5.206912e-01
Residual	88.0	94.942885	1.078896	NaN	NaN
x3	grp	grp	1.0	6375.876120	6375.876120	259.202952	5.779374e-28
Residual	88.0	2164.624651	24.598007	NaN	NaN
x4	grp	grp	1.0	13760.538009	13760.538009	256.515180	8.145953e-28
Residual	88.0	4720.684932	53.644147	NaN	NaN

만약 특정 변수에 대한 일원분산분석 결과만을 조회하고 싶다면, 아래처럼 DataFrame의 MultiIndex 에 대해 인덱싱을 해오면 됩니다. 가령, 'x3' 에 대한 집단별 평균 차이 여부를 검정한 결과는 아래처럼 인덱싱해오면 됩니다.

## Getting values of 'x3' from ANOVA tables
df_anova_tables.loc[('x3', 'grp', 'grp')]
[Out]

df         1.000000e+00
sum_sq     6.375876e+03
mean_sq    6.375876e+03
F          2.592030e+02
PR(>F)     5.779374e-28
Name: (x3, grp, grp), dtype: float64

F 통계량과 p-value 에 대해서 조회하고 싶으면 위의 결과에서 DataFrame 의 칼럼 이름으로 선택해오면 됩니다.

# F-statistic
df_anova_tables.loc[('x3', 'grp', 'grp')]['F']
[Out]
259.2029515179077


# P-value
df_anova_tables.loc[('x3', 'grp', 'grp')]['PR(>F)']
[Out]
5.7793742588216585e-28

MultiIndex 를 인덱싱해오는게 좀 불편할 수 도 있는데요, 이럴 경우 df_anova_tables.reset_index() 로 MultiIndex 를 칼럼으로 변환해서 사용할 수도 있습니다.

# resetting index to columns
df_anova_tables_2 = df_anova_tables.reset_index().dropna()


df_anova_tables_2
[Out]

level_0	level_1	level_2	df	sum_sq	mean_sq	F	PR(>F)
0	x1	grp	grp	1.0	0.235732	0.235732	0.221365	6.391661e-01
2	x2	grp	grp	1.0	0.448662	0.448662	0.415853	5.206912e-01
4	x3	grp	grp	1.0	6375.876120	6375.876120	259.202952	5.779374e-28
6	x4	grp	grp	1.0	13760.538009	13760.538009	256.515180	8.145953e-28

Greenplum DB에서 PL/Python (또는 PL/R)을 사용하여 여러개의 숫자형 변수에 대해 일원분산분석을 분산병렬처리하는 방법(one-way ANOVA in parallel using PL/Python on Greenplum DB)은 rfriend.tistory.com/640 를 참고하세요.

[reference]

* ANOVA test using Python statsmodels
: https://www.statsmodels.org/stable/generated/statsmodels.stats.anova.anova_lm.html

이번 포스팅이 많은 도움이 되었기를 바랍니다.

행복한 데이터 과학자 되세요! :-)

728x90

저작자표시 비영리 변경금지

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] 지수평활법 모형 훈련 및 예측, 모델평가 (Exponential Smoothing in Python) (19)	2021.07.29
[Python 시계열 자료 분석] 시계열 모형의 예측 성능 평가 지표 (7)	2021.07.18
[Python] 샘플 크기가 다른 2개 이상 그룹간 일원분산분석 (one-way ANOVA with different sized samples) (0)	2021.05.07
[Python] 선형회귀모형, 로지스틱 회귀모형에 대한 각 관측치 별 변수별 기여도(민감도) 분석 (Sensitivity analysis of linear regression & Logistic regression per each variables and each observations) (1)	2020.01.12
[Python 시계열 자료 분석] 시계열 패턴별 지수 평활법 (exponential smoothing by time series patterns) (5)	2020.01.05

Posted by Rfriend

이전 1 다음

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'statsmodels'에 해당되는 글 2건

[Python] 지수평활법 모형 훈련 및 예측, 모델평가 (Exponential Smoothing in Python)

(1) 지수평활법 기법으로 시계열 예측모형 적합하기
(training the exponential smoothing models in Python)

(2) 지수평활법 적합 모델로 예측하여 모델 간 성능 평가/비교하고 최적 모델 선택하기

(forecasting using the exponential smoothing models, model evaluation/ model comparison and best model selection)

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame)

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

R, Python 분석과 프로그래밍의 친구 (by R Friend)

'statsmodels'에 해당되는 글 2건

[Python] 지수평활법 모형 훈련 및 예측, 모델평가 (Exponential Smoothing in Python)

(1) 지수평활법 기법으로 시계열 예측모형 적합하기 (training the exponential smoothing models in Python)

(2) 지수평활법 적합 모델로 예측하여 모델 간 성능 평가/비교하고 최적 모델 선택하기

(forecasting using the exponential smoothing models, model evaluation/ model comparison and best model selection)

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

[Python] DataFrame에서 여러개의 변수에 대해 일원분산분석 검정하기 (ANOVA test for multiple numeric variables in pandas DataFrame)

'Python 분석과 프로그래밍 > Python 통계분석' 카테고리의 다른 글

카테고리

태그목록

티스토리툴바

(1) 지수평활법 기법으로 시계열 예측모형 적합하기
(training the exponential smoothing models in Python)